Data Warehouse Design Process
The demand to store and analyze vast data for business growth necessitates careful design of data warehouse architecture. Various approaches exist for utilizing its components effectively. Guidelines ensure adherence to standards in designing segments of the warehouse. This article explores different frameworks and approaches, emphasizing the importance of aligning architecture with business needs.
The Two Approaches to Designing the Architecture of a Data Warehouse
When we are designed to build an architecture of a data warehouse, it must always be taken care of for the data model that needs to be integrated. With that case coming up, a data model provides a framework and a set of best practices to follow when designing the architecture or troubleshooting issues. As a data warehouse is a heterogeneous collection of different data sources that are organized under a unified schema, we can broadly follow two approaches for constructing a data warehouse which are explained below :
- The Top-down approach
- The Bottom-up approach
1. The Top-down approach:
This approach was coined by Inmon, and it can be defined as the data warehouse in this approach acts as a central information repository for the complete enterprise, and then the data marts are created from it after the complete data warehouse has been set up.
Elaborating on the "Top-Down" design approach, where we pick data from different sources which are validated, reformatted, and saved in a normalized (up to 3NF) database as the data warehouse as here we consider the data warehouse as a subject-oriented, time-variant, non-volatile and integrated data repository for the entire organization.
The Top-Down approach is considered a data-driven approach as the information is collected and integrated first, and then the business requirements by subjects(or themes) are subjected to building the data marts. The data marts that we built from this approach while designing the architecture of the data warehouse will still have consistency when the data overlap as this approach supports a single integrated data source.
It is widely observed that from the top-down approach to the architecture of the data warehouse, the data marts are built from the data which are selected for specific business subjects or particular departments as the data collected here stores the atomic information, that is, the lowest level of granularity.
Below is the pictorial representation of the TOP_DOWN APPROACH of the Data Warehouse Architecture:
Components of the Top-Down Approach
A. The External Sources: The data or the raw data is collected from the external source, that is, the source of truth, which the organization needs to decide as a best practice for designing the architecture of the data warehouse. The external source is a source from where the data is collected, irrespective of the type of data. The Data is mainly of three different forms, which are : • Structured ( CSV, excel sheets, relational database, etc) • Semi-structured (HTML, JSON, XML) • Unstructured (audio, video, pdf, etc)
B. The Stage Area: After the extraction of data is done from the external sources, we see that the data does not follow a particular format that is, some are logical values, numerical values, etc., so to make the standardized data format we need to validate this source data before we load into the data warehouse.
And to solve this we have the ETL tool. The ETL tool or the Extract-Transform-Load tool helps in cleansing and transforming the data to serve the business needs.
E(Extracted): Here, the raw data is extracted from the external data source.
T(Transform): Here, the raw data that we received is transformed into the standard format, which is universally acceptable and reliable, along with serving the business needs. We make use of the Query Tools at this stage of data transformation.
L(Load): Here, the refined-transformed data is loaded into the data warehouse for further analysis to gain insights that can help a business grow.
C. The Data Marts: The third component in the Top-down approach while designing the architecture of the data warehouse is the Datamart, which can also be suggested as part of the storage component. The data mart stores the information or the transformed data, which is of a particular function/theme of an enterprise mostly handled by a single authority.
We can have as many data marts in an organization that majorly depends upon the subjects/theme. This is widely known that the data mart is also called the subset of the data warehouse as the data stored in the data mart is the same as in the data warehouse compartmentalized under different themes.
D. The Data Mining:
The fourth major component in the Top-down approach while designing the architecture of the data warehouse is data mining. The raw data was gathered at the External source after the transformation reached the data warehouse. Now this cleansed and transformed data will be of no use until the analyst makes the best use of the same. So data mining can be explained as the ability to analyze the transformed data to find out the hidden patterns that are present in the database or the data warehouse with the help of an algorithm of data mining.
As we know, every architecture has its ability to be modified, and so we have certain advantages and disadvantages associated:
Advantages of the Top-Down Approach
There is a consistency that is maintained with the data marts as it is a subset of the data warehouse where the data was cleansed, transformed, and stored. It helps create a consistent dimensional view of the data marts even when they overlap. This also goes for the easier and quicker approach towards creating the data marts, which can then be used by the analyst.
This is a widely used approach as this is considered the strongest business model for designing the architecture of the data warehouse.
The Disadvantages of the Top-Down Approach
The major disadvantage of the top-down approach is the cost, the time taken in designing the architecture of the data warehouse and its maintenance is also very costly.
2. The Bottom-up Approach
This approach was coined by Kimball as it can be defined as data after extraction from the source is cleansed and transformed by the staging area, after which the data is sent to the data marts of each theme/subject, and then it is loaded up in the data warehouse.
Elaborating on the Bottom-Up design approach, here, similar to the Top-Up approach we pick data from different sources, which are validated, cleansed, reformatted, and saved in a data mart instead of the data warehouse of the particular themes/ subjects.
The data marts which are created and store the transformed data start to provide the reporting capability to the analyst. As the data marts are based on particular subjects, these address only a single business area.
After the data is saved in data marts, these are then loaded and integrated into the data warehouse.
It can also be described as a copy of the transaction data specifical architecture for query and analysis, which is the same for the star schema.
Below is the pictorial representation of the BOTTOM-UP APPROACH of the Data Warehouse Architecture:
Advantages of the Bottom-Up Approach
- The analyst can start generating the reports early as the data marts are readily available after the transformation process in the staging area, which means that working towards the decision can be started early, therefore, enabling better decision-making.
- As the number of data marts that are created can be more which means we can extend the data warehouse
- The cost and time taken in designing the architecture of the data warehouse by the bottom-up approach are low comparatively.
- The documents can be generated quickly as the data marts are saved with data early.
- This approach helps in developing the new data marts and then integrating them with the other data marts.
- This approach allows the analysts of the team to learn and grow
The Disadvantage of the Bottom-Up Approach
- As the dimensional view of data marts is not consistent, as was seen in the Top-Down approach, we can say that the model is not strong enough comparatively.
- The cost of implementation of the project is high.
- As the architecture of the data warehouse needs to be flexible to accustom to any changes concerning the needs generated within the organization, we can say that the Bottom-up approach is inflexible to it.
The Data Warehouse Architecture: With Staging Area
The architecture of the data warehouse with the staging area can be defined as the consolidation of the basic architecture of the data warehouse along with the staging area. The staging area is referred to as the place where the data collected from external sources in its raw and uncleaned/unformatted format is brought so that it can be processed before entering into the data warehouse.
The staging area is where we clean and process our operational information before we put it into the warehouse. With the help of the staging area, the data cleansing and consolidation for operational methods can be simplified before we update the same in data marts or data warehouses as per the approach the enterprises take serving their requirements.
Below is the pictorial representation of the Data Warehouse Architecture: With Staging Area of the Data Warehouse Architecture:
The Data Warehouse Architecture: With Staging Area and Data Marts
The architecture of the data warehouse with the staging area and data marts can be defined as the consolidation of the basic architecture of the data warehouse along with the staging area and the data marts. This architecture is different from the above as here we store the transformed data into the data marts concerning its subjects.
Sometimes it may be that we want to customize our data warehouse architecture for different groups within our organization, which can be done by adding the data marts serving the purpose of each of its groups. The data mart can be defined as the segment of a data warehouse that provides information for analysis on a section, unit, department, or operation in the organization and reports over the same, which can help the business take actions that help the organization as a whole.
For example, if an organization receives feedback forms from its customers, that raw data will first be gathered and sent to the staging area to clean and transform it. It will then be divided among the different data marts like purchasing, inventory, customer personal details, etc. These data marts are readily available concerning the bottom-down approach, and the user or the analyst can start its reporting, analysis, and mining to obtain insights that can, for example, improve its product or re-customize its prices.
Below is the pictorial representation of the Data Warehouse Architecture - With Staging Area and Data marts of the Data Warehouse Architecture:
The Data Warehouse Architecture: Best Practices to Define a Crisp Architecture
To design Data Warehouse Architecture, you need to follow the below given best practices to make your architecture flexible enough to quickly adapt to changes that can serve the needs of the enterprise :
- It is always a good practice while designing a data warehouse as this makes it more reliable in understanding the business logic or the use case.
- To be able to choose the appropriate designing approach as a top-down and bottom-up approach in Data Warehouse
- It must always be taken care as to which data model needs to be integrated and not just taken up consolidated. With that case coming up, a data model provides a framework and a set of best practices to follow when designing the architecture or troubleshooting issues. For the same, we can consider standard architectures of data warehouse like we should be considering the 3NF, Data Vault modeling, and star schema.
- Always Create a data model first by starting to identify the organization’s business logic and then Understanding what data is vital to the organization and how it will flow through the data warehouse.
- For information retrieval, the architecture should always be optimized so that it can easily pick up and use Data Warehouse Models like dimensional mode, denormalized, or hybrid approach.
- While a vast amount of data is present around us, it becomes even more important to have a single source of trust, that is, an organization must have a verified source from where it receives data that can help it grow.
- It is important to process the data quickly and accurately and at the same time, consolidate it up into a single repository. With that being said, the designing of the Meta-Data architecture is also important as it allows sharing of metadata between components of the Data Warehouse easier.
- It can not be emphasized enough how automating the task can save us time which can be used to speed up decision-making therefore, automation tools need to be used to help when dealing with vast amounts of data.
- When we talk about data warehouses, we mostly revolve around the ETL process but as a best practice, it is important to spend time carefully designing the data acquisition and cleansing process for the Data warehouse
- To improve on efficiency it's recommended to Enforce coding standards.
Conclusion
- The architecture of the data warehouse can be defined as the overall architecture of data communication, processing, and final records being presented to the end clients who are computing within the enterprise.
- As a data warehouse is a heterogeneous collection of different data sources that are organized under a unified schema, we can broadly follow 2 approaches for constructing a data warehouse which is:
- The Top-down approach ~ Coined by Inmon
- The Bottom-up approach ~ Coined by Kimbell
- The Top-down approach stores the data first in the data warehouse, followed by the data marts. On the other hand, The Bottom-Up approach stores the data first in data marts followed by the data warehouse. Both have their advantages and disadvantages, therefore the organization can take the approach that suits their business needs.
- The data warehouse architecture can be classified into two broad categories: The DW-With Staging Area And The DW-With Staging Area and Datamarts.
- While designing the architecture of the data warehouse, one needs to reflect upon the best practices listed above in the manual to design an effective, flexible, and scalable architecture to meet the changing needs.