Architecture of Data Warehouse
With vast of data getting generated every minute, there is a drive to generate insights from the data, which can help create products and services that can serve humankind. With that being said, the first and foremost step is to collect the data from external sources. This housing structure where we store the data is called the data warehouse.
For the past few decades, the data warehouse architecture has been the pillar of corporate data ecosystems, and despite numerous alterations over the last five years in the area of Big Data, cloud computing, predictive analysis, and information technologies, data warehouses have only gained more significance.
Types of Data Warehouse Architectures
• The Single-tier Data Warehouse architecture • The Two-tier Data Warehouse architecture • The Three-Tier Data Warehouse Architecture
The Architecture of a Data Warehouse can be quite intriguing at first as it can be considered an information system that has historical commutative data from multiple sources, and to layer down, we have different approaches for constructing the data warehouse layers and the three different types of architecture of Data Warehouse as can be explained from below:
The Single-Tier Data Warehouse Architecture
The Single-Tier architecture of the data warehouse can be considered as a cumulation of three layers that is physical source layer, the virtual data warehouse, and the analysis layer, which can have reporting or OLAP tools.
In this method of the single-tier architecture of data warehouse, the data warehouse is virtual, which in turn means that the data warehouse is implemented as a multidimensional view of operational data, which is mostly created by specific middleware. It can also be called an intermediate processing layer which is not periodically used in practice.
The purpose of having just a single layer of physical source layer in the architecture of a data warehouse is mostly to minimize the amount of data stored to reach the goal, which in turn removes data redundancies.
This single-tier architecture of data warehouse has a primary drawback which is that it doesn't have a component that separates analytical and transactional processing. The vulnerability of this architecture lies in its failure to meet the requirement for separation between analytical and transactional processing.
The Analysis queries are done on to operational data after the middleware interprets them. In this way, queries affect transactional workloads, which sums up why the single-tier architecture is not a frequently practiced approach.
Below is the pictorial representation of the single-tier architecture of the data warehouse:
The Two-Tier Data Warehouse Architecture
With the major drawback of single-tier architecture not having a separation of layers for analytical and transactional processing, The two-tier architecture of data warehouse came into the picture. This two-tier architecture vanishes the drawback of the single-tier as it has a separation between the layers which plays an essential role in maintaining the two-tier architecture.
The two-tier architecture of the data warehouse comprises the following two tiers:
1. The Data Tier
2. The Client Tier
1. The Data Tier
The Data Tier in the two-tier architecture of the data warehouse can be defined as the layer where actual data is stored after various ETL processes are used to load data into the database or the data warehouse.
The staging area where the ETL processes are used in the Data tier helps you ensure that all data loaded into the warehouse is cleansed and in the appropriate format.
This staging area in the two-tier architecture of the data warehouse ensures that data from all sources are collected before the data warehouse layer. With the addition of the staging area between the sources and the storage repository, i.e., the data warehouse, we make sure that all data loaded into the warehouse is in a defined format.
Though it is analyzed as a Data Tier of the Two-Tier architecture, it consists of three subsequent data flow stages to process the data, which can be understood as below:
The Data Tier consists of the following Three layers:
- The Source Layer
- The Data Staging Layer
- The Data Warehouse Layer
A. The Source Layer
This layer is the first layer of the two-tier architecture of the data warehouse. As the data warehouse system uses a heterogeneous source of data, this source layer acts as a repository to store the data from multiple sources that can be stored initially in corporate relational databases. This data can also be stored on legacy databases, or sometimes it comes from sources that are outside the enterprise walls.
B. The Data Staging Layer
This layer is the second layer of the two-tier architecture of the data warehouse. This layer obtains the data stored in the source layer, which is then extracted, cleansed, and integrated to club heterogeneous sources into one standard schema.
The cleansing of the raw data is done to remove inconsistencies and fill gaps in the raw data. We have widely heard of the process called ETL, expanded Extraction, Transformation, and Loading, which plays an important part in this layer of the two-tier architecture of the data warehouse.
The ETL tools then combine multiple schemas into a standard schema, extract, transform, cleanse, validate, filter, and finally load source data into a data warehouse.
C. The Data Warehouse Layer
This layer is the third layer of the two-tier architecture of the data warehouse. The information received from the data staging layer is then saved to one logically centralized individual repository, which is termed as the data warehouse.
Many times the data warehouses can be directly accessed, but it has been observed that sometimes it can also be integrated as a source for creating data marts, which replicate data warehouse contents partially. These are mostly designed for specific enterprise departments.
The Meta-data repositories, which are also an important part of the architecture of the data warehouse, store information about the sources, data staging, access procedures, data mart schema, users, and so on.
As discussed above, once the data coming from multiple external sources is cleansed and transformed, it gets stored in the data warehouse.
Now to actually understand what the data wants to tell us, we introduce the Second tier of the Two-tier Architecture of the data warehouse, That is, The Client Tier.
2. The Client Tier
The Client Tier in the two-tier architecture of the data warehouse can be defined as the layer where data is stored in the data warehouse and can be used by the client to generate insights that can help the organization invent or transform the existing solution/products based on the trends that can be analyzed by the reports generated.
This Client Tier is the front-end application where the clean data is reflected so that the users/client can start their analysis. Many times, the data warehouse reports are hidden in the GUI to show the required reports.
The Client Tier consists of a single layer called the Analysis Layer.
The Analysis layer:
This layer is the fourth layer of the two-tier architecture of the data warehouse. The prime focus in this layer is the integration of data that is efficient and flexible enough to accessibly issue reports, simulate hypothetical business scenarios and dynamic analysis of information.
The analysis layer in the architecture of the data warehouse features aggregate information navigators and efficient query optimizers.
The Two-tier architecture has a primary drawback is that it is not expandable, which in turn also resonates with it not supporting a large number of end-users. We also see connectivity problems because of network limitations that are faced with this type of architecture of data warehouse.
Below is the pictorial representation of the two-tier architecture of the data warehouse:
The Three-Tier Data Warehouse Architecture
Solving the limitations of connectivity problems because of network limitations of the two-tier architecture of the data warehouse. We have the most widely used architecture for data warehouse systems.
We have the three-tier architecture of the data warehouse. This three-tier architecture of the data warehouse consists of the source layer (containing heterogeneous source systems), the reconciled layer, and the data warehouse layer (containing both data marts and data warehouses).
Understanding architecture is especially useful for extensive, enterprise-wide systems. A disadvantage of this structure is the extra file storage space used through the redundant reconciled layer. It also makes the analytical tools a little further away from being real-time.
The three tiers of the three-tier architecture of the data warehouse can be classified as below:
1. The Bottom Tier
The bottom tier consists of the database of the Data warehouse servers as the bottom tier for the architecture. We can define this database as a relational database system.
This layer is responsible for cleaning, transforming, and loading the data present in the database from the bottom tier using back-end tools. This layer can also be termed the data warehouse layer (containing both data marts and data warehouses)
2. The Middle Tier
The middle tier consists of the OLAP server. This is widely implemented by using either the ROLAP or the MOLAP model. As this is the middle tier of the architecture of the data warehouse, this acts as a mediator between the end-user and the database.
This tier is presented as an application layer that has an abstract view of the database when a user visits this tier. The middle layer of this three-tier architecture is the reconciled layer which resides between the source data and the data warehouse. As it can create a standard reference data model for a whole organization, this can be considered as its major advantage.
The reconciled layer also helps in separating the problems of data extraction at the source end and integrating that data into the data warehouse population. Be it Producing daily analysis sheets or reports (which cannot be satisfactorily prepared using the enterprise applications) or Periodically generating data flows to feed external processes which can benefit from cleaning and integration of the data in many cases like the reconciled layer (the middle tier of the architecture of data warehouse) is majorly used to directly accomplish better operational tasks.
3. The Top-Tier
The top tier consists of the front-end client layer as the top tier for the architecture. This is widely classified as the tier consisting of the API and tools that we use for connecting and gathering data out from the data warehouse. This is also used as Query tools, reporting tools, managed query tools, Analysis tools, and Data mining tools. This layer can be considered as the source layer (containing heterogeneous source systems.
Below is the pictorial representation of the three-tier architecture of the data warehouse:
The Properties of Data Warehouse Architecture
While designing the architecture of the data warehouse, we need to keep in mind the various properties it holds so that we are making the best choices for defining the architecture of the data warehouse.
The five main properties that the Architecture of a data warehouse must hold are as below:
1) Security
It is very important to have a secure architecture of the data warehouse as the data which is processed and stored in the data warehouse should not be accessible by any unlicensed third-party users.
It's highly recommended to monitor accesses that are necessary as the strategic data stored in the data warehouses can range from personal data of the customers or any such data which, when intervened by anyone, can lead to privacy issues.
Therefore, Security is one of the important properties while designing the architecture of the Data Warehouse
2) Administerability
The next property that needs to be kept in mind while designing the Architecture of the Data Warehouse is Administerability. For maintaining Administerability, it is always recommended to have a simple version of the Data Warehouse management; that is it should not be complicated.
3) Scalability
With digitalization ruling the industry, the scalability of architecture plays an important role in designing the architecture of a data warehouse.
This property of scalability must be accommodated when designing as with the rapid growth of digital transformation, it is obvious to make sure that the hardware and software architectures should be kept simple to upgrade the data volume as quickly as possible, which in turn can be managed and processed on demand and can meet the number of user's requirements quickly to have a progressive increase.
4) Extensibility
The fourth property that covers the ability to adjust to the different client needs and gives the Architecture of a Data Warehouse the liberty to accommodate the changes is Extensibility.
With the new technologies emerging and the amount of data getting created, it is highly recommended that the Extensibility of the architecture of the data warehouse be spot on.
We should have an architecture that can upgrade with respect to the demand without redesigning the whole system.
5) Separation
The last important property that needs to be kept in mind while designing the Architecture of the Data Warehouse is Separation. When the amount of data in the data warehouse stored starts to increase can be issues that can start to come up concerning the overlapping of the process, such as analytical or transactional.
Therefore it is always a good point to incorporate proper Separation analysis while designing the Architecture of the Data Warehouse to keep both the Analytical and transactional as apart as possible.
Below is the pictorial representation of The Properties of the Data Warehouse that must be followed to design the architecture of the data warehouse:
Conclusion
- The Data warehouse can be described as an information storage repository that contains data from single or multiple sources dated historically or recently.
- The architecture of the data warehouse can be defined as the overall architecture of data communication, processing, and final records being presented to the end clients who are computing within the enterprise.
- The architecture of data warehouses is of three different types: Single-tier architecture, Two-tier architecture, and Three-tier architecture.
- The Properties of Data Warehouse architecture are centered around five pillars: Security, Administerability, Scalability, Extensibility, and Separation.
- It's important to always be aware of the Property and the Characteristics that need to be recollected while designing the architecture of the data warehouse.