Last Updated: 23-02-2022 (work in progress)
This codelab is the second in a series on designing and implementing data warehouses using Kimballs method. After completing the series, you will be able to build a simple data warehouse from scratch using only SQL code.
In this codelab, we will look at data quality and what that means in a data warehouse context. Afterwards, we will be starting out with the extract and transform steps of the ETL process. We will work with the NorthWind example and the ETL will be based on the dimensional design from the first codelab in the series.
Before we look into Data Quality and extraction and transformation of data, we will first recap what we have learned in codelab 1.
This codelab consists of a number of videos for you to watch as well as small exercises that should be completed as you move along. Exercises are clearly marked and are allocated extra time. If you complete an exercise early, you can move on by clicking "next".
You must work in the relevant programs along with completing the steps and watching the videos in this codelab.
It is a prerequisite for this learning path that you have completed the first codelab in the series "Introduction to Data Warehousing".
In codelab one we looked at the difference between transactional systems and data warehouses. We learned about the dimensional design approach and the star schema. We completed two small exercises on creating both ER diagrams for transactional systems and ER diagrams for data warehouses based on the small Mountain Hotel exercise. If you have followed along with my videos , you should now have an ER diagram for the NorthWind dimensional model as well as a database with corresponding tables for both the dimensions and the fact tables. This forms the basis of continuing with with this codelab,
Recall the 4 (+1) step dimensional design approach:
(+1: Draw the star schema)
At the end of the previous codelab, we finished a dimensional model for the NorthWind case. If you followed along and created your own ER diagrams, you should have an ER diagram that looks like this:
Figure: Dimensional Model for NorthWind
As you can see in the above example, I have added the Date Dimension in grey here (this was not shown in the videos). Date and Time are dealt with separately from other dimensions in a data warehousing context and we will return to them later in the series. For now you can ignore the date dimension.
In addition to the above ER diagram, you should also have a new database in SSMS called NorthWindDW which contains the initial stage tables corresponding to the ER diagram above.
Your stage tables can be created with SQL code that looks like this (example provided for Dim Customer):
/****** DimCustomer Stage table created if it does not already exist ******/
IF NOT EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[stage].[DimCustomer]') AND type in (N'U'))
CREATE TABLE [stage].[DimCustomer](
[CustomerId] [nvarchar](5) NOT NULL,
[CompanyName] [nvarchar](40) NULL,
[City] [nvarchar](15) NULL,
[Region] [nvarchar](15) NULL,
[PostalCode] [nvarchar](10) NULL,
[Country] [nvarchar](15) NULL,
CONSTRAINT [PK_DimCustomer] PRIMARY KEY CLUSTERED
(
[CustomerId] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, OPTIMIZE_FOR_SEQUENTIAL_KEY = OFF) ON [PRIMARY]
) ON [PRIMARY]
GO
Please note that in the example above, the unique key constraint is enforced as the stage.DimCustomer table has a Primary Key constraint added on the schema definition (CONSTRAINT [PK_DimCustomer] PRIMARY KEY CLUSTERED
).
Since unique primary keys are often (or at least should be) enforced by the source system, we can actually decide not to enforce them in the data warehouse staging tables (we should still take care to check the ETL process for possible issues that might impact our data quality!). This may improve insert speed (since we do not have to incur the overhead of checking for duplicate occurrences of the primary key on all dimension tables in each batch insert). How you end up designing your stage load will depend on your latency requirements. If you process streaming data as part of your source you will have to consider another method than shown in this codelab. For the remainder of this series, I will keep the key constraints from the source tables in the staging tables. Once we have transformed and are ready to load data into our final modelled schema in the data warehouse, the constraints will be replaced with constraints on the newly generated surrogate keys.
ETL stands for Extract Transform and Load. It is the process represented by the cogwheels in the figure below:
Figure: The Data Warehouse sits in the middle between source systems and data products designed for the end user
In the ETL process, what we are trying to achieve is extracting data from one or more source systems (typically transactional systems such as ERP systems, CRM systems and/or other transactional databases that the company uses to capture data relevant to its business processes).
We then transform that data one or more, typically normalized data models (recall our example with the Mountain Hotel from codelab 1) into a data model (here, a star schema) for our data warehouse. We do this "exercise" in re-modelling data to be able to integrate data from multiple sources and put it into a format that is easily consumable by an analytics user.
The transformation is not only limited to denormalization by way of dimensional modelling, but can also stem from other transformation needs (deduplication, data quality etc). In this codelab we will focus on the transformations that occur because of the denormalization stemming from the dimensional design approach, but also a few transformations due to data quality issues.
We have basic transformation such as selection, matching and data cleansing - e.g. replacing null with "No value" or "Unknown", but we may also do major transformations such as standardizing data, character conversion, calculations and derived values, conversions of units, aggregation, deduplication and key restructuring (we will look at the latter later when we start to talk about surrgoate keys).
"The ETL data cleansing process is often expected to fix dirty data, yet at the same time the data warehouse is expected to provide an accurate picture of the data as it was captured by the organization's production systems" (Kimball, p. 381).
In his book, Kimball suggests three types of data quality screens: column screens, structure screens and business rule screens. Further, he suggests three courses of actions: halting the process, suspending rows for later processing, tagging data but allowing it through. There is a fourth option as well, which is overriding issues in the ETL process, but this choice should be carefully considered because of the need for striking a balance between accuracy (as in "equal to data in the operational system") and clean data that is useful for the business. Imagine a "dirty" customer dimension where the same customer appears one or more times. What happens when we try to group our sales by the customer dimension? We will see multiple different "aggregates of sales" to seemingly different customers - and we will have to figure out which customers are the same and add the aggregates by hand to get the actual, correct sales amount for that customer.
Remember that when you work with real-life data, you are dealing with data that may be noisy either from "messy" business processes and from source systems that may be error prone. Systems can yield inconsistent (e.g. across the system landscape - a customer exists in one system with one id, but has a different id in another system), simply wrong (incorrect data) or missing. When bugs exist in the source system they will (hopefully!) eventually get fixed over time, but will live on in (historical) data. So in addition to considering data quality issues stemming from messy business processes, the DQ considerations should also take care to address system-specific issues and be willing to adjust when the system landscape changes.
In the video below, I use Pipino, Lee and Wang (2002) DQ typology to assess possible issues related to data quality based on a fictional case of Cookie Cutter Inc. Please watch the video below?
Video 1: Data Quality Dimensions
Please consider:
Given that we are not working with an actual case company, but rather relying on a static database to build our data warehouse on, we are not in a position to go back and challenge the business on their data entry processes, so we are left with the four options presented above (halt, suspend, tag and fix). In the above example with Cookie Cutter Inc. we could choose the strategy that Kimball suggests: flag data and pass it along - then build an audit dimension to be used by the end data user.
Since data in the NorthWind database is not very "dirty" and only minor issues exist (mostly related to missing values), we will use the "fix strategy" to deal with null attributes. First, however, we need to inspect data quality. Please watch the video below where I show some basic data profiling based on NorthWind:
Video 2: Data Quality Inspection
Fixing the null attributes on the postal code in the customer dimension, and null values on the region attribute in both Dim Customer and Dim Employee will be the transformations we will do on the data related to data quality issues. However, in your own real-life data warehousing project there are other transformations that you might want to consider doing in your ETL flow (see further up). I can give one example that I have worked on in my career: an agribusiness that was in the business of M&A (mergers and acquisitions) with small farms needed a financial overview of their business across a setup with many different ERP systems. Integrating the many diverse ERP systems was costly and not a viable option since over time the farms would be "ported" to the acquiring company's ERP system. Therefore, a solution was chosen where the data warehouse would be built solely on .csv exports from the different ERP systems and complex mapping was done to ensure a uniform representation of data across the system landscape. Once manufacturers were successfully ported to the acquiring firms ERP system, the "link" could easily be closed, but history maintained.
On the following page, you will see how to extract data from source tables, and insert them into staging tables and then afterwards transform the same data by making sure steps are taken to address the data quality issues identified above.
Video 3: ETL - staging data
Complete your initial ETL load for all stage tables (dimensions excl. date in NW) including all relevant data cleansing steps.
There are three different types of loading data that you need to know about. The first is the initial load. It is where we populate all our tables in the data warehouse for the first time. During this load we also create any persistent or static dimensions that we might need in the data warehouse (dimensions that never or rarely change, e.g. like date and time dimensions). Persistent dimensions are created in the ETL process and not loaded from the source system.
The second type is the incremental load where we only load a subset of data to capture ongoing changes in the underlying data sources in a periodic manner. For example, we might decide to load fact data intra-daily without also refreshing the associated dimension tables. Finally, we have the full refresh which completely erases the content of one or more tables and reload the tables.
In the video below, I will show you how to finalize your initial ETL load for your NorthWind Data Warehouse. Specifically, I will show you how to create the EDW area for your data, how to create a persistent date dimension (that does not get updated from source tables) and generating/setting surrogate keys.
Video: Finalizing initial ETL data load
In order to show the database design for the data warehouse, you should include the ER-diagram (dimensional model) as well as source-target mappings which should provide information about the logical structure of the data warehouse (including data types) and document any transformations done to data due to business rules.
Further, you should make sure to document the flow of activities. Which order are the steps executed in? For this simple example, the order is pretty straight forward but as your data warehouse grows it becomes extremely important to document the order of execution so that other developers can understand the interdependencies between steps (e.g. cleansing data that has not been extracted is not a meaningful activity). Flow of activities can be documented with activity diagrams and sequence diagrams (UML).
Further, make sure you include detailed documentation of activities by including all source code and relevant comments (you may also decide to subject your entire source code to version control - but that's a topic for another class).
If you like my work, please consider sharing it!
This codelab is based on materials that I - together with colleagues - developed for my students at VIA University College in Horsens as well as the Kimball book: The Data Warehouse Lifecycle Toolkit and papers from ACM.
The adaptation for this codelab has been done by myself and is using only materials produced by me to avoid copyright infringements.
Any errors in this material are mine alone.