Last Updated: 22-02-2022
This codelab is the first in a series on designing and implementing data warehouses using Kimballs methods. After completing the series, you will be able to build a simple data warehouse from scratch using only SQL code.
In this codelab, you will be introduced to data warehouses and we will get started on creating our first dimensional model.
Afterwards, we will start implementing our Data Warehouse by setting up staging tables to receive data from our source systems.
This codelab consists of a number of videos for you to watch as well as small exercises that should be completed as you move along. Exercises are clearly marked and are allocated extra time. If you complete an exercise early, you can move on by clicking "next".
You must work in the relevant programs along with completing the steps and watching the videos in this codelab.
It is a prerequisite for this learning path that you have a fundamental understanding of database concepts. You may wish to complete my codelab on Basic Database Concepts if you find this codelab difficult to complete.
Please help me understand who uses these codelabs:
Feel free to connect with me on LinkedIn and let me know what you think about this material.
Please see the video below for a general introduction to data warehouses. In the video, you will see an example of the types of business processes that a data warehouse can capture data from.
Video 1: Introduction to Data Warehousing
Data Warehouses as central repositories of data
Figure: The Data Warehouse sits in the middle between source systems and data products designed for the end user
During this series of codelabs, we will go over the steps involved in the Extraction, Transformation and Loading of Data Warehouses (the cog wheels in the figure above). This session focuses on the "E" of ETL - extraction of data.
Data Warehouses are designed to solve business problems across a variety of domains. In the example with Sophie in the video above, designing Data Warehouses allows the business user to inquire about the sales process, understanding lead time from suppliers and gauging customer satisfaction. From a managerial perspective, Data Warehouses (and the data products derived from Data Warehouses) can help solve problems across many different levels:
Figure: Levels of insights
Often the day-to-day operational tasks can be adequately served from the company's transactional system because looking up a single order in a system is efficient (remember, the system was designed to handle single operations), but once we need to aggregate data and apply different filters to data on the tactical, strategic and transformative levels,, the Data Warehouse becomes more valuable.
Please follow the install guide below to prepare for the exercises in this codelab.
Step 1 (optional): Install Astah
If you want to follow along with the same programs that I use in the videos in this learning path, you will need to download and install a trial version of Astah Professional from here: https://astah.net/downloads/
If you don't want to install Astah, you are free to use any of the free online diagramming tools available or even just create diagrams by hand.
Step 2: Install SQL Server
You can download your own limited SQL Server here:
Select to download the developer edition. The developer edition is - at the time of writing - free for unlimited time.
Step 3: Install SQL Server Management Studio (SSMS)
You also need to download and install SQL Server Management Studio 2018
Step 3: Install NorthWind source database
If you have never worked with SSMS before, you may watch this 16 minute tutorial covering the most basic elements of working in SSMS (including how to make SSMS draw visual database diagrams!)
Video: Introduction to SSMS
Dimensional modelling is "a logical design technique for structuring data so that it
is intuitive for business users, and delivers fast query performance" (Kimball, 2008 p.234). Dimensional models divide the world into facts and measurements.
Facts are typically numeric data about some measures that the company may want to know about. Facts are the "numbers of an event". They are generated by one or more business processes (i.e. one fact could be the number of units sold in the sales process of a company).
Dimensions are the context of those measures that make up the fact. They are usually textual data and describe business entities (e.g. customers, products, suppliers).
Facts are related to multiple dimensions. Dimensions may be referenced by multiple facts. In fact, we try to design dimensions so that they are useful for multiple facts.
We have three different types of facts: additive, semi-additive and factless facts. Additive facts can be summed across all dimensions, semi-additive facts may be summed across most dimensions and averaged across one or more dimensions. Typically averaged across time - for example the inventory count of a product. Summing the inventory count of a product for every day of a year yields a number that doesn't make any sense.
Just like facts, dimensions also come in different types. An important distinction is between dynamic and static dimensions. Dynamic dimensions are fed by data streams from source systems. They change as data in operational systems change - i.e. capture the change of a product category or the name of a customer. Static dimensions can be defined once-and-for-all and do not change. A good example of a static dimension is the date dimension. I was born on a Friday - and no matter how much time passes, it will always be a Friday. No matter what happens in the future, the dates 4/4, 6/6, 8/8, 10/10 and 12/12 always fall on the same weekday.
You may watch the video below to see how a restaurant generates data in multiple business processes when conducting their business:
Video: Kimball explained using Overcooked 2
Kimball advocates a four-step dimensional design process (p. 246). The four step-process involves the following four steps - but I like to add in a fifth step:
When we build data warehouses using star schemas, we denormalize data as they arrive from source systems. Dimensions have redundancy/hierarchies. See the example below of how my absolute favorite chocolate moves from a highly normalized form to a denormalized, flat dimension (Mondelēz, if you want to sponsor this - I totally take payment in chocolate):
Figure: Controlled redundancy in a product dimension
Watch the video below to see how to create UML-style ER Diagrams in Astah. You will need to know how to create ER diagrams to complete the exercise on the next page. If you want, you can also create your diagrams by hand on paper.
Video: UML-Style ER Diagrams in Astah
Remember that nice serving of cheese noodles we had after skiing in the Swiss Alps? Below is the conceptual model we built back in the first codelab. Usually clients never hand out their source systems straight of the bat when you start building data warehouse products for them. Instead, they will talk to you about how they are on a quest to become data driven (I absolutely hate that phrase, but that's a story for another time) and they need to build insights into their sales process (and a bunch of others at the same time) to enable EVERYONE to become data driven. Go through the dimensional design approach and identify the facts and dimensions in the diagram below. Then draw your star schema.
Figure: Conceptual Model of a POS system.
Menu item (or product), waiter (or sales person), and date (using order date)
Amount and quantity
Keys for the dimension tables are inferred. They are not show in the conceptual model, but as we know from any basic database course, unique key constraints are enforced to avoid duplicate rows in entity tables.
In this section, you will be introduced to the case that I use throughout my codelabs: NorthWind. You should follow along with the videos that I present in this section. Students who take my university class get to work on their own projects building slightly more complex and larger data warehouses based on a client they get to pick themselves - if you have one or more source systems at your disposal, please feel free to follow along and base your developments on that instead.
(Source: Mostly just my imagination based on the sample database NorthWind that has been retired from Microsoft)
NorthWind Traders is a wholesale company. The company buys and sells specialty food items. They engage in the B2B market (Business to Business) meaning that their customers are primarily small retail stores and supermarkets. NorthWind Traders is a small company that employs only a handful of employees and most of them are sales representatives who earn money on commissions. Their customers are located all around the world. The Vice President of Sales is called Andrew Fuller.
Andrew Fuller has requested that you help him build a data warehouse so that he is able to integrate multi-source data, model time-variant data and do basic descriptive analysis in a front-end visualisation tool of his choosing (he will get back to you on that part later in the semester as he is still trying to figure out which front end tool is best for the business).
Since Andrew is working in Sales, he has kindly requested that you start out by modelling the sales process for NorthWind. Maybe next year, you can move onto Finance, Logistics and Marketing so make sure you conform dimensions for the business. Sales is the most important thing to get started on and when you're done, we will have to negotiate if you should do finance or logistics first. But keep in mind that you will be working on those business areas and associated processes eventually.
Sales in NorthWind are commissioned by a sales representative (SR) to a customer. In order to get a basic overview of the sales, Andrew would like you to include the sales amount as well as the number of products ordered by the customer by each sales representative. Perhaps if there was a way to see if different types of employees performed better, then that would be great as well.
When NorthWind sells an item, let's say for $1000 with a 5% sales tax, then accounts receivables is debited for $1050, cost-of-goods sold is debited $650 corresponding to the price of producing (the price NorthWind paid to suppliers) that item. Then the account for revenue is credited $1000, Inventory is credited $650 and the sales tax account gets credited $50. Because of how the ERP-system is designed each of those changes on the books is entered into the Grand Ledger Table in the source system. All other financial transactions (salaries, etc) are likewise handled in the GL module. The finance business partner needs to be able to sort transactions based on accounts.
Andrew is worried that too many orders are shipped later than the date the customer requested when placing the order. The reason for this is because he is getting more and more complaints from customers through e-mail. One customer was particularly angry with him because the milk arrived 3 days late. "If I don't have milk, I don't have customers", was the basic sentiment of the email but written in a much less appealing tone. Andrew wonders if the problem relates to NorthWinds own distribution channels to the customer or if it can be traced back to certain product categories or suppliers. Oh yeah, and since it doesn't matter with delivery times for discontinued orders – you need to be able to sort on that as well.
Andrew knows that Dairy products is an especially important category that functions as a point attractor in food retailers.
Andrew is also worried that some of his employees may be slacking because they are not clearing as many orders as the other employees. He would like to be able to see who is clearing the most orders and what the total value is for each employee. In addition, he is worried that one of his employees may be engaging in fraud (nepotism) by offering higher than usual discounts to one customer, who happens to be that employee's cousin.
Andrew is also looking to renegotiate shipping contracts and would like to have an overview of orders shipped in the different regions as a basis for negotiating shipping prices with his three shippers. Perhaps there is a way to improve costs in logistics if he can somehow see which customer segments are served with the different shipping partners based on goods sold.
Logistics in NorthWind deals with all supply chain issues. That means they are both responsible for receiving goods from suppliers (each product NorthWind sells is bought from suppliers as NorthWind does not manufacture anything themselves). In order to be able to track logistics performance, it's important that NorthWind can sort products based on suppliers. There are also some requirements related to inventory management and stock reorder, but the business people haven't had the time to consider their needs in this area yet, so you will need to come back to them later about that.
Marketing needs to be able to see which products were on promotion and which customers bought during promotions. Specifics are still a bit unclear here, but the Marketing Manager will get back to you during the next few weeks.
In a normal business, things of course change over time. Therefore, it's mandatory that Andrew can see how all of the mentioned things change over time. That goes for all the areas of the business.
Repeat the 5-step process that you just went over with the Mountain Hotel for NorthWind, but this time based on the case background description provided above for NorthWind (or your own source system).
Watch the videos below to see how to create a dimensional design for NorthWind.
Video: Dimensional Design (Part 1)
Video: Dimensional Design (Part 2)
Watch the video below to see how to set up your stage tables in SSMS. I will show you how to export SQL in Astah, how to script SQL from within SSMS to document your datawarehouse. Then finally, how to create your datawarehouse database and creating stage tables for your ETL load.
Video: Getting started with your data warehouse
If you like my work, please consider sharing it!
This codelab is based on materials that I - together with colleagues - developed for my students at VIA University College in Horsens as well as the Kimball book: The Data Warehouse Lifecycle Toolkit and resources from Microsoft.
The adaptation for this codelab has been done by myself and is using only materials produced by me to avoid copyright infringements.
Any errors in this material are mine alone.