Last Updated: 07-02-2022
This codelab takes you through some of the basic concepts of relational database models.
The codelab does not replace an introductory course to database-management systems, but should provide you with sufficient knowledge in order to take the course Data Analytics Infrastructure.
I will cover a minimal set of theories from any introductory course to databases (see list below).
A database is a collection of data. That collection of data is managed by a database management system (DBMS). A DBMS is expected to allow users:
A relational database model (RDBMS) allows for users to save structured data and define the relationship between data. In transactional systems, the database is typically designed for a single purpose and for a specific group of users. The database model represents a small part of the real world (a "mini world").
RDBMS are designed for the effective storage and handling of data. To do this data is often normalized, redundancy (having the same data appear multiple times) is kept to a minimum and the database is designed for single-row operations with no history (i.e. if a customer changes name, the database will only reflect the current name of that customer).
Data Warehouses, in contrast to transactional database systems, are designed for many users. The Data Warehouse is used for business analytics purposes typically serving users across many business domains (many "mini worlds"). The Data Warehouse should ideally offer business users a shared view on the company data - a single source of the truth - so that users irregardless of their background can view the same data and (hopefully) arrive at a shared understanding of best course of action on any business question that warranted analysing data. Some notable differences between transactional database systems and Data Warehouses are:
Designed to record
Supports transaction processing
Use requires specific knowledge
Designed to analyze
Supports cross-topical querying
Aggregate operations on large sets
Same querying on all models
Handling of complex history of change
Figure 1: Design Philosophies (DAI lecture notes)
The major components of RDBMS are the query processor which allows a user to query ("ask questions") of data, the transaction manager and the storage manager. Each of these is responsible for the effective operation of the databases they manage. For example the transaction manager ensures ACID properties.
Atomicity: Either all of a transaction is executed, or none of it.
Consistency: Data meets the expectations that we have - for example that a table reservation cannot be reserved by more than one party.
Isolation: If two or more transactions run concurrently, they must not affect one another.
Durability: Effects of transactions should not get lost even if the system fails immediately after the transaction completes.
Microsoft SQL Server (MSSQL) - Which is the RDBMS that I use for the codelabs hosted on my site. Uses t-sql syntax.
PostGres SQL - A open-source RDBMS system. Uses PostgresSQL Syntax.
MySQL - Also an open-source RDBMS system. Uses mySQL syntax.
All of the systems mentioned above use a similar syntax, but with slight variations. T-SQL has the greatest contrast compared to the others. For example Postgres syntax for selecting two tuples of data is:
SELECT tuple1, tuple2 whereas in t-sql it is:
SELECT [tuple1], [tuple2].
Similarly, data-types vary between for example Microsoft and Postgres. Another notable difference between MSSQL and Postgres is that MSSQL is case insensitive when comparing to strings.
A relational database management system typically stores data in tables:
Figure 2: Tables, tuples and columns
Within that table each column refers to an attribute and each row (also called tuple) refers to an instance of the entity the table is modelling.
Entities are nothing more than "objects that exist". An entity can be a thing, person, place, unit or object.
Entities are analogous to classes in Object Oriented Programming - so for example "The Little Mermaid" is not an entity, but an example of the entity "Movies" and "Barack Obama" is not an entity, but an example of the entity "Presidents" (or even just an example of the entity Persons - but we will get back to that later).
Entities have attributes. Attributes are properties of entities. For example, a movie will have a title ("The Little Mermaid") and a release year ("1989"). Just like a person can have attributes such as name ("Barack Obama"), hair color ("Black") as well as other attributes such as social security number, profession, etc.
Attributes can be simple or composed. Simple attributes have independent existence, whereas composites have multiple components which have independent existence from one another. For example, an entity Person (let's say me) can have the composite attribute FullName ("Astrid Hanghøj"), but each of the parts exist independently of one another - even if I changed my first name, I could still keep my last name.
When you approach modelling a database system for any given scenario, you can identify entities by looking for nouns - for example a thing (product), a person (employee), a place (store).
Relations connect two or more entity-sets. For example a President - presides - over a Country. An actor (or actress) - stars in - a movie. A movie is - owned by - a studio. A teacher - teaches - a class.
Relationships can have different multiplicities. The multiplicity of a relationship specifies the cardinality (number of instances of an entity) that can be associated with the instances of another entity.
When looking for relationships, look for verbs.
In database modelling we have multiple different types of relationships (unary, binary, ternary, quaternary). For example, the relationship between a Movie, a Studio and an Actor is a ternary relationship. In Data Warehousing - at least based on Kimballs principles - the most common type of relationship is binary, therefore I will not go into more detail here. But since you will be working on extracting data from source systems, it is important that you are aware of many different aspects of database modelling - including types of relations.
Every relationship type can be characterised in terms of its cardinalities which specifies the minimum and maximum number of relationship instances that an individual entity can participate in. For binary relationships, the different types of cardinalities can be: 1:1 (each individual entity can and must participate in exactly a one-to-one relationship with another entity), 1:* (one-to-many, each individual entity must relate to one or more of another entity), *:1 (many to one, many individual entities must relate to one and just one of another entity), and *:* (many-to-many, many individual entities relate to many others).
We can depict entities and their relationships using Entity-Relationship Diagrams (ERD). An example of such a diagram is depicted below (based on Ullman and Widom, 1997):
Figure 3: Entity-Relation Diagram depicting three entities and two relationships
Entities are modelled in square boxes, relationships are written in diamonds and attributes are listed in circles. Attributes are properties of entities.
When designing database systems, we typically work with three different types of models:
Typically used for requirement specification. May only show entities and relationships. The conceptual model may or may not have attributes listed for entities.
Shows logical organisation of data and is used to implement a database. The model will show which entities are stored in the database and their attributes including the data types of attributes. The logical model shows the cardinality of relations.
Shows a specific implementation of a database including any rules that may have been applied to the model to ensure consistency and/or controlled redundancy of data.
A key is one or more attributes in a table that uniquely defines that row in that table. The definition of an entity key is a set, K, of one or more attributes such that given any two distinct entities, E1 and E2 in the entity set, cannot have identical values for each of the attributes in the key, K (Ullman and Widom, 1997 p. 64).
If we consider the entity Movies in Figure one, then we have 5 different attributes shown. We may well consider the title as a candidate to be the key. However, a quick google search will tell us that the good folks behind The Avengers weren't particularly unique when they came up with the name for their hit movie (there's a 1960ies Spy series with the same name - I haven't watched it, but maybe I should. It looks good.).
Thus, we cannot alone define the title as key, since we will have two instances of the entity set Movies (The movie The Avengers with Spiderman and spy-series The Avengers without Spiderman) share the same key, title (The Avengers). Thus, we have to come up with multiple attributes that can serve as they key. Let's pick title AND year and solve the problem:
Figure 4: Keys indicated in the entity set for Movies
Thus, from traditional database management systems we have three different types of keys that we should try to remember: The Primary Key (that uniquely defines a row), candidate keys (keys that may serve as primary keys) and composite keys (a collection of attributes that together form the key).
In Data Warehousing, we talk about keys using a slightly different terminology than that used in traditional database design. Since we do not deal with designing databases as transactional systems, but rather the design of data warehouses to hold data from transactional systems, the primary keys of tables in the source systems have already been defined. In Data Warehousing we will typically talk about two different types of keys: business keys and surrogate keys.
Business keys are provided by the business (typically it is the primary key of an entity set in the source system). We retrieve these from the source system, but will typically not show them to the end user as they seldom hold any valuable business meaning. Perhaps except if the business key in for example a product table is the equivalent of that product's SKU (stock keeping unit). If the company internally talks about SKUs in their day-to-day operations as opposed to the product name (maybe multiple products exist with similar sounding-names?). Otherwise, keys are hidden from the end user.
Surrogate keys are, quite literally, substitute keys that we can generate to replace the business key in our entity relationships. We use surrogate keys because they allow us the ability to track changes across attributes (e.g. we might be interested in what happens if as an actor, you change your name from Emily Stone to Emma Stone with the ability to see which movies you starred in regardless of your name... perhaps your name is important to make it in Hollywood?). Surrogate keys make it possible to integrate data from disparate data sources without having to figure out what to do if primary keys are not unique across systems and you therefore end up with two non-identical rows that share the same business key.
I like to think of them like this:
Figure 5: Business Keys and Surrogate Keys
Business keys are defined by the business. They are typically primary keys of some entity tables in some source system that the business owns. Surrogate keys are keys that you make up. They are different (do not share the same value as the business key) and replace the business key in the final model in our data warehouse.
In DBMS: Typically enforce unique key constraint on table to avoid duplicate rows. Enforcing unique key constraints on an entity's primary key ensures that every row (or tuple) in our table occurs only once (not duplicate rows of the same instance of an entity). Constraints are part of the schema for the database. If we somehow erroneously decided that title was sufficient to be the primary key of the Movies entity, when declaring the unique key constraint on title we would make it impossible to enter both The Avengers the movie and The Avengers the tv series.
We can also put constraints on a single value (an attribute) for example requiring that attribute to exist (not be null - the value must simply exist when an entry is recorded in the database and it cannot arrive later).
Similarly, we constrain data when we choose data types. For example, if we chose INT to be the data type for title in Movies, we would mainly be able to store period dramas and other movies that have conveniently named their movies with titles that correspond to a specific year: e.g. 1408 (starring John Cusack), 1900 (about class struggles in Italy), 1941 (a shitty movie about Pearl Harbor) and 1984 (if you don't know this one - read the book!).
We can also enforce other semantic constraints using check constraints in t-sql to limit the values that are accepted by one or more columns. That could be useful if we want to restrict the number of genres we can enter for the genre attribute in Movies. From a business viewpoint this could make sense in some cases (an email must contain an @ character?), whereas in others it is valuable to capture as much data as possible - for example about the genre or features of movies so you can reverse engineer Hollywood and arrive at a 76,897 micro-genres!
Tables linked through foreign key constraints in order to ensure that certain attributes must make sense. If Barack Obama is (or was) a president (exists on the President table), the Country U.S.A. must exist in the Country Table.
We do (or can) use constraints in the data warehouse after transforming and loading data, however we may typically not enable foreign-key constraints while loading data since referential integrity is maintained by the source-system and nothing "happens" to the data as we load it.
Below is a receipt from a nice meal at a Mountain Hotel in Switzerland. Perhaps what one might enjoy after a nice, long day of skiing (I'll take the cheese spätzle, please!).
Figure 6: A simple receipt from Wikimedia Commons
On a piece of paper, propose a database model of the Mountain Hotel (Berghotel)'s point of sale (POS) database system. Unless we know the exact POS system that this restaurant uses, we cannot for certain know the logical or physical model of the database, so a conceptual database model will have to suffice, but try to consider making it as well-formed as possible (include attributes and think about relationships).
Take your time before moving on to the next slide to see a possible solution.
Figure 7: A proposed conceptual model for the Berghotel POS
At least three entities can be defined from the receipt on Figure 6. One may also define the Berghotel itself as an entity - in the entity set Restaurants - or infer the existence of an entity Customer.
The waiter, Ursula, has issued the receipt (receipt no. 4572) for the table (7/01) for a total amount due of EUR 54,50 including taxes on the 30th of July 2007. In order to avoid redundancy, these attributes are typically associated with an Order Header table in the POS.
The Order Line table, on the other hand, holds information about the goods ordered on that particular order. In this case, two lattes, an ice cream, a serving of cutlet, and a serving of cheese noodles (spätzle - highly recommended!).
Each of these line items refer to products sold at the mountain hotel. Each of these products has a name and a unit price.
If we want to be more specific about the diagram above, we can also identify candidate keys such as Staff Number for Waiter and Menu Item Number (or Product Number) for product, but in the above example I have chosen to only represent what can be extracted from the receipt in Figure 6.
Further, we may also approach a logical model by beginning to define data types (Waiter name should be a string) and constraints (for example, line Item is defined by the existence of a receipt).
The following codelab has been produced based on - in part - my colleagues' lecture notes from DBS at VIA University College, lecture notes from the course DAI at VIA University College and the book "A First Course In Database Systems" by Ullman and Widom (1997), as well as material from numerous google searches (thanks, Disney). References noted where relevant.
Any errors in this material are mine alone.