Data Modeling · 27. Dezember 2021

Tables, columns, and some things related

A data model contains at least one table. If the data model has only a single table, this data model is special (at least). Here you will find an article that explains The danger of one-table solutions

Tables and their columns are the very foundation of Power BI data models.

The alleged simple object table can test our understanding more than once, especially if we consider that the physical store of the data is a columnar in-memory data store. After we have spent some time (sometimes this can be a lot of time) defining the tables that form the data model, data will be stored column-wise instead of row-wise.

Next to that, reading about extended tables also might raise the question, "Why should I model if there is a single view of all the tables?" If you are wondering what extended tables are, read this article, written by Jeffrey Wang some time ago: The Logic behind the Magic of DAX Cross Table Filtering The chapter relationships will provide more details about relationships in general, certain types of relationships, and the concept of extended tables..

Don't feel bad if you read the article more than once 😊 I have read the article at least a hundred times, and I'm still reading it from now and then.

Tables

There is a straightforward explanation of what tables are:

Tables use rows and columns to structure data.

It's important to understand that much more can be said about tables, especially columns, e.g., considering different data types and how they will affect our data modeling. When this becomes important, this will be addressed in a dedicated article.

This article concentrates on the two table types in dimensional modeling

Dimension tables
Fact tables

Dimension tables

Dimension tables provide context to the fact tables. Context can be:

When did something happen (dimension date/time)
Who purchased something (dimension customer/client)
What has been purchased (dimension customer/client)

The following picture shows a very simple representation of a dimension table representing dates:

A date table provides to a fact: the when.

It's important to understand that a row represents a single object instance inside a dimension table. Here the modeled object is the date.

As we are talking about data modeling for Power BI: dimension tables are used on the one side of a relationship. It's important to remember that defining a relationship requires a single column in both tables. It's impossible to define a relationship based on multiple columns (aka a compound key). If you have a background in relational data modeling, you might consider this a limitation; maybe it is. But, it's also one of the ingredients that make the data analysis using Power BI or Analysis Services Tabular data models so fast. You will find a detailed explanation of why data analysis is so fast in the chapter (relationships)

It's crucial to remember that the instance modeled with the dimension table must be identifiable by a single column. This column is then called the key column of the dimension.

If there is a table in your model where multiple columns form the key column, you must concatenate these columns and create a single column.

Often dimension tables are wide, meaning they have a lot of columns. These columns can be considered attributes of the key column that provide additional context and are fueling the analysis of the facts. Sometimes it's easy deriving attributes from the value of the key column itself, like the year or the weekday. Often attributes like product color or product category are available in source systems. When we are defining the attributes needed, we must be aware that there is a logical relationship between the key column and each attribute column. This logical relationship requires that a key-value only is attributed to one value inside the attribute column. For example, only one value from the attribute column Year is attributable to the key-value "2019-10-04"; this is "2019."

As natural and straightforward as the above may sound, we likely find attributes where this unique relationship between the key column and an attribute column does not exist. If we want to analyze our sales data by our customers' hobbies, it's more than likely that many customers have more than one hobby. If we find an attribute like this, then we have to create additional tables and define many-to-many relationships. You will find more details on how this will affect the data model (many-to-many)

Sometimes it turns out that defining an attribute becomes complex e.g.classifying a customer into new, recurring, etc. Defining the states can reveal that the state of a customer can change over time and out of a sudden you might consider creating more tables being able to model a simple "attribute" column.

Identifying the dimensions is simple. Defining the attributes columns can be complex, especially if the columns violate the below-mentioned laws. If this happens, more complex data modeling techniques are required.

Dimension tables - laws:

A row inside a dimension table represents an instance of the modeled object (a date, a customer, a product, …)
Power BI data modeling requires that a single column identifies the instance. This column is called key column and forms the relationships with one or more fact tables (at least most of the time).

Fact tables

From my experience, all Power BI initiatives started with a fact table in mind, an idea like

We do need to analyze our sales
Let's track our payments
What's the conversion rate from visitor to client

Whenever I imagine a fact table, I imagine an event (a specific sale) or a process (from order placement to delivery). Next, I itemize how I want to analyze the event/the process. If you are going to explore your sales data, you likely come up with items like this: sales quantity, sales amount, discount … These items are called facts. Facts are measuring the event/the process you want to analyze.

Starting a new analytical initiative can be challenging. The Kimball Enterprise Data Warehouse Bus Architecture will help if the process is not familiar. It will help to focus thoughts and help to identify the facts most commonly used for similar analytical initiatives: Enterprise Data Warehouse Bus Architecture - Kimball Group

The structure of a fact table is different from a dimension table. A fact table does not have to have a single column that identifies a single row. If it does, you may want to get rid of it because of its memory footprint (see here for a more detailed explanation about the impact of the column size: Why size matters – column size and data modeling). Instead, the key columns of the dimension tables are forming a compound key. Next to the dimension key columns, there are fact columns, and each fact represents a measurement for a specific row. The following picture shows an example of a fact table:

I often start the development of a model by itemizing the facts. Itemizing all the required facts allows for clarity about three important characteristics of each fact. These characteristics are dimensionality, granularity, periodicity of measurement. All three characteristics will affect data modeling.

Dimensionality describes the dimensions attributed to a specific fact. Facts are often required for calculations, either a calculated column or a measure. We must check if each dimension we want to use for the overall analysis of the event / the process can be attributed to the fact. If not, we should store this fact in a separate fact table. Suppose a percentage value has to be used to calculate a discount, but this percentage is attributed only to a product and a date. In that case, we should avoid storing the percentage value in the fact table above. No matter if the percentage value is used to calculate each row's discount value.

Granularity describes the level of detail available for a particular fact. When itemizing the facts, it's also essential to check if the fact is known at the most detailed level for the dimension. Let's consider the percentage value from above. It can be possible that the value is provided/available "only" for the product category level, but not for the product level, the most detailed level inside the product dimension.

Periodicity of measurement describes the nature of the measurement itself. There are three different types of measurement

Atomic transaction fact
An individual event is tracked inside this type of fact table, simple aggregation functions like SUM, MIN, MAX, AVERAGE, COUNT aggregate measurements by all the dimensions.
Snapshot fact
Snapshot fact tables store periodical measurements. The recorded facts must not be aggregated across all dimensions. An example of this is the number of employees (a fundamental fact). This fact might be measured at the end of each month but must not be aggregated by a simple aggregation like SUM to determine the number of employees at the end of a given quarter.
Accumulated snapshot fact
Accumulated snapshot facts store information about multiple events in a single row, like the date of order placement, the date of shipment, and the delivery date. This kind of fact table allows analyzing the elapsed time between different states. The chapter A different approach to the event-in-progress problem provides more insight into this kind of analysis.

The book "Star schema – the complete reference" by Chris Adamson contains more information about the different types of fact tables.

Fact tables -laws:·

Only store facts in the same fact table if they match
- Dimensionality
- Granularity
- Periodicity of measurement
There are no NULL values for dimension keys

The next chapter (article) describes how calculated columns and measures will affect data modeling.

Enjoy reading and happy data modeling!

Kommentar schreiben

Kommentare: 0