This article introduces one of the most important artifacts of a Power BI data model, maybe the most important one. This discussion, of course, can become philosophical fast nevertheless, here is some general advice:
- If you are asked to optimize some DAX measures' performance and face a data model with more than one table but without any relationship – RUN!
- If you are asked to optimize some DAX measures' performance and face a data model that contains a single table – RUN!
Relationships - the simple explanation
There is a straightforward explanation for what relationships are: a relationship connects two tables and allows the vertipaq engine (aka xvelocity in-memory analytics engine) to unfold its magic.
Of course, there is a more lengthy explanation too, which is why we need more than one article to explain relationships in greater detail. Hopefully, after reading all the articles in this chapter, you will understand why this tune is playing in my head whenever I'm able to create a 1-to-many relationship with the cross filter direction property set to Single: The Thermals - The Sword By My Side [Official Lyric Video] - YouTube and why I consider the 1-to-many (Single) relationship being the sword of the tabular data model.
Relationships - a more lengthy explanation
It's best to explain relationships, what they are and why they are tremendously important, is starting with a data model that does contain some tables but no relationships. The picture below shows this data model:
You might recognize the tables if you are familiar with the sample dataset called ContosoRetailDW (link to ContosoRetailDW). From top left to top right, the first three tables represent the product dimension, and the fourth table, FactOnlineSales (bottom right corner), represents the fact table of this simple data model.
Suppose you wonder why this model contains three tables forming the dimension table instead of one, as you might have expected because you are following the dimensional modeling approach (creating a Star Schema). Another reason is how these tables exist in the relational data source. In that case, this snowflake modeling is for demonstration purposes.
I'm trying to answer whether it's a good idea to use snowflaked dimensions instead of denormalized dimension tables later in this article. For now, it's not relevant as none of these tables are related.
I assume you are familiar with the situation of repeating values in Power BI visuals like the one below:
The repeating values are due to a missing relationship between DimProduct and FactOnlineSales. From my experience, a missing relationship is the most common reason for these repeating values almost every time.
Before relationships make the issue of repeating values a faint memory of darker times, it's essential to understand why the above table visual contains repeating values for each ColorName, the sum across all values from the column SalesAmountfor. Whenever a visual is using a text column, the vertipaq engine starts its magic and shows the distinct values of a text column. The data type of the column ColorName (table DimProduct) is text. The column SalesAmount (table FactOnlineSales) has the data type currency, numeric columns. When used on visuals, numeric columns are aggregated using the configured aggregation function; for SalesAmount the aggregation function is SUM. To be precise, the vertipaq engine does not know how the columns are used on any visual; the engine only executes the DAX query generated by Power BI Desktop or the Power BI Service after the report has been published.
Missing relationships between tables (I’m not talking about inactive relationships) can be virtually created by using the DAX function TREATAS (DAX Guide TREATAS: ). The measure below creates a virtual relationship between the tables DimProduct and FactOnlineSales
SalesAmount TREATAS =
SUM( 'FactOnlineSales'[SalesAmount] )
, TREATAS( VALUES( 'DimProduct'[ProductKey] ) , 'FactOnlineSales'[ProductKey] )
Using the measure creates the expected result:
If you think that DAX provides more flexibility instead of creating relationships, you might be correct, but this flexibility has its price. The price you pay is performance.
The below picture shows the timing of the measure SumAmount TREATAS, using DAX Studio:
The total DAX query's generated execution time from the report page TREATAS is 37ms (cold cache) on my machine.
As this article focuses on relationships, I assume that you know how to create a relationship. This article focuses on 1-to-many (cardinality) relationships with the Cross filter direction property set to Single.
Other articles will explain different types of cardinality or the implications (not to say the danger) of applying the value Both to the Cross filter direction property.
The following picture shows this relationship for the tables DimProduct (the table on the one-side) and the table FactOnlineSales (the table on the many-side):
After creating the 1-to-many relationships between the snowflaked dimension tables and the fact table
(DimProductCategory -> DimProductSubCategory -> DimProduct -> FactOnlineSales),
we can create the same visual without needing any explicit measure. The Server Timings of the query from the report page Relationships show that relationships speed up query performance by percent 38% (use the report page Relationships):
Understanding why the query performance is way much faster in comparison to the DAX solution, it's important to consider three aspects:
- Filter propagation
- The relationship artifact
- The extended table
A dedicated article will explain this foundational concept in more detail
Filter propagation means that the vertipaq engine propagates the values of the column ProductKey from the table DimProduct to the fact table, leveraging the relationship, then the engine will apply the aggregation/the measure to the filtered rows of the fact table. The following picture shows filter propagation in a very simplified way, assuming that a user has selected the ColorName Value Orange:
Yes, it’s true, row numbers are used to identify the row. These row numbers are stored inside the relationship artifact. This row number can not be used using DAX or inside Power BI data visualization. Nevertheless, these row numbers are crucial, for the performance but also from a technical point of view as we have to remember that the data will be stored in a columnar way. For this reason it’s necessary to provide a method that all the aggregating and filtering creates the expected result.
If you are interested in more technical details (some are gory), read the explanation Alberto Ferrari and Marco Russo provide in their epic book “The Definitive Guide to DAX.”
Rephrasing the above explanation, relationships are crucial to identifying the values (the rows) that will be aggregated: 8 + 16 + 64.
If relationships can not be used, the complete fact table has to be scanned to identify the matching rows. Scanning many rows is why the TREATAS solution is not as fast as the solution that can leverage relationships.
The memory footprint of relationships
The relationship artifact is vital for the performance of the vertipaq engine. I consider a relationship an artifact because data refresh creates or updates the relationship object. This object can be "visualized" using the vertipaq analyzer or DAX Studio.
The following two pictures show the size of the relationship artifacts,
Snowflaked dimension tables:
Denormalized dimension table:
The difference in total size between both models is slight. Still, of course, this can vary between models and significantly impact the relationship artifacts' memory consumption.
This tiny difference might be a reason considering adopting a snowflaked dimension design from a relational data warehouse into the Power BI data warehouse.
I imagine filter propagation following a path that leads (filters) to the rows in the fact table being aggregated. Traversing the longer path that comes with snowflaked dimension design takes more time than following the short path of a denormalized design. Next, the internal query (not the DAX query) gets more complex with snowflaked dimensions. The following picture shows this complex query:
The following picture shows the query to answer the same question using denormalized dimension design:
The vertipaq engine executes the query faster because it is more straightforward, even if a millisecond does not matter – at least from my perspective.
But (as always), this difference can become dramatic as soon as the involved aggregations and measures become more demanding.
Does this mean that denormalized dimension design is always better than snowflaked dimensions? Of course not. It depends (as always).
I consider a relationship a sword. If there is a snowflaked dimension, there is also a dagger, and sometimes even a little pocket knife. Handling two or more sharp things at the same time can be dangerous, think of the havoc three and more sharp blades can create if you are juggling with these sharped-edged things.
If you think this means there should be only one relationship in the data model - of course not.
Multiple swords wielded by different dimensions are a band of brothers who can easily tackle vast amounts of data.
Thanks for reading