One of the powerful features of the Tabular data model is to extend the model by adding calculated columns and measures. Both objects are defined using DAX, but besides this similarity, they are entirely different beasts.
This article is not about how to write the best DAX regarding performance and memory consumption. This article discusses the differences and when to use a calculated column or a measure.
Another essential aspect that has to be considered is the nature of the data model, meaning if your data model is based on a direct query connection type or imported data and if your model is a composite model.
Suppose you are creating a measure that is sufficient for imported data. You might encounter performance degradation if you are using this measure with a direct query data or a composite model. If the data model uses direct query, you have to be even more careful creating calculated columns, as this might break query folding.
This article is about imported data.
This article comes with a simple pbix file, available here (it has a fact table with 200M rows).
How a data model comes to life
This series of articles focuses on the data model. All the necessary steps to get the data residing in various data sources into the model are not covered. But it’s important to have an idea of what’s going on when we are hitting the button “Close and Apply” inside Power Query:
The data will be loaded into the data store of the vertipaq engine. I call this engine the heart of the Power BI dataset. This heart also beats in the Azure Analysis Services Tabular semantic model, SQL Server Analysis Services Tabular, and Excel Power Pivot.
When data is loaded into the storage of the vertipaq engine, column data will be compressed. This compression allows squeezing large amounts of data into a Power BI dataset. This compression
happens because the vertipaq store is a column-oriented in-memory data store. These data stores are optimal for answering analytical questions quickly, wading through vast amounts of data. Next
to the compression of column data, relationships will be “materialized,” and some other things will happen as well. The important part for this article is the compression of column data. If you
are interested in some of the gory details of column compression, I recommend reading this article:
Inside VertiPaq - Compress for
success.
No matter the details of column data compression, it’s crucial to notice that a column consumes memory (precious RAM). Even if memory is much cheaper than years ago, it’s a precious resource.
Types of columns
A calculated column (using DAX) consumes memory, even if the column is not in use in any visual or referenced by a measure (what – why there are calculated columns that are not in use).
Calculated columns will not benefit from column data compression. Calculated columns will consume more memory than a column loaded from a data source or created using Power Query, affecting query
performance as well. For this reason, remember Roche’s maxim: “create a calculated column as far upstream as possible, and as far downstream as necessary.” Here you will find the article by
Matthew Roche that brings this universal truth of data transformation into existence:
Roche's Maxim of Data Transformation
As far upstream as possible means as close to the point where the data is created, maybe in the LOB application that generates the source data or inside a data warehouse.
A calculated column or a DAX measure - take one
Only create a calculated column if you want to use this column as an axis inside a visual or a slicer. Avoid using DAX as these columns will not benefit from column compression.
Column creation upstream or a DAX measure - take two
Assuming we are close friends with the data warehouse team, we might be tempted to create the column inside the data warehouse. The reason for this is valid for a couple of reasons:
- There are more people out there who can create a performant SQL statement than people who can create a performant DAX statement. Using SQL instead of DAX reduces the overall complexity necessary to create a solution.
- A column from the source system will benefit from column data compression.
- A column inside a data warehouse can be used by many applications, whereas a DAX measure can only be consumed by applications able to connect to a tabular model.
Maybe the 3rd point depicts the most tempting reason why we want to reuse an already existing column inside the Power BI dataset.
We know that the vertipaq engine contains two engines, the formula engine, and the storage engine. We are also aware that the storage engine contributes a great deal to the fantastic performance of the vertipaq engine. This performance is due to the storage engine’s ability to scan large amounts of column data and apply simple aggregations like SUM, AVERAGE, or other aggregation functions. Knowing this, creating a numeric column outside of the Power BI dataset seems to be the perfect solution. As we are creating the numeric column way more upstream, we know that this column will benefit from column compression when the data is loaded into the Power BI dataset.
But wait, we must be aware that the unique values inside a column will largely determine the size of the dataset. The more unique values, the larger the dataset, the slower the query performance, and the uglier the user experience.
The number of unique values might outweigh the speed of the storage engine. The following articles in this series provide additional insights:
Finally, answering the question of a numeric column that will be aggregated should be created outside the Power BI dataset or using a measure: it depends.
Am I safe when I use measures for numeric values that will be aggregated or referenced by other measures
My dear Power BI friends - of course not.
Assuming we have to calculate something like
quantity times price
or fact1 times rate (see the pbix file that comes with this article),
this type of business logic requires using an iterator function.
The measure from the pbix file looks like this:
fact1 times rate =
SUMX(
'fact200million'
, 'fact200million'[fact1] * RELATED( 'rules'[rate] )
)
This measure is fast; it uses extended tables enabled by the one-to-many relationship between the table ‘rules’ and ‘fact200million’.
As time goes by, new requirements will emerge. Maybe one of these requirements needs to determine the minimum value of the measure [fact1 times rate] down its path. A first guess might be to create something like this:
minx( 'fact200million' , [fact1 times rate] )
The above snippet has been taken from the measure “min value of measure (very slow).” Using this measure on one of the card visuals without any active calendar table-related slicers. will likely create an error message in the visual.
If you do not receive an error, please share the spec of the machine you are using 😊
The error appears because the snippet will lead to nested iterators, meaning the operation can no longer be performed by the storage engine alone. Nested iterators will create materialized views. These views are passed from the storage engine to the formula engine for further processing. This process takes time, but it also consumes a lot of RAM.
One goal of creating the “optimal” data model is to reduce the number of nested iterators or the depth of nested iterators. To tackle the issues that come with nested iterators, it can become necessary to expand the business logic encapsulated in the measure “fact1 times rate” and rewrite the logic inside a second measure, “min value of measure expanded”:
min value of measure expanded =
minx(
'fact200million'
, 'fact200million'[fact1] * RELATED( 'rules'[rate] )
)
The clear answer to the question “Do you recommend to re-use [measures] or rewrite the business logic whenever needed?” is: it depends!
When writing/composing/conjuring our DAX statements, we must be aware of other datasets. The dataset we are currently working on will compete for resources with other datasets. From my experience, the ideal measure is sufficiently fast and consumes as less RAM as possible. Later this year, I will write at least one article about performance vs. memory consumption.
This article from SQL BI shows some ways how nested iterators can be optimized and also provides some ideas on how to find the proper balance between expanding the business logic and reducing the number of iterators: Optimizing nested iterators in DAX
Kommentar schreiben