Mincing Data - Gain insight from data

This site will be a container for all my musings about the analytical pipeline.

For this reason it will be necessary to define the Analytical Pipeline (at least define this pipeline from my point of view). From a general perspective (be aware that this is my perspective) five activities are necessary to build an analytical pipeline, these steps are

source

ingest

process

store

deliver


 The overall goal of an analytical pipeline is to answer an analytical question. To achieve this overall goal, different data sources have to be targeted and their data has to be ingested into the pipeline and properly processed. During these first steps the data ingested into the pipeline often has to be stored in one or more different data stores, each is used or its special type of usage, finally the result of the data processing has to be delivered to its users, its audience. Depending on the nature of the question, different types of processing methods and also different data stores may be used along the flow of the data throughout the pipeline.

 

I put a direction to these activities, but I also added this direction to spur your critical mind, because

  • source data is of course also stored somewhere
  • to target the source can become a challenge in the pipeline
  • to deliver processed data to the its audience can become a another challenge  if you try to deliver to mobile workforce
Successfully ingested data often gets processed more than once to answer different questions, and during its existence (inside an analytical data platform) this data will be transformed into various shapes, wrangled by different algorithms, and ingested into different "data stores".

For this reason I believe that these activities are tightly related, and the above mentioned sequence of these activities will just aid as a guidance.

 

I will use blog posts to describe how different activities are combined to answer analytical questions. In most of my upcoming blog posts I will link to different topics from the activities used in the pipeline. Each activity has its own menu and is by itself representing an essential part in analytical pipeline.

 

Hopefully this site will help its readers as much as it helps me to focus on each activity always knowing that most of the time more than one activity has to be mastered to find an answer to an analytical question.


Power BI, DAX, and relational division - a common search task with a nerdy name

Note: here is the Power BI file, the file contains the data model and all the DAX

 

I tend to differentiate between two categories of DAX statements

  • DAX statements that calculate a numeric expression and
  • DAX statements that are searching for specific members inside a column or table

Sometimes, things become a little blurry as a DAX statement may contain parts of both categories.

A simple example of calculating a numeric expression, e.g., is the calculation of the YoY Growth of Total sales. An example of a search is looking for the TOP N customers.

 

This article is about searching.

 

Quite often, I encounter tasks that are similar to the following one: Find all the customers from the US, with a Total Sales > xyz. Depending on the data model, this type of question can be answered quickly, even without using DAX.

But sometimes, questions become more tricky. Take some time and consider the following question: Find all the customers from the US, with a Total Sales > xyz, who bought red and green products together. Sometimes rigor is put to this question by adding "and only red and green products, no other colored products in addition" to the question.

Answering this kind of question using Power BI has to tackle the following underlying challenges:

  • Multiple selections of values (colors) are combined by an OR and not by an AND
  • Selecting only two colors means that the customers have to be identified who bought the selected colors and just the selected colors

Some theoretical background

I'm not aware of any other term or phrase used to name this specific task that is different from this one: relational division. For this reason, I will stick to this phrase, and admittedly I like the nerdy touch that comes with it. After introducing the name, it makes sense to explain what's behind the name. Relational Division is one of the basic operations of relational algebra invented by E.F. Codd ("A Relational Model of Data for Large Shared Data Banks", 1970). This operation uses one table (the divisor) to divide another table (the dividend), the result (the quotient) of this relational division, often a single column that represents a subset of the domain of the dividend, then can be used to further refine the analysis of the available data.

If you think this sounds strange or weird, you are right. But being able to name a problem very often is the first step in solving the specific problem and similar problems more easily. Nevertheless, it's time to provide an example. The data has been taken from here (http://dbazine.com/ofinterest/oi-articles/celko1/), as this is very popular regarding the relational operation – relational division.

 

The task is simple: Find the pilots who can fly the planes inside a hangar.

 

As we are using Power BI, the 1st step is to look at the data model:

mehr lesen 0 Kommentare