This site will be a container for all my musings about the analytical pipeline.
For this reason it will be necessary to define the Analytical Pipeline (at least define this pipeline from my point of view). From a general perspective (be aware that this is my perspective) five activities are necessary to build an analytical pipeline, these steps are
The overall goal of an analytical pipeline is to answer an analytical question. To achieve this overall goal, different data sources have to be targeted and their data has to be ingested into the pipeline and properly processed. During these first steps the data ingested into the pipeline often has to be stored in one or more different data stores, each is used or its special type of usage, finally the result of the data processing has to be delivered to its users, its audience. Depending on the nature of the question, different types of processing methods and also different data stores may be used along the flow of the data throughout the pipeline.
I put a direction to these activities, but I also added this direction to spur your critical mind, because
For this reason I believe that these activities are tightly related, and the above mentioned sequence of these activities will just aid as a guidance.
I will use blog posts to describe how different activities are combined to answer analytical questions. In most of my upcoming blog posts I will link to different topics from the activities used in the pipeline. Each activity has its own menu and is by itself representing an essential part in analytical pipeline.
Hopefully this site will help its readers as much as it helps me to focus on each activity always knowing that most of the time more than one activity has to be mastered to find an answer to an analytical question.
I love building Power BI solutions, solutions that help my colleagues make better decisions. I’m very much interested in the overall data architecture and love creating data models that sometimes look like a star, but they often don’t because there are more than six tables 😱. To feed these models with data, sometimes we have to do heavy data massaging, and sometimes we don’t need to. The content creation starts when the model is done (at least for a few months). This last step is often called data visualization. When doing all the fun things I mentioned above, it’s easy to overlook a very important fact: when we hit “Get data,” a data source is created inside the Power BI file [1]. When we do this, this works seamlessly 99.999% of the time, except when someone migrated to a new SAP BW version but forgot to tell the client team that a new connector must be rolled out to client machines or the data source is REST API endpoint with poor documentation not revealing how to authenticate. “Get data” works so seamlessly using Power BI Desktop that people wonder why they can not configure an automatic data refresh after publishing the Power BI file into their workspace and suddenly need a data gateway connection [2]. I started working with my current employer in 2017 and since that time 10s of thousands of Power BI apps and many more workspaces [3] and a bunch of datasets came into existence. On average, there are 1.6 datasets per App, only counting apps used in production. Since that start, colleagues of mine have moved many data sources from our basement to another basement owned by someone else. Some of these databases have been migrated to Azure SQL DBs, some are still running on virtual machines.
The on-premises gateway is the untiring component that is uncomplainingly “managing” a great deal of all our data sources. To be precise, precision is essential when discussing data source and gateway management. Of course, the on-premises gateway is not managing the connections; it’s more like a humble and silent servant, never seen, never praised.
This is the start of a series of articles inspired by the feedback I got on one of my latest blog articles, “Have data will travel, or should I stay or should I go”
The series will encompass the following aspects:
[1] With the advent of the Power BI project file format (pbip) I will no longer use the term pbix as the term to describe the artifact I create using Power BI Desktop. instead, I will call this artifact “PBI file” no matter if it’s the pbix or the pbip file format.
[2] Do you wonder why people do not know they need a data gateway connection when they want to create an automatic refresh sourcing data from an on-premises data source? I assume that they did not read the “Power BI 101- the survival guide” document, which was sent to them automatically after they had installed Power BI Desktop 😉
[3] I do not count “My workspace”, I never do that.