Mincing Data - Gain insight from data

This site will be a container for all my musings about the analytical pipeline.

For this reason it will be necessary to define the Analytical Pipeline (at least define this pipeline from my point of view). From a general perspective (be aware that this is my perspective) five activities are necessary to build an analytical pipeline, these steps are

source

ingest

process

store

deliver


 The overall goal of an analytical pipeline is to answer an analytical question. To achieve this overall goal, different data sources have to be targeted and their data has to be ingested into the pipeline and properly processed. During these first steps the data ingested into the pipeline often has to be stored in one or more different data stores, each is used or its special type of usage, finally the result of the data processing has to be delivered to its users, its audience. Depending on the nature of the question, different types of processing methods and also different data stores may be used along the flow of the data throughout the pipeline.

 

I put a direction to these activities, but I also added this direction to spur your critical mind, because

  • source data is of course also stored somewhere
  • to target the source can become a challenge in the pipeline
  • to deliver processed data to the its audience can become a another challenge  if you try to deliver to mobile workforce
Successfully ingested data often gets processed more than once to answer different questions, and during its existence (inside an analytical data platform) this data will be transformed into various shapes, wrangled by different algorithms, and ingested into different "data stores".

For this reason I believe that these activities are tightly related, and the above mentioned sequence of these activities will just aid as a guidance.

 

I will use blog posts to describe how different activities are combined to answer analytical questions. In most of my upcoming blog posts I will link to different topics from the activities used in the pipeline. Each activity has its own menu and is by itself representing an essential part in analytical pipeline.

 

Hopefully this site will help its readers as much as it helps me to focus on each activity always knowing that most of the time more than one activity has to be mastered to find an answer to an analytical question.


A song to the unnoticed heroes of every successful Power BI solution - the on-premises data gateway, the vnet gateway, and the sharable cloud connection

I love building Power BI solutions, solutions that help my colleagues make better decisions. I’m very much interested in the overall data architecture and love creating data models that sometimes look like a star, but they often don’t because there are more than six tables 😱. To feed these models with data, sometimes we have to do heavy data massaging, and sometimes we don’t need to. The content creation starts when the model is done (at least for a few months). This last step is often called data visualization. When doing all the fun things I mentioned above, it’s easy to overlook a very important fact: when we hit “Get data,” a data source is created inside the Power BI file [1]. When we do this, this works seamlessly 99.999% of the time, except when someone migrated to a new SAP BW version but forgot to tell the client team that a new connector must be rolled out to client machines or the data source is REST API endpoint with poor documentation not revealing how to authenticate. “Get data” works so seamlessly using Power BI Desktop that people wonder why they can not configure an automatic data refresh after publishing the Power BI file into their workspace and suddenly need a data gateway connection [2]. I started working with my current employer in 2017 and since that time 10s of thousands of Power BI apps and many more workspaces [3] and a bunch of datasets came into existence. On average, there are 1.6 datasets per App, only counting apps used in production. Since that start, colleagues of mine have moved many data sources from our basement to another basement owned by someone else. Some of these databases have been migrated to Azure SQL DBs, some are still running on virtual machines.

 

The on-premises gateway is the untiring component that is uncomplainingly “managing” a great deal of all our data sources. To be precise, precision is essential when discussing data source and gateway management. Of course, the on-premises gateway is not managing the connections; it’s more like a humble and silent servant, never seen, never praised.

 

This is the start of a series of articles inspired by the feedback I got on one of my latest blog articles, “Have data will travel, or should I stay or should I go” 

The series will encompass the following aspects:

  1. Gateways and Cloud connections - How do they work I use this article to explain how the gateways and cloud connections work using my own words coupled with some experience. Please be aware that my article will not explain the fantastic documentation. In this article, I will focus on the on-premises data gateway because the vnet gateway (unfortunately) and the new cloud connection are still in preview. Nevertheless, the latter two will not be forgotten.
  2. Power BI on-premises data gateway: what you need to know
  3. What you must tell your colleagues (excerpt from: Power BI 101 - the survival guide) Depending on the type of organization you are part of, colleagues are likely asking you to create a new gateway connection, or they can create new connections alone, I consider the most important things they have to know are the two things below. I will elaborate more on these two topics in the article and add other aspects
    1. all data sources forming a dataset must leverage the same gateway cluster
    2. if people are going to leave the team, make sure the team is aware of the data source names and does know the password of any data source involved
  4. Ideas on how to manage your on-premises data gateways from a technical point of view with a touch of an organizational perspective This article will be about building clusters, aligning your gateway clusters with different workloads, like import vs. directquery, or aligning gateway clusters with different environments, like dev, test, and prod
  5. Some gory technical and security-related details
  6. Differences between the different types of gateways/connections, advantages and disadvantages
  7. Avoid unnecessary data travel whenever possible (done; see the above-mentioned article) This article is about avoiding unnecessary data travel, assuming the data and the premium capacity are located in the same Azure region, which is not your Power BI home tenant.

[1] With the advent of the Power BI project file format (pbip) I will no longer use the term pbix as the term to describe the artifact I create using Power BI Desktop. instead, I will call this artifact “PBI file” no matter if it’s the pbix or the pbip file format.

[2] Do you wonder why people do not know they need a data gateway connection when they want to create an automatic refresh sourcing data from an on-premises data source? I assume that they did not read the “Power BI 101- the survival guide” document, which was sent to them automatically after they had installed Power BI Desktop 😉

 

[3] I do not count “My workspace”, I never do that.

0 Kommentare