Data Sources

Some time ago it had been quite obvious, that the source for any analytical pipeline would have been a relational database. Today there are more different sources outside that can be used for a pipeline than ever before. From a simple text file to data in motion, from data stored in databases (relational or nor) to data scraped from websites or accessed though special APIs. All these data sources have their own intricate workings. Some of these data sources use data stores that I will use along the data pipeline as an intermediate touch point or as the final store from where it is delivered to the consumer.

Today it's quite common to use two categories to describe the data that is captured and ingested into an analytical pipeline, data in motion and data in rest.

Most of the time, data in motion flows through a network of systems, before this data comes to a rest also. So, capturing data in motion is always about capturing this data as early as possible, as short as possible after its origin, a long time before this data gets to it's final landing zone. This zone may even does not exist for all the data we want to capture, in this case we're talking about transient data in motion.

It seems that it's much more simple to capture data at rest, but as time will tell (or this site) this is by far not the case, we have to ingest large amounts initially to set the pipeline in motion. We have to setup mechanism to detect change to data piles, avoiding to ingest always the complete pile of data into our pipeline over and over again.

Here I will describe different data sources and how to tackle some problems, that come with these sources. Gladly, it's not just about problems, but sometimes also about possibilities that come with using specific data sources. Without these data sources the analytical pipeline would not have been possible, so a data source can also be an integral part of the analytical pipeline.

OneDrive / eXcel

Sure, you may say the combination of OneDrive / Excel is not a data source, from a certain point of view, I would call it a container that can hold various objects. One of my favorite objects is a simple Excel file. Once again you are correct, Excel is not a data store that is able to solve all your Big Data problems not to mention all Data Quality issues that may arise due to the lack of data types, but ...