21. Februar 2018

Analyzing R - with Power BI

For a couple of years R is one of my favorite tools to analyze and visualize data. So, it seems quite obvious to analyze what's is going on in the R universe, all the more I'm able to use another great tool from my data analytics arsenal - Power BI.

Here you will find the 1st release of a Power BI Report that provides some information about the R universe :-)

If you are not familiar with R these sites will get you started, here you will find a good starting point

https://cran.r-project.org/ (the official homepage)
https://www.datacamp.com/onboarding/learn?from=home&technology=r (learn R)

This will be the first post in a series about the R universe, where I use R and Power BI, and maybe more stranger things to analyze what's going on.

What I really like about the R universe is the vast amount of available R packages, I never encountered an analytical challenge that couldn't be solved by using an existing package or the combination of packages and of course some R programming :-) . But this vast amount of packages can also become a problem, by deciding what should be the go-to package for a certain kind of challenge.

Sure there are the Taskviews, curated lists grouping R packages that are helping to answer questions of closely related topics, there are lists for "TimeSeries", "WebTechnologies", and …

Here you will find a nice overview of available Task Views: http://www.maths.lancs.ac.uk/~rowlings/R/TaskViews/ and of course the homepage also provides lists.

But unfortunately, some of these lists do not consider all of the recent released packages, I can imagine that this is a heavy task, due to the rapid growth of available R packages.

For this reason I will start this project by providing answers to the following questions

What packages have been released recently or have been updated

Development (meaning number of packages) of Task Views over time

Number of downloaded packages

Hopefully this will help me to find the proper package to solve an analytical question.

In later posts I will expand my analysis using graph processing and also some text mining, but for now it's just this :-)

I'm harvesting data by doing some webscraping (using the package rvest) from the following websites (examples)

https://cran.r-project.org/web/packages/ggplot2/index.html

https://cran.r-project.org/src/contrib/Archive/ggplot2/

The first link provides information about the "ggplot2" package, here I'm especially interested in the assignment of the package to Task Views, depicted by the entry "In views". The package ggplot2 is assigned to the Task Views "graphics" (this is no surprise) and also to "Phylogenetics" (https://en.wikipedia.org/wiki/Phylogenetics).

The 2nd link provides information about the evolution of the ggplot2 package, the release history of the package. Here I'm interested in the dates of the releases. This information will help me to understand if the package is actively maintained or not. But you have to be aware that the packages heavily vary by its nature, some "just" provide datasets whereas others have become essential to my working with R. It's very likely that packages that are just providing a dataset get less frequent updates than other packages.

From a data modeling perspective we have to notice that there are two many-to-many relationships

Package (m:n) Task View

A package can be assigned to none, to one, to many TaskViews

Package (m:n) Release Date

A package has one or more release dates

To answer the question "How have the Task Views developed over time" can become challenging by itself, depending on the capabilities of your analytical tool. How I modeled this using Power BI, is described in one of the upcoming posts

To get an idea about the number of downloads of each package I use the publicly available logs from the RStudio CRAN mirror (this is also the reason why I'm using this command

install.packages(character vector with names of the packages to install, repos="http://cran.rstudio.com/")

if I'm installing / downloading a package. Considering the amount of mirrors available, decisions based on the metric "No of downloads" have build-in uncertainty.

In one of my next posts in this series I will also describe how I do the webscraping in much more detail, but for now, have fun finding out what's available right now.

Tags: R, PowerBI, webscraping

Kommentar schreiben

Kommentare: 0