SQL Server 2017 Machine Learning Services with R

上QQ阅读APP看书，第一时间看更新

Analytical barriers

Many companies encounter barriers when trying to analyse their data. These barriers are usually knowledge scarcity (not all departments have the knowledge to analyse the data) and data dispersity (usually data arrives from different sources).

Enterprises divide responsibilities according to the roles or functions of the employees. Such division of work has a positive effect, especially when an enterprise is large. Usually, small to mid-sized enterprises adopt such roles as well, but they are normally granulated on a higher level due to a smaller number of employees.

With rapid market changes, the emergence of new technologies, and the need for faster adaptation, many experts have noticed many of the following barriers:

Data scarcity and data dispersity
Complex (and many times outdated) architecture
Lack of knowledge
Low productivity
Slow adaptation to market changes (long time to market)

Many enterprises are facing at least one (if not more) of these barriers and Microsoft has addressed these barriers by opening the R language to SQL Server. Embracing an open source language and open source technology, it broadens the pool of knowledge, enabling the enterprises to use community knowledge and community solutions, as well as opening and democratizing analytics. Rather than suffering and waiting on data scientists and specialists with the subject-matter academic knowledge, now this pool of knowledge can be easily shared and many of the data munging and data engineering tasks can be offloaded to other roles and people. This offload process also bridges the traditional gap between IT and statisticians that resulted in low and slow productivity in the past. This gap in knowledge and skills can now be overcome by mixing different roles and tasks using R in SQL Server (meaning that data wranglers or data stewards can have R code at their perusal that would help them in getting data insight, without actually needing to understand the complexity of the statistics). There are no surprises for understanding or knowing different platforms, now, as many IT people can use the R language provided by statisticians and data scientists. Also, data scientists can start learning skills and languages that one finds in IT.

Such interconnected and shared knowledge between two or more different departments of people will also increase productivity. When productivity is increased, statistical and predictive models can be deployed faster, changed, or adapted to consumer and market changes and enabled for data engineers, wranglers, and analysts. This certainly is the way for an enterprise to improve the innovation path and maximize the potential of open source, and broaden the sandbox of experiments using different methods and models.

The last step in addressing these barriers is addressing the issues of data scarcity and complex infrastructure. The rule is that the bigger the enterprise, the higher the likelihood that the infrastructure will be complex. With complex infrastructure, we can understand that data resides on different layers or different granularity, on different platforms, on different vendors, and on different silos, making data analysis an additional step further from realization. With an introduction to R, this complexity can be overridden with simpler data connectors, an easier way to extract the data.

As the R language is becoming more and more popular and important in the field of analytics, it can be ready for enterprises on different scales and can be designed to anticipate beyond vendors, regardless of whether you have your solution on-premises or in the cloud. The need for data movement also decreases because of the ability to access and read data directly from any hybrid system and extract only what is needed. All barriers that are present in enterprises can now be addressed faster with less bureaucracy, better integration, and less effort.

Another important aspect of embracing open source language, with which many big enterprises are still struggling, is the general providence of open source solutions. This aspect shall not be overlooked and must be taken into consideration. Microsoft took the R language on-board with several steps:

Being on board in the R Consortium, which is responsible for supporting the R Foundation and key organizations working tightly on developing, distributing, and maintaining R engine and supporting R-related infrastructure projects. One of the projects was the RHub project (lead by Gabor Csardi) that delivered a service for developing, testing, and validating R packages.
Creating an MRAN repository of R packages under the CC license and making CRAN packages to Microsoft R engine distribution available and compatible with R distributions.
Making Intel MKL (Math Kernel Library) computational functions to improve the performance of the R statistical computation, available out of the box when you download the R engine from the MRAN repository. Basic Linear Algebra Subprograms (BLAS) and Linear Algebra Package (LAPACK) are a family of functions for linear algebra that is enhanced for parallel computations. Such functions are matrix factorization, Cholesky matrix decomposition, vector and matrix additions, scalar multiplications, and so on.
Rewriting many R functions from Fortran to C++ language to improve the performances.

We can quickly support the theory of the MKL computational functions, when we compare the R engine distribution available on the CRAN and R engine distribution available on MRAN. As we have already seen, BLAS and LAPACK are vector or matrix superseded, so we will compare (benchmark) the computations on a matrix between two R engine distributions.

The comparison is made on CRAN R 3.3.0 and MRAN R Open 3.3.0, with the following code:

# Matrix creation 
set.seed (2908) 
M <- 20000 
n <- 100 
Mat <- matrix (runif (M*n),M,n) 
 
 
# Matrix multiply 
system.time ( 
  Mat_MM <- Mat%*% t(Mat), gcFirst=TRUE 
)[1] 
 
# Matrix multiply with crossprod 
system.time ( 
  Mat_CP <- crossprod(Mat), gcFirst=TRUE 
)[1]

The following are the results with the following time (in seconds):

In the following figure, you can see the difference in performance between CRAN and MRAN R engine:

Figure 1

The graph shows a simple linear algebra that uses matrices or vectors and performs faster for a factor of 10 (tested on my local client-Inter I7, 4 CPU, 20 GB RAM). Note that, when you run this test on your local machine, you should observe the RAM and disk storage consumption; you will see that the MRAN operation is very lightweight in comparison to the CRAN operation when it comes to the RAM consumption.

When you download the MRAN distribution of R Open, note that there will be additional information on MKL multithreaded performance functions available:

Figure 2: Source: https://mran.microsoft.com/download/

Many more steps were taken that would reassure consumers, developers, wranglers, and managers from bigger enterprises that the R language is here to stay. Microsoft promised that, besides this provision, there is also support for general governance and if the company decides, it can also receive R support on an enterprise level.

Furthermore, to support the idea of using open source R language, one must understand the general architecture of R. The R engine is written by the core group of roughly 20 people with access to the source code of the R engine (even though only six are working on day-to-day R development). This group of people not only maintains the code, but they themselves are also contributors, bug fixers, and developers. So, the R engine is open source, which means that it is free software (under a GNU license), but the engine is not maintained that openly. On the other hand, R libraries (or packages) are mostly community-driven contributions, which mean that people in the community are free to develop and create a variety of functions to support statistical calculations, visualizations, working with datasets, and many other aspects.

In the months following the release of SQL Server 2016 (from summer 2016 onward), Microsoft also changed what is available in different editions of SQL Server. If you visit the SQL Server 2016 editions overview at https://www.microsoft.com/en-us/sql-server/sql-server-2016-editions, you can see that under advanced analytics, basic R integration is available in all editions of SQL Server 2016, and advanced R Integration (with full parallelism of ScaleR functions in RevoScaleR package) is available only in the Enterprise and Developer editions.