You are on page 1of 11

Using R Code in TransactSQL SQL Server R

Services
SQL Server 2016 and later

Updated: October 14, 2016


Applies To: SQL Server 2016
This tutorial provides very short examples of the syntax for working with R in SQL Server R Services.
This is a good place to start if you are new to SQL Server R Services and want to know the basics of how to define R script in a T
SQL stored procedure, how to use data as input, and how to define outputs.
Prerequisites: To run R code in SQL Server R Services, you must be connected to an instance of SQL Server where R Services is
already installed.

Part 1 - Basic Operations


In the following sections, you'll learn how to extend the stored procedure by adding parameters, and how to configure the
inputs and outputs.

1.1. Check whether R is Working in SQL Server


This query uses the system stored procedure, sp_execute_external_script, which is how you call R in the context of SQL Server.
This example passes a SQL query as input to R, and R returns that data frame to SQL Server.
1. In the Windows Start Menu, locate Microsoft SQL Server Management Studio.
2. If you cannot find it, you might need to install it: Download SQL Server Management Studio.
3. In the Connect to Server dialog box, type the name of a server or instance that has SQL Server R Services enabled. For
these examples, be sure to log in using an account that has the ability to create a new database, run SELECT statements,
and view tables. Open a new Query window and paste in this statement, just to check that everything is operational:
sql
execsp_execute_external_script
@language=N'R',
@script=N'OutputDataSet<InputDataSet',
@input_data_1=N'select1ashello'
withresultsets(([hello]intnotnull));
go

1.2. Create some simple test data


Before developing a complex data science solution, it's important to understand the differences between SQL Server and R.

1. Create a temporary table of data by running the following TSQL statement.


sql
CREATETABLE#MyData([Col1]intnotnull)ON[PRIMARY]
INSERTINTO#MyDataVALUES(1);
INSERTINTO#MyDataValues(10);
INSERTINTO#MyDataValues(100);
GO

2. When the table has been created, use the following statement to query the table:
sql
SELECT*from#MyData

Results

Col1
1
10
100

1.3. Get the test data using R script


After the table has been created, run the following statement:
sql
executesp_execute_external_script
@language=N'R'
,@script=N'OutputDataSet<InputDataSet;'
,@input_data_1=N'SELECT*FROM#MyData;'
WITHRESULTSETS(([NewColName]intNOTNULL));

It should return the same values but with a new column name.
Notes
The @language parameter defines the language extension to call, in this case, R.
In the @script parameter, you define the commands to pass to the R runtime as Unicode text. You can also add the text to
a variable of type nvarchar and then call the variable.

The line N'OutputDataSet<InputDataSet;' passes the input data contained in the default variable name
InputDataSet, to R and then back to the results without any further operations. Note that R is casesensitive; therefore,
both the input and output variable names must use the correct casing or an error will be raised.
To specify a different input or output variable, use the @input_data_1_name parameter, and type a valid SQL identifier. For
example, in this example the names of the output and input variables have been changed to SQLOut and SQLIn respectively:
sql
executesp_execute_external_script
@language=N'R'
,@script=N'SQLOut<SQLIn;'
,@input_data_1=N'SELECT12asCol;'
,@input_data_1_name=N'SQLIn'
,@output_data_1_name=N'SQLOut'
WITHRESULTSETS(([NewColName]intNOTNULL));

Notes
The required parameters @input_data_1 and @output_data_1 must be specified first, in order to use the optional
parameters @input_data_1_name and @output_data_1_name.
SQL and R do not support the same data types; therefore, type conversions very often take place when sending data from
SQL Server to R and vice versa. For more information, see Working with R Data Types.
Only one input dataset can be passed as a parameter, and you can return only one dataset. However, you can call other
datasets from inside your R code and you can return outputs of other types in addition to the dataset. You can also add
the OUTPUT keyword to any parameter to have it returned with the results.
The schema for the returned dataset R data.frame is defined by the WITHRESULTSETS statement. Try omitting this and
see what happens.
Tabular results are returned in the Values pane. Messages returned by the R runtime are provided in the Messages

1.4. Generate results using R


You can also generate values using just the R script and leave the input query string in @input_data_1 blank. Or, you can use a
valid SQL SELECT statement as a placeholder but not use the SQL results in the R script.
sql
executesp_execute_external_script
@language=N'R'
,@script=N'mytextvariable<c("hello","","world");
OutputDataSet<as.data.frame(mytextvariable);'
,@input_data_1=N'SELECT1asTemp1'
WITHRESULTSETS(([col]char(20)NOTNULL));

Results

col
hello

col
world
Now, try a different version of the Hello World sample provided above.
sql
executesp_execute_external_script
@language=N'R'
,@script=N'OutputDataSet<data.frame(c("hello"),"",c("world"));'
,@input_data_1=N''
WITHRESULTSETS(([col1]varchar(20),[col2]char(1),[col3]varchar(20)));

Results

col1

col2

hello

col3
world

Note that both statements create a vector with three values, but the second example returns three columns with a single row,
and the first returns a single column with three rows. Why?
The reason is that R provides many ways to work with columns of values: vectors, matrices, arrays, and lists. These operations,
while powerful and flexible, do not always conform to the expectations of SQL developers. Some R functions will perform implicit
data object conversions on lists and matrices.

Tip

Always verify your results and determine how many columns of data your R code will return, and what the data types will be.
Regardless of whether your R code uses matrices, vectors, or some other data structure, remember that the result that is
output from the R script to the stored procedure must be a data.frame.

Part 2 - Data conversion and other issues


This section provides a quick overview of some issues to be aware of when running your R code in SQL Server.
Data types: understanding cast and convert options as well as implicit conversions
Tabular result sets: anticipating the ways that R can change columns and rows of data

2.1 Implicit coercion of data types

R and SQL Server don't use the same data types, so you must be aware of the restrictions when you move data between R and
the database. the following examples illustrate some common issues.
1. Run the following statement to perform matrix multiplication using R. In this script, the single column of three values is
converted to a singlecolumn matrix. Then, R implicitly coerces the second variable, y, to a singlecolumn matrix to make
the two arguments conform.
sql
executesp_execute_external_script
@language=N'R'
,@script=N'
x<as.matrix(InputDataSet);
y<array(12:15);
OutputDataSet<as.data.frame(x%*%y);'
,@input_data_1=N'SELECT[Col1]from#MyData;'
WITHRESULTSETS(([Col1]int,[Col2]int,[Col3]int,Col4int));

Results

Col1

Col2

Col3

Col4

12

13

14

15

120

130

140

150

1200

1300

1400

1500

2. Now run the next script, which is similar, and see what happens when you change the length of the array.
sql
executesp_execute_external_script
@language=N'R'
,@script=N'
x<as.matrix(InputDataSet);
y<array(12:14);
OutputDataSet<as.data.frame(y%*%x);'
,@input_data_1=N'SELECT[Col1]from#MyData;'
WITHRESULTSETS(([Col1]int));

Results

Col1
1542

This time R returns a single value as the result. This result is valid because the two arguments are vectors of the same length;
therefore, R will return the inner product as a matrix.

2.2 Merge or multiply columns of different length


The following script illustrates some behaviors of R that might not conform to the expectations of a database developer.
The script defines a new numeric array of length 6 and stores it in the R variable df1. The numeric array is then combined with
the integers of the #MyData table, which contains 3 values, to make a new data frame, df2.
sql
executesp_execute_external_script
@language=N'R'
,@script=N'
df1<as.data.frame(array(1:6));
df2<as.data.frame(c(InputDataSet,df1));
OutputDataSet<df2'
,@input_data_1=N'SELECT[Col1]from#MyData;'
WITHRESULTSETS(([Col2]intnotnull,[Col3]intnotnull));

Results

Col2

Col3

10

100

10

100

There are many functions in R that create tabular output but perform quite different operations on the values depending on the
R data object. Because this TransactSQL stored procedure requires that both inputs and outputs be passed as a data.frame, you
will frequently be using functions to convert columns and rows to and from data frames.
If you ever have any doubt as to which R data object is being used, add the R str() function or one of the identify functions
is.matrix, is.vector, etc. to inspect the results and get the actual schema and value types.
For more information, see this article by Hadley Wickham on R Data Structures.

2.3 Identify data types and verify schemas


You can see that it is important to know exactly how many columns will be returned from your R code, and what the data type of
each column will be.

You can use the function str() in your R script to have the data schema of the R object returned as an informational message in
.
For example, the following statement returns the schema of the #MyData table.
sql
executesp_execute_external_script
@language=N'R'
,@script=N'str(InputDataSet);'
,@input_data_1=N'SELECT*FROM#MyData;'
WITHRESULTSETSundefined;

Results
STDOUT message (s) from external script:
'data.frame': 3 obs. of 1 variable:
STDOUT message (s) from external script:
$ Col1: int 1 10 100

2.4 Cast or convert columns


When you send data from SQL Server to R, frequently you will need to cast or convert data types to ensure that they can be
handled appropriately.
When you use a SQL query as input to R code, multiple data transformations take place:
SQL Server pushes the data from the query to the R process managed by the Launchpad service and converts it to an
internal representation.
The R runtime loads the data into a data.frame variable and performs its own operations on the data.
The database engine returns the data to management Studio or another client using SQL Server data types.
1. Run the following query on the AdventureWorksDW data warehouse. The input data query defines a table of sales data to
use in creating a forecast.
sql
executesp_execute_external_script
@language=N'R'
,@script=N'str(InputDataSet);'
,@input_data_1=N'
SELECTReportingDate
,CAST(ModelRegionasvarchar(50))asProductSeries
,Amount
FROM[AdventureworksDW2016CTP3].[dbo].[vTimeSeries]
WHERE[ModelRegion]=''M200Europe''
ORDERBYReportingDateASC;'
WITHRESULTSETSundefined;

2. Now review the results of the str function to see how R handled the input data.
Results

STDOUT message(s) from external script: 'data.frame': 37 obs. of 3 variables:


STDOUT message(s) from external script: $ ReportingDate: POSIXct, format: "20101224 23:00:00" "20101224 23:00:00"
STDOUT message(s) from external script: $ ProductSeries: Factor w/ 1 levels "M200 Europe",..: 1 1 1 1 1 1 1 1 1 1 ...
STDOUT message(s) from external script: $ Amount : num 3400 16925 20350 16950 16950
Notes
Some SQL Server data types are not supported by R. Therefore, to avoid errors, you should specify columns individually
and cast some columns.
The string predicate in the WHERE clause must be enclosed by two sets of single quotation marks.
The datetime column has been returned as the R data type, POSIXct.
The column [ProductSeries] has been correctly identified as a factor.

2.5 Use multiple inputs


Only one input dataset can be passed as a parameter. However, you can call other datasets from inside your R code, using
RODBC.
For example, the following sample makes a call to the RODBC package, specifying the data in a SELECT statement.
sql
@script=N'
<otherRcodehere>;
library(RODBC);
connStr<paste("Driver=SQLServer;Server=",instance_name,";Database=",
database_name,";Trusted_Connection=true;",sep="");dbhandle<
odbcDriverConnect(connStr)
OutputDataSet<sqlQuery(dbhandle,"SELECT*from<table_name>");
<moreRcodecombiningthedata>'

We recommend that you use RODBC to get smaller datasets, such as lookups or lists of factors, and use the @input_data
parameter to get larger datasets, such as those used for training a model, from SQL Server.

2.6 Generate multiple outputs


In SQL Server 2016, the output of R from the stored procedure sp_execute_external_script is limited to a single data.frame or
dataset. This limitation might be removed in future.
However, you can return outputs of other types in addition to the dataset. For example, you might train a model using a single
dataset as input, but return a table of statistics as the output, plus the trained model as an object.
Moreover, you can add the OUTPUT keyword to any input parameter to have it returned with the results.

2.7 Parallel processing


If you are working with a large data set and the SQL query that gets your data can generate a parallel query plan, you can run R
script in parallel by setting the @parallel parameter to 1. This is typically useful if you are scoring a large result set. if you use this
option, you must specify the output results schema in advance using WITH RESULT SETS.
However, this parameter won't have any effect if you need to train a model that uses all the rows; for such cases, we recommend
you use the ScaleR packages. These packages can distribute processing automatically without needing to specify @parallel =1 in
the call to sp_execute_external_script.

Part 3. Wrapping R Functions in Stored Procedures


R can perform many advanced statistical functions that might require complicated code to reproduce using TSQL. However,
with R Services, you can run R utility scripts in a stored procedure to support tasks like these:
Apply R's mathematical and statistical functions to SQL Server data
Create tablevalued functions or scalar functions that use R code
Typically, you would encapsulate the calls to these R functions in stored procedures, to make it easier to pass in parameters. To
illustrate this process, the same function is provided in R code, in an ad hoc stored procedure call to sp_execute_external_script,
and in a stored procedure that you can use to parameterize the R function.

3.1. Generate random numbers


The following statement uses one of the R functions from the stats package, which is loaded by default when R Services is
installed. The rnorm function shown here generates 20 random numbers with a normal distribution, given a mean of 100.
Here's the R code that does the work.
r
as.data.frame(rnorm(20,mean=100));

This statement calls the function from TSQL and outputs the results to SQL Server.
sql
EXECsp_execute_external_script
@language=N'R'
,@script=N'
OutputDataSet<as.data.frame(rnorm(20,mean=100));'
,@input_data_1=N';'
WITHRESULTSETS(([Density]floatNOTNULL));

Next, you wrap the stored procedure in another stored procedure to make it easier to pass in parameters. You must define each
of the input parameters in the @params argument, and map each parameter to its corresponding R parameter by name.
sql
CREATEPROCEDUREMyRNorm(@mynormint,@mymeanint)
AS
EXECsp_execute_external_script
@language=N'R'
,@script=N'
OutputDataSet<as.data.frame(rnorm(mynorm,mymean));'
,@input_data_1=N';'
,@params=N'@mynormint,@mymeanint'
,@mynorm=@mynorm

,@mymean=@mymean
WITHRESULTSETS(([Density]floatNOTNULL));

Call the new stored procedure and pass in values.


sql
execMyRNorm@mynorm=20,@mymean=100

3.2 More uses for R utility functions


This example calls one of R utility packages and uses its memory.limit() function to get memory for the current environment.
The utils package is installed but not loaded by default.
sql
executesp_execute_external_script
@language=N'R'
,@script=N'
library(utils);
mymemory<memory.limit();
OutputDataSet<as.data.frame(mymemory);'
,@input_data_1=N';'
WITHRESULTSETS(([Col1]intnotnull));

The next example gets the maximum length of integers that are supported on the current computer, using the R .Machine
function, and outputs it to the console.
R
localmax<.Machine$integer.max;
localmax;

sql
executesp_execute_external_script
@language=N'R'
,@script=N'
localmax<.Machine$integer.max;
OutputDataSet<as.data.frame(localmax);'
,@input_data_1=N'select[Col1]from#MyData;'
WITHRESULTSETS(([MaxIntValue]intnotnull));

However, it isn't always the case that R will do the job better. Setbased operations in SQL Server might be far more efficient for
some operations that data scientists would traditionally perform in R. For an example of a performance comparison of R
functions and TSQL custom functions, see the Data Science EndtoEnd solution.

We recommend that you evaluate on a casebycase basis whether it makes more sense to perform a given operation using R,
using TSQL, or some other tool.

Additional Resources
Data Science Deep Dive: Using the RevoScaleR packages: This walkthrough provides handson experience with common data
science tasks
Data Science EndtoEnd solution: This walkthrough illustrates a development and deployment process that balances SQL and R
approaches
Advanced analytics for the SQL Developer: Illustrates the complete model operationalization the SQL Developer
2016 Microsoft

You might also like