Professional Documents
Culture Documents
Real-time Analytics at the Edge: Identifying Abnormal Equipment Behavior and Filtering Data near the Edge for
Internet of Things Applications
By Ryan Gillespie and Saurabh Gupta, SAS Institute, Inc.
Location Analytics: Minority Report Is Here—Real-Time Geofencing Using SAS® Event Stream Processing
By Frederic Combaneyre, SAS Institute, Inc.
Listening for the Right Signals – Using Event Stream Processing for Enterprise Data
By Tho Nguyen, Teradata Corporation
Fiona McNeill, SAS Institute, Inc.
sas.com/books
for additional books and resources.
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration.
Other brand and product names are trademarks of their respective companies. © 2017 SAS Institute Inc. All rights reserved. M1673525 US.0817
About This Book
While you won’t find a canonical definition of IoT in this ebook, the papers included in this special collection demonstrate
how SAS is using its technology to address our customers’ IoT needs, including streaming data, edge computing, prescriptive
analytics, and much more.
The following papers are excerpts from the SAS Global Users Group Proceedings. For more SUGI and SAS Global Forum
Proceedings, visit the online versions of the Proceedings.
For many more helpful resources, please visit support.sas.com and sas.com/books.
Do you have questions about a SAS Press book that you are reading? Contact the author through saspress@sas.com.
viii The Internet of Things with SAS: Special Collection
Foreword
Life was simple. And then the mouse showed up. Not the furry kind, mind you, but the clicky kind.
Some of you may remember when we only had keyboards to interact with our computer monitors. We had to use the Tab,
Shift-Tab, Enter, and arrow keys to move our cursor from field to field on the screen. If you were an end user back then, you
would probably describe the experience as controlled, somewhat tedious, and often slow – but, frankly, it is all we knew. And
if you were an application/database developer, like I was, creating these controlled digital experiences was admittedly mind-
numbing at times but necessary to satisfy the business requirements of our company or client.
Developers, I would argue, were hit the hardest. They had to upgrade their keyboard-controlled, character-based applications
to keyboard- and mouse-controlled GUI (Graphical User Interface) apps. It was a painful transition. Long gone were the days
of systematically controlling a user’s every move with the keyboard. End users, by comparison, had it easy. All they had to
do was get used to offloading their navigation activity from the keyboard to this new clicky thing called a mouse.
The mouse in all its simplicity proved to provide more freedom, flexibility, and speed – and ultimately changed the way we
interacted with computers. It took a couple of years to transition fully to a mouse-driven world, but, once the transition was
made, there was no going back.
Today, we are in the midst of another significant digital transformation: the Internet of Things (IoT). On the surface, the IoT
is moving us towards a smarter, more connected world. However, at its core, the IoT is about data. Big data, IoT data, sensor
data – call it what you want – but it is data that’s fueling this transformational shift.
Just as the mouse changed how we interacted with computers, the Internet of Things is changing how we interact with data.
How we collect it. How we process it. How we store it. How we govern it. How we manage it. How we analyze it. And,
ultimately, how we make decisions with it. Not only do we want to make decisions based on data stored in our enterprise data
warehouse (that is, data at rest), we now need the ability to make decisions on-the-fly, in real or near real time (that is, data in
motion).
This is where SAS comes in. SAS has been in the data and analytics business for more than 40 years – well before the mouse
made its debut – helping companies analyze and understand all their data, at rest and in motion. The papers included in this
special collection demonstrate how SAS is using its technology to address our customers’ IoT needs, including streaming
data, edge computing, prescriptive analytics, and much more.
Modernizing Data As the Internet of Things (IoT) continues to grow, a natural increase in the volume and
Management with Event variety of data follows. SAS® Event Stream Processing offers the flexibility, versatility,
Streams and speed to tackle these issues and adapt as the landscape of IoT changes. This paper
Evan Guarnaccia, SAS distinguishes the advantages of adapters and connectors and shows how SAS® Event
Institute Inc. Stream Processing can leverage both Hadoop and YARN technologies to scale while
Fiona McNeill, SAS Institute still meeting the needs of streaming data analysis and large, distributed data
Inc. repositories.
Steve Sparano, SAS Institute
Inc.
Location Analytics: Geofencing is one of the most promising and exciting concepts that has developed with
Minority Report Is Here— the advent of the Internet of Things (IoT). Examples include receiving commercial ads
Real-Time Geofencing and offers based on your personal taste and past purchases when you enter a mall,
Using tracking vessels to detect where a ship is located, and forecasting and optimizing ship
SAS® Event Stream harbor arrivals. This paper explains how to implement real-time geofencing on
Processing streaming data with SAS® Event Stream Processing and achieve high-performance
Frederic Combaneyre, SAS processing in terms of millions of events per second over hundreds of millions of
Institute Inc. geofences.
Listening for the Right With the emergence of the Internet of Things (IoT), ingesting streams of data and
Signals – Using Event analyzing events in real-time become even more critical. The interconnectivity of IoT
Stream Processing for from web and mobile applications provides organizations with even richer contextual
Enterprise Data data and more profound volumes to decipher in order to harness insights. Capturing all
Tho Nguyen, Teradata of the internal and external data streams is the first step to enable listening for the
Corporation important signals that customers are emitting, based on their event activity. Having the
Fiona McNeill, ability to permeate identified patterns of interest throughout the enterprise requires
SAS Institute Inc. deep integration between event stream processing and foundational enterprise data
management applications. This paper describes the innovative ability to consolidate
real-time data ingestion with controlled and disciplined universal data access - from
SAS® and Teradata™.
Prescriptive Analytics – Automation of everyday activities holds the promise of consistency, accuracy, and
Providing the Instructions relevancy. When applied to business operations, the additional benefits of governance,
to Do What’s Right adaptability, and risk avoidance are realized. Prescriptive analytics empowers both
Tho Nguyen, Teradata systems and front-line workers to take the desired company action – each and every
Corporation time. And with data streaming from transactional systems, from the Internet of Things
Fiona McNeill, SAS Institute (IoT), and from any other source – doing the right thing with exceptional processing
Inc. speed embodies the responsive necessity that customers depend on. This paper
describes how SAS® and Teradata are enabling prescriptive analytics – in current
business environments and in the emerging IoT.
Foreword xi
We hope these selections provide you with a useful overview of the many tools and techniques that are available to help you
as we shift from a data-at-rest to a data-in-motion world.
If this whets your appetite, check out The Non-Geek’s A-to-Z Guide to The Internet of Things, a white paper listing 101
common terms related to the Internet of Things. As IoT is evolving so quickly it’s not exhaustive but rather a quick go-to
resource for the technically savvy data professional who wants to get a handle on this vast IoT ecosystem explained sans
technical “geek speak.”
Tamara Dull
Director of Emerging Technologies
SAS Best Practices
Tamara Dull is the Director of Emerging Technologies for SAS Best Practices, a thought leadership
team at SAS Institute. Through key industry engagements, and provocative articles and
publications, she delivers a pragmatic perspective on big data, the Internet of Things, open source,
privacy, and cybersecurity. Tamara began her high-tech journey long before the internet was born,
and has held both technical and management positions for multiple technology vendors,
consultancies, and a non-profit. Tamara is listed in the IoT Institute's "25 Most Influential Women
in IoT" and Onalytica’s Big Data Top 100 Influencers and Brands lists for the last three years. She
is also an advisory board member for the Internet of Things Community.
xii The Internet of Things with SAS: Special Collection
Paper SAS6367-2016
Streaming Decisions: How SAS® Puts Streaming Data to Work
Fiona McNeill, David Duling, and Stephen Sparano, SAS Institute Inc.
ABSTRACT
Sensors, devices, social conversation streams, web movement, and all things in the Internet of Things
(IoT) are transmitting data at unprecedented volumes and rates. SAS® Event Stream Processing ingests
thousands and even hundreds of millions of data events per second, assessing both the content and the
value. The benefit to organizations comes from doing something with those results, and eliminating the
latencies associated with storing data before analysis happens. This paper bridges the gap. It describes
how to use streaming data as a portable, lightweight micro-analytics service for consumption by other
applications and systems.
INTRODUCTION
The Internet has created a culture of people conditioned to expect immediate access to information.
Mobile networks have created a society reliant on instant communication. The Internet of Things (IoT) is
forming a new era, blending these revolutionary technologies, and establishing information access and
communication between objects. This provides a seminal opportunity for organizations to realign their
services, products, and even identity to an operational environment that responds in real time.
Hopefully by now, the debate of “What is real time versus the right time” is over. For the purpose of this
paper, real time corresponds to latency that is so short that events are impacted as they occur. As such,
real-time activity is in contrast to the more traditional, offline use of data, where business intelligence and
analytics have been used to make both tactical and strategic decisions. Real time is a time-sensitive need
and is essential when the decision needs to occur to avoid impending and undesirable threats, or to take
advantage of fleeting opportunities.
In order for organizations to operate in real time, some fundamentals are required. Data input must be
emitted and received in real time, as it’s being generated, such as it is with sensors transmitting object
status and health. The data needs to be assessed in real time, extracting the inherent meaning from the
data elements as they are being ingested. Lastly, the data needs to provide decisions and the instructions
for low latency actions. These are characteristics associated with streaming data.
Unlike other types of data, streaming data is transferred at high-speed, on the order of hundreds,
thousands, and even millions of events per second – and at a consistent rate (save for interrupted
transmissions associated with network outages). Popular types of streaming data include streaming
television broadcasting and financial market data. Such data are continuous, dynamic events that flow
across a sufficient bandwidth and are so fast that there is no humanly perceived time lag between one
event and the next. Given the high volume and high velocity of streaming data, it’s not surprising that the
receipt, ingestion, and decisions made from this data are left to powerful computing technology, which
can scale to assure the high-volume, low latency actions by objects connected in the IoT enabling them to
communicate and respond in real time.
1
tracking from smart meters, binary readings of on/off status from machinery, RFID tags, sensors
readings of temperature and pressure from oil drills, banking transaction systems, and more.
• Semi-structured or unstructured, such as data that is generated by computer machine logs, social
media streams, weather alert bulletins, live camera feeds, and operational and ERP systems (free
form notes and comments are included with the structured records in most operational/ERP systems),
to name a few.
It’s often assumed that streaming data, generated from sensors, devices and machines is consistent and
accurate unlike human generated content, which is known to be fraught with misspellings, stylistic
differences, translation loss, and so on. However, sensor data, and its cohorts also suffer from
inconsistent and incorrect data, with bad readings (temperature sensor goes awry), missed readings (with
interruptions in transmission) and consolidating different readings, which are typically associated with
multi-sensor assessment that have different specifications or protocols. For example, dialysis machines
communicate using different languages, transmitting in USB, Ethernet, and different serial interfaces (RS-
232, RS-485, RS-422, and so on, and Wi-Fi®.
Streaming data, as with any other type of data suffers from data quality issues that must be addressed in
order to assess, analyze and action it. Big data repositories, like Hadoop®, provide a currently popular
answer to capture and then cleanse streaming data for analysis. And while initially, this might be viable,
it’s only a short-term stop gap. With the expansion of IoT, and corresponding explosion in streaming data
on the horizon even low cost commodity storage for big data will soon be prohibitive to economically
address the needs of streaming data. Yet even if an unlimited budget existed (it doesn’t), when real-time
answers are demanded, the latency associated with first storing streaming data, then cleansing, then
analyzing adds incremental time to processing – delaying actions until they are no longer in real time.
You can reduce the transmission, storage, and assessment costs of streaming data by cleansing and
analyzing streaming data near the source of the data generation, pushing required processing to the
edges of the IoT. Aggregators, gateways, and controllers are natural levees to cleanse multiple sources
of aggregated data, minimizing the downstream pollution with dirty events. Embeddable technology,
provided by SAS® Event Stream Processing, aggregates, cleanses, normalizes, and filters streaming data
while it is in motion – and before it stored. SAS Event Stream Processing is poised to even process data
at the sensor processing chip itself.
Unlike traditional database management systems, which are designed for static data in conventional
stores, and even big data repositories, with queries to file systems, streaming data management requires
flexible query processing in which the query itself is not performed once or in batch, but is permanently
installed to be executed continuously. SAS includes pre-built data quality routines in SAS Event Stream
Processing query definition. In this way, the necessary streaming data correction, cleansing and filtering
is applied to data in motion and in turn, reduces polluting data lakes with bad and irrelevant data. Of
equal, if not more importance, including streaming data quality paves the way for streaming analytics,
doing the required data preparation for analytically sound real-time actions.
STREAMING ANALYTICS
Tom Davenport, thought leader in field of analytics, has said that the “Analytics of Things is more
important than the Internet of Things” (Davenport, 2015). Arguable perhaps by those in the
communications industry, the point is that understanding the data upon which connected objects
communicate is critical to having successful conversations between the ‘things’. IoT provides an
opportunity to reconsider how we use analytics, and make it pervasive - to drive useful and effective
conversations between things.
In many cases, we can apply the same types of analytics to streaming data that we use in traditional
batch model execution. The difference is that unlike traditional analysis, which requires data to be stored
before it’s analyzed with event streams analyze data before it’s stored. The following types of analytics
are applicable to IoT data as part of the continuous query:
2
• Predictive analytics identifies future likelihoods of events that have not yet happened
SAS Event Stream Processing provides procedural windows to include both descriptive and predictive
algorithms defined in SAS DATA step, SAS DS2, and other languages. As with traditional analysis, these
models are built in SAS® High-Performance Data Mining, SAS® Factory Miner, SAS® Contextual Analysis,
SAS® Forecast Server, and any other SAS® product or solution that generates SAS DATA step or SAS
DS2 code. For this broad selection of algorithms, the models are built upon an event history that has
been stored. As with traditional analysis, models are built, tested, and validated. The resulting model
code, however, is included into the SAS Event Stream Processing continuous query as a pre-defined,
procedural calculation. As part of the continuous query, the model scores individual events are they are
ingested. In other words, the analytics are performed on live data streams, synonymous with the term
‘streaming analytics’.
Taking advantage of machine learning and deep learning techniques, SAS Event Stream Processing
includes a growing suite of methods to build and score event data solely based on streaming data, and
without out-of-stream model development and event history. In this case, algorithms such as K-Means
clustering, are both defined and applied to events in motion, learning with new events. This exciting field
of new techniques further expands the streaming analytics methods available to streaming data.
STREAMING DECISIONS
The focus of this paper is enabling prescriptive analytics in stream, a term we describe as ‘streaming
decisions’. Streaming decisions define the instructions for real-time actions based on live, streaming
events. They are of particular importance to actions taken by object in the IoT. They combine descriptive
and predictive algorithms, with the business rules that trigger when the models are relevant to the current
streaming event data scenarios. In other words, they are the instructions needed by an IoT object to take
the right action, something of core importance to the adoption and successful proliferation of autonomous
IoT activity.
As we distribute analytics further out to the edges of the IoT there is a classification that provides some
guiding principles that direct when one type of analysis, and the corresponding actions, is more applicable
than another. The following types of analytics are described on the IoTHub (2016):
• Edge Analytics is the analysis at the same device from which it is streaming
• In Stream Analytics is the analysis that occurs as data streams from one device to another, or from
multiple sensors to an aggregation point
• At Rest Analytics is the analysis processed after the event fact has passed, based on saved
historical event data and/or other stored information.
In general, the closer to the edge, the less event data there is to analyze. At the edge, there is just that
one object/sensor/device, with its limited supply of data. As mentioned, data quality issues are present at
the edge, and events can be aggregated in windows of time to correct and filter out the irrelevant noise
from the signal of interest. Analytical calculations are more limited due to the data restrictions, and
prescriptive analytics (say, instructions emanating from another object) are limited to real-time actions that
can be performed in isolation – like commands to turn up or down, turn on or off.
As more objects are related to each other in-stream, at aggregation points, the data are richer (emanating
from several sources) and correspondingly there are more data quality issues. The contextual
understanding of the scenario is also richer (with more event data over time, space, and so on) and, in
turn, more complex patterns of interest can be identified. Streaming decisions can thus be made that
relate to more objects, even becoming a series of inter-connected actions.
3
As typical of any analysis, the decision to apply different types of models is dependent on the data as well
as the business problem the analysis solves. More often than not, IoT analytic solutions require
multiphase analytics, that is models defined in the traditional, stored data paradigm, and scoring for new
analytical insight, as well as in-stream model derivation/calculation, and analytics applied to the edge.
SAS® does this. With the same integrated code, and with over one-hundred and fifty (at last count)
adapters and connectors linking streaming data, SAS Event Stream Processing is used to define the
complete continuous query that can be as simple or complex as the business problem itself. Moreover,
built into SAS Event Stream processing is the ability to automatically issue alerts and notifications for real-
time situational awareness and understanding of event status.
When we consider the IoT we are describing, an analytically driven network of objects that communicate
with each other. When we automate actions between objects, especially when there is no human
intervention, the risks associated with rogue actions, as well as the technical debt that accumulates from
both machine learning algorithms (Sculley et al., 2015) and from any unmanaged advanced analytics
environments, will outweigh the advantages. As such, the IoT demands a governed, reliable and secure
environment for streaming analytics and associated prescribed analytic actions. SAS® Decision Manager
is a prescriptive analytics solution. With fully traceable workflows, versioning and audit trails to assure
command control over streaming analytics for real-time, reliable, and accurate IoT applications.
4
SCALING TO DATA STREAMS
Decisions in Action
Streaming data, as we have seen, represents data from sources such as customers web clicks, call data
records, fleet vehicle GPS, point-of-sale systems, and now more commonly, sensors from corporate
assets, such as machines on manufacturing floor, or sensors in an electric power grid.
Data from these systems has historically been extracted, transformed or cleansed from the source, and
then loaded to data warehouses for storage and later to analyzed. But, as described above, soon, if not
already – organizations will conclude that they can’t afford to store it all, and it certainly can’t afford the
lag times of analyzing it after data has been stored. In business operations, the value of data diminishes
the longer we wait to use it, so we need new ways to analyze it sooner – closer to where it originates, and
that means we need new ways to tap into the value of data streams.
Tapping into data, while it’s still in-motion in data streams, before it’s stored empowers actions to be
applied sooner, before its value diminishes and before missed an opportunity or prevented a threat. The
diminishing value of data can be seen in Figure 1, depicts the relationship between our ability to ingest,
and analyze the data for an action and the value of making that decision sooner rather than later.
5
DRIVING INTERNET OF THINGS (IOT) ACTIONS
In this nascent hyper-connected world of IoT, data is being generated rapidly and businesses want
effective approaches to leverage their analytical resources to not only analyze and gain insights from the
rising tide of data but also take action from it - to obtain the most, real-time value.
IDC research have found that only 0.5% of the data being generated through the IoT is being analyzed to
derive value (see Figure 2). This means that only 0.5% of the data from “things” was being analyzed at
that time, leaving a rich set of opportunities to understand and take action untapped. And while this
research is from 2012, and more organizations have begun to examine IoT data for deriving business
value, the vast majority of organizations have not yet used IoT data for business operations. This report
also points out the amount of untagged data (often associated with unstructured text data) that would be
more useful if it was tagged and analysis-ready. However, with the aforementioned traditional technique
of storing first and then analyzing, tagging content for use if often prohibitive because of the
consequential large storage costs – even if we assume that all potentially useful unstructured text could
be stored (Gantz and Reinsel, 2012).
Figure 2. IDC: The Untapped Big Data Gap (Gantz and Reinsel, 2012)
With increasing numbers of objects, sensors and devices joining the connected network of the IoT, the
big data gap is growing. Processing data at the necessary scale and speed associated with streaming
data generated from the ‘things’ requires new architectures can analyze and make decisions on the
streaming, in-motion data, filtering out the irrelevant from any downstream activity – including data
storage. As such, there is a growing necessity to move analytics, the associated operational decisions,
and the corresponding actions closer to the data - and in some cases, right into the data streams, near
the point of data generation.
SAS Event Stream Processing and SAS® Decision Manager together enable organizations to analyze
data while it’s still in motion and also apply prescriptive actions sooner, deriving maximum value from live
events and before streaming data value diminishes. The communication between prescriptive instruction
from SAS Decision Manager and the real-time analytical determination embedded within SAS Event
Stream Processing is achieved using SAS® Micro Analytic Services 1. SAS Micro Analytic Services
provides the ability to quickly execute decisions based on the results of in-stream scoring.
1 SAS Micro Analytic Services are included within the SAS Decision Manager offering.
6
SAS Decision Manager and SAS Event Stream Processing
SAS Decision Manager is being used to automate tactical decision making by prescribing actions to take
through the design, development, testing, and publishing of decision flows. SAS Decision Manager
supports the Business Analyst, Data Scientist, Data Miner, and Statistician, collaboratively developing
tactical decision actions from building decision flows that combine the analytical models with necessary
business rules that drive operational business processes.
These same decision flows, authored from SAS Decision Manager and published into SAS Micro Analytic
Services, can be can be executed within streaming data by inclusion in the event data flow activity
defined in SAS® Event Stream Processing Studio.
BUSINESS PROBLEM
A fleet management company wants to minimize the time vehicles are out of service to reduce costs,
minimize lost revenue, and maximize uptime. The trucks have sensors that transmit data that monitoring
location, vibration, rpm, temperature, speed, steering angle, pressures, and so on. The company needs to
proactively prioritize maintenance before a vehicle unexpectedly is out-of-service. The somewhat obvious
and immediate business benefit for analyzing vehicle sensors in the transportation industry is to maximize
assets efficiently.
Streaming Analytics for IoT Actions: Two Aspects Driving One Outcome
The company has collected data on its fleet and processing it offline, identifying problems, and then
generating notices sent to maintenance locations, to try to improve the maintenance scheduling and
overall efficiency of vehicle assets. This approach, however, is typically expensive, given the vehicle is
often inoperable, taken off the road before parts are available for the necessary maintenance, mechanics
can fit the additional work their schedules. This can result in unplanned downtime, overtime servicing
costs, expensive non-scheduled parts delivery, and more.
In this scenario, the company wants to use real-time data they’re collecting from vehicles to identify
issues sooner and increase the time available to address maintenance proactively. In fact, the truck
sensors can be analyzed in an onboard SAS Event Stream Processing engine, reducing the latency from
the time the data is collected to the time the issue is identified and addressed.
On-board processing of streaming data is one solution to better predict the likelihood of a vehicle issue.
An even better solution is to drive a prescriptive action that instructs where to route the vehicle to, notifies
suppliers of necessary parts for on-time delivery, and identifies mechanic schedules aligned to a service
stop that minimizes deleterious impact of the fleet’s transportation of goods to its customers.
Together, the in-stream analysis along with the prescriptive actions enable alerts to be generated in-real
time to all stakeholders, empowering them to take the right action, minimizing costs, maximizing revenue
and drastically reducing unplanned down times for the fleet.
7
records for those same trucks and significant events were labeled as failures. We have joined the sensor
readings with the maintenance events to create a predictive modeling table. The vehicle data represents
real sensor readings and has been provided by the Intel ™ Corporation for public demonstrations.
Building the Model
The first thing that a data scientist will do is look at the data and run a basic variables distribution
analysis. All variables in this sample including failure are numeric.
proc means data = trucks.failures n nmiss min max mean ;
output out=means ;
var _numeric_;
run;
We can see that three variables have missing values in a relatively small number of cases: 14, 158 and
791, out of a total of 8395 cases. We can also see that failure occurred in 20% of the cases. Therefore,
predicting failure has potential to improve efficiency and reduce costs.
We need a tool that can predict failures for this sample. We need to the tool that has robust handling of
missing values. We also need a tool that can provide root causes analysis and determine which devices
with sensors potentially contribute to failure so that we can improve those devices and reduce overall
failure rate. Finally, and perhaps most importantly, we need a tool that can produce a failure scoring
function that we can deploy to the real-time system. The scores generated by this system will be used to
generate signals for the decision processing application. Higher scores will indicate that failure is possibly
imminent and the truck should be routed to a service center before a serious problem occurs.
Decision tree is a modern data mining tool that handles missing values and is useful for both model
interpretation and scoring. A decision tree is sometimes referred to as a recursive partitioning algorithm,
8
which works by selecting variables that have the greatest power to divide cases based on the dependent
target variable values. The result is a downward tree where each node contains fewer cases and the
dependent target variable distribution is more biased. The prediction value at each leaf of the tree is the
leaf’s proportion of dependent target variable events. In our case, the target variable failure has two
values, zero and one, where one represents an observed failure event. We want to create a decision tree
model that predicts the event value (one) and identifies the independent input variables that are used in
that prediction.
Therefore, we will proceed with building a model using the HPSPLIT procedure from the SAS Enterprise
Miner ™ distribution. Notably, we are using the procedure option missing=branch to enable tree branches
based on missing values in addition to real values. We also did not select the GPS variables since we
want our models to be based on truck sensor readings that can be applied in any location. For the
complete procedure syntax refer to SAS documentation.
/* select list of input variables */
proc transpose data=means (drop= _TYPE_ _FREQ_ ) out=vars;
id _STAT_ ;
run ;
proc sql noprint ;
select _name_ into :vars separated by ' ' from vars
where _name_ ne 'failure' and
_name_ not contains "GPS";
quit ;
/* predict truck failure */
filename scrcode "&file.\score.sas" ;
proc hpsplit data= trucks.failures missing=branch ;
performance details ;
partition fraction (validate=0.3) ;
input &vars / level=int ;
target failure / level=nom order=ascending ;
score out= model ;
code file= scrcode ;
run ;
The results of the HPSPLIT procedure provide some clues about the failure analysis. First we want to
examine the accuracy of the procedure. The confusion matrix shown below displays the number of cases
that are correctly and incorrectly predicted. In our sample, only six cases of failure were miss-identified as
non-failures. The accuracy of this model is very good.
9
Output 3. Variables Selected by the HPSPLIT Procedure
Finally, we can look at the structure of the decision tree to better understand the complexity of the model
and the order of importance of the input variables. The following tree graph shows the overall size of the
tree model in the overview box and the readable detail of the top portion of the tree. We can see that
Trip_Time_journey is the most important predictor, followed by Throttle_Pos_Manifold and Engine_RPM.
10
Now that we have confidence in this model, we can work on deploying it to the event stream processing
engine. Model scoring is simply the application of the model to new data to produce a score. In this case
the model represents the decision tree created in the model building step. In the running of the HPSPLIT
procedure, we saved the scoring code to an external file named scrcode that contains simple DATA step
code that generates predictions. The score code contains 159 source code lines of SAS DATA step code
that can be inserted between the SET statement and the RUN statement in a SAS program. A small
fragment of the score code is displayed below. The score code contains several similar fragments.
. . .
IF NOT MISSING(Trip_Time_journey) AND ((Trip_Time_journey >= 10026.2))
THEN DO;
IF NOT MISSING(Engine_RPM) AND ((Engine_RPM >= 1761.6))
THEN DO;
IF NOT MISSING(Mass_Air_Flow_Rate) AND ((Mass_Air_Flow_Rate < 51.6215))
THEN DO;
IF NOT MISSING(Accel_Pedal_Pos_D) AND ((Accel_Pedal_Pos_D >= 27.12156951))
THEN DO;
_Node_ = 18;
_Leaf_ = 8;
P_failure0 = 0;
P_failure1 = 1;
END;
ELSE DO;
_Node_ = 17;
_Leaf_ = 7;
P_failure0 = 1;
P_failure1 = 0;
END;
END;
ELSE DO;
_Node_ = 12;
_Leaf_ = 3;
P_failure0 = 1;
P_failure1 = 0;
END;
END;
. . .
The code requires the input variables listed by the model description and generates five new variables
that describe the model. _WARN_ is an indicator that the prediction function could not be computed from
the values of the input variables. _NODE_ and _LEAF_ are internal variables that identify the branches
taken for each case. P_Failure1 and P_Failure0 are the probabilities of truck failure and non-failure,
respectively. We are primarily interested in P_Failure1 and _WARN_. Higher values of P_Failure1
indicate that action should be taken to prevent truck failure. Non-empty values of _WARN_ might indicate
that one or more critical sensors have failed and action should be scheduled to repair those devices. The
score code file is stored as part of the sample code for this paper. Two additional variables for validation
data proportion have been omitted as they are not needed for this example. The DATA Step score code
is then adapted for the run-time environment.
Event Stream Processing
The score code is now deployed to the SAS Event Stream Processing engine. The SAS Event Stream
Processing engine requires code in the DS2 format. DS2 code is a modular and structured form of SAS
DATA Step that can be embedded in various run-time environments. The process for converting
acceptable DATA Step code to DS2 is fairly simple. The scorexml macro will detect the necessary input
variables and save them to an XML file. The DSTRANS procedure will convert the DATA Step code to
DS2 code and add the needed input variable declarations. We then want to test the scoring in a simple
DS2 procedure step and examine the output.
/* Create an XML file for Variable Info */
filename scrxml "&file.\score.xml";
%AAMODEL;
%scorexml(Codefile=scrcode,data=trucks.failures,XMLFile=scrxml);
11
proc dstrans ds_to_ds2 in="&file.\score.sas" out="&file.\score.ds2" EP nocomp
xml=scrxml; run; quit;
/* Test Scoring */
libname SASEP "&file.";
proc delete data=sasep.out; run;
proc ds2 ;
%include "&file.\score.ds2";
run;
After we have validated the DS2 code and the scoring, an additional manual step is required to rename
the input and output destinations to ESP.IN and ESP.OUT, respectively.
The authors recommend using SAS Event Stream Processing Studio for creating the stream definition. In
that definition we need to define a procedural window based on DS2 code. The properties for this window
will include a text editor for DS2 code. The user needs to paste the DS2 code created in the previous
section into this procedural window. The event definition must include the input variables required by the
decision tree model and pass them to the procedural window. The window definition must add the
_WARN_ and P_Failure1 variables to the output event definition.
Display 1 shows a simple design for handling the sensor events. An on board diagnostic device has
already aggregated the sensor data into a single data record. The SAS Event Stream Processing Studio
Data_Stream window can subscribe to the diagnostic data records, execute the predictive model scoring,
and then filter the events to the ones that have a high probability of failure. After the filter, we can add a
notification window that will post predictive failure events to a remote service such as a data center, or
store them locally until they can be retrieved by a physical location such as a maintenance shop.
Display 1. Simple Design for Handling Sensor Events in SAS Event Stream Processing Studio
12
critical analytics to the local system we can detect problems earlier and save huge amounts of server
processing, storage, and networking.
SAS Decision Manager can be used to construct and execute routine business decisions in both online
and batch processing environments. After an alert from a truck is received, the custom decision logic can
be used to determine the best action to take and the best way to execute that action. The core SAS
Decision Manager contains components that will help manage and monitor the predictive models, build
and manage business rules, and build and execute decision processes. These components operate in a
data center making routine business decisions and have business-friendly interfaces. They provide a
buffer between SAS Advanced Analytics and the business’s operational systems.
The complete system is depicted in Figure 4. The IoT systems are responsible for event processing and
pattern detection. Operational systems are responsible for delivering the business strategy and resources
to the IoT systems and managing the alerts and communications that are recommended. This view is
entirely representational. There are many ways to architect the system and many ways to connect the
components.
13
CONCLUSION
The following summarizes the four ways that processing streaming data provides a vital role in IoT
(Combaneyre, 2015):
• Detect events of interest and trigger appropriate action
• Aggregate information for monitoring
• Sensor data cleansing and validation
• Real-time predictive and optimized operations
SAS Event Stream processing is being used by organizations to send an alert, a notification trigger, for
current situational monitoring and improving sensor data quality in real time, across industries.
The autonomous actions of objects needs to be founded in well-established practices of humans. As
objects in the IoT are depended on to make decisions, the need to govern, manage, control, and secure
the conditional logic specific to specific event scenarios will become increasingly important. For these
more complex, real-time streaming decisions, the processing speed of SAS Event Stream Processing
executing actions defined with the rigor of SAS Decision Management will ensure that real-time actions of
autonomous objects in the IoT are both the right and the relevant ones.
REFERENCES
Combaneyre, F. 2015. “Understanding data streams in IoT.” SAS White Paper. Available
at http://www.sas.com/en_us/whitepapers/understanding-data-streams-in-iot-107491.html.
Davenport, T. 2015. “#ThinkChat IoT and AoT with @ShawnRog and @Tdav” Information
Management. Available at http://en.community.dell.com/techcenter/information-
management/b/weblog/archive/2015/06/22/thinkchat-iot-and-aot-with-shawnrog-and-tdav.
Gantz, J.; Reinsel D. 2012. “The DigitaI Universe in 2020: Big Data, Bigger Digital Shadows,
and Biggest Growth in the Far East.” IDC IVIEW.
“IoT pushes limits of analytics”. IoTHub. February 29, 2016. Available
at http://www.iothub.com.au/news/iot-pushes-limits-of-analytics-415787.
Sculley, D., Holt, G., Golovin, D., Davydov, E., Phillips, T., Ebner, D., Chaudhary, V., Young, D.,
Crespo, J-F., Dennison, D. 2015. “Hidden Technical Debt in Machine Learning Systems”
Proceedings of NIPS 2015. Montreal, PQ. Available at http://papers.nips.cc/paper/5656-
hidden-technical-debt-in-machine-learning-systems.pdf?imm_mid=0df22b&cmp=em-data-na-
na-newsltr_20160120.
ACKNOWLEDGMENTS
The authors would like to thank Kristen Aponte, Brad Klenz, and Dan Zaratsian for their assistance and
contributions to this paper.
RECOMMENDED READING
“Channeling Streams of Data for Competitive Advantage”, SAS White
Paper http://www.sas.com/en_us/whitepapers/channeling-data-streams-107736.html
“How Streaming Analytics Enables Real-Time Decisions”, SAS White
Paper http://www.sas.com/en_us/whitepapers/streaming-analytics-enables-real-time-decisions-
107716.html
CONTACT INFORMATION
Your comments and questions are valued and encouraged. Contact the authors at:
14
Fiona McNeill
SAS Institute Inc.
100 SAS Campus Drive
Cary, NC 27513
Email: fiona.mcneill@sas.com
David Duling
SAS Institute Inc.
100 SAS Campus Drive
Cary, NC 27513
Email: david.duling@sas.com
Steve Sparano
SAS Institute Inc.
100 SAS Campus Drive
Cary, NC 27513
Email: steve.sparano@sas.com
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of
SAS Institute Inc. in the USA and other countries. ® indicates USA registration.
Other brand and product names are trademarks of their respective companies.
15
Paper SAS645-2017
Real-time Analytics at the Edge: Identifying Abnormal Equipment Behavior
and Filtering Data near the Edge for Internet of Things Applications
Ryan Gillespie and Saurabh Gupta, SAS Institute Inc.
ABSTRACT
This paper describes the use of a machine learning technique for anomaly detection and the SAS ® Event
Stream Processing engine to analyze streaming sensor data and determine when performance of a
turbofan engine deviates from normal operating conditions. Turbofan engines are the most popular type
of propulsion engines used by modern airliners due to their high thrust and good fuel efficiency (National
Aeronautics and Space Administration 2015). For this paper, we intend to show how sensor readings
from the engines can be used to detect asset degradation and help with preventative maintenance
applications.
INTRODUCTION
The data set used is the 2008 Prognostics and Health Management (PHM08) Challenge Data Set on
turbofan engine degradation (Saxena and Goebel 2008). We use a single-class classification machine
learning technique, called Support Vector Data Description (SVDD), to detect anomalies within the data.
The technique shows how each engine degrades over its life cycle. This information can then be used in
practice to provide alerts or trigger maintenance for the particular asset on an as-needs basis. Once the
model was trained, we deployed the score code on to a thin client device running SAS® Event Stream
Processing to validate scoring the SVDD model on new observations and simulate how the SVDD model
might perform in Internet of Things (IoT) edge applications.
1
event streaming engine closer to the source. Then you can use only the relevant data in the proper format
to both train your model and to make your predictions or alerts.
APPLICATION OF SVDD
To illustrate how SVDD can be applied to a predictive maintenance scenario, we used the algorithm on
the 2008 Prognostics and Health Management (PHM08) Challenge Data Set on turbofan engine
degradation (Saxena and Goebel 2008). The data set consists of examples of simulated turbofan engine
degradation that were used for a data challenge competition at the 1st international conference on
Prognostics and Health Management.
2
APPLYING SUPPORT VECTOR DATA DESCRIPTION TO THE PROBLEM
The Support Vector Data Description algorithm was applied to the problem to help determine when the
time series is beginning to deviate from normal operating conditions. The output measurement of the
algorithm provides a scored metric that can be used to assess the degradation of the engine and help put
in place preventative measures before the failure point.
To train the model, we sampled data from a small set of engines within the beginning of the time series
that we assumed to be operating under normal conditions. As previously noted, the SVDD algorithm is
constructed using the normal operating conditions for the equipment or system. It can also handle various
states of normal operating conditions. For example, a haul truck within a mine might have very different
sensor data readings when it is traveling on a flat road with no payload and when it is traveling up a hill
with ore. However, both readings represent normal operating conditions for the piece of equipment.
With this in mind, we randomly sampled 30 of the 218 engines from the data set to be used to build the
SVDD model. Of the 30 engines that were sampled, the first 25% of each engine’s measurements were
then used to train the model. As such, it was estimated that the data within this region was related to
normal operating conditions. This resulted in a training set used for the model consisting of 1,512
observations out of the total 45,918 observations.
It should be noted that examination of the three operational setting variables indicated that there were six
different operational setting combinations within the data set. Given that the algorithm is flexible enough
to accommodate varying operating conditions, no additional indicator flags or pre-processing work was
performed on the data to model the different operating conditions.
The model was trained using the svddTrain action from the svdd action set within SAS Visual Data Mining
and Machine Learning. The ASTORE scoring code generated by the action was then saved to be used
to score new observations using SAS Event Stream Processing on a gateway device.
3
SAMPLE RESULTS
The scoring results from the hold-out data set illustrate the degradation in the engines captured by using
the SVDD model. Four random samples were taken from the 188 scored engines with their SVDD scored
distance plotted versus the number of cycles. This is shown in Figure 1, Sample SVDD Scoring Results.
As seen in the figure, each engine shows a relatively stable normal operating state for the first portion of
its useful life, followed by a sloped upward trend in the distance metric leading up to the failure point.
This upward trend in the data indicates that the observations are moving further and further from the
centroid of the normal hypersphere created by the SVDD model. As such, the engine operating
conditions are moving increasingly further from normal operating behavior.
With increasing distance indicating potential degradation, an alert can be set to be triggered if the scored
distance begins to rise above a pre-determined threshold or if the moving average of the scored distance
deviates a certain percentage from the initial operating conditions of the asset. This can be tailored to the
specific application that the model is used to monitor.
4
CONCLUSION
Anomaly detection can be a useful tool to detect asset degradation and help with preventative
maintenance efforts. In this paper, we discuss how we applied a single-class classification technique
called Support Vector Data Description to monitor how turbofan engines degrade from normal operating
conditions. Given the potential use of real-time anomaly detection for Internet of Thing applications, we
also tested scoring the model on a gateway type device to mimic application in the field. The results of
the model on new data show visual trends indicating the degradation with the turbofan engines used in
the example.
REFERENCES
National Aeronautics and Space Administration. “Turbofan Engine.” Retrieved on February, 28, 2016
from https://www.grc.nasa.gov/www/k-12/airplane/aturbf.html
A. Saxena and K. Goebel. 2008. "PHM08 Challenge Data Set." NASA Ames Prognostics Data Repository
(http://ti.arc.nasa.gov/project/prognostic-data-repository), NASA Ames Research Center, Moffett Field,
CA. Retrieved February 24, 2017.
Chaudhuri, Arin, Deovrat Kakde, Maria Jahja, Wei Xiao, Seunghyun Kong, Hansi Jiang, and Sergiy
Peredriy. 2016. “Sampling Method for Fast Training of Support Vector Data Description.” eprint
arXiv:1606.05382, 2016.
Cisco. “The Internet of Things.” Retrieved February 25th, 2017, from
http://www.cisco.com/c/dam/en_us/solutions/trends/iot/docs/iot-aag.pdf
GE and Accenture. “Industrial Internet Insights Report for 2015.” Retrieved February 25 th, 2017, from
https://www.ge.com/digital/sites/default/files/industrial-internet-insights-report.pdf
IDC. “Connecting the IoT: The Road to Success.” Retrieved February 25th, 2017, from
http://www.idc.com/infographics/IoT
Intel. “A Guide to the Internet of Things Infographic.” Retrieved February 25th, 2017, from
http://www.intel.com/content/www/us/en/internet-of-things/infographics/guide-to-iot.html
Telefonica. “Connected Car Industry Report 2014.” Retrieve February 25th, 2017, from
https://iot.telefonica.com/multimedia-resources/connected-car-industry-report-2014-english
ACKNOWLEDGMENTS
Thanks to Seunghyun Kong, Dev Kakde, Allen Langlois, and Yiqing Huang whose code contributions and
help made this paper possible. And also, thanks to Robert Moreira for suggestions and input on the ideas
in the paper.
CONTACT INFORMATION
Your comments and questions are valued and encouraged. Contact the author at:
Ryan Gillespie
SAS Institute Inc.
Ryan.Gillespie@sas.com
Saurabh Gupta
SAS Institute Inc.
Saurabh.Gupta@sas.com
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of
SAS Institute Inc. in the USA and other countries. ® indicates USA registration.
Other brand and product names are trademarks of their respective companies.
5
Paper SAS6431-2016
Modernizing Data Management with Event Streams
Evan Guarnaccia, Fiona McNeill, Steve Sparano, SAS Institute Inc.
ABSTRACT
Specialized access requirements to tap into event streams vary depending on the source of the events.
Open-source approaches from Spark, Kafka, Storm, and others can connect event streams to big data
lakes, like Hadoop and other common data management repositories. But a different approach is needed
to ensure latency and throughput are not adversely affected when processing streaming data, that is, they
need to scale. This talk distinguishes the advantages of adapters and connectors and shows how SAS®
Event Stream Processing can leverage both Hadoop and YARN technologies to scale while still meeting
the needs of streaming data analysis and large, distributed data repositories.
INTRODUCTION
Organizations are tapping into event streams data as a new source of detailed, granular data. Event
streams provide new insights in real time, helping organizations improve current situational awareness,
respond to current situations with extremely low latency, and can improve predictive estimates for
proactive intervention.
When an organization begins to use streaming data, or when it decides to enhance its real-time
capabilities and offerings, many things need to be taken into account to ensure that scaling to the high
throughput (hundreds of thousands of events per second and more) can be achieved.
An easy place to start would be the three Vs of big data: Volume, Variety, and Velocity. As the Internet of
Things continues to grow, a natural increase in the volume and variety of data follows. Sensors have
become smaller and cheaper than ever and it makes sense that businesses would want to take
advantage of this detailed data to monitor activity more closely and on shorter time scales, thereby
leading to an increase in the velocity of high volume streaming data. Customers are interacting with
businesses in ways they never have before, such as through apps, social media sites, online forums, and
more. This has led to unstructured text becoming an important and valuable source of data along with
more traditional types of data. Variety isn’t just associated with unstructured content in event streams but
also relates to the different formats of data emanating from sensors, applications, and machines –
transmitting event data at different intervals, formats, and levels of consistency.
SAS Event Stream Processing offers the flexibility, versatility, and speed to be able to tackle these issues
and adapt as the landscape of the Internet of Things (IoT) changes. Whether a single event stream or
multiple event streams are ingested for insights, scaling to this fast and big data will separate those
organizations that are successful in putting event streams to use for organizational advantage from those
that become swamped by the pollution from their overflowing data lakes.
1
compared to a short, retained history of live activity – that is, events that have just occurred in a limited
window of time (of course, based on live data, versus data that’s been stored). This means that standard
data manipulation and analysis tasks that require more than a single event value can readily be
calculated in live, streaming data using aggregate windows. Given this, a host of insights and tasks can
be accomplished in streaming event data – before it is even stored to disk – promoting scalable insights
to even the most complex, big, and high throughput streaming data sources.
2
SCALING ANALYTICS
One goal of analytical event stream processing is to have current situational awareness to existing
conditions – for example, to ask: Are current events outside of normal operating parameters? As such,
continuous queries often focus on the changes of events, those that deviate from normal conditions. If all
is normal, then no further action is required. However, if events aren’t normal then real-time alerts,
notifications and actions are issued to further investigate or react to such abnormal activity. Within SAS
Event Stream Processing, you can detail the alert conditions, message and recipient information –
directly in the Studio interface, as illustrated in Figure 1.
Figure 1 shows alerts included in stream processing (right), detail alert channel and recipients (upper)
and real-time condition details (lower).
3
Knowing the appropriate conditions of when a pre-defined tolerance threshold is crossed is defined by the
rules in the system. For example, in statistical process control, the Western Electric Rules1 stipulate
decisions as to whether a process should be investigated for issues in a manufacturing or other controlled
setting. These rules signify when an event, or calculation based on an event are relevant. This relevance
can be highlighted in dashboards, to trigger operational processes or alerts that are sent to other
applications listening for these events of interest. Any combination of rules and analytical algorithms are
possible with SAS Event Stream Processing, so you can devise and adjust your scenario definitions,
tolerance threshold levels defined as rules directly in the studio interface.
SAS Event Stream Processing also allows historic data (called into memory) to be assessed in tandem
with live event stream processing for evaluations based on summary conclusions that have been made
from existing knowledge bases, like a customer segmentation score. Such lookups based on data that
has been stored in offline repositories (also known as “data at rest”) opens another suite of conditional
assessments that can be made from streaming event data, like last action taken, pre-existing risk
assignment, or likelihood of acceptance.
4
Figure 2 - Learning Algorithms that Score Streaming Events in SAS Event Stream Processing Studio
As an embeddable engine, SAS Event Stream Processing can also be pushed to extremely small
compute environments – out to the edge of the IoT, like compute sticks and gateways. At the edges of the
IoT, and even individual sensors, event stream data is limited by the transmitting or gateway device and
as such, some analytical processing will make sense while algorithms that require a variety of inputs,
won’t. As you move away from the edge, more data is available for more complex decision making. And
out-of-stream analysis, based on stored data, like Hadoop, will have extensive history upon which
investigation and analytic development can happen. SAS Event Stream Processing can be used
throughout, pushing data cleansing, rules and machine learning algorithms to the edge, learning models
in-stream and including score code from algorithms developed from data repositories, scaling to all
phases of analytic need.
Wide Data
It is very common for streaming data sets to be wide with many fields per record, especially when multiple
event streams are all ingested into the same processing flow. Since the time it takes to process an event
scales linearly with the number of fields, it makes sense to eliminate fields as early as possible that will
not be used for any useful purpose. This can easily be done in a compute window, where the user can
5
specify which fields from the previous window will be used going forward. In addition to reducing the
number of fields to be processed, compute windows can also be used to change data types and key
fields. Ideally, a compute window is defined directly after the source window (the latter, which ingests
event stream data), so that minimal time is spent processing unneeded event fields.
Reduce Complexity
SAS Event Stream Processing is able to perform complex operations on data, but sometimes this is at
the cost of latency. String parsing is an example of an operation that can increase latency. Such
processing can be done using the SAS® Expression Language or using user-defined functions written in
C. Using the SAS Expression Language is simpler to use than a custom function, but custom functions
typically run faster.
Pattern Compression
The user defines a pattern of interest, which will most likely consist of multiple events. When an event
arrives to the pattern window, the pattern window holds that event while it waits for other events that
comprise the pattern of interest. As the number of these partially matched patterns increases, memory
usage can grow quite large. The impact of this can be offset by enabling the pattern compression feature.
By compressing unfinished patterns, memory usage can be reduced by up to 40% with the cost of a slight
increase in CPU usage.
6
throughput, projects can be spread across network interfaces. One option is to connect SAS Event
Stream Processing projects in a star schema, where many projects are taking in data from the edge,
aggregating it down to desired elements and performing any preprocessing - all connected to a central
continuous query, which ingests the prepared data and performs the desired operations. It should be
noted that retaining state in source windows can affect throughput.
Events are grouped into event blocks consisting of zero or more events when they are first ingested using
a source window. Using larger event blocks helps increase throughput rates during publish and subscribe
actions. Event blocks can, at times, only contain one event, such as when aggregate statistics are being
joined with incoming events. If an event block contains an insert and multiple updates, they would be
collapsed to a single insert containing the most recent values. In this case, the aggregate statistics would
not accurately reflect the stream of incoming events.
7
Figure 3 –SAS Event Stream Processing Integration with YARN.
As shown in Figure 3 - SAS Event Stream Processing server on Hadoop, the ESP application has been
started and is running using three YARN containers and is managed by the YARN Node Manager. This
allows YARN to manage the ESP servers running in the various nodes to control start up and shut down
for a seamless and scalable processing environment and the requested nodes can be increased when
additional processing resources are needed.
The YARN plug-in supports commands using dfesp_yarn_joblauncher - for requesting and launching
YARN cores and memory, as noted in Figure 4.
Using the SAS Event Stream Processing Application Master interface, show in Figure 5, the SAS Event
Stream Processing XML factory server, “qsthdpc03”, is shown running in the YARN-managed Hadoop
environment and will be using the noted http-admin, pubsub, and http-pubsub ports, as well as the
requested virtual cores and memory.
8
Figure 5 - SAS Event Stream Processing Application Master Screen
By using the http-admin port defined for the running ESP server, “qsthdpc03”, commands are used to
load ESP project “test_pubsub_index” through dfesp_xml_client into the running ESP XML factory server,
depicted in Figure 6.
To deliver additional processing for higher throughput, a second SAS Event Stream Processing factory
server is started. This environment can be discovered and managed using the consul service to monitor
its performance characteristics. As shown in Figure 7, a second server, “qsthdpc02” is running and
available as an additional resource within YARN for event processing.
9
Figure 7 - SAS Event Stream Processing Application Master running an Additional Server for Higher
Throughput
The consul service view provides the health check information about the publish, subscribe, and HTTP
Admin interfaces for “qsthdpc02” as well as other SAS Event Stream Processing servers running and
managed by the YARN resource manager on Hadoop, as shown in Figure 8.
10
ENTERPRISE CONSIDERATIONS
With the approaches outlined above, we can see how SAS Event Stream Processing supports scaling to
meet the processing demands of the enterprise for latency as well as throughput without sacrificing either.
These important performance considerations are not the only factors that need to be considered for
scalability. In any technology there are the other overriding factors that are needed to deliver reliability,
productivity, flexibility, governance, and security.
A complete ecosystem for managing and governing the code for a successful streaming analytics
environment is needed for enterprise applications. Many open-source tools provide components for what
is needed to deliver aspects of streaming performance but tend to lack the capabilities needed for
complete business solution. A complete business solution includes scalability, reliability, and governance
– aspects that ensure a solution supports the enterprise’s needs both today and tomorrow, scaling to new
problems and data volumes.
RELIABILITY
Today’s IT infrastructures require that event streams are processed in a reliable manner and are
protected against any loss of data or any reduction in IT’s service level agreements.
SAS Event Stream Processing provides a robust and fault-tolerant architecture to ensure minimal data
loss and exceptionally reliable processing to maximize up time. SAS does this by delivering reliability
using proven technologies - like message buses and solution orchestration frameworks – that ensure
message deliverability while eliminating performance hits on the SAS Event Stream Processing engine.
This translates to a solution that is reliable and which supports failover. One definition of failover is:
Failover architectures are essential to any system that demands minimal data loss. SAS Event Stream
Processing has a patented approach for a 1+N Way Failover architecture, as illustrated in Figure 9.
11
Figure 9 - SAS Event Stream Processing 1+N-Way Failover
As shown in Figure 9, the failover architecture allows the SAS Event Stream Processing engine,
subscribers, and publishers to be oblivious to failover and any occurrences thereof. The product pushes
the failover responsibilities and knowledge to the (prebuilt) publish and subscribe APIs and clients. The
APIs and clients are, in turn, complemented by third-party message bus technology (from Solace
Systems, Tervela, or RabbitMQ). This architecture has the benefit of flexibility, as it allows failover to be
introduced without requiring the publishers and subscribers to be changed or recompiled.
In this approach, “N” (in 1+ N-Way Failover) refers to the ability to support more than one active failover at
a time, and as such, you can be assured of getting as close to zero downtime as can be afforded. All
event streams are published to both the active and stand-by SAS Event Stream Processing engines as
part of standard processing. Only the active SAS Event Stream Processing engine forwards subscribed
event streams to the message bus (for subscribers). If the message bus detects a dropped active
connection or missed “I’m alive” signal, the message bus then appoints a stand-by to be active with the
event block IDs to begin for the subscribers. This new active engine keeps a running queue for
subscribed event streams. The queue is then used to start forwarding events to the subscribers.
SAS has another patent pending for Guaranteed Delivery. This informs a publisher callback approach
when event blocks are received by one or more identified subscribers within a configured time window.
The same callback function is notified if this does not occur, so the publisher determines how to handle
that situation. This is done asynchronously and without persistence so as not to impact performance. As
a result, failover is instantaneous and automatic with no loss, nor replay of events, no performance
degradation, and a reliable environment that meets IT’s needs.
PRODUCTIVITY
Developing models to ingest, analyze, and emit events can be a complicated task when you consider the
various window types, the sophisticated analysis, testing the model, and reporting results. All of this is
required to ensure that the model produces the desired actions. SAS Event Stream Processing provides
tools to assist with the development of models for processing event streams.
Model Design
SAS Event Stream Processing Studio is a design-time environment that supports the development of
engines, projects, and continuous queries. These components form a hierarchy for the model building
environment, as illustrated in Figure 10. The studio is one of the three ways to build a model in SAS
Event Stream Processing.
SAS ESP has three modeling approaches that are 100% functionally equivalent that provide developers
12
the flexibility they need to develop, test, and implement streaming models. These approaches include:
C++: a C++ library that can be used to build and execute ESP engines.
XML: XML syntax to define ESP Engines or Projects via an XML editor.
Graphical: SAS Event Stream Processing Studio is a browser-based development environment
using a drag-and-drop interface to define ESP models, either engines or projects.
Using the hierarchy depicted in Figure 10 - SAS Event Stream Processing Model Hierarchy, sophisticated event
stream processing models are defined. The top level represents the engine, and within an engine, one or
more projects can be created allowing for flexibility in how the events are processed. Different projects
can be coordinated to allow events to be delivered from one project to another project for processing.
Finally, within a project, the SAS Event Stream Processing Studio interface supports continuous queries,
where events are processed using the window types available from a menu (window types are shown in
Figure 11).
Typically, SAS Event Stream Processing Studio is used to quickly build streaming models that include the
flow of data from source windows through to the processing windows for pattern matching, filtering,
aggregations, and analytics. The drag-and-drop interface supports rapid event stream model
development and doesn’t require any XML or C++ coding to deploy these models. Of particular note, the
Procedural window is how SAS analytical models are introduced into the event streaming data flow.
Once the design is complete, the user can test the models from within the interface.
13
Figure 11 - SAS Event Stream Processing Window Types
14
SAS Event Stream Processing Streamviewer allows for rapid model iterations by visualizing the trends in
the streaming data, and eliminates the need to build a custom visualization tool for testing.
FLEXIBILITY
Given the flexibility and power of the engine, continuous queries, and streaming window types, streaming
models can be complex designs that introduce branching, left and right joins, schema copies, pattern
matching, and analytics. All of these moving parts can be difficult to orchestrate, and as with any complex
design, can be difficult to communicate to other teams in the organization. This can introduce delays and
risk as teams struggle with describing stream processing designs to other groups in a way that ensures
all parties understand the solution as well as their specialized involvement. Additionally, when skilled
resources are scarce, and design logic expertise is in short supply, a visual representation of a model
achieves a common, clear and effective means to communicate a complex model design to others. SAS
Event Stream Processing visual Studio is often valued by teams, providing an easily consumable format
for design specification, thus reducing such risks.
15
API. Adapters can also be networked to allow for coordination between different input and output streams
of data.
API Adapter
Language
C++ Database
Event Stream Processor
File and Socket
IBM WebSphere MQ
PI
Rabbit MQ
SMTP Subscriber
Sniffer Publisher
Solace Systems
Teradata Subscriber
Tervela Data Fabric
Java HDAT Reader
HDFS (Hadoop Distributed File System)
Java Message Service (JMS)
SAS LASR Analytic Server
REST Subscriber
SAS Data Set
Twitter Publisher
Table 1- SAS Event Stream Processing Studio Adapters (C++ and Java)
Connectors
Similarly, SAS Event Stream Processing connectors can be also created and integrated using the Java
and C++ APIs. Connectors are “in process”, meaning that they are built into the model during design time.
In contrast, adapters can be started or stopped at any time, even remotely. Connectors use the SAS
Event Stream Processing publish/subscribe API to do one of the following:
publish event streams into source windows. Publish operations do the following, usually
continuously:
o read event data from a specified source
o inject that event data into a specific source window of a running event stream processor
subscribe to window event streams. Subscribe operations write output events from a window of a
running event stream processing engine to the specified target (usually continuously).
The SAS Event Stream Processing Connectors include:
Database
Project Publish (Inter-ESP Project)
File and Socket
IBM WebSphere MQ
16
PI
Rabbit MQ
SMTP Subscribe
Sniffer Publish
Solace Systems
Teradata Connector
Tervela Data Fabric
TIBCO Rendezvous (RV)
Each of these connectors supports various specific formats.
Taken as a whole, this collection of streaming data connectors and adapters offers a robust collection of
pre-built routines to ingest streaming data and deliver outputs. The connectors also offer an extensible
framework to use new event stream sources.
Given that each data management approach is different, SAS Event Stream Processing supports large
data stores subscribing to fast-moving streams. The data can be landed in distributed file systems like Big
Insights, MapR, Cloudera, and Hortonworks.
GOVERNANCE
Deployment of event streaming models to various targets requires version control, configuring publishing
targets, and updating models dynamically to minimize interruption of service to both event publishers and
subscribers.
SAS Event Stream Processing provides support to manage versions and to publish changes to models as
illustrated in Figure 14. Changes to models can be scripted using plan files that not only support changes
to the deployed streaming models, but also coordinate the loading of the updated model to a running SAS
Event Stream Processing engine. Also addressed are the orchestration of adapters to inject events to the
updated model and validation that the model is syntactically correct.
This support is enabled by the configuration of XML plan files (see Figure 15) that manage these changes
at publish time. These plan files enable reuse for streamlined operations and governance as well as
ensuring flexibility for managing multiple models. This automation allows for repeatability across
operational scenarios for consistency and repeatability.
17
Figure 15 - SAS Event Stream Processing Example Plan File
In conjunction with the support for plan files to update and publish new models, developers can use the
SAS Event Stream Processing engine’s dynamic service change feature to change models on the fly
without taking the SAS Event Stream Processing server down - ensuring constant up-time for always-on
streaming applications. Specifically, users can add additional windows, remove windows, or change
windows as part of these dynamic updates. Dynamically changing models on a running XML factory
server without bringing down the project or significantly impacting service processing improves business
agility.
SAS Event Stream Processing manages such dynamic changes without losing existing state, where
possible, and can propagate retained events from parent windows into newly added windows. If the new
streaming model design changes a given window, then most likely the state is no longer meaningful and
will be dropped.
The implication is that new analytic score code can be dynamically updated into deployed streaming
models as the need arises, so that analytics are refreshed on an as-needed basis while governed in a
controlled manner.
SECURITY
Data streams can include sensitive data, requiring that data be secured when in flight as well as during
processing. This, in turn, necessitates that the publishers (delivering data to SAS Event Stream
Processing), the subscribers (to the processed events), as well as event data stored in-memory during
processing be secure. SAS can secure event data when in-memory from unauthorized access.
The prior release of SAS Event Stream Processing, version 3.1, provided encryption of data streams both
to and from the SAS Event Stream Processing engine (both publish and subscribe) using OpenSSL when
communicating between client and server. The OpenSSL option is available when using the SAS® Event
Stream Processing System Encryption and Authentication Overlay provided with the product.
SAS Event Stream Processing 3.2 introduced an optional authentication between the clients and servers
to ensure more secure access to the product’s network interfaces such as the XML server API and the
Java/C Pub/Sub APIs and adapters. This was also extended to SAS Event Stream Processing
Streamviewer.
CONCLUSION
The best utilization of Hadoop and other big data lakes for streaming data is achieved when a strategic
18
approach is adopted – one that doesn’t pollute them with dirty or irrelevant noise. Direct integration with
YARN helps scale for higher throughput by using the distributed processing framework of Hadoop.
Scaling to examine streaming data once it is landed in a big data repository is, however, only one
consideration when scaling for big and fast data.
There is a balance to be struck between what is best done as part of event stream processing before
event data is stored, and what is more appropriately done once it is landed in Hadoop. Many advanced
analytical models require a rich history to appropriately model the desired behavior. Hadoop, as a popular
big data environment, is ideal for in-depth SAS analysis – and often appropriate to build and define SAS
DATAStep2 score code to be embedded with SAS Event Stream Processing.
SAS Event Stream Processing can ingest, cleanse, analyze, aggregate, and filter data while it’s still in
motion – helping channel only relevant big data to such ‘data at rest’ repositories for in-depth diagnostics
and investigation. Event streams can be assessed as they are sourced, filtering out irrelevant noise,
saving network transport loads, and focusing downstream efforts on what’s relevant.
Configuring a solution to address high throughput volumes with low latency response times, that can
successfully ingest data streams and provide the answers needed by the business, is dependent on both
the infrastructure environment as well as the event stream processing model itself. Using the SAS Event
Stream Processing integration with YARN enables a dynamic linkage of these two technologies. This
further scales the power of the service management of YARN in Cloudera, Hortonworks and other
Hadoop environments – while SAS reduces the data stream data, generates immediate insights and
balances resources for an optimized business solution.
For enterprise adoption, additional scaling considerations that go beyond those of in-memory
environments, analytics, throughput, and latency are also core to successful event stream processing
deployments. With an ever changing business climate, event stream processing applications also need to
be reliable, productive, flexible, governed, and secure. SAS Event Stream Processing provides the agility
needed to scale – for organizations who are extending their existing SAS knowledge by tapping into the
new sources of insight that event streams provide, and scaling to those already tackling the IoT frontier.
ACKNOWLEDGMENTS
The authors would like to acknowledge the guidance and assistance of SAS colleagues Jerry Baulier,
Scott Kolodzieski, Fred Combaneyre, Vince Deters, Yiqing Huang, and Yin Dong for your direction and
support of this paper.
The authors would also like to acknowledge that Figure 3 of this paper was jointly crafted in partnership
with Hortonworks – initially defined to illustrate SAS Event Stream Processing YARN integration with the
Hortonworks Data Platform.
CONTACT INFORMATION
Your comments and questions are valued and encouraged. Contact the authors at:
Evan Guarnaccia
SAS Inc.
Evan.Guarnaccia@sas.com
Fiona McNeill
SAS Inc.
Fiona.McNeill@sas.com
Steve Sparano
SAS Inc.
Steve.Sparano@sas.com
19
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of
SAS Institute Inc. in the USA and other countries. ® indicates USA registration.
Other brand and product names are trademarks of their respective companies.
20
Paper SAS395-2017
Location Analytics: Minority Report Is Here—Real-Time Geofencing Using
SAS® Event Stream Processing
Frederic Combaneyre, SAS Institute Inc.
ABSTRACT
Geofencing is one of the most promising and exciting concepts that has developed with the advent of the
Internet of Things. Like John Anderton in the 2002 movie “Minority Report,” you can now enter a mall and
immediately receive commercial ads and offers based on your personal taste and past purchases.
Authorities can track vessels’ positions and detect when a ship is not in the area it should be, or they can
forecast and optimize harbor arrivals. When a truck driver breaks from the route, the dispatcher can be
alerted and can act immediately. And there are countless examples from manufacturing, industry,
security, or even households. All of these applications are based on the core concept of geofencing,
which consists of detecting whether a device’s position is within a defined geographical boundary.
Geofencing requires real-time processing in order to react appropriately. In this session, we explain how
to implement real-time geofencing on streaming data with SAS® Event Stream Processing and achieve
high-performance processing, in terms of millions of events per second, over hundreds of millions of
geofences.
INTRODUCTION
One of the most important underlying concepts behind all location-based applications is called
geofencing. Geofencing is a feature of an application that defines geographical boundaries. A geofence
is a virtual barrier. So, when a device enters (or exits) the defined boundaries, an action is immediately
triggered based on specific business needs.
One of the early commercial uses of geofencing was in the livestock industry, where a handful of cattle in
a herd would be equipped with GPS units and if the herd moved outside of geographic boundaries set by
the rancher, the rancher would receive an alert.
What applies to the flow of cattle can also be applied to:
• Fleet management: When a truck driver breaks from his route, the dispatcher can be alerted and
act immediately.
• Customs transport: Authorities can track vessels' positions and detect when a ship is not in the
area it should be or forecast and optimize harbor’s arrivals.
• Public areas, like airports or train stations: The flow and density of people can be detected in real
time in order to remove bottlenecks and optimize queuing times, adapt path guidance, organize
staffing, or optimize flow path and procedures.
• Galleries and museums: Administrators can quantify the popularity of exhibits, identify under-
used spaces, and use visitor behavior to optimize future events.
• Shopping centers: Geofencing can show in real time how many people pass in front of a certain
store, shelf, information point, or door, and where these people are coming from. How many
people are watching a certain TV ad on a billboard? Where is the best place to position a
promotion based on foot traffic? Hence, geofencing allows optimizing store workflow (goods
supply, cart management…) and layout.
And there are countless examples from manufacturing, industry, security, or even households—like an
ankle bracelet alerting authorities if an individual under house arrest leaves the premises, or automatically
switching lights off when the whole family leaves the house.
An important paradigm that is inherent to all those applications is the immediacy of action.
1
In order to react appropriately, the position information has to be processed immediately, with low latency,
regardless of the volume of events it is required to analyze. Taking too much time to react is not an option
in such cases, as the subject/device will already have moved to another location. Hence, a delayed
reaction is obsolete.
Of course, a timely reaction is just one part of the game. We also need to react appropriately. Deciding
the best action to apply often means being able to detect specific complex events patterns out of the
masses of events and to apply high-end analytics or machine learning algorithms on real-time data
streams. And this is where SAS® software like SAS® Event Stream Processing comes into play,
providing high performance and low latency geofencing analysis and high-end streaming analytics, as
well as real-time predictive and optimization operations.
Released in early 2017, SAS Event Stream Processing 4.3 introduces a new Geofence window that
provides real-time geofencing analysis capabilities on streaming events. This Geofence window was
already available as a custom additional plug-in for SAS Event Stream Processing 4.1 and 4.2 and is now
totally integrated as a standard SAS Event Stream Processing window.
In an event stream processing XML model, the first window connected to the Geofence window is the
position window. When using the SAS Event Stream Processing Studio, each window’s role is defined in
the property panel.
Areas and locations of interest are defined as geometry shapes. The Geofence window supports 2 types
of geometries: polygons and circles.
2
Geometries are published as events, one event per geometry. The Geofence window supports insert,
update, and delete opcodes, allowing dynamic update of the geometries.
The Geofence window is designed to support any coordinate type or space, either Cartesian or
geographic. The only requirement is that all coordinates must be consistent and refer to the same space
or projection. For geographic coordinates, the coordinates must be specified in the (X,Y) Cartesian order
(longitude, latitude). All distances are defined and calculated in meters.
Let’s cover now how the Geofence window implements the two types of geometries: polygons and circles.
POLYGON GEOMETRIES
A polygon is a plane shape representing an area of interest. The Geofence window supports polygons,
multi-polygons, and polygons with holes or multiple rings.
Figure 2 below shows some sample polygon geofences.
A polygon is defined as a list of position coordinates representing the polygon’s rings. A ring is a closed
list of position coordinates. In order to be considered closed, the last point of the ring list must be the
same as the first one. So, for example, a ring that is geometrically defined with 4 points like a square
must declare 5 position coordinates, the last point being the same as the first one.
The input polygon window schema must have at least the following 2 mandatory fields:
• A single key field of type int64 or string. This field defines the ID of the geometry.
3
• A data field of type string. This field contains the list of the rings’ position coordinates. The
coordinates are defined as a list of numbers (double) separated by spaces in the X, Y order.
For polygons with multiple rings, the first ring defined must be the exterior ring and any others
must be interior rings or holes. For example, the following string represents a polygon made
of 4 points that includes a hole made of 7 points:
"5.281 9.455 3.607 7.112 6.268 6.181 8.414 7.705 5.281 9.455 5.671 8.316
6.572 8.033 7.087 7.695 6.444 7.469 5.929 7.215 5.285 7.384 5.199 7.949 5.671
8.316"
If the polygon data is provided using the standard GeoJSON format, you can easily parse it and format it
using a functional window.
The schema can also have an optional description field that can be propagated with the Geofence
Window output event.
All other fields will be ignored.
When working with polygons, the Geofence window analyzes each event position coming from the
streaming window and returns the polygon this position is inside of. If there are multiple matching
geometries (in case of overlapping polygons) and if the option output-multiple-results is set to
true, multiple events are produced (one per geometry).
The Geofence window behaves like a lookup join, so its output schema is automatically defined and
includes all fields coming from the input position window appended with the following additional fields:
• A mandatory field of type int64 or string that will receive the ID of the geometry. If no geometries
are found, the value of this field will be null in the produced event. This field is defined by the
parameter geoid-fieldname.
• An optional field that will receive the description of the geometry if it exists in the geometry
window schema. This field is defined by the parameter geodesc-fieldname.
• An optional field of type double that will receive the distance from the position to the centroid of
the polygon. This field is defined by the parameter geodistance-fieldname.
• If output-multiple-results is set to true, a mandatory key field of type int64 that will
receive the event number of the matching geometry. This field is defined by the parameter
eventnumber-fieldname.
Below is a sample event stream processing XML model that implements a Geofence window using
polygons:
<project name="geofencedemo" pubsub="auto" threads="4" index="pi_EMPTY">
<contqueries>
<contquery name="cq1" trace="alerts">
<windows>
<window-source name="position_in" pubsub="true" insert-only="true">
<schema>
<fields>
<field name="pt_id" type="int64" key="true"/>
<field name="GPS_longitude" type="double"/>
<field name="GPS_latitude" type="double"/>
<field name="speed" type="double"/>
<field name="course" type="double"/>
<field name="time" type="stamp"/>
</fields>
</schema>
</window-source>
<window-source name="poly_in" pubsub="true" insert-only="true">
<schema>
4
<fields>
<field name="poly_id" type="int64" key="true"/>
<field name="poly_desc" type="string"/>
<field name="poly_data" type="string"/>
</fields>
</schema>
</window-source>
<window-geofence name="geofence_poly" index="pi_EMPTY">
<geofence
coordinate-type="geographic"
log-invalid-geometry="false"
output-multiple-results="false"
autosize-mesh="true"
max-meshcells-per-geometry="200"
/>
<geometry
data-fieldname="poly_data"
desc-fieldname="poly_desc"
data-separator=" "
/>
<position
x-fieldname="GPS_longitude"
y-fieldname="GPS_latitude"
/>
<output
geoid-fieldname="poly_id"
geodesc-fieldname="poly_desc"
geodistance-fieldname="poly_dist"
/>
</window-geofence>
</windows>
<edges>
<edge source="position_in" target="geofence_poly"/>
<edge source="poly_in" target="geofence_poly"/>
</edges>
</contquery>
</contqueries>
</project>
CIRCLE GEOMETRIES
A circle defines the position of a location of interest. It is defined as a couple of coordinates (X, Y) or
(longitude, latitude) representing the center of the circle and a radius distance around this position.
Figure 3 below illustrates some sample circle geofences.
5
Figure 3. Sample Circle Geofences
The input circle geometry window schema must have at least the following 3 fields:
• A single key field of type int64 or string. This field defines the ID of the circle geometry.
• 2 coordinate fields of type double that contain the X and Y coordinates of the circle center.
The schema can also have the following optional fields:
• A radius field of type double, representing a circular area around the center point position. If this
field is not specified, the default distance defined by the parameter radius will be used.
• A description field that can be propagated with the Geofence Window output event.
All other fields will be ignored.
When working with circles, the Geofence window analyzes each event position coming from the
streaming window and returns the circle ID, which matches the following criteria:
• If the position lookup distance is set to 0, then the position behaves like a simple point. It is either
in or out of the circle. If it is in the circle, we have a match.
• Similarly, if the circle radius is set to 0, then the circle behaves like a bare point, and it is only
required for this point to be within the position lookup distance area for having a match.
• For any other value of the position lookup distance and the circle radius, then the position and the
circle’s center have to be within each other’s distance to have a match. It means that the position
is within the circle and the distance between the circle’s center and the position is lower than the
lookup distance. Figure 4 below illustrates the circle’s geometry lookup logic in such a case.
6
• And finally, if both the position lookup distance and the circle radius equal 0, then they have to be
the exact same point to have a match.
This position lookup distance is defined either by an additional event input field value, or by the parameter
lookupdistance.
• If output-multiple-results is set to true, a mandatory key field of type int64 that will
receive the event number of the matching geometry. This field is defined by the parameter
eventnumber-fieldname.
Below is a sample event stream processing XML model that implements a Geofence window using circle
geometries:
<project name="geofencedemo" pubsub="auto" threads="4" index="pi_EMPTY">
<contqueries>
<contquery name="cq1" trace="alerts">
<windows>
<window-source name="position_in" pubsub="true" insert-only="true">
<schema>
7
<fields>
<field name="pt_id" type="int64" key="true"/>
<field name="GPS_longitude" type="double"/>
<field name="GPS_latitude" type="double"/>
<field name="speed" type="double"/>
<field name="course" type="double"/>
<field name="time" type="stamp"/>
</fields>
</schema>
</window-source>
<window-source name="circles_in" pubsub="true" insert-only="true">
<schema>
<fields>
<field name="GEO_id" type="int64" key="true"/>
<field name="GEO_x" type="double"/>
<field name="GEO_y" type="double"/>
<field name="GEO_radius" type="double"/>
<field name="GEO_desc" type="string"/>
</fields>
</schema>
</window-source>
<window-geofence name="geofence_circle" index="pi_EMPTY">
<geofence
coordinate-type="geographic"
log-invalid-geometry="false"
output-multiple-results="true"
output-sorted-results="true"
max-meshcells-per-geometry="200"
autosize-mesh="true"
/>
<geometry
desc-fieldname="GEO_desc"
x-fieldname="GEO_x"
y-fieldname="GEO_y"
radius-fieldname="GEO_radius"
radius="0"
/>
<position
x-fieldname="GPS_longitude"
y-fieldname="GPS_latitude"
lookupdistance="110"
/>
<output
geoid-fieldname="GEO_id"
geodesc-fieldname="GEO_desc"
eventnumber-fieldname="event_nb"
geodistance-fieldname="GEO_dist"
/>
</window-geofence>
</windows>
<edges>
<edge source="position_in" target="geofence_circle"/>
<edge source="circles_in" target="geofence_circle"/>
</edges>
</contquery>
</contqueries>
</project>
8
HIGH PERFORMANCE CONSIDERATIONS
In order to provide fast and low latency lookup processing, the Geofence window is implementing an
optimized mesh index algorithm using a spatial data structure that subdivides space into buckets of grid
shapes called cells. This mesh structure is totally independent of the coordinate system in use. Therefore,
any type of Cartesian, geographic, or projection coordinates space can be used seamlessly.
This mesh algorithm is using a parameter (called mesh factor) that defines the scale of the space
subdivision. This mesh factor is an integer in the [-5, 5] range, representing a power of 10 of the
coordinate units in use. For example, the default factor of 0 will generate 1 subdivision per coordinate
unit, a factor of 1 generates 1 subdivision per 10 units, and a factor of -1 generates 10 subdivisions per
unit. This factor can be set for both X and Y axes or independently for each axis.
For example, considering the following set of coordinates representing a square polygon (note the
repeated first point at the end, closing the polygon):
[(1001.12,9500.12) (1001.12,9510.12) (1010.12,9510.12) (1010.2,9500.12)
(1001.12,9500.12)]
• With a mesh factor of 1, the Geofence window divides the coordinates by 10^1 resulting in
[(100,950) (100,951) (101,950) (101,951)] and creates 4 mesh cells for this
geometry. (101-100+1)*(951-950+1) = 4
• Similarly, with a factor of 2, it creates (10-10+1)*(95-95+1) = 1 mesh cell.
• If the mesh factor is set to -1, then the window creates 9191 mesh cells for this geometry
resulting in an oversized mesh: (10101-10011+1)*(95101-95001+1)=91*101 = 9191
As a result, in order to get the best performance, you need to adapt the mesh factor to the spatial
coverage and to the number of loaded geometries. Too many mesh cells per geometry slows down the
ingestion of geometries and generates an oversized index. Too few mesh cells per geometry slows down
the lookup process, which impacts the stream performance and latency.
From our experience, an appropriate and efficient factor subdivides the space in order to have between
0.5 and 10 geometries per cell. This represents having each geometry generating around 1 to 10
subdivision cells maximum.
You can set the maximum allowed mesh cells created per geometry in order to avoid creating an
oversized mesh that would generate useless intensive calculations using a dedicated parameter called
max-meshcells-per-geometry. If a geometry exceeds this limit, it is rejected. Consider then setting a
higher mesh factor or setting a higher maximum mesh cells per geometry if relevant.
The Geofence Window provides an internal algorithm that automatically computes and sets an
appropriate mesh factor by analyzing the ingested geometries. If for some reason, you want to define the
mesh factors manually, set the parameter autosize-mesh to false.
Using an appropriate mesh, this window has been designed to provide outstanding performance despite
the number of calculations involved.
A test has been performed with a set of 21,569,300 polygons representing 625,451,932 points (~ 28
points per polygon).
With a stream of 10 million events representing 10 million different positions, the observed throughput
was ~200K events/second using 1 core.
This performance level will be largely enough for most use cases, although higher performance can be
easily reached by adding another window and partitioning the stream.
CONCLUSION
The new Geofence window is an easy-to-use, fast, and flexible SAS® Event Stream Processing window
that provides new capabilities for processing geolocation data in real time. It allows expanding the
9
application of streaming analytics by analyzing movements and locations of people or connected objects,
and opening up the horizon to new Internet of Things applications on countless domains, in order to react
immediately and appropriately.
RECOMMENDED READING
®
• SAS Event Stream Processing User Guide
CONTACT INFORMATION
Your comments and questions are valued and encouraged. Contact the author at:
Frederic Combaneyre
SAS Institute Inc.
frederic.combaneyre@sas.com
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of
SAS Institute Inc. in the USA and other countries. ® indicates USA registration.
Other brand and product names are trademarks of their respective companies.
10
Paper 4140-2016
Listening for the Right Signals –
Using Event Stream Processing for Enterprise Data
Tho Nguyen, Teradata Corporation
Fiona McNeill, SAS Institute Inc.
ABSTRACT
With the big data throughputs generated by event streams, organizations can opportunistically respond
with low-latency effectiveness. Having the ability to permeate identified patterns of interest throughout the
enterprise requires deep integration between event stream processing and foundational enterprise data
management applications. This paper describes the innovative ability to consolidate real-time data
®
ingestion with controlled and disciplined universal data access - from SAS and Teradata™.
INTRODUCTION
In today’s big data world, there are great challenges and many opportunities. Organizations need to have
the ability to make the right decisions with precision, accuracy and speed - in order to enhance the
competitive advantage in a global economy of constant change. As a result, the state of the business
impacted by dynamic conditions requires continuous monitoring and evaluation by separating the right
signals from the noise. These events of interest are only apparent when they are understood and heard
by the dependent parts of the organization. This requires event processing that follows through the
organization in contextually and relevant data-driven actions. The ability to ingest data and process
streams of events effectively identifies patterns and correlations of importance, focusing organizational
activity to react and even proactively drive the results they seek and respond to in real time. Instead of
collecting, analyzing and storing the data in the traditional method, data can now be analyzed constantly,
as it is occurs, empowering organizations to adjust situational intelligence as new events transpire.
With the emergence of the Internet of Things (IoT), ingesting streams of data and analyzing events in
real-time become even more critical. The interconnectivity of IoT from web and mobile applications
provides organizations with even richer contextual data and more profound volumes to decipher in order
to harness insights. These insights can uncover greater business value to better understand customer
habits and behavior, enhance operational efficiencies and expand product and services offerings.
Capturing all of the internal and external data streams is the first step to enable listening for the important
signals that customers are emitting, based on their event activity. When you hear what they want from
the data they generate, the right and data-driven actions can happen, and now more rapidly than ever,
positively impacting bottom line profitability.
Of course, one obvious challenge is deploying a reliable, scalable and persistent streaming environment.
This environment needs to provide the necessary self-service capabilities for data administrators,
application developers and data scientists alike, so they can rapidly configure new and different
combinations of data streams and continuous queries for insights. Some organizations have explored and
implemented open source technologies for real time streaming. However, many have already come to
realize the inherent challenges of scaling across multiple event streams, building a dynamic and yet
stable environment that is flexible for adaptation to business dynamics and one that is supportive of
enterprise goals, ongoing needs and timelines.
As such, innovative organizations are moving beyond constructing enterprise environments that require
extensive manual coding from the ground up, to ones that take advantage of pre-built capabilities that are
readily available and integrated with existing organizational assets to drive automated, intelligent
streaming insights. Together, SAS® and Teradata provide an integrated pre-built environment for
exploiting enterprise data that listens for the right streaming signals – improving data-driven decisions for
the entire organization.
1
SAS® EVENT STREAM PROCESSING
Event stream processing (ESP), is designed to connect and analyze real-time event-driven information.
ESP processes event streams with the mission of identifying meaningful patterns and correlations as they
occur. Doing more than pipeline transport, the ability to enrich data by correlating events, identifying
naturally occurring clusters of events, event hierarchies, event probabilities and other aspects such as
contextual meaning, membership and timing – event stream processing , delivers deep insights to real-
time activity for a new, fast data infrastructure.
®
SAS Event Stream Processing is a comprehensive technology that delivers fast data insights based on a
publish and subscribe framework that ingests event streams, executes continuous queries using a suite
of pre-built and interchangeable window types and operators and delivers insights and instructions for
automated actions to dependent systems, applications and big data warehouses. In the traditional data
infrastructure approach, data is amassed, stored and then analyzed. Instead of storing data and then
running queries against this data at rest, SAS Event Stream Processing stores queries to continuously
enrich streaming data while it is in motion. As such, event streams are examined as they are received, in
real-time, and can incrementally update with new intelligence as new events happen. Focusing on
enriching data while events are still in motion demands a highly scaled and optimized process to address
the hundreds of thousands of events per second common to event streams. SAS Event Stream
Processing has the ability to enrich and filter event, differentiating and analyzing text and structured
streaming data with embeddable analytics that instantly translate to real-time insights for event-driven
®
actions. SAS Event Stream Processing Studio is the visual data flow interface, simplifying the
construction of event stream continuous queries, and saving time and efficiency of application
developers, data scientists and IT architects.
Given event stream data is never clean data, even when generated by machine sensors, SAS Event
Stream Processing includes pre-built data quality routines to aggregates, normalize, standardize, extract
and correct, enrich and filter event data before it is stored in a data platform. By eliminating data quality
issues upfront, countless resources and computing hours are saved, big data stores avoid unnecessary
pollution and IT and data scientists are more productive. Not only does productivity improve with this
traditional data cleansing now happening on data in motion, it takes care of the necessary data
preparation needed for successful in-stream analytics. Furthermore, by filtering the data to what is
cleansed and relevant, unnecessary storage of irrelevant event noise helps focus all other activity on
what is relevant.
The ability to listen for events and ingest, consolidate streams of data is critical to real-time actions, ones
that impact transitory event opportunities and avoid impeding threats. Low latency response for real-time
actions, with millisecond and sub-millisecond response times, not only demands high performance
processing but also requires tightly integrated data communication access to event stream sources and
delivery to streaming insight consumers. SAS Event Stream Processing comes with a suite of prebuilt
connectors and adapters (such as Teradata) to consume structured and semi-structured data streams.
Connectors and adapters operate through the publish/subscribe layer (as illustrated in Figure 1), and can
also be custom built as APIs in C, Java, and Python. Supporting authentication and encryption, they
publish data from any source into the continuous query and publish data out to any subscribed source. In
addition, they include communication protocols across different streams for enterprise level use of
streaming insights from a range of messaging bus and data transport protocols. Creating a robust
ecosystem with both pre-built, editable and open APIs to ingest consolidate and manage multiple event
streams mitigates the risk of limiting insights and relieves the need to write code by specialized
programmers for ongoing support and maintenance.
Continuous queries are at the heart of driving new, enriched insights from streaming data (depicted in
Figure 1). SAS Event Stream Processing enables a comprehensive suite of advanced analytics to event
streams, like forecasting, data mining, and machine learning algorithms for governed, streaming
decisions (McNeill et al., 2016). Data governance is key to addressing not only the dynamic nature of
streaming data, it also ensures fully documented and readily understood event stream processing
application – empowering agility to make and understand the impact of changes necessitated by the
dynamic nature of business.
2
Customizable alerts, notifications and updates directly issued from SAS Event Stream Processing provide
precise and accurate situational awareness so that actions are relevant and informed as to what’s
happening and what’s likely to happen. These actions are fueled by continuous, accurate, and secured
event pattern detection from SAS Event Stream Processing patented 1+N-Way failover, guaranteed
delivery (without persistence), full access to event stream model metadata, live stream queries, dynamic
streaming model updates, along with deep analytic capabilities.
SAS Event Stream Processing captures true business value otherwise lost through information lag.
Businesses are able to analyze events as they happen and seize new opportunities through producing
data-driven actionable intelligence with no latencies. It enables new analysis and processing of models to
be developed and modified quickly to meet the changing needs of the business and the competitive
landscape.
3
Figure 2: ESP and Database processing
From SAS Event Stream Processing, the Teradata server subscribes to the following operations:
Stream – operating similarly to a standard event stream processing database subscriber, but with
improved derived from TPT. Supports insert, update, and deletion of events. As events are received
from the subscribed window, it writes them to the pre-defined target table. If (the required)
tdatainsertonly configuration parameter is set to “false”, serialization is automatically enabled in TPT
to maintain correct ordering of row data over multiple sessions.
Update - Supports insert/update/delete events, but writes them to the target table in batch mode. The
batch period is a required configuration parameter. At the cost of higher latency, this operator
provides better throughput with longer batch periods (for example minutes instead of seconds).
Load - Supports insert events. Requires an empty target table. Provides the most optimized
throughput. Staggers data through a pair of intermediate staging tables. These table names and
connectivity parameters are additional configuration parameter specification requirements. Writing
from a staging table to the ultimate target table uses the generic ODBC driver used by the database
connector. Thus, the associated connect string configuration and odbc.ini file specification is required.
The staging tables are automatically created by the connector. If the staging tables and related error
and log tables already exist when the connector starts, it automatically drops them at start-up.
Having integrated connectors is certainly a good start. New innovations expand upon this to facilitate
even faster processing and reduced latency. The new Teradata Listener™ is an integrated offering that
delivers a unified solution to handle the endless torrent of digital information streams. With the constant
flood of digital information exponentially growing by all estimates, the complexities to integrate streaming
insights across the enterprise will correspondingly become more important and complex. Integrating the
Teradata Listener with SAS Event Stream Processing provides a new frontier for analyzing all big data in
a massively parallel processing environment delivery new, timely and current fact-based insights to all in
the enterprise.
4
TERADATA LISTENER™ AND SAS® EVENT STREAM PROCESSING
Ingesting streams of data is the key design element of the Teradata Listener. As an intelligent, self-
service software solution that ingests and distributes exceedingly fast moving data streams throughout
the enterprise analytical ecosystem. Listener™ collects data from multiple, high volume, real time streams
from sources such as social media feeds, web clickstreams, mobile events and IoT (server logs, sensors
and telematics). As mentioned, as a subscribed source, Listener can also ingest streaming analytic
insights defined in SAS Event Stream Processing.
The key value of Listener is to allow developers and data administrators to build real time processing
capabilities. It handles large volumes logs and event data streams, and reliably handles mission critical
data streams ensuring data delivering without loss. The Teradata Listener offers a self-service capability
to ingest streams of data without coding. And with no manual coding, it accelerates time to deeper
insights as a streamlined and traceable process. It simplifies the IT process, maintenance and cost of
custom-built systems. It can act as a centralizing system that can scale to the complete organization,
operate with hundreds of applications built by silo teams, all of which can be plugged into the same,
consistent system.
The Teradata Listener ingestion services are invoked from RESTful interfaces through the very popular
http transport protocol— a universally accepted protocol for modern-day applications. Any developer can
easily invoke the Listener’s ingestion services to send continuous data streams to a data warehouse,
analytical platform, or Hadoop or any other big data platform.
Additionally, APIs (such as developed with SAS Event Stream Processing) provide more flexibility to
developers to access the data flowing through Listener. And in the case of connection with SAS Event
Stream Processing, the Teradata Listener is receiving streams that have been vetted, cleansed, filtered
and enriched, improving the content of the streaming pipeline sourced by Listener, as per Figure 3.
Output from Teradata Listener is used to inform existing reporting work streams, updating custom
dashboards and integrating other processing engines for additional transformations. Moreover, (and as
depicted in Figure 3), the Listener output can stream back into SAS applications, other data repositories
(aka. data at rest) and reporting systems and even back into SAS Event Stream Processing.
5
Listener is agnostic to data variety, working effectively with both structured and semi-structured data. A
Teradata Listener cluster of servers scales horizontally to meet the growing demands of multiple data
streams in the enterprise.
DATA-DRIVEN INTELLIGENCE
Listener continuously monitors incoming data streams automatically, gathering critical information
exploited from the graphical user interface and dashboards to serve a deep understanding of the data.
Various metrics on this dashboard help end-users understand current activity both in and out of Listener’s
ingest and distribution processes. Users can intuitively discover when a stream has stopped or when a
target halts accepting the data output.
Teradata Listener‘s micro services architecture enables decoupling of ingestion process of incoming data
streams with the outgoing distribution processes. Listener buffers the distribution output intelligently when
the target systems are full, activating the distribution later when target system allows - all without any
manual intervention.
6
CONCLUSION
As business conditions evolve, the need to continuously monitor and measure streaming events of
interests is imperative. Machine-driven with human-guided curation of event streams, enriched with
analytic intelligence and focused to relevant events that are heard throughout the enterprise is unique
value that SAS and Teradata provide. Instead of the traditional “stream, score and store” process, data
can now be analyzed immediately as it is ingested or received and adjusting situational intelligence as
new events happen using Teradata Listener and SAS Event Stream Processing.
Applicable across a wide range of industries, the ability to process streaming data once, and persist to
stores and applications and other streams, across the enterprise is a foundational benefit for analytical
workloads. Efficient and well-managed processing is paramount to low latency, real-time responsiveness
– and when time matters, the ability to complete the full analytical lifecycle to drive better decisions
becomes critical. Whether that be the need to re-optimize mobile dispatch units based on live location
streams, preventing hazardous events by prioritizing maintenance needs based on current weather
predictions, or recognizing the need for new streams of data to improve projected operational
effectiveness, listening for the right signals provides focus.
With SAS and Teradata, the combined and integrated technology offers a scalable and reliable solution to
ingest data and process streams of events, leveraging the embeddable streaming analytics of SAS so
that organizations can pro-actively respond to even the most complex issues.
REFERENCES
McNeill, F., D. Duling, S. Sparano. 2016. “Paper SAS6367 Streaming Decisions: How SAS Puts
Streaming Data to Work” Proceedings of SAS Global Forum 2016, Los Vegas, NV
CONTACT INFORMATION
Your comments and questions are valued and encouraged. Contact the author at:
Tho Nguyen
Teradata
tho.nguyen@teradata.com
Fiona McNeill
SAS
fiona.mcneill@sas.com
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of
SAS Institute Inc. in the USA and other countries. ® indicates USA registration.
Other brand and product names are trademarks of their respective companies.
7
Paper 4120-2016
Prescriptive Analytics – Providing the Instruction to Do What’s Right
Tho Nguyen, Teradata Corporation
Fiona McNeill, SAS Institute Inc.
ABSTRACT
Automation of everyday activities holds the promise of consistency, accuracy and relevancy. When
applied to business operations, the additional benefits of governance, adaptability and risk avoidance are
realized. Prescriptive analytic empowers both systems and front-line workers to take the desired company
action – each and every time. And with data streaming from transactional systems, from IoT, and any
other source – doing the right thing with exceptional processing speed embodies the responsive
® ®
necessity that customers depend on. This talk will describe how SAS and Teradata are enabling
prescriptive analytics – in current business environments and in the emerging IoT.
INTRODUCTION
Being an analytically-driven organization means basing decisions and actions on data, rather than gut
instinct. As more organizations recognize the competitive advantages of using analytics, the impact can
wane as competitors build this same capability. To cross this innovation chasm and sustain the
competitive advances that come from analytical adoption, organizations continually test and expand data
sources, improve algorithms and evolve the application of analytics to every day activity.
Predictive algorithms describe a specific scenario, and using historic knowledge increase awareness of
what comes next. But knowing what is most likely to happen, and what needs to be done about it are two
different things. That’s where prescriptive analytics comes in. Prescriptive analytics answers the question
of what to do, providing decision option(s) even based on predicted future scenarios.
Seldom (if ever) do events happen in isolation. It’s through their interconnections that we develop the
detailed understanding of what needs to be done to change future trajectories. The richness of this
understanding, in turn, also determines the usefulness of the predictive models (Pinheiro & McNeill,
2014). Just as the best medicine is prescribed based on thorough examination of patient history, existing
symptoms and alike – so are the best prescriptive actions, founded in well understood scenario context.
And just as some medicines can react with one another - with one medicine not be as effective when it’s
in the presence of another, so can decisions and corresponding actions taken from analytics – which in
turn can impact the outcome of future scenarios.
As you’d expect, under different scenarios – you’d have different predictions. When conditions change,
the associated prediction for that same data event can also change. When you apply one treatment, you
affect another, changing the scenario. Actions that are taken not only create a new basis for historical
context, but also create new data that may not have been considered by the original model specification.
In fact, the point of building predictive models is to understand future conditions in order to change them.
Once you modify the conditions and associated event behavior, you change the nature of the data. As a
result, models tend to degrade over time, requiring updates to ensure accuracy to the current data,
scenario, and new predicted future context.
Well-understood scenarios are fed by data. The more data you have to draw from to examine
dependencies and relationships that impact the event being predicted, the better the prediction will likely
be. This is where the value of big data comes in… as big data is more data with finer detail, and greater
context richness. Big data offers details not historically available that explain the conditions under which
events happen, or in other words, the context of events, activities and behaviors. Big data analytics
allows us, like never before, to assess context – from a variety of data, and in detail. And when that big
data is also fast data (on the order of thousands of events per second), it’s a stream of events. When we
bridge big data analytics with event streams, as generated in the IoT - we have the power to write more
1
timely and relevant business prescriptions that are much harder for competitors to mimic.
Figure 1. Operational decisions are built by combining business rules (e.g. account_level = “COPPER”) with
analytical models (e.g. Bad_level_Default) using conditional logic in SAS Decision Manager.
The instruction, as defined SAS Decision Manager’s decision logic, encapsulates the conditions under
which a particular model is valid – and when it should trigger to deliver results. Scoring data is then
reserved for when the appropriate conditions are met, i.e. specific to the model design scenario, avoiding
unnecessary data processing.
Typically, business analysts are the decision designers. They are often tasked with working through the
logic of what actions need to be taken under different operational scenarios – whether they are product
related decisions, customer actions, service requirements or other types of day-to-day business activities.
These analysts draw upon the work of others, namely those analytical experts and data scientists who
®
have built the models, as well as reaching into data, like that from Teradata Unified Data Architecture™,
which has been vetted and validated by IT.
Building decisions therefore requires the foundation of models that have already been developed, tested
® ®
and validated using applications like SAS Analytics for Teradata . Decisions also require pre-
determined business rules to be defined to the system. Management of both business rules and
analytical models is necessary, particularly given the expanse of users who often benefit from a
formalized decision management environment.
2
BUSINESS RULE MANAGEMENT
Business analysts themselves may have access to all the governing policies, regulatory rules,
constraints, best practices and other and business logic necessary to define business rules. More often
than not, however, the business knowledge that’s needed is retained across different divisions of the
organization – like compliance, finance, sales, marketing and others. Thus, the need to have a centralized
and well-manage environment for defining business rules, business logic and terminology helps to
eliminate debates between different divisions of the organization. It also promotes consistency in the use
and application of business rules to operations.
®
Within SAS Decision Manager, SAS Business Rules Manager provides a centralized and managed
repository for rules. Individual rules are joined together using a wizard, which define the specific scenario
conditions as rule flows (as illustrated in Figure 2). Rules can be defined, tested, validated against data,
and even discovered using analytic methods - all from within the same environment. When rule flows are
published for execution in operations, the published rule flow is automatically locked down – to secure it
from additional testing and modification. Authorizations and defined workflows ensure that changes are
documented, approved and authorized by the appropriate personnel.
Figure 2: Wizard edit environment for creating, editing and managing business rule flows.
The collection of terms used to build rules is foundational to the common language that communicates
the objectives and responsibilities of the business, appropriately described as a vocabulary. You can
import pre-existing vocabularies (from .CSV files), edit them, reuse ones extracted from physical tables
and share vocabularies across rule sets. SAS Business Rules Manager allows multiple and authorized
users to contribute to rule definition, facilitates change management control, retains audit details,
empowers validation by subject matter experts, and governs rule elements. When business rules are
designed in this environment, they are safe from the risk of undocumented tribal knowledge and become
a corporate asset.
3
ANALYTICAL MODEL MANAGEMENT
Just as business rules are the domain of experts who understand the business, analytical models are the
®
domain of data scientists, statisticians and data miners alike. SAS Decision Manager includes SAS
® ®
Model Manger, which manages the inventory of models developed in in SAS Factory Miner, SAS/STAT ,
® ® 1
SAS/ETS , SAS Enterprise Miner™, PMML, generic R models, code snippets from other code bases , as well as
®
from SAS High-Performance Data Mining. Having forecasts, predictions and other models registered as a
comprehensive collection (as shown in Figure 3) allows organizations to monitor for signs of degradation as scenario
context changes, manage versioning, authorship, workflow, usage tracing, and provides detailed visibility into
production quality.
Business analysts, who are focused on creating complete decisions, select the appropriate model as
designated by the analytic expert. Taking the guesswork out of which model is most appropriate to a
particular scenario and streamlining the often tedious tasks of understanding model definitions and data
input needs. The business analyst readily has the business context of the model, explicitly defined and in
a recognizable, intuitive format. Building complete decision flows therefore becomes an exercise of
defining the rule flows in conjunction with the prescribed model – associating them together by the
appropriate conditional logic – all from the same, simplified interface (as was illustrated in Figure 1).
Moreover, the logic used, definitions and ownership of each element of the decision flow is retained – so
that when it comes to deploying models into production, IT has a complete perspective of who, what, why
and how these decisions are defined, the testing done and how to apply to business operations for
prescriptive actions.
1
Other code bases, such as C, C++, Java, Python, etc.
4
VALUE OF PRESCRIPTIVE ANALYTICS
Prescriptive analytics provides the instruction of what to do – and as importantly – what not to do when
analytical models are deployed into production environments. Defined as decisions, they are applied to
scenarios where there are too many options, variables, constraints and data for a person to evaluate
without assistance from technology. These prescriptive decisions are presented to the front-line worker –
providing the answer they seek, and accounting for the detailed aspects of the scenario that they find
themselves in. For example, call center personnel often rely on prescriptive analytics to know the
appropriate options, amount, and under what conditions, a prospective customer can be extended varying
levels of credit.
Prescriptive analytics also provides organizations with the ability to automate actions, based on these
codified decisions. Every organization has simple, day-to-day decisions that occur hundreds to
thousands of times (or more), and which don’t have to require human intervention. For example, the
identification and placement of a targeted advertisement based on a web shopper’s session activity is
popular in the retail industry. In such cases, prescriptive analytics are used to ingest, define and take the
optimal action (for example, place the most relevant ad) based on scenario conditions (in our example,
what has been viewed and clicked on during the session). What is optimal, for the purposes of this
paper, is defined as an action that best meets the business rule definitions and associated predicted
likelihoods. What is optimal can also refer to a mathematically optimized solution, as Duling (2015) has
previously described.
Scoring data with a model typically involves IT. Sent in an email, or some of other notification, IT is
presented with an equation and the data inputs needed. What is often very lacking is the business
rationale, context and a translation of terminology into IT terms. As such, IT will ask all the necessary
questions, often recode the model – run tests and validate output, and then, after applying any specific
business policies, and/or regulatory rules – will put the model into ‘production’, aka operationalize the
model so it can generate results.
While in some organizations these steps may not all be done by IT, they still happen. As illustrated in
Figure 4, each step – even after the model is developed - adds time to implementing the model, and
cashing in on the business benefits. In many organizations the latency associated with model deployment
to business action is weeks, if not months. As a result, by the time a model is ready to generate results in
a production context, it’s often too late – and either the opportunity to impact is gone or conditions have
changed to the point where the model is no longer relevant.
5
Prescriptive analytics defined using SAS Decision Manager reduces this latency. Streamlining the time
from when a model is developed to when actions are taken. Furthermore, the context of the model is
explicit, defined by the business rules - to the point that impact assessments across any point of the
decision flow is transparent (as illustrated in Figure 5). And because of this explicit decision definition,
changes and adjustments to new models, rules, conditions, data or combinations of any of these
dynamics are readily done – tracked as part of version control and documented for the purview of
auditors and alike. Analytical model deployment and usage becomes part of a governed, managed
environment, reducing the risk associated with incorrect definitions, poor market timing and regulatory
non-compliance.
Prescriptive analytics have the benefit of automating instructions and best suggested options that are
acted upon by a person. Prescriptive analytics is also be used to directly automate actions, for more
mundane tasks, doing so consistently and accurately. In both cases, relevancy to the current scenario is
assured in this managed environment and is the product of the vetted, tested and detailed decision flow
(as was illustrated in Figure 1). As data volume, variety and velocity are only set to increase, and as
technology continues to develop to process more data, faster – the trend to automating actions taken
from analytics will correspondingly rise.
The business need to automate prescriptive analytics stems from companies that demand real-time
responses from data-driven decisions. It is obvious that every company will increasing become inundated
with data and that data needs to be analyzed. The reality is that organizations simply don’t have enough
people to analyze all the data – even if they could comprehend all the scenario details and volumes, to
make all decisions in a timely manner. Prescriptive analytics – defined in SAS Decision Manager have
the benefit of being:
6
Relevant, consist and accurate decisions
Easily automated for human instructions and downstream application/system actions
Explicit of the business context
Tested, vetted and documented decisions
Adjustable to changing scenarios
Timely deployed actions
Governed in a single environment, providing an unequivocal source of truth
Assets, by encapsulating intellectual property and managing lifecycle degradation.
HOW IT WORKS
SAS and Teradata are well integrated to deliver complete, data-driven decisions. The Teradata database
can be leveraged to handle the heavy processing of data analytics. Teradata offers a powerful and
scalable architecture to enable massively parallelize processing (MPP). This MPP architecture is a
“shared nothing” environment and can disseminate large queries across nodes for simultaneous
processing. It is capable of high data consumption rates through parallelized data movement which
completes any task in a fraction of the time. The end-to-end process can be executed inside the Teradata
platform to improve performance, economics and governance, as illustrated in Figure 6.
7
2
Complete decisions, which include data definitions, business rules and analytical models are recognized
®
within SAS Data Integration Studio. Treated as a single decision flow, SAS Data Integration Studio
generates SAS DS2 code which can be run within Teradata to inherently leverage the highly scalable
environment for processing big data. This is enabled by an embeddable processing technology, the SAS
Threaded Kernel (TK) within the Teradata platform. This embedded process generates work using units
of parallelism scheduled on each AMP of the Teradata platform. Teradata’s workload manager manages
the SAS embedded process as a standard Teradata workload.
® ®
Analytic models can also be built in-database using SAS Analytics Accelerator. The SAS Analytics
Accelerator for Teradata contains specialized vendor defined functions for Teradata that enable in-
database processing for a collection of modeling and data mining algorithms. For model building, the SAS
Scoring Accelerator for Teradata transforms models created by multiple SAS/STAT or Enterprise Miner
for scoring inside the database using the SAS embedded process technology.
Decision that are deployed, and “published” to Teradata directly from SAS Data Integration Studio, as
SAS macros, or if only model scoring is desired, they can be published using SAS Model Manager. SAS
Decision Manager includes both SAS Data Integration Studio as well as SAS Model Manager, providing
options which optimize analytically-based processing in-database with Teradata (as shown in Figure 7).
The metadata about models, rules and logic are all encapsulated within decisions – helping organize
production deployment.
Figure 7: Publish models using SAS Model Manager (included with SAS Decision Manager) and Teradata
In some organizations, prescriptive decisions are deployed into operational data streams. There may
also be instances that only business rules, without analytic models, are needed for the appropriate action.
For example, internal organizational accounting often requires a distribution of revenue (aka revenue
2 ® ®
The SAS Code Accelerator for Teradata and SAS Analytics Accelerator for Teradata push SAS executable code to
process directly inside the Teradata data warehouse.
8
attribution) across business divisions and functions – based on corporate policies or governance
measures. Business rules defined within SAS Business Rules Manager can be pushed down and directly
3
executed inside the database without any recoding or redefinition . For the in-database business rule
execution inside Teradata, the processing tasks are further streamlined and are fully scalable without
data replication. It also has the advantage of being highly amenable to commonly required changes in
business rule definitions – with organizational and product changes, acquisitions, mergers and business
policy dynamics.
DATA EXPLORATION
Once data sources are gathered in Teradata, you can begin to explore it using your preferred data
®
exploration tool, like SAS Visual Analytics (which is also enabled on the Teradata Appliance for SAS).
Data exploration is a process that examines and explores data, often discovering or extracting new
knowledge. Typically performed by a business analyst, exploring what the data looks like, the scenario at
hand, and what variables are in the data set – evaluates the relationships and patterns necessary to
understand decision conditions.
This initial exploration of the data helps explain common inquiries and is a productive way to become
more familiar and intimate with the data that defines the scenario. One of the best practices is to explore
all your data directly in the database, so data is well understood before identifying the key factors for
conditional logic and rule definitions, while eliminating redundancy and removing irrelevant data. This
same exploration capability is also a powerful and flexible way to monitor business rule execution and
retrospectively decision actions – as a dashboard or reports. And it’s not just for the business analyst.
The ability to quickly extract knowledge from large complex data sets also provides the data scientist,
statistician and data miner alike with this same advantage of dynamically exploring data as part of the
model development process.
Prescriptive analytics require the right process, skilled personnel and scalable technology. With SAS and
Teradata, prescriptive analytics is streamlined, effective and efficient – from the perspectives of both IT
and the business. These integrated technologies deliver data-driven decision options and even
automated actions, helping organizations take advantage of future opportunities and alleviating potential
risks each time a decision is made.
3
The SAS Code Accelerator for Teradata is used to execute SAS Business Rules Manager code in the Teradata data
warehouse.
9
based on marketing campaign effectiveness is often associated with a targeted list of loyal customers. By
collecting web clicks in real-time, and past purchases, prescriptive analytics could indicate that they have
a high likelihood of purchasing shoes with the pants they are viewing – prompting a pop-up savings
coupon for pants.
Use cases leveraging prescriptive analytics in IoT applications abound. Everything from analyzing social
media watch by collecting tweets, blogs and posts to determine what consumers are recommending as a
service/product to security and surveillance of login sessions and data access for data security breaches
– and all else in between.
CONCLUSION
In the eyes of customers, in both business-to-business and business-to consumer industries, purchase
choices can be summarized as being dependent on product quality, service and support excellence, and
the ability to appropriately fulfill the purchase need. As such, ensuring product health, the responsiveness
of fulfillment, and understanding the full context of the purchase decision is paramount to being the
selected candidate. For day-to-day decisions, prescriptive analytics fulfills that need - giving organizations
the ability to accurately decipher the scenario context and to take the appropriate action in a manner
® ®
that’s consistent and relevant. With SAS In-Database Decision Management for Teradata , you can:
Be more responsive, proactive and reliant on data-driven operational decisions for new opportunities.
Improve performance and minimize time previously spent moving or duplicating data and code
between systems.
Increase security and compliance of data in one integrated, highly governed environment.
Taking prescriptive analytics to the data and running in-database extends the benefits of relevant, timely
instructions and actions without having to move data. Model and business rule deployment – as
complete, documented and vetted decisions – become part of job processing, for even the biggest of big
data. With SAS and Teradata, the integrated portfolio of solutions enables you to explore all options,
determine the appropriate approach, execute the action and evaluate/improve the business decision.
REFERENCES
C. Pinheiro, F. McNeill 2014 Heuristics in Analytics: A Practical Perspective of What Influences Our
Analytical World. NJ : John Wiley and Sons.
Duling, D. 2015 “Make Better Decisions with Optimization” SAS Global Forum 2015 Proceedings, Paper
SAS1785-2015, Dallas, TX. Available at:
http://support.sas.com/resources/papers/proceedings15/SAS1785-2015.pdf
SAS and Teradata Partnership
www.teradata.com/sas
SAS and Teradata In-Database Decision Management for Teradata
http://www.teradata.com/partners/SAS/SAS-In-Database-Decision-Management-for-Teradata-Advantage-
Program/
RECOMMENDED READING
IIA Research Brief, Prescriptive Analytics: Just What the Doctor Ordered, 2014
http://epictechpage.com/sms/sas/wp-content/uploads/2015/01/iia-prescriptive-analytics-Just-What-Dr-
Ordered.pdf
10
CONTACT INFORMATION
Your comments and questions are valued and encouraged. Contact the author at:
Tho Nguyen
Teradata
tho.nguyen@teradata.com
Fiona McNeill
SAS
fiona.mcneill@sas.com
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of
SAS Institute Inc. in the USA and other countries. ® indicates USA registration.
Other brand and product names are trademarks of their respective companies.
11
Paper 334-2017
Analytics of Healthcare Things IS THE Next Generation Real World Data
Joy King, Teradata Corporation
ABSTRACT
As you know Real World Data (RWD) provides highly valuable and practical insights. But as valuable as
RWD is, it still has limitations. It is encounter-based and we are largely blind to what happens between
encounters in the Healthcare System. The encounters generally occur in a clinical setting which may not
reflect actual patient experience. Many of the encounters are subjective interviews, observations, or
self-reports rather than objective data. Information flow can be slow (even real-time is not fast enough in
healthcare anymore). And some data that could be transformative cannot be captured currently.
Data from select IoT can fill the gaps in our current RWD for certain key conditions and provide missing
components that are key to conducting AoHT, such as:
• Direct objective measurements
• Data collected in "usual" patient setting rather than artificial clinical setting
• Data collected continuously in patients setting
• Insights that carry greater weight in Regulatory and Payer decision-making
• Insights that lead to greater commercial value
Teradata has partnered with an IoT company whose technology generates unique data for conditions
impacted by mobility or activity. This data can fill important gaps and provide new insights that can help
distinguish your value in your marketplace.
Join us to hear details of successful pilots that have been conducted as well as ongoing case studies.
INTRODUCTION
As the Internet of Things (IoT) was gaining momentum in industries such as manufacturing, insurance,
travel and transportation, the healthcare and life science industries were still trying to figure out how to
leverage real world data (RWD) such as claims and electronic health records.
Now that RWD has been firmly embraced, it is time to explore the benefits of IoT to healthcare and life
science companies and ultimately to the patient, clinician and caregiver.
1
2. Collected in the “usual” patient setting, rather than an artificial clinical setting which may not reflect
accurate readings or patterns
3. Collected continuously, which is a richer source of data to reveal variability and patterns over time
4. Informing the provider as to what occurs between encounters
CONCLUSION
Integrating IoHT data and conducting robust, advanced analytics on the data can provide immediate
competitive advantage. The data, by itself, has no business value unless it provides decision-making
insights. That is why the Analytics of Healthcare Things (AoHT) provides a real differentiator for
companies leveraging Real World Data (RWD).
CONTACT INFORMATION
Your comments and questions are valued and encouraged. Contact the author at:
Joy King
Teradata Corporation
(919) 696-6067
joy.king@teradata.com
www.teradata.com
2
Ready to take your SAS ®
sas.com/books
for additional books and resources.
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration.
Other brand and product names are trademarks of their respective companies. © 2017 SAS Institute Inc. All rights reserved. M1588358 US.0217