You are on page 1of 88

The correct bibliographic citation for this manual is as follows: Dull, Tamara. 2017.

The Internet of Things with SAS®: Special Collection. Cary,


NC: SAS Institute Inc.
The Internet of Things with SAS®: Special Collection
Copyright © 2017, SAS Institute Inc., Cary, NC, USA
All Rights Reserved. Produced in the United States of America.
For a hard-copy book: No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means,
electronic, mechanical, photocopying, or otherwise, without the prior written permission of the publisher, SAS Institute Inc.
For a web download or e-book: Your use of this publication shall be governed by the terms established by the vendor at the time you acquire
this publication.
The scanning, uploading, and distribution of this book via the Internet or any other means without the permission of the publisher is illegal and
punishable by law. Please purchase only authorized electronic editions and do not participate in or encourage electronic piracy of copyrighted
materials. Your support of others’ rights is appreciated.
U.S. Government License Rights; Restricted Rights: The Software and its documentation is commercial computer software developed at
private expense and is provided with RESTRICTED RIGHTS to the United States Government. Use, duplication, or disclosure of the Software
by the United States Government is subject
to the license terms of this Agreement pursuant to, as applicable, FAR 12.212, DFAR 227.7202-1(a), DFAR 227.7202-3(a), and DFAR
227.7202-4, and, to the extent required under U.S. federal law, the minimum restricted rights as set out in FAR 52.227-19 (DEC 2007). If FAR
52.227-19 is applicable, this provision serves as notice under clause (c) thereof and no other notice is required to be affixed to the Software or
documentation. The Government’s rights in Software and documentation shall be only those set forth in this Agreement.
SAS Institute Inc., SAS Campus Drive, Cary, NC 27513-2414
December 2017
SAS® and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and
other countries. ® indicates USA registration.
Other brand and product names are trademarks of their respective companies.
SAS software may be provided with certain third-party software, including but not limited to open-source software, which is licensed under its
applicable third-party software license agreement. For license information about third-party software distributed with SAS software, refer to
http://support.sas.com/thirdpartylicenses.
Table of Contents

Streaming Decisions: How SAS® Puts Streaming Data to Work


By Fiona McNeill, David Duling, and Stephen Sparano, SAS Institute Inc.

Real-time Analytics at the Edge: Identifying Abnormal Equipment Behavior and Filtering Data near the Edge for
Internet of Things Applications
By Ryan Gillespie and Saurabh Gupta, SAS Institute, Inc.

Modernizing Data Management with Event Streams


By Evan Guarnaccia, Fiona McNeill, Steve Sparano, SAS Institute, Inc.

Location Analytics: Minority Report Is Here—Real-Time Geofencing Using SAS® Event Stream Processing
By Frederic Combaneyre, SAS Institute, Inc.

Listening for the Right Signals – Using Event Stream Processing for Enterprise Data
By Tho Nguyen, Teradata Corporation
Fiona McNeill, SAS Institute, Inc.

Prescriptive Analytics – Providing the Instruction to Do What's Right


By Tho Nguyen, Teradata Corporation
Fiona McNeill, SAS Institute, Inc.

Analytics of Healthcare Things IS THE Next Generation Real World Data


By Joy King, Teradata Corporation
iv The Internet of Things with SAS: Special Collection
Free SAS® e-Books:
Special Collection
In this series, we have carefully curated a collection of papers that introduces
and provides context to the various areas of analytics. Topics covered
illustrate the power of SAS solutions that are available as tools for
data analysis, highlighting a variety of commonly used techniques.

Discover more free SAS e-books!


support.sas.com/freesasebooks

sas.com/books
for additional books and resources.
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration.
Other brand and product names are trademarks of their respective companies. © 2017 SAS Institute Inc. All rights reserved. M1673525 US.0817
About This Book

What Does This Collection Cover?


Defining the Internet of Things isn’t easy. From cars to factories to farms, many organizations are already collecting
information from the connected devices that send and receive data over the Internet of Things (IoT). While analysts expect
the IoT to soar to tens of billions of devices by 2020, no one knows how many or what new types of intelligent devices will
emerge. Whether these definitions and forecasts are accurate is really not important. What is important is that we understand
the context or frame of reference in which the Internet of Things is being discussed.

While you won’t find a canonical definition of IoT in this ebook, the papers included in this special collection demonstrate
how SAS is using its technology to address our customers’ IoT needs, including streaming data, edge computing, prescriptive
analytics, and much more.

The following papers are excerpts from the SAS Global Users Group Proceedings. For more SUGI and SAS Global Forum
Proceedings, visit the online versions of the Proceedings.
For many more helpful resources, please visit support.sas.com and sas.com/books.

We Want to Hear from You


SAS Press books are written by SAS users for SAS users. We welcome your participation in their development and your
feedback on SAS Press books that you are using. Please visit sas.com/books to

● Sign up to review a book


● Request information on how to become a SAS Press author
● Recommend a topic
● Provide feedback on a book

Do you have questions about a SAS Press book that you are reading? Contact the author through saspress@sas.com.
viii The Internet of Things with SAS: Special Collection
Foreword
Life was simple. And then the mouse showed up. Not the furry kind, mind you, but the clicky kind.

Some of you may remember when we only had keyboards to interact with our computer monitors. We had to use the Tab,
Shift-Tab, Enter, and arrow keys to move our cursor from field to field on the screen. If you were an end user back then, you
would probably describe the experience as controlled, somewhat tedious, and often slow – but, frankly, it is all we knew. And
if you were an application/database developer, like I was, creating these controlled digital experiences was admittedly mind-
numbing at times but necessary to satisfy the business requirements of our company or client.

Then, the mouse showed up. And it changed everything.

Developers, I would argue, were hit the hardest. They had to upgrade their keyboard-controlled, character-based applications
to keyboard- and mouse-controlled GUI (Graphical User Interface) apps. It was a painful transition. Long gone were the days
of systematically controlling a user’s every move with the keyboard. End users, by comparison, had it easy. All they had to
do was get used to offloading their navigation activity from the keyboard to this new clicky thing called a mouse.

The mouse in all its simplicity proved to provide more freedom, flexibility, and speed – and ultimately changed the way we
interacted with computers. It took a couple of years to transition fully to a mouse-driven world, but, once the transition was
made, there was no going back.

Today, we are in the midst of another significant digital transformation: the Internet of Things (IoT). On the surface, the IoT
is moving us towards a smarter, more connected world. However, at its core, the IoT is about data. Big data, IoT data, sensor
data – call it what you want – but it is data that’s fueling this transformational shift.

Just as the mouse changed how we interacted with computers, the Internet of Things is changing how we interact with data.
How we collect it. How we process it. How we store it. How we govern it. How we manage it. How we analyze it. And,
ultimately, how we make decisions with it. Not only do we want to make decisions based on data stored in our enterprise data
warehouse (that is, data at rest), we now need the ability to make decisions on-the-fly, in real or near real time (that is, data in
motion).

This is where SAS comes in. SAS has been in the data and analytics business for more than 40 years – well before the mouse
made its debut – helping companies analyze and understand all their data, at rest and in motion. The papers included in this
special collection demonstrate how SAS is using its technology to address our customers’ IoT needs, including streaming
data, edge computing, prescriptive analytics, and much more.

Title of Paper Abstract


Streaming Decisions: How Sensors, devices, social conversation streams, web movement, and all things in the
SAS® Puts Streaming Data Internet of Things (IoT) are transmitting data at unprecedented volumes and rates.
to Work SAS® Event Stream Processing ingests thousands and even hundreds of millions of
Fiona McNeill, SAS Institute data events per second, assessing both the content and the value. The benefit to
Inc. organizations comes from doing something with those results and eliminating the
David Duling, SAS Institute latencies associated with storing data before analysis. This paper bridges the gap. It
Inc. describes how to use streaming data as a portable, lightweight micro-analytics service
Stephen Sparano, SAS for consumption by other applications and systems.
Institute Inc.
x The Internet of Things with SAS: Special Collection

Title of Paper Abstract


Real-time Analytics at the In IoT applications, processing at the edge, or edge computing, pushes the analytics
Edge: Identifying from a central server to devices close to where the data is generated. As such, edge
Abnormal Equipment computing moves the decision-making capability of analytics from centralized nodes
Behavior and Filtering and brings it closer to the data source. This paper describes the use of a machine-
Data near the Edge for learning technique for anomaly detection and the SAS® Event Stream Processing
Internet of Things engine to analyze streaming sensor data and determine when performance of a turbofan
Applications engine deviates from normal operating conditions. The authors describe how sensor
Ryan Gillespie, SAS Institute readings from the engines can detect asset degradation and help with preventative
Inc. maintenance applications.
Saurabh Gupta, SAS Institute
Inc.

Modernizing Data As the Internet of Things (IoT) continues to grow, a natural increase in the volume and
Management with Event variety of data follows. SAS® Event Stream Processing offers the flexibility, versatility,
Streams and speed to tackle these issues and adapt as the landscape of IoT changes. This paper
Evan Guarnaccia, SAS distinguishes the advantages of adapters and connectors and shows how SAS® Event
Institute Inc. Stream Processing can leverage both Hadoop and YARN technologies to scale while
Fiona McNeill, SAS Institute still meeting the needs of streaming data analysis and large, distributed data
Inc. repositories.
Steve Sparano, SAS Institute
Inc.

Location Analytics: Geofencing is one of the most promising and exciting concepts that has developed with
Minority Report Is Here— the advent of the Internet of Things (IoT). Examples include receiving commercial ads
Real-Time Geofencing and offers based on your personal taste and past purchases when you enter a mall,
Using tracking vessels to detect where a ship is located, and forecasting and optimizing ship
SAS® Event Stream harbor arrivals. This paper explains how to implement real-time geofencing on
Processing streaming data with SAS® Event Stream Processing and achieve high-performance
Frederic Combaneyre, SAS processing in terms of millions of events per second over hundreds of millions of
Institute Inc. geofences.

Listening for the Right With the emergence of the Internet of Things (IoT), ingesting streams of data and
Signals – Using Event analyzing events in real-time become even more critical. The interconnectivity of IoT
Stream Processing for from web and mobile applications provides organizations with even richer contextual
Enterprise Data data and more profound volumes to decipher in order to harness insights. Capturing all
Tho Nguyen, Teradata of the internal and external data streams is the first step to enable listening for the
Corporation important signals that customers are emitting, based on their event activity. Having the
Fiona McNeill, ability to permeate identified patterns of interest throughout the enterprise requires
SAS Institute Inc. deep integration between event stream processing and foundational enterprise data
management applications. This paper describes the innovative ability to consolidate
real-time data ingestion with controlled and disciplined universal data access - from
SAS® and Teradata™.

Prescriptive Analytics – Automation of everyday activities holds the promise of consistency, accuracy, and
Providing the Instructions relevancy. When applied to business operations, the additional benefits of governance,
to Do What’s Right adaptability, and risk avoidance are realized. Prescriptive analytics empowers both
Tho Nguyen, Teradata systems and front-line workers to take the desired company action – each and every
Corporation time. And with data streaming from transactional systems, from the Internet of Things
Fiona McNeill, SAS Institute (IoT), and from any other source – doing the right thing with exceptional processing
Inc. speed embodies the responsive necessity that customers depend on. This paper
describes how SAS® and Teradata are enabling prescriptive analytics – in current
business environments and in the emerging IoT.
Foreword xi

Title of Paper Abstract


Analytics of Healthcare As the Internet of Things (IoT) was gaining momentum in industries such as
Things Is THE Next- manufacturing, insurance, travel, and transportation, the healthcare and life science
Generation Real-World industries were still trying to figure out how to leverage real-world data (RWD) such as
Data claims and electronic health records. RWD provides highly valuable and practical
Joy King, Teradata insights. But as valuable as RWD is, it still has limitations. Teradata has partnered with
Corporation an IoT company whose technology generates unique data for conditions impacted by
mobility or activity. This data can fill important gaps and provide new insights that can
help distinguish your value in the marketplace. This paper describes successful pilots
that have been conducted as well as ongoing case studies.

We hope these selections provide you with a useful overview of the many tools and techniques that are available to help you
as we shift from a data-at-rest to a data-in-motion world.

If this whets your appetite, check out The Non-Geek’s A-to-Z Guide to The Internet of Things, a white paper listing 101
common terms related to the Internet of Things. As IoT is evolving so quickly it’s not exhaustive but rather a quick go-to
resource for the technically savvy data professional who wants to get a handle on this vast IoT ecosystem explained sans
technical “geek speak.”

Tamara Dull
Director of Emerging Technologies
SAS Best Practices

Tamara Dull is the Director of Emerging Technologies for SAS Best Practices, a thought leadership
team at SAS Institute. Through key industry engagements, and provocative articles and
publications, she delivers a pragmatic perspective on big data, the Internet of Things, open source,
privacy, and cybersecurity. Tamara began her high-tech journey long before the internet was born,
and has held both technical and management positions for multiple technology vendors,
consultancies, and a non-profit. Tamara is listed in the IoT Institute's "25 Most Influential Women
in IoT" and Onalytica’s Big Data Top 100 Influencers and Brands lists for the last three years. She
is also an advisory board member for the Internet of Things Community.
xii The Internet of Things with SAS: Special Collection
Paper SAS6367-2016
Streaming Decisions: How SAS® Puts Streaming Data to Work
Fiona McNeill, David Duling, and Stephen Sparano, SAS Institute Inc.

ABSTRACT
Sensors, devices, social conversation streams, web movement, and all things in the Internet of Things
(IoT) are transmitting data at unprecedented volumes and rates. SAS® Event Stream Processing ingests
thousands and even hundreds of millions of data events per second, assessing both the content and the
value. The benefit to organizations comes from doing something with those results, and eliminating the
latencies associated with storing data before analysis happens. This paper bridges the gap. It describes
how to use streaming data as a portable, lightweight micro-analytics service for consumption by other
applications and systems.

INTRODUCTION
The Internet has created a culture of people conditioned to expect immediate access to information.
Mobile networks have created a society reliant on instant communication. The Internet of Things (IoT) is
forming a new era, blending these revolutionary technologies, and establishing information access and
communication between objects. This provides a seminal opportunity for organizations to realign their
services, products, and even identity to an operational environment that responds in real time.
Hopefully by now, the debate of “What is real time versus the right time” is over. For the purpose of this
paper, real time corresponds to latency that is so short that events are impacted as they occur. As such,
real-time activity is in contrast to the more traditional, offline use of data, where business intelligence and
analytics have been used to make both tactical and strategic decisions. Real time is a time-sensitive need
and is essential when the decision needs to occur to avoid impending and undesirable threats, or to take
advantage of fleeting opportunities.
In order for organizations to operate in real time, some fundamentals are required. Data input must be
emitted and received in real time, as it’s being generated, such as it is with sensors transmitting object
status and health. The data needs to be assessed in real time, extracting the inherent meaning from the
data elements as they are being ingested. Lastly, the data needs to provide decisions and the instructions
for low latency actions. These are characteristics associated with streaming data.
Unlike other types of data, streaming data is transferred at high-speed, on the order of hundreds,
thousands, and even millions of events per second – and at a consistent rate (save for interrupted
transmissions associated with network outages). Popular types of streaming data include streaming
television broadcasting and financial market data. Such data are continuous, dynamic events that flow
across a sufficient bandwidth and are so fast that there is no humanly perceived time lag between one
event and the next. Given the high volume and high velocity of streaming data, it’s not surprising that the
receipt, ingestion, and decisions made from this data are left to powerful computing technology, which
can scale to assure the high-volume, low latency actions by objects connected in the IoT enabling them to
communicate and respond in real time.

UNDERSTANDING DATA, ANALYTICS, AND DECISIONS IN IOT


Streaming data is not only high volume, high velocity data, it is also highly varied data. It can be
generated by humans, machines, objects, sensors, and devices. With such different sources it should be
of no surprise than that streaming data varies in data type, format, specification parameters, and
communication protocols.
EVENT STREAM DATA
For simplicity, we can broadly divide streaming data into two major types:
• Structured data, such as continuous readings from heart rate and blood pressure monitors, use

1
tracking from smart meters, binary readings of on/off status from machinery, RFID tags, sensors
readings of temperature and pressure from oil drills, banking transaction systems, and more.
• Semi-structured or unstructured, such as data that is generated by computer machine logs, social
media streams, weather alert bulletins, live camera feeds, and operational and ERP systems (free
form notes and comments are included with the structured records in most operational/ERP systems),
to name a few.
It’s often assumed that streaming data, generated from sensors, devices and machines is consistent and
accurate unlike human generated content, which is known to be fraught with misspellings, stylistic
differences, translation loss, and so on. However, sensor data, and its cohorts also suffer from
inconsistent and incorrect data, with bad readings (temperature sensor goes awry), missed readings (with
interruptions in transmission) and consolidating different readings, which are typically associated with
multi-sensor assessment that have different specifications or protocols. For example, dialysis machines
communicate using different languages, transmitting in USB, Ethernet, and different serial interfaces (RS-
232, RS-485, RS-422, and so on, and Wi-Fi®.
Streaming data, as with any other type of data suffers from data quality issues that must be addressed in
order to assess, analyze and action it. Big data repositories, like Hadoop®, provide a currently popular
answer to capture and then cleanse streaming data for analysis. And while initially, this might be viable,
it’s only a short-term stop gap. With the expansion of IoT, and corresponding explosion in streaming data
on the horizon even low cost commodity storage for big data will soon be prohibitive to economically
address the needs of streaming data. Yet even if an unlimited budget existed (it doesn’t), when real-time
answers are demanded, the latency associated with first storing streaming data, then cleansing, then
analyzing adds incremental time to processing – delaying actions until they are no longer in real time.
You can reduce the transmission, storage, and assessment costs of streaming data by cleansing and
analyzing streaming data near the source of the data generation, pushing required processing to the
edges of the IoT. Aggregators, gateways, and controllers are natural levees to cleanse multiple sources
of aggregated data, minimizing the downstream pollution with dirty events. Embeddable technology,
provided by SAS® Event Stream Processing, aggregates, cleanses, normalizes, and filters streaming data
while it is in motion – and before it stored. SAS Event Stream Processing is poised to even process data
at the sensor processing chip itself.
Unlike traditional database management systems, which are designed for static data in conventional
stores, and even big data repositories, with queries to file systems, streaming data management requires
flexible query processing in which the query itself is not performed once or in batch, but is permanently
installed to be executed continuously. SAS includes pre-built data quality routines in SAS Event Stream
Processing query definition. In this way, the necessary streaming data correction, cleansing and filtering
is applied to data in motion and in turn, reduces polluting data lakes with bad and irrelevant data. Of
equal, if not more importance, including streaming data quality paves the way for streaming analytics,
doing the required data preparation for analytically sound real-time actions.

STREAMING ANALYTICS

Tom Davenport, thought leader in field of analytics, has said that the “Analytics of Things is more
important than the Internet of Things” (Davenport, 2015). Arguable perhaps by those in the
communications industry, the point is that understanding the data upon which connected objects
communicate is critical to having successful conversations between the ‘things’. IoT provides an
opportunity to reconsider how we use analytics, and make it pervasive - to drive useful and effective
conversations between things.

In many cases, we can apply the same types of analytics to streaming data that we use in traditional
batch model execution. The difference is that unlike traditional analysis, which requires data to be stored
before it’s analyzed with event streams analyze data before it’s stored. The following types of analytics
are applicable to IoT data as part of the continuous query:

• Descriptive analytics identifies patterns in events as they occur

2
• Predictive analytics identifies future likelihoods of events that have not yet happened

• Prescriptive analytics provides the instructions for event actions

SAS Event Stream Processing provides procedural windows to include both descriptive and predictive
algorithms defined in SAS DATA step, SAS DS2, and other languages. As with traditional analysis, these
models are built in SAS® High-Performance Data Mining, SAS® Factory Miner, SAS® Contextual Analysis,
SAS® Forecast Server, and any other SAS® product or solution that generates SAS DATA step or SAS
DS2 code. For this broad selection of algorithms, the models are built upon an event history that has
been stored. As with traditional analysis, models are built, tested, and validated. The resulting model
code, however, is included into the SAS Event Stream Processing continuous query as a pre-defined,
procedural calculation. As part of the continuous query, the model scores individual events are they are
ingested. In other words, the analytics are performed on live data streams, synonymous with the term
‘streaming analytics’.
Taking advantage of machine learning and deep learning techniques, SAS Event Stream Processing
includes a growing suite of methods to build and score event data solely based on streaming data, and
without out-of-stream model development and event history. In this case, algorithms such as K-Means
clustering, are both defined and applied to events in motion, learning with new events. This exciting field
of new techniques further expands the streaming analytics methods available to streaming data.

STREAMING DECISIONS
The focus of this paper is enabling prescriptive analytics in stream, a term we describe as ‘streaming
decisions’. Streaming decisions define the instructions for real-time actions based on live, streaming
events. They are of particular importance to actions taken by object in the IoT. They combine descriptive
and predictive algorithms, with the business rules that trigger when the models are relevant to the current
streaming event data scenarios. In other words, they are the instructions needed by an IoT object to take
the right action, something of core importance to the adoption and successful proliferation of autonomous
IoT activity.

As we distribute analytics further out to the edges of the IoT there is a classification that provides some
guiding principles that direct when one type of analysis, and the corresponding actions, is more applicable
than another. The following types of analytics are described on the IoTHub (2016):

• Edge Analytics is the analysis at the same device from which it is streaming
• In Stream Analytics is the analysis that occurs as data streams from one device to another, or from
multiple sensors to an aggregation point
• At Rest Analytics is the analysis processed after the event fact has passed, based on saved
historical event data and/or other stored information.

In general, the closer to the edge, the less event data there is to analyze. At the edge, there is just that
one object/sensor/device, with its limited supply of data. As mentioned, data quality issues are present at
the edge, and events can be aggregated in windows of time to correct and filter out the irrelevant noise
from the signal of interest. Analytical calculations are more limited due to the data restrictions, and
prescriptive analytics (say, instructions emanating from another object) are limited to real-time actions that
can be performed in isolation – like commands to turn up or down, turn on or off.

As more objects are related to each other in-stream, at aggregation points, the data are richer (emanating
from several sources) and correspondingly there are more data quality issues. The contextual
understanding of the scenario is also richer (with more event data over time, space, and so on) and, in
turn, more complex patterns of interest can be identified. Streaming decisions can thus be made that
relate to more objects, even becoming a series of inter-connected actions.

3
As typical of any analysis, the decision to apply different types of models is dependent on the data as well
as the business problem the analysis solves. More often than not, IoT analytic solutions require
multiphase analytics, that is models defined in the traditional, stored data paradigm, and scoring for new
analytical insight, as well as in-stream model derivation/calculation, and analytics applied to the edge.
SAS® does this. With the same integrated code, and with over one-hundred and fifty (at last count)
adapters and connectors linking streaming data, SAS Event Stream Processing is used to define the
complete continuous query that can be as simple or complex as the business problem itself. Moreover,
built into SAS Event Stream processing is the ability to automatically issue alerts and notifications for real-
time situational awareness and understanding of event status.

When we consider the IoT we are describing, an analytically driven network of objects that communicate
with each other. When we automate actions between objects, especially when there is no human
intervention, the risks associated with rogue actions, as well as the technical debt that accumulates from
both machine learning algorithms (Sculley et al., 2015) and from any unmanaged advanced analytics
environments, will outweigh the advantages. As such, the IoT demands a governed, reliable and secure
environment for streaming analytics and associated prescribed analytic actions. SAS® Decision Manager
is a prescriptive analytics solution. With fully traceable workflows, versioning and audit trails to assure
command control over streaming analytics for real-time, reliable, and accurate IoT applications.

BUILDING STREAMING DECISIONS WITH SAS®


Decision Construction
When designing, building, and testing decisions it’s best to begin with a description of what is included
within a decision. For the discussion within this paper, we consider the types of decision-making that
organizations use, that is strategic and tactical decisions, and the ways that organizations leverage
analytical models and their output to make decisions that meet businesses goals.

Strategic and Tactical Decisions


Organizations are required to address decisions at both the strategic and the tactical levels since both are
required for a business to run effectively. Strategic decisions typically represent the less common
decisions that an organization makes such as creating new product lines or expanding into new territories
or merging with another firm. These decisions, while important, are not typically made on a frequent basis
and therefore businesses can take the time and effort needed to create specific processes that aren’t
required to be repeatable nor require automation.
Tactical decisions, on the other hand, and which can include operational decisions, are made frequently
and often (often thousands of decisions), made in a single day or even in minutes or seconds. Loan
underwriting, fleet maintenance operations, point of sale operations, fraud detection, and remediation are
examples of the decisions that process high volumes of rapidly moving information. Tactical decisions like
these are numerous, require short timeframes, high rates of data ingestion, as well as automation and
analytics, and of course ways to prescribe an appropriate action based on the analytic model output.
Analytics, an important element for tactical decision making, has become pervasive within organizations
due to a couple of factors. First, the accessibility of analytics has increased due to the rise of tools that
assist and guide users through the analytical process to suggest relevant algorithms based on the
available data. This data-driven, guided approach includes better visualizations to identify patterns in the
data and recommendations, as to which is the best model to use. Second, analytics is being applied to a
wider set of problems, ranging across industries from retail to manufacturing, to health care and drug
development. Organizations have come to recognize that the application of analytics can help with a vast
array of problems. The emergence of the data scientist and the citizen data scientist represents the wider
number of users that are using the new and powerful analytical tools, applying them in a variety of ways
to solve difficult and complex business problems within a single business.

4
SCALING TO DATA STREAMS

Prescribing Action from Analytics


Analytics has become pervasive in business, and can result in large volumes of analytical output, which
applied to a business process in order to generate a decision. The output is usually a score, a numerical
representation of the likelihood of an outcome. But how can businesses translate that numerical output
into a decision that is repeatable, automated, and scalable?
Some scoring models deliver probabilities of an event occurring based on analysis of historical trends and
events. This output is derived using, in many cases, IF-THEN-ELSE (conditional) logic applied to models
is part of the probability determination, but that score doesn’t define the action to be taken. For example,
a typical credit score for a customer applying for a loan doesn’t specify if the loan officer (or underwriting
system) should approve or deny the loan, what rate to offer the customer, or what product to cross-
sell/upsell. Comparatively, the actions of a decision would provide the prescriptive instruction suggesting
approve over deny, defining the appropriate rate, or that the customer would likely be interested personal
insurance with that loan. These prescriptive instructions combine the output of the analytical model along
with business rules, suggesting, or defining the action that is needed. Such prescriptive action drives the
numerous tactical and operational decisions for a business, and be scalable to address the high-
frequency data inherent to business operations.

Decisions in Action
Streaming data, as we have seen, represents data from sources such as customers web clicks, call data
records, fleet vehicle GPS, point-of-sale systems, and now more commonly, sensors from corporate
assets, such as machines on manufacturing floor, or sensors in an electric power grid.
Data from these systems has historically been extracted, transformed or cleansed from the source, and
then loaded to data warehouses for storage and later to analyzed. But, as described above, soon, if not
already – organizations will conclude that they can’t afford to store it all, and it certainly can’t afford the
lag times of analyzing it after data has been stored. In business operations, the value of data diminishes
the longer we wait to use it, so we need new ways to analyze it sooner – closer to where it originates, and
that means we need new ways to tap into the value of data streams.
Tapping into data, while it’s still in-motion in data streams, before it’s stored empowers actions to be
applied sooner, before its value diminishes and before missed an opportunity or prevented a threat. The
diminishing value of data can be seen in Figure 1, depicts the relationship between our ability to ingest,
and analyze the data for an action and the value of making that decision sooner rather than later.

Figure 1. Decision Decay and Diminishing Value

5
DRIVING INTERNET OF THINGS (IOT) ACTIONS
In this nascent hyper-connected world of IoT, data is being generated rapidly and businesses want
effective approaches to leverage their analytical resources to not only analyze and gain insights from the
rising tide of data but also take action from it - to obtain the most, real-time value.
IDC research have found that only 0.5% of the data being generated through the IoT is being analyzed to
derive value (see Figure 2). This means that only 0.5% of the data from “things” was being analyzed at
that time, leaving a rich set of opportunities to understand and take action untapped. And while this
research is from 2012, and more organizations have begun to examine IoT data for deriving business
value, the vast majority of organizations have not yet used IoT data for business operations. This report
also points out the amount of untagged data (often associated with unstructured text data) that would be
more useful if it was tagged and analysis-ready. However, with the aforementioned traditional technique
of storing first and then analyzing, tagging content for use if often prohibitive because of the
consequential large storage costs – even if we assume that all potentially useful unstructured text could
be stored (Gantz and Reinsel, 2012).

Figure 2. IDC: The Untapped Big Data Gap (Gantz and Reinsel, 2012)

With increasing numbers of objects, sensors and devices joining the connected network of the IoT, the
big data gap is growing. Processing data at the necessary scale and speed associated with streaming
data generated from the ‘things’ requires new architectures can analyze and make decisions on the
streaming, in-motion data, filtering out the irrelevant from any downstream activity – including data
storage. As such, there is a growing necessity to move analytics, the associated operational decisions,
and the corresponding actions closer to the data - and in some cases, right into the data streams, near
the point of data generation.
SAS Event Stream Processing and SAS® Decision Manager together enable organizations to analyze
data while it’s still in motion and also apply prescriptive actions sooner, deriving maximum value from live
events and before streaming data value diminishes. The communication between prescriptive instruction
from SAS Decision Manager and the real-time analytical determination embedded within SAS Event
Stream Processing is achieved using SAS® Micro Analytic Services 1. SAS Micro Analytic Services
provides the ability to quickly execute decisions based on the results of in-stream scoring.

1 SAS Micro Analytic Services are included within the SAS Decision Manager offering.

6
SAS Decision Manager and SAS Event Stream Processing
SAS Decision Manager is being used to automate tactical decision making by prescribing actions to take
through the design, development, testing, and publishing of decision flows. SAS Decision Manager
supports the Business Analyst, Data Scientist, Data Miner, and Statistician, collaboratively developing
tactical decision actions from building decision flows that combine the analytical models with necessary
business rules that drive operational business processes.
These same decision flows, authored from SAS Decision Manager and published into SAS Micro Analytic
Services, can be can be executed within streaming data by inclusion in the event data flow activity
defined in SAS® Event Stream Processing Studio.

CASE STUDY: PRESCRIPTIVE ACTION FOR TRANSPORTATION


The following real-world case study illustrates the power of streaming analytics calculated close to event
data origination for real-time prescriptive actions. We describe a simple yet powerful approach with SAS
Event Stream Processing to build the streaming model and analyze streaming data, with SAS Decision
Manager acting on the real-time events using SAS Micro Analytics Services.

BUSINESS PROBLEM
A fleet management company wants to minimize the time vehicles are out of service to reduce costs,
minimize lost revenue, and maximize uptime. The trucks have sensors that transmit data that monitoring
location, vibration, rpm, temperature, speed, steering angle, pressures, and so on. The company needs to
proactively prioritize maintenance before a vehicle unexpectedly is out-of-service. The somewhat obvious
and immediate business benefit for analyzing vehicle sensors in the transportation industry is to maximize
assets efficiently.

Streaming Analytics for IoT Actions: Two Aspects Driving One Outcome
The company has collected data on its fleet and processing it offline, identifying problems, and then
generating notices sent to maintenance locations, to try to improve the maintenance scheduling and
overall efficiency of vehicle assets. This approach, however, is typically expensive, given the vehicle is
often inoperable, taken off the road before parts are available for the necessary maintenance, mechanics
can fit the additional work their schedules. This can result in unplanned downtime, overtime servicing
costs, expensive non-scheduled parts delivery, and more.
In this scenario, the company wants to use real-time data they’re collecting from vehicles to identify
issues sooner and increase the time available to address maintenance proactively. In fact, the truck
sensors can be analyzed in an onboard SAS Event Stream Processing engine, reducing the latency from
the time the data is collected to the time the issue is identified and addressed.
On-board processing of streaming data is one solution to better predict the likelihood of a vehicle issue.
An even better solution is to drive a prescriptive action that instructs where to route the vehicle to, notifies
suppliers of necessary parts for on-time delivery, and identifies mechanic schedules aligned to a service
stop that minimizes deleterious impact of the fleet’s transportation of goods to its customers.
Together, the in-stream analysis along with the prescriptive actions enable alerts to be generated in-real
time to all stakeholders, empowering them to take the right action, minimizing costs, maximizing revenue
and drastically reducing unplanned down times for the fleet.

SAS® DECISION SPECIFICATION


Analytics can be found at the core of any quality decision making process. In this case, we want to build a
predictive model that will help us decide when trucks are likely to experience a failure that requires
downtime and maintenance. We need a history of truck sensor readings and a history of scheduled and
unscheduled maintenance events. We can then use the sensor readings to predict the unscheduled
events and use those predictions to make better decisions. To build the history data, we have
accumulated data from a small fleet of trucks equipped with sensors. These trucks were operated for a
period of time in different locations and conditions and their sensor readings stored in on board memory.
The saved readings were later downloaded to a database. In addition, we have data on the maintenance

7
records for those same trucks and significant events were labeled as failures. We have joined the sensor
readings with the maintenance events to create a predictive modeling table. The vehicle data represents
real sensor readings and has been provided by the Intel ™ Corporation for public demonstrations.
Building the Model
The first thing that a data scientist will do is look at the data and run a basic variables distribution
analysis. All variables in this sample including failure are numeric.
proc means data = trucks.failures n nmiss min max mean ;
output out=means ;
var _numeric_;
run;

Output 1. Output from the MEANS Procedure

We can see that three variables have missing values in a relatively small number of cases: 14, 158 and
791, out of a total of 8395 cases. We can also see that failure occurred in 20% of the cases. Therefore,
predicting failure has potential to improve efficiency and reduce costs.
We need a tool that can predict failures for this sample. We need to the tool that has robust handling of
missing values. We also need a tool that can provide root causes analysis and determine which devices
with sensors potentially contribute to failure so that we can improve those devices and reduce overall
failure rate. Finally, and perhaps most importantly, we need a tool that can produce a failure scoring
function that we can deploy to the real-time system. The scores generated by this system will be used to
generate signals for the decision processing application. Higher scores will indicate that failure is possibly
imminent and the truck should be routed to a service center before a serious problem occurs.
Decision tree is a modern data mining tool that handles missing values and is useful for both model
interpretation and scoring. A decision tree is sometimes referred to as a recursive partitioning algorithm,

8
which works by selecting variables that have the greatest power to divide cases based on the dependent
target variable values. The result is a downward tree where each node contains fewer cases and the
dependent target variable distribution is more biased. The prediction value at each leaf of the tree is the
leaf’s proportion of dependent target variable events. In our case, the target variable failure has two
values, zero and one, where one represents an observed failure event. We want to create a decision tree
model that predicts the event value (one) and identifies the independent input variables that are used in
that prediction.
Therefore, we will proceed with building a model using the HPSPLIT procedure from the SAS Enterprise
Miner ™ distribution. Notably, we are using the procedure option missing=branch to enable tree branches
based on missing values in addition to real values. We also did not select the GPS variables since we
want our models to be based on truck sensor readings that can be applied in any location. For the
complete procedure syntax refer to SAS documentation.
/* select list of input variables */
proc transpose data=means (drop= _TYPE_ _FREQ_ ) out=vars;
id _STAT_ ;
run ;
proc sql noprint ;
select _name_ into :vars separated by ' ' from vars
where _name_ ne 'failure' and
_name_ not contains "GPS";
quit ;
/* predict truck failure */
filename scrcode "&file.\score.sas" ;
proc hpsplit data= trucks.failures missing=branch ;
performance details ;
partition fraction (validate=0.3) ;
input &vars / level=int ;
target failure / level=nom order=ascending ;
score out= model ;
code file= scrcode ;
run ;

The results of the HPSPLIT procedure provide some clues about the failure analysis. First we want to
examine the accuracy of the procedure. The confusion matrix shown below displays the number of cases
that are correctly and incorrectly predicted. In our sample, only six cases of failure were miss-identified as
non-failures. The accuracy of this model is very good.

Output 2. Output from the HPSPLIT Procedure


For interpretation of the model, we can look at the list of variables selected as important predictors. The
devices attached to these sensors should be examined for their possible contribution to truck failures.
These are the variables that will be required to execute the decision scoring function.

9
Output 3. Variables Selected by the HPSPLIT Procedure

Finally, we can look at the structure of the decision tree to better understand the complexity of the model
and the order of importance of the input variables. The following tree graph shows the overall size of the
tree model in the overview box and the readable detail of the top portion of the tree. We can see that
Trip_Time_journey is the most important predictor, followed by Throttle_Pos_Manifold and Engine_RPM.

Output 4. Decision Tree Structure from the HPSLIT Procedure


Deploying the Model

10
Now that we have confidence in this model, we can work on deploying it to the event stream processing
engine. Model scoring is simply the application of the model to new data to produce a score. In this case
the model represents the decision tree created in the model building step. In the running of the HPSPLIT
procedure, we saved the scoring code to an external file named scrcode that contains simple DATA step
code that generates predictions. The score code contains 159 source code lines of SAS DATA step code
that can be inserted between the SET statement and the RUN statement in a SAS program. A small
fragment of the score code is displayed below. The score code contains several similar fragments.
. . .
IF NOT MISSING(Trip_Time_journey) AND ((Trip_Time_journey >= 10026.2))
THEN DO;
IF NOT MISSING(Engine_RPM) AND ((Engine_RPM >= 1761.6))
THEN DO;
IF NOT MISSING(Mass_Air_Flow_Rate) AND ((Mass_Air_Flow_Rate < 51.6215))
THEN DO;
IF NOT MISSING(Accel_Pedal_Pos_D) AND ((Accel_Pedal_Pos_D >= 27.12156951))
THEN DO;
_Node_ = 18;
_Leaf_ = 8;
P_failure0 = 0;
P_failure1 = 1;
END;
ELSE DO;
_Node_ = 17;
_Leaf_ = 7;
P_failure0 = 1;
P_failure1 = 0;
END;
END;
ELSE DO;
_Node_ = 12;
_Leaf_ = 3;
P_failure0 = 1;
P_failure1 = 0;
END;
END;
. . .
The code requires the input variables listed by the model description and generates five new variables
that describe the model. _WARN_ is an indicator that the prediction function could not be computed from
the values of the input variables. _NODE_ and _LEAF_ are internal variables that identify the branches
taken for each case. P_Failure1 and P_Failure0 are the probabilities of truck failure and non-failure,
respectively. We are primarily interested in P_Failure1 and _WARN_. Higher values of P_Failure1
indicate that action should be taken to prevent truck failure. Non-empty values of _WARN_ might indicate
that one or more critical sensors have failed and action should be scheduled to repair those devices. The
score code file is stored as part of the sample code for this paper. Two additional variables for validation
data proportion have been omitted as they are not needed for this example. The DATA Step score code
is then adapted for the run-time environment.
Event Stream Processing
The score code is now deployed to the SAS Event Stream Processing engine. The SAS Event Stream
Processing engine requires code in the DS2 format. DS2 code is a modular and structured form of SAS
DATA Step that can be embedded in various run-time environments. The process for converting
acceptable DATA Step code to DS2 is fairly simple. The scorexml macro will detect the necessary input
variables and save them to an XML file. The DSTRANS procedure will convert the DATA Step code to
DS2 code and add the needed input variable declarations. We then want to test the scoring in a simple
DS2 procedure step and examine the output.
/* Create an XML file for Variable Info */
filename scrxml "&file.\score.xml";
%AAMODEL;
%scorexml(Codefile=scrcode,data=trucks.failures,XMLFile=scrxml);

/* Convert Code to DS2 */

11
proc dstrans ds_to_ds2 in="&file.\score.sas" out="&file.\score.ds2" EP nocomp
xml=scrxml; run; quit;
/* Test Scoring */
libname SASEP "&file.";
proc delete data=sasep.out; run;
proc ds2 ;
%include "&file.\score.ds2";
run;

After we have validated the DS2 code and the scoring, an additional manual step is required to rename
the input and output destinations to ESP.IN and ESP.OUT, respectively.
The authors recommend using SAS Event Stream Processing Studio for creating the stream definition. In
that definition we need to define a procedural window based on DS2 code. The properties for this window
will include a text editor for DS2 code. The user needs to paste the DS2 code created in the previous
section into this procedural window. The event definition must include the input variables required by the
decision tree model and pass them to the procedural window. The window definition must add the
_WARN_ and P_Failure1 variables to the output event definition.
Display 1 shows a simple design for handling the sensor events. An on board diagnostic device has
already aggregated the sensor data into a single data record. The SAS Event Stream Processing Studio
Data_Stream window can subscribe to the diagnostic data records, execute the predictive model scoring,
and then filter the events to the ones that have a high probability of failure. After the filter, we can add a
notification window that will post predictive failure events to a remote service such as a data center, or
store them locally until they can be retrieved by a physical location such as a maintenance shop.

Display 1. Simple Design for Handling Sensor Events in SAS Event Stream Processing Studio

The Complete System


The complete solution for monitoring devices on a fleet of trucks, predicting failure, and taking action is
very complex. This paper covers only a small part of that system. SAS provides key components for
adding advanced analytics to every step of that journey.
SAS Event Stream Processing can integrate high speed data management, pattern detection, and model
scoring in integrated devices. Local events can be handled and filtered before integration with
operational servers. We used a decision tree scoring function to predict possible failures. As a result, only
the important events that indicate failure need to be transmitted to the operational systems. By moving

12
critical analytics to the local system we can detect problems earlier and save huge amounts of server
processing, storage, and networking.
SAS Decision Manager can be used to construct and execute routine business decisions in both online
and batch processing environments. After an alert from a truck is received, the custom decision logic can
be used to determine the best action to take and the best way to execute that action. The core SAS
Decision Manager contains components that will help manage and monitor the predictive models, build
and manage business rules, and build and execute decision processes. These components operate in a
data center making routine business decisions and have business-friendly interfaces. They provide a
buffer between SAS Advanced Analytics and the business’s operational systems.
The complete system is depicted in Figure 4. The IoT systems are responsible for event processing and
pattern detection. Operational systems are responsible for delivering the business strategy and resources
to the IoT systems and managing the alerts and communications that are recommended. This view is
entirely representational. There are many ways to architect the system and many ways to connect the
components.

Figure 3. Architecture of IoT Systems and Operations Systems

RESULTS FROM STREAMING DECISIONS


Businesses are required now to adapt to the needs of real-time data analysis and decision making to
deliver value as lower costs, rapid response, and maximized revenue. With SAS Decision Manager and
SAS Event Stream Processing, they can now deliver that value to the business, minimize the additional
costs associated with missing opportunities, and reduce the inherent latencies of processing data after it
has been persisted in various data stores. Tangible results include reducing maintenance costs through a
more proactive approach and targeting specific maintenance tasks to keep their fleet moving as efficiently
as possible. And since time is money, the business can address problems before they manifest
themselves as out-of-service vehicles that have slowed delivery times, added costs, and negatively
impacted customer satisfaction.
With real-time analysis and decision making applied across their fleet, the business can look at other
ways to take advantage of the big data gap in their operations. By looking to new sources of event data,
along with those that already exist, and applying streaming analytics, the business could improve
inventory turnover (through integration of point of sale data with inventory management), or reduce stock
of maintenance components (by integrating live data into their forecasting mechanisms). The possibilities
are almost limitless when organizations look for ways to leverage streaming data and improve their
decision making.

13
CONCLUSION
The following summarizes the four ways that processing streaming data provides a vital role in IoT
(Combaneyre, 2015):
• Detect events of interest and trigger appropriate action
• Aggregate information for monitoring
• Sensor data cleansing and validation
• Real-time predictive and optimized operations
SAS Event Stream processing is being used by organizations to send an alert, a notification trigger, for
current situational monitoring and improving sensor data quality in real time, across industries.
The autonomous actions of objects needs to be founded in well-established practices of humans. As
objects in the IoT are depended on to make decisions, the need to govern, manage, control, and secure
the conditional logic specific to specific event scenarios will become increasingly important. For these
more complex, real-time streaming decisions, the processing speed of SAS Event Stream Processing
executing actions defined with the rigor of SAS Decision Management will ensure that real-time actions of
autonomous objects in the IoT are both the right and the relevant ones.

REFERENCES
Combaneyre, F. 2015. “Understanding data streams in IoT.” SAS White Paper. Available
at http://www.sas.com/en_us/whitepapers/understanding-data-streams-in-iot-107491.html.
Davenport, T. 2015. “#ThinkChat IoT and AoT with @ShawnRog and @Tdav” Information
Management. Available at http://en.community.dell.com/techcenter/information-
management/b/weblog/archive/2015/06/22/thinkchat-iot-and-aot-with-shawnrog-and-tdav.
Gantz, J.; Reinsel D. 2012. “The DigitaI Universe in 2020: Big Data, Bigger Digital Shadows,
and Biggest Growth in the Far East.” IDC IVIEW.
“IoT pushes limits of analytics”. IoTHub. February 29, 2016. Available
at http://www.iothub.com.au/news/iot-pushes-limits-of-analytics-415787.
Sculley, D., Holt, G., Golovin, D., Davydov, E., Phillips, T., Ebner, D., Chaudhary, V., Young, D.,
Crespo, J-F., Dennison, D. 2015. “Hidden Technical Debt in Machine Learning Systems”
Proceedings of NIPS 2015. Montreal, PQ. Available at http://papers.nips.cc/paper/5656-
hidden-technical-debt-in-machine-learning-systems.pdf?imm_mid=0df22b&cmp=em-data-na-
na-newsltr_20160120.

ACKNOWLEDGMENTS
The authors would like to thank Kristen Aponte, Brad Klenz, and Dan Zaratsian for their assistance and
contributions to this paper.

RECOMMENDED READING
“Channeling Streams of Data for Competitive Advantage”, SAS White
Paper http://www.sas.com/en_us/whitepapers/channeling-data-streams-107736.html
“How Streaming Analytics Enables Real-Time Decisions”, SAS White
Paper http://www.sas.com/en_us/whitepapers/streaming-analytics-enables-real-time-decisions-
107716.html

CONTACT INFORMATION
Your comments and questions are valued and encouraged. Contact the authors at:

14
Fiona McNeill
SAS Institute Inc.
100 SAS Campus Drive
Cary, NC 27513
Email: fiona.mcneill@sas.com

David Duling
SAS Institute Inc.
100 SAS Campus Drive
Cary, NC 27513
Email: david.duling@sas.com

Steve Sparano
SAS Institute Inc.
100 SAS Campus Drive
Cary, NC 27513
Email: steve.sparano@sas.com

SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of
SAS Institute Inc. in the USA and other countries. ® indicates USA registration.
Other brand and product names are trademarks of their respective companies.

15
Paper SAS645-2017
Real-time Analytics at the Edge: Identifying Abnormal Equipment Behavior
and Filtering Data near the Edge for Internet of Things Applications
Ryan Gillespie and Saurabh Gupta, SAS Institute Inc.

ABSTRACT
This paper describes the use of a machine learning technique for anomaly detection and the SAS ® Event
Stream Processing engine to analyze streaming sensor data and determine when performance of a
turbofan engine deviates from normal operating conditions. Turbofan engines are the most popular type
of propulsion engines used by modern airliners due to their high thrust and good fuel efficiency (National
Aeronautics and Space Administration 2015). For this paper, we intend to show how sensor readings
from the engines can be used to detect asset degradation and help with preventative maintenance
applications.

INTRODUCTION
The data set used is the 2008 Prognostics and Health Management (PHM08) Challenge Data Set on
turbofan engine degradation (Saxena and Goebel 2008). We use a single-class classification machine
learning technique, called Support Vector Data Description (SVDD), to detect anomalies within the data.
The technique shows how each engine degrades over its life cycle. This information can then be used in
practice to provide alerts or trigger maintenance for the particular asset on an as-needs basis. Once the
model was trained, we deployed the score code on to a thin client device running SAS® Event Stream
Processing to validate scoring the SVDD model on new observations and simulate how the SVDD model
might perform in Internet of Things (IoT) edge applications.

IOT PROCESSING AT THE EDGE


IoT processing at the edge, or edge computing, pushes the analytics from a central server to devices
close to where the data is generated. As such, edge computing moves the decision making capability of
analytics from centralized nodes and brings it closer to the source of the data. This can be important for
several reasons. It can help to reduce latency for applications where speed is critical. And it can also
reduce data transmission and storage costs through the use of intelligent data filtering at the edge device.
If you have a better sense of what data is valuable and what data is not valuable at the edge of your
network, you can be more selective of what data you choose to transmit and store. With intelligent data
filtering, the use of real-time event stream processing engines can reside at the edge of the network to
not just score new data, but to pre-process and filter the data before analysis or scoring.
In our use case, we are evaluating sensors from a fleet of turbofan engines to determine engine
degradation and future failure. To do this, we constructed a scoring model to be able to do real-time
detection of anomalies indicating degradation. In practice, it is easy to anticipate an organization trying to
monitor a large fleet of turbofan engines or other capitally intensive pieces of equipment. This equipment
might not have access to a centralized cluster of computers to be able to be monitored in real time. Or,
the latency associated with transmitting the data might be greater than the practical benefit of monitoring
the asset. This is an area where edge computing can help to resolve issues where the scale of data
transmitted is a concern.
The closer we can put the model to the sensors, or the source of the data, the less time it takes to make a
decision due to the decrease in data movement. The risk of the decision process being interrupted by an
unreliable or slow network connection is also decreased if a light weight scoring engine resides at the
edge of the network.
Edge computing can also help reduce the costs associated with the analytics infrastructure. Training your
failure model happens less frequently than scoring the data, so you can reduce costs by using metered
computing to train your model and then deploying the decision or score code to many smaller nodes at
the edge. Most analytics use cases do not use all available data. This is because it is redundant,
incomplete, or noisy data. By leveraging edge computing, you can filter or pre-process the data with an

1
event streaming engine closer to the source. Then you can use only the relevant data in the proper format
to both train your model and to make your predictions or alerts.

THE POTENTIAL MARKET FOR EDGE COMPUTING


According to Cisco, there will be 50 billion connected devices by 2020 (Cisco 2014). Intel has an even
more bullish prediction at 200 billion connected devices by 2020 (Intel 2017). No matter who is right, the
opportunity for edge computing appears large. IDC has estimated that investment on IoT, which includes
devices, connectivity solutions, and services will reach $1.7 trillion (IDC 2017). Telefonica estimates that
90% of cars will be connected by 2020 (Telefonica 2014). And GE estimates that the “Industrial Internet”,
which is the Internet of connected industrial machinery such as turbofan engines, will add about $10 to
$15 trillion to the global GDP in the next 20 years (General Electric and Accenture 2014). This amount of
investment and value at stake will likely require analytics solutions that can be deployed at a centralized
location or on edge devices for a variety of prediction and real-time monitoring applications such as
anomaly detection.

SUPPORT VECTOR DATA DESCRIPTION (SVDD)


One potential solution to the issue of anomaly detection is a method called Support Vector Data
Description. It’s a machine learning technique that can be used to do single-class classification. The
model creates a minimum radius hypersphere around the training data used to build the model. The
hypersphere is made flexible through the use of Kernel functions (Chaudhuri et al. 2016). As such, SVDD
is able to provide a flexible data description on a wide variety of data sets. The methodology also does
not require any assumptions regarding normality of the data, which can be a limitation with other anomaly
detection techniques associated with multivariate statistical process control.
If the data used to build the model represents normal conditions, then observations that lie outside of the
hypersphere can represent possible anomalies. These might be anomalies that have previously occurred
or new anomalies that would not have been found in historical data. Since the model is trained with data
that is considered normal, the model can score any observation that is not considered normal whether it
has seen an example like it before or not.
Being able to detect new anomalies can be key to Internet of Things applications or detecting new threats
related to cyber-security or fraud. Given that the model also only requires one class of data for
construction, it can be beneficial for applications where there is a severe imbalance of observations
between normal and non-normal operating conditions.
The implementation of SVDD used for this paper can be found in SAS® Visual Data Mining and Machine
Learning.

APPLICATION OF SVDD
To illustrate how SVDD can be applied to a predictive maintenance scenario, we used the algorithm on
the 2008 Prognostics and Health Management (PHM08) Challenge Data Set on turbofan engine
degradation (Saxena and Goebel 2008). The data set consists of examples of simulated turbofan engine
degradation that were used for a data challenge competition at the 1st international conference on
Prognostics and Health Management.

DESCRIPTION OF DATA SET


The data set consists of multivariate time series information for a fleet of engines with the engine
operating normally at the start of the series and degrading at a point until failure is reached.
There are 26 variables within the data set. They correspond to the engine ID, the cycle number of the
time series, three operational settings, and 23 sensor measurements. Within the training set, there are
218 different turbofan engines simulated to a point of failure, with the number of cycles to failure ranging
between 128 and 357 cycles with a mean failure point of 211 cycles. In total, there are 45,918
observations within the training data set.

2
APPLYING SUPPORT VECTOR DATA DESCRIPTION TO THE PROBLEM
The Support Vector Data Description algorithm was applied to the problem to help determine when the
time series is beginning to deviate from normal operating conditions. The output measurement of the
algorithm provides a scored metric that can be used to assess the degradation of the engine and help put
in place preventative measures before the failure point.
To train the model, we sampled data from a small set of engines within the beginning of the time series
that we assumed to be operating under normal conditions. As previously noted, the SVDD algorithm is
constructed using the normal operating conditions for the equipment or system. It can also handle various
states of normal operating conditions. For example, a haul truck within a mine might have very different
sensor data readings when it is traveling on a flat road with no payload and when it is traveling up a hill
with ore. However, both readings represent normal operating conditions for the piece of equipment.
With this in mind, we randomly sampled 30 of the 218 engines from the data set to be used to build the
SVDD model. Of the 30 engines that were sampled, the first 25% of each engine’s measurements were
then used to train the model. As such, it was estimated that the data within this region was related to
normal operating conditions. This resulted in a training set used for the model consisting of 1,512
observations out of the total 45,918 observations.
It should be noted that examination of the three operational setting variables indicated that there were six
different operational setting combinations within the data set. Given that the algorithm is flexible enough
to accommodate varying operating conditions, no additional indicator flags or pre-processing work was
performed on the data to model the different operating conditions.
The model was trained using the svddTrain action from the svdd action set within SAS Visual Data Mining
and Machine Learning. The ASTORE scoring code generated by the action was then saved to be used
to score new observations using SAS Event Stream Processing on a gateway device.

SCORING NEW OBSERVATIONS TO DETECT DEGRADATION


As the type of example is related to a potential Internet of Things (IoT) use case, it was decided to
implement the scoring code within a SAS Event Stream Processing engine running on a gateway device.
This was done in order to validate implementing predictive maintenance scoring algorithms near the edge
for IoT use cases.
In the case of aircraft engines or other assets that can generate potentially large amounts of data, it is
beneficial to be able to bring the analytics to where the data resides. This can help in two ways. In the
first, it can help reduce latency in instances where an extremely quick decision is required (or help to
make a decision if there is no network connection to a source that can score the new observation). The
second way is that it can also help to filter data at the edge. If a new observation is deemed to be outside
of the normal operating range, it can be sent back to a central storage system for data collection,
analysis, and future model building. Similarly, for all observations within the normal operating range, it is
possible to select a sample of them to be sent back for collection and future use. As such, the volume of
normal operating condition observations transmitted and stored can be greatly reduced.
A Dell Wyse 3290 was set up with Wind River Linux and SAS Event Stream Processing (ESP). An ESP
model was built to take the incoming observations, score them using the ASTORE code generated by the
VDMML program and return a scored distance metric for each observation. This metric could then be
used to monitor degradation and create a flag that could trigger an alert if above a specified threshold.
The remaining observations associated with the 188 engines that were not used for model training were
then loaded onto the gateway device and streamed into the processing engine to be scored. In an
application, data would be fed to the gateway device, scored and/or sampled, acted on or monitored if
necessary, and then sent to a central location for storage or further processing. For the purposes of the
validation, the scoring results were output to a comma-delimited file for analysis of how the model scored
the new observations.

3
SAMPLE RESULTS
The scoring results from the hold-out data set illustrate the degradation in the engines captured by using
the SVDD model. Four random samples were taken from the 188 scored engines with their SVDD scored
distance plotted versus the number of cycles. This is shown in Figure 1, Sample SVDD Scoring Results.
As seen in the figure, each engine shows a relatively stable normal operating state for the first portion of
its useful life, followed by a sloped upward trend in the distance metric leading up to the failure point.
This upward trend in the data indicates that the observations are moving further and further from the
centroid of the normal hypersphere created by the SVDD model. As such, the engine operating
conditions are moving increasingly further from normal operating behavior.
With increasing distance indicating potential degradation, an alert can be set to be triggered if the scored
distance begins to rise above a pre-determined threshold or if the moving average of the scored distance
deviates a certain percentage from the initial operating conditions of the asset. This can be tailored to the
specific application that the model is used to monitor.

Figure 1. Sample SVDD Scoring Results

4
CONCLUSION
Anomaly detection can be a useful tool to detect asset degradation and help with preventative
maintenance efforts. In this paper, we discuss how we applied a single-class classification technique
called Support Vector Data Description to monitor how turbofan engines degrade from normal operating
conditions. Given the potential use of real-time anomaly detection for Internet of Thing applications, we
also tested scoring the model on a gateway type device to mimic application in the field. The results of
the model on new data show visual trends indicating the degradation with the turbofan engines used in
the example.

REFERENCES
National Aeronautics and Space Administration. “Turbofan Engine.” Retrieved on February, 28, 2016
from https://www.grc.nasa.gov/www/k-12/airplane/aturbf.html
A. Saxena and K. Goebel. 2008. "PHM08 Challenge Data Set." NASA Ames Prognostics Data Repository
(http://ti.arc.nasa.gov/project/prognostic-data-repository), NASA Ames Research Center, Moffett Field,
CA. Retrieved February 24, 2017.
Chaudhuri, Arin, Deovrat Kakde, Maria Jahja, Wei Xiao, Seunghyun Kong, Hansi Jiang, and Sergiy
Peredriy. 2016. “Sampling Method for Fast Training of Support Vector Data Description.” eprint
arXiv:1606.05382, 2016.
Cisco. “The Internet of Things.” Retrieved February 25th, 2017, from
http://www.cisco.com/c/dam/en_us/solutions/trends/iot/docs/iot-aag.pdf
GE and Accenture. “Industrial Internet Insights Report for 2015.” Retrieved February 25 th, 2017, from
https://www.ge.com/digital/sites/default/files/industrial-internet-insights-report.pdf
IDC. “Connecting the IoT: The Road to Success.” Retrieved February 25th, 2017, from
http://www.idc.com/infographics/IoT
Intel. “A Guide to the Internet of Things Infographic.” Retrieved February 25th, 2017, from
http://www.intel.com/content/www/us/en/internet-of-things/infographics/guide-to-iot.html
Telefonica. “Connected Car Industry Report 2014.” Retrieve February 25th, 2017, from
https://iot.telefonica.com/multimedia-resources/connected-car-industry-report-2014-english

ACKNOWLEDGMENTS
Thanks to Seunghyun Kong, Dev Kakde, Allen Langlois, and Yiqing Huang whose code contributions and
help made this paper possible. And also, thanks to Robert Moreira for suggestions and input on the ideas
in the paper.

CONTACT INFORMATION
Your comments and questions are valued and encouraged. Contact the author at:
Ryan Gillespie
SAS Institute Inc.
Ryan.Gillespie@sas.com

Saurabh Gupta
SAS Institute Inc.
Saurabh.Gupta@sas.com

SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of
SAS Institute Inc. in the USA and other countries. ® indicates USA registration.
Other brand and product names are trademarks of their respective companies.

5
Paper SAS6431-2016
Modernizing Data Management with Event Streams
Evan Guarnaccia, Fiona McNeill, Steve Sparano, SAS Institute Inc.

ABSTRACT
Specialized access requirements to tap into event streams vary depending on the source of the events.
Open-source approaches from Spark, Kafka, Storm, and others can connect event streams to big data
lakes, like Hadoop and other common data management repositories. But a different approach is needed
to ensure latency and throughput are not adversely affected when processing streaming data, that is, they
need to scale. This talk distinguishes the advantages of adapters and connectors and shows how SAS®
Event Stream Processing can leverage both Hadoop and YARN technologies to scale while still meeting
the needs of streaming data analysis and large, distributed data repositories.

INTRODUCTION
Organizations are tapping into event streams data as a new source of detailed, granular data. Event
streams provide new insights in real time, helping organizations improve current situational awareness,
respond to current situations with extremely low latency, and can improve predictive estimates for
proactive intervention.
When an organization begins to use streaming data, or when it decides to enhance its real-time
capabilities and offerings, many things need to be taken into account to ensure that scaling to the high
throughput (hundreds of thousands of events per second and more) can be achieved.
An easy place to start would be the three Vs of big data: Volume, Variety, and Velocity. As the Internet of
Things continues to grow, a natural increase in the volume and variety of data follows. Sensors have
become smaller and cheaper than ever and it makes sense that businesses would want to take
advantage of this detailed data to monitor activity more closely and on shorter time scales, thereby
leading to an increase in the velocity of high volume streaming data. Customers are interacting with
businesses in ways they never have before, such as through apps, social media sites, online forums, and
more. This has led to unstructured text becoming an important and valuable source of data along with
more traditional types of data. Variety isn’t just associated with unstructured content in event streams but
also relates to the different formats of data emanating from sensors, applications, and machines –
transmitting event data at different intervals, formats, and levels of consistency.
SAS Event Stream Processing offers the flexibility, versatility, and speed to be able to tackle these issues
and adapt as the landscape of the Internet of Things (IoT) changes. Whether a single event stream or
multiple event streams are ingested for insights, scaling to this fast and big data will separate those
organizations that are successful in putting event streams to use for organizational advantage from those
that become swamped by the pollution from their overflowing data lakes.

WHAT INSIGHTS SCALE WITH EVENT STREAMS


With continuous data delivered in event streams, the ongoing evaluation of conditions, activities and
interpretation become possible. Differences between what was and what is are compared. Outliers are
apparent. Patterns are found. And trends, toward or away from normal conditions are proactively
assessed.
Comparison to historic conditions and statistical measures (like average, standard deviation, and alike)
are defined relative to tolerance thresholds. This comparison can be done for unique events, or
collections of events. For collections of events, live streamed data is aggregated and held in memory for
comparative purposes. These aggregations are referenced windows that compile and compare event
summaries and their corresponding elements to new data that streams through the event stream
continuous query.
The ability to retain event data, and calculate in-memory conclusions from it, can be conceptually

1
compared to a short, retained history of live activity – that is, events that have just occurred in a limited
window of time (of course, based on live data, versus data that’s been stored). This means that standard
data manipulation and analysis tasks that require more than a single event value can readily be
calculated in live, streaming data using aggregate windows. Given this, a host of insights and tasks can
be accomplished in streaming event data – before it is even stored to disk – promoting scalable insights
to even the most complex, big, and high throughput streaming data sources.

SCALING DATA QUALITY


Treating event streams as a new source of data and simply storing it – to be examined later – pollutes
data lakes with irrelevant, incorrect, and unusable data. Initially, such pipeline processing might not be
considered much of a problem, given the low costs of commodity storage and technologies to directly
investigate data stored in Hadoop.
Open-source pipeline processing is a current data strategy used by organizations to ingest event streams
data and store in Hadoop or other big data repositories. However, with the accelerating volumes of event
data, generated in the IoT and other technologies, the volume will soon compound even the lowest,
incremental costs. It’s expected that IT will soon be called to task to reduce these growing incremental
costs. A longer term, more strategic approach is needed, one that won’t overflow and pollute data lakes,
fill clouds with data waste and crippled data centers with irrelevance.
Streaming data, even from the most regulated environments – like that of sensors – will contain data
errors. Gaps in events, from network interruptions and erroneous measures that have no merit are
commonplace. Correcting data streams as they are ingested reduces downstream processing data
manipulation needs and helps create data that’s ready for use.
SAS Event Stream Processing is used to filter out unwanted or unimportant event data and even correct it
while it’s still in motion, before it is stored. Determinations made on live data are invoked using data
quality routines available in SAS® Data Quality and Natural Language Processing (NLP) from SAS® Text
Analytics. These routines cleanse data, identify and extract entities, categorize unstructured data, and
help manage event records. When these routines are applied directly to streaming data, they have the
following capabilities:
 Create a normalized standard from an input that is context specific
 Correct nonstandard or duplicate records as well as identify unknown data types
 Separate values through natural language parsing
 Identify and resolve entities using logic types (phone number, name, address, location, and so
on)
 Extract common entities from unstructured text
 Validate data against standard measures and customized business rules
 Categorize unstructured text
 Determine case and gender
 Generate unique match codes
 Identify sentiment, and more.
Applying such data quality and text analytics routines within the data stream, before it is stored, fills in the
gaps of missing data, corrects event data errors, and standardizes inputs and formats. The routines also
help to determine if events are meaningful and are of further use in analytical investigation. If they are not
of further use, no further processing is required. Unwanted data is filtered out, and the process saves the
incremental cost of storing that data. You don’t store what you’ll never need, and you can substantially
decrease further network transmission of event data as well as downstream processing costs.

2
SCALING ANALYTICS

One goal of analytical event stream processing is to have current situational awareness to existing
conditions – for example, to ask: Are current events outside of normal operating parameters? As such,
continuous queries often focus on the changes of events, those that deviate from normal conditions. If all
is normal, then no further action is required. However, if events aren’t normal then real-time alerts,
notifications and actions are issued to further investigate or react to such abnormal activity. Within SAS
Event Stream Processing, you can detail the alert conditions, message and recipient information –
directly in the Studio interface, as illustrated in Figure 1.

Figure 1 - SAS Event Stream Processing Studio

Figure 1 shows alerts included in stream processing (right), detail alert channel and recipients (upper)
and real-time condition details (lower).

3
Knowing the appropriate conditions of when a pre-defined tolerance threshold is crossed is defined by the
rules in the system. For example, in statistical process control, the Western Electric Rules1 stipulate
decisions as to whether a process should be investigated for issues in a manufacturing or other controlled
setting. These rules signify when an event, or calculation based on an event are relevant. This relevance
can be highlighted in dashboards, to trigger operational processes or alerts that are sent to other
applications listening for these events of interest. Any combination of rules and analytical algorithms are
possible with SAS Event Stream Processing, so you can devise and adjust your scenario definitions,
tolerance threshold levels defined as rules directly in the studio interface.
SAS Event Stream Processing also allows historic data (called into memory) to be assessed in tandem
with live event stream processing for evaluations based on summary conclusions that have been made
from existing knowledge bases, like a customer segmentation score. Such lookups based on data that
has been stored in offline repositories (also known as “data at rest”) opens another suite of conditional
assessments that can be made from streaming event data, like last action taken, pre-existing risk
assignment, or likelihood of acceptance.

SCALE TO ALL PHASES OF ANALYTICAL NEED


Event stream processing is also used to get an accurate projection of the future likelihood of events, to
answer questions such as “Will this machine part fail?” In doing so you can better plan, schedule, and
proactively adjust actions based on estimates of future events.
One differentiating feature of SAS Event Stream Processing is the ability to use the power of SAS
advanced analytics in stream. This is done using a procedural window, in which users can write their own
input handlers. Currently, input handlers can be written in C++, SAS DATA step, or SAS DS2. When SAS
DATA step is used, calls to Base SAS are made each time an event is processed. When SAS DATA Step
2 or C++ is used, no calls are processed, which expedites processing times.
Creating the predictive algorithm outside of streaming data, built on a rich history from existing
repositories, including historic events and then scoring events using the algorithms as part of a
continuous query (via the procedural window) is one method to scale analytics to streaming events. Many
analytical questions require rich history in order to identify the best model for prediction.
Some data questions, however, don’t require extensive history to define an algorithm. This is the case
when models learn as new data arrives, and when learning does not require historical data. For these
types of data questions, the equation or algorithm can be defined in the event stream. SAS has
introduced K-Means clustering as the first iteration of advanced analytics learning in the data stream
(again, with no reliance on previously training the model using stored data). As illustrated in Figure 2,
sourced live events are first processed by a training stage, where events are examined by the algorithms,
and then events are scored using that algorithm, with no calls out to any dependent processing. As
stream events change, the model updates to new conditions, learning from the data. The K-Means
streaming algorithm is defining clusters of events, which group events into homogenous categories.
Other, such machine learning algorithms, are planned for future releases of SAS Event Stream
Processing.

1 More on Western Electric Rules here: https://en.wikipedia.org/wiki/Western_Electric_rules

4
Figure 2 - Learning Algorithms that Score Streaming Events in SAS Event Stream Processing Studio

As an embeddable engine, SAS Event Stream Processing can also be pushed to extremely small
compute environments – out to the edge of the IoT, like compute sticks and gateways. At the edges of the
IoT, and even individual sensors, event stream data is limited by the transmitting or gateway device and
as such, some analytical processing will make sense while algorithms that require a variety of inputs,
won’t. As you move away from the edge, more data is available for more complex decision making. And
out-of-stream analysis, based on stored data, like Hadoop, will have extensive history upon which
investigation and analytic development can happen. SAS Event Stream Processing can be used
throughout, pushing data cleansing, rules and machine learning algorithms to the edge, learning models
in-stream and including score code from algorithms developed from data repositories, scaling to all
phases of analytic need.

HOW SCALING WORKS


Latency and throughput are major concerns that need to be addressed when a business decides to
implement real time, streaming data solutions. Latency is the amount of time it takes for an event to be
processed from start to finish. Throughput is measured by how many events can be ingested per second.
Scaling to reduce latency and to ingest increasing throughput is a balance that needs to be based on the
business requirements for the application at hand.

SCALING FOR LOW LATENCY


While some sources of latency are common to all deployments of SAS Event Stream Processing, there
are many different ways to architect a real-time solution, and with each comes different sources of
latency. Common sources of latency include the following:
 processing wide records
 complex processing steps
 making calls out to external systems
 using stateful models.

Wide Data
It is very common for streaming data sets to be wide with many fields per record, especially when multiple
event streams are all ingested into the same processing flow. Since the time it takes to process an event
scales linearly with the number of fields, it makes sense to eliminate fields as early as possible that will
not be used for any useful purpose. This can easily be done in a compute window, where the user can

5
specify which fields from the previous window will be used going forward. In addition to reducing the
number of fields to be processed, compute windows can also be used to change data types and key
fields. Ideally, a compute window is defined directly after the source window (the latter, which ingests
event stream data), so that minimal time is spent processing unneeded event fields.

Reduce Complexity
SAS Event Stream Processing is able to perform complex operations on data, but sometimes this is at
the cost of latency. String parsing is an example of an operation that can increase latency. Such
processing can be done using the SAS® Expression Language or using user-defined functions written in
C. Using the SAS Expression Language is simpler to use than a custom function, but custom functions
typically run faster.

Managing State for Lower Latency


As an in-memory technology, SAS Event Stream Processing is dependent on available memory for
processing speed. Each window in a model can be assigned an index type, making it a stateless, or a
stateful window. A stateless window ingests an event, processes it, and passes it on without retaining the
event data in memory. To ensure low latency, models should be kept as stateless whenever possible.
Sometimes, however, state needs to be retained, such as when streaming data is joined to a lookup
table, when aggregate statistics are being calculated, or when pattern windows accumulate partially
matched patterns.
Very often, streaming data can be enriched by joining it to historical records, customer data, or product
data. At times, these lookup tables can be very large - to the point where it is undesirable or even
impossible to keep the entire table in memory. When this happens, the HLEVELDB index should be used
for the local index of the dimension side of a join. The HLEVELDB index type stores the lookup data on
disk and provides an optimized way to retrieve such records, effectively offsetting the latency that typically
comes with using data that isn’t stored in memory.
Another common scenario is to retain statistical information about the streaming data for additional
processing, using an aggregate window. This raises the issue of event retention. To ensure that memory
consumption is bounded, a retention policy must be implemented in which the user defines the number of
retained records or for how long records are retained. This is accomplished by preceding an aggregate
window with a copy window. Also, the pattern window also needs to retain some state associated with
events – best understood by describing how the pattern window works.

Pattern Compression
The user defines a pattern of interest, which will most likely consist of multiple events. When an event
arrives to the pattern window, the pattern window holds that event while it waits for other events that
comprise the pattern of interest. As the number of these partially matched patterns increases, memory
usage can grow quite large. The impact of this can be offset by enabling the pattern compression feature.
By compressing unfinished patterns, memory usage can be reduced by up to 40% with the cost of a slight
increase in CPU usage.

SCALING FOR HIGH THROUGHPUT


For an event stream processing solution to scale to increasing volumes of data, it must be designed in
such a way as to handle high throughput. There are many factors to consider because of the varied and
diverse use cases.

Managing Thread Pools


Each SAS Event Stream Processing engine contains one or more projects, each of which has a defined
thread pool size. This enables the project to use multiple processor cores, which allows for more efficient
parallel processing. Often, streaming data will have to travel over a network to reach a SAS Event Stream
Processing server. In that case, the throughput is limited by the speed of the network connection. A
typical 1GB/sec interface should be able to process about 600 MB/sec of event data. To achieve a higher

6
throughput, projects can be spread across network interfaces. One option is to connect SAS Event
Stream Processing projects in a star schema, where many projects are taking in data from the edge,
aggregating it down to desired elements and performing any preprocessing - all connected to a central
continuous query, which ingests the prepared data and performs the desired operations. It should be
noted that retaining state in source windows can affect throughput.
Events are grouped into event blocks consisting of zero or more events when they are first ingested using
a source window. Using larger event blocks helps increase throughput rates during publish and subscribe
actions. Event blocks can, at times, only contain one event, such as when aggregate statistics are being
joined with incoming events. If an event block contains an insert and multiple updates, they would be
collapsed to a single insert containing the most recent values. In this case, the aggregate statistics would
not accurately reflect the stream of incoming events.

SAS ESP and Hadoop Technologies


As the adoption of Hadoop continuous to grow, it’s important for technologies like SAS Event Stream
Processing to deeply integrate with Hadoop – to make the best use of its features. It might be desirable to
architect a streaming solution with aspects of the event stream processing model residing on different
machines. In some instances, one might want to separate data preparation from pattern matching
because of memory constraints. The same holds true for parts of the model that join streaming data to
large tables of dimension data. In order to reduce latency and increase throughput, SAS Event Stream
Processing models need to be designed in such a way as to take full advantage of a distributed
environment.
Hadoop uses a resource management platform known as YARN to allocate resources to applications.
YARN uses containers, which represent a collection of physical resources, and SAS Event Stream
Processing servers can be run in these containers. With SAS, one can specify the memory and number
of cores to be used in each container. This means that more memory and parallel processing power can
be allocated to parts of the overall event processing model that are more computationally intensive.
Assuming that network connectivity is present, outside users can also connect to ports opened by the
SAS Event Stream Processing XML servers in the Hadoop cluster. From the perspective of the outside
user, then, the functional behavior will be the same as if SAS Event Stream Processing was being run
stand-alone.

SAS EVENT STREAM PROCESSING ADAPTER/CONNECTORS


SAS Event Stream Processing provides Hadoop adapters for integration with large data sources to both
read and write data (events) to these distributed data targets. But simply storing and writing efficiently to
these systems is only part of the story to deliver an adaptive and scalable system. SAS Event Stream
Processing provides necessary integration with the YARN resource manager on Hadoop to leverage the
resource management capabilities for higher throughput by leveraging the distributed processing
framework in Hadoop.
YARN, which was designed as a generic resource negotiator platform for a distributed system like SAS
Event Stream Processing, uses the same underlying daemons and APIs that are within the common
ecosystem of other Hadoop applications. SAS Event Stream Processing nodes run directly within the fully
managed YARN environment thereby leveraging the power of the YARN resource manager, as illustrated
in Figure 3. To leverage YARN’s resource management power and deliver maximum throughput, SAS
Event Stream Processing provides a YARN plug-in to communicate with the YARN environment and
submit the necessary requests for cores and memory needed for performance.

7
Figure 3 –SAS Event Stream Processing Integration with YARN.

As shown in Figure 3 - SAS Event Stream Processing server on Hadoop, the ESP application has been
started and is running using three YARN containers and is managed by the YARN Node Manager. This
allows YARN to manage the ESP servers running in the various nodes to control start up and shut down
for a seamless and scalable processing environment and the requested nodes can be increased when
additional processing resources are needed.
The YARN plug-in supports commands using dfesp_yarn_joblauncher - for requesting and launching
YARN cores and memory, as noted in Figure 4.

Figure 4 - Launch SAS Event Stream Processing on YARN

Using the SAS Event Stream Processing Application Master interface, show in Figure 5, the SAS Event
Stream Processing XML factory server, “qsthdpc03”, is shown running in the YARN-managed Hadoop
environment and will be using the noted http-admin, pubsub, and http-pubsub ports, as well as the
requested virtual cores and memory.

8
Figure 5 - SAS Event Stream Processing Application Master Screen

By using the http-admin port defined for the running ESP server, “qsthdpc03”, commands are used to
load ESP project “test_pubsub_index” through dfesp_xml_client into the running ESP XML factory server,
depicted in Figure 6.

Figure 6 - SAS Event Stream Processing Project Load Example

To deliver additional processing for higher throughput, a second SAS Event Stream Processing factory
server is started. This environment can be discovered and managed using the consul service to monitor
its performance characteristics. As shown in Figure 7, a second server, “qsthdpc02” is running and
available as an additional resource within YARN for event processing.

9
Figure 7 - SAS Event Stream Processing Application Master running an Additional Server for Higher
Throughput

The consul service view provides the health check information about the publish, subscribe, and HTTP
Admin interfaces for “qsthdpc02” as well as other SAS Event Stream Processing servers running and
managed by the YARN resource manager on Hadoop, as shown in Figure 8.

Figure 8 - SAS Event Stream Processing Status within Consul Interface

10
ENTERPRISE CONSIDERATIONS
With the approaches outlined above, we can see how SAS Event Stream Processing supports scaling to
meet the processing demands of the enterprise for latency as well as throughput without sacrificing either.
These important performance considerations are not the only factors that need to be considered for
scalability. In any technology there are the other overriding factors that are needed to deliver reliability,
productivity, flexibility, governance, and security.
A complete ecosystem for managing and governing the code for a successful streaming analytics
environment is needed for enterprise applications. Many open-source tools provide components for what
is needed to deliver aspects of streaming performance but tend to lack the capabilities needed for
complete business solution. A complete business solution includes scalability, reliability, and governance
– aspects that ensure a solution supports the enterprise’s needs both today and tomorrow, scaling to new
problems and data volumes.

RELIABILITY
Today’s IT infrastructures require that event streams are processed in a reliable manner and are
protected against any loss of data or any reduction in IT’s service level agreements.
SAS Event Stream Processing provides a robust and fault-tolerant architecture to ensure minimal data
loss and exceptionally reliable processing to maximize up time. SAS does this by delivering reliability
using proven technologies - like message buses and solution orchestration frameworks – that ensure
message deliverability while eliminating performance hits on the SAS Event Stream Processing engine.
This translates to a solution that is reliable and which supports failover. One definition of failover is:

“switching to a redundant or standby computer server, system, hardware component


or network upon the failure or abnormal termination of the previously active
application, server, system, hardware component, or network.”
Reference: https://en.wikipedia.org/wiki/Failover

Failover architectures are essential to any system that demands minimal data loss. SAS Event Stream
Processing has a patented approach for a 1+N Way Failover architecture, as illustrated in Figure 9.

11
Figure 9 - SAS Event Stream Processing 1+N-Way Failover

As shown in Figure 9, the failover architecture allows the SAS Event Stream Processing engine,
subscribers, and publishers to be oblivious to failover and any occurrences thereof. The product pushes
the failover responsibilities and knowledge to the (prebuilt) publish and subscribe APIs and clients. The
APIs and clients are, in turn, complemented by third-party message bus technology (from Solace
Systems, Tervela, or RabbitMQ). This architecture has the benefit of flexibility, as it allows failover to be
introduced without requiring the publishers and subscribers to be changed or recompiled.
In this approach, “N” (in 1+ N-Way Failover) refers to the ability to support more than one active failover at
a time, and as such, you can be assured of getting as close to zero downtime as can be afforded. All
event streams are published to both the active and stand-by SAS Event Stream Processing engines as
part of standard processing. Only the active SAS Event Stream Processing engine forwards subscribed
event streams to the message bus (for subscribers). If the message bus detects a dropped active
connection or missed “I’m alive” signal, the message bus then appoints a stand-by to be active with the
event block IDs to begin for the subscribers. This new active engine keeps a running queue for
subscribed event streams. The queue is then used to start forwarding events to the subscribers.
SAS has another patent pending for Guaranteed Delivery. This informs a publisher callback approach
when event blocks are received by one or more identified subscribers within a configured time window.
The same callback function is notified if this does not occur, so the publisher determines how to handle
that situation. This is done asynchronously and without persistence so as not to impact performance. As
a result, failover is instantaneous and automatic with no loss, nor replay of events, no performance
degradation, and a reliable environment that meets IT’s needs.

PRODUCTIVITY
Developing models to ingest, analyze, and emit events can be a complicated task when you consider the
various window types, the sophisticated analysis, testing the model, and reporting results. All of this is
required to ensure that the model produces the desired actions. SAS Event Stream Processing provides
tools to assist with the development of models for processing event streams.

Model Design
SAS Event Stream Processing Studio is a design-time environment that supports the development of
engines, projects, and continuous queries. These components form a hierarchy for the model building
environment, as illustrated in Figure 10. The studio is one of the three ways to build a model in SAS
Event Stream Processing.
SAS ESP has three modeling approaches that are 100% functionally equivalent that provide developers

12
the flexibility they need to develop, test, and implement streaming models. These approaches include:
 C++: a C++ library that can be used to build and execute ESP engines.
 XML: XML syntax to define ESP Engines or Projects via an XML editor.
 Graphical: SAS Event Stream Processing Studio is a browser-based development environment
using a drag-and-drop interface to define ESP models, either engines or projects.

Figure 10 - SAS Event Stream Processing Model Hierarchy

Using the hierarchy depicted in Figure 10 - SAS Event Stream Processing Model Hierarchy, sophisticated event
stream processing models are defined. The top level represents the engine, and within an engine, one or
more projects can be created allowing for flexibility in how the events are processed. Different projects
can be coordinated to allow events to be delivered from one project to another project for processing.
Finally, within a project, the SAS Event Stream Processing Studio interface supports continuous queries,
where events are processed using the window types available from a menu (window types are shown in
Figure 11).
Typically, SAS Event Stream Processing Studio is used to quickly build streaming models that include the
flow of data from source windows through to the processing windows for pattern matching, filtering,
aggregations, and analytics. The drag-and-drop interface supports rapid event stream model
development and doesn’t require any XML or C++ coding to deploy these models. Of particular note, the
Procedural window is how SAS analytical models are introduced into the event streaming data flow.
Once the design is complete, the user can test the models from within the interface.

13
Figure 11 - SAS Event Stream Processing Window Types

Testing and Refinement


SAS Event Stream Processing Studio provides an interactive test mode, which can be used to load event
streams into the continuous query, and publish results from select windows to the screen once the event
streams are processed by the model. This functions as an easy to use diagnostic tool that aids the model
builder during the development and testing phases.
SAS Event Stream Processing Streamviewer provides additional insights into the streaming model
behavior with overlays of intuitive graphics that visually depict trends in the rapidly moving data. This is
essential to understand model performance in high throughput data. As illustrated in Figure 12, the
interface provides a view into the live results of each window by simply subscribing to any window of
interest – so that testing can identify if the model matches expected results.

Figure 12 - SAS Event Stream Processing Streamviewer

14
SAS Event Stream Processing Streamviewer allows for rapid model iterations by visualizing the trends in
the streaming data, and eliminates the need to build a custom visualization tool for testing.

FLEXIBILITY
Given the flexibility and power of the engine, continuous queries, and streaming window types, streaming
models can be complex designs that introduce branching, left and right joins, schema copies, pattern
matching, and analytics. All of these moving parts can be difficult to orchestrate, and as with any complex
design, can be difficult to communicate to other teams in the organization. This can introduce delays and
risk as teams struggle with describing stream processing designs to other groups in a way that ensures
all parties understand the solution as well as their specialized involvement. Additionally, when skilled
resources are scarce, and design logic expertise is in short supply, a visual representation of a model
achieves a common, clear and effective means to communicate a complex model design to others. SAS
Event Stream Processing visual Studio is often valued by teams, providing an easily consumable format
for design specification, thus reducing such risks.

The Power of a Visual Interface


As shown in Figure 13, SAS Event Stream Processing Studio is a graphical event stream model design-
time environment, making it easier to share designs between stakeholders, using the export and import
feature. This allows models to be shared across design environments. This graphical design-time
environment also allows new staff to quickly understand the streaming model definition, reducing the risk
associated with only a few staff possessing the knowledge of how a particular event stream processing
flow operates.

Figure 13 - SAS Event Stream Processing Studio Graphical Editor

Evolving Stream Data


Adapters
Data types and streaming data sources are constantly evolving and it is important to have the
ability to quickly adapt, and this includes the ability to include new streaming data sources and
types. The SAS Event Stream Processing Publish and Subscribe API allows extensibility for new
adapters to be built using the same API used to develop the prebuilt adapters. These APIs include
both JAVA and C APIs. A wide array of adapters are supplied out of the box (as listed in
Table 1), which can be further configured or custom adapters can be built using the same Java or C++

15
API. Adapters can also be networked to allow for coordination between different input and output streams
of data.

API Adapter
Language
C++ Database
Event Stream Processor
File and Socket
IBM WebSphere MQ
PI
Rabbit MQ
SMTP Subscriber
Sniffer Publisher
Solace Systems
Teradata Subscriber
Tervela Data Fabric
Java HDAT Reader
HDFS (Hadoop Distributed File System)
Java Message Service (JMS)
SAS LASR Analytic Server
REST Subscriber
SAS Data Set
Twitter Publisher

Table 1- SAS Event Stream Processing Studio Adapters (C++ and Java)

Connectors
Similarly, SAS Event Stream Processing connectors can be also created and integrated using the Java
and C++ APIs. Connectors are “in process”, meaning that they are built into the model during design time.
In contrast, adapters can be started or stopped at any time, even remotely. Connectors use the SAS
Event Stream Processing publish/subscribe API to do one of the following:
 publish event streams into source windows. Publish operations do the following, usually
continuously:
o read event data from a specified source
o inject that event data into a specific source window of a running event stream processor
 subscribe to window event streams. Subscribe operations write output events from a window of a
running event stream processing engine to the specified target (usually continuously).
The SAS Event Stream Processing Connectors include:
 Database
 Project Publish (Inter-ESP Project)
 File and Socket
 IBM WebSphere MQ

16
 PI
 Rabbit MQ
 SMTP Subscribe
 Sniffer Publish
 Solace Systems
 Teradata Connector
 Tervela Data Fabric
 TIBCO Rendezvous (RV)
Each of these connectors supports various specific formats.
Taken as a whole, this collection of streaming data connectors and adapters offers a robust collection of
pre-built routines to ingest streaming data and deliver outputs. The connectors also offer an extensible
framework to use new event stream sources.
Given that each data management approach is different, SAS Event Stream Processing supports large
data stores subscribing to fast-moving streams. The data can be landed in distributed file systems like Big
Insights, MapR, Cloudera, and Hortonworks.

GOVERNANCE
Deployment of event streaming models to various targets requires version control, configuring publishing
targets, and updating models dynamically to minimize interruption of service to both event publishers and
subscribers.

Figure 14 - SAS Event Stream Processing Studio Change Management

SAS Event Stream Processing provides support to manage versions and to publish changes to models as
illustrated in Figure 14. Changes to models can be scripted using plan files that not only support changes
to the deployed streaming models, but also coordinate the loading of the updated model to a running SAS
Event Stream Processing engine. Also addressed are the orchestration of adapters to inject events to the
updated model and validation that the model is syntactically correct.
This support is enabled by the configuration of XML plan files (see Figure 15) that manage these changes
at publish time. These plan files enable reuse for streamlined operations and governance as well as
ensuring flexibility for managing multiple models. This automation allows for repeatability across
operational scenarios for consistency and repeatability.

17
Figure 15 - SAS Event Stream Processing Example Plan File

In conjunction with the support for plan files to update and publish new models, developers can use the
SAS Event Stream Processing engine’s dynamic service change feature to change models on the fly
without taking the SAS Event Stream Processing server down - ensuring constant up-time for always-on
streaming applications. Specifically, users can add additional windows, remove windows, or change
windows as part of these dynamic updates. Dynamically changing models on a running XML factory
server without bringing down the project or significantly impacting service processing improves business
agility.
SAS Event Stream Processing manages such dynamic changes without losing existing state, where
possible, and can propagate retained events from parent windows into newly added windows. If the new
streaming model design changes a given window, then most likely the state is no longer meaningful and
will be dropped.
The implication is that new analytic score code can be dynamically updated into deployed streaming
models as the need arises, so that analytics are refreshed on an as-needed basis while governed in a
controlled manner.

SECURITY
Data streams can include sensitive data, requiring that data be secured when in flight as well as during
processing. This, in turn, necessitates that the publishers (delivering data to SAS Event Stream
Processing), the subscribers (to the processed events), as well as event data stored in-memory during
processing be secure. SAS can secure event data when in-memory from unauthorized access.
The prior release of SAS Event Stream Processing, version 3.1, provided encryption of data streams both
to and from the SAS Event Stream Processing engine (both publish and subscribe) using OpenSSL when
communicating between client and server. The OpenSSL option is available when using the SAS® Event
Stream Processing System Encryption and Authentication Overlay provided with the product.
SAS Event Stream Processing 3.2 introduced an optional authentication between the clients and servers
to ensure more secure access to the product’s network interfaces such as the XML server API and the
Java/C Pub/Sub APIs and adapters. This was also extended to SAS Event Stream Processing
Streamviewer.

CONCLUSION
The best utilization of Hadoop and other big data lakes for streaming data is achieved when a strategic

18
approach is adopted – one that doesn’t pollute them with dirty or irrelevant noise. Direct integration with
YARN helps scale for higher throughput by using the distributed processing framework of Hadoop.
Scaling to examine streaming data once it is landed in a big data repository is, however, only one
consideration when scaling for big and fast data.
There is a balance to be struck between what is best done as part of event stream processing before
event data is stored, and what is more appropriately done once it is landed in Hadoop. Many advanced
analytical models require a rich history to appropriately model the desired behavior. Hadoop, as a popular
big data environment, is ideal for in-depth SAS analysis – and often appropriate to build and define SAS
DATAStep2 score code to be embedded with SAS Event Stream Processing.
SAS Event Stream Processing can ingest, cleanse, analyze, aggregate, and filter data while it’s still in
motion – helping channel only relevant big data to such ‘data at rest’ repositories for in-depth diagnostics
and investigation. Event streams can be assessed as they are sourced, filtering out irrelevant noise,
saving network transport loads, and focusing downstream efforts on what’s relevant.
Configuring a solution to address high throughput volumes with low latency response times, that can
successfully ingest data streams and provide the answers needed by the business, is dependent on both
the infrastructure environment as well as the event stream processing model itself. Using the SAS Event
Stream Processing integration with YARN enables a dynamic linkage of these two technologies. This
further scales the power of the service management of YARN in Cloudera, Hortonworks and other
Hadoop environments – while SAS reduces the data stream data, generates immediate insights and
balances resources for an optimized business solution.
For enterprise adoption, additional scaling considerations that go beyond those of in-memory
environments, analytics, throughput, and latency are also core to successful event stream processing
deployments. With an ever changing business climate, event stream processing applications also need to
be reliable, productive, flexible, governed, and secure. SAS Event Stream Processing provides the agility
needed to scale – for organizations who are extending their existing SAS knowledge by tapping into the
new sources of insight that event streams provide, and scaling to those already tackling the IoT frontier.

ACKNOWLEDGMENTS
The authors would like to acknowledge the guidance and assistance of SAS colleagues Jerry Baulier,
Scott Kolodzieski, Fred Combaneyre, Vince Deters, Yiqing Huang, and Yin Dong for your direction and
support of this paper.
The authors would also like to acknowledge that Figure 3 of this paper was jointly crafted in partnership
with Hortonworks – initially defined to illustrate SAS Event Stream Processing YARN integration with the
Hortonworks Data Platform.

CONTACT INFORMATION
Your comments and questions are valued and encouraged. Contact the authors at:
Evan Guarnaccia
SAS Inc.
Evan.Guarnaccia@sas.com

Fiona McNeill
SAS Inc.
Fiona.McNeill@sas.com

Steve Sparano
SAS Inc.
Steve.Sparano@sas.com

19
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of
SAS Institute Inc. in the USA and other countries. ® indicates USA registration.
Other brand and product names are trademarks of their respective companies.

20
Paper SAS395-2017
Location Analytics: Minority Report Is Here—Real-Time Geofencing Using
SAS® Event Stream Processing
Frederic Combaneyre, SAS Institute Inc.

ABSTRACT
Geofencing is one of the most promising and exciting concepts that has developed with the advent of the
Internet of Things. Like John Anderton in the 2002 movie “Minority Report,” you can now enter a mall and
immediately receive commercial ads and offers based on your personal taste and past purchases.
Authorities can track vessels’ positions and detect when a ship is not in the area it should be, or they can
forecast and optimize harbor arrivals. When a truck driver breaks from the route, the dispatcher can be
alerted and can act immediately. And there are countless examples from manufacturing, industry,
security, or even households. All of these applications are based on the core concept of geofencing,
which consists of detecting whether a device’s position is within a defined geographical boundary.
Geofencing requires real-time processing in order to react appropriately. In this session, we explain how
to implement real-time geofencing on streaming data with SAS® Event Stream Processing and achieve
high-performance processing, in terms of millions of events per second, over hundreds of millions of
geofences.

INTRODUCTION
One of the most important underlying concepts behind all location-based applications is called
geofencing. Geofencing is a feature of an application that defines geographical boundaries. A geofence
is a virtual barrier. So, when a device enters (or exits) the defined boundaries, an action is immediately
triggered based on specific business needs.
One of the early commercial uses of geofencing was in the livestock industry, where a handful of cattle in
a herd would be equipped with GPS units and if the herd moved outside of geographic boundaries set by
the rancher, the rancher would receive an alert.
What applies to the flow of cattle can also be applied to:
• Fleet management: When a truck driver breaks from his route, the dispatcher can be alerted and
act immediately.
• Customs transport: Authorities can track vessels' positions and detect when a ship is not in the
area it should be or forecast and optimize harbor’s arrivals.
• Public areas, like airports or train stations: The flow and density of people can be detected in real
time in order to remove bottlenecks and optimize queuing times, adapt path guidance, organize
staffing, or optimize flow path and procedures.
• Galleries and museums: Administrators can quantify the popularity of exhibits, identify under-
used spaces, and use visitor behavior to optimize future events.
• Shopping centers: Geofencing can show in real time how many people pass in front of a certain
store, shelf, information point, or door, and where these people are coming from. How many
people are watching a certain TV ad on a billboard? Where is the best place to position a
promotion based on foot traffic? Hence, geofencing allows optimizing store workflow (goods
supply, cart management…) and layout.
And there are countless examples from manufacturing, industry, security, or even households—like an
ankle bracelet alerting authorities if an individual under house arrest leaves the premises, or automatically
switching lights off when the whole family leaves the house.
An important paradigm that is inherent to all those applications is the immediacy of action.

1
In order to react appropriately, the position information has to be processed immediately, with low latency,
regardless of the volume of events it is required to analyze. Taking too much time to react is not an option
in such cases, as the subject/device will already have moved to another location. Hence, a delayed
reaction is obsolete.
Of course, a timely reaction is just one part of the game. We also need to react appropriately. Deciding
the best action to apply often means being able to detect specific complex events patterns out of the
masses of events and to apply high-end analytics or machine learning algorithms on real-time data
streams. And this is where SAS® software like SAS® Event Stream Processing comes into play,
providing high performance and low latency geofencing analysis and high-end streaming analytics, as
well as real-time predictive and optimization operations.
Released in early 2017, SAS Event Stream Processing 4.3 introduces a new Geofence window that
provides real-time geofencing analysis capabilities on streaming events. This Geofence window was
already available as a custom additional plug-in for SAS Event Stream Processing 4.1 and 4.2 and is now
totally integrated as a standard SAS Event Stream Processing window.

SAS EVENT STREAM PROCESSING GEOFENCE WINDOW


The SAS Event Stream Processing Geofence window allows determining if a location position coming
from an event stream is inside a defined area of interest or close to a defined location of interest, and
augmenting this event with these area or location details.
The Geofence window behaves functionally like a lookup or an outer join window, where the geofence
areas and locations are on the dimension side and the event position on the streaming side. Hence, like a
Join window, the Geofence window requires 2 input windows, one for injecting the geofence areas or
locations, called geometries, and the other for the streaming events, called positions.
Figure 1 shows a sample Event Stream Processing streaming model implementing the Geofence window.

Figure 1. Event Stream Processing Model Using the Geofence Window

In an event stream processing XML model, the first window connected to the Geofence window is the
position window. When using the SAS Event Stream Processing Studio, each window’s role is defined in
the property panel.
Areas and locations of interest are defined as geometry shapes. The Geofence window supports 2 types
of geometries: polygons and circles.

2
Geometries are published as events, one event per geometry. The Geofence window supports insert,
update, and delete opcodes, allowing dynamic update of the geometries.
The Geofence window is designed to support any coordinate type or space, either Cartesian or
geographic. The only requirement is that all coordinates must be consistent and refer to the same space
or projection. For geographic coordinates, the coordinates must be specified in the (X,Y) Cartesian order
(longitude, latitude). All distances are defined and calculated in meters.
Let’s cover now how the Geofence window implements the two types of geometries: polygons and circles.

POLYGON GEOMETRIES
A polygon is a plane shape representing an area of interest. The Geofence window supports polygons,
multi-polygons, and polygons with holes or multiple rings.
Figure 2 below shows some sample polygon geofences.

Figure 2. Sample Polygon Geofences

A polygon is defined as a list of position coordinates representing the polygon’s rings. A ring is a closed
list of position coordinates. In order to be considered closed, the last point of the ring list must be the
same as the first one. So, for example, a ring that is geometrically defined with 4 points like a square
must declare 5 position coordinates, the last point being the same as the first one.
The input polygon window schema must have at least the following 2 mandatory fields:
• A single key field of type int64 or string. This field defines the ID of the geometry.

3
• A data field of type string. This field contains the list of the rings’ position coordinates. The
coordinates are defined as a list of numbers (double) separated by spaces in the X, Y order.
For polygons with multiple rings, the first ring defined must be the exterior ring and any others
must be interior rings or holes. For example, the following string represents a polygon made
of 4 points that includes a hole made of 7 points:
"5.281 9.455 3.607 7.112 6.268 6.181 8.414 7.705 5.281 9.455 5.671 8.316
6.572 8.033 7.087 7.695 6.444 7.469 5.929 7.215 5.285 7.384 5.199 7.949 5.671
8.316"
If the polygon data is provided using the standard GeoJSON format, you can easily parse it and format it
using a functional window.
The schema can also have an optional description field that can be propagated with the Geofence
Window output event.
All other fields will be ignored.
When working with polygons, the Geofence window analyzes each event position coming from the
streaming window and returns the polygon this position is inside of. If there are multiple matching
geometries (in case of overlapping polygons) and if the option output-multiple-results is set to
true, multiple events are produced (one per geometry).
The Geofence window behaves like a lookup join, so its output schema is automatically defined and
includes all fields coming from the input position window appended with the following additional fields:
• A mandatory field of type int64 or string that will receive the ID of the geometry. If no geometries
are found, the value of this field will be null in the produced event. This field is defined by the
parameter geoid-fieldname.
• An optional field that will receive the description of the geometry if it exists in the geometry
window schema. This field is defined by the parameter geodesc-fieldname.
• An optional field of type double that will receive the distance from the position to the centroid of
the polygon. This field is defined by the parameter geodistance-fieldname.

• If output-multiple-results is set to true, a mandatory key field of type int64 that will
receive the event number of the matching geometry. This field is defined by the parameter
eventnumber-fieldname.
Below is a sample event stream processing XML model that implements a Geofence window using
polygons:
<project name="geofencedemo" pubsub="auto" threads="4" index="pi_EMPTY">
<contqueries>
<contquery name="cq1" trace="alerts">
<windows>
<window-source name="position_in" pubsub="true" insert-only="true">
<schema>
<fields>
<field name="pt_id" type="int64" key="true"/>
<field name="GPS_longitude" type="double"/>
<field name="GPS_latitude" type="double"/>
<field name="speed" type="double"/>
<field name="course" type="double"/>
<field name="time" type="stamp"/>
</fields>
</schema>
</window-source>
<window-source name="poly_in" pubsub="true" insert-only="true">
<schema>

4
<fields>
<field name="poly_id" type="int64" key="true"/>
<field name="poly_desc" type="string"/>
<field name="poly_data" type="string"/>
</fields>
</schema>
</window-source>
<window-geofence name="geofence_poly" index="pi_EMPTY">
<geofence
coordinate-type="geographic"
log-invalid-geometry="false"
output-multiple-results="false"
autosize-mesh="true"
max-meshcells-per-geometry="200"
/>
<geometry
data-fieldname="poly_data"
desc-fieldname="poly_desc"
data-separator=" "
/>
<position
x-fieldname="GPS_longitude"
y-fieldname="GPS_latitude"
/>
<output
geoid-fieldname="poly_id"
geodesc-fieldname="poly_desc"
geodistance-fieldname="poly_dist"
/>
</window-geofence>
</windows>
<edges>
<edge source="position_in" target="geofence_poly"/>
<edge source="poly_in" target="geofence_poly"/>
</edges>
</contquery>
</contqueries>
</project>

CIRCLE GEOMETRIES
A circle defines the position of a location of interest. It is defined as a couple of coordinates (X, Y) or
(longitude, latitude) representing the center of the circle and a radius distance around this position.
Figure 3 below illustrates some sample circle geofences.

5
Figure 3. Sample Circle Geofences

The input circle geometry window schema must have at least the following 3 fields:
• A single key field of type int64 or string. This field defines the ID of the circle geometry.
• 2 coordinate fields of type double that contain the X and Y coordinates of the circle center.
The schema can also have the following optional fields:
• A radius field of type double, representing a circular area around the center point position. If this
field is not specified, the default distance defined by the parameter radius will be used.
• A description field that can be propagated with the Geofence Window output event.
All other fields will be ignored.
When working with circles, the Geofence window analyzes each event position coming from the
streaming window and returns the circle ID, which matches the following criteria:
• If the position lookup distance is set to 0, then the position behaves like a simple point. It is either
in or out of the circle. If it is in the circle, we have a match.
• Similarly, if the circle radius is set to 0, then the circle behaves like a bare point, and it is only
required for this point to be within the position lookup distance area for having a match.
• For any other value of the position lookup distance and the circle radius, then the position and the
circle’s center have to be within each other’s distance to have a match. It means that the position
is within the circle and the distance between the circle’s center and the position is lower than the
lookup distance. Figure 4 below illustrates the circle’s geometry lookup logic in such a case.

6
• And finally, if both the position lookup distance and the circle radius equal 0, then they have to be
the exact same point to have a match.
This position lookup distance is defined either by an additional event input field value, or by the parameter
lookupdistance.

Figure 4. Circle’s Geometry Matching Logic


If there are multiple matching geometries in the lookup distance and if the option output-multiple-
results is set to true, multiple events will be produced (one per geometry).
As when using polygons, the Geofence window behaves like a lookup join, so its output schema is
automatically defined and includes all fields coming from the input streaming window appended with the
following additional fields:
• A mandatory field of type int64 or string that will receive the ID of the geometry. If no geometries
are found, the value of this field will be null in the produced event. This field is defined by the
parameter geoid-fieldname.
• An optional field that will receive the description of the geometry if it exists in the geometry
window schema. This field is defined by the parameter geodesc-fieldname.
• An optional field of type double that will receive the distance from the position to the center of the
circle or to the centroid of the polygon. This field is defined by the parameter geodistance-
fieldname.

• If output-multiple-results is set to true, a mandatory key field of type int64 that will
receive the event number of the matching geometry. This field is defined by the parameter
eventnumber-fieldname.
Below is a sample event stream processing XML model that implements a Geofence window using circle
geometries:
<project name="geofencedemo" pubsub="auto" threads="4" index="pi_EMPTY">
<contqueries>
<contquery name="cq1" trace="alerts">
<windows>
<window-source name="position_in" pubsub="true" insert-only="true">
<schema>

7
<fields>
<field name="pt_id" type="int64" key="true"/>
<field name="GPS_longitude" type="double"/>
<field name="GPS_latitude" type="double"/>
<field name="speed" type="double"/>
<field name="course" type="double"/>
<field name="time" type="stamp"/>
</fields>
</schema>
</window-source>
<window-source name="circles_in" pubsub="true" insert-only="true">
<schema>
<fields>
<field name="GEO_id" type="int64" key="true"/>
<field name="GEO_x" type="double"/>
<field name="GEO_y" type="double"/>
<field name="GEO_radius" type="double"/>
<field name="GEO_desc" type="string"/>
</fields>
</schema>
</window-source>
<window-geofence name="geofence_circle" index="pi_EMPTY">
<geofence
coordinate-type="geographic"
log-invalid-geometry="false"
output-multiple-results="true"
output-sorted-results="true"
max-meshcells-per-geometry="200"
autosize-mesh="true"
/>
<geometry
desc-fieldname="GEO_desc"
x-fieldname="GEO_x"
y-fieldname="GEO_y"
radius-fieldname="GEO_radius"
radius="0"
/>
<position
x-fieldname="GPS_longitude"
y-fieldname="GPS_latitude"
lookupdistance="110"
/>
<output
geoid-fieldname="GEO_id"
geodesc-fieldname="GEO_desc"
eventnumber-fieldname="event_nb"
geodistance-fieldname="GEO_dist"
/>
</window-geofence>
</windows>
<edges>
<edge source="position_in" target="geofence_circle"/>
<edge source="circles_in" target="geofence_circle"/>
</edges>
</contquery>
</contqueries>
</project>

8
HIGH PERFORMANCE CONSIDERATIONS
In order to provide fast and low latency lookup processing, the Geofence window is implementing an
optimized mesh index algorithm using a spatial data structure that subdivides space into buckets of grid
shapes called cells. This mesh structure is totally independent of the coordinate system in use. Therefore,
any type of Cartesian, geographic, or projection coordinates space can be used seamlessly.
This mesh algorithm is using a parameter (called mesh factor) that defines the scale of the space
subdivision. This mesh factor is an integer in the [-5, 5] range, representing a power of 10 of the
coordinate units in use. For example, the default factor of 0 will generate 1 subdivision per coordinate
unit, a factor of 1 generates 1 subdivision per 10 units, and a factor of -1 generates 10 subdivisions per
unit. This factor can be set for both X and Y axes or independently for each axis.
For example, considering the following set of coordinates representing a square polygon (note the
repeated first point at the end, closing the polygon):
[(1001.12,9500.12) (1001.12,9510.12) (1010.12,9510.12) (1010.2,9500.12)
(1001.12,9500.12)]
• With a mesh factor of 1, the Geofence window divides the coordinates by 10^1 resulting in
[(100,950) (100,951) (101,950) (101,951)] and creates 4 mesh cells for this
geometry. (101-100+1)*(951-950+1) = 4
• Similarly, with a factor of 2, it creates (10-10+1)*(95-95+1) = 1 mesh cell.
• If the mesh factor is set to -1, then the window creates 9191 mesh cells for this geometry
resulting in an oversized mesh: (10101-10011+1)*(95101-95001+1)=91*101 = 9191
As a result, in order to get the best performance, you need to adapt the mesh factor to the spatial
coverage and to the number of loaded geometries. Too many mesh cells per geometry slows down the
ingestion of geometries and generates an oversized index. Too few mesh cells per geometry slows down
the lookup process, which impacts the stream performance and latency.
From our experience, an appropriate and efficient factor subdivides the space in order to have between
0.5 and 10 geometries per cell. This represents having each geometry generating around 1 to 10
subdivision cells maximum.
You can set the maximum allowed mesh cells created per geometry in order to avoid creating an
oversized mesh that would generate useless intensive calculations using a dedicated parameter called
max-meshcells-per-geometry. If a geometry exceeds this limit, it is rejected. Consider then setting a
higher mesh factor or setting a higher maximum mesh cells per geometry if relevant.
The Geofence Window provides an internal algorithm that automatically computes and sets an
appropriate mesh factor by analyzing the ingested geometries. If for some reason, you want to define the
mesh factors manually, set the parameter autosize-mesh to false.
Using an appropriate mesh, this window has been designed to provide outstanding performance despite
the number of calculations involved.
A test has been performed with a set of 21,569,300 polygons representing 625,451,932 points (~ 28
points per polygon).
With a stream of 10 million events representing 10 million different positions, the observed throughput
was ~200K events/second using 1 core.
This performance level will be largely enough for most use cases, although higher performance can be
easily reached by adding another window and partitioning the stream.

CONCLUSION
The new Geofence window is an easy-to-use, fast, and flexible SAS® Event Stream Processing window
that provides new capabilities for processing geolocation data in real time. It allows expanding the

9
application of streaming analytics by analyzing movements and locations of people or connected objects,
and opening up the horizon to new Internet of Things applications on countless domains, in order to react
immediately and appropriately.

RECOMMENDED READING
®
• SAS Event Stream Processing User Guide

CONTACT INFORMATION
Your comments and questions are valued and encouraged. Contact the author at:
Frederic Combaneyre
SAS Institute Inc.
frederic.combaneyre@sas.com

SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of
SAS Institute Inc. in the USA and other countries. ® indicates USA registration.
Other brand and product names are trademarks of their respective companies.

10
Paper 4140-2016
Listening for the Right Signals –
Using Event Stream Processing for Enterprise Data
Tho Nguyen, Teradata Corporation
Fiona McNeill, SAS Institute Inc.

ABSTRACT
With the big data throughputs generated by event streams, organizations can opportunistically respond
with low-latency effectiveness. Having the ability to permeate identified patterns of interest throughout the
enterprise requires deep integration between event stream processing and foundational enterprise data
management applications. This paper describes the innovative ability to consolidate real-time data
®
ingestion with controlled and disciplined universal data access - from SAS and Teradata™.

INTRODUCTION
In today’s big data world, there are great challenges and many opportunities. Organizations need to have
the ability to make the right decisions with precision, accuracy and speed - in order to enhance the
competitive advantage in a global economy of constant change. As a result, the state of the business
impacted by dynamic conditions requires continuous monitoring and evaluation by separating the right
signals from the noise. These events of interest are only apparent when they are understood and heard
by the dependent parts of the organization. This requires event processing that follows through the
organization in contextually and relevant data-driven actions. The ability to ingest data and process
streams of events effectively identifies patterns and correlations of importance, focusing organizational
activity to react and even proactively drive the results they seek and respond to in real time. Instead of
collecting, analyzing and storing the data in the traditional method, data can now be analyzed constantly,
as it is occurs, empowering organizations to adjust situational intelligence as new events transpire.
With the emergence of the Internet of Things (IoT), ingesting streams of data and analyzing events in
real-time become even more critical. The interconnectivity of IoT from web and mobile applications
provides organizations with even richer contextual data and more profound volumes to decipher in order
to harness insights. These insights can uncover greater business value to better understand customer
habits and behavior, enhance operational efficiencies and expand product and services offerings.
Capturing all of the internal and external data streams is the first step to enable listening for the important
signals that customers are emitting, based on their event activity. When you hear what they want from
the data they generate, the right and data-driven actions can happen, and now more rapidly than ever,
positively impacting bottom line profitability.
Of course, one obvious challenge is deploying a reliable, scalable and persistent streaming environment.
This environment needs to provide the necessary self-service capabilities for data administrators,
application developers and data scientists alike, so they can rapidly configure new and different
combinations of data streams and continuous queries for insights. Some organizations have explored and
implemented open source technologies for real time streaming. However, many have already come to
realize the inherent challenges of scaling across multiple event streams, building a dynamic and yet
stable environment that is flexible for adaptation to business dynamics and one that is supportive of
enterprise goals, ongoing needs and timelines.
As such, innovative organizations are moving beyond constructing enterprise environments that require
extensive manual coding from the ground up, to ones that take advantage of pre-built capabilities that are
readily available and integrated with existing organizational assets to drive automated, intelligent
streaming insights. Together, SAS® and Teradata provide an integrated pre-built environment for
exploiting enterprise data that listens for the right streaming signals – improving data-driven decisions for
the entire organization.

1
SAS® EVENT STREAM PROCESSING
Event stream processing (ESP), is designed to connect and analyze real-time event-driven information.
ESP processes event streams with the mission of identifying meaningful patterns and correlations as they
occur. Doing more than pipeline transport, the ability to enrich data by correlating events, identifying
naturally occurring clusters of events, event hierarchies, event probabilities and other aspects such as
contextual meaning, membership and timing – event stream processing , delivers deep insights to real-
time activity for a new, fast data infrastructure.
®
SAS Event Stream Processing is a comprehensive technology that delivers fast data insights based on a
publish and subscribe framework that ingests event streams, executes continuous queries using a suite
of pre-built and interchangeable window types and operators and delivers insights and instructions for
automated actions to dependent systems, applications and big data warehouses. In the traditional data
infrastructure approach, data is amassed, stored and then analyzed. Instead of storing data and then
running queries against this data at rest, SAS Event Stream Processing stores queries to continuously
enrich streaming data while it is in motion. As such, event streams are examined as they are received, in
real-time, and can incrementally update with new intelligence as new events happen. Focusing on
enriching data while events are still in motion demands a highly scaled and optimized process to address
the hundreds of thousands of events per second common to event streams. SAS Event Stream
Processing has the ability to enrich and filter event, differentiating and analyzing text and structured
streaming data with embeddable analytics that instantly translate to real-time insights for event-driven
®
actions. SAS Event Stream Processing Studio is the visual data flow interface, simplifying the
construction of event stream continuous queries, and saving time and efficiency of application
developers, data scientists and IT architects.

Given event stream data is never clean data, even when generated by machine sensors, SAS Event
Stream Processing includes pre-built data quality routines to aggregates, normalize, standardize, extract
and correct, enrich and filter event data before it is stored in a data platform. By eliminating data quality
issues upfront, countless resources and computing hours are saved, big data stores avoid unnecessary
pollution and IT and data scientists are more productive. Not only does productivity improve with this
traditional data cleansing now happening on data in motion, it takes care of the necessary data
preparation needed for successful in-stream analytics. Furthermore, by filtering the data to what is
cleansed and relevant, unnecessary storage of irrelevant event noise helps focus all other activity on
what is relevant.
The ability to listen for events and ingest, consolidate streams of data is critical to real-time actions, ones
that impact transitory event opportunities and avoid impeding threats. Low latency response for real-time
actions, with millisecond and sub-millisecond response times, not only demands high performance
processing but also requires tightly integrated data communication access to event stream sources and
delivery to streaming insight consumers. SAS Event Stream Processing comes with a suite of prebuilt
connectors and adapters (such as Teradata) to consume structured and semi-structured data streams.
Connectors and adapters operate through the publish/subscribe layer (as illustrated in Figure 1), and can
also be custom built as APIs in C, Java, and Python. Supporting authentication and encryption, they
publish data from any source into the continuous query and publish data out to any subscribed source. In
addition, they include communication protocols across different streams for enterprise level use of
streaming insights from a range of messaging bus and data transport protocols. Creating a robust
ecosystem with both pre-built, editable and open APIs to ingest consolidate and manage multiple event
streams mitigates the risk of limiting insights and relieves the need to write code by specialized
programmers for ongoing support and maintenance.
Continuous queries are at the heart of driving new, enriched insights from streaming data (depicted in
Figure 1). SAS Event Stream Processing enables a comprehensive suite of advanced analytics to event
streams, like forecasting, data mining, and machine learning algorithms for governed, streaming
decisions (McNeill et al., 2016). Data governance is key to addressing not only the dynamic nature of
streaming data, it also ensures fully documented and readily understood event stream processing
application – empowering agility to make and understand the impact of changes necessitated by the
dynamic nature of business.

2
Customizable alerts, notifications and updates directly issued from SAS Event Stream Processing provide
precise and accurate situational awareness so that actions are relevant and informed as to what’s
happening and what’s likely to happen. These actions are fueled by continuous, accurate, and secured
event pattern detection from SAS Event Stream Processing patented 1+N-Way failover, guaranteed
delivery (without persistence), full access to event stream model metadata, live stream queries, dynamic
streaming model updates, along with deep analytic capabilities.

Figure 1: SAS® Event Stream Processing conceptual architecture

SAS Event Stream Processing captures true business value otherwise lost through information lag.
Businesses are able to analyze events as they happen and seize new opportunities through producing
data-driven actionable intelligence with no latencies. It enables new analysis and processing of models to
be developed and modified quickly to meet the changing needs of the business and the competitive
landscape.

SAS EVENT STREAM PROCESSING WITH TERADATA


Traditionally data has been stored in a database. Once the data is captured, it goes through a rigorous
ETL (Extraction, Transformation, Load) process to integrate the data stored in the data warehouse. The
ETL process can take days or even weeks to complete, depending on the size of the data. Data analysts,
business analysts and automated reports are gleaned from queries that run against the trusted and vetted
data warehouse. However, this traditional processing paradigm isn’t well suited to driving insights from
events that are happening in near real-time. Figure 2 illustrates the comparison of traditional relationship
database (RDBMS) processing with that of event stream processing.
By integrating SAS Event Stream Processing with Teradata, organizations now have a new, modernized
approach to percolate current events and streams of data to existing reporting and insight-driven
applications. Enabled by the SAS Event Stream Processing connector that leverages the Teradata
Parallel Transporter (TPT) API supports subscribe operations against the Teradata server.

3
Figure 2: ESP and Database processing

From SAS Event Stream Processing, the Teradata server subscribes to the following operations:

 Stream – operating similarly to a standard event stream processing database subscriber, but with
improved derived from TPT. Supports insert, update, and deletion of events. As events are received
from the subscribed window, it writes them to the pre-defined target table. If (the required)
tdatainsertonly configuration parameter is set to “false”, serialization is automatically enabled in TPT
to maintain correct ordering of row data over multiple sessions.
 Update - Supports insert/update/delete events, but writes them to the target table in batch mode. The
batch period is a required configuration parameter. At the cost of higher latency, this operator
provides better throughput with longer batch periods (for example minutes instead of seconds).
 Load - Supports insert events. Requires an empty target table. Provides the most optimized
throughput. Staggers data through a pair of intermediate staging tables. These table names and
connectivity parameters are additional configuration parameter specification requirements. Writing
from a staging table to the ultimate target table uses the generic ODBC driver used by the database
connector. Thus, the associated connect string configuration and odbc.ini file specification is required.
The staging tables are automatically created by the connector. If the staging tables and related error
and log tables already exist when the connector starts, it automatically drops them at start-up.

Having integrated connectors is certainly a good start. New innovations expand upon this to facilitate
even faster processing and reduced latency. The new Teradata Listener™ is an integrated offering that
delivers a unified solution to handle the endless torrent of digital information streams. With the constant
flood of digital information exponentially growing by all estimates, the complexities to integrate streaming
insights across the enterprise will correspondingly become more important and complex. Integrating the
Teradata Listener with SAS Event Stream Processing provides a new frontier for analyzing all big data in
a massively parallel processing environment delivery new, timely and current fact-based insights to all in
the enterprise.

4
TERADATA LISTENER™ AND SAS® EVENT STREAM PROCESSING
Ingesting streams of data is the key design element of the Teradata Listener. As an intelligent, self-
service software solution that ingests and distributes exceedingly fast moving data streams throughout
the enterprise analytical ecosystem. Listener™ collects data from multiple, high volume, real time streams
from sources such as social media feeds, web clickstreams, mobile events and IoT (server logs, sensors
and telematics). As mentioned, as a subscribed source, Listener can also ingest streaming analytic
insights defined in SAS Event Stream Processing.

The key value of Listener is to allow developers and data administrators to build real time processing
capabilities. It handles large volumes logs and event data streams, and reliably handles mission critical
data streams ensuring data delivering without loss. The Teradata Listener offers a self-service capability
to ingest streams of data without coding. And with no manual coding, it accelerates time to deeper
insights as a streamlined and traceable process. It simplifies the IT process, maintenance and cost of
custom-built systems. It can act as a centralizing system that can scale to the complete organization,
operate with hundreds of applications built by silo teams, all of which can be plugged into the same,
consistent system.

STREAMING SIMPLIFIED WITH SELF SERVICE


Teradata Listener streamlines the data ingestion process through a self-service dashboard, which can be
accessed by multiple users (developers, administrators, and data scientists) throughout the enterprise.
The intuitive Listener dashboard makes configuration of data sources and targets an easy task,
eliminating the need for programming. Technical users can easily add, remove, or edit sources and
targets to create streaming data pipelines. There is no need to request for access or change or add IT
tickets. And there is no waiting for a programming team to develop and test another interface to a home
grown streaming capture module.

The Teradata Listener ingestion services are invoked from RESTful interfaces through the very popular
http transport protocol— a universally accepted protocol for modern-day applications. Any developer can
easily invoke the Listener’s ingestion services to send continuous data streams to a data warehouse,
analytical platform, or Hadoop or any other big data platform.

Additionally, APIs (such as developed with SAS Event Stream Processing) provide more flexibility to
developers to access the data flowing through Listener. And in the case of connection with SAS Event
Stream Processing, the Teradata Listener is receiving streams that have been vetted, cleansed, filtered
and enriched, improving the content of the streaming pipeline sourced by Listener, as per Figure 3.

Output from Teradata Listener is used to inform existing reporting work streams, updating custom
dashboards and integrating other processing engines for additional transformations. Moreover, (and as
depicted in Figure 3), the Listener output can stream back into SAS applications, other data repositories
(aka. data at rest) and reporting systems and even back into SAS Event Stream Processing.

INGEST CONTINOUS STREAMS


Sources of event streaming data proliferate whether it is from web events, email, sensors, social media,
machine data, IoT, SAS Event Streaming output, and others as shown in Figure 3. Teradata Listener
brings together the big data ingestion process by collecting multiple, high volume data streams
continuously from a variety of sources, and storing them into one or more of the data stores that comprise
the enterprise data ecosystem. Listener has capabilities to write to a variety of target stores – in
integrated data warehouse, analytical platform, or Hadoop. Listener can also write results back into SAS
Event Stream Processing. Now enriched with more data, new insights, and even directions from end-
users – Listener output can be analyzed further, and actions as streaming decisions to devices, and
objects at aggregation points in-stream, and even to the edges of the IoT.

5
Listener is agnostic to data variety, working effectively with both structured and semi-structured data. A
Teradata Listener cluster of servers scales horizontally to meet the growing demands of multiple data
streams in the enterprise.

Figure 3: Teradata Listener and SAS Event Stream Processing

DATA-DRIVEN INTELLIGENCE
Listener continuously monitors incoming data streams automatically, gathering critical information
exploited from the graphical user interface and dashboards to serve a deep understanding of the data.
Various metrics on this dashboard help end-users understand current activity both in and out of Listener’s
ingest and distribution processes. Users can intuitively discover when a stream has stopped or when a
target halts accepting the data output.

Teradata Listener‘s micro services architecture enables decoupling of ingestion process of incoming data
streams with the outgoing distribution processes. Listener buffers the distribution output intelligently when
the target systems are full, activating the distribution later when target system allows - all without any
manual intervention.

6
CONCLUSION

As business conditions evolve, the need to continuously monitor and measure streaming events of
interests is imperative. Machine-driven with human-guided curation of event streams, enriched with
analytic intelligence and focused to relevant events that are heard throughout the enterprise is unique
value that SAS and Teradata provide. Instead of the traditional “stream, score and store” process, data
can now be analyzed immediately as it is ingested or received and adjusting situational intelligence as
new events happen using Teradata Listener and SAS Event Stream Processing.

Applicable across a wide range of industries, the ability to process streaming data once, and persist to
stores and applications and other streams, across the enterprise is a foundational benefit for analytical
workloads. Efficient and well-managed processing is paramount to low latency, real-time responsiveness
– and when time matters, the ability to complete the full analytical lifecycle to drive better decisions
becomes critical. Whether that be the need to re-optimize mobile dispatch units based on live location
streams, preventing hazardous events by prioritizing maintenance needs based on current weather
predictions, or recognizing the need for new streams of data to improve projected operational
effectiveness, listening for the right signals provides focus.

With SAS and Teradata, the combined and integrated technology offers a scalable and reliable solution to
ingest data and process streams of events, leveraging the embeddable streaming analytics of SAS so
that organizations can pro-actively respond to even the most complex issues.

REFERENCES
McNeill, F., D. Duling, S. Sparano. 2016. “Paper SAS6367 Streaming Decisions: How SAS Puts
Streaming Data to Work” Proceedings of SAS Global Forum 2016, Los Vegas, NV

CONTACT INFORMATION
Your comments and questions are valued and encouraged. Contact the author at:
Tho Nguyen
Teradata
tho.nguyen@teradata.com

Fiona McNeill
SAS
fiona.mcneill@sas.com

SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of
SAS Institute Inc. in the USA and other countries. ® indicates USA registration.
Other brand and product names are trademarks of their respective companies.

7
Paper 4120-2016
Prescriptive Analytics – Providing the Instruction to Do What’s Right
Tho Nguyen, Teradata Corporation
Fiona McNeill, SAS Institute Inc.

ABSTRACT
Automation of everyday activities holds the promise of consistency, accuracy and relevancy. When
applied to business operations, the additional benefits of governance, adaptability and risk avoidance are
realized. Prescriptive analytic empowers both systems and front-line workers to take the desired company
action – each and every time. And with data streaming from transactional systems, from IoT, and any
other source – doing the right thing with exceptional processing speed embodies the responsive
® ®
necessity that customers depend on. This talk will describe how SAS and Teradata are enabling
prescriptive analytics – in current business environments and in the emerging IoT.

INTRODUCTION
Being an analytically-driven organization means basing decisions and actions on data, rather than gut
instinct. As more organizations recognize the competitive advantages of using analytics, the impact can
wane as competitors build this same capability. To cross this innovation chasm and sustain the
competitive advances that come from analytical adoption, organizations continually test and expand data
sources, improve algorithms and evolve the application of analytics to every day activity.
Predictive algorithms describe a specific scenario, and using historic knowledge increase awareness of
what comes next. But knowing what is most likely to happen, and what needs to be done about it are two
different things. That’s where prescriptive analytics comes in. Prescriptive analytics answers the question
of what to do, providing decision option(s) even based on predicted future scenarios.
Seldom (if ever) do events happen in isolation. It’s through their interconnections that we develop the
detailed understanding of what needs to be done to change future trajectories. The richness of this
understanding, in turn, also determines the usefulness of the predictive models (Pinheiro & McNeill,
2014). Just as the best medicine is prescribed based on thorough examination of patient history, existing
symptoms and alike – so are the best prescriptive actions, founded in well understood scenario context.
And just as some medicines can react with one another - with one medicine not be as effective when it’s
in the presence of another, so can decisions and corresponding actions taken from analytics – which in
turn can impact the outcome of future scenarios.
As you’d expect, under different scenarios – you’d have different predictions. When conditions change,
the associated prediction for that same data event can also change. When you apply one treatment, you
affect another, changing the scenario. Actions that are taken not only create a new basis for historical
context, but also create new data that may not have been considered by the original model specification.
In fact, the point of building predictive models is to understand future conditions in order to change them.
Once you modify the conditions and associated event behavior, you change the nature of the data. As a
result, models tend to degrade over time, requiring updates to ensure accuracy to the current data,
scenario, and new predicted future context.
Well-understood scenarios are fed by data. The more data you have to draw from to examine
dependencies and relationships that impact the event being predicted, the better the prediction will likely
be. This is where the value of big data comes in… as big data is more data with finer detail, and greater
context richness. Big data offers details not historically available that explain the conditions under which
events happen, or in other words, the context of events, activities and behaviors. Big data analytics
allows us, like never before, to assess context – from a variety of data, and in detail. And when that big
data is also fast data (on the order of thousands of events per second), it’s a stream of events. When we
bridge big data analytics with event streams, as generated in the IoT - we have the power to write more

1
timely and relevant business prescriptions that are much harder for competitors to mimic.

BUILDING PRESCRIPTIVE ANALYTICS


Prescriptive analytics define the instructions for actions based on both analytic models and the business
®
rules that trigger the models. Combined by Boolean logic, SAS Decision Manager provides an intuitive
interface to build decision logic, associating the models and the business rules with the appropriate
conditions, as illustrated in Figure 1.

Figure 1. Operational decisions are built by combining business rules (e.g. account_level = “COPPER”) with
analytical models (e.g. Bad_level_Default) using conditional logic in SAS Decision Manager.

The instruction, as defined SAS Decision Manager’s decision logic, encapsulates the conditions under
which a particular model is valid – and when it should trigger to deliver results. Scoring data is then
reserved for when the appropriate conditions are met, i.e. specific to the model design scenario, avoiding
unnecessary data processing.
Typically, business analysts are the decision designers. They are often tasked with working through the
logic of what actions need to be taken under different operational scenarios – whether they are product
related decisions, customer actions, service requirements or other types of day-to-day business activities.
These analysts draw upon the work of others, namely those analytical experts and data scientists who
®
have built the models, as well as reaching into data, like that from Teradata Unified Data Architecture™,
which has been vetted and validated by IT.
Building decisions therefore requires the foundation of models that have already been developed, tested
® ®
and validated using applications like SAS Analytics for Teradata . Decisions also require pre-
determined business rules to be defined to the system. Management of both business rules and
analytical models is necessary, particularly given the expanse of users who often benefit from a
formalized decision management environment.

2
BUSINESS RULE MANAGEMENT
Business analysts themselves may have access to all the governing policies, regulatory rules,
constraints, best practices and other and business logic necessary to define business rules. More often
than not, however, the business knowledge that’s needed is retained across different divisions of the
organization – like compliance, finance, sales, marketing and others. Thus, the need to have a centralized
and well-manage environment for defining business rules, business logic and terminology helps to
eliminate debates between different divisions of the organization. It also promotes consistency in the use
and application of business rules to operations.
®
Within SAS Decision Manager, SAS Business Rules Manager provides a centralized and managed
repository for rules. Individual rules are joined together using a wizard, which define the specific scenario
conditions as rule flows (as illustrated in Figure 2). Rules can be defined, tested, validated against data,
and even discovered using analytic methods - all from within the same environment. When rule flows are
published for execution in operations, the published rule flow is automatically locked down – to secure it
from additional testing and modification. Authorizations and defined workflows ensure that changes are
documented, approved and authorized by the appropriate personnel.

Figure 2: Wizard edit environment for creating, editing and managing business rule flows.

The collection of terms used to build rules is foundational to the common language that communicates
the objectives and responsibilities of the business, appropriately described as a vocabulary. You can
import pre-existing vocabularies (from .CSV files), edit them, reuse ones extracted from physical tables
and share vocabularies across rule sets. SAS Business Rules Manager allows multiple and authorized
users to contribute to rule definition, facilitates change management control, retains audit details,
empowers validation by subject matter experts, and governs rule elements. When business rules are
designed in this environment, they are safe from the risk of undocumented tribal knowledge and become
a corporate asset.

3
ANALYTICAL MODEL MANAGEMENT
Just as business rules are the domain of experts who understand the business, analytical models are the
®
domain of data scientists, statisticians and data miners alike. SAS Decision Manager includes SAS
® ®
Model Manger, which manages the inventory of models developed in in SAS Factory Miner, SAS/STAT ,
® ® 1
SAS/ETS , SAS Enterprise Miner™, PMML, generic R models, code snippets from other code bases , as well as
®
from SAS High-Performance Data Mining. Having forecasts, predictions and other models registered as a
comprehensive collection (as shown in Figure 3) allows organizations to monitor for signs of degradation as scenario
context changes, manage versioning, authorship, workflow, usage tracing, and provides detailed visibility into
production quality.

Figure 3: Collections of analytical algorithms are centralized in one, governed environment.

Business analysts, who are focused on creating complete decisions, select the appropriate model as
designated by the analytic expert. Taking the guesswork out of which model is most appropriate to a
particular scenario and streamlining the often tedious tasks of understanding model definitions and data
input needs. The business analyst readily has the business context of the model, explicitly defined and in
a recognizable, intuitive format. Building complete decision flows therefore becomes an exercise of
defining the rule flows in conjunction with the prescribed model – associating them together by the
appropriate conditional logic – all from the same, simplified interface (as was illustrated in Figure 1).
Moreover, the logic used, definitions and ownership of each element of the decision flow is retained – so
that when it comes to deploying models into production, IT has a complete perspective of who, what, why
and how these decisions are defined, the testing done and how to apply to business operations for
prescriptive actions.

1
Other code bases, such as C, C++, Java, Python, etc.

4
VALUE OF PRESCRIPTIVE ANALYTICS
Prescriptive analytics provides the instruction of what to do – and as importantly – what not to do when
analytical models are deployed into production environments. Defined as decisions, they are applied to
scenarios where there are too many options, variables, constraints and data for a person to evaluate
without assistance from technology. These prescriptive decisions are presented to the front-line worker –
providing the answer they seek, and accounting for the detailed aspects of the scenario that they find
themselves in. For example, call center personnel often rely on prescriptive analytics to know the
appropriate options, amount, and under what conditions, a prospective customer can be extended varying
levels of credit.
Prescriptive analytics also provides organizations with the ability to automate actions, based on these
codified decisions. Every organization has simple, day-to-day decisions that occur hundreds to
thousands of times (or more), and which don’t have to require human intervention. For example, the
identification and placement of a targeted advertisement based on a web shopper’s session activity is
popular in the retail industry. In such cases, prescriptive analytics are used to ingest, define and take the
optimal action (for example, place the most relevant ad) based on scenario conditions (in our example,
what has been viewed and clicked on during the session). What is optimal, for the purposes of this
paper, is defined as an action that best meets the business rule definitions and associated predicted
likelihoods. What is optimal can also refer to a mathematically optimized solution, as Duling (2015) has
previously described.
Scoring data with a model typically involves IT. Sent in an email, or some of other notification, IT is
presented with an equation and the data inputs needed. What is often very lacking is the business
rationale, context and a translation of terminology into IT terms. As such, IT will ask all the necessary
questions, often recode the model – run tests and validate output, and then, after applying any specific
business policies, and/or regulatory rules – will put the model into ‘production’, aka operationalize the
model so it can generate results.
While in some organizations these steps may not all be done by IT, they still happen. As illustrated in
Figure 4, each step – even after the model is developed - adds time to implementing the model, and
cashing in on the business benefits. In many organizations the latency associated with model deployment
to business action is weeks, if not months. As a result, by the time a model is ready to generate results in
a production context, it’s often too late – and either the opportunity to impact is gone or conditions have
changed to the point where the model is no longer relevant.

Figure 4: The impact of time delays on the value of using analytics

5
Prescriptive analytics defined using SAS Decision Manager reduces this latency. Streamlining the time
from when a model is developed to when actions are taken. Furthermore, the context of the model is
explicit, defined by the business rules - to the point that impact assessments across any point of the
decision flow is transparent (as illustrated in Figure 5). And because of this explicit decision definition,
changes and adjustments to new models, rules, conditions, data or combinations of any of these
dynamics are readily done – tracked as part of version control and documented for the purview of
auditors and alike. Analytical model deployment and usage becomes part of a governed, managed
environment, reducing the risk associated with incorrect definitions, poor market timing and regulatory
non-compliance.

Figure 5: Lineage across decisions can be examined as part of impact assessments

Prescriptive analytics have the benefit of automating instructions and best suggested options that are
acted upon by a person. Prescriptive analytics is also be used to directly automate actions, for more
mundane tasks, doing so consistently and accurately. In both cases, relevancy to the current scenario is
assured in this managed environment and is the product of the vetted, tested and detailed decision flow
(as was illustrated in Figure 1). As data volume, variety and velocity are only set to increase, and as
technology continues to develop to process more data, faster – the trend to automating actions taken
from analytics will correspondingly rise.
The business need to automate prescriptive analytics stems from companies that demand real-time
responses from data-driven decisions. It is obvious that every company will increasing become inundated
with data and that data needs to be analyzed. The reality is that organizations simply don’t have enough
people to analyze all the data – even if they could comprehend all the scenario details and volumes, to
make all decisions in a timely manner. Prescriptive analytics – defined in SAS Decision Manager have
the benefit of being:

6
 Relevant, consist and accurate decisions
 Easily automated for human instructions and downstream application/system actions
 Explicit of the business context
 Tested, vetted and documented decisions
 Adjustable to changing scenarios
 Timely deployed actions
 Governed in a single environment, providing an unequivocal source of truth
 Assets, by encapsulating intellectual property and managing lifecycle degradation.

OPERATIONALIZING PRESCRIPTIVE ANALYTICS IN TERADATA


SAS Decision Manager has integrated with Teradata to further extend the benefits of prescriptive
analytics – by moving analytics deployment to the data, eliminating the burden on network resources and
reducing the latency of time to action. By applying prescriptive analytics to where the data reside, the
process is significantly streamlined - because data movement and redundancy is eliminated. In addition,
this improves data integrity since there is no copying of data or moving data to a different, silo server.

HOW IT WORKS
SAS and Teradata are well integrated to deliver complete, data-driven decisions. The Teradata database
can be leveraged to handle the heavy processing of data analytics. Teradata offers a powerful and
scalable architecture to enable massively parallelize processing (MPP). This MPP architecture is a
“shared nothing” environment and can disseminate large queries across nodes for simultaneous
processing. It is capable of high data consumption rates through parallelized data movement which
completes any task in a fraction of the time. The end-to-end process can be executed inside the Teradata
platform to improve performance, economics and governance, as illustrated in Figure 6.

Figure 6: End-to-end decision processing with SAS and Teradata

7
2
Complete decisions, which include data definitions, business rules and analytical models are recognized
®
within SAS Data Integration Studio. Treated as a single decision flow, SAS Data Integration Studio
generates SAS DS2 code which can be run within Teradata to inherently leverage the highly scalable
environment for processing big data. This is enabled by an embeddable processing technology, the SAS
Threaded Kernel (TK) within the Teradata platform. This embedded process generates work using units
of parallelism scheduled on each AMP of the Teradata platform. Teradata’s workload manager manages
the SAS embedded process as a standard Teradata workload.
® ®
Analytic models can also be built in-database using SAS Analytics Accelerator. The SAS Analytics
Accelerator for Teradata contains specialized vendor defined functions for Teradata that enable in-
database processing for a collection of modeling and data mining algorithms. For model building, the SAS
Scoring Accelerator for Teradata transforms models created by multiple SAS/STAT or Enterprise Miner
for scoring inside the database using the SAS embedded process technology.
Decision that are deployed, and “published” to Teradata directly from SAS Data Integration Studio, as
SAS macros, or if only model scoring is desired, they can be published using SAS Model Manager. SAS
Decision Manager includes both SAS Data Integration Studio as well as SAS Model Manager, providing
options which optimize analytically-based processing in-database with Teradata (as shown in Figure 7).
The metadata about models, rules and logic are all encapsulated within decisions – helping organize
production deployment.

Figure 7: Publish models using SAS Model Manager (included with SAS Decision Manager) and Teradata

In some organizations, prescriptive decisions are deployed into operational data streams. There may
also be instances that only business rules, without analytic models, are needed for the appropriate action.
For example, internal organizational accounting often requires a distribution of revenue (aka revenue

2 ® ®
The SAS Code Accelerator for Teradata and SAS Analytics Accelerator for Teradata push SAS executable code to
process directly inside the Teradata data warehouse.

8
attribution) across business divisions and functions – based on corporate policies or governance
measures. Business rules defined within SAS Business Rules Manager can be pushed down and directly
3
executed inside the database without any recoding or redefinition . For the in-database business rule
execution inside Teradata, the processing tasks are further streamlined and are fully scalable without
data replication. It also has the advantage of being highly amenable to commonly required changes in
business rule definitions – with organizational and product changes, acquisitions, mergers and business
policy dynamics.

DATA EXPLORATION
Once data sources are gathered in Teradata, you can begin to explore it using your preferred data
®
exploration tool, like SAS Visual Analytics (which is also enabled on the Teradata Appliance for SAS).
Data exploration is a process that examines and explores data, often discovering or extracting new
knowledge. Typically performed by a business analyst, exploring what the data looks like, the scenario at
hand, and what variables are in the data set – evaluates the relationships and patterns necessary to
understand decision conditions.
This initial exploration of the data helps explain common inquiries and is a productive way to become
more familiar and intimate with the data that defines the scenario. One of the best practices is to explore
all your data directly in the database, so data is well understood before identifying the key factors for
conditional logic and rule definitions, while eliminating redundancy and removing irrelevant data. This
same exploration capability is also a powerful and flexible way to monitor business rule execution and
retrospectively decision actions – as a dashboard or reports. And it’s not just for the business analyst.
The ability to quickly extract knowledge from large complex data sets also provides the data scientist,
statistician and data miner alike with this same advantage of dynamically exploring data as part of the
model development process.
Prescriptive analytics require the right process, skilled personnel and scalable technology. With SAS and
Teradata, prescriptive analytics is streamlined, effective and efficient – from the perspectives of both IT
and the business. These integrated technologies deliver data-driven decision options and even
automated actions, helping organizations take advantage of future opportunities and alleviating potential
risks each time a decision is made.

LEVERAGING INTERNET OF THINGS (IOT)


The Internet of Things (IoT) can mean different things for many people and works in conjunction with big
data. It is a system of physical objects—devices, vehicles, buildings, machines and others — that are
embedded with electronics, software, sensors, and network connectivity so that these objects can
communicate through the exchanged of data. IoT is, and will continue to generate a lot of data. Data
transmitted by objects provides entirely new opportunities to measure, collect and act upon an ever-
increasing variety event activity.
If we isolate to just consider sensor data, say in the transportation industry – use cases abound in ways to
identify potential equipment defects in planes, trains and automobiles. Going beyond collection of data for
exploration, and even analysis - prescriptive analytics will not only uncover patterns in events as they
occur, they are used to take automated actions to prevent unnecessary outages and costs. By sending
alerts, notifications, updating situational war room dashboards, and even providing instructive action to
other objects, the need for real-time actions has never been greater.
Sensor data can be in the form of structured and semi-structured data. Sensor data can be integrated
with other data sources, as lookup tables while it’s still in motion, and after it’s landed in a big data
repository/warehouse, And while many organizations are simply streaming sensor data, storing it to be
analyzed retrospectively, prescriptive analytics – embedded within sensor (and other event) streams
holds the promise of consistent reaction and even proactive intervention. For example, retail sales activity

3
The SAS Code Accelerator for Teradata is used to execute SAS Business Rules Manager code in the Teradata data
warehouse.

9
based on marketing campaign effectiveness is often associated with a targeted list of loyal customers. By
collecting web clicks in real-time, and past purchases, prescriptive analytics could indicate that they have
a high likelihood of purchasing shoes with the pants they are viewing – prompting a pop-up savings
coupon for pants.
Use cases leveraging prescriptive analytics in IoT applications abound. Everything from analyzing social
media watch by collecting tweets, blogs and posts to determine what consumers are recommending as a
service/product to security and surveillance of login sessions and data access for data security breaches
– and all else in between.

CONCLUSION
In the eyes of customers, in both business-to-business and business-to consumer industries, purchase
choices can be summarized as being dependent on product quality, service and support excellence, and
the ability to appropriately fulfill the purchase need. As such, ensuring product health, the responsiveness
of fulfillment, and understanding the full context of the purchase decision is paramount to being the
selected candidate. For day-to-day decisions, prescriptive analytics fulfills that need - giving organizations
the ability to accurately decipher the scenario context and to take the appropriate action in a manner
® ®
that’s consistent and relevant. With SAS In-Database Decision Management for Teradata , you can:
 Be more responsive, proactive and reliant on data-driven operational decisions for new opportunities.
 Improve performance and minimize time previously spent moving or duplicating data and code
between systems.
 Increase security and compliance of data in one integrated, highly governed environment.
Taking prescriptive analytics to the data and running in-database extends the benefits of relevant, timely
instructions and actions without having to move data. Model and business rule deployment – as
complete, documented and vetted decisions – become part of job processing, for even the biggest of big
data. With SAS and Teradata, the integrated portfolio of solutions enables you to explore all options,
determine the appropriate approach, execute the action and evaluate/improve the business decision.

REFERENCES
C. Pinheiro, F. McNeill 2014 Heuristics in Analytics: A Practical Perspective of What Influences Our
Analytical World. NJ : John Wiley and Sons.
Duling, D. 2015 “Make Better Decisions with Optimization” SAS Global Forum 2015 Proceedings, Paper
SAS1785-2015, Dallas, TX. Available at:
http://support.sas.com/resources/papers/proceedings15/SAS1785-2015.pdf
SAS and Teradata Partnership
www.teradata.com/sas
SAS and Teradata In-Database Decision Management for Teradata
http://www.teradata.com/partners/SAS/SAS-In-Database-Decision-Management-for-Teradata-Advantage-
Program/

RECOMMENDED READING
IIA Research Brief, Prescriptive Analytics: Just What the Doctor Ordered, 2014
http://epictechpage.com/sms/sas/wp-content/uploads/2015/01/iia-prescriptive-analytics-Just-What-Dr-
Ordered.pdf

10
CONTACT INFORMATION
Your comments and questions are valued and encouraged. Contact the author at:
Tho Nguyen
Teradata
tho.nguyen@teradata.com

Fiona McNeill
SAS
fiona.mcneill@sas.com

SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of
SAS Institute Inc. in the USA and other countries. ® indicates USA registration.
Other brand and product names are trademarks of their respective companies.

11
Paper 334-2017
Analytics of Healthcare Things IS THE Next Generation Real World Data
Joy King, Teradata Corporation

ABSTRACT
As you know Real World Data (RWD) provides highly valuable and practical insights. But as valuable as
RWD is, it still has limitations. It is encounter-based and we are largely blind to what happens between
encounters in the Healthcare System. The encounters generally occur in a clinical setting which may not
reflect actual patient experience. Many of the encounters are subjective interviews, observations, or
self-reports rather than objective data. Information flow can be slow (even real-time is not fast enough in
healthcare anymore). And some data that could be transformative cannot be captured currently.

Data from select IoT can fill the gaps in our current RWD for certain key conditions and provide missing
components that are key to conducting AoHT, such as:
• Direct objective measurements
• Data collected in "usual" patient setting rather than artificial clinical setting
• Data collected continuously in patients setting
• Insights that carry greater weight in Regulatory and Payer decision-making
• Insights that lead to greater commercial value

Teradata has partnered with an IoT company whose technology generates unique data for conditions
impacted by mobility or activity. This data can fill important gaps and provide new insights that can help
distinguish your value in your marketplace.

Join us to hear details of successful pilots that have been conducted as well as ongoing case studies.

INTRODUCTION
As the Internet of Things (IoT) was gaining momentum in industries such as manufacturing, insurance,
travel and transportation, the healthcare and life science industries were still trying to figure out how to
leverage real world data (RWD) such as claims and electronic health records.
Now that RWD has been firmly embraced, it is time to explore the benefits of IoT to healthcare and life
science companies and ultimately to the patient, clinician and caregiver.

REAL WORLD DATA GAP


Real world data provides insight into a patient at a point in time and are based on provider/patient
encounters such as, a doctor visit or filing a claim. What it does not provide is information about the
patient in his or her normal life setting.
Claims show what assessments and treatments were actually performed, billed, and paid but they are not
always clear about the context and are not always accurate and interpretable. EHRs are a significant
evolution in healthcare data but they are only as good as the degree to which information is accurately
and reliably entered. There are fields that contain free text, like physician notes, that need to be
accurately captured and organized to be useful.
So the question remains: What happens to the patient between doctor visits?

ADVANTAGES OF USING DATA FROM IOHT


IoHT can be a valuable tool to differentiate medicines in the marketplace and drive greater commercial
value. It also carries greater weight in regulatory and payer decision-making. Advantageous features of
the data include:
1. Direct objective measurements rather than self-reported, physician-reported, or observational

1
2. Collected in the “usual” patient setting, rather than an artificial clinical setting which may not reflect
accurate readings or patterns
3. Collected continuously, which is a richer source of data to reveal variability and patterns over time
4. Informing the provider as to what occurs between encounters

CONCLUSION
Integrating IoHT data and conducting robust, advanced analytics on the data can provide immediate
competitive advantage. The data, by itself, has no business value unless it provides decision-making
insights. That is why the Analytics of Healthcare Things (AoHT) provides a real differentiator for
companies leveraging Real World Data (RWD).

CONTACT INFORMATION
Your comments and questions are valued and encouraged. Contact the author at:
Joy King
Teradata Corporation
(919) 696-6067
joy.king@teradata.com
www.teradata.com

2
Ready to take your SAS ®

and JMP®skills up a notch?

Be among the first to know about new books,


special events, and exclusive discounts.
support.sas.com/newbooks

Share your expertise. Write a book with SAS.


support.sas.com/publish

sas.com/books
for additional books and resources.

SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration.
Other brand and product names are trademarks of their respective companies. © 2017 SAS Institute Inc. All rights reserved. M1588358 US.0217

You might also like