You are on page 1of 66

Building the Internet of Things

Early learnings from architecting solutions


focused on predictive maintenance
Authors
Martijn Hoogendoorn, Architect, Applied Incubation, Microsoft
Mark Kottke, Architect, Applied Incubation, Microsoft
Intended audience
This white paper is aimed at technical decision makers, solution architects, and developers.
Abstract
This white paper provides a comprehensive overview of lessons learned from the
authors' experiences in implementing large scale customer projects that target
predictive maintenance as a space in IoT. It frames various elements and
considerations of importance within the Internet of Things, highlighting tradeoffs,
opportunities and grounding the implementation activities using a reference
architecture and an associated comprehensive cost model.
Acknowledgments
The authors would like to thank the following people, who contributed to, reviewed, and
helped improve this white paper.
Contributors
Marc Mercuri, Principal Program Manager, Azure Customer Advisory Team, Microsoft
Clemens Vasters, Principal Program Manager, Azure Application Platform, Microsoft
Reviewers
Arno Harteveld, Architect, Client Solutions, Microsoft
Carolina Piavis, Director Business Programs, Applied Incubation, Microsoft
Ray Stephenson, Director, Applied Incubation, Microsoft
Mani Subramanian, Senior SDET, Patterns & Practices, Microsoft
Version
1.2

The information contained in this document represents the current view of Microsoft Corporation on the issues
discussed as of the date of publication. Because Microsoft must respond to changing market conditions, it should
not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of
any information presented after the date of publication.
MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS
DOCUMENT.
Complying with all applicable copyright laws is the responsibility of the user. Without limiting the rights under
copyright, no part of this document may be reproduced, stored in or introduced into a retrieval system, or
transmitted in any form or by any means (electronic, mechanical, photocopying, recording, or otherwise), or for any
purpose, without the express written permission of Microsoft Corporation.
Microsoft may have patents, patent applications, trademarks, copyrights, or other intellectual property rights
covering subject matter in this document. Except as expressly provided in any written license agreement from
Microsoft, the furnishing of this document does not give you any license to these patents, trademarks, copyrights,
or other intellectual property.
The descriptions of other companies products in this document, if any, are provided only as a convenience to you.
Any such references should not be considered an endorsement or support by Microsoft. Microsoft cannot guarantee
their accuracy, and the products may change over time. Also, the descriptions are intended as brief highlights to aid
understanding, rather than as thorough coverage. For authoritative descriptions of these products, please consult
their respective manufacturers.
2014 Microsoft Corporation. All rights reserved. Any use or distribution of these materials without express
authorization of Microsoft Corp. is strictly prohibited.
Microsoft and Windows are either registered trademarks of Microsoft Corporation in the United States and/or other
countries.
The names of actual companies and products mentioned herein may be the trademarks of their respective owners.

Page 1

Table of contents
Executive summary................................................................................................................. 1
IoT and predictive maintenance.............................................................................................. 2
The Internet of Things.......................................................................................................... 2
Business value................................................................................................................. 4
Megatrends...................................................................................................................... 4
Technology enablers......................................................................................................... 6
Standardization efforts..................................................................................................... 7
Predictive maintenance....................................................................................................... 7
Predictive maintenance scenarios........................................................................................... 8
Healthcare........................................................................................................................... 8
Automotive.......................................................................................................................... 9
Manufacturing...................................................................................................................... 9
Architectural considerations.................................................................................................. 10
Connectivity....................................................................................................................... 10
Interaction patterns........................................................................................................ 10
Connectivity pathways................................................................................................... 12
Connectivity network types............................................................................................ 12
Protocol choices................................................................................................................. 14
Transport-layer protocol choices.....................................................................................14
Transport-layer protocol security....................................................................................15
Application-layer protocol choices..................................................................................16
Security............................................................................................................................. 18
Virtual Private Networks.................................................................................................19
Compliance.................................................................................................................... 20
Device communication patterns........................................................................................21
NAT-based device network............................................................................................. 21
IPv6 direct-addressing device network...........................................................................22
NAT-based, PAN device network.....................................................................................23
Generic concerns with direct addressing........................................................................23
Service-assisted communication....................................................................................23
Designing for scale............................................................................................................ 27
Communication and ingestion........................................................................................ 28
Data storage scalability.................................................................................................. 28
Device registration............................................................................................................ 28

Page 2

Acquiring data................................................................................................................... 29
Message size and format................................................................................................ 29
Message types............................................................................................................... 30
Message priority............................................................................................................. 30
Conditional messaging................................................................................................... 31
Contextual messaging.................................................................................................... 31
Message batching.......................................................................................................... 31
Bandwidth and scale...................................................................................................... 31
Storing information............................................................................................................ 32
Storing data on the device............................................................................................. 32
Transforming data.......................................................................................................... 32
Location.......................................................................................................................... 33
Longevity, format, and cost............................................................................................ 33
Processing information...................................................................................................... 33
Alarm processing............................................................................................................ 33
Complex-event processing............................................................................................. 34
Big Data analysis............................................................................................................ 34
Machine learning............................................................................................................ 35
Data enhancement......................................................................................................... 35
Publishing insights............................................................................................................. 36
Audience........................................................................................................................ 36
Publishing format........................................................................................................... 36
Cost modeling and estimation...............................................................................................37
Common architecture overview......................................................................................... 37
Capacity modeling............................................................................................................. 40
Cost estimation.................................................................................................................. 41
Ingress path cost............................................................................................................ 41
Egress path cost............................................................................................................. 45
Management cost........................................................................................................... 47
System processing cost..................................................................................................... 49
Cost estimate calculation................................................................................................... 49
Strategic choices................................................................................................................... 51
Buy, build, or hybrid.......................................................................................................... 51
Important topics not yet covered.......................................................................................... 53
Networks with automatic handover and fallbacks..............................................................53
The need for the commoditization of devices....................................................................53
The creation and use of information marketplaces............................................................53

Page 3

Management solutions....................................................................................................... 53
The redefinition of SLAs..................................................................................................... 54
Integration simplicity......................................................................................................... 54
Conclusions........................................................................................................................... 55
How Microsoft can help you succeed.....................................................................................56

Page 4

Executive summary
For decades, technology experts have anticipated the Internet of Things (IoT): the
proliferation of tens of billions of connected devices that contain embedded microchips, and
the rise of machine-to-machine and service-to-service communications. IoT will make
inanimate objects, networks, and processes smarteverything from tiny components,
appliances, machines, homes, buildings, and factories to energy grids, transportation
networks, and logistics systems. Its a game-changing opportunity in IT. By analyzing the
vast new streams of data, and by harnessing the precise control that IoT provides, your
organization can reduce costs, create new revenue streams, increase customer satisfaction
and retention, spot trends faster, gain from opportunities more easily, and innovate with
agility. IoT will be especially beneficial in predictive maintenance: performing maintenance
at the right time to predict and prevent failures.
To take full advantage of IoT opportunities in predictive maintenance, you need to think
strategically about the many elements of IoT. For example, one should consider connectivity
pathways and types, transport-layer and application-layer protocol choices, device
interaction and communication patterns, and how to design for the vast scale of IoT. It is
especially critical to understand the complex issues of data security and regulatory
compliance, which can expose the enterprise to legal difficulties if they are not handled
properly. You also should think about how the enterprises communications systems will
ingest data, including message types, sizes, formats, and priorities, conditional and
contextual messaging, message batching, bandwidth, and how to scale a messaging system.
Another pivotal set of questions to ask relate to the data: where will data be stored and how
will it be distributed or potentially sold, and what is the longevity of the data, the right
format, and the associated cost to do that? What is the most efficient way to analyze Big
Data, how can you best take advantage of possibilities, such as alarm processing, complexevent processing, Big Data analysis, machine learning, and data enhancement? Because
data that seems at first uninteresting can be very valuable to the right audience, how do you
find that audience to monetize the insights gained from processing it?
The elements that are needed for security, communication, and scale in an IoT solution
make it very challenging to build one from scratch. To succeed with any IoT solution, it will
very likely require the implementation of a reference architecture that can help accelerate
the use of massive data from millions or even billions of devices. Modeling the systems
capacity to scale, and calculating the costs to do so for related aspects, such as ingress
(device to cloud) and egress (cloud to device, cloud to system) paths and system
processing, is paramount. Depending on the company background, a classic buy vs. build
vs. hybrid decision should be made, based on what you are already using, what is available,
and what will be available in the near future at a price that is acceptable to your business.
This white paper introduces and describes all of these considerations and provides you with
the tools necessary to estimate the operational cost of an implemented reference
architecture in production.
With the Microsoft Azure platform, Microsoft offers a broad set of building blocks to help you
get an IoT solution up and running quickly.

Page 1

IoT and predictive


maintenance
At Microsoft, we hear constantly from customers who say that the Internet of Things (IoT) is
one of the most exciting trends in IT.1 Many of our customers are interested in deploying
sensors and devices in every part of their businesses in order to capture information from
the physical world and act upon the knowledge gained from refining it. They will buy or build
systems that can deliver these capabilities in order to optimize their bottom line, keep
customers satisfied, and explore new revenue potential.
Predictive maintenance is an IoT scenario where a device can provide data that leads to
insightful, proactive maintenance before the likely failure can take place. Predictive
maintenance offers a new revenue stream for device manufacturers, and it is very
interesting to their customers because it enables better business continuity, which usually
generates extra revenue. In this way, the cost of a new service from a device manufacturer
is justifiable to customers, given the cost and impact on them of unplanned downtime of the
device.

The Internet of Things


An expert in radio-frequency identification named Kevin Ashton first used the term the
Internet of Things in 1999,2 though the idea had been around at least a decade earlier. As
with many terms in technology, IoT is a loaded term that people interpret differently
depending on their viewpoint and purpose. For example, Gartner defines it as The network
of physical objects that contain embedded technology to communicate and interact with
their internal states or the external environment.3 Formulating this differently:
The Internet of Things is a metaphor for a set of systems in which direct human
intermediation is dramatically reduced by equipping distributed systems with sensors
that let us acquire information, make decisions, and control things in the physical
world.

1 Microsoft, What Our Customers Are Saying: Top Enterprise Trends of 2014, Susan Houser
2 Wikipedia, Internet of Things
3 Gartner, IT Glossary, Internet of Things

Page 2

Based on
this

Figure 1. Foundational activities, composable within and between devices and systems

definition, IoT consists of a set of four composable activities:

Acquiring data. Using sensors to record information about the physical world.
Examples include measuring location, humidity, temperature, light, heart rate, blood
pressure, brain waves, current, and gas detection.

Processing information. Take action based on data captured and on contextual


information retrieved previously or sourced from other systems. This processing could
involve using actuators that can alter the state of the physical world, such as opening
valves, switching machines on and off, sounding alarms, controlling servos, closing
doors, and many other things.

Storing information. To enable trend analysis, forecasting and insight-driven decision


making, historical information and context is needed. Storing the information retrieved in
its contextual form (for example, including the location where it was captured, the date
and time it was captured, the state of the system at the time it was captured, and so on.)
is critical for this process.

Publishing insights. When embedded sensor data is combined with both internal and
external data from other systems, additional insight from analyzing the data can be
learned and acted upon. Exposing that insight can also drive additional value for other
stakeholders outside the immediate needs of the current system, allowing for the
monetization of this knowledge.

On top of familiar devices, such as phones for input and presentation, a set of core
components to support those activities is needed, though business goals and technical
constraints will drive those that are required. Core components may include:

Sensors: the components that translate a value from the physical world into bits.
Examples include sensors that measure pressure, humidity, heart rate, gas levels, and
acceleration.

Devices: networked, physical, special-purpose systems that emit telemetry data, accept
external information, request external information, and execute remotely-issued
commands. Examples include factory floor equipment, environmental pollution sensors,
and control modules in vehicles.

Page 3

Bridges: systems that act as communication brokers between a device and a gateway,
typically by translating data traffic between different link protocols or methods, for
instance between short-range and long-range wireless protocols. A bridge can also be a
connectivity infrastructure that manages a nationwide or world-wide wireless network on
one side, and a bridge to a cloud system on the other. A bridge might also perform
intelligent preprocessing of data, or act as an autonomous local communications hub in
addition to its bridging function relative to a cloud system 4. Bridges are often also
referred to as gateways, but we reserve the term gateway for a network-based service
with which a bridge communicates.

Gateways: network-based services that manage connectivity and connections with


devices either directly or through bridges. The service establishes a trusted
communication relationship with a device, deals with ingestion and routing of telemetry
data, and provides access to command and notification data destined for the device. On
top of these services, it provides data pipeline processing, possible containing
transformation, complex event processing capabilities, data analytics components,
machine learning, and so on.

Machine learning: computational algorithms that can analyze large sums of data and
extract patterns from it to help a system act and learn from that data to drive more
intelligent system responses in the physical world.

Interconnections: different systems sharing learnings and data that in turn form
composite systems.

We have read thought-provoking papers about IoT. Two that we found especially valuable in
providing context to the concepts and opportunities of IoT are:

Recommendations for the Strategic Initiative INDUSTRIE 4.0.5

Industrial Internet: Pushing the Boundaries of Minds and Machines, a European


Perspective.6

IoT enables you to build, enhance or extend a business model based on data-driven insights
from pervasive sensors that help you optimize resource use and reduce cost and
environmental impact. IoT also helps you maintain a closer relationship with customers
beyond the point of sale of physical products by enabling contextual, remote actions
automatically and intelligently. Examples include remote servicing, proactive sales, bestpractices guidance, and more.

4 Microsoft, How Microsoft tech is helping affordable housing tenants save money (section
on Captain)
5 Deutsche Akademie der Technikwissenschaften, Final report of the Industrie 4.0 Working
Group
6 General Electric, Industrial Internet: Pushing the Boundaries of Minds and Machines, a
European Perspective

Page 4

Business value
At least 26 billion devices will be connected on the Internet by 2020, and organizations in
every sector will use them.7 Billions of connected devices will help businesses to:

Reduce cost. Businesses can use the increased insight into manufacturing and delivery
processes to optimize those processes and reduce cost. For example, reducing the
number of scheduled visits a technician must make by scheduling service visits based on
duty cycles and expected product lifespans informed by actual usage.

Create new revenue streams. Using the ability to sense from and actuate in the
physical, new business models are emerging. Business can capitalize on these new
opportunities and create new innovate revenue streams. Some examples would be
monetizing newly collected datasets, offering APIs to create new business partnerships,
increasing service revenue by notifying and offering improved convenience to customers,
offering differentiating SKUs based on usage patterns, supplying optimized configuration
services, and so on.

Increase customer satisfaction and retention. By knowing how customers of


physical products use them, opportunities exist to extend the customer experience into
scenarios of higher value, and retain and extend the customer base. Capturing data on
how customers actually use products, and ensuring that they do not experience
frustrating service issues helps companies retain customers.

In the blog post 10 reasons businesses need a strategy for the Internet of Things now, 8 the
author identified a concise set of benefits that a company can realize by adopting an IoT
strategy.

Megatrends
The world faces many challenges, such as changes in wealth distribution, resource scarcity,
and an aging population in developed countries. The authors of the book From Machine-toMachine to the Internet of Things: Introduction to a New Age of Intelligence analyzed these
megatrends and capabilities in detail.9 They found that these megatrends are driving a
proliferation of embedded devices with sensors, which in turn require new capabilities for
new market scenarios, as the graphic below shows.

7 Gartner, Gartner says the Internet of Things Installed Base Will Grow to 26 billion units by
2020, December 2013
8 Microsoft, 10 reasons businesses need a strategy for the Internet of Things now
9 From Machine-to-Machine to the Internet of Things: Introduction to a New Age of
Intelligence, ISBN 978-0124076846

Page 5

Figure 1. "Megatrends." From Machine-To-Machine to the Internet of Things: Introduction to


a New Age of Intelligence. Amsterdam, Netherlands, Elsevier, January 2014.
Among the list of megatrends listed in the previous figure, we want to explain in this paper
how some of them relate to the Internet of Things:

Natural resource constraints. The world population is growing at a high rate, with a
projected peak population of 9.22 billion in 2075. 10 Given this growth and the impact it
has on the growth of the worldwide economy, the world will increasingly have to do more
with less, and optimize the way that we produce. IoT can support the optimization of
production, loss reduction, and the efficiency of the necessary supply chain.

Economic shifts. Much like the shift in IT, going from packaged products to as-a-service
solutions, the global economy is moving from a product-oriented to a service-oriented
perspective.11 For a viable service-oriented economy to come into existence, it needs to
be supported by a large set of devices that provide context to the customer environment
for the system in order to offer the right service, at the right price, and at the right time.

Changing demographics. With the world population, especially in more-developed


countries, increasingly aging, the change in demographics will need smart solutions that
can help elderly people remain self-supporting.

10 United Nations, Economic & Social Affairs, World Population to 2300


11 Wikipedia, Service economy

Page 6

Climate change. The impact of human activities on the environment, although debated
at length, is detrimental to the sustainability of the world. In recent years, there has been
a growing movement of green technologies and services, ranging from electric cars to
corporate and government policy changes. IoT can be a supporting factor in both
providing footprint insight and reduction.

Technology enablers
The ever-decreasing cost and size of components, such as accelerometers, Wi-Fi radios,12
GPS, microcontrollers, and Bluetooth radios is also enabling the Internet of Things (IoT). It
allows components and devices to be used in new settings, such as wearables, on-person
devices, and even smaller equipment.
As shown in Figure 1, IoT depends on several other major technologies and trends. Some of
these technology enablers as well as others warrant clarification:

Ubiquitous connectivity. Low-powered wireless networking enables devices to talk to


a gateway, among each other, or directly to the outside world. A foundation for IoT
implementations, connectivity must be managed carefully. To learn more, see the
Connectivity section in this paper.

Cloud computing. For systems that connect hundreds of millions of devices, cloud
computing is the technology that allows for vast scale and acceptable costs, providing
the ability to store large amounts of machine generated data at low cost and perform Big
Data analytics and machine learning.

Small, low-power, low-cost microcontrollers. Microcontrollers today can perform


tasks at very low power and have a battery life of many years.13 For example, the Texas
Instruments MSP430 runs at less than 100A/MHz and can operate on a single coin
battery for more than 20 years.14 (Device battery life always depends on components
and application cycle use). The memory embedded in this microcontroller is ferroelectric
read-only memory (FRAM), an improvement on flash memory that sports very high data
throughput at a power consumption three times lower than flash memory and 99 percent
lower than comparable dynamic random-access memory (DRAM).

Power supply and storage technologies. Given the tiny size of many new devices,
their deployment location, and the vast number of them that will be deployed, changing
batteries is often impractical or impossible. Besides optimizing hardware design for these
scenarios,15 enhancing circuitry by limiting their quiescent current (I q) will further
improve battery life. Also, with energy harvesting techniques, such as solar power

12 For example, a network chip for less than $10 for 1,000 units. Texas Instruments,
SimpleLink Wi-Fi Module CC3000
13 maxEmbedded, What is a microcontroller? And how does it differ from a microprocessor?
14 Texas Instruments, MSP430 documentation
15 Texas Instruments, Using power solutions to extend battery life in MSP430 applications

Page 7

supplies, devices can recharge their built in batteries as long as there is a minimal
charge left.

Embedded operating system platforms. With the vast number of devices that will be
installed, cost and energy consumption per device become decisive. Engineers will
create devices that cost less and that are more energy-efficient, even if they have limited
processing capabilities and memory. CPU cycles spent, and the memory allocated will
become important factors for choosing operating system platforms, installed
components, and security configurations. There is a plethora of good general-purpose
operating systems, ranging from Windows Embedded and Embedded Linux to real-time
operating systems, such as FreeRTOS, ThreadX, Integrity, Nucleus, Qnx, Atomthreads,
AVIX-RT, ChibiOS/RT, ERIKA Enterprise, TinyOS, Thingsquare Mist/Contiki, and others.16

In sum, IoT is gaining momentum because of growing customer and enterprise needs
meeting technology enablers at the right cost.

Standardization efforts
Throughout the world, many organizations are working on the standardization of IoT, based
on specific technology or holistically on reference architectures. Examples of this work
include:

ITU-Telecom (ITU-T), Internet of Things Global Standards Initiative (IoT-GSI). 17

European Union, Internet of Things Architecture (IoT-A).18

In addition to these efforts, there is a lot of work going on in depth in many different
technology areas, such as the standardization of protocols. Protocol choices, both at the
transport as well as the application layer, are discussed later in this document.

Predictive maintenance
This white paper focuses on a common scenario IoT enables that we call predictive
maintenance: performing maintenance with a focus on timeliness, acting exactly when
needed instead of at regular intervals, and predicting and preventing failures before they
happen, based on learning from historical data. Predictive maintenancejust-in-time
maintenancewill massively transform how organizations and consumers manage
equipment as well as people. Predictive maintenance also informs more traditional
preventative maintenance patterns, optimizing routine maintenance activities.

16 For a comprehensive list, see http://en.wikipedia.org/wiki/List_of_realtime_operating_systems


17 ITU-T, Internet of Things Global Standards Initiative
18 European Union, Internet of Things Architecture

Page 8

Predictive maintenance
scenarios
The potential for useful applications in the Internet of Things (IoT) is endless. This section
focuses on scenarios that illustrate concrete benefits based on predictive maintenance,
where maintenance can be performed on both inanimate and living things. The following
scenarios that we describe provide examples of the enormous potential that IoT holds for
enterprises.

Healthcare
With the previously described change in world demographics, there is an
increasing need for remote patient management, allowing elderly
citizens to only come to the doctor or the hospital when the need arises,
based on telemetry captured by smart devices. Some early innovation in
this space, more geared toward health self-management and consumer
devices can be seen in watches with sensors that collect a variety of data,
such as blood pressure and heart rate. When body temperature, oxygen levels, and CO 2
levels are combined with the ability to display this data to the patient and physician in real
time, this alleviates the stress of full waiting rooms and reduces the cost per patient. 19
Another example is an in-home glucose monitor that uploads a patients vital signs to a
cloud-based health platform, where the data is analyzed and presented back to the patient
in an easy-to-understand format on a mobile device, and in a more complex format on a
touchscreen to the doctor. The doctor can review the patients information and then use the
touchscreen to send feedback to the patient and write a prescription. 20
Powerful, specialized, cloud-connected devices like these that enable doctors and patients to
work together to remotely monitor vital signs, exchange information, communicate, and
alert relatives, all in real time, are either becoming available or in development. By actively
monitoring patients at home21,22 or while they are mobile, healthcare professionals can
provide a higher level of care, reduce in-hospital waiting time and costs, and reduce stress
for everyone involved, which leads to better patient outcomes. Using technology to
accurately predict and signal medical staff about conditions that need attention, enables
19 Samsung Simband aims to take a big step in wearable health,
www.cnet.com/products/samsung-simband/
20 Microsoft Healthvault Medical Intelligent System, www.youtube.com/watch?
v=j8Y4ukdNM60
21 Medical Design Technology Magazine, The Internet of Things and Medical Device Product
Development: Practical Strategy Suggestions, March, 2014
22 YouTube, Medical Intelligent System, Proof of Concept

Page 9

healthcare professionals to anticipate patient issues instead of reacting to them, and remedy
them before they become critical, all while maintaining the security and privacy of the data
collected from such technologies.23 As a positive side effect, the collected evidence of
provided care could also help alleviate the issue where doctors in the U.S. are sometimes
reluctant to provide prescriptions or diagnosis over the phone because of billing
restrictions,24 which forces patients to visit the office of the healthcare provider, and as a
result waste a lot of everyones time for the treatment of common or recurring ailments.

Automotive
Vehicles contain telemetry about their operation, and about the service
activities and faults that happen on them. They travel through different
locations, different weather conditions, and different usage scenariosa
four-wheel drive vehicle climbing trails, a sports car in the mountains, or a
family van loaded with children. Each of these factors can have an effect
on how the vehicle operates, as well as its reliability, comfort, safety, and
performance. If the vehicle manufacturer or a vendor-agnostic data aggregator/analyst can
collect this data, and analyze it over time, trends can be identified to find new, timelier, and
more cost effective and impactful actions to take. These can include maintenance on the
vehicle, reconfiguring it, which in turn can help to prevent recalls, or conversely trigger
recalls to keep the vehicle safe, and more fun, useful, and cost effective for everyone
involved, including the owner, the operator, and the passengers.

Manufacturing
A service technician is dispatched to analyze an elevator after someone
reports that its doors will not close. The building owner is hearing from
people who are unhappy that they have to walk up the stairs. It takes the
engineer an hour to drive to the building and find the elevator. After
arriving, he works through a standard checklist for another hour, only to
conclude that the elevator works as expected. As so often happens, a
fleeting obstruction, such as a coffee cup between the doors of the elevator or accumulated
dust and dirt in the sliding rail might have caused the problem.
The service technician drives back to his office, having spent a total of three hours on a
phantom problem. At $150 USD per hour and with more than one million elevators in
service, incidents where equipment is evaluated as operating normally upon inspection such
as in this scenario can have a big impact on the profitability of an elevator company,
depending on the type of maintenance contract.
Moving beyond this reactive maintenance illustration, capturing telemetry about the motors
that operate the elevator or the speed that the doors of the elevator close allows the
engineer to take a more predictive approach. For example, an increase in the consumption
of energy or a decrease in the door closing speed might signal a service request, and trigger
23 Deloitte, Networked medical device cybersecurity and patient safety: Perspectives of
health care information cybersecurity executives
24 Texas Medical Association, Coding for Telephone Consultations

Page 10

a maintenance crew to provide the service before the elevator breaks down and customers
call support, thus saving money, reducing downtime, and increasing customer satisfaction.

Page 11

Architectural considerations
Designing any system reveals concerns that transcend the individual components of the
system. In this section, we discuss various considerations and architectural approaches that
we have encountered while helping our customers design solutions in the realm of predictive
maintenance.

Connectivity

Figure 2. An overview of network layers and mapped logical protocols


A key technical enabler of the Internet of Things (IoT) is ubiquitous connectivity. Lets first
look at the Open Systems Interconnection (OSI) model.25 Even though the Internet model
uses a simplified abstraction, the models in the previous figure and the associated wellknown logical protocols are comparable.
Application-layer protocols are not concerned with the lower-level layers in the stack other
than being aware of the key attributes of those layers, such as IP addresses and ports. The
right side of the figure shows the logical protocol breakdown transposed over the OSI model
and the TCP/IP model.

Interaction patterns
Special-purpose devices differ not only in the depth of their relationship with back-end
services, but in the interaction patterns of these services when compared to informationcentric devices because of their role as peripherals. They are not the origin of commandand-control gestures; instead, they typically contribute information to decisions, and receive
commands as a result of decisions. The decision-maker does not interface with them locally,
and the device acts as an immediate proxy; the decision-maker is remotely connected and
25 Wikipedia, OSI Model

Page 12

might be a machine. We usually classify interaction patterns for special-purpose devices into
the four categories indicated in the following figure.

Figure 3. Device communication patterns

Telemetry is information flowing in one direction that a device volunteers to a collecting


service, either on a schedule or based on circumstances. That information represents the
current or temporally aggregated state of the device or the state of its environment,
such as readings from sensors that are associated with it.

Notifications are one-way, service-initiated messages that inform a device or a group of


devices about some environmental state that they would otherwise not be aware of. For
example, wind parks can be fed weather forecast information, and cities can broadcast
information about air pollution, suggesting that fossil-fueled systems either throttle CO 2
output or vehicles may want to show weather or news alerts or text messages to drivers.

Inquiries occur when a device solicits information about the state of the world beyond
its own reach based on its current needs; an inquiry can be a singular request, but it
might also ask a service to supply ongoing updates about a particular information scope.
For example, a vehicle might supply a set of geo-coordinates for a route, and then ask
for continuous traffic alert updates about a particular route until it arrives at the
destination.

Commands are service-initiated instructions sent to either a single device or a group of


devices. Commands can tell a device to provide information about its state, or to change
the state of the device, including activities with effects on the physical world. That
includes, for instance, sending a command from a smartphone app to unlock the doors of
your vehicle, whereby the command first flows to an intermediating service and then
from there is routed to the vehicle's onboard control system.

Telemetry and inquiries are device-initiated, and their counterparts, commands and
notifications, are service-initiated. This means that there must be a network path for
messages to flow from the service to the device, which bubbles up a set of important
technical questions. How do you:

Page 13

Address a device on a network when it is roaming or if it is power-constrained and duty


cycling the radio to conserve energy?26, 27

Send commands or notifications with acceptable latency for a given scenario?

Ensure that the device only accepts legitimate commands and trustworthy notifications?

Ensure that the device is not easily susceptible to denial-of-service (DoS) attacks that
render it inoperable?

Perform this with millions of devices attached to a telemetry-and-control system?

Connectivity pathways
In the architectures that we have worked on, there are four
common connectivity pathways:

Peer-to-peer: A method of communication between devices


of a system without the use of a centralized administrative
system. The peers in the network can exchange information
and communicate only the necessary information back to the
system. Besides providing the ability to create specific case
and self-organizing networks of devices, this method of
communication enhances the capabilities of the system
nodes can work together to become smarter. The
disadvantages for smart systems in this type of inter-device
communication is the lack of centralized control, and the
impact it has on the security of the system. It also requires a
higher level of logic (intelligence) for some or all peers to
use peer-to-peer communication.

Device-to-service: A device that communicates to a


supporting back-end in the system, often called the service.

Figure 5. Communication styles

Service-to-device: A service that communicates to a device; the opposite of the


previous connectivity pathway.

Service-to-service. Communication between two separate systems, exchanging data to


augment knowledge in the system.

From the work that the authors have done, we have learned that for predictive maintenance
implementations, a bi-directional communication pattern is key to a manageable solution.
The reason for this bi-directional communication ability is to ensure that the system can tell
devices to change the way that they capture telemetry, for example, the rate at which it is
captured or the fidelity of the readings. We have not come across a case where the
requirements were simply to capture data from devices in a one-way communication flow.
Because most systems will need a method of telling devices to capture data at differing

26 Wikipedia, Duty Cycle


27 Georgia State University, ActSee: Activity-Aware Radio Duty Cycling for Sensor Networks
in Smart Environments

Page 14

frequency or with increased fidelity, we consider a one-way communication flow a subset of


the more common pattern.

Connectivity network types


The connectivity type demonstrates how a device and service communicate. The type of
connectivity chosen for a system has broad implications to its architecture. We commonly
see three types of connectivity with different implementations and implications:

Figure 4. The increasing geographical reach of varying network types

Wide area network (WAN). A good example of a WAN is a cellular network. This
network type is a wireless network that is distributed over land areas called cells, each
served by at least one fixed-location transceiver, known as a cell site or base station. In a
cellular network, each cell uses a different set of frequencies from neighboring cells, to
avoid interference and provide guaranteed bandwidth within each cell. When joined
together, these cells provide radio coverage over a wide geographic area. This enables a
large number of portable transceivers (for example, mobile phones, pagers, and so on.)
to communicate with each other and with fixed transceivers and telephones anywhere in
the network, via base stations, even if some of the transceivers are moving through
more than one cell during transmission.28 The most common cellular network is the type
that cellphones use. Cellphones and many integrated components for devices support
network technologies, such as Global System for Mobile Communications (GSM),
Universal Mobile Telecommunications System (UMTS) and as an evolution technology,
Long Term Evolution (LTE), as well as others. As with many technologies in the ecosystem

28 Wikipedia, Cellular network

Page 15

of IoT, there is still a large opportunity for optimization of resource usage and cost for
these technologies.29

Local area network/wireless local area network (LAN/WLAN). A LAN uses


networking technology to connect computers and devices in a limited area, such as a
home, school, computer laboratory, or office building. Unlike WANs, LANs cover a limited
geographic area, and do not include leased telecommunication lines. Ethernet over
twisted-pair cables and Wi-Fi are the two most common technologies used in LANs today.
Though Ethernet 10/100Base-T structured cabling is the basis for many commercial
LANs, fiber-optic cabling is increasingly used in commercial applications. Cabling is often
inconvenient or impossible to use. With the increasing capability of WLAN devices that
use radio waves based on the Wi-Fi industry standard, WLAN is now the standard for
wireless connectivity. Wi-Fi has a maximum range of about 250 meters outdoors. 30 The
connection between different LANs, often times owned by a single entity and extending
its range inside a metropolitan area, is referred to as a Metropolitan Area Network (MAN).

Personal area network (PAN). One of the most interesting developments in


networking is the use of a PAN to transmit data among devices, such as computers,
telephones, and personal digital assistants. PANs can be used to communicate among
the devices themselves (intrapersonal communication), or to connect to the Internet.
Until recently, PAN devices could not communicate over IP, so they needed a bridge to
translate between their proprietary protocol and IP. With the introduction and adoption of
IPv6 over low-power wireless personal area networks (6LoWPAN), these devices, which
use developing standards, such as Bluetooth LE,31 will communicate via IP directly and
take a more active role in IoT.
A wireless personal area network (WPAN) is a PAN carried over wireless network
technologies such as the following:
Bluetooth and Bluetooth Low Energy (LE): A wireless technology standard for
exchanging data over short distances (using short-wavelength UHF radio waves in the
ISM band from 2.4 to 2.485 GHz from fixed and mobile devices, and building personal
area networks (PANs). Invented by telecom Ericsson in 1994, it was originally
conceived as a wireless alternative to RS-232 data cables. It can connect several
devices, overcoming problems of synchronization. Bluetooth LE32 uses 5 to 10 times
less power than older Bluetooth,33 making it a good fit for certain IoT applications.

29 Ericsson Labs, 4G for IoT


30 Wikipedia, IEEE 802.11, Protocols
31 IEEE, Transmission of IPv6 Packets over BLUETOOTH Low Energy
32 Wikipedia, Bluetooth Low Energy
33 Bluetooth SIG, A look at the Basics of Bluetooth Technology

Page 16

Z-Wave: A wireless communications protocol designed for home automation to


remotely control applications in residential and light commercial environments. Z-Wave
is licensed through the Z-Wave alliance.34
ZigBee(-IP): Built on IEEE 802.15.4, the physical layer for low-rate WPANs, ZigBee 35 is
often used to transmit low-powered periodic or intermittent data or a single signal from
a sensor or input device, wireless light switch, electrical meter with in-home-display,
traffic management system, or other consumer and industrial equipment that requires
short-range wireless data transfer at relatively low rates. The new ZigBee IP,36 based
on 6LoWPAN, lets ZigBee devices communicate without a bridge.
For an interesting comparison of power consumption between ZigBee and Bluetooth LE,
see Power Consumption Analysis of Bluetooth Low Energy, ZigBee and ANT Sensor
Nodes in a Cyclic Sleep Scenario.37 Here are some of our observations from this study:
BLE would appear to have an intrinsic disadvantage in a cyclic sleep scenario because
the frequency hopping scheme it uses inherently takes longer to connect compare to
the fixed RF channel used in ZigBee and ANT.
BLE took longer for one connection (1.15 s), than ANT (0.93 s) and ZigBee (0.25 s).
This is because the BLE node was able to sleep for longer between individual RF
packets, improving its duty cycle significantly.
We found that BLE achieved the lowest power consumption, followed by ZigBee and
ANT. The parameters that dominated power consumption were not the active or sleep
currents but rather the time required to reconnect after a sleep cycle and to what
extent the RF module slept between individual RF packets.

Protocol choices
After you have chosen a connectivity type, you need to determine which protocols suit the
purpose of your IoT solution. As you can see in the overview of logical protocols, the term
protocol applies to different layers in the stack, and there are different protocols to choose
from for each layer.

Transport-layer protocol choices


The transport layer provides communication services in the layered architecture of a
network. In the Internet era, two such protocols have emerged as favorites: the connectionoriented Transmission Control Protocol (TCP) and the connectionless User Datagram Protocol
(UDP). Depending on the environment that an IoT system must function in, the capabilities
of its devices, and how much it must guarantee message delivery, you can choose to
34 Z-Wave alliance, Z-Wave For Developers And OEMs: How To Get Started
35 ZigBee Alliance, ZigBee Specification Overview
36 ZigBee Alliance, ZigBee IP Specification Overview
37 Artem Dementyev, Steve Hodges, Stuart Taylor and Joshua Smith, Power Consumption
Analysis of Bluetooth Low Energy, ZigBee and ANT Sensor Nodes in a Cyclic Sleep Scenario

Page 17

support either one of these protocols or both of them. The following figure and table provide
an overview of the packet structure and a lightweight comparison of TCP and UDP, as well as
factors to consider before choosing to use either protocol.

Figure 5. TCP and UDP basic packet structures

Table 1. Factors for using TCP vs. UDP


Capability
Connection type
Reliability
Protocol overhead
Resource usage
Broadcast transmission
support
Ordering of packets
Header size
Error checking
Acknowledgement
Special features

TCP
Connection-oriented
Full
+++
++
no

UDP
Connectionless
None
+
+
yes

active
20 bytes
Yes, retransmit
Yes
-

none
8 bytes
Yes, no recovery possible
No
Broadcasts
Multicast

UDP is a good candidate to transmit data from constrained devices over constrained
networks in close proximity, such as LANs or PANs where congestion and packet loss can be
low. The following factors contribute to this:

UDP has very little overhead compared to TCP.

UDP is connectionless, so with no state to maintain, it uses less memory.

UDP transactions require only two datagrams, which reduces network pressure.

UDP has no retransmission delays.

On networks with a higher probability of packet loss, TCP, being more reliable and secure, is
a viable candidate. The following factors contribute to this:

TCP supplies reliability, which is especially important in long-haul communications where


there is a high chance of packet loss.

Page 18

Because TCP is connection oriented, a device that uses TCP can better defend itself
because it can ignore communications unrelated to current connections, whereas a
device that uses UDP must accept every packet it receives on the listening port.

In scenarios that use streaming video or audio, where high throughput is more important
than guaranteed packet delivery, and in telemetry solutions in which segments are missing,
architects often choose UDP because packet loss is often a better tradeoff than experiencing
delays caused by TCP retransmission.38
There are also scenarios where the occasional packet lost for the telemetry channel would
be acceptable, but requirements would exist for guaranteed delivery of commands, making
the case for a composite model to address these needs to uses both UDP and TCP.

Transport-layer protocol security


Another perspective on the choice between UDP and TCP is security. Because UDP is a
connectionless protocol, it lacks the header values that TCP uses for connection
management, such as keeping track of packet ordering (sequence numbers) and packet
delivery (acknowledgment number), shown in Figure 5. UDP is thus a more lightweight
protocol, but its lack of header values also lowers the barrier that an attacker has to
overcome to send false information to the system. The ability of an attacker to just spoof
the sender address on the IP layer instead of also accounting for altering connection
management information demonstrates this vulnerability. In addition to spoofing, UDP is
more susceptible to flooding attacks, where the attacker floods the system with requests,
because of missing flow control39 and subsequent throttling behavior. TCP is also vulnerable
to flooding attacks, but TCP systems can be fairly well secured by using SYN cookies.
In our work with customers, we have seen many who use devices with limited resources. For
example, a 120 MHz microcontroller, with 256 KB SRAM, and 2 MB flash successfully use TCP
as a transport protocol, although the stack in embedded systems often needs modification. 40
These customers needed reliability for long-haul (direct) communication.

Application-layer protocol choices


From our experience, we have seen three dominant protocols on the rise in this space:

Advanced Message Queuing Protocol (AMQP). AMQP is an open protocol for


message-oriented middleware that JP Morgan Chase developed. The same problems of
connecting systems together would crop up regularly. Each time the same discussions
about which products to use would happen, and each time the architecture of some
system would be curtailed to allow for the fact that the chosen middleware was
reassuringly expensive.41

38 Wireshark, Packet loss


39 Wikipedia, Transmission Control Protocol, Flow control
40 Embedded, Reworking the TCP/IP stack for use on embedded IoT devices

Page 19

The first implementation of AMQP was iMatix OpenAMQ,42 but others have emerged as
well, notably Apache Qpid,43 Microsoft Azure Service Bus,44 and RabbitMQ.45
AMQP is a binary wire protocol that supports programming languages such as C#, C,
Java, Perl, Python, Ruby, PHP, and Lisp.
Where many traditional queuing mechanisms have failed, AMQP seems to be
thriving and is currently used in many systems, such as:46
Aadhaar,47 a large-scale identity system with 1.2 billion identities and about 100 million
authentications per day.48
The National Science Foundations Oceans Observatory Initiative, processing 3
petabytes per year49
For more information, see the AMQP site to read the specifications on the protocol
or try a free implementation.

Constrained Application Protocol (CoAP).50 Targeted mostly at


resource constrained sensors and actuators (devices) such as valves
and switches, this protocol fits the bill for specific purpose networks,
such as Wireless Sensor Networks (WSNs),51 with applications such as
forest fire detection,52 and structural health monitoring.53 CoAP is by
default bound to UDP and optionally Datagram Transport Layer
Security (DTLS), providing communications privacy. With the default
binding to UDP, CoAP supports multicast messaging, allowing for the

Figure 8. Multicast

41 Association for Computing Machinery, Toward a Commodity Enterprise Middleware


42 See http://www.openamq.org
43 Apache, Apache Qpid
44 Microsoft, AMQP 1.0 support in Service Bus
45 See http://www.rabbitmq.com/
46 Amqp.org, Products and success stories, Notable AMQP Users
47 Unique Identification Authority of India, Aadhaar technology
48 Slideshare, Big Data at Aadhaar (slide 9)
49 OOI, CIAD COI TV RabbitMQ
50 Wikipedia, Constrained Application Protocol

Page 20

addressing of a group of destinations at once. CoAP over TCP transport is currently in


draft.

MQ Telemetry Transport (MQTT).54, 55 From the documentation, MQTT is a Client


Server publish/subscribe messaging transport protocol. The protocol runs over TCP/IP, or
over other network protocols that provide ordered, lossless, bi-directional connections.
MQTT is a publish-subscribe protocol developed for machine-to-machine (M2M)
communications, initially created by IBM, and currently undergoing standardization at
OASIS.

In projects that we have done where there was a green field for implementation,
AMQP has been the best fit because in addition to it being efficient, reliable, flexible,
and broker independent, AMQP is native to Microsoft Azure Service Bus, the key
technology component for all these customer projects.

51 Wikipedia, Wireless sensor network


52 Wikipedia, Wireless sensor network, Forest fire detection
53 Wikipedia, Structural health monitoring, Examples
54 See MQTT.org
55 OASIS, MQTT 3.1.1 draft 01 / public review draft 01

Page 21

Security
With devices communicating sensitive information and acting on our behalf, we clearly need
to ensure that the system and the information it captures, processes, and stores, is secure.
With any system, security is a tradeoff with other requirements, such as user friendliness,
performance, cost, and so on. In this section, we cover some important security aspects we
have come across while working in this field.

This is the weather forecast for the week of June 16, 2024 for Texas, the
weatherman says. Last week was hot, but this week will be sizzling, with
temperatures reaching in excess of 110 degrees, with no rain expected. In
hot weather, irrigation is the key to crop and cattle survival. Because most of
the states farmers are using a new irrigation system that depends on
thousands of sensors to determine the best time to irrigate, few of them
worry. What they dont know is that the system is sending faulty telemetry
information that indicates that it rained every day last week. This keeps the
system from irrigating, and now, crops and cattle start to die.
When distributed systems directly influence the physical world by turning valves, controlling
servos, and much more, there is a clear need to ensure that compromised systems do not
kill crops, cattle, and people, burn buildings, or crash cars. The security bar for commands
and data that make things move must be much higher than in e-commerce or finance.
Lets start with a short list of questions about security for the kinds of systems that we have
come across in our work on predictive maintenancea list of factors to think about as you
architect an IoT system. On top of normal security precautions, you also need to know how
to:

Securely onboard new devices. You must ensure that only devices that the system
can register are allowed into the system.

Prevent devices from being duplicated or substituted. Because devices provide


data that the system will directly or indirectly act upon, you must be able to trust data
from devices. Peripherals that can be duplicated or substituted might allow a rogue
entity to flood the system with false but trusted data. Also, in the past, a pirated copy of
a device used to cost money in terms of a lost sale. If it is a connected device, it can now
have actual costs in terms of those related to connectivity and cloud compute to support
and interact with the device.

Ensure that device data can be trusted. As devices communicate, you need to
ensure that the data that they transmit is received unaltered and from verified sources
that the data logged in the service by the device must be trustworthy, representing a
point-in-time observation. This requires integrity and authenticity of data in informationsecurity terms.56

Ensure the confidentiality of messages in transit and at rest. Because IoT


systems span multiple physical networks and transport information over public and

56 Wikipedia, Information Security, Authenticity

Page 22

unknown networks through dynamic routes, information in transit must be secured


against observation by non-authorized third parties.

Prevent devices from denying service. In modern software architecture, the level of
interdependencies is high and increasing. Dependencies within the systemsuch as
devices measuring data potentially critical to effective decision-makingneed to be
available and accessible.

Accept only authorized commands on devices. In any system that acts on external
commands and especially one that interacts with the physical world, it is imperative to
ensure that those commands are only acted on if they are properly authenticated and
authorized.

Remove rogue devices from the system. If you find a bad actor such as a
compromised device in the system, you must be able to remove it quickly.

Authenticate peers. If a system supports peer-to-peer communication among devices,


for example, to enrich information or intelligent edge decision-taking without service
intervention (autonomous system operation), you must have a way to authenticate in
place to ensure that peers in the system are talking to trusted neighbors.

Ensure that devices are always connected to a particular service. A powerful part
of how modern communication works is by using hyperlinks to let clients dynamically
reroute traffic. Devices will blindly follow these hyperlink redirects without thinking twice
(or once, for that matter). Besides offering flexibility, redirects pose a substantial risk if
someone redirects the dataflow into an intermediate system to alter system behavior,
copy the data, or modify the data stream.

In combinatory devices, ensure fine grained security is possible. When a


component of a customer is embedded inside a larger system, such as smart brakes
inside a train or components inside machines, ensure each interested party has access
to the right information and commands, and that when a component is replaced, it is no
longer authorized to act as being part of the larger device.

Virtual Private Networks


A common way to connect networks over
an untrusted network is to use a virtual
private network (VPN)57. VPNs act as a
virtual network card on both ends of the
connection, combining two networks as if
they were a single entity.
The issue with this approach is that a VPN
merely provides secure virtual network
Figure 6. VPN connecting two networks at the
cables; it is the two networks and
link-layer
therefor everything in them that are
connected. After the connection is established, the VPN provides access to all layers above
the link-layer from any device on either network.

57 Wikipedia, Virtual private network

Page 23

A VPN does not help establish any notion of authentication and authorization beyond their
immediate scope. A network application that sits on the other end of a TCP socket, where a
portion of the route is facilitated by the VPN, is oblivious to their existence because it acts on
the transport and application layers of the network model. What matters for the
trustworthiness of the information that travels from the logic on the device to a remote
control system that does not reside on the same network, as well as for commands that
travel back up to the device, is solely a fully protected end-to-end communication path
spanning networks, where the identity of the parties is established at the application layer.
Protecting the route at the transport layer by signature and encryption is done as a service
for the application layer either after the application has given its permission (for example,
via certificate validation hooks) or just before the application layer performs an authorization
handshake, before entering into any conversations. Establishing end-to-end trust is the job
of application infrastructure and services, not of networks.

Compliance
For vertical sectors such as government and healthcare, compliance is a key consideration
as you architect an IoT solution. National and local governments and industry groups have
mandates that affect what a company can share and with whom. Conversely, some
regulations require the sharing of data among government entities or businesses that work
on government programs. The EU has model clause regulations that dictate the storage and
exposure of personal data.58 The U.S. has similar regulations, such as the Health Insurance
Portability and Accountability Act (HIPAA)59 and the Privacy Act.60 Other countries and
entities also have privacy mandates that consider the location of stored data, its origin, the
location and nationality of the users, and the location, nationality, and use of the data
consumers.
If ingested, processed, or published data offers no way to discern details about specific
people, it will less likely be affected by regulation. But all data that is made available to the
public or even a controlled set of partners must be reviewed to adhere to all applicable
mandates because violations present high legal 61 and reputational risks.62

Healthcare
The HIPAA and HITECH laws in the U.S. apply to healthcare and partner organizations that
have access to sensitive patient information, called electronic protected health information
(ePHI). Service providers that work with these entities usually must agree in writing to
adhere to security and privacy provisions set forth in HIPAA and the HITECH act. If an IoT
58 European Commission, Protection of Personal Data
59 US HHS, Health Information Privacy
60 US HHS, Privacy Act
61 TechRepublic, Data security laws and penalties: Pay IT now or pay out later
62 Experian, Reputation Impact of a Data Breach

Page 24

system that supports applications such as the one we described in the Healthcare scenario
captures ePHI, it must adhere to these laws. Microsoft provides a Business Associate
Agreement as a contract addendum to its cloud platform, Microsoft Azure. 63 We also provide
information on some of the best practices for HIPAA-compliant applications, and we detail
Microsoft Azure provisions for handling security breaches.64

63 U.S. Department of Health & Human Services, Health Information Privacy, Business
Associates
64 Microsoft, Azure HIPAA Implementation Guidance

Page 25

Device communication patterns


Many current IoT communication approaches try to answer the basic addressing question
with traditional network techniques. That means that the device either gets a public network
address or it becomes part of a virtual network and then listens for incoming traffic using
that address, acting like a server. In this section, we document various architectural
approaches that we have seen, highlight their characteristics, and then propose an
alternative that is suitable for many IoT scenarios.

NAT-based device network


This architectural design approach uses network address translation (NAT) 65 to expose
internal devices in the network that usually use a private IP address, to the outside world by
reserving a port on the edge device and mapping this port to the private IP address. The
following diagram illustrates this approach.

Figure 7. NAT-based device network


The previous figure shows a device that uses an internal IPv4 IP address (192.168.1.112)
that is listening on port 8088. The device is exposed to the outside world on IP address
127.x.x.x, using port 721. The DNS entry associated with this IP address is
device.mynetwork.com. Clients accessing device.mynetwork.com on port 721 will be routed
directly to the internal device.
This approach has been used in many traditional networks, and depending on the scenario,
it can still work today. However, we have found this scenario to be typically limited by the
amount of devices that it can support (about 65,000) due to the number of available ports,
the need to be statically located (not moving), and the fact that every exposed device needs

65 Wikipedia, Network Address Translation

Page 26

to act like a server (receiving, parsing, and answering arbitrary requests from clients), which
increases its attack surface for malicious abuse.

IPv6 direct-addressing device network


With the rollout of IPv6, it is natural to think about giving every device in an IoT solution its
own publically routable IP address to let it connect to peers, services in the system, or other
systems. The following diagram conceptually depicts this model, which we have seen many
times.

Figure 8. IPv6 direct-addressing device network


We mentioned the drawbacks with this approach at the start of this section. Many current
IoT communication approaches try to answer the basic addressing question with traditional
network techniques. That means that the device either gets a public network address or it
becomes part of a virtual network and then listens for incoming traffic using that address,
acting like a server. For NAT-based device networks that use either of these protocols, a
device needs to act like a server, and with the implicit direct-connectivity model, it must be
stationary to avoid connection loss, or it must employ application-layer measures that can
handle this scenario.

Page 27

NAT-based, PAN device network


For PAN power-constrained and mostly wirelessly connected devices that are often not IPbased, a common approach to bridging the last few feet of connectivity is to use a hub
device wired to the main network that can bridge to the devices on the local network. The
following figure illustrates this approach.

Figure 9. NAT-based, PAN devices network


Even though a hub translates between IP and the various PAN protocols, the problem space
is the same as with other NAT-based device networks that we described.

Generic concerns with direct addressing


All previous architectures that provide direct addressability for devices share common
concerns. As each device is publically addressable, it needs to handle inbound commands
itself, taking care of all application layer responsibilities, such as hosting the server
accepting inbound connections, interpreting commands, queuing requests, and so on.
Because many devices in large-scale deployments will have limited resources, constraining
the number of socket connections that they can handle, and leaving them open to simple
denial-of-service (DoS) attacks.66 In this approach, the devices would also have to handle the
authentication of users for command + control, using the already scarce sockets, memory
and compute power to call out to a service or connect to a database and handle its
responses and I/O.

Service-assisted communication
Another approach to connecting a large number of devices to the central service within a
system is to have the device connect to a well-known service (called a gateway) and then
use that service to tunnel commands to the device. The goal of this approach is to establish
trustworthy and bi-directional communication paths between control systems and special-

66 Wikipedia, Denial-of-service Attack

Page 28

purpose devices that are deployed in untrusted physical space. To that end, the following
principles are established:

Security trumps all other capabilities. If a capability cannot be implemented


securely, it must not be implemented. Threats are identified and either mitigated or
accepted.

Devices do not accept unsolicited network information. All connections and routes
are established in an outbound-only fashion.

Devices are peered with a gateway to only connect or establish routes to wellknown services. If devices need to feed information to or receive commands from a
multitude of services, they are peered with a gateway that takes care of routing
information downstream. This ensures that commands are only accepted from authorized
parties before routing them to the devices.

The communication path between device and service or device and gateway is
secured at the application protocol layer. This mutually authenticates the device to
the service or gateway and vice versa. Because the application does not normally
concern itself with lower-level layers in the network stack as we discussed earlier in
Connectivity, device applications do not trust the link-layer below.

System-level authorization and authentication must be based on per-device


identities. One device, one identity ensures that you have granular control over which
devices can access the system, provide data, and receive commands.

Access credentials and permissions must be revocable. In case of device abuse,


the system must be able to quickly respond by removing the device as an authorized
part of the system.

Bi-directional communication for devices may be facilitated by an intermediate


store. Devices that are connected sporadically due to power or connectivity concerns
may be facilitated through holding commands and notifications for the devices in a
queue or mailbox structure until they can connect to retrieve them.

Application payload data may be separately secured. This is to protect transit


through gateways to any particular service.

Page 29

Figure 10. Service-assisted communication pattern


From the previous illustration, we can derive the following set of attributes:

Device. The device acts like a client; it connects to the gateway and does not listen for
unsolicited traffic. The device connects to an external gateway by creating and
maintaining an outbound TCP socket across a NAT boundary or by establishing a bidirectional UDP route, potentially using mechanisms such as Session Traversal Utilities
for NAT (STUN) or with larger NATs, such as Traversal Using Relay NAT (TURN). These
facilitate the detection of a NAT and the discovery of the public IP address of the network
for binding.

Connection. The connection is routed through the edge device, usually a router.
Because the connection is outbound, the port mapping is performed automatically. By
only relying on outbound connectivity, the NAT/Firewall device at the edge of the local
network will never have to be opened up for any unsolicited inbound traffic.
The outbound connection or route is maintained by either client or gateway in a fashion
that intermediaries such as NATs will not drop due to inactivity. That means that either
side might send some form of a keep-alive packet periodically, or send a payload packet
periodically that then doubles as a keep-alive packet. Under most circumstances it will
be preferable for the device to send keep-alive traffic as it is the originator of the
connection or route, and it can and should react to a failure by establishing a new one.
As TCP connections are endpoint concepts, a connection will only be declared dead if the
route is considered collapsed and the detection of this fact requires packet flow. A device
and its gateway may therefore sit idle for quite a while believing that the route and
connection is still intact before the lack of acknowledgement of the next packet confirms
that assumption is incorrect. This conflict in behavior calls for a tradeoff decision to be
made.
Carrier-grade NATs (CGNs) employed by mobile network operators permit very long
periods of connection inactivity and mobile devices that get direct IPv6 address
allocations are not forced through a NAT at all. The push notification mechanisms
employed by all popular smartphone platforms use this to dramatically reduce the power
consumption of the devices by maintaining the route very infrequentlyevery 20

Page 30

minutes or moreso the devices can remain in sleep mode with most systems turned off
while idly waiting for payload traffic. The downside of infrequent keep-alive traffic is that
the time it takes to detect a bad route is, at worst, as long as the keep-alive interval.
Ultimately, it is a tradeoff between battery-power and traffic-volume cost (on metered
subscriptions) and acceptable latency for commands and notifications in case of failures.
The device can actively detect potential issues and abandon the connection and create a
new one when, for instance, it hops to a different network or when it recovers from signal
loss.
The connection from the device to the gateway is protected end-to-end and ignores any
underlying link-level protection measures. The gateway authenticates with the device
and the device authenticates with the gateway, so neither is anonymous to the other. In
the simplest case, this can be done by exchanging a previously shared key. As we see
quite often in more capable devices, it can also be done via a X.509 certificate exchange
as performed by Transport Layer Security (TLS), or a combination of a TLS handshake
with server authentication where the device later supplies credentials or an authorization
token at the application level. The privacy and integrity protection of the route is also
established end-to-end, ideally as a byproduct of the authentication handshake so that a
potential attacker cannot waste cryptographic resources on either side without producing
proof of authorization.
Today, TLS/DTLS and Secure Shell (SSH) dominate as application-level connection
security protocols. SSH is popular, but it lacks a standard session-resumption gesture.
TLS supports both the X.509 certificate-exchange model and a simplified model (TLSPSK) that uses previously shared keys. Removing support for X.509 certificate handling
and wire-level exchange reduces the footprint of the TLS library, and by reducing the
supported algorithms (for example, supporting only AES-256 and SHA-256), its feasible
to use this protocol on compute- and memory-constrained devices while remaining
compatible with other application layer protocols that rely on TLS. The result of all this is
a secure peer connection between the device and a gateway that only the gateway can
feed.

Edge security. Because there are no ports open to listen on the edge device, the attack
surface on the local network and its devices is minimized.

Gateway. The connection is accepted by a hosted process called a gateway, a system


hosted in an environment that is defendable against external threats, either at the edge
of the internal network or based in the cloud. It provides a well-defined endpoint and API
for clients to connect to and communicate with, effectively acting as a proxy for the
device. Eventual peer-to-peer connections inside the network are acceptable, but only if
the gateway permits them and facilitates a secure handshake between the peers.
In case any authorized client wishes to send a command (or a reply to a previous
request) to a device, it can do so by sending the command to the gateway, providing one
or even several different APIs and protocol surfaces that can be translated to the primary
bi-directional protocol used by the device. As the gateway is a layer of abstraction, it
provides the device with a stable address, location transparency and location hiding.
As this gateway forms an abstraction toward the device, the device could be limited to
speak AMQP, MQTT or some proprietary protocol, and yet have a full HTTP/REST interface
projection at the gateway, with the gateway taking care of the required translation and
also the enrichment where responses from the device can be augmented with reference
data. The device can connect from any context and it can even switch contexts, yet its

Page 31

projection into the gateway and its address remains completely stable. The gateway can
also be federated with external identity and authorization services, so that only callers
acting on behalf of particular users or systems can invoke particular device functions.
The gateway therefore provides basic network defense, API virtualization, and
authorization services all combined into in one. This approach gets even better when it
includes or is based on an intermediary messaging infrastructure that provides a scalable
queuing model for both ingress (device to cloud) and egress (cloud to device) traffic.
Without this intermediary infrastructure, this approach would still suffer from the
issue that devices must be online and available to receive commands and
notifications when the control system sends them. With a per-device queue or
per-device subscription on a publish/subscribe infrastructure, the control system
can drop a command at any time, and the device can pick it up whenever it is
online. If the queue provides time-to-live expiration alongside a dead-lettering
mechanism for such expired messages, the control system can also know
immediately when a message has not been picked up and processed by the
device in the allotted time.
The queue also ensures that the device can never be overtaxed with commands
or notifications. The device maintains one connection into the gateway and it
fetches commands and notifications on its own schedule. Any backlog forms in
the gateway and can be handled there accordingly. The gateway can start
rejecting commands on the devices behalf if the backlog grows beyond a
threshold or the cited expiration mechanism kicks in and the control system gets
notified that the command cannot be processed at this time.
On the ingress-side (from the gateway perspective) using a queue has the same
kind of advantages for the back-end systems. If devices are connected at scale
and input from the devices comes in bursts or has significant spikes around
certain hours of the day, such as with telematics systems in passenger cars
during rush-hour, having the gateway deal with the traffic spikes keeps the backend system robust. The ingestion queue also allows telemetry and other data to
be held temporarily when the back-end systems or their dependencies are taken
down for service or suffer from service degradation of any kind.

Designing for scale


The opportunity for IoT is in the ubiquity of connected devices, the volume of data that those
devices will supply, the intelligence to be gained from that data, and the command/control
that we can exert on the devices. All of these aspects mean that the solution must be
designed to scale at all levels.
In many respects, designing an IoT solution to effectively scale carries the same aspects as
any large scale solution. While IoT does not require a cloud-based deployment, in most
cases, taking advantage of the cloud makes sense because of usage-based pricing, a simple
model that scales, geographic availability, and infrastructure support provided by the cloud
vendor. Many documents and articles have been written about cloud application scalability
and availability. For a good overview of this topic, see Failsafe: Guidance for Resilient Cloud

Page 32

Architectures.67 The Microsoft patterns & practices team also has a large body of work on
Cloud Development that provides guidance on building scalable cloud systems. 68
There are specific scalability areas that come up more frequently in IoT scenarios that may
not appear in other IT solutions, however. One area is identity. For web properties, the
concept of identity federation has taken hold, and most modern consumer web properties
now allow a user to use their identity from other well-known identity stores, such as an
account registered with Microsoft, Facebook, Google, Yahoo, and so on. Additionally,
corporate accounts can be federated with platform as a service (PaaS) vendors and partners.
But with the addition of devices, there will often be identities associated with those devices,
relationships between those devices and human identities, and relationships between
multiple humans and devices. This potentially complex set of relationships should be
considered early in an IoT project, and the solution should strive to simplify these
relationships as much as possible.
In our project experience we have not yet seen a pattern that satisfies this level of
complexity with satisfactory results. The initial projects have used Azure Active Directory for
human identities, and external data stores for device identity and the associations with
Azure Active Directory users. Design, prototyping, and testing is an ongoing process to find
more scalable, resilient and feature complete solutions.

Communication and ingestion


Another scalability area that is tested are the communication paths for ingestion. Most
solutions will require secure, authenticated communication between devices and the
collection point. Additionally, any implementation choice for messaging technology will have
scalability points, limits in certain properties, such as messages per unit (for example queue,
topic, and so on), and bandwidth per implementation unit (subscription, instance, and so
on). These parameters must be well understood, planned for from the beginning of the
project, and tested and verified as the architecture and solution progress.
Our projects have used the Microsoft Azure Service Bus69, with Azure Active Directory Access
Control (ACS)70 keys granted for each device. In a generic solution, some type of secure key
must be generated that will make a device unique, and one that only that device knows
about. The system it connects to must know about the device and its key, and then verify
that they match when messages arrive. The Service Bus and ACS provide these capabilities,
making them a good fit. The solutions use Topics and Subscriptions 71, and they are designed

67 MSDN, Failsafe: Guidance for Resilient Cloud Architectures


68 MSDN, Cloud Development
69 MSDN, Service Bus
70 MSDN, Access Control Services 2.0
71 MSDN, Service Bus Queues, Topics, and Subscriptions

Page 33

to take into account the scalability parameters of the Service Bus72, and use as many topics
as needed to comfortably support the number of devices in the system and scale to
additional topics if and when additional devices are added to the system.

Data storage scalability


Scalable data storage is another area that will be important in these projects. Because of the
expected volume of data, blob storage will normally be the preferred choice. The reason for
this is that blob storage is the lowest cost storage option, and Big Data analysis tools are
built to work against blob storage. Depending on the volume and the geographic dispersion
of the devices, the solution may need to use multiple storage accounts, and it may also
need to move data from collection data centers into a single data center in order to perform
analysis on that data. For additional guidance on managing the data, see Data
Management Patterns and Guidance73 on the Microsoft
Developer Network.

Device registration
Registering devices is the critical first step to take to
ensure that the system is secure and remains secure,
only allows data to be ingested from trusted endpoints,
and devices only accept commands from trusted
systems. A device must be uniquely identified, the
system must authenticate its identity, and the device
must know that it is communicating securely with only
the correct collection endpoint.
Often a device will be created with the knowledge of the expected endpoints, or at least
have some influence over the collection point. An example of this is a vehicle whose
manufacturer is selling a connected vehicle experience. In this scenario, when the device is
manufactured, a unique key will be stored on the device. Either that key or a public key
associated with it will be stored in a database, and when the device is enabled, the service
can check the database and verify that the device is an approved device. These keys may
be service-generated, such as by Azure Active Directory Access Control Services (ACS), or
keys created to support the TLS-PSK pattern as described earlier in this paper, or keys
intended for service-specific authentication. Typically, even when the device carries a key
out of the factory, the device will become active in another step; for example, when a
customer purchases, installs, and configures the device. Configuration will associate a user
with the device, which transforms it to an active device. The device may be issued a new
key at this time.
In other cases, the set of potentially connected devices will not be known at manufacturing
time, so keys cannot be installed on the device prior to its release. In this case, device
registration must happen when the device is installed or activated. An example of this might
be a traffic service that will collect GPS and movement telemetry from a smartphone, and in
turn provide free traffic information for users who opt in to share data. In this case, there
72 Microsoft, Service Bus Scalability
73 MSDN, Data Management Patterns and Guidance

Page 34

would be a registration step where a user must identify the device to the service, the service
then sends a key to the device, and then that key is used to manage communication.
Equally important to device registration is the ability to unregister the device, or disable it.
This is critical because even though the communication with the device is secure, the device
itself can become compromised. Being able to unregister the device and refuse
communication is a critical aspect of the system. With device specific keys, the keys can be
revoked and the system can quickly stop accepting telemetry from the device.

Acquiring data
IoT data acquisition is frequently referred to as data
ingestion. In literature about Big Data, the three Vs,
volume, variety, and velocity are often cited74. There are
other aspects to consider as well. In our initial
engagements in IoT, we have seen that device bandwidth,
connection speed, reliability, and cost have been major
influencers in the solution choices made. But each item in
this section is important, and the relative importance of
each will vary depending on a projects requirements. The
following sections discuss many aspects of data ingestion.

Message size and format


Messages from devices are the lifeblood of IoT. In a world with no boundaries, we might
collect all telemetry data and analyze it extensively, or simply save it in case we need it
later. In the real world, we need to consider the size of the message, which will be affected
by its number of attributes, the data types, the message formats, the message overhead,
and the security overhead.
Many common message formats are in use today. Extensible Markup Language (XML) and
JavaScript Object Notation (JSON) are common. Binary JSON (BSON), Protocol Buffers, and
Avro are more compact formats that are often used when message size and bandwidth are
constrained. XML is supported by all development tools, and easy to understand, but its tags
can often cause message-size bloat. JSON is quickly becoming as ubiquitous as XML, and it is
more compact than XML, but JSON retains the readability of XML.
In IoT there is often a premium on memory, bandwidth, and connection cost, so compact
message formats can be useful. BSON is a binary encoded version of JSON. It allows you to
encode binary data in the message, and it enables storing data as raw bytes versus text.
Protocol Buffers define a method of serializing structured data. They were developed at
Google, and then given to the open source community. Protocol Buffers are compact, but not
self-describing like XML and JSON, so sender and receiver must understand the message
being transmitted. Avro is another option for compact formats. It differs from BSON and
Protocol Buffers in that it is not self-describing, but it is always accompanied by a schema, so
now code generation or prior knowledge of schema is required for processing on the
message receiving end. Ultimately, choosing one of these formats comes down to how to
74 See The 3 Vs of BIG data

Page 35

balance development environment support, device support, the need for compactness, and
storage and processing requirements on the message-receiving side.

Message types
Your system may require different message types that can differ in schema, data type, or
both of these. A real-world example of this is a connected vehicle system that predominantly
sends telemetry information for predictive maintenance. This system might also be used to
send audio or video clips for emergency management, accident recording, and so on. In
these cases, the media files are often enhanced with metadata related to the collection of
the media file. Additionally, the media messages may be of lower or higher priority and they
may require splitting, compression, resumption on error, and temporary local storage. If
different device types are involved, they may provide media files in formats or encoding
levels that are optimized or specific to those devices, which could require normalization at
the storage point.

Message priority
Different message types will often have different priorities in an IoT system. A message can
be a standard telemetry message that is intended specifically for cataloging, and used for
machine learning algorithms downstream. There can be other message types that are
considered events and alarms. An event could be an elevator door opening, a car starting, or
the temperature being increased in a home, whereas an alarm might be a broken window, a
car crash, or a full engine failure.
Message priority will be handled either by providing a separate endpoint for priority
messages, or by detecting attributes in the message itself to assign priority. Using a
separate endpoint for priority massages can reduce the chance of a high priority message
delivery being slowed by a flood of the standard flow messages. If the throughput of the
initial point of ingestion is considered adequate, then downstream detection is an option, for
instance creating a standard subscription and a high priority subscription on an Azure
Service Bus Topic.
There are also cases where device priority may be required. In a connected vehicle scenario,
there may be a premium service that has priority, or there may be sensors in a building with
relative priority, such as one that detects a broken window on the first floor that has higher
priority than one on the fifth floor of the building. In this case, the priority may be handled
similarly to message priority. Another approach is to use a separate service that handles the
higher priority devices.

Conditional messaging
In some of our projects, the solution required the message pattern to change based on
conditions. In this case, if a service technician received an alert that an elevator needed
attention, the technician could send a message to the device asking for it to increase the
detail and frequency of messaging. This would continue for a configurable timeframe.
This type of requirement means that the solution must be scaled to handle the conditional
events. For instance, if the devices could automatically increase the size and frequency of

Page 36

messages, they could cause a dramatic increase in traffic to the system. Safeguards and
throttling should be considered to protect against unplanned data floods in such situations.

Contextual messaging
Similar to conditional messaging, there are use cases that require contextual messaging,
which can follow multiple patterns. There may be situations where the device includes
contextual information in the messages that it sends. The data may include GPS
coordinates, and a vehicle may need to send additional telemetry when it travels above a
certain altitude, or if the ambient temperature rises above a trigger level. The context may
require more data in messages, the collection of data from other sensors on the device, or it
may require more or less frequent message transmission.

Message batching
The natural inclination may be to send messages immediately when data is generated, but
there are several reasons why messages may be batched. A device may be power
constrained, so the connectivity may only be turned on for a limited amount of time. The
connection may be unreliable, so it could make sense to batch the collected messages for a
single transmission once connectivity is available. The device may move in and out of
connectivity, or connectivity may be congested or less expensive at certain times of the day.
If you allow batched messages, the message receiver must be designed to accept them as
well as single messages. In this case, a message envelope that can contain multiple
messages or a single message can simplify the solution.

Bandwidth and scale


Previous topics in this paper discuss bandwidth from the device. The bandwidth and scale of
the collection points must also be considered. The size of the network pipe out of the device
environment may be constrained. For example, if the solution is collecting building
telemetry, and there are devices that are connected to an internal network and sent to an
external collection point, the effect on the capacity of the building network should be
evaluated. The collection points will also have an upper bound. For example, Microsoft Azure
Storage and Service Bus have capacity targets. If your solution needs to extend beyond the
targets of the enabling technology, then a scale-out approach should be designed for the
project. This approach should include plenty of excess capacity for growth and unplanned
spikes. In our projects, we typically plan for no more than 50 percent capacity at steady
state.
If the connected devices are geographically distributed, consider scaling out the solution to
multiple data centers. This can introduce the complexity of directing device traffic to the
right collection points. In our projects, we have found success in assigning devices to data
centers so that no single device traffic needs to find where its data should go. If the device
moves geographically, then it may need to be reassigned. It is important to understand how
the data will be used, and if it needs to be aggregated before use or if the data can be used
autonomously in the data center where it was collected.

Page 37

Storing information
In an IoT solution, there are also several aspects to
consider for data storage. The following sections discuss
many aspects of this topic.

Storing data on the device


The critical telemetry data is generated on the device, or
prior to getting to the device in the case of a gateway.
The data may be cached and preprocessed on the
device. The reasons for doing this include the desire to
optimize the amount of data sent, to minimize noise data from analysis, to save on storage
costs at the central storage location, minimize transmission time or cost, account for
unreliable connectivity, and so on. If data will be stored on the device either temporarily or
permanently, there are several local storage considerations, such as those on security,
reliability, and capacity. If data is stored on the device, the solution architect needs to
consider the implications of losing the data, if the data will expire on the device if it cannot
be sent to external storage, and how the system will detect and recover from missing data,
should a local outage occur.

Transforming data
Generally the data will go through multiple transformation steps that extend from the
generation, sending, storage, and processing of it. As stated in the previous section, there
may be data transformation happening on the device itself, such as converting its format,
aggregation, and so on. This will rely on local processing capabilities. Other than the local
preprocessing, any other transformation would happen at the collection point.
For years, data processing has been thought of in terms on Extract, Transform, and Load
(ETL). With the advent of Big Data, much of the discussion has changed to Extract, Load,
and then Transform (ELT). The key concept in this transition is that your system is ingesting a
huge amount of data, and the transformation process costs significant compute power.
Additionally, while this transformation is happening, the data is at risk. If it has not yet been
serialized, and the server crashes, then the data is lost. With ELT, the system ingests the
data and immediately stores it. This minimizes the exposure of the data during ingestion,
and provides new opportunities for data transformation and analysis. First, the data can be
transformed asynchronously from ingestion. This helps reduce compute demand. Then the
data can be transformed multiple times, for multiple purposes, and this process also
supports the idea of collecting all data for extended periods of time. This is often referred to
as a data lake75, and this strategy suggests keeping all data for later analysis. The
rationale for this is that machine learning algorithms may find interesting patterns or trends
that would not be expected, and that these would warrant studying other seemingly
unneeded data.

75 Forbes, The Data Lake Dream

Page 38

Location
Most IoT solutions will send data to a public or private cloud. If connected devices are
geographically distributed, there may be a case for storing the data across several locations
around the globe, in order to store the data closest to where it was generated. There may
also be government mandates that require an individuals data to remain in that person's
home country, or the data may only be interesting within the region within which it was
collected. However, in a large percentage of projects, the value is in the large body of data,
so data must be brought together into a single location for the most insightful analysis. In
this case, the considerations will center on the time constraints of the analysis (how often
are the algorithms run?), the physical limitations of the data centers, bandwidth, and the
cost of moving data.

Longevity, format, and cost


After the data reaches its long-term storage point there are decisions to be made about how
to govern that. A data retention policy must be defined. The arguments for long data
retention periods are that cloud storage is inexpensive and getter cheaper all the time, and
that data scientists want data saved in case a new insight is discovered that warrants
looking at data that was previously uninteresting. Even with those benefits, the costs for
large volume data storage can add up, and the data could become unmanageable if you do
not have a basic plan for how to store, access, and retrieve it. The terms Data Temperature
and Hot and Cold storage76 also come up in this context. The concept centers around how
frequently accessed the data is, and how quickly the users or systems expect to be able to
use the data. Hot data is frequently accessed and users expect good response time. Cold
data is data that is less frequently accessed and expected response times can be lower.
Classifying data in this manner allows the architects to choose faster and potentially more
expensive storage for hot storage and select lower cost options for cold storage.
The format for long-term data storage also needs to be carefully considered. Should it be
optimized for Hive queries, or should it be as compact as possible? Or should there be a
fresh data store with more recent data that is easy to access, process and query, and an
archive that is compressed and stored in a way that minimizes cost, but that requires
overhead if and when it needs to be accessed. All of these considerations add in to making
decisions on how to best store the data.

Processing information
After the data is ingested, it must be processed.
Processing types range from very simple to longrunning and complex. The following sections
discuss common IoT data-processing types.

Alarm processing
A common use case is to watch for specific data items on ingestion and then take action
based on that data. These could be alarms from devices, or any kind of simple event
processing. The characteristic of this type of processing is that there is a specific set of
76 Teradata, Hot and Cold Running Data

Page 39

values that are to be monitored on specific attributes of the incoming data that can trigger
predetermined responses. While this type of event processing is logically straightforward,
the implementation still requires consideration due to the expected high volume of data
being ingested, and the likelihood that the events that must be responded to are of relative
importance.
In alarm processing, the solution must also account for the potential of alarm floods. If a
systemic failure happens, for instance if a home alarm system sends an alarm to the event
processing system when the power goes out, there may be a flood of alarms, or if the there
is no battery backup, messages may be cached on the device, and then when the power
returns, all the devices send their entire set of messages at once. To handle these situations,
the devices may be designed to have a random offset for message delays, or the message
receiving service can implement a circuit breaker pattern 77 to circumvent failure when an
abnormal event pattern happens.

Complex-event processing
Complex-event processing is used to detect conditions or states on data in motion that may
not be directly deduced from simple data evaluation. This might include the detection of a
certain set of events that arrive in a particular order or frequency, such as an event that is
innocuous if it appears once, but that indicates a problem if it occurs a certain number of
times in a certain timeframe, or if the same event is transmitted from a set of devices or
sensors. Imagine that your car sends telemetry to the manufacturer, and one of the items
that it reports is failed starts. By itself, this would mean very little to the manufacturer.
However, if the weather got very cold last night, and none of the SuperCar Model 8s in that
area started in the morning, that could tell the manufacturer that there is a systemic
problem with the car's battery or something related to the starting system.
The industry sees complex event processing as one of the keys to monetizing the vast
opportunity of IoT.78 When envisioning the solution, ensure that initial requirements are
discussed early in the project. This is an area where businesses will learn and improve over
time, but one which should be prototyped early in the process to prove out the concepts,
and to begin to develop the right mindset for capitalizing on the opportunities. This is a rich
area of development within Microsoft, our competitors, and the open source community.
Microsoft has developed StreamInsight,79 which can be deployed in the cloud. A popular
open source project is Apache Storm80 for real-time stream processing, and Amazon is
offering Kinesis for their cloud solutions, which includes stream processing.

77 MSDN, Circuit Breaker Pattern


78 Venture Beat, Without stream processing, theres no big data and no Internet of things
79 Microsoft, StreamInsight
80 Apache, Storm, distributed and fault-tolerant realtime computation

Page 40

Big Data analysis


One of the main drivers for IoT is the ability to economically collect and store large amounts
of data. After the data is collected, it must be processed, aggregated, analyzed to create
datasets that can be visualized and used either for business analysis, informing business
decisions and strategy, feedback into product engineering to improve products, or provide
views of the data that can be shared with partners for monetization or adding value to the
business relationship.
The most common approach for this is to use the Map/Reduce81 pattern to batch process
collected data. Apache Hadoop is the predominant implementation of that pattern, and
Microsoft provides HDInsight, which is a cloud platform service implementation of Hadoop.
The approach may be as simple as aggregating and summarizing data for simpler reuse, or
it may be complex, multi-step processing that generates insights across the recently
collected and historical data. Hadoop includes many tools within its ecosystem that help
with searching, querying, and cataloging the data. In solutions today, Hadoop will frequently
be used to preprocess data, such that Hadoop jobs will run and create summarized datasets
that can be used for querying, reporting, and as input to machine learning activities, or as
reference datasets in Complex Event Processing solutions.

Machine learning
Machine learning refers to the concept of studying data and deriving insights from the data.
The results will be a model that can be used to predict future outcomes from similar data
sets. The first step is to train the model. This is normally an iterative step performed by a
data scientist where a training set of data is used to infer a function, or model, from that
data. That model will be used to make decisions on incoming data. The model is typically
retrained periodically, so that the model can improve over time, learning from additional
new data and patterns.
Machine learning falls into two broad categories: supervised learning and unsupervised
learning. Supervised learning studies the data looking for a known set of desired outcomes.
In other words, in the vehicle scenario, I may want to minimize the number of times that a
car needs its oil changed. So I would run studies against the data looking for patterns that
give me information about the consequences of delaying oil changes, conditions, and so on.
In unsupervised learning, the concept is to naturally find patterns and relationships of any
kind in the data. After something interesting is observed, then these data points will be
further investigated until they are found to be either useful or not useful.
Common tools for machine learning include MATLAB82, Mahout83 and R84. Microsoft
introduced its ML tooling in June 2014, called Azure ML. 85 Azure ML is a machine learning
service that democratizes the practice of machine learning. It provides a visual experience
81 Wikipedia, MapReduce
82 Wikipedia, MATLAB
83 Wikipedia, Apache Mahout
84 Wikipedia, R (programming language)

Page 41

for constructing data experiments, and easy to use implementations of many commonly
used machine learning algorithms, relieving the data scientist of implementing them in a
programming language. Azure ML integrates easily with Azure Storage, HDInsight, and
Windows Azure SQL Database, and it can expose the models as web services so that they
are simple to integrate into the runtime data flow or applications.

Data enhancement
Another core piece of the IoT architecture is data enhancement. The data collected from the
devices, the volume of it, and the hidden patterns within it provide tremendous value, but
often combining the device data is either critical in order for it to make sense to the
business, or there is even more significant value to be gained by adding other data sets to
analyze with the device data. Enterprise data may be used for simple things, such as
relating device data to customer data. Other areas of opportunity include data markets that
publish datasets that are either sold or available for free. Microsoft offers the Azure
DataMarket86, which offers datasets from governments, research institutions, historical,
environmental, business organizations, and more. One of the most frequent datasets that
gets combined with device data is weather. Devices often exist all over the globe in different
conditions, so predictive maintenance will frequently factor in weather data, which is
normally sourced from weather data providers as opposed to collecting it with the device
itself.

Publishing insights
After data stored in the system has been
processed into information of value to others, the
question becomes how to approach this exposure
in a secure and compatible manner that is easy to
discover and consume. Some organizations want
to make their data available to partners both up
and down the supply chain to realize efficiencies
that result in lower costs and improve margins. Others are realizing the data they have can
be directly monetized as services available for consumption by individuals, corporations and
governments around the world. In addition to the stand-alone value of the data, it may also
be seen as valuable to augment other data services. Data that may seem uninteresting to
those within the organization could in reality be a key ingredient used in a number of
potential external applications or analytical recipes. For an in-depth discussion of datapublishing considerations, see the paper Making Public Data Public from Microsoft. 87 The
following sections discuss many aspects of this topic.

85 Techcrunch, Microsoft announces Azure ML, Cloud-based Machine Learning Platform That
Can Predict Future Events
86 See https://datamarket.azure.com/
87 Microsoft, Making Public Data Public

Page 42

Audience
The target audience for the data will have a significant impact on how it is published. Will it
be used to enhance analysis of other data? Will it be used through data visualization tools,
such as PowerBI or Tableau? Will it be metered and have a price associated with it? Or will
there be different views and price points of the data for different partners?

Publishing format
The choice of publishing format will be influenced by the targeted audience and the type of
information being published. Similar to the discussion earlier in this paper about the
incoming message format, the most likely choices for publishing data are XML, JSON, and
AtomPub. OData88 is a standardized protocol for creating and consuming data APIs. OData
originated at Microsoft, but it has become well-accepted in the industry. OData supports both
JSON and AtomPub, so it is widely consumable by nearly all current tools and programming
languages.
There are tools that can help scale, secure, and normalize the data publishing task. The
Microsoft Azure DataMarket89 is a global marketplace for data and applications that provides
discoverability, interface normalization, and a monetization approach. Microsoft Azure API
Management90 is a service that facilitates publishing APIs. It includes features for API
translation, versioning, aggregation, discovery, authorization, caching, and quotas. Both
Azure DataMarket and Azure API Management can be part of the publishing strategy, using
DataMarket for the broad exposure of large datasets, and API Management to expose APIs
securely with usage metrics and management capabilities.

88 Odata, OData Home page


89 Microsoft, Microsoft Azure Marketplace Publishing
90 Microsoft, Microsoft Azure API Management

Page 43

Cost modeling and


estimation
Determining the cost of an Internet of Things (IoT) solution focused on predictive
maintenance is generally a complex problem. This section will list an initial approach that we
have used with our customers to estimate the cost of the architecture to support their
predictive maintenance solutions. With any calculation, it is very specific to a scenario and
this model will not be applicable to all situations or be totally complete.
Before we go into the specifics of determining the cost for a solution, we want to stress that
cost modeling, like capacity planning, is an iterative exercise. The process repeats itself, and
performance testing and other data gathered will change capacity distribution (for example,
different workloads could be combined in a single unit to save cost because these workloads
are compatible in load profile) and tune the model over time. In other words, the first cost
estimate will not be perfect, and it provides only an indicator of the cost of the solution.

A common architecture for IoT


Although you need to verify whether is satisfies your specific requirements, from our work
with customers, a reference architecture surfaced which helps in implementing the Service
Assisted Connectivity pattern by acting as the mentioned gateway. This architecture is built
on top of Microsoft Azure Service Bus. Within Service Bus, it utilizes Event Hubs for the
ingress (device to cloud) of data and topics for sending Command & Control messages as
well as replies.

Event Hubs
Event Hubs is a new feature of Microsoft Azure Service Bus. It stands next to topics and
queues as a Service Bus entity, and provides a different type of queue, offering time based
retention, client-side cursors, publish subscribe support, and high scale stream ingestion.
Although it could be argued the use of topics could satisfy the technical requirement for
receiving data from devices, Event Hubs supports higher throughput and has an increased
horizontal capacity.

Architectural details
Starting at the logical architecture level, the main architectural components are depicted in
the following figure.

Page 44

Figure 11. Reference architecture conceptual overview


The previous conceptual architecture figure includes four important components within the
system:
1. The provisioning service that takes in information on authorized devices, creates its
configuration, and stores access keys.
2. Devices that interact using either AMQP or HTTP towards Service Bus directly, or a
component called the Custom Protocol Gateway Host, which hosts adapters for other
protocols, such as MQTT and CoAP.
3. Telemetry requests that are distributed by the router, using adapters to communicate
with downstream storage and processing engines.
4. Commands send to devices through the use of the notification/command router that is
internally surfaced through the Command API host.
To ensure the architecture is able to support a large number of devices, a partitioned model
where the device population is divided into manageable groups is used. This partition model
can be seen in the following figure.

Page 45

Figure 12. Reference architecture details and partition overview


The figure details some important aspects of the reference architecture:

Master. Part of the requirements assumption for the architecture is that solutions built
on top of it will aim for a unified global or at least regional management model,
independent from technical scale limitations that might inform how large a particular
partition may grow.
This motivates an overarching architectural model with a common Master service,
shown on the far left of the figure, that takes care of shared management and
deployment tasks, as well as of device provisioning and placement, and several parallel
and independent deployments of Partition services that each take ownership of one or
more logical system partitions.

Partition. Instead of looking at a population of millions of connected devices as a whole,


the system divides the device population into smaller, more manageable partitions of
large numbers of devices each.
Each resource in the distributed system has a throughput- and storage-capacity ceiling,
limiting the number of devices associated with any single Service Bus ingress entity so
that the events sent by the devices will not exceed that entitys ingestion throughput
capacity, and any message backlog that might temporarily build up does not exceed the
entitys storage capacity.
In order to allocate appropriate compute resources and not overload the storage backend
with too many concurrent write operations, a relatively small set of resources with
reasonably well-known performance characteristics is bundled into an autonomous, and
mostly isolated scale-unit.
Each scale-unit supports a maximum and tested number of devices, which is also
important for limiting risks in a scalability ramp-up. The principle behind this is that a
production system can only be scaled up as much as it can be scaled up in testing on a
regular basis.

Page 46

A benefit of introducing scale-units is that they significantly reduce the risk of full system
outages. If a system depends on a single data store and that store has availability issues,
the whole system is affected. However, if the system consists of 10 scale-units that each
maintain an independent store, issues in one store only affect 10 percent of the system.
The principle of running all traffic ingestion through asynchronous Service Bus
messaging entities, instead of into a service edge that writes data straight to the
database, is that Service Bus already provides a scaled-out and secure network service
gateway for messaging, and it is specifically designed to deal with bad network
conditions, traffic bursts, and even sustained traffic peaks. A back-end datastore that is
the target of the ingested data should not be dimensioned to handle specific bursts, such
as vehicle telemetry during core European or U.S. East Coast rush hours.
The group called partition is a set of resources focused on handling data from a welldefined and known device population that has been assigned to and configured into the
partition through provisioning. Cross-partition distribution of devices will be based on
your solution-specific logic, and allocation within the partition is handled by provisioning.
The partition group is the unit of scale. Through testing, the load specifications for the
partition have to be determined and a so-called scale-unit can be defined. A scale-unit is
a group of resources that can effectively support a well-known load profile for the
system, allowing replication of the scale-unit to provide support for an extrapolation of
this load profile. Within the partition group, there are two basic paths, ingestion
(sending data from the device to the cloud) and egress (sending data from the cloud to
the device). These paths accomplish the following:
Ingestion. Ingestion has a given device connect through its supported protocol,
delivering messages to its specific Event Hub, using its assigned credentials.
Egress. Egress routes messages (replies, Command & Control) to their device
destination.

Device Repo. The device repository contains configuration information about the
registered devices for a given partition.

Capacity modeling
Before cost can be modeled, the way that the system will scale needs to be considered and
the characteristics of the architecture need to be determined. Essentially, the attributes of
the previously mentioned scale-unit need to be defined.
There is a throughput ceiling for each of the components in the architecture, including each
of the Service Bus entities. The reason to be cautious when evaluating throughput is that
when dealing with distributed devices that send messages periodically, we cannot assume
perfect, random distribution of event submissions across any given period. There will be
bursts and we need to allow for ample capacity reserve to handle such bursts.
Assuming a scenario of a 10-minute event interval with one extra control interaction
feedback message per device per hour, seven messages per hour from each device can be
expected, and roughly 50,000 devices can be associated with each entity with a 100
messages per second average throughput capacity.
Having covered the flow rate, we can conclude that storage throughput is of little concern.
However, storage capacity and the manageability of the event store are concerns. The per-

Page 47

device event data at a resolution of one hour for 50,000 devices amounts to some 438
million event records per year. Even if these event records are limited in size to only 50
bytes, the yearly payload data is still 22 GB per year for each of the scale-units. This
underlines the need to keep an eye on the storage capacity and storage growth when
thinking about sizing scale-units.
These considerations manifest in a capacity model in the deployment model, which informs
how many entities must be created in the Service Bus namespace backing a partition for a
given device population size like 50,000 devices and for a given load profile.
The load profile is currently informed by how many (telemetry-) messages a device is
generally expected to send, how many commands or notifications the device is expected to
receive per hour, and what the average size of these messages is. The inputs should be wellinformed, but generous estimates because while changing the shape of a scale-unit layout
at a later time is possible, doing so may require re-provisioning the devices.
Determining partitions is not only motivated by capacity concerns, however. Because a
partition also forms a configuration scope, it provides a suitable mechanism to segregate
device populations by region, country, owner, operator, product, or other concerns. As an
example, one deployment can have up to 1,024 partitions.
Each partition corresponds to exactly one Service Bus namespace. Because there can only
be 50 namespaces per Azure subscription, and other dependent services have similar
quotas, a fully built-out architecture will therefore most likely span multiple subscriptions.
In summary, the attributes that we have found to determine the capacity model are:

Number of devices. This is the


number of sensors supplying telemetry
information to the scale-unit.

Average message interval ingress /


egress. This represents the average
number of messages that a given
device emits per hour (ingress) / and
the system emits per hour (egress).

Average message size ingress /


egress. This is the average size of the
messages that a device emits
Figure 13. Scale Units in the reference architecture
(ingress) or the system sends
(egress), in bytes.

Cost estimation
With the estimation of cost for a solution built on top of this architecture, there are many
factors to consider. We will work through the list from the ingress of device data to sending
commands. Cost is estimated based on architectural design and necessary scale for success.
As such, cost estimation has variables for the scale that is needed applied to the formula for
calculation.

Page 48

Before we dig into the details, we feel the need to underscore the fact that cost modeling,
like capacity modeling, are inputs for architectural decision making and business case
modeling, where the combination of all inputs should always be considered as a whole. As
an example, you might find using HTTP for communications will be somewhat less expensive
from a cost modeling perspective. However, choosing HTTP over AMQP will inherently impact
performance.
For all pricing related information in the cost estimation formulas outlined in this section, it is
important to state that prices will vary over time and the examples are aimed only at
explaining the formula itself. The latest pricing information can always be found at
http://azure.microsoft.com/en-us/pricing/overview/.

Ingress path cost using Event Hubs


As events consumed from an Event
Hub, as well as Management
operations and control calls such
as checkpoints, are not counted as
billable ingress events, the formula
for estimating cost for the
architecture when using Event Hubs
is a combination of:

Figure 14. The ingress path of the reference


architecture

Cost monthly ingress=Cost base charge +Cost brokered connections +Cost throughputunits +Cost messages +Cost protocol gateway + Cost telemetry pump +C
Which expands into a more detailed formula we can work with to fill in the appropriate
variables:

T
Cost monthly ingress=Cost base charge + brokered connections 1000
744

a $ 0.015
a $ 0.025 +744 N throughput units $ 0.0 3+
100 k a 500k
a $ 0.03
a>500 k

a<100 k

Equation 1 - The cost estimation formula for the ingress path

Page 49

N devices Am

It should be noted this formula is using the Standard tier offering of Event Hubs 91, which
offers additional brokered connections, filters, and additional storage capacity. The fixed
pricing elements in the formula uses pricing from a point in time, susceptible to change.
Also, the formula assumes a flat use of brokered connections while actual billing is based on
peak use prorated per hour; the dynamics of your system will likely deviate.
The variables in this equation are:
Variable

Description

Cost monthly ingress

The cost of the ingestion of events, per month.

T brokered connections

The total amount of hours connections to the system are made,


summing all simultaneous connection time.

N throughput units

The number of throughput units92 needed to support the ingress of data


into the system. A throughput unit is the combination of inbound
bandwidth, temporary storage and outbound bandwidth, as described in
the reference.

N devices

The number of deployed devices sending data to the system.

A msg per month

The average number of messages sent into the system, per device, per
month.

msg
A

The average size of each message sent into the system, per month.

N supported scale

The number of worker roles necessary to support the projected scale of


the system. Normally, at least two (2) are needed to fall within SLA
support of Microsoft Azure.

Cost worker

The cost per worker role for the ingress path when using custom
protocols and for the telemetry pump, per hour.

A egressGB

The average amount of egress traffic, per gigabyte.

Cost egressGB

The cost of egress traffic93, per gigabyte.

Example calculation
An example calculation where 1,000,000 deployed devices send a message averaging 128
bytes every 60 seconds, having an average number of 100,000 simultaneously connected
devices during the entire month would yield the following results:
Variable

Value

91 See http://azure.microsoft.com/en-us/pricing/details/event-hubs/.
92 Microsoft, Microsoft Azure, Event Hubs pricing, FAQ What are throughput units and how
are they billed?
93 Microsoft, Microsoft Azure, Data Transfer Pricing Details

Page 50

T brokered connections

100,000 (100,000 simultaneous connections for the full month).

N throughput units

17 (44,640 messages per device, per month. 44,640,000,000 messages


per month, equaling 16,666.66per second. Given a single throughput
unit supports up to 1,000 messages per second, rounding up 16,666.6
6/1,000 equals 17).

N devices

1,000,000

A msg per month

44,640 (744 hours * 3,600 equals 2,678,400 seconds per month. 1


message every 60 seconds equals 44,640 messages per month)

msg
A

1KB (rounding up 128 bytes in KB (128 / 1,024 equals 0.125)).

N supported scale

50 (assuming a rough estimate of 20,000 devices would be supported


per worker role). Note again, this is not a capacity modeling exercise,
these numbers should come from performance tests on your specific
scenario.

Cost worker

$0.08 per hour (assuming A1 worker role size).

A egressGB

0 (assuming all downstream processing happens inside the same region


DC.

Cost egressGB

Not Applicable

Cost monthly ingress=Cost base charge + 99,000$ 0.03+74417$ 0.0 3+

1,000,00044,640ceil
1,000,000

Egress path cost


As with ingress, the egress path also has multiple components that incur cost. As sizes often
vary between ingestion data and command & control, the message size is not the same
value as used in the ingress path.
The components involved in egress are:

Page 51

( 641 ) 12.5 $ 0.028+

Command API Host. The process in


charge of sending notifications and
commands to devices and groups of
devices. It encapsulates the
notification/command router, and
routes egress messages to the
appropriate topic on Microsoft Azure
Service Bus, depending on the type of
request. It is hosted inside a worker
role.

Subscriptions. There are two


different types of messages that the
Command API supports: notifications
and commands. A command can both
yield a single or multiple response
messages. Notifications and
commands can also target
Figure 15. The egress path for the reference
groups of devices. All of
these
architecture
messages incur cost.
Response
messages have not been accounted for in the egress calculation and should be
estimated here. Command replies are not routed through the telemetry adapters.

Egress traffic. Each egress message will incur cost.

Given these components, the egress path cost can be calculated using the following
formula:

Cost monthly egress=Cost worker roles commandAPI + Cost

notifications
single command messages
multi command messages
command response messages

+Cost egress traffic

Which also expands into a more detailed formula we can work with to fill in the appropriate
variables:

Page 52

a
64

a
Anm
Acsm
Acmm

1,048,576

ceil ( 1,000,00012.5 )

a $ 0.20
a $ 0.50
100 a 2,500
a $ 0.80
a>2,500

a <100

Anm
A csm
A cmm
A rm

Cost monthly egress=744 N supported scale Cost commandAPI +


Equation 2 - The cost estimation formula for the reference architecture egress path
This calculation combines both single device notifications and commands, as well as group
broadcast messaging. Determining the magnitude and distribution in order to figure out the
averages within the formula is left to the reader as part of the capacity modeling for the
system architecture.
The variables in this equation are:
Variable

Description

N supported scale

The number of roles necessary to support the projected scale of the


system. Normally, at least two (2) are needed to fall within SLA support
of Microsoft Azure.

Cost commandAPI

The cost per worker role for the command API host, per hour.

A nm

The average number of notifications per month.

A csm

The average number of single response command messages per month.

A cmm

The average number of multiple response command messages per


month.

A rm

The average number of response messages to commands, per month.

Page 53

The average response size, in kilobytes, averaged over all outbound


message types.

Cost egressGB

The cost of egress traffic94, per gigabyte.

Example calculation
An example calculation using 100,000 notifications per month of 20 KB each, 130,000
commands of 35 KB each with single replies of 80 KB each, and 20,000 commands of 20 KB
each with on average three (3) replies of 70 KB each would yield the following results:
Variable

Value

N supported scale

Cost commandAPI

$0.08 (A1)

A nm

100,000

A csm

150,000

A cmm

20,000

A rm

190,000 (130,000 + 3 * 20,000 equals 190,000)

Cost egressGB

$0.138

Cost monthly egress=7442$ 0. 08+

100,000 ceil

121
+190,000 ceil (
( 6420 )+150,000 ceil ( 6435 )+20,000 ceil ( 210
)
64
64 )
1

94 Microsoft, Microsoft Azure, Data Transfer Pricing Details

Page 54

1,000,000

Management cost
Besides the messaging
related components in the
reference architecture, there
is also the concept of one or
more masters for managing
the system, as discussed
previously in this paper. The
master is tasked with
provisioning devices,
creating appropriate queues
and topics, storing device
information, provisioning
security, and so on. The
master contains the
following cost components:

Figure 21 - The "master" component within the reference architecture

Provisioning Runtime. The component called by tooling to provision a device or a set


of devices into the system, creating the necessary service bus, compute, and storage
artifacts. It is hosted inside a worker role.

Device Repo. The datastore collecting the registered devices per partition.

Partition Repo. The datastore collecting partition registration information.

Given these components, the egress path cost can be calculated using the following
formula:

devicerepoGB N partitions
partition repoGB+
Cost tsGRS + di Cost tx
Cost monthly mgmt =744 N supported scale Cost master +
Equation 3 - The cost estimation formula for management of the reference architecture

Page 55

The variables in this equation are:


Variable

Description

Cost monthly mgmt

The cost of the management for the architecture, per month.

N supported scale

The number of roles necessary to support the projected scale of the


system. Normally, at least two (2) are needed to fall within SLA support
of Microsoft Azure.

Cost master

The cost per worker role for the management host, per hour.

partition repoGB

The number of gigabytes used in the partition repository for


administrative purposes.

device repoGB

The number of gigabytes used in the device repository.

N partitions

The number of partitions to allow for appropriate scale.

Cost tsGRS

The cost for Geo Redundant Storage (GRS) table storage ($0.095 / GB at
the time of writing).

di

The change for device information. Any change to the device information
stored in the system and subsequently in a device repository inside a
partition, will account for at least two operations on table storage.

Cost tx

The cost for storage transactions ($0.0036 / 100k transactions at the


time of writing).

Example calculation
An example calculation using 10,000 changes to device registration per month (either new
devices, changes in activation, or removed devices) leading to a total partition repo
(assuming a single master instance is used) size of 256 MB and 128 MB device repository
per partition, using 10 partitions, would yield the following results:
Variable

Value

N supported scale

Cost master

$0.16 (medium)

partition repoGB

0.25

device repoGB

0.125

N partitions

10

Cost tsGRS

$0.095 / GB

di

10,000

Page 56

Cost tx

$0.0036 / 100k

Cost monthly mgmt =744 x 2 x $ 0.16+ ( 0.25+0.125 x 10 ) $ 0.095+0.1 x $ 0.005=$ 238.08+ $ 0.1425+ $ 0.000 36=$ 238
As can be observed from the outcome of the formula, the cost of management for the
reference architecture is mostly dependent on the worker roles running to support it.

System processing cost


An IoT system with only the ability to ingest and offload data combined with the ability to
send commands is not complete. This is just the communication interface for connecting
devices to a central system.
Although it is not included in this example, in order to complete an IoT system, there is a
need to perform data analysis, either in flight by using an event processing engine, or at rest
by using solutions for machine learning. With a high degree of certainty, you will also need
components that take advantage of key parts of this underlying technology to surface
management and control mechanisms to users through the use of one or more portals,
expose the gathered knowledge from machine learning to other parties through web
services, and so on.

Cost estimate calculation


In the previous sections of this paper, we discussed the various components that make up
the cost for the data ingestion and communication platform inside the reference
architecture. When we combine these, we can calculate the total estimated cost for a
partition, and extrapolate the total estimated OPEX cost for the system based on the
number of needed partitions using the following formula:

Page 57

Important
topics not yet covered
cot total per month =( Cost ingress +Cost egress ) N partitions +Cost management

In this paper, we have strived to capture many of our learnings from implementing
predictive maintenance solutions in the Internet of Things (IoT) space. However, in addition
to the topics discussed, there is both much detail to add and more things to think about
when architecting for IoT. This final section touches on some of these topics.

Networks with automatic handover and


fallbacks
When we think about IoT scenarios, there seems to be an emerging need for networks
working together in a seamless manner in order to provide frequently roaming users with
the ability to perform command and control to either partially or fully closed IoT systems
that they can access. This capability would require working across vendors and standards to
ensure that the right connectivity type is available at the right time, and at the right price.

The need for the commoditization of devices


Many solutions today use their own proprietary hub for connecting their point solution to the
Internet. This approach needs to change, with vendors selling connectivity bridges that work
much like today's home Internet routers. In fact, such Internet routers could prove to be a
great point of integration with standardized PAN/LAN devices, and support autonomous
operations when connectivity is not available. Ideally, these bridges would support current
legacy non-IP PAN device protocols, such as Z-Wave, traditional ZigBee, and so on.

The creation and use of information


marketplaces
As IoT systems evolve, especially those capturing telemetry for intelligent decision making,
there is a clear need for data augmentation to provide context for machine learning.
Information marketplaces, such as Microsoft Azure DataMarket, need to expand their
offerings, providing new opportunities for data providers.

Management solutions
There are standards put forth for managing devices95, such as OMA Device Management96 (of
which Microsoft implemented a subset, called Mobile Device Management 97), CPE WAN
Management Protocol98, Lighweight M2M99, and UPnP-DM100.
95 Blackberry, A Comparison of Protocols for Device Management and Software Updates
96 Wikipedia, OMA Device Management

Page 58

As millions of devices become part of IoT systems, there is a clear need for IoT solutions that
can monitor and manage incidents in the systems, visualize information and effectively
control the environment, and span the various connectivity options and supporting legacy
systems.

The redefinition of SLAs


Although it represents a very hard problem to find a solution for, customers will ask for
different types of Service Level Agreements (SLAs) in this space. Where current SLAs provide
a system availability guarantee, this definition has to evolve to provide a concrete answer to
questions, such as how much bandwidth is available, what is the maximum and average
latency to expect, how many I/O operations per second (IOPS101) can the storage system
provide, and do on. Moving beyond those basic guarantees, customers will seek answers
from SaaS solutions for IoT based on simply the number of devices that they can support.

Integration simplicity
As IoT promises to extend vertical solutions across horizontal markets, and connect systems
in ways never seen before to add value to businesses and peoples lives, the integration
between these systems and how they are secured needs to happen in a way that
standardizes the integration. AMQP provides an example of this in regard to transport-layer
integration.

97 Microsoft, MS-MDM: Mobile Device Management Protocol


98 Wikipedia, TR-069
99 Ericsson, Lightweight M2M: Enabling Device Management and Applications for the
Internet of Things
100 See Introduction to UPnP Device Management
101 Wikipedia, IOPS

Page 59

Conclusions
This paper has gone into great detail about the particulars of building IoT solutions, based on
our experience in working with enterprise customers. As you can see, IoT solutions can be
complex but also offer massive promise for increasing revenue, cutting cost and finding new
business models based on innovate use of technology. An enterprise might believe that its
requirements are so unique that only a custom IoT solution can meet their needs. But the
unusual requirements of IoT solutions in security, communication, and scale make them
complex and expensive to build as custom solutions from the ground up.
The Microsoft Azure platform, on the other hand, has a comprehensive set of building blocks
that you need to build an IoT solution relatively quickly and painlessly by using the
mentioned reference architecture.

Page 60

How Microsoft can help you


succeed
Microsoft Services can help establish an effective strategy for your Predictive Maintenance
scenario and provide direction, implementation guidance, delivery, and support to help your
realize your Internet of Things strategy. We offer:
Customer value discovery and ideation workshops
Strategy workshops
Implementation guidance
Microsoft Services Subject Matter Expertise, both in your vertical industry and on the
topic of general IoT and Predictive Maintenance.
For more information about Consulting and Support solutions from Microsoft, contact your
Microsoft Services representative or visit www.microsoft.com/services.

Page 61

You might also like