You are on page 1of 39

Big Data and Information Complexity

Paul Groll - MS, CISSP, CISSO, CSM


Enterprise Information Architect
State of Michigan
Michigan Digital Government Summit
29 September 2015

Agenda
Objectives
Challenges
Complexity in IT
Complexity is emerging as Security Risk
Where to find it, how to tame it
Strange New World, Strange New Words

Objectives

Challenges

Challenges
Data scientists to CEOs:
YOU CANT HANDLE THE TRUTH
- VentureBeat.com, 01 Aug 2015

. Growing fear to let the data speak for itself


. Growing distrust of primary sources, when the
results fail to match the CXO's "intuition"

Source: about.com

Background

Moku O Loe
Source: http://

Source: hawaii.edu

Source: about.com

Complexity

Source: reefbuilders.com

Complexity
Questions of interest:

What factors control growth?


Is this something we can model?

Logistic Growth
(Verhulst, Lotka/Volterra, Kolmogorov)

dN
rN( K N )
=
dt
K
where
K = the carrying capacity of the environment,
r = the intrinsic rate of growth

Diversity and Complexity


As the ecosystem grows:

What does the emerging


community look like?
What information does this
community contain?
How diverse, complex is it?
Is this something we can model?

Complexity
Claude Shannon
From Petoskey, Michigan (yay)
Father of Information Theory
Office down the hall from Einstein
Looked at the information in a system
Developed a model of Information
Diversity that works in scores of fields

Complexity
Shannons Diversity Index
Lets assume we have a community with six types:

A
B
C
D
E
F

6
individuals
13 individuals
19
How Diverse is this
129
system? How much
372
information does it
1187

contain?

Complexity

Source: Shannon & Weaver, 1948

Complexity

(i)
6

13

19

129

372

1187

p(i)
0.003, 0.008, 0.011, 0.075, 0.216, 0.688
log(p(i))
-2.46 -2.12 -1.96 -1.13 -0.67 -0.16
p(i)[log(p(i))]
-0.009 -0.015 -0.02 -0.08 -0.14 -0.11

all of these * -1 = H' = 0.39

(N = 1726)

Complexity of the DATA system


We have populations:

Data Sources Agencies, etc.


Data Formats - Hundreds
Data Fields - Thousands
Database Vendors - Many
Database Versions - Dozens
Data Models Scores
Data Volume - Petabytes
Permissions Privacy Limits, Data that need Special Handling
Confidentiality Varied Encryption Requirements

Complexity of the DATA system


Each of these has some limit, K, as size grows:

STAFF!! - Security!! DBAs, Developers, other special skills


Hard Problem conversion, mapping (ETL)
Server Capacity, Elasticity
Licensing restrictions
Raw storage
Limited Resources FLASH Storage, High-Speed Computing
Planned and managed storage, backup, recovery
Specific security demands and practices
An ever-growing number of one-offs, exceptions

dN rN( K N )
=
dt
K

Complexity of the DATA system


N at time (t), given Limit K
K Phase
r Phase

Source:
memrise.com

Complexity models
Monoculture (the trivial case): 100% A

Complexity models
Haphazard, Unmanaged (real world)
CANNOT
predict the
overall state
from one
time (t) to
the next
Expensive!
Highest TCO
we can have

Complexity models

Step-wise
Consolidation
This is what were after!
Reducing Complexity
Reduces Costs!

The N-Squared Complexity Problem

A mere 8 objects will


require 28 separate
interfaces to fully
realize ubiquitous
communications
- 100? 4,950 interfaces!
Source: Journal of Integer Sequences, Vol. 1 (1998), Article 98.1.5

The N-Squared Complexity Problem


C(N) = 28,680

N = 240

The FOUR Vs

Velocity
- Its coming fast

Volume
- Theres a lot

Variety
Veracity

- Its not all the


- How much can
same type,
we trust the data? format, or size

Use Case Factory:


- Variability . Audit Logs Record Formats Vary
- Who has looked at Record #38,491?
- When? - Why?
- Did that violate any laws or
policies? Which?
- Repeat 1,000,000,000 times

Use Case Factory:


- Volume -

. Health & Business Information Messaging


10k per message (on average)
20 million messages per month
Must retain for 7 years
10k * 20mil * 84mos * 250 systems =

~ 4 petabytes

Use Case Factory:


- Velocity . Vibration and Alerting Sensors
Up to hundreds per structure
Up to 15,000 structures
Sending in real-time 5k messages
400 * 15,000 * 5k =
~ 30 Gb stream

Use Case Factory:


- Combines 3 Vs . State Police video
3 8-hour shifts
2 HD cameras per unit (body, car)
10 Gb/hour per camera*
~ 155 Tb / day = Backhaul Challenge
~ 55 Pb / year = Storage Challenge
* Depending on streaming bitrate, resolution could be 15-25 GB/hour

Thanks for listening


Keep in touch

Questions Welcome
PAUL GROLL
GROLLP @ Michigan.gov
517.373.9578