# Big Data and Information Complexity

## Paul Groll - MS, CISSP, CISSO, CSM

Enterprise Information Architect
State of Michigan
Michigan Digital Government Summit
29 September 2015

Strange New World, Strange New Words

Data scientists to CEOs:
YOU CANT HANDLE THE TRUTH
- VentureBeat.com, 01 Aug 2015

## . Growing fear to let the data speak for itself

. Growing distrust of primary sources, when the
results fail to match the CXO's "intuition"

Background

Moku O Loe
Source: http://

Source: hawaii.edu

Complexity

Source: reefbuilders.com

Complexity
Questions of interest:

## What factors control growth?

Is this something we can model?

Logistic Growth
(Verhulst, Lotka/Volterra, Kolmogorov)

dN
rN( K N )
=
dt
K
where
K = the carrying capacity of the environment,
r = the intrinsic rate of growth

## Diversity and Complexity

As the ecosystem grows:

## What does the emerging

community look like?
What information does this
community contain?
How diverse, complex is it?
Is this something we can model?

Complexity
Claude Shannon
From Petoskey, Michigan (yay)
Father of Information Theory
Office down the hall from Einstein
Looked at the information in a system
Developed a model of Information
Diversity that works in scores of fields

Complexity
Shannons Diversity Index
Lets assume we have a community with six types:

A
B
C
D
E
F

6
individuals
13 individuals
19
How Diverse is this
129
system? How much
372
information does it
1187

contain?

Complexity

## Source: Shannon & Weaver, 1948

Complexity

(i)
6

13

19

129

372

1187

p(i)
0.003, 0.008, 0.011, 0.075, 0.216, 0.688
log(p(i))
-2.46 -2.12 -1.96 -1.13 -0.67 -0.16
p(i)[log(p(i))]
-0.009 -0.015 -0.02 -0.08 -0.14 -0.11

(N = 1726)

## Complexity of the DATA system

We have populations:

## Data Sources Agencies, etc.

Data Formats - Hundreds
Data Fields - Thousands
Database Vendors - Many
Database Versions - Dozens
Data Models Scores
Data Volume - Petabytes
Permissions Privacy Limits, Data that need Special Handling
Confidentiality Varied Encryption Requirements

## Complexity of the DATA system

Each of these has some limit, K, as size grows:

## STAFF!! - Security!! DBAs, Developers, other special skills

Hard Problem conversion, mapping (ETL)
Server Capacity, Elasticity
Licensing restrictions
Raw storage
Limited Resources FLASH Storage, High-Speed Computing
Planned and managed storage, backup, recovery
Specific security demands and practices
An ever-growing number of one-offs, exceptions

dN rN( K N )
=
dt
K

## Complexity of the DATA system

N at time (t), given Limit K
K Phase
r Phase

Source:
memrise.com

Complexity models
Monoculture (the trivial case): 100% A

Complexity models
Haphazard, Unmanaged (real world)
CANNOT
predict the
overall state
from one
time (t) to
the next
Expensive!
Highest TCO
we can have

Complexity models

Step-wise
Consolidation
This is what were after!
Reducing Complexity
Reduces Costs!

## A mere 8 objects will

require 28 separate
interfaces to fully
realize ubiquitous
communications
- 100? 4,950 interfaces!
Source: Journal of Integer Sequences, Vol. 1 (1998), Article 98.1.5

## The N-Squared Complexity Problem

C(N) = 28,680

N = 240

The FOUR Vs

Velocity
- Its coming fast

Volume
- Theres a lot

Variety
Veracity

## - Its not all the

- How much can
same type,
we trust the data? format, or size

## Use Case Factory:

- Variability . Audit Logs Record Formats Vary
- Who has looked at Record #38,491?
- When? - Why?
- Did that violate any laws or
policies? Which?
- Repeat 1,000,000,000 times

- Volume -

## . Health & Business Information Messaging

10k per message (on average)
20 million messages per month
Must retain for 7 years
10k * 20mil * 84mos * 250 systems =

~ 4 petabytes

## Use Case Factory:

- Velocity . Vibration and Alerting Sensors
Up to hundreds per structure
Up to 15,000 structures
Sending in real-time 5k messages
400 * 15,000 * 5k =
~ 30 Gb stream

## Use Case Factory:

- Combines 3 Vs . State Police video
3 8-hour shifts
2 HD cameras per unit (body, car)
10 Gb/hour per camera*
~ 155 Tb / day = Backhaul Challenge
~ 55 Pb / year = Storage Challenge
* Depending on streaming bitrate, resolution could be 15-25 GB/hour

