Professional Documents
Culture Documents
AN INTRODUCTION
BAGUS JATI SANTOSO
MARCH, 7TH 2018
WHAT IS BIG DATA?
• Large datasets means a dataset too large to reasonably process or store with traditional
tooling or on a single computer.
DATA NOWADAYS
• Organizations nowadays are capturing additional data from its operational environment at an
increasingly fast speed. Some examples are :
• Web Data : Costumer level web behaviour data (page views, searches, reading, reviews, purchasing)
• Text data (email, news, Facebook feeds, documents, etc) is one of the biggest and most widely applicable types
of big data.
• Time and location data. GPS and connection makes time and location information a growing source of data. As
more individuals open up their time and location data more publicly, lots of interesting applications start to
emerge.
• Smart grid and sensor data. Sensor data are collected nowadays from cars, oil pipes, windmill turbines, and they
are collected in extremely high frequency.
• Social network data. Within social network sites like Facebook, LinkedIn, Instagram, it is possible to do link
analysis to uncover the network of a given user.
CHARACTERISTICS OF BIG DATA (3V)
• Volume
• Velocity
• Variety
• The sheer scale of the information processed helps define big data systems.
• These datasets can be orders of magnitude larger than traditional datasets, which
demands more thought at each stage of the processing and storage life cycle.
• There exists a challenge of pooling, allocating, and coordinating resources from groups of
computers. Cluster management and algorithms capable of breaking tasks into smaller
pieces become increasingly important.
VELOCITY
• Another way in which big data differs is the speed that information moves through the
system.
• Data is frequently flowing into the system from multiple sources and is often expected to
be processed in real time to gain insights and update the current understanding of the
system.
• The focus on near instant feedback has driven many big data practitioners away from a
batch-oriented approach and closer to a real-time streaming system.
VARIETY
• The variety of sources and data types being generated expands as fast as new technology can
be created
• Big data is unique because of the wide range of both the sources being processed and their
relative quality.
• Data can be ingested from internal systems like application and server logs, from social media
feeds and other external APIs, from physical device sensors, and from other providers.
• Big data seeks to handle potentially useful data regardless of where it's coming from by
consolidating all information into a single system.
SOME MAKE IT 4V
VERACITY, VARIABILITY, VALUE
• Veracity: The variety of sources and the complexity of the processing can lead to
challenges in evaluating the quality of the data (and consequently, the quality of the
resulting analysis)
• Variability:Variation in the data leads to wide variation in quality. Additional resources
may be needed to identify, process, or filter low quality data to make it more useful.
• Value: The ultimate challenge of big data is delivering value. Sometimes, the systems and
processes in place are complex enough that using the data and extracting actual value can
become difficult.
WHO’S GENERATING BIG DATA
Mobile devices
(tracking all objects all the time)
• The progress and innovation is no longer hindered by the ability to collect data
• But, by the ability to manage, analyze, summarize, visualize, and discover knowledge
from the collected data in a timely manner and in a scalable fashion
THE MODEL HAS CHANGED…
Old Model: Few companies are generating data, all others are consuming data
New Model: all of us are generating data, and all of us are consuming data
WHAT’S DRIVING BIG DATA
• Big Data enables shifting from the traditional insight, “descriptive”, into new insights,
“predictive” and “prescriptive”
• Descriptive analytics : “What happen in the past”
• What was the last revenue in the last year, which is our most profitable product ?
Automate the
Enabled enhanced Decision decision
insight making (Process
Automation)
APPLICATION OF BIG DATA
• Churn prediction
• Attracting new customer is much more expensive than retaining new ones
• Sentiment analysis
• Find opinions across a large number of people to provide information on what the market is saying, thinking, and feeling about an organization.
• Operational analytics
• Airlines automatically reroute customers when a flight is delayed in order to limit travel disruption and raise cutstomer satisfaction.
• To better address the high storage and computational needs of big data, computer clusters
are a better fit.
• The benefits are :
• Resource Pooling: Combining the available storage space to hold data is a clear benefit, but CPU
and memory pooling is also extremely important.
• High Availability: Clusters can provide varying levels of fault tolerance and availability guarantees
to prevent hardware or software failures from affecting access to data and processing.
• Easy Scalability: Clusters make it easy to scale horizontally by adding additional machines to the
group. This means the system can react to changes without expanding the physical resources on a
machine.
BIG DATA TECHNOLOGY
The World
of Big Data
Tools DAG Model MapReduce Model Graph Model BSP/Collective Model
Hadoop
MPI
HaLoop Giraph
Twister Hama
For
GraphLab
Iterations/ Spark GraphX
Learning
Harp
Stratosphere
Dryad/ Reef
DryadLIN
Q Pig/PigLatin
Hive
For Query Tez
Drill
Shark
MRQL
For S4 Storm
Streaming Samza Spark Streaming