Final Report On Big Data-Orignal

Report
For
Academic Progress
(1140372001)
on
“Big Data: Technical Issues & Security Challenges”
Submitted by:
Kebebe Abebe
Student ID.No: B20176111W
School of Software Engineering
Beijing University of Technology
Submitted to:
Prof. Jingsha He
School of Software Engineering
Beijing University of Technology
Date: 2017/11/13
ABSTRACT
This report aims to provide a review of academic progress course, with particular regard to Big
Data. Big Data is the large amount of data that cannot be processed by making use of traditional
methods of data processing. Due to widespread usage of many computing devices such as smart
phones, laptops, wearable computing devices; billions of people are connected to internet
worldwide, generating large amount of data at the rapid rate. The data processing over the
internet has exceeded more than the modern computers can handle. Due to this high growth rate,
the term Big Data is envisaged. This growth of data has led to an explosion, coining the term Big
Data. In addition to the growth in volume, Big Data also exhibits other unique characteristics,
such as velocity, variety, value and veracity. This large volume, rapidly increasing and verities
of data is becoming the key basis of completion, underpinning new waves of productivity growth,
innovation and customer surplus. However, the fast growth rate of such large data generates
numerous challenges, such as data analysis, storage, querying, inconsistency and
incompleteness, scalability, timeliness, and security. Key industry segments are heavily
represented; financial services, where data is plentiful and data investments are substantial, and
life sciences, where data usage is rapidly emerging. This report provides a brief introduction to
the Big Data technology and its importance in the contemporary world. This report also
addresses the various concepts, characteristics, architecture, management, technologies,
challenges and applications of Big Data.
i
Contents
1. INTRODUCTION ................................................................................................................. 1
1.1 Definition of Big Data ...................................................................................................... 2
1.2 What comes under Big Data............................................................................................. 2
1.3 Types of data .................................................................................................................... 3
1.4 Drivers of Big Data .......................................................................................................... 4
1.5 Benefits of Big Data ......................................................................................................... 5
2. CHARACTERISTICS OF BIG DATA ............................................................................... 6
2.1 Data Volume .................................................................................................................... 7
2.2 Data Velocity.................................................................................................................... 7
2.3 Data Variety ..................................................................................................................... 7
2.4 Data Value ........................................................................................................................ 8
2.5 Data Veracity.................................................................................................................... 8
3. ARCHITECTURE OF BIG DATA ..................................................................................... 8
3.1 Data Sources Layer .......................................................................................................... 9
3.2 Ingestion Layer ............................................................................................................... 11
3.3 Hadoop Storage Layer .................................................................................................... 12
3.4 Hadoop Infrastructure Layer .......................................................................................... 13
3.5 Hadoop Platform Management Layer ............................................................................ 13
3.6 Security Layer ................................................................................................................ 15
3.7 Monitoring Layer ........................................................................................................... 16
3.8 Analytics Engines Layer ................................................................................................ 16
3.9 Visualization Layer ........................................................................................................ 17
3.10 Big Data Applications Layer .......................................................................................... 17
ii
4. BIG DATA MANAGEMENT ............................................................................................ 18
4.1 Data Collection ............................................................................................................... 18
4.2 Data Processing .............................................................................................................. 19
4.3 Data Analysis ................................................................................................................. 20
4.4 Data Interpretation.......................................................................................................... 20
5. BIG DATA TECHNOLOGIES .......................................................................................... 21
5.1 Hadoop ........................................................................................................................... 21
5.2 Hadoop Components ...................................................................................................... 22
5.3 Hadoop technology works.............................................................................................. 24
5.4 Advantages of Hadoop ................................................................................................... 25
6. BIG DATA CHALLENGES ............................................................................................... 26
6.1 Privacy and security ....................................................................................................... 26
6.2 Data access and sharing of information ......................................................................... 26
6.3 Storage and processing issues ........................................................................................ 26
6.4 Analytical challenges ..................................................................................................... 26
6.5 Technical challenges ...................................................................................................... 27
6.6 Human resources and manpower ................................................................................... 27
6.7 Future challenges............................................................................................................ 27
7. APPLICATIONS OF BIG DATA ...................................................................................... 28
8. CONCLUSION .................................................................................................................... 29
REFERENCES: .......................................................................................................................... 30
iii
List of Figures
Figure 1. Contents of Big Data ...................................................................................................... 3
Figure 2. Types of data being used in big data .............................................................................. 4
Figure 3. Five Vs Big Data Characteristics [3] .............................................................................. 6
Figure 4. Velocity of Big Data [4] ................................................................................................. 7
Figure 5. Variety of Big Data [4] ................................................................................................... 8
Figure 6. The big data architecture ................................................................................................ 9
Figure 7. The variety of data sources ........................................................................................... 10
Figure 8. Components of data ingestion layer ............................................................................. 12
Figure 9. NoSQL databases ........................................................................................................ 13
Figure 10. Big data platform architecture .................................................................................... 14
Figure 11. MapReduce tasks ........................................................................................................ 15
Figure 12. Search engine conceptual architecture ....................................................................... 16
Figure 13. Visualization conceptual architecture ......................................................................... 17
Figure 14. Typical ETL process framework [5] .......................................................................... 19
Figure 15. The architecture of Hadoop ........................................................................................ 22
Figure 16. MapReduce parallel programming ............................................................................. 23
Figure 17. NoSQL database typical business scenarios ............................................................... 24
List of Tables
Table 1. Legacy data sources ....................................................................................................... 10

Table 2. New age data sources - telecom industry ....................................................................... 11
Table 3. Big data typical software stack ...................................................................................... 18
iv
1. INTRODUCTION
Recent advancement in technology has led to generation of a great quantity of data from
distinctive domains over the past 20 years. Big Data is a broad term for datasets so great in
volume or complicated that traditional data processing applications are inadequate [1]. Although
the Big Data have large amount of data or volume, it also processes the number of unique
characteristics unlike traditional data. The term Big Data often refers to large amount of data
which requires new technologies and architectures to make possible to extract value from it by
capturing and analysis process and seldom to a particular size of dataset. New sources of Big
Data include location specific data arising from traffic management, and from the tracking of
personal devices such as smart phones, laptops, wearable computing devices. Big Data is usually
unstructured and requires more time for analysis and processing. This development calls for new
system architectures for data acquisition, transmission, storage, and large-scale data processing
mechanisms.
Big Data has emerged because we are living in society which makes increasing use of data
intensive technologies. Due to such large size of data, it becomes very difficult to perform
effective analysis using the existing traditional techniques. Since Big Data is a recent upcoming
technology in the market which can bring huge benefits to the business organizations, it becomes
necessary that various challenges and issues associated in bringing and adapting to this
technology are need to be understood. Big Data concept means a datasets which continues to
grow so much that it becomes difficult to manage it using existing database management
concepts and tools. The difficulties can be related to data capture, storage, search, sharing,
analysis and visualization etc.
Big Data due to its various characteristics like volume, velocity, variety, value and veracity put
forward many challenges. The various challenges faced in large data management include,
analysis, capture, data curation, search, sharing, storage, transfer, visualization, and information
privacy and many more. In addition to variations in the amount of data stored in different sectors,
the types of data generated and stored; i.e. encoded video, images, audio, or text/numeric
information also differ markedly from industry to industry. The data is so enormous and are
generated so fast that it doesn’t fit the structures of normal or regular database architecture. To
analyze the data new alternative way must be used to process it.
In this report, the next sections address the basic concepts, characteristics, architecture,
management, technologies, challenges and applications of Big Data.
1|Page
1.1 Definition of Big Data
The term “Big Data” is used in a variety of contexts with a variety of characteristics. Therefore,
the followings are few definitions for Big Data.
Gartner’s definition:
“Big Data is a high-volume, high-velocity and/or high-variety information assets that
demand cost-effective, innovative forms of information processing for enhanced insight,
decision making, and process optimization."
Working definition:
Big Data is a collection of large datasets that cannot be processed using traditional
computing technologies and techniques in order to extract value. It is not a single technique
or a tool; rather it involves many areas of business and technology.
The ultimate goal of processing big data includes:

 The data analysis being undertaken uses a high volume of data from a variety of sources
including structured, semi-structured, unstructured or even incomplete data; and
 The size (volume) of the data sets within the data analysis and velocity with which they
need to be analyzed has outpaced the current abilities of standard business intelligence
tools and methods of analysis.
1.2 What comes under Big Data

Big data involves the data produced by different devices and applications. Given below are some
of the fields that come under the umbrella of big data.
 Black Box Data: It is a component of helicopter, airplanes, and jets, etc. It captures
voices of the flight crew, recordings of microphones and earphones, and the performance
information of the aircraft.
 Social Media Data: Social media such as Facebook and Twitter hold information and the
views posted by millions of people across the globe.
 Stock Exchange Data: The stock exchange data holds information about the ‘buy’ and
‘sell’ decisions made on a share of different companies made by the customers.
 Power Grid Data: The power grid data holds information consumed by a particular node
with respect to a base station.
2|Page
 Transport Data: Transport data includes model, capacity, distance and availability of a
vehicle.
 Search Engine Data: Search engines retrieve lots of data from different databases.
Figure 1. Contents of Big Data
1.3 Types of data

Since Big Data includes huge volume, high velocity, and extensible variety of data, the data in it
can be classified into the following types:
1. Structured data: it refers to data that is identifiable and organized in a structured way.
The most common form of structured data is a database where specific information is
stored based on a methodology of columns and rows. Structured data is machine readable
and also efficiently organized for human readers. For example, an 'Employee' table in a
database is type of structured data.
2. Semi-structured data: it refers to data that does not conform to a formal structure based
on standardized data models. However semi-structured data may contain tags or other
meta-data to organize it. For example, personal data stored in a XML file is considered as
category of semi-structured data.
3|Page
3. Unstructured data: it refers to any data that has no identifiable structure. For examples,
images, videos, email, documents and text fall into the category of unstructured data.
Figure 2. Types of data being used in big data
1.4 Drivers of Big Data

Rapid growth in the acquisition, production and use of data has been attributed to a range of
technological, societal and economic factors. Technological factors include the creation of new
data sources, such as smart phones, and increasing capacity to store and analyze data. Among the
key societal factors driving big data is the wide-spread adoption of new forms of communication
through social media (such as Facebook, YouTube and Twitter), which are the subject of a
current Select Committee inquiry [2] and POSTnote 460.
Technological factors driving the growth of big data
 New sources of data are being created through:

 Digitization of existing processes and services, for example online banking, email
and medical records.
 Automatic generation of data, such as web server logs that record web page
requests
 Reduction in the cost and size of sensors found in aero planes, buildings and the
environment.
4|Page
 Production of new gadgets that collect and transmit data, for example GPS
location information from mobile phones and capacity updates from ‘smart’ waste
bins (POSTnote 423).
 Enhanced computing capabilities driving big data include:
 Improved data storage at higher densities, for lower cost.
 Greater computing power for faster and more complex calculations.
 Cloud computing (remote access to shared computing resources via a device
connected to a network), facilitating cheaper access to data storage, computation,
software and other services.
 Recent advances in statistical and computational techniques, which can be used to
analyze and extract meaning from big data.
 Development of new tools such as Apache Hadoop (which enables large data sets
to be processed across clusters of computers) and extension of existing software,
such as Microsoft Excel.
1.5 Benefits of Big Data

Ability to process Big Data brings in multiple benefits, such as:-
 Understanding and targeting customers: - Big Data is used to better understand
customers, their behaviors and preferences.
 Understanding and optimizing business processes: - Big data is also increasingly used
to optimize business processes. Retailers are able to optimize their stock based on
predictions generated from social media data, web search trends and weather forecasts.
 Personal qualification and performance optimization: - Big data is not just for
companies and governments but also for all individuals. For examples, calorie
consumption, activity levels, and our sleep patterns.
 Improving healthcare and public health: - Big Data analytics enables us to decode
entire DNA strings in minutes and will allow us to find new cures and better understand
and predict disease patterns.
 Improving sports performance: - Most elite sports have now embraced big data
analytics. For example, use video analytics that track the performance of every player in a
football or baseball game, etc.
5|Page
 Improving science and research: - Science and research is currently being transformed
by the new possibilities bid data brings.
 Optimizing machine and device performance: - Big data analytics help machines and
devices become smarter and more autonomous.
 Improving security and law enforcement: -Big data is applied heavily in improving
security and enabling law enforcement.
 Improving and optimizing cities and countries: - Big data is used to improve many
aspects of our cities and countries.
 Financial Trading: - Big data algorithms are used to make trading decisions.
2. CHARACTERISTICS OF BIG DATA
Characteristics of Big Data by what is usually referred to as a multi V model. The three Vs main
characteristics (volume, velocity and variety) of big data are well defined in the definition by
Gartner. In report, the 5V characteristics (Volume, Velocity, Variety, Value and Veracity) of big
data are described below.
Figure 3. Five Vs Big Data Characteristics [3]
6|Page
2.1 Data Volume
Data volume defines the measures of amount of data available to an organization, which does not
necessarily have to own all of it as long as it can access it. As amount of data volume increases,
the value of different data records will decrease in proportion to age, type, richness, and quantity
among all other factors.
2.2 Data Velocity

Data velocity is a mean to measure the speed of data generation, streaming, and arithmetic
operations. E-Commerce and other start-ups have rapidly increased the speed and richness of
data used for different business transactions (for instance, web-site clicks). Managing the Data
velocity is much more and bigger than a band width issue; it is also an ingest issue (extract
transform-load).
Considering data velocity [2], it is considered that, to complicate matters further, arrival of data
and processing or analyzing data are performed at different speeds, as illustrated in Figure 4.
Figure 4. Velocity of Big Data [4]
2.3 Data Variety

Data variety is a measure of the richness of the data representation of the different types of data
stored in the database – text, images video, audio, etc. From an analytic perspective, it is
probably the biggest obstacle to effectively use large volumes of data. Incompatible data formats,
incomplete data, non-aligned data structures, and inconsistent data semantics represents
7|Page
significant challenges that can lead to analytic spread out over a large area in an untidy or
irregular way.
Figure 5. Variety of Big Data [4]
2.4 Data Value

Data value measures the usefulness of data in making decisions. It has been noted that “the
purpose of computing is insight, not numbers”. Data science is exploratory and useful in getting
to know the data, but “analytic science” encompasses the predictive power of big data.
2.5 Data Veracity

Data veracity refers to the degree in which a leader trusts information in order to make a
decision. Therefore, finding the right correlations in Big Data is very important for the business
future. However, as one in three business leaders do not trust the information used to reach
decisions, generating trust in Big Data presents a huge challenge as the number and type of
sources grows.
3. ARCHITECTURE OF BIG DATA
Big data management architecture should be able to consume myriad data sources in a fast and
inexpensive manner. Figure 6 outlines the architecture of big data with its components in big
data tech stack. We can choose either open source frameworks or packaged licensed products to
take full advantage of the functionality of the various components in the stack.
8|Page
Figure 6. The big data architecture
3.1 Data Sources Layer

Big data begins in the data sources layer, where data sources of different volumes, velocity, and
variety vie with each other to be included in the final big data set to be analyzed. These big data
sets, also called data lakes, are pools of data that are tagged for inquiry or searched for patterns
after they are stored in the Hadoop framework. Figure 7 illustrates the various types of data
sources.
9|Page
Figure 7. The variety of data sources
Industry Data
Traditionally, different industries designed their data-management architecture around the legacy
data sources listed in Table 1. The technologies, adapters, databases, and analytics tools were
selected to serve these legacy protocols and standards.
Legacy Data Sources

HTTP/HTTPS web services
RDBMS
FTP
JMS/MQ based services
Text / flat file /csv logs
XML data sources
IM Protocol requests
Table 1. Legacy data sources
10 | P a g e
Some of the “new age” data sources that have seen an increase in volume, velocity, or variety are
illustrated in Table 2.
New Age Data Sources

High Volume Sources
1. Switching devices data
2. Access point data messages
3. Call data record due to exponential growth in user base
4. Feeds from social networking sites
Variety of Sources
1. Image and video feeds from social Networking sites
2. Transaction data
3. GPS data
4. Call center voice feeds
5. E-mail
6. SMS
High Velocity Sources
1. Call data records
2. Social networking site conversations
3. GPS data
4. Call center - voice-to-text feeds
Table 2. New age data sources - telecom industry
3.2 Ingestion Layer

The ingestion layer loads the final relevant information, sans the noise, to the distributed Hadoop
storage layer based on multiple commodity servers. It should have the capability to validate,
cleanse, transform, reduce, and integrate the data into the big data tech stack for further
processing.
The building blocks of the ingestion layer should include components for the following:
 Identification: - involves detection of the various known data formats or assignment of
default formats to unstructured data.
 Filtration: - involves selection of inbound information relevant to the enterprise, based
on the Enterprise MDM repository.
 Validation: - involves analysis of data continuously against new MDM metadata.
 Noise Reduction: - involves cleansing data by removing the noise and minimizing
disturbances.
 Transformation: - involves splitting, converging, de-normalizing or summarizing data.
11 | P a g e
 Compression: - involves reducing the size of the data but not losing the relevance of the
data in the process. It should not affect the analysis results after compression.
 Integration: - involves integrating the final massaged data set into the Hadoop storage
layer, that is, Hadoop distributed file system (HDFS) and NoSQL databases.
Figure 8. Components of data ingestion layer
There is multiple ingestion patterns (data source-to-ingestion layer communication) that can be
implemented based on the performance, scalability, and availability requirements.
3.3 Hadoop Storage Layer

The storage layer is usually loaded with data using a batch process. The integration component
of the ingestion layer invokes various mechanisms like Sqoop, MapReduce jobs, ETL jobs, and
others to upload data to the distributed Hadoop storage layer (DHSL). The storage layer provides
storage patterns (communication from ingestion layer to storage layer) that can be implemented
based on the performance, scalability, and availability requirements. Hoop storage layer consists
of NoSQL Database and HDFS which are the cornerstones of the big data storage layer.
NoSQL Database is used to store prevalent in the big data world, including key-value pair,
document, graph, columnar, and geospatial databases.
12 | P a g e
Figure 9. NoSQL databases
3.4 Hadoop Infrastructure Layer

Hadoop Infrastructure Layer is the layer responsible for the operation and scalability of big data
architecture. It is based on a distributed computing model. It is a “share-nothing” architecture,
where the data and the functions required to manipulate it reside together on a single node. It
contains the main components: Bare Metal Clustered Workstations and Virtualized Cloud
Services. Hadoop and HDFS can manage the infrastructure layer in a virtualized cloud
environment (on-premises as well as in a public cloud) or a distributed grid of commodity
servers over a fast gigabit network.
3.5 Hadoop Platform Management Layer

It is the layer that provides the tools and query languages to access the NoSQL databases using
the HDFS storage file system sitting on top of the Hadoop physical infrastructure layer. Figure
10 shows how the platform layer of the big data tech stack communicates with the layers below
it.
13 | P a g e
Figure 10. Big data platform architecture
The Hadoop platform management layer accesses data, runs queries, and manages the lower
layers using scripting languages like Pig and Hive.
The key building blocks of the Hadoop platform management layer are Zookeeper, Pig, Hive,
Sqoop and MapReduce.
 MapReduce simplifies the creation of processes that analyze large amounts of
unstructured and structured data in parallel. Here are the key facts associated with the
scenario in Figure 11.
14 | P a g e
Figure 11. MapReduce tasks
 Hive is a data-warehouse system for Hadoop that provides the capability to aggregate
large volumes of data. This SQL-like interface increases the compression of stored data
for improved storage-resource utilization without affecting access speed.
 Pig is a scripting language that allows us to manipulate the data in the HDFS in parallel.
 Sqoop is a command-line tool that enables importing individual tables, specific columns,
or entire database files straight to the distributed file system or data warehouse.
 ZooKeeper is a coordinator for keeping the various Hadoop instances and nodes in sync
and protected from the failure of any of the nodes.
3.6 Security Layer

It is the layer in which security has to be implemented in a way that does not harm performance,
scalability, or functionality, and it should be relatively simple to manage and maintain. To
implement a security baseline foundation, we should design a big data tech stack so that, at a
minimum, it does the following:
 Authenticates nodes using protocols like Kerberos
 Enables file-layer encryption
 Subscribes to a key management service for trusted keys and certificates
 Uses tools like Chef or Puppet for validation during deployment of data sets or when
applying patches on virtual nodes
15 | P a g e
 Logs the communication between nodes, and uses distributed logging mechanisms to
trace any anomalies across layers
 Ensures all communication between nodes is secure, for example, by using Secure
Sockets Layer (SSL), TLS, and so forth.
3.7 Monitoring Layer

It is the layer provides tools for data storage and visualization. Performance is a key parameter to
monitor so that there is very low overhead and high parallelism. Open source tools like Ganglia
and Nagios are widely used for monitoring big data tech stacks.
3.8 Analytics Engines Layer

The data loaded from various enterprise applications into the big data tech stack has to be
indexed and searched for big data analytics processing. Figure 12 shows the conceptual
architecture of the search engines layer and how it interacts with the various layers of a big data
tech stack.
Figure 12. Search engine conceptual architecture

16 | P a g e
3.9 Visualization Layer
Visualization is incorporated as an integral part of the big data tech stack in order to help data
analysts and scientists to gain insights faster and increase their ability to look at different aspects
of the data in various visual modes. Figure 13 shows the interactions between different layers of
the big data stack that allow us to harnesses the power of visualization tools.
Figure 13. Visualization conceptual architecture
3.10 Big Data Applications Layer

The companies are seeing the development of applications that are designed specifically to take
advantage of the unique characteristics of big data. The applications rely on huge volumes,
velocities, and varieties of data to transform the behavior of a market.
17 | P a g e
There are a wide choice of tools and products that we can use to build our application
architecture end to end. Products usually selected by many enterprises to begin their big data
journey are shown in Table 3.
Purpose Products/tools
Ingestion Layer Apache Flume, Storm
Hadoop Storage HDFS
NoSQL Databases Hbase, Cassandra
Rules Engines MapReduce jobs
NoSQL Data Warehouse Hive
Platform Management Query Tools MapReduce, Pig, Hive
Search Engine Solr
Platform Management Co-ordination Tools ZooKeeper, Oozie
Analytics Engines R, Pentaho
Visualization Tools Tableau, Clickview, Spotfire
EMC Greenplum, IBM Netezza, IBM Pure
Big Data Analytics Appliances
Systems, Oracle Exalytics
Monitoring Ganglia, Nagios
Data Analyst IDE Talend, Pentaho
Cloudera, DataStax, Hortonworks, IBM Big
Hadoop Administration
Insights
Public Cloud-Based Virtual Infrastructure Amazon AWS & S3, Rackspace
Table 3. Big data typical software stack
4. BIG DATA MANAGEMENT
Increasing quantities of data are being collected and analyzed, producing new insights into how
people think and act, and how systems behave. This often requires innovative processing and
analysis known as ‘big data analytics’. Making use of any kind of data requires data collection,
processing, analysis and interpretation of results.
4.1 Data Collection

Big data can be acquired in myriad formats from a vast, and increasing, number of sources.
These include images, sound recordings, user click streams that measure internet activity, and
data generated by computer simulations (such as those used in weather forecasting). Key to
managing data collection is metadata, which are data about data. For example, an e-mail
automatically generates metadata containing the addresses of the sender and recipient, and the
date and time it was sent, to aid the manipulation and storage of e-mail archives. Producing
metadata for big data sets can be challenging, and may not capture all the nuances of the data.
18 | P a g e
4.2 Data Processing
Data may undergo numerous processes to improve quality and usability before analysis. After
recording, big data must be filtered and compressed. Only the relevant data should be recorded
by means of filters that discard useless information using specialized tools such as ETL (Extract-
Transform-Load).
Phases in ETL Process:

1. Extraction: In this phase relevant information is extracted. To make this phase efficient,
only the data source that has been changed since recent last ETL process is considered.
2. Transformation: Data is transformed through the following various sub phases:
 Data analysis
 Definition of transformation workflow and mapping rules
 Verification
 Transformation
 Backflow of cleaned data
3. Loading: At the last, after the data is in the required format, it is then loaded into the data
warehouse/Destination.
The ETL process framework is shown in the Figure 14 below.
Figure 14. Typical ETL process framework [5]

These processes can be more difficult when applied to big data. For example: it may contain
multiple data formats that are difficult to extract; require rapid real-time processing to enable the
user to react to a changing situation; or involve the linkage of different databases, which requires
data formats that are compatible with each other.
19 | P a g e
4.3 Data Analysis
Analytics are used to gain insight from data. They typically involve applying an algorithm (a
sequence of calculations) to data to find patterns, which can then be used to make predictions or
forecasts. Big data analytics encompass various inter-related techniques, including the following
examples.
 Data mining - identifies patterns by sifting through data. It can be applied to user click
streams to understand how customers use web pages to inform web page design.
 Machine learning - describes systems that learn from data. For example, a system that
compares documents in two different languages can infer translation rules; human
correction of any errors in the rules can result in the system learning how to improve the
software.
 Simulation - can be used to model the behaviour of complex systems. For example,
building a trading simulation can help to assess the effectiveness of measures to reduce
insider trading.
4.4 Data Interpretation

For the results of analysis to be useful, they need to be interpreted and communicated.
Interpreting big data needs to take context into account, such as how the data were collected,
their quality and any assumptions made. Interpretation requires care for several reasons:
 Despite being large, a data set may still contain biases and anomalies, or exclude
behaviour not captured by the data.
 There may be limitations to the usefulness of big data analytics, which can identify
correlations (consistent patterns between variables) but not necessarily cause.
Correlations can be extremely useful for making predictions or measuring previously
unseen behaviour, if they occur reliably.
 Techniques can be reductionist and not appropriate for all contexts.
20 | P a g e
5. BIG DATA TECHNOLOGIES
Big data technologies are important in providing more accurate analysis, which may lead to more
concrete decision-making resulting in greater operational efficiencies, cost reductions, and
reduced risks for the business.
Some key characteristics of these technologies include:

 Accessing data stored in a variety of standard configurations.
 Relying on standard relational data access methods.
 Enabling canonical means for virtualizing data accesses to consumer applications
 Employ push-down capabilities of a wide variety of data management systems (ranging
from conventional RDBMS data stores to newer NoSQL approaches) to optimize data
access „
 Rapid application of data transformations as data sets is migrated from sources to the big
data target platforms
While looking into the technologies that handle big data, they can be examined as the following
two complementary classes of technology and they are frequently deployed together.
1. Operational big data technology: - it includes systems that provide operational
capabilities for real-time, interactive workloads where data is primarily captured and
stored ( e.g. MongoDB, NoSQL, etc)
2. Analytical big data technology: - includes systems that provide analytical capabilities
for retrospective and complex analysis that may touch most or all of the data (e.g. MPP,
MapReduce, etc).
Even though there are many technologies available for data management, one of the most widely
used technologies is Hadoop.
5.1 Hadoop
Hadoop is an Apache open source framework written in java that allows distributed processing
of large datasets across clusters of computers using simple programming models. The Hadoop
framework application works in an environment that provides distributed storage and
21 | P a g e
computation across clusters of computers. Hadoop is designed to scale up from single server to
thousands of machines, each offering local computation and storage.
5.2 Hadoop Components

 File System (The Hadoop File System)
 Programming Paradigm (Map Reduce)
Figure 15. The architecture of Hadoop
A. MapReduce
MapReduce is a parallel programming model for writing distributed applications devised at

Google for efficient processing of large amounts of data (multi-terabyte data-sets), on large
clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner. The
MapReduce program runs on Hadoop which is an Apache open-source framework. Figure 16
shows how data is processed using MapReduce parallel programming.
22 | P a g e
Figure 16. MapReduce parallel programming
B. Hadoop Distributed File System (HDFS)
HDFS is based on the Google File System (GFS) and provides a distributed file system that is
designed to run on commodity hardware. It has many similarities with existing distributed file
systems. It provides high throughput access to application data and is suitable for applications
having large datasets. It is not accessible as a logical data structure for easy data manipulation.
HDFS stores data prevalent in the big data world, including key-value pair, document, graph,
columnar, and geospatial databases which are collectively referred to as NoSQL databases.
23 | P a g e
Figure 17. NoSQL database typical business scenarios
Apart from the above-mentioned two core components, Hadoop framework also includes the
following two modules:
 Hadoop Common: These are Java libraries and utilities required by other Hadoop
modules.
 Hadoop YARN: This is a framework for job scheduling and cluster resource
management.
5.3 Hadoop technology works

It is quite expensive to build bigger servers with heavy configurations that handle large scale
processing, but as an alternative, we can tie together many commodity computers with single-
CPU, as a single functional distributed system and practically, the clustered machines can read
the dataset in parallel and provide a much higher throughput. Moreover, it is cheaper than one
24 | P a g e
high-end server. So this is the first motivational factor behind using Hadoop that it runs across
clustered and low-cost machines.
Hadoop runs code across a cluster of computers. This process includes the following core tasks
that Hadoop performs:
 Data is initially divided into directories and files. Files are divided into uniform sized
blocks of 128M and 64M (preferably 128M).
 These files are then distributed across various cluster nodes for further processing.
 HDFS, being on top of the local file system, supervises the processing.
 Blocks are replicated for handling hardware failure.
 Checking that the code was executed successfully.
 Performing the sort that takes place between the map and reduce stages.
 Sending the sorted data to a certain computer.
 Writing the debugging logs for each job.
5.4 Advantages of Hadoop

 Hadoop framework allows the user to quickly write and test distributed systems. It is
efficient, and it automatic distributes the data and work across the machines and in turn,
utilizes the underlying parallelism of the CPU cores.
 Hadoop does not rely on hardware to provide fault-tolerance and high availability
(FTHA), rather Hadoop library itself has been designed to detect and handle failures at
the application layer.
 Servers can be added or removed from the cluster dynamically and Hadoop continues to
operate without interruption.
 Another big advantage of Hadoop is that apart from being open source, it is compatible
on all the platforms since it is Java based.
25 | P a g e
6. BIG DATA CHALLENGES
Big data challenges are usually the real implementation interrupt which require immediate
attention, if any implementation without handling these challenges may lead to failure of
technology and some unfavorable results [6]. Big data challenges can be classified as privacy &
security, data access &sharing of information, storage and processing issues, analytical, human
resources & manpower, technical, and future challenges.
6.1 Privacy and security

It is the most important issue with big data which is sensitive and includes conceptual, technical
as well as legal significance. The personal information of a person when combined with external
large data sets leads to the inference of new facts about that person and it’s possible that these
kinds of facts about the person are secretive and the person might not want the data owner to
know or any person to know about them.
6.2 Data access and sharing of information

If data is to be used to make accurate decisions in time it becomes necessary that it should be
available in accurate, complete and timely manner. This makes the Data management and
governance process bit complex adding the necessity to make data open and make it available to
government agencies in standardized manner with standardized APIs, metadata and formats thus
leading to better decision making, business intelligence and productivity improvements.
6.3 Storage and processing issues

The storage available is not enough for storing the large amount of data which is being produced
by almost everything: Social Media sites are themselves a great contributor along with the sensor
devices etc. Because of the rigorous demands of the big data on networks, storage and servers
outsourcing the data to cloud may seem an option. Processing of such large amount of data also
takes large amount of time.
6.4 Analytical challenges

Big data brings along with it some huge analytical challenges. Big data analytics is the process of
examining big data to uncover hidden patterns, unknown correlations and other useful
information that can be used to make better decisions. The types of analysis to be done on this
26 | P a g e
huge amount of data which is unstructured, semi structured and structured they need large
technical skill.
6.5 Technical challenges

 Fault tolerance – New incoming technologies like cloud computing and big data are
always that whenever the failure occurs, the damage is done should be within acceptable
threshold rather than beginning the whole task from the scratch. Fault tolerant computing
is extremely hard, involving difficult algorithms.
 Quality of data- Storage and collection of huge amount of data are more costly. More
data if used for decision making or for predictive analysis in business will definitely lead
to better results. Big data basically focuses on quality of data storage rather than having
very large irreverent data so that better result and conclusion can be drawn.
 Heterogeneous data- In big data unstructured data represent almost every kind of data
being produced by social media interaction, recorded meeting, handle PDF document, fax
transfer, to email and more. Converting unstructured data in to structured data one is also
not feasible.
6.6 Human resources and manpower
Since Big data is at its youth and an emerging technology so it needs to attract organizations and
youth with diverse new skill sets. These skills should not be limited to technical ones but also
should extend to research, analytical, interpretive and creative ones. These skills need to be
developed in individuals hence requires training programs to be held by the organizations.
6.7 Future challenges

 Distributed mining- Many data mining techniques are not trivial to paralyze. To have
distributed versions of some methods, a lot of research is needed with practical and
theoretical analysis to provide new methods.
 Analytics Architecture- It is not clear yet how an optimal architecture of analytics
systems should be to deal with historic data and with real-time data at the same time.
 Compression- Dealing with Big Data, the quantity of space needed to store it is very
relevant. Using compression, we may take more time and less space, so we can consider
it as a transformation from time to space. Using sampling, we are losing information, but
the gains in space may be in orders of magnitude.
27 | P a g e
 Visualization- A main task of Big Data analysis is how to visualize the results of any
data. Because of data is so big, it is very difficult to find user-friendly visualizations.
 Hidden Big Data- Large quantities of useful data are getting lost since new data is
largely untagged file based and unstructured data.
7. APPLICATIONS OF BIG DATA
Big Data is applying in many application areas. Here are some examples of Big Data
applications:
 Smart Grid case: it is crucial to manage in real time the national electronic power
consumption and to monitor Smart grids operations.
 E-health: connected health platforms are already used to personalize health services
(e.g., CISCO solution). Big Data is generated from different heterogeneous sources (e.g.,
laboratory and clinical data, patients symptoms uploaded from distant sensors, hospitals
operations, and pharmaceutical data).
 Internet of Things (IoT): IoT represents one of the main markets of big data
applications. Because of the high variety of objects, the applications of IoT are
continuously evolving. Nowadays, there are various Big Data applications supporting for
logistic enterprises.
 Public utilities: Utilities such as water supply organizations are placing sensors in the
pipelines to monitor flow of water in the complex water supply networks.
 Transportation and logistics: Many public road transport companies are using RFID
(Radiofrequency Identification) and GPS to track buses and explore interesting data to
improve their services.
 Political services and government monitoring: Many governments such as India and
United States are mining data to monitor political trends and analyze population
sentiments.
 Big Data Analytics Applications (BDAs) are a new type of software applications, which
analyze big data using massive parallel processing frameworks (e.g., Hadoop).
28 | P a g e
 Data Mining: Decision trees automatically help users understand what combination of
data attributes result in a desired outcome. The structure of the decision tree reflects the
structure that is possibly hidden in your data.
 Banking: The use of customer data invariably raises privacy issues. By uncovering
hidden connections between seemingly unrelated pieces of data, big data analytics could
potentially reveal sensitive personal information.
 Marketing: Marketers have begun to use facial recognition software to learn how well
their advertising succeeds or fails at stimulating interest in their products.
 Telecom: Now a day’s big data is used in different fields. In telecom also it plays a very
good role.
8. CONCLUSION
In this report some of the important concepts are covered that are needed to be analyzed by the
organizations while estimating the significance of implementing the Big Data technology and
some direct challenges to the infrastructure of the technology. The availability of Big Data, low-
cost commodity hardware, new information management and analytic software has produced a
unique moment in the history of data analysis. The convergence of these trends means that we
have the capabilities required to analyze astonishing data sets quickly and cost-effectively for the
first time in history. These capabilities are neither theoretical nor trivial. They represent a
genuine leap forward and a clear opportunity to realize enormous gains in terms of efficiency,
productivity, revenue, and profitability. The age of Big Data is here, and these are truly
revolutionary times if both business and technology professionals continue to work together and
deliver on the promise.
29 | P a g e
REFERENCES:
[1] Wei Fan, Albert Bifet. Mining big data: current status, and forecast to the future, ACM
SIGKDD Explorations Newsletter, Volume 14 Issue 2, December 2012
[2] Social media data & real time analytics, HoC Sci. & Tech. Com. bit.ly/1eMJcEK
[3] BIG DATA AND FIVE V’S CHARACTERISTICS, Ministry of Education, Islamic University
College, Third Author Affiliation
[4] Big Data computing and clouds: Trends and future direction by Rajkumar Buyya
[5] Big Data: Emerging Challenges of Big Data and Techniques for Handling,
Dr.M.Padmavalli, Nov.-Dec. 2016
[6] Armour, F., Kaisler, S. and Espinosa, J. A., Money W. 2013. Illustrated the issues and
challenges in big data.
[7] Lee, K. H., Choi, T. W., Ganguly, A., Wolinsky, D. I., Boykin, P. O. and Figueired, R. 2011.
Presents the parallel data processing with map Reduce.
[8] Marz, N. and Warren, J. 2013. Big Data: Principles and best practices of scalable realtime
data systems. Manning Publications.
[9] Feldman, D., Schmidt, M. and Sohler, C. 2013. Turning big data into tiny data: Constant-
size coresets for k-means, pca and projective clustering. In SODA.
[10] Fan, W. and Bifet, A., Discribe the big data mining current status and forecast to the
future.
[11] Big Data Application Architecture Q & A, A Problem – Solution Approach, Nitin Sawant
and Himanshu Shah, Apress.
[12] Mark A. Beyer and Douglas Laney, "The Importance of 'Big Data': A Definition," Gartner,
30 | P a g e

Final Report On Big Data-Orignal

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Final Report On Big Data-Orignal

Uploaded by

Copyright:

Available Formats

Report

1.1 Definition of Big Data ...................................................................................................... 2

1.2 What comes under Big Data............................................................................................. 2

1.3 Types of data .................................................................................................................... 3

1.4 Drivers of Big Data .......................................................................................................... 4

1.5 Benefits of Big Data ......................................................................................................... 5

2. CHARACTERISTICS OF BIG DATA ............................................................................... 6

2.1 Data Volume .................................................................................................................... 7

2.2 Data Velocity.................................................................................................................... 7

2.3 Data Variety ..................................................................................................................... 7

2.4 Data Value ........................................................................................................................ 8

2.5 Data Veracity.................................................................................................................... 8

3. ARCHITECTURE OF BIG DATA ..................................................................................... 8

3.1 Data Sources Layer .......................................................................................................... 9

3.2 Ingestion Layer ............................................................................................................... 11

3.3 Hadoop Storage Layer .................................................................................................... 12

3.4 Hadoop Infrastructure Layer .......................................................................................... 13

3.5 Hadoop Platform Management Layer ............................................................................ 13

3.6 Security Layer ................................................................................................................ 15

3.7 Monitoring Layer ........................................................................................................... 16

3.8 Analytics Engines Layer ................................................................................................ 16

3.9 Visualization Layer ........................................................................................................ 17

3.10 Big Data Applications Layer .......................................................................................... 17

4.1 Data Collection ............................................................................................................... 18

4.2 Data Processing .............................................................................................................. 19

4.3 Data Analysis ................................................................................................................. 20

4.4 Data Interpretation.......................................................................................................... 20

5. BIG DATA TECHNOLOGIES .......................................................................................... 21

5.1 Hadoop ........................................................................................................................... 21

5.2 Hadoop Components ...................................................................................................... 22

5.3 Hadoop technology works.............................................................................................. 24

5.4 Advantages of Hadoop ................................................................................................... 25

6. BIG DATA CHALLENGES ............................................................................................... 26

6.1 Privacy and security ....................................................................................................... 26

6.2 Data access and sharing of information ......................................................................... 26

6.3 Storage and processing issues ........................................................................................ 26

6.4 Analytical challenges ..................................................................................................... 26

6.5 Technical challenges ...................................................................................................... 27

6.6 Human resources and manpower ................................................................................... 27

6.7 Future challenges............................................................................................................ 27

7. APPLICATIONS OF BIG DATA ...................................................................................... 28

Table 1. Legacy data sources ....................................................................................................... 10

The ultimate goal of processing big data includes:

1.2 What comes under Big Data

Figure 1. Contents of Big Data

1.3 Types of data

Figure 2. Types of data being used in big data

1.4 Drivers of Big Data

Technological factors driving the growth of big data

 New sources of data are being created through:

1.5 Benefits of Big Data

2. CHARACTERISTICS OF BIG DATA

Figure 3. Five Vs Big Data Characteristics [3]

2.2 Data Velocity

Figure 4. Velocity of Big Data [4]

2.3 Data Variety

Figure 5. Variety of Big Data [4]

2.4 Data Value

2.5 Data Veracity

3. ARCHITECTURE OF BIG DATA

3.1 Data Sources Layer

Legacy Data Sources

Table 1. Legacy data sources