Professional Documents
Culture Documents
For
Academic Progress
(1140372001)
on
“Big Data: Technical Issues & Security Challenges”
Submitted by:
Kebebe Abebe
Student ID.No: B20176111W
School of Software Engineering
Beijing University of Technology
Submitted to:
Prof. Jingsha He
School of Software Engineering
Beijing University of Technology
Date: 2017/11/13
ABSTRACT
This report aims to provide a review of academic progress course, with particular regard to Big
Data. Big Data is the large amount of data that cannot be processed by making use of traditional
methods of data processing. Due to widespread usage of many computing devices such as smart
phones, laptops, wearable computing devices; billions of people are connected to internet
worldwide, generating large amount of data at the rapid rate. The data processing over the
internet has exceeded more than the modern computers can handle. Due to this high growth rate,
the term Big Data is envisaged. This growth of data has led to an explosion, coining the term Big
Data. In addition to the growth in volume, Big Data also exhibits other unique characteristics,
such as velocity, variety, value and veracity. This large volume, rapidly increasing and verities
of data is becoming the key basis of completion, underpinning new waves of productivity growth,
innovation and customer surplus. However, the fast growth rate of such large data generates
numerous challenges, such as data analysis, storage, querying, inconsistency and
incompleteness, scalability, timeliness, and security. Key industry segments are heavily
represented; financial services, where data is plentiful and data investments are substantial, and
life sciences, where data usage is rapidly emerging. This report provides a brief introduction to
the Big Data technology and its importance in the contemporary world. This report also
addresses the various concepts, characteristics, architecture, management, technologies,
challenges and applications of Big Data.
i
Contents
1. INTRODUCTION ................................................................................................................. 1
ii
4. BIG DATA MANAGEMENT ............................................................................................ 18
8. CONCLUSION .................................................................................................................... 29
REFERENCES: .......................................................................................................................... 30
iii
List of Figures
Figure 1. Contents of Big Data ...................................................................................................... 3
Figure 2. Types of data being used in big data .............................................................................. 4
Figure 3. Five Vs Big Data Characteristics [3] .............................................................................. 6
Figure 4. Velocity of Big Data [4] ................................................................................................. 7
Figure 5. Variety of Big Data [4] ................................................................................................... 8
Figure 6. The big data architecture ................................................................................................ 9
Figure 7. The variety of data sources ........................................................................................... 10
Figure 8. Components of data ingestion layer ............................................................................. 12
Figure 9. NoSQL databases ........................................................................................................ 13
Figure 10. Big data platform architecture .................................................................................... 14
Figure 11. MapReduce tasks ........................................................................................................ 15
Figure 12. Search engine conceptual architecture ....................................................................... 16
Figure 13. Visualization conceptual architecture ......................................................................... 17
Figure 14. Typical ETL process framework [5] .......................................................................... 19
Figure 15. The architecture of Hadoop ........................................................................................ 22
Figure 16. MapReduce parallel programming ............................................................................. 23
Figure 17. NoSQL database typical business scenarios ............................................................... 24
List of Tables
iv
1. INTRODUCTION
Recent advancement in technology has led to generation of a great quantity of data from
distinctive domains over the past 20 years. Big Data is a broad term for datasets so great in
volume or complicated that traditional data processing applications are inadequate [1]. Although
the Big Data have large amount of data or volume, it also processes the number of unique
characteristics unlike traditional data. The term Big Data often refers to large amount of data
which requires new technologies and architectures to make possible to extract value from it by
capturing and analysis process and seldom to a particular size of dataset. New sources of Big
Data include location specific data arising from traffic management, and from the tracking of
personal devices such as smart phones, laptops, wearable computing devices. Big Data is usually
unstructured and requires more time for analysis and processing. This development calls for new
system architectures for data acquisition, transmission, storage, and large-scale data processing
mechanisms.
Big Data has emerged because we are living in society which makes increasing use of data
intensive technologies. Due to such large size of data, it becomes very difficult to perform
effective analysis using the existing traditional techniques. Since Big Data is a recent upcoming
technology in the market which can bring huge benefits to the business organizations, it becomes
necessary that various challenges and issues associated in bringing and adapting to this
technology are need to be understood. Big Data concept means a datasets which continues to
grow so much that it becomes difficult to manage it using existing database management
concepts and tools. The difficulties can be related to data capture, storage, search, sharing,
analysis and visualization etc.
Big Data due to its various characteristics like volume, velocity, variety, value and veracity put
forward many challenges. The various challenges faced in large data management include,
analysis, capture, data curation, search, sharing, storage, transfer, visualization, and information
privacy and many more. In addition to variations in the amount of data stored in different sectors,
the types of data generated and stored; i.e. encoded video, images, audio, or text/numeric
information also differ markedly from industry to industry. The data is so enormous and are
generated so fast that it doesn’t fit the structures of normal or regular database architecture. To
analyze the data new alternative way must be used to process it.
In this report, the next sections address the basic concepts, characteristics, architecture,
management, technologies, challenges and applications of Big Data.
1|Page
1.1 Definition of Big Data
The term “Big Data” is used in a variety of contexts with a variety of characteristics. Therefore,
the followings are few definitions for Big Data.
Gartner’s definition:
“Big Data is a high-volume, high-velocity and/or high-variety information assets that
demand cost-effective, innovative forms of information processing for enhanced insight,
decision making, and process optimization."
Working definition:
Big Data is a collection of large datasets that cannot be processed using traditional
computing technologies and techniques in order to extract value. It is not a single technique
or a tool; rather it involves many areas of business and technology.
2|Page
Transport Data: Transport data includes model, capacity, distance and availability of a
vehicle.
Search Engine Data: Search engines retrieve lots of data from different databases.
1. Structured data: it refers to data that is identifiable and organized in a structured way.
The most common form of structured data is a database where specific information is
stored based on a methodology of columns and rows. Structured data is machine readable
and also efficiently organized for human readers. For example, an 'Employee' table in a
database is type of structured data.
2. Semi-structured data: it refers to data that does not conform to a formal structure based
on standardized data models. However semi-structured data may contain tags or other
meta-data to organize it. For example, personal data stored in a XML file is considered as
category of semi-structured data.
3|Page
3. Unstructured data: it refers to any data that has no identifiable structure. For examples,
images, videos, email, documents and text fall into the category of unstructured data.
4|Page
Production of new gadgets that collect and transmit data, for example GPS
location information from mobile phones and capacity updates from ‘smart’ waste
bins (POSTnote 423).
Enhanced computing capabilities driving big data include:
Improved data storage at higher densities, for lower cost.
Greater computing power for faster and more complex calculations.
Cloud computing (remote access to shared computing resources via a device
connected to a network), facilitating cheaper access to data storage, computation,
software and other services.
Recent advances in statistical and computational techniques, which can be used to
analyze and extract meaning from big data.
Development of new tools such as Apache Hadoop (which enables large data sets
to be processed across clusters of computers) and extension of existing software,
such as Microsoft Excel.
5|Page
Improving science and research: - Science and research is currently being transformed
by the new possibilities bid data brings.
Optimizing machine and device performance: - Big data analytics help machines and
devices become smarter and more autonomous.
Improving security and law enforcement: -Big data is applied heavily in improving
security and enabling law enforcement.
Improving and optimizing cities and countries: - Big data is used to improve many
aspects of our cities and countries.
Financial Trading: - Big data algorithms are used to make trading decisions.
Characteristics of Big Data by what is usually referred to as a multi V model. The three Vs main
characteristics (volume, velocity and variety) of big data are well defined in the definition by
Gartner. In report, the 5V characteristics (Volume, Velocity, Variety, Value and Veracity) of big
data are described below.
6|Page
2.1 Data Volume
Data volume defines the measures of amount of data available to an organization, which does not
necessarily have to own all of it as long as it can access it. As amount of data volume increases,
the value of different data records will decrease in proportion to age, type, richness, and quantity
among all other factors.
Considering data velocity [2], it is considered that, to complicate matters further, arrival of data
and processing or analyzing data are performed at different speeds, as illustrated in Figure 4.
7|Page
significant challenges that can lead to analytic spread out over a large area in an untidy or
irregular way.
Big data management architecture should be able to consume myriad data sources in a fast and
inexpensive manner. Figure 6 outlines the architecture of big data with its components in big
data tech stack. We can choose either open source frameworks or packaged licensed products to
take full advantage of the functionality of the various components in the stack.
8|Page
Figure 6. The big data architecture
9|Page
Figure 7. The variety of data sources
Industry Data
Traditionally, different industries designed their data-management architecture around the legacy
data sources listed in Table 1. The technologies, adapters, databases, and analytics tools were
selected to serve these legacy protocols and standards.
10 | P a g e
Some of the “new age” data sources that have seen an increase in volume, velocity, or variety are
illustrated in Table 2.
The building blocks of the ingestion layer should include components for the following:
Identification: - involves detection of the various known data formats or assignment of
default formats to unstructured data.
Filtration: - involves selection of inbound information relevant to the enterprise, based
on the Enterprise MDM repository.
Validation: - involves analysis of data continuously against new MDM metadata.
Noise Reduction: - involves cleansing data by removing the noise and minimizing
disturbances.
Transformation: - involves splitting, converging, de-normalizing or summarizing data.
11 | P a g e
Compression: - involves reducing the size of the data but not losing the relevance of the
data in the process. It should not affect the analysis results after compression.
Integration: - involves integrating the final massaged data set into the Hadoop storage
layer, that is, Hadoop distributed file system (HDFS) and NoSQL databases.
There is multiple ingestion patterns (data source-to-ingestion layer communication) that can be
implemented based on the performance, scalability, and availability requirements.
NoSQL Database is used to store prevalent in the big data world, including key-value pair,
document, graph, columnar, and geospatial databases.
12 | P a g e
Figure 9. NoSQL databases
13 | P a g e
Figure 10. Big data platform architecture
The Hadoop platform management layer accesses data, runs queries, and manages the lower
layers using scripting languages like Pig and Hive.
The key building blocks of the Hadoop platform management layer are Zookeeper, Pig, Hive,
Sqoop and MapReduce.
MapReduce simplifies the creation of processes that analyze large amounts of
unstructured and structured data in parallel. Here are the key facts associated with the
scenario in Figure 11.
14 | P a g e
Figure 11. MapReduce tasks
Hive is a data-warehouse system for Hadoop that provides the capability to aggregate
large volumes of data. This SQL-like interface increases the compression of stored data
for improved storage-resource utilization without affecting access speed.
Pig is a scripting language that allows us to manipulate the data in the HDFS in parallel.
Sqoop is a command-line tool that enables importing individual tables, specific columns,
or entire database files straight to the distributed file system or data warehouse.
ZooKeeper is a coordinator for keeping the various Hadoop instances and nodes in sync
and protected from the failure of any of the nodes.
15 | P a g e
Logs the communication between nodes, and uses distributed logging mechanisms to
trace any anomalies across layers
Ensures all communication between nodes is secure, for example, by using Secure
Sockets Layer (SSL), TLS, and so forth.
17 | P a g e
There are a wide choice of tools and products that we can use to build our application
architecture end to end. Products usually selected by many enterprises to begin their big data
journey are shown in Table 3.
Purpose Products/tools
Ingestion Layer Apache Flume, Storm
Hadoop Storage HDFS
NoSQL Databases Hbase, Cassandra
Rules Engines MapReduce jobs
NoSQL Data Warehouse Hive
Platform Management Query Tools MapReduce, Pig, Hive
Search Engine Solr
Platform Management Co-ordination Tools ZooKeeper, Oozie
Analytics Engines R, Pentaho
Visualization Tools Tableau, Clickview, Spotfire
EMC Greenplum, IBM Netezza, IBM Pure
Big Data Analytics Appliances
Systems, Oracle Exalytics
Monitoring Ganglia, Nagios
Data Analyst IDE Talend, Pentaho
Cloudera, DataStax, Hortonworks, IBM Big
Hadoop Administration
Insights
Public Cloud-Based Virtual Infrastructure Amazon AWS & S3, Rackspace
Increasing quantities of data are being collected and analyzed, producing new insights into how
people think and act, and how systems behave. This often requires innovative processing and
analysis known as ‘big data analytics’. Making use of any kind of data requires data collection,
processing, analysis and interpretation of results.
19 | P a g e
4.3 Data Analysis
Analytics are used to gain insight from data. They typically involve applying an algorithm (a
sequence of calculations) to data to find patterns, which can then be used to make predictions or
forecasts. Big data analytics encompass various inter-related techniques, including the following
examples.
Data mining - identifies patterns by sifting through data. It can be applied to user click
streams to understand how customers use web pages to inform web page design.
Machine learning - describes systems that learn from data. For example, a system that
compares documents in two different languages can infer translation rules; human
correction of any errors in the rules can result in the system learning how to improve the
software.
Simulation - can be used to model the behaviour of complex systems. For example,
building a trading simulation can help to assess the effectiveness of measures to reduce
insider trading.
20 | P a g e
5. BIG DATA TECHNOLOGIES
Big data technologies are important in providing more accurate analysis, which may lead to more
concrete decision-making resulting in greater operational efficiencies, cost reductions, and
reduced risks for the business.
While looking into the technologies that handle big data, they can be examined as the following
two complementary classes of technology and they are frequently deployed together.
1. Operational big data technology: - it includes systems that provide operational
capabilities for real-time, interactive workloads where data is primarily captured and
stored ( e.g. MongoDB, NoSQL, etc)
2. Analytical big data technology: - includes systems that provide analytical capabilities
for retrospective and complex analysis that may touch most or all of the data (e.g. MPP,
MapReduce, etc).
Even though there are many technologies available for data management, one of the most widely
used technologies is Hadoop.
5.1 Hadoop
Hadoop is an Apache open source framework written in java that allows distributed processing
of large datasets across clusters of computers using simple programming models. The Hadoop
framework application works in an environment that provides distributed storage and
21 | P a g e
computation across clusters of computers. Hadoop is designed to scale up from single server to
thousands of machines, each offering local computation and storage.
A. MapReduce
22 | P a g e
Figure 16. MapReduce parallel programming
HDFS is based on the Google File System (GFS) and provides a distributed file system that is
designed to run on commodity hardware. It has many similarities with existing distributed file
systems. It provides high throughput access to application data and is suitable for applications
having large datasets. It is not accessible as a logical data structure for easy data manipulation.
HDFS stores data prevalent in the big data world, including key-value pair, document, graph,
columnar, and geospatial databases which are collectively referred to as NoSQL databases.
23 | P a g e
Figure 17. NoSQL database typical business scenarios
Apart from the above-mentioned two core components, Hadoop framework also includes the
following two modules:
Hadoop Common: These are Java libraries and utilities required by other Hadoop
modules.
Hadoop YARN: This is a framework for job scheduling and cluster resource
management.
24 | P a g e
high-end server. So this is the first motivational factor behind using Hadoop that it runs across
clustered and low-cost machines.
Hadoop runs code across a cluster of computers. This process includes the following core tasks
that Hadoop performs:
Data is initially divided into directories and files. Files are divided into uniform sized
blocks of 128M and 64M (preferably 128M).
These files are then distributed across various cluster nodes for further processing.
HDFS, being on top of the local file system, supervises the processing.
Blocks are replicated for handling hardware failure.
Checking that the code was executed successfully.
Performing the sort that takes place between the map and reduce stages.
Sending the sorted data to a certain computer.
Writing the debugging logs for each job.
25 | P a g e
6. BIG DATA CHALLENGES
Big data challenges are usually the real implementation interrupt which require immediate
attention, if any implementation without handling these challenges may lead to failure of
technology and some unfavorable results [6]. Big data challenges can be classified as privacy &
security, data access &sharing of information, storage and processing issues, analytical, human
resources & manpower, technical, and future challenges.
26 | P a g e
huge amount of data which is unstructured, semi structured and structured they need large
technical skill.
27 | P a g e
Visualization- A main task of Big Data analysis is how to visualize the results of any
data. Because of data is so big, it is very difficult to find user-friendly visualizations.
Hidden Big Data- Large quantities of useful data are getting lost since new data is
largely untagged file based and unstructured data.
Big Data is applying in many application areas. Here are some examples of Big Data
applications:
Smart Grid case: it is crucial to manage in real time the national electronic power
consumption and to monitor Smart grids operations.
E-health: connected health platforms are already used to personalize health services
(e.g., CISCO solution). Big Data is generated from different heterogeneous sources (e.g.,
laboratory and clinical data, patients symptoms uploaded from distant sensors, hospitals
operations, and pharmaceutical data).
Internet of Things (IoT): IoT represents one of the main markets of big data
applications. Because of the high variety of objects, the applications of IoT are
continuously evolving. Nowadays, there are various Big Data applications supporting for
logistic enterprises.
Public utilities: Utilities such as water supply organizations are placing sensors in the
pipelines to monitor flow of water in the complex water supply networks.
Transportation and logistics: Many public road transport companies are using RFID
(Radiofrequency Identification) and GPS to track buses and explore interesting data to
improve their services.
Political services and government monitoring: Many governments such as India and
United States are mining data to monitor political trends and analyze population
sentiments.
Big Data Analytics Applications (BDAs) are a new type of software applications, which
analyze big data using massive parallel processing frameworks (e.g., Hadoop).
28 | P a g e
Data Mining: Decision trees automatically help users understand what combination of
data attributes result in a desired outcome. The structure of the decision tree reflects the
structure that is possibly hidden in your data.
Banking: The use of customer data invariably raises privacy issues. By uncovering
hidden connections between seemingly unrelated pieces of data, big data analytics could
potentially reveal sensitive personal information.
Marketing: Marketers have begun to use facial recognition software to learn how well
their advertising succeeds or fails at stimulating interest in their products.
Telecom: Now a day’s big data is used in different fields. In telecom also it plays a very
good role.
8. CONCLUSION
In this report some of the important concepts are covered that are needed to be analyzed by the
organizations while estimating the significance of implementing the Big Data technology and
some direct challenges to the infrastructure of the technology. The availability of Big Data, low-
cost commodity hardware, new information management and analytic software has produced a
unique moment in the history of data analysis. The convergence of these trends means that we
have the capabilities required to analyze astonishing data sets quickly and cost-effectively for the
first time in history. These capabilities are neither theoretical nor trivial. They represent a
genuine leap forward and a clear opportunity to realize enormous gains in terms of efficiency,
productivity, revenue, and profitability. The age of Big Data is here, and these are truly
revolutionary times if both business and technology professionals continue to work together and
deliver on the promise.
29 | P a g e
REFERENCES:
[1] Wei Fan, Albert Bifet. Mining big data: current status, and forecast to the future, ACM
SIGKDD Explorations Newsletter, Volume 14 Issue 2, December 2012
[2] Social media data & real time analytics, HoC Sci. & Tech. Com. bit.ly/1eMJcEK
[3] BIG DATA AND FIVE V’S CHARACTERISTICS, Ministry of Education, Islamic University
College, Third Author Affiliation
[4] Big Data computing and clouds: Trends and future direction by Rajkumar Buyya
[5] Big Data: Emerging Challenges of Big Data and Techniques for Handling,
Dr.M.Padmavalli, Nov.-Dec. 2016
[6] Armour, F., Kaisler, S. and Espinosa, J. A., Money W. 2013. Illustrated the issues and
challenges in big data.
[7] Lee, K. H., Choi, T. W., Ganguly, A., Wolinsky, D. I., Boykin, P. O. and Figueired, R. 2011.
Presents the parallel data processing with map Reduce.
[8] Marz, N. and Warren, J. 2013. Big Data: Principles and best practices of scalable realtime
data systems. Manning Publications.
[9] Feldman, D., Schmidt, M. and Sohler, C. 2013. Turning big data into tiny data: Constant-
size coresets for k-means, pca and projective clustering. In SODA.
[10] Fan, W. and Bifet, A., Discribe the big data mining current status and forecast to the
future.
[11] Big Data Application Architecture Q & A, A Problem – Solution Approach, Nitin Sawant
and Himanshu Shah, Apress.
[12] Mark A. Beyer and Douglas Laney, "The Importance of 'Big Data': A Definition," Gartner,
30 | P a g e