Professional Documents
Culture Documents
Abstract:
In recent years, the internet application and communication have seen a lot of development and
reputation in the field of Information Technology. These internet applications and communication are
continually generating the large size, different variety and with some genuine difficult multifaceted
structure data called big data. As a consequence, we are now in the era of massive automatic data
collection, systematically obtaining many measurements, not knowing which one will be relevant to
the phenomenon of interest. The traditional data storage techniques are not adequate to store and
analyses those huge volume of data. Hence the aim of this paper is to provide the overview of big data
and its concepts , challenges and techniques related with big data.
1.Introduction
Today, many organizations are collecting, storing, and analyzing massive amounts of data. This data
is commonly referred to as “big data” because of its volume, the velocity with which it arrives, and
the variety of forms it takes. Big data is creating a new generation of decision support data
management. Businesses are recognizing the potential value of this data and are putting the
technologies, people, and processes in place to capitalize on the opportunities. A key to deriving value
from big data is the use of analytics. Collecting and storing big data creates little value; it is only data
infrastructure at this point. It must be analyzed and the results used by decision makers and
Big Data is a term used to describe a collection of data that is huge in size and yet growing
exponentially with time. In short such data is so large and complex that none of the traditional data
management tools are able to store it or process it efficiently. Big data usually includes data sets with
sizes beyond the ability of commonly used software tools to capture, curate, manage, and process data
Big data challenges include capturing data, data storage, data analysis,
search, sharing, transfer, visualization, querying, updating, information privacy and data source. Big
data was originally associated with three key concepts: volume, variety, and velocity. Other concepts
2.Concept:
Big data represents the information assets characterized by such a high volume, velocity and variety
to require specific technology and analytical methods for its transformation into value. The three
concepts volume, variety and velocity have been further expanded to other complementary
Machine learning: big data often doesn't ask why and simply detects patterns
Volume:
This refers to the quantity of generated and stored data. The size of the data determines the value and
Variety:
It refers to the type and nature of the data. This helps people who analyse it to effectively use the
resulting insight.
Velocity:
The speed at which the data is generated and processed to meet the demands and challenges that lie in
Veracity:
It is the extended definition for big data, which refers to the data quality and the data value.
Everything we do leaves digital trace, which can be used and analysed. Because of the size and
complexity, they can not be processed and analysed through traditional methods such as a RDBMS.
Your technology is generating data whenever you use your smartphone, when you chat with your
family and friends on Facebook, and when you shop. Anytime you go online, you’re producing data
and leaving a digital trail of information. All of this data is very complex, there’s so much of it from
Businesses use all of this data to create customized and improved experiences for all of us.
There are billions of gigabytes of data being generated every single day by people and technologies
They use this data to figure out what kind of new drink people will like, or where would be a good
But businesses are not just simply collecting all of this data that we are generating. They’re actually
analyzing it, and finding ways to improve their products and services, which in turn shapes our lives
and the experiences that we are having with the world around us.
That is what Big Data is: it’s all of the data being generated in the digital era from all of the different
Data is always being generated by digital technologies, whether we are using apps on our phones,
interacting on our social media, or shopping for products. All of this information combines with other
Companies also combine Big Data with technologies like Machine Learning and Artificial
Intelligence to further improve their ability to enhance our daily lives with faster, more personalized
experiences.
3.Importance of big data:
The importance of big data is how you utilize the data which you own. Data can be fetched from any
1) Cost reductions
2) Time reductions
Combination of big data with high-powered analytics, you can have great impact on your business
Finding the root cause of failures, issues and defects in real time operations.
Generating coupons at the point of sale seeing the customer’s habit of buying goods.
There are many techniques that are widely used in big data to store and analyse the data. Some listed
1. Apache Hadoop
Apache Hadoop is a java based free software framework that can effectively store large amount of
data in a cluster. This framework runs in parallel on a cluster and has an ability to allow us to process
data across all nodes. Hadoop Distributed File System (HDFS) is the storage system of Hadoop which
splits big data and distribute across many nodes in a cluster. This also replicates data in a cluster thus
2. Microsoft HDInsight
It is a Big Data solution from Microsoft powered by Apache Hadoop which is available as a service in
the cloud. HD Insight uses Windows Azure Blob storage as the default file system. This also provides
3. NoSQL
While the traditional SQL can be effectively used to handle large amount of structured data, we need
NoSQL (Not Only SQL) to handle unstructured data. NoSQL databases store unstructured data with
no particular schema. Each row can have its own set of column values. NoSQL gives better
performance in storing massive amount of data. There are many open-source NoSQL DBs available to
4. Hive
This is a distributed data management for Hadoop. This supports SQL-like query option HiveSQL
(HSQL) to access big data. This can be primarily used for Data mining purpose. This runs on top of
Hadoop.
5. Sqoop
This is a tool that connects Hadoop with various relational databases to transfer data. This can be
6. PolyBase
PDW is a datawarhousing appliance built for processing any volume of relational data and provides
As many people are comfortable in doing analysis in EXCEL, a popular tool from Microsoft, you can
also connect data stored in Hadoop using EXCEL 2013. Hortonworks, which is primarily working in
providing Enterprise Apache Hadoop, provides an option to access big data stored in their Hadoop
8. Presto
Facebook has developed and recently open-sourced its Query engine (SQL-on-Hadoop) named Presto
which is built to handle petabytes of data. Unlike Hive, Presto does not depend on MapReduce
Other techniques include A/B testing, Classification Tree Analysis , Cluster Analysis , Crowdsourcing
, Data fusion and data integration, Data Mining , Ensemble learning , Machine Learning etc.
The handling of big data is very complex. Some challenges faced during its integration include
uncertainty of data Management, big data talent gap, getting data into a big data structure, syncing
across data sources, getting useful information out of the big data, volume, skill availability, solution
cost etc.
Some of the most common of those big data challenges include the following:
HADOOP IS HARD
While Hadoop and the surrounding ecosystem of tools is lauded for its ability to handle massive
volumes of structured and unstructured data, the software isn’t easy to manage or use. Since the
technology is relatively new, many data professionals aren’t familiar with how to manage Hadoop.
Add to that the fact that Hadoop frequently requires extensive internal resources to maintain, and
many companies are left devoting most of their resources to the technology rather than to the actual
big data problem they are trying to solve. In the survey mentioned above, 73% of respondents claimed
understanding the big data platform was the most significant challenge of a big data project.
SCALABILITY
With big data, it’s crucial to be able to scale up and down on-demand. Many organizations fail to take
into account how quickly a big data project can grow and evolve. Constantly pausing a project to add
additional resources cuts into time for data analysis. Big data workloads also tend to be bursty,
making it difficult to predict where resources should be allocated. The extent of this big data
challenge varies by solution. A solution in the cloud will scale much easier and faster than an on-
premises solution.
DATA QUALITY
Data quality is not a new concern, but the ability to store every piece of data a business produces in its
original form compounds the problem. Dirty data costs companies in the United States $600 billion
every year. Common causes of dirty data that must be addressed include user input errors, duplicate
data and incorrect data linking. In addition to being meticulous at maintaining and cleaning data, big
SECURITY
Keeping that vast lake of data secure is another big data challenge. Specific challenges include:
1. User authentication for every team and team member accessing the data.
COST MANAGEMENT
It’s difficult to project the cost of a big data project, and given how quickly they scale, can quickly eat
up resources. The challenge lies in taking into account all costs of the project from acquiring new
hardware, to paying a cloud provider, to hiring additional personnel. Businesses pursuing on-premises
projects must remember the cost of training, maintenance and expansion. Big data in the cloud
projects must carefully evaluate the service-level agreement with the provider to determine how usage
While the number of big data challenges can be overwhelming, it also presents an opportunity. Those
businesses who are able to identify the right infrastructure for their big data project and follow best
practices for implementation will see a significant competitive advantage. Entrepreneurs have also
Big data storage is a compute-and-storage architecture that collects and manages large data sets and
enables real-time data analytics.Companies apply big data analytics to get greater intelligence
from metadata. In most cases, big data storage uses low-cost hard disk drives, although moderating
prices for flash appear to have opened the door to using flash in servers and storage systems as the
foundation of big data storage. These systems can be all-flash or hybrids mixing disk and flash
storage.The data itself in big data is unstructured, which means mostly file-based and object
storage.Although a specific volume size or capacity is not formally defined, big data storage usually
I hereby conclude that the only challenge now is to manage the big data, and with innovations
in the field of big data like Hadoop the scope is getting bigger. Hadoop is the supermodel of
Big Data. To be skilled in Hadoop is a deciding factor in getting a springboard to your career
or getting left behind. Big Data is one of the most rewarding careers with a number of
opportunities in the field. Organisations today are looking for data analysts, data engineers, and
professionals with Big Data expertise in a big number. The need for analytics professionals and
big data architects is also increasing. Hence, there is a growing interest of professionals in making