You are on page 1of 9

Big data: concepts, techniques, storage and challenges

Abstract:

In recent years, the internet application and communication have seen a lot of development and

reputation in the field of Information Technology. These internet applications and communication are

continually generating the large size, different variety and with some genuine difficult multifaceted

structure data called big data. As a consequence, we are now in the era of massive automatic data

collection, systematically obtaining many measurements, not knowing which one will be relevant to

the phenomenon of interest. The traditional data storage techniques are not adequate to store and

analyses those huge volume of data. Hence the aim of this paper is to provide the overview of big data

and its concepts , challenges and techniques related with big data.

1.Introduction

Today, many organizations are collecting, storing, and analyzing massive amounts of data. This data

is commonly referred to as “big data” because of its volume, the velocity with which it arrives, and

the variety of forms it takes. Big data is creating a new generation of decision support data

management. Businesses are recognizing the potential value of this data and are putting the

technologies, people, and processes in place to capitalize on the opportunities. A key to deriving value

from big data is the use of analytics. Collecting and storing big data creates little value; it is only data

infrastructure at this point. It must be analyzed and the results used by decision makers and

organizational processes in order to generate value.

What is big data?

Big Data is a term used to describe a collection of data that is huge in size and yet growing

exponentially with time. In short such data is so large and complex that none of the traditional data

management tools are able to store it or process it efficiently. Big data usually includes data sets with
sizes beyond the ability of commonly used software tools to capture, curate, manage, and process data

within a tolerable elapsed time.

Evolution of big data:

Big data challenges include capturing data, data storage, data analysis,

search, sharing, transfer, visualization, querying, updating, information privacy and data source. Big

data was originally associated with three key concepts: volume, variety, and velocity. Other concepts

later attributed with big data are veracity and value.

2.Concept:

Big data represents the information assets characterized by such a high volume, velocity and variety

to require specific technology and analytical methods for its transformation into value. The three

concepts volume, variety and velocity have been further expanded to other complementary

characteristics of big data:

 Machine learning: big data often doesn't ask why and simply detects patterns

 Digital footprint: big data is often a cost-free by-product of digital interaction

Volume:

This refers to the quantity of generated and stored data. The size of the data determines the value and

potential insight, and whether it can be considered big data or not.

Variety:

It refers to the type and nature of the data. This helps people who analyse it to effectively use the

resulting insight.

Velocity:

The speed at which the data is generated and processed to meet the demands and challenges that lie in

the path of growth and development.

Veracity:

It is the extended definition for big data, which refers to the data quality and the data value.
Everything we do leaves digital trace, which can be used and analysed. Because of the size and

complexity, they can not be processed and analysed through traditional methods such as a RDBMS.

How does it work?

Your technology is generating data whenever you use your smartphone, when you chat with your

family and friends on Facebook, and when you shop. Anytime you go online, you’re producing data

and leaving a digital trail of information. All of this data is very complex, there’s so much of it from

many different sources, and it's coming in quickly in real-time.

Businesses use all of this data to create customized and improved experiences for all of us.

There are billions of gigabytes of data being generated every single day by people and technologies

all around the world.

They use this data to figure out what kind of new drink people will like, or where would be a good

place to open up a new store location.

But businesses are not just simply collecting all of this data that we are generating. They’re actually

analyzing it, and finding ways to improve their products and services, which in turn shapes our lives

and the experiences that we are having with the world around us.

That is what Big Data is: it’s all of the data being generated in the digital era from all of the different

types of technologies out there.

Data is always being generated by digital technologies, whether we are using apps on our phones,

interacting on our social media, or shopping for products. All of this information combines with other

data sources and becomes Big Data.

Companies also combine Big Data with technologies like Machine Learning and Artificial

Intelligence to further improve their ability to enhance our daily lives with faster, more personalized

experiences.
3.Importance of big data:

The importance of big data is how you utilize the data which you own. Data can be fetched from any

source and analyze it to solve that enable us in terms of

1) Cost reductions

2) Time reductions

3) New product development and optimized offerings, and

4) Smart decision making.

Combination of big data with high-powered analytics, you can have great impact on your business

strategy such as:

 Finding the root cause of failures, issues and defects in real time operations.

 Generating coupons at the point of sale seeing the customer’s habit of buying goods.

 Recalculating entire risk portfolios in just minutes.

 Detecting fraudulent behavior before it affects and risks your organization.

4.Techniques used in big data:

There are many techniques that are widely used in big data to store and analyse the data. Some listed

below are discussed here.

1. Apache Hadoop

Apache Hadoop is a java based free software framework that can effectively store large amount of

data in a cluster. This framework runs in parallel on a cluster and has an ability to allow us to process

data across all nodes. Hadoop Distributed File System (HDFS) is the storage system of Hadoop which
splits big data and distribute across many nodes in a cluster. This also replicates data in a cluster thus

providing high availability.

2. Microsoft HDInsight

It is a Big Data solution from Microsoft powered by Apache Hadoop which is available as a service in

the cloud. HD Insight uses Windows Azure Blob storage as the default file system. This also provides

high availability with low cost.

3. NoSQL

While the traditional SQL can be effectively used to handle large amount of structured data, we need

NoSQL (Not Only SQL) to handle unstructured data. NoSQL databases store unstructured data with

no particular schema. Each row can have its own set of column values. NoSQL gives better

performance in storing massive amount of data. There are many open-source NoSQL DBs available to

analyse big Data.

4. Hive

This is a distributed data management for Hadoop. This supports SQL-like query option HiveSQL

(HSQL) to access big data. This can be primarily used for Data mining purpose. This runs on top of

Hadoop.

5. Sqoop

This is a tool that connects Hadoop with various relational databases to transfer data. This can be

effectively used to transfer structured data to Hadoop or Hive.

6. PolyBase
PDW is a datawarhousing appliance built for processing any volume of relational data and provides

an integration with Hadoop allowing us to access non-relational data as well.

7. Big data in EXCEL-*

As many people are comfortable in doing analysis in EXCEL, a popular tool from Microsoft, you can

also connect data stored in Hadoop using EXCEL 2013. Hortonworks, which is primarily working in

providing Enterprise Apache Hadoop, provides an option to access big data stored in their Hadoop

platform using EXCEL 2013.

8. Presto

Facebook has developed and recently open-sourced its Query engine (SQL-on-Hadoop) named Presto

which is built to handle petabytes of data. Unlike Hive, Presto does not depend on MapReduce

technique and can quickly retrieve data.

Other techniques include A/B testing, Classification Tree Analysis , Cluster Analysis , Crowdsourcing

, Data fusion and data integration, Data Mining , Ensemble learning , Machine Learning etc.

6.Challenges in big data:

The handling of big data is very complex. Some challenges faced during its integration include

uncertainty of data Management, big data talent gap, getting data into a big data structure, syncing

across data sources, getting useful information out of the big data, volume, skill availability, solution

cost etc.

Some of the most common of those big data challenges include the following:
HADOOP IS HARD

While Hadoop and the surrounding ecosystem of tools is lauded for its ability to handle massive

volumes of structured and unstructured data, the software isn’t easy to manage or use. Since the

technology is relatively new, many data professionals aren’t familiar with how to manage Hadoop.

Add to that the fact that Hadoop frequently requires extensive internal resources to maintain, and

many companies are left devoting most of their resources to the technology rather than to the actual

big data problem they are trying to solve. In the survey mentioned above, 73% of respondents claimed

understanding the big data platform was the most significant challenge of a big data project.

SCALABILITY

With big data, it’s crucial to be able to scale up and down on-demand. Many organizations fail to take

into account how quickly a big data project can grow and evolve. Constantly pausing a project to add

additional resources cuts into time for data analysis. Big data workloads also tend to be bursty,

making it difficult to predict where resources should be allocated. The extent of this big data

challenge varies by solution. A solution in the cloud will scale much easier and faster than an on-

premises solution.

DATA QUALITY

Data quality is not a new concern, but the ability to store every piece of data a business produces in its

original form compounds the problem. Dirty data costs companies in the United States $600 billion

every year. Common causes of dirty data that must be addressed include user input errors, duplicate

data and incorrect data linking. In addition to being meticulous at maintaining and cleaning data, big

data algorithms can also be used to help clean data.

SECURITY

Keeping that vast lake of data secure is another big data challenge. Specific challenges include:
1. User authentication for every team and team member accessing the data.

2. Restricting access based on a user’s need.

3. Recording data access histories and meeting other compliance regulations

4. Proper use of encryption on data in-transit and at rest.

COST MANAGEMENT

It’s difficult to project the cost of a big data project, and given how quickly they scale, can quickly eat

up resources. The challenge lies in taking into account all costs of the project from acquiring new

hardware, to paying a cloud provider, to hiring additional personnel. Businesses pursuing on-premises

projects must remember the cost of training, maintenance and expansion. Big data in the cloud

projects must carefully evaluate the service-level agreement with the provider to determine how usage

will be billed and if there will be any additional fees.

BIG DATA OPPORTUNITY

While the number of big data challenges can be overwhelming, it also presents an opportunity. Those

businesses who are able to identify the right infrastructure for their big data project and follow best

practices for implementation will see a significant competitive advantage. Entrepreneurs have also

capitalized on big data technology to create new products and services.

7.Big data storage

Big data storage is a compute-and-storage architecture that collects and manages large data sets and

enables real-time data analytics.Companies apply big data analytics to get greater intelligence

from metadata. In most cases, big data storage uses low-cost hard disk drives, although moderating

prices for flash appear to have opened the door to using flash in servers and storage systems as the

foundation of big data storage. These systems can be all-flash or hybrids mixing disk and flash

storage.The data itself in big data is unstructured, which means mostly file-based and object

storage.Although a specific volume size or capacity is not formally defined, big data storage usually

refers to volumes that grow exponentially to terabyte or petabyte scale.


8.Conclusion

I hereby conclude that the only challenge now is to manage the big data, and with innovations

in the field of big data like Hadoop the scope is getting bigger. Hadoop is the supermodel of

Big Data. To be skilled in Hadoop is a deciding factor in getting a springboard to your career

or getting left behind. Big Data is one of the most rewarding careers with a number of

opportunities in the field. Organisations today are looking for data analysts, data engineers, and

professionals with Big Data expertise in a big number. The need for analytics professionals and

big data architects is also increasing. Hence, there is a growing interest of professionals in making

a big data career.

You might also like