You are on page 1of 32

JAMES COOK UNIVERSITY SINGAPORE

The Challenges of Using


Big Data in Business
Decision Making
LB5235 Practical or Research Project

Bui Quang Huy (13055527)


Jesslyn Khou (13207944)
KyungMin Lee (13193809)
Pan Na (13156996)

6/9/2016
Table of Contents: 1

1. Introduction 2
1.1 Aims 2
1.2 Research Aims and Questions 2
1.3 Methods 3
2. Definitions 4
2.1 Big Data 4
2.2 Big Data Analytics 5
3. Analysis 5
3.1 Statistical issues related to big data analysis 5
3.1.1 Data quality issues 6
3.1.1.1 Data sources 7
3.1.1.2 Unstructured and semi-structured data 8
3.1.1.3 The combining of sources, data repurposing and data
transfer 9
3.1.2 Data analysis process 10
3.2 Limited talents for big data analysis 11
3.3 Issues of inequalities related to big data 13
3.3.1 Big data infrastructure 13
3.3.2 Widened digital divide 14
3.4 Big data privacy and security challenges 16
3.4.1 Big data privacy 16
3.4.2 Big data security 19
3.5 Big datas potential to be misleading 19
3.5.1 Lack of Attention to Data Quality and Processing 20
3.5.2 Misleading Assumptions related to big data 21
3.5.3 Effects on Consumers 23
4. Implications 24
5. Conclusions 25
6. Suggestions 25
7. References 27

1
1. Introduction
1.1. Aims
The aim of this research is to fully understand the challenges of using big data in business. Big
data has the potential to allow businesses and organizations to gain more accurate and
representative information of organizational environment, both internally and externally. Big
data also has the potential to allow for a more accurate and reliable analysis, particularly in
terms of predictive analytics. It has hoped that through the use of big data, organizations will be
able to find a way for cost reduction and increase efficiency, make better decisions, and provide
customers with new products and services (SAS, 2016). Companies that managed to
successfully use big data to augment their marketing activities are better in identifying,
reaching and engaging the right target market. They are also able to pinpoint consumer
behavior more accurately, resulting in increased sales. Considering these positive potentials and
companies eagerness in using big data, the aim of this literature review is to provide a
counterpoint to these potential benefits and understand the various pitfalls of big data in its
current development. Big data has numerous issues that could make its uninformed usage
detrimental for some businesses. Less attention is paid to the various challenges and drawbacks
of using or trying to use a vast amount of data (Clarke, 2016). With the hope that by searching,
compiling and categorizing these various challenges associated with the use of big data, readers
will have a much clearer view of big data, and thereby be in the position to make an informed
decision on its use in their business.

1.2 Research Aims and Questions


Specifically this review aims to meet the following objectives:
1. Determine the various challenges businesses may face when using big data
2. Categorize the various challenges associated with big data and big data analytics
3. Determine the impact of these challenges on businesses
To achieve the aims stated above this review will answer the following research questions:
1. What is big data?
2. What is big data analytics?

2
3. What are the sources of big data?
4. What are the dangers or shortcoming of using big data?
5. What are the possible negative impacts of using big data?

It is hoped that the answers to the research question will aid businesses to make better
decisions regarding the use of big data. The answers should also provide insight to big data
analyst making them aware of the pitfalls of big data. Overall, understanding more about big
data should provide benefit to all who intend to use big data.

1.3. Methods
A systematic review of the literature was conducted to infer the challenges and potential
negative impacts of big data. This involved systematic searching and a consistent, best
evidence, approach to the selection of the literature. A range of sources were used including
empirical research as well as secondary sources such as books and good practice example
published from 2001 onwards. One Search (University provided library research system, which
searches across plenty of the librarys information resources) and Google scholar were used to
find the various articles for review. One Search supplement where One Search did not have
access to various academic databases and peer reviewed journal articles and Google to find
additional resources like articles, news. Table 1 shows the selection criteria for literatures used
in this review. The keyword terms included combinations of: big data, what is big data, big
data challenges, big data drawbacks, big data negative impacts, big data issues, poor
quality data big data security, big data intellectual property law, big data bias, predictive
analytics, google flu, and big data statistic. The table below describes the main criteria for
the identification of relevant literature.

3
Table 1: Selection criteria for the inclusion of literature

Publication Work published from the year 2001


date: Published in English
Language: Journal articles (peer reviewed); empirical research; good practice examples;
Study Type: company published literature.

The decision to restrict the publication date from 2001 was because big data was first
introduced in early 2000 and popularized from (around) 2010 until today.
We read roughly 114 articles during the entire research process. After initial readings, we
formulated 4 categories in which we can group the issues that may pose as challenges to
business decision making. During this process, some relevant articles were found to focus less
on our area of interest and therefore must be dropped from the review list. For example, we
searched big data challenges, but most of the articles focus on the positive potentials of big
data, and only mentioned the challenges in passing. In cases such as these, we noted down the
ideas relevant to our interest and used another source to supplement our points, and the
original source were discarded. Some articles we reviewed also requires technical knowledge
beyond our abilities, and must therefore be excluded from our literature review. The final
number of articles used in to write this literature review is 51 articles, a list which contains a
combination of academic journals, electronic publications, news articles, various publications,
electronic magazines and journal.

2. Definitions
2.1 Big Data
There are several definitions of big data. The most well-known definition was coined by Doug
Laney in 2001. He defined big data as having 3 main characteristics: volume, variety and
velocity. Volume refers to the depth and breadth of information gathered on a particular
subject, increasing the number of data points and storage space required to store the
information. Variety refers to the diversity of data, categorized by its source and type. Velocity
4
refers to the speed in which data is gathered and used to support interactions (eg: real time
inventory, product tracking (Laney, 2001). While some still categorize big data according to this
definition, others argue that it is no longer sufficient. Dr. Satyam Priyadarshy added 4 more
characteristics of big data: veracity, visualization, variability, and value (Priyadarshy, 2015).
Veracity refers to how accurate, complete, and free from biases the data is. Visualization refers
to how to represent big data and the result of analysis of big data, in a concise and easily
understood manner. Variability refers to the meaning and context of data, as they are
important to ensure accurate interpretation. Value refers to the costs and the potential returns
of using big data. Other definitions of big data are forms of combination between these 7
characteristics. Some sources list additional characteristics to define big data, such as validity
(the accuracy of data for its intended use) and volatility (how long data remains valid and
consequently, how long it should be stored) (Normandeau, 2013). These characteristics defines
what are required for big data to be useful, not to differentiate big data from normal, yet large
data set. As such, for the purpose of this literature review, the first definition of big data will be
used as it provides a clearer definition and differentiation between big data and standard data
sets.

2.2 Big Data Analytics


Prajapati defines big data analytics as the process of examining large amounts of data of a
variety of types to uncover hidden patterns, unknown correlations, and other useful
information. (Prajapati, 2013). Big data analytics in this literature review will conform to this
definition.

3. Analysis
3.1 Statistical Issues Related to Big Data Analysis
The characteristics of big data mentioned in the definition means that big data cannot be
processed in the same way that standard data sets do. Big data has several collection points (or
sources), and due to its size, requires huge storage space when they need to be consolidated.
Due to its size and speed, it also needs reliable high speed connection to transmit them to

5
processing point (Lawal, 2016). This limits the method of processing big data into two: process
the data in place where it is gathered and transmit the resulting information, or transmit only
data that is critical in downstream analysis (Lawal, 2016). Additionally, due to its characteristics,
it is impossible for researchers to oversee big data collection and processing in the same
meticulousness that they may do with traditional data analysis. The combination of this process
limitations and the characteristics of big data allows for serious data quality issues, which
affects the quality and reliability of big data analysis results. This section will discuss the various
factors that may affect data quality and how these factors affect big data analysis. Additionally,
big data characteristics also encourage certain processing methods, which may affect the
reliability of big data analysis results.

3.1.1 Data Quality Issues


Data quality issues in big data analytics are caused mostly by the variety and volume of big
data. Big datas variety comes from its sources and its type. The sources used in big data are
archives, documents, various media: images, podcasts, videos, audio recordings, flash and live
stream, business applications, social media, public web, data storage, machinery logs, and
sensor data. These sources may be divided according to the type of data they produce:
structured, semi-structured and unstructured. Data from sensor data, machinery logs, data
storage, and business applications are considered structured. The rest of the data sources
mentioned above contain a mixture between semi-structured and unstructured data.
Structured data refers to a type of data which is organized in such a way that its inclusion into a
relational database is seamless and straightforward, allowing it to be readily searchable using
simple algorithms (BrightPlanet, 2012). Semi-structured data is a type of structured data,
though it is not readily transferable into relational database due to its unique organization (they
cannot be constrained by a schema), and thus requires additional process in order to organize
them into database for further processing (Buneman, 1997). According to David and Julia Jary,
unstructured data is data which has been collected without a reference on how it might
eventually be coded (Jary & Jary, 2006).

6
Data quality is of outmost importance when doing any sort of data analysis, including big data
analysis. Poor data quality would eventually lead analytics to the wrong conclusions. If
businesses were to decide their strategies based on the results of big data analytics, while
simultaneously ignoring the fact that the data on which their analysis is based on is of
inadequate quality, they will inevitably suffer losses.

3.1.1.1 Data Sources


Information within big data comes from several different sources. While big data aims to
be able to contain every information that a population may have, and thus eliminate the
need for sampling, in reality, because getting every possible data of a whole population is
impossible, the process of collecting and selecting big data source is still a sampling
activity. A large data size does not automatically mean that the sample is a complete
representation of actual population (Kaplan, Chambers, & Glasgow, 2014). Due to the
variety, volume and limited processing capability, researchers are required to choose
sources to match their research purpose1. Each of these sources may have their own
characteristics and inadvertently introduces various sampling parameters. Disregarding
this fact may expose researchers to selection bias. Selection bias is the selection of data
for analysis in such a way that a proper randomization is not achieved, resulting in a
sample that is not representative of actual situation. Furthermore, big data is highly
dependent on technology and thus the intensity of this of selection bias is dictated by the
intensity of technology divide. Additionally, considering the size of big data, special care
must be taken to ensure the sampling model, as large sample size magnify bias (Kaplan,
Chambers, & Glasgow, 2014). Some sources argue that big data can replace random
sampling, giving mobile phone records as an example: they can be used to predict
socioeconomic level in a region, given that it has high global penetration and accessible
even to those that have very low income (Hilbert, 2016). However, it can be argued that
in this case, for the purpose of its study, mobile phone users allow for a rather
representative sample, given mobile phones penetration and usage rate (75% even

1
Refer to the 2 methods of data processing in section 3.1

7
among those with very low income). This only proves that data source must correspond
with the purpose of the study for it to be able to lead big data analytics to a reliable
conclusion.

Self-selection bias may also affect big datas representativeness. Some data sources,
particularly those on the Web and social medias (which are used to gain insight on
human behavior), may be riddled with this bias, simply because information may be
provided by Web users voluntarily, without supervision from researchers (Tufekci, 2014).

The lack of supervision during data collection also means that problems such as entry
mistakes, measurement errors, missing and inaccurate information may be present in the
data (Liu, Li, Li, & Wu, 2016). Data may also be incomplete. Data records of events, for
example, is almost always only partial. (Price & Ball, 2014) Since big data relies on
technology, poorly designed data mining tools and practices can produce fragmented
information (Pilkington, 2015). Incomplete data can cause endogeneity. Endogeneity
happens when causative relationship is established between the wrong factors, without
knowledge by the researcher that the actual causing factor is not recorded in the
research sample. Endogeneity causes causal fallacy in data analysis process, because they
established the wrong cause factor. This means that incomplete data may eventually
cause the whole data analysis result to be unusable.

3.1.1.2 Unstructured and Semi-Structured Data


According to Cai and Zhu, unstructured data amounts to about 80% of all data in
existence (Cai & Zhu, 2015). While sources disagree on the exact percentage, it is
understood that the number of unstructured and semi-structured data is quite significant
(Grimes, 2008).

Structured data poses less problems to researchers during data processing. Since
conventional relational databases has a hard time dealing with unstructured data, to be

8
able to use them, they must first be processed into a form that can be easily understood
by the analytical systems. Machine analysis algorithms expect homogeneous data, and
cannot understand nuance. In consequence, data must be carefully structured as a first
step in (or prior to) data analysis (Lawal, 2016). Unstructured data is comprised of several
different types of file types: images, sound recordings, text files, videos. These forms
cannot be fitted into relational database. Any data in a form that cannot be fitted into
relational columns and must be processed into metadata (Blumberg & Atre, 2003),
before they are able to be processed. During this conversion process, whatever
information is picked off its whole can lose context.

In relation to its sources and big data gathering process, unstructured data may be also
with problems stemming from lack of supervision during data gathering process2. These
errors may not be detected by researchers, particularly if data has already gone through
processing, and researchers do not have access to the raw data.

3.1.1.3 The Combining of Sources, Data Repurposing and Data Transfer


In big data, repurposing data has been considered as a way to gain more value from data
sets. This may also present a problem. In traditional data analysis, data is gathered with a
purpose in mind, and the quality and dimensions of these data are matched to this
purpose. With data repurposing, it is likely that some parts of data required for the new
purpose maybe missing since the situation constraints are different. Representativeness
of the data is therefore a problem when data used in research is repurposed (Liu, Li, Li, &
Wu, 2016).

It is also possible for researchers to combine different sources of data to create a


complete picture of data by integrating them in order to solve the problem of incomplete
data mentioned above. Different data sources may cause conflicts between data points,
and sometimes inconsistent or contradictory phenomena among data from different

2
See part 3.1.1.1

9
sources (Cai & Zhu, 2015). Theres also the risk of overwriting data points during the
consolidation process if these points were deemed similar enough by the system used to
perform this task. Combining different data sources also increases the chance of
measurement errors being present in the data and may cause potential incidental
endogeneity (Fan, Han, & Liu, 2013).

The process of repurposing, combining of sources and data reformatting may cause data
to lose its original context. The complexities of big data that may also encourage
researchers to focus on the data itself and not the context of these data points, even
when context is very important for drawing valuable conclusion. This is what happens to
Microsoft when they used big data to design the layout of windows 8, which their
customers hate. They followed their data analysis that shows trends in their customer
behavior and designed the layout so that their customer would have easier time using it,
basing all their decision from the data trend without taking their customers motivation
(the context of the customers behavior) into account. Using big data without
understanding their context can lead to a costly mistake.

Data transfer is inevitable when dealing with any type of data. However it is more of a
problem with big data due to its size and speed. Data transfer is very dependent on the
communication system used to transfer it. During this transfer, data quality may become
compromised if the communication system is unreliable. Loss of parts of the data may
become inevitable, given big enough data being transferred. This loss of data may
eventually contribute to the incomplete data problem mentioned earlier in this paper.

3.1.2 Data Analysis Process


Big data analytical processes tend to focus on correlation rather than causation. This is because
there are more methods of processing and finding correlation than causation. Correlations are
also faster to produce and cheaper, while finding and establishing causal relations require a
more laborious process (Cowls & Ralph, 2015). Since big data depends highly on technology, it

10
becomes impossible for humans to keep up with the data stream. The approach used to dealing
with these data by researchers is therefore different. Sometimes in big data processing,
statistical algorithms are used to find patterns, without prior theories or hypotheses. These are
done to detect trends otherwise unexpected by researchers. Relying on this method means
relying on correlation. According to Mike Cafarella, the practice is risky since the size of big data
means that there are possibly billions of relationships, and therefore there is a higher chance of
spurious correlations to occur potentially affecting the validity of these correlations or findings
(Liu, Li, Li, & Wu, 2016).

3. 2 Limited Talents for Big Data Analysis


In order to be efficient at processing and analyzing big data, certain skills and related talents are
required. However, it is scarce to find qualified experts who possess both skills and talents.
Three main skills are essential for working with big data which are IT Skills, statistical skills and
other skills that include creative problem solving capability and the skills for dealing with data
governance and ethics (UN Global Working Group on Big Data for Official Statistics, 2015).

IT skills include the capability to utilize IT tools such as SQL (Structured Query Language)
databases, noSQL databases and Hadoop. SQL is a designed programming language for a
special-purpose to manage data held in a relational database management systems and noSQL
deals with non-relational structured query language (Minelli, Chambers & Dhiraj, 2012).
Hadoop is an open-source software framework for storing, processing and analyzing data. It has
been revolutionary in the area of computer science and is one of the popular technologies in
the big data world (ODriscoll, Daugelaite & Sleator, 2013). These programs are designed for
specific users who can deal with tons of data gathered for processing and analyzing. Along with
IT skills, statistical skills are also important to utilize big data effectively. Statistical skills are
explained as capability to deal with methodology and standards for processing big data and
data mining. It includes statistical algorithms, machine learning methods, linear algebra, signal
processing, data mining, text mining, graph mining, video mining, and visual analysis (Gobble,
2013).

11
Along with these skills, certain talents are needed to be considered to achieve efficient
utilization of big data. Big data talent is divided into deep analytical, big data- savvy and
supporting technology (Manyika, Brown, Bughin, Dobbs, Roxburgh, & Byers , 2011). Deep
analytical talent stands for those people with the advanced training in statistics and machine
learning with capability of dealing with large volume of data to derive business insights. Data-
savvy managers and analysts are who have basic knowledge of statistics and machine learning.
They can define key questions that data can answer, for example, to raise the proper questions
for analysis, interpret and challenge the results, and make suitable decisions. Supporting
technology personnel are who develop, implement and maintain the hardware and software
tools needed to utilize big data including databases and analytic programs.

However, it is not easy to have proper human resources who have skills and talent in real world
due to lack of resources. To use big data efficiently, it is necessary to develop new methods for
data collection, integration and analysis. It is required to have an expert who are versed in
these new methods, but the main aim of using big data methods and procedures can be used
not by a few experts but also by a wide ranging users. However, current technologies of big
data require considerable knowledge of system programming or parallel programing which
restricts the user group to experts. Therefore, data analysis remains limitation of expert group
who possess programming knowledge as well as skills in data analysis and machine learning
(Schermann, Hemsen, Buchmller, Bitter, Krcmar, Markl, & Hoeren, 2014)To overcome the
difficulties in terms of hiring, so-called data scientists who are qualified to process big data and
extract the value from gathered data, training existing staff, hiring short-term consultants
cooperating with universities and research institutes are commonly approached for many of
firms instead of hiring data scientists. In addition, participating in big data conferences and
workshops or inviting big data experts for in-house seminar and self-learning are addressed to
improve the big data skills for existing staffs in practice. It is shown that discrete effort are put
to implement big data in business, however, it does not that each attempt cannot ensure
successful usage of big data in a rational way.

12
To make sure that the data processing give reliable results, companies need a person that
possess all previously mentioned skills and talents. While there are many individuals that have
these individual skills, there are less people who knows the entire picture of big data processing
and extracting value from it. It means the talent for big data processing is very rare and
necessary to improve lack of resources.

3.3 Big Data Issues Related to Inequalities


Big data phenomenon results in creation of new infrastructure needs along with economic
costs. Three main elements for big data infrastructure is necessary to benefit the big data
utilization and it results in only certain geographies can actually take advantages of big data.

3.3.1. Big data infrastructure


A robust physical infrastructure is a main point of big data operation and scalability. It is based
on a distributed model, where data can be stored in different places and integrated through
networks. Three elements are required to achieve big data infrastructure.

Firstly, telecommunication quality is an essential condition to open a gateway to big data. It


allows accumulated information capacity storage and computational devices into an equivalent
information capacity and it lies in the social ownership of telecommunication access (Hilbert,
2016). Over the past two decades, telecommunication access has become significantly
increased and diversified. In the analogue era, telecom subscriptions were fixed-line phones
with similar levels of performance. However, there are uncountable different
telecommunication subscriptions with a diverse performance range. Along with high quality of
telecommunications, a special server architecture is necessary and both elements provide a
potential gateway to the big data cloud. A special server architecture enables to perform the
intensive analytical tasks for big data. It comprises thousands of nodes with multiple processors
and disks connected by a high speed network working in a distributed way (Luna, Mayan,
Almerares & Househ, 2014). Well-known internet companies including Google, Microsoft,

13
Yahoo and Amazon use this architecture to provide their services with centers distributed
throughout the world. Besides the hardware infrastructure, an additional component, software
is required to implement big data effectively. As mentioned in IT skills for big data operation,
several softwares can be utilized, however, most of software requires significant investment of
purchasing and licensing with properly trained workforce.

Fig.1 Spending and employees of software and computer services across 42 countries (as % of
respective total)

Fig.1 shows the share of software and computer service spending of total ICT(Information and
Communications Technologies) spending and of software and computer service employees of
total employees for 42 countries. The size of bubble means total ICT spending per capita. Larger
bubbles are related to both more software experts and more software spending. It restates
those countries that are behind in terms of ICT spending in absolute terms(including hardware
infrastructure), are less capacity in terms of software and computer services in relative terms

14
(Hilbert, 2016). It is shown that Finland has relatively higher national workforce that is
specialized in software and computer services compare to Mexico. It implies that many of
developing countries are lack of physical communication infrastructure and software so that
they are not capable to organize, integrate and analyze the amount of information they
generated.

3.3.2 Widened digital divide


Different economies and regions show different characteristics from the amount of data they
generate to the maturity of their ICT infrastructure (MGI, 2010). It implies that some particular
geographies might be able to take advantages of value creation from big data more quickly
than others. North America and Europe are the regions of the majority of new data stored. It
suggests that they will be the leading geographies for creating value through the use of big
data. Accessibility to big data is a key prerequisite to capturing value and most of developed
economies are accessible to big data easily comparing to emerging markets. In other words,
majority of data that we can access are mostly generated in those specific geographies and it
brings about the riskiness of biased information that general public receive.

15
Fig.2 Amount of new data stored varies across geography

Fig. 2 shows that amount of new data stored upon geography. Besides Europe and North
America, developing economies including China, India and other areas are still showing its
relatively lower available storage of new data. Eventually, it widens a gap between developed
economies and emerging economies in terms of digital divides.

3.4 Big data privacy and security challenges


3.4.1 Big data privacy
Nowadays, with the development of technology that leads to the proliferation of devices
connected to the Internet and connected to each other, the volume of data collected, stored,
and processed is increasing every day, which also brings new challenges in terms of the
information security and privacy for big data users (Lafuente, 2015). Thus, Big Data is such an
important and complex topic, it is almost natural that immense security and privacy challenges
will arise (Michael & Miller, 2013).
16
Cloud Security Alliance (CSA), which is a non-profit organization (NGO), has categorized the
different security and privacy challenges into four different aspects of the Big Data ecosystem
(Moura & Serro, 2015). They are Infrastructure Security, Data Privacy, Data Management and
Integrity, Reactive Security. Each of these aspects faces the following security challenges,
according to CSA. Additionally, these security and privacy challenges cover the entire spectrum
of the Big Data lifecycle, (Figure A): sources of data production (devices), the data itself, data
processing, data storage, data transport and data usage on different devices (Cloud Security
Alliance, 2013)

Figure A: Security and Privacy challenges in Big Data ecosystem

Due to the large amount of data being stored in the databases and the opportunities available
for those who take advantage of them, Big Data carries significant security, privacy, and
transfer risks that are real and will continue to escalate. There is also the potential that larger
amount of personal information, especially sensitive data, will be leaked out once breach
happens (Bell, Rotman, & VanDenBerg, 2014). With respect to Big Data, privacy represents the

17
elephant in the room. Companies that effectively leverage Big Data often face a great deal of
media scrutiny, PR issues, and vocal individual detractors.

On the other hand, consumers (who provide their data) get most irritated when their data is
collected in stealth, since they are not do not have an idea on how it is being used and, more
importantly, how they are benefitting from it (Bertolucci, 2013). Take Google as an example, in
May 2012 Google ran into trouble itself. Although the companys Street View software has
succeeding after few years from its introduction, the software was later discovered by the US
government to have been furtively collecting data on open Wi-Fi networks. The software in
Google Street View mapping cars was intended to collect Wi-Fi payload data. This data is then
transferred to Oregon Storage facility. As consequence, In August 2012, the Federal Trade
Commission (FTC) fined Google $22.5 million on Thursday to settle charges that it had
bypassed privacy settings in Apples Safari browser to be able to track users of the browser and
show them advertisements, and violated an earlier privacy settlement with the agency. The
fine represents the largest civil penalty ever levied by the commission (Womack & Shields,
2012).

Unlike the Google Wi-Fi sniffing issue described earlier, sometimes privacy controversies stem
not from the direct actions of platform companies, but from the actions of their partners,
external developers, and third-party apps (Simon, 2013). For instance, Apple iPhone customers
can download more than 600,000 apps in the AppStore as of this writing (and growing). There is
no easy way for the company to prevent violations of its terms of service. One can only imagine
the privacy breaches taking place on much more open platforms such as Android. No one can
completely answer the question, What are our partners doing with the data they collect?
(Simon, 2013).

Therefore, it is fair to say that limited disclosure of personal data usage or no disclosure at all
usually creates distrust among customers, and that is something big data advocates should
strive to avoid. Moreover, consumer trust is based upon reputation and the firm cannot

18
maintain brand reputation if it is losing data. The goal of data usage should be to deliver benefit
to both the collector of information -- typically a business or government and the provider
(Bertolucci, 2013).

3.4.2 Big data security


Organization nowadays has to worry about data pirates because Big Data attracts big interest
from hackers, especially when that data is so complete and personal. The bigger the volume of
the data, the bigger target it becomes to hackers. The stark example for this case is Zappos, an
online shoe and clothing retailer owned by Amazon.com. In January 2012, the company
informed their 24 million customers that its site had been hacked, and some customers
personal details and account information were stolen, however, the company believed that no
credit or debit card information had been accessed by attackers. Five months later, a hacker
stole and posted 8 million usernames and passwords from LinkedIn, prompting a $5 million
lawsuit. With nearly 200 million registered users as of this writing, count LinkedIn among the
many companies that use and generate a boatload of valuable data (McGarry, 2012).

As big data contain large amounts of personal identifiable information (PII) and thus privacy of
users is huge concern. It also is the biggest challenge for Big Data security to those who are data
collector to protect data provider. This is because a big data security breach will potentially
affect a much larger number of people, with consequences not only from a reputational point
of view, but with enormous legal repercussions as well.

3.5 Big Datas Potential to be Misleading


With the rise of blog and social networking, a new way to gather information appears.
Combined with cloud computing, networking and other technology, data is growing at an
unprecedented rate, with constant growth and accumulation. This amount of big data however,
is not accompanied by in depth understanding of the data. The extensive presence of the data
gathered makes data analysis more challenging and thus there is a risk of excessive speculation
that may mislead users of big data analytics.

19
3.5.1 Lack of Attention to Data Quality and Processing
Some companies such as Google, Facebook and Amazon are constantly trying understanding
humans lifestyle and behavior through the data that people produced on the internet. Edward
Snowden exposes the size and scope of US government monitor data; it is clear that the
security sector also obsessed our daily data and want excavated something out (BBC, 2014).
McKinsey Global agency made a calculation in 2011 years demonstrates that if government
could better integrated analyzed from clinical trial to medical insurance and some smart
databases, in that the US health insurance system could save 300 billion dollars per year, the
average American could save about $1,000 (Peter, Basel, David, & Steve, 2013). However, as of
currently, the data analysis does not produce reliable results. For example, Google Flu Trends
is an application based on the realistic data, it is not only fast, accurate, low cost, and there is
no theory to use (Naughton, 2014). Google engineers used search keywords such as flu
symptoms or pharmacy by my side, then let the algorithm to make the selection. This
method used by Google Flu Trends does not produce reliably accurate results all the time, such
as when facing seriously flu outbreak. When the predicted results was compared with the
actual situation, it was found that the analysis results exaggerated the situation. The root of the
problem is that when Google analyses data they just found some statistical features in the data
(Ross, 2011). They are more concerned about the correlation itself rather than related causes.
This pure correlation analysis method result will be inevitable by vulnerable and mislead.
Further explanation of why collect big data of Google Flu Trends misleads people is that after
people seeing the report which is generated by big data, even the healthy people will go to
search on the internet, boosting up the hits for related vocabulary in the search engine.
Another explanation is that Googles own search algorithm will recommend search phrases,
thereby affecting users search results and browsing behaviors.

The issues exemplified by Googles case is faced by companies that uses big data for analysis.
Reliance on technology only assists data processing and quality control to a certain extent.
Statisticians have spent the past 200 years to sum up various traps in the process of cognitive

20
data. Nowadays, the data with greater size, updated faster, the cost of acquisition also lower.
But we cannot pretend that these traps have been filled in, the fact is that they are still there.

Although big data looks hopeful within entrepreneurs and the governments vision, but if they
ignores some of previous statistics lessons, big data may be destined to mislead people.
Professor Spiegelhalter said: There are a large number of small data issues in big data. These
problems are not with the increasing amount of data disappear, they will only become more
prominent. (Spiegelhalter, 2014).

3.5.2 Misleading Assumptions Related to Big Data


One of the most famous definitions of big data is N= all which is defined by the co-author
of Big Data called Viktor Mayer-Schnberger. In here, it is argued that sampling is no longer
needed, because we are capable of gathering the data for the entire population (Mayer-
Schonberger & Kenneth Cukier, 2013). Theres no longer a need to select a few representatives
as sample to estimate the results, for any purposes, such of the elections, because the system is
able to remember any information. When N= all, there is really no sampling bias of problem,
because the sample already contains all. However, is this formula ignores the fact that its
almost impossible to gather all data (Wolfe, 2015). As an example, theoretically we can store
and analyse each record on Twitter and use the result reach some conclusions about public
emotions and behavior. However, even if we can read all the tweets records, Twitter users
themselves cannot represent everyone in the world. According to the results of the Pew
Internal Research Project in 2014, people who uses twitter age between 18-49
(PewResearchCenter, 2014).

21
Source: (PewResearchCenter, 2013)
Finger 1: data was limited by users age and scope.

N= all of the big data is an assumption, rather than a reality. Likewise there are progressively
questions about the not so minor matter of representativeness of much Big Data (Esmar, 2014).

Professor David Hand said We have some new data sources nowadays, but no one wants data,
what people what is an answer.(Harford, 2014). Everyone wants to find ways to use all
possible tools to find out something valuable in big data. Big data has a lot of potential, but at
the same time, it is also is riddled by various problems because of its characteristics. However,
people easily overestimate the accuracy the results produced by the algorithms used to process
it, because they are so excited on the prospect of gaining any insight. This brings them to a
wrong conclusion and mislead their decisions.

Nowadays, if people who do sales and marketing do not know data analysis that is really
outdated. Many business leaders demand to be shown data before they make decisions. Data
analysis therefore becomes one of the most promising career in coming decade.

22
However data can be used to bluff and mislead people. For example, during the war between
United States and Spain, the US Navy suffered nine thousand casualty. At the time, civilian
deaths in New York numbers around sixteen thousand, from various causes. Later, the Navy
recruiter will used these data, comparing them to each other, as proof that soldiering is a
relatively safe occupation. Of course the conclusion is not correct, these two figures simply are
not comparable. The soldiers are able-bodied young men, and the mortality rate of Ney York
cited there included the sick and elderly residents. These data are similar, so the percent of
Navys data should be compared with the residents in the same age. While in this example, the
misdirection is intended because the recruiters are aware of the differences, big data may
cause the same effect if not processed carefully. Misleading of big data will causes waste of
resources and decision-making errors.

3.5.3 Effects on Consumer


If a company makes decision to produce items based on the big data analysis and the analysis
tells them that the most profitable is product is A, they might stop production of B to save
costs. Consumers may still like the product enough, even though the company doesnt deem it
profitable enough to continue making them. Consumers wouldnt have a choice in the matter
and so lost the power to choose.

Companies collect big data from consumer behavior to design and produce products. However,
often times, the resulting products still fail to satisfy customers. The reason is because there are
too many factors that could change peoples preferences, such place and time. These factors
are context, and they are not detected by the big data. Big data merely summarized the reality.
Trend is decided by the inherent relationship among the development of things, but big data
recording cannot detect these relationship. Multiple intrinsic factors of events are not all
available in digital format. At the same time, people as one part of the factor itself is full of
variables. In the other words, big data represent of knowledge, not the wisdom. As such, it is
not wise to base the decision fully on big data results. Businesses must take care to take into
account the whole situation, and not only the things that is in the data.

23
4. Implications
Ignoring the statistical issues that big data has, particularly regarding data quality, would
expose businesses to using the wrong analytics. The quality of data gathered for analysis
depends highly on the systems used to process, categorize and standardize the data. However,
as can be seen from the Google example above, the systems used for may very well be
imperfect and therefore an inadequate replacement to the process that expert data analyst go
through. Basing their decisions on these results will cause loss of investment, at the very least
them mount invested on the analytical process itself. Should the decisions result in
disadvantaging customer in some way (ex: removing a less profitable, but liked product line),
businesses will have to deal with the additional trouble of trying to regain their customers
favor. Depending on trends discovered using big data to guide decisions may also expose
businesses to , which apart from being unethical, also inadvertently causes business to lose
target market.

Since big data processes may also be used partially to lighten the workload of certain company
processes, such as hiring employees, the effect of big data limitations mentioned also affect
these areas. Depending on search systems to process prospective employees profile, although
faster may cause employers to miss prospective employees that have the required capability
but does not have the specific words required by the search system on their profile.

Additionally, remembering the security and privacy issues regarding big data as mentioned
above, businesses have to pay attention to these issues regardless of their position (whether as
a data source, for example: social media sites, or as data user). Since companies need to
transfer data for their analytics purposes, they need to consider the security of the data supply
chain in their decision on utilizing big data. Considering that theres demand for big data, this
occurrence may also introduces data as a commodity, to be traded and sold. Considering that
the data collection process used by some companies may be morally grey, it is important that

24
business start paying attention to this issue as well, since the growth of awareness on the
matter will no doubt impact how businesses can acquire and process their data.

5. Conclusion
While big data has a lot of potential, it also introduces a range of challenges that businesses
must first solve before they will be able to utilize big data effectively. Businesses must keep in
mind that the utilization of big data is very highly dependent on technology due to its
characteristics (volume, variety and velocity), and this affects both the analytical processing of
big data, collection of big data, as well as its storage and transfer.

Digital divide, infrastructure inequality, reliability of data storage system and data transfer
system eventually affects the representativeness of sample, as well as the completeness of the
data gathered. This affects the quality of data used in big data analytics and affects the usability
of the results. Ignoring these issues, or putting too much trust on the imperfect data analytics
systems currently in place, would lead businesses to costly mistakes and eventually
disadvantages all involved stakeholders. Additionally, this dependence on technology also
introduces privacy and security issues for business to consider, much more so than before,
especially considering the potential richness of the data involved.

Much of the challenges mentioned draws attention to the need of talent for big data. Ensuring
that businesses have possess adequate technology and talents to use this is important because
only then businesses may be able to address the challenges listed in this paper. These big data
talents would be able to oversee and improve methods for big data processing, and with their
expertise, it can be hoped that the results of the analytics would be more reliable.

6. Suggestions
Considering the challenges that has been already mentioned and the possible implications, it is
necessary to pay more attention to following factors.

Attention to data processing and train data scientists


Since usefulness of resulting analysis depends on data quality, it is important that
researchers/company should pay more attention to their data processing. Since the quality and
result of data processing highly depends on ability and talents possessed by company, company
should put effort to train data scientists who have synthetic capability to deal with various
perspectives of big data processing and analysis.

25
Regulation of data trade and brokering
In addition, to protect the privacy of consumers and misuse of personal data collected,
regulations should be taken place in a timely manner to prevent illegal data trade and
brokering.

26
References
Analysis of Big Data Survey 2015 on Skills, Training and ... (2015). Retrieved from
http://unstats.un.org/unsd/trade/events/2015/abudhabi/presentations/day2/02/Analy
sis_of_Big_Data_Survey_2015_on_Skills_Training_and_Capacity_Buildingv10.pdf
A Review: Issues and Challenges in Big Data from Analytics and Storage Perspective. (2016,
March). International Journal of Engineering and Computer Science, 5(3).
BBC. (2014, January 17). Edward Snowden: Leaks that exposed US spy programme. Retrieved
2016, from BBC NEWS: http://www.bbc.com/news/world-us-canada-23123964
Bell, G., Rotman, D., & VanDenBerg, M. (2014). Navigating Big Datas Privacy and Security
Challenges. USA: KPMG.

Bertolucci, J. (2013, June 18). InformationWeek. Retrieved May 22, 2016, from Privacy
Concerns: Big Data's Biggest Barrier?: http://www.informationweek.com/big-data/big-
data-analytics/privacy-concerns-big-datas-biggest-barrier/d/d-id/1110408?
Blumberg, R., & Atre, S. (2003, February). The Problem with Unstructured Data. DM Review, 42-
46.
BrightPlanet. (2012, June 28). Structured vs. Unstructured data . Retrieved May 8, 2016, from
BrightPlanet: https://brightplanet.com/2012/06/structured-vs-unstructured-data/
Buneman, P. (1997, May 1). Semistructured Data. PODS '97 Proceedings of the sixteenth ACM
SIGACT-SIGMOD-SIGART symposium on Principles of database systems , 117-121. DOI:
10.18535/ijecs/v5i3.12
Cai, L., & Zhu, Y. (2015). The Challenges of Data Quality and Data Quality Assessment in the Big
Data Era. Data Science Journal, 14(2), 1-10.
Cowls, J., & Ralph, S. (2015). Causation, Correlation, and Big Data in Social Science Research.
Policy and Internet, 7(4), 447-472.

Cloud Security Alliance. (2013). Expanded Top Ten Big Data Security and Privacy Challenges. Big
Data Working Group.
Esmar. (2014). Big Data means market research needs to do some soul searching. Colin Strong
Thoughts and articles, 1-3.

27
Fan, J., Han, F., & Liu, H. (2013, August). Challenges of Big Data analysis. National Science
Review, 1(2), 293-314. doi: 10.1093/nsr/nwt032

Fourcade, M., & Healy, K. (2016). Seeing Like a Market. slam, 1-27.

Forbes Insights. (2013). THE BIG POTENTIAL OF BIG DATA: A FIELD GUIDE FOR CMOs. Retrieved
May 4, 2016, from Forbes:
http://images.forbes.com/forbesinsights/StudyPDFs/RocketFuel_BigData_REPORT.pdf.
Gobble, M. M. (2013). Big data: The next big thing in innovation. Research-Technology
Management, 56(1), 64. doi:10.5437/08956308X5601005

Grimes, S. (2008, August 1). Unstructured Data and the 80 Percent Rule. Retrieved May 10,
2016, from Breakthrough Analysis:
https://breakthroughanalysis.com/2008/08/01/unstructured-data-and-the-80-percent-
rule/

Harford, T. (2014, March 28). Big data: are we making a big mistake? Retrieved from FT
magazine: http://www.ft.com/intl/cms/s/2/21a6e7d8-b479-11e3-a09a-
00144feabdc0.html
Hilbert, M. (2016, January). Big Data for Development: A Review of Promises and Challenges.
Development Policy Review, 34(1), 135174. doi: 10.1111/dpr.12142
Jary, D., & Jary, J. (2006). Collins Dictionary of Sociology. London, United Kingdom: HarperCollins
Publishers. Retrieved from
https://elibrary.jcu.edu.au/login?url=http://search.credoreference.com/content/entry/
collinssoc/unstructured_data/0
Kaplan, R. M., Chambers, D. A., & Glasgow, R. E. (2014, August). Big Data and Large Sample Size:
A Cautionary Note on the Potential for Bias. Clinical and Translational Science, 7(4), 342
346. doi: 10.1111/cts.12178

Lafuente, G. (2015). The big data security challenge. Network Security, 12-14.

28
Laney, D. (2001, February 6). Gartner Blog Network. Retrieved May 10, 2016, from Gartner:
http://blogs.gartner.com/doug-laney/files/2012/01/ad949-3D-Data-Management-
Controlling-Data-Volume-Velocity-and-Variety.pdf
Lawal, Z. K. (2016, March 3). A review: Issues and Challenges in Big Data from Analytic and
Storage perspectives. International Journal of Engineering and Computer Science, 5(3),
15947-15961. doi: 10.1145/263661.263675
Liu, J., Li, J., Li, W., & Wu, J. (2016, May). Rethinking big data: A review on the data quality and
usage issues. ISPRS Journal of Photogrammetry and Remote Sensing, 115, 134142.
doi:10.1016/j.isprsjprs.2015.11.006

Luna, D., Mayan, J. C., Garca, M. J., Almerares, A. A., & Househ, M. (2014). Challenges and
potential solutions for big data implementations in developing countries. Yearbook of
Medical Informatics, 9, 36.
Manyika, J., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., & Byers, A. (2011). Big data: The next
frontier for innovation, competition, and productivity. Retrieved from
http://www.mckinsey.com/business-functions/business-technology/our-insights/big-
data-the-next-frontier-for-innovation
Mayer-Schonberger, V., & Kenneth Cukier. (2013). Big Data: A Revolution That Will Transform
How We Live, Work and Think. new york: Financial Times.

McGarry, C. (2012). Zappos records hacked. Las Vegas Review - Journal.

Michael, K., & Miller, K. W. (2013). Big Data: New Opportunities and New Challenges. GUEST
EDITORS INTRODUCTION, 22-24.
Minelli, M., Chambers, M., & Dhiraj, A. (2012) What is big data and why is it important? (pp. 1-
18). Hoboken, NJ, USA: John Wiley & Sons, Inc. doi:10.1002/9781118562260.ch1
Moura, J., & Serro, C. (2015). Security and Privacy Issues of Big Data . Portugal: Instituto
Universitrio de Lisboa.
Naughton, J. (2014, april 5). Google and the flu: how big data will help us make gigantic
mistakes. Retrieved 2016, from theguardian:

29
https://www.theguardian.com/technology/2014/apr/05/google-flu-big-data-help-
make-gigantic-mistakes
Normandeau, K. (2013, September 12). Beyond Volume, Variety and Velocity is the Issue of Big
Data Veracity. Retrieved May 8, 2016, from insideBigData:
http://insidebigdata.com/2013/09/12/beyond-volume-variety-velocity-issue-big-data-
veracity/
O'Driscoll, A., Daugelaite, J., & Sleator, R. D. (2013). 'big data', hadoop and cloud computing in
genomics. Journal of Biomedical Informatics, 46(5), 774-781.
doi:10.1016/j.jbi.2013.07.001

Peter, G., Basel, K., David, K., & Steve, V. K. (2013). The "big data" revolution in healthcare.
McKinsey&Company, 1-17.
PewResearchCenter. (2013, december 27). Social Networking Fact Sheet. Retrieved from
PewResearchCenter: http://www.pewinternet.org/fact-sheets/social-networking-fact-
sheet/
Pilkington, J. (2015, February 12). Incomplete data: What it is, what to do about it. Retrieved
May 15, 2016, from Datawatch: http://www.datawatch.com/2015/02/12/incomplete-
data-what-it-is-what-to-do-about-it/
Prajapati, V. (2013, November). Big Data Analytics with R and Hadoop (1st ed.). Packt
Publishing Ltd .
Price, M., & Ball, P. (2014). Big Data, Selection Bias, and the Statistical Patterns of Mortality in
Conflict. SAIS Review of International Affairs, 34(1), 9-20. doi: 10.1353/sais.2014.0010
Priyadarshy, S. (2015, January). The 7 Pillars of Big Data. Petroleum Review, 34-42.

oss, . J. (2011). Microelectronics Failure Analysis: Desk Reference. united states: EDFAS Desk
Reference Committee.
SAS. (2016). Big Data Analytics. Retrieved March 21, 2016, from SAS:
http://www.sas.com/en_us/insights/analytics/big-data-analytics.html

30
Schermann, M., Hemsen, H., Buchmller, C., Bitter, T., Krcmar, H., Markl, V., & Hoeren, T.
(2014). Big data: An interdisciplinary opportunity for Information systems research.
Business & Information Systems Engineering,6(5), 261-266. doi:10.1007/s12599-014-
0345-1

Simon, P. (2013). Too Big to Ignore : The Business Case for Big Data. New Jersey: John Wiley &
Sons, Inc.
Spiegelhalter, P. D. (2014, 1 4). how to use big data in your advantage. (KF, Interviewer)

Wolfe, P. P. (2015). Understanding the Behavior of Large Networks. london: UCL Big Data
Institute.
Tufekci, Z. (2014, March). Big Questions for Social Media Big Data: Representativeness, Validity
and Other Methodological Pitfalls. ICWSM 14: Proceedings of the 8th International AAAI
Conference on Weblogs and Social Media.
Womack, B., & Shields, T. (2012, April 17). Google Gets Maximum Fine After Impeding Privacy
Probe. Retrieved May 22, 2016, from Bloomberg:
http://www.bloomberg.com/news/articles/2012-04-16/google-gets-maximum-fine-
after-impeding-privacy-probe

31