You are on page 1of 7

IDL - International Digital Library Of

Technology & Research


Volume 1, Issue 6, June 2017 Available at: www.dbpublications.org

International e-Journal For Technology And Research-2017

Two-Phase TDS Approach for Data Anonymization


To Preserving Bigdata Privacy

1.Ambika M Patil, M.Tech Computer Science Engineering, Center for P G Studies Jnana Sangama VTU Belagavi,
Belagavi, INDIA, Ambika702@gmail.com
2.Assistant Prof.Ranjana B Nadagoudar, Computer Science Engineering Department, Center for P G Studies Jnana Sangama
VTU Belagavi, Belagavi, INDIA
3.Dhananjay A Potdar , Dhananjay.potdar@gmail.com

ABSTRACT - While Big Data gradually become a hot topic Variety. Volume shows the huge amount of data being
of research and business and has been everywhere used in produced from multiple sources. Velocity is concerned with
many industries, Big Data security and privacy has been both how fast we produce and collect data, but also how fast
increasingly some of the collected data is changing. Variety shows their
concerned. However, there is an obvious contradiction highly distributed and various nature. The data generation rate
between Big Data security and privacy and the widespread use is growing so rapidly that it is becoming very difficult to
of Big Data. There have been a various different privacy handle it using traditional methods or systems [1]. In the
preserving mechanisms developed for protecting privacy at 3Vs model, Variety indicates the various types of data
different stages (e.g. data generation, data storage, data which include structured, semistructured and unstructured
processing) of big data life cycle. The goal of this paper is to data; Volume means data scale is large; Velocity indicates all
provide a complete overview of the privacy preservation processes of Big Data must be quick and timely in order to
mechanisms in big data and present the challenges for existing maximize value of Big Data as shown in Fig.1. These features
mechanisms and also we illustrate the infrastructure of big that Big Data handles huge amount of data and uses various
data and state-of-the-art privacy-preserving mechanisms in types of data including unstructured data and attributes that
each stage of the big data life cycle. This paper focus on the were never used in the past distinguish data mining from Big
anonymization process, which significantly improve the Data.
scalability and efficiency of TDS (top-down-specialization) In 2011, IDC defined big data as big data technologies
for data anonymization over existing approaches. Also, we describe a new generation of technologies and architectures,
discuss the challenges and future research directions related to designed to economically extract value from very large
preserving privacy in big data. volumes of a wide variety of data, by enabling the high-
velocity capture, discovery, and/or analysis[2].
In this definition, features of big data may be abridged as
KEYWORDS - Big data, privacy, big data storage, big data 4Vs, i.e., Variety, Velocity, Volume and Value, where the
processing. Data anonymization, top-down specialization, implications of Variety, Velocity, Volume is same as the 3Vs
MapReduce, cloud, privacy preservation. model respectively and Value refers big data have great social
value. The 4Vs model was widely recognized because it
indicates the most critical problem which is how to discover
I. INTRODUCTION value from an enormous, various types, and rapidly generated
As a result of recent technological development, the amount of datasets in big data.
data generated by social networking sites, sensor networks,
Internet, healthcare applications, and many other companies,
is significantly increasing day by day. The term Big Data
reflects the trend and salient features of the data being
produced from various sources. Basically Big Data can be
described by 3Vs which stands for Volume, Velocity and

IDL - International Digital Library 1|P a g e Copyright@IDL-2017


IDL - International Digital Library Of
Technology & Research
Volume 1, Issue 6, June 2017 Available at: www.dbpublications.org

International e-Journal For Technology And Research-2017


original data before they are released to a non-trusted party.
The approaches to privacy protection in data storage phase are
mainly based on encryption techniques. Encryption based
techniques can be further divided into attribute based
encryption (ABE), Identity based encryption (IBE), and
storage path encryption. In addition, to protect the sensitive
information, hybrid clouds are used where sensitive data are
stored in private cloud. The data processing phase includes
privacy preserving data publishing (PPDP) and knowledge
extraction from the data. In PPDP, anonymization techniques
such as generalization and suppression are used to protect the
privacy of data. Ensuring the utility of the data while
preserving the privacy is a great challenge in PPDP. In the
knowledge extracting process, there exist several mechanisms
to extract useful information from large-scale and complex
FIGURE 1. Illustration of the 3 V's of big data. data. These mechanisms can be further divided into clustering,
classification and association rule mining based techniques.
Despite big data could be effectively used for us to better While clustering and classification split the input data into
understand the world and innovate in various aspects of different groups, association rule mining based techniques and
human activities, the blowing up amount of data has increased the useful relationships and trends in the input data.
potential privacy breach of individual. For example, Amazon
and Google can learn our shopping preferences and browsing
habits. Social networking sites such as Facebook store all the
information about our personal life and social relationships.
Popular video sharing websites such as YouTube recommends
us videos based on our search history. With all the power
motivated by big data, gathering, storing and reusing our
personal information for the purpose of attainment of
commercial profits, have put a threat to our privacy and
FIGURE 2. Illustration of big data life cycle.
security. In 2006, AOL released 20 million search queries for
650 users by eliminating the AOL id and IP address for Protecting privacy in big data is a fast growing
research purposes. Though, it took researchers only couple of research area. Although some related papers have been
days to re-identify the users. Users privacy may be breached published but only few of them are survey/review type of
under the following circumstances [3]: papers [4], [5]. Moreover, while these papers introduced the
I. Personal information when combined with exterior basic concept of privacy protection in big data, they failed to
Datasets may lead to the inference of new facts about the cover several important aspects of this area. For example,
users. Those details may be secretive and not supposed to be neither [4] nor [5] provide detailed discussions regarding big
exposed to others. data privacy with respect to cloud computing. Besides, none of
II. Personal facts is sometimes collected and used to add the papers discussed future challenges in detail.
value to business. For example, individual's shopping habits
In this paper, we will give a comprehensive overview
may disclose a lot of personal information.
of the state-of-the-art technologies to preserve privacy of big
III. The sensitive data are stored and processed in a data at each stage of big data life cycle. This paper focus on
location not secured properly and data leakage may occur the anonymization process, which significantly improve the
during storage and processing phases. scalability and efficiency of TDS (top-down-specialization)
In order to safeguard big data privacy, numerous mechanisms for data anonymization over existing approaches.
have been developed in recent years. These mechanisms can The major contributions of our research are threefold.
be grouped based on the stages of big data life cycle, i.e., data First, we creatively apply MapReduce on cloud to TDS for
generation, data storage, and data processing. In data data anonymization and deliberately design a group of
generation phase, for the protection of privacy, access
innovative MapReduce jobs to concretely accomplish the
restriction and falsifying data techniques are used. While
specializations in a highly scalable fashion. Second, we
access restriction techniques try to limit the access to propose a two-phase TDS approach to gain high scalability via
individuals private data, falsifying data techniques alter the allowing specializations to be conducted on multiple data

IDL - International Digital Library 2|P a g e Copyright@IDL-2017


IDL - International Digital Library Of
Technology & Research
Volume 1, Issue 6, June 2017 Available at: www.dbpublications.org

International e-Journal For Technology And Research-2017


partitions in parallel during the first phase. Third, manage and query large scale data sets. It should also provide
implementation results show that our approach can many interfaces to interact with and analyze stored data.
significantly improve the scalability and efficiency of TDS for
data anonymization over existing approaches. Data processing: Data processing phase refers basically to the
The remainder of this paper is organized as follows: process of data collection, data transmission, pre-processing
The infrastructure of bigdata and issues or challenges related and take out useful information. Data collection is needed
to privacy of big data because of the underlying structure of because data may be coming from different various sources
cloud computing, traditional data privacy preservation i.e., sites that contains text, images and videos. In data
methods in section III. Privacy preservation for big data, collection phase, data are acquired from specific data
formulates the two-phase TDS approach, elaborates production environment using dedicated data collection
algorithmic details of MapReduce jobs in section IV. technology. In data transmission phase, after collecting raw
Implementation of our approach is shown in Section V. data from a specific data production environment we need a
Finally, we conclude this paper and discuss future work in high speed transmission mechanism to transmit data into a
Section VI. proper storage for different type of analytic applications.
Finally, the pre-processing phase aims at removing
II. INFRASTRUCTURE OF BIG DATA meaningless and redundant parts of the data so that more
storage space could be saved.
To handle different dimensions of big data in terms of volume, The excessive data and domain specific analytical
velocity, and variety, we need to design efficient and effective methods are used by various applications to derive significant
systems to process large amount of data arriving at very high information. Although different fields in data analytics require
speed from different sources. Big data has to go through different data characteristics, few of these fields may leverage
multiple phases during its life cycle, as shown in Figure. 2. similar underlying technology to inspect, transform and model
Data are distributed nowadays and new technologies are being data to extract value from it. Emerging data analytics research
developed to store and process large repositories of data. For can be categorized into the following six technical areas:
example, cloud computing technologies, such as Hadoop structured data analytics, text analytics, multimedia analytics,
MapReduce, are explored for big data storage and processing. web analytics, network analytics, and mobile analytics [6].
In this section we will explain the life cycle of big data. In
addition, we will also discuss how big data are leveraging
B. CHALLENGES OF BIG DATA
from cloud computing technologies and challenges related
with cloud computing when used for storage and processing of
big data. The application of Big Data is leading to a set of new
challenges since data sets of Big Data so large and complex
A. LIFE CYCLE OF BIG DATA that it is difficult to acquisition, storage, management and
Data generation: Data can be generated from many distributed analysis. The main challenges are listed as following [7][8]:
sources. The amount of data generated by humans and 1. Data preparation. According to the definition of strong and
machines has blowup in the past few years. For example, accurate techniques for big data, an important basis of big data
everyday 2.5 quintillion bytes of data are generated on the web analysis and management is the availability of high-quality,
and 90 percent of the data in the world is generated in the past precise, and trustworthy data. Data preparation is paramount
few years. Facebook, a social networking site alone is for increasing the value of big data.
generating 25TB of new data every day. Usually, the data 2. Efficient distributed storage and search. Timeliness of data
generated is large, diverse and complex. Therefore, it is hard collection is fundamental to offer fast analysis of big data.
for traditional systems to handle them. The data generated are Therefore, there is an increasing need of providing efficient
normally related with a specific domain such as business, distributed storage with faster memories and enhancing search
Internet, research, etc. algorithms.
3. Effective online data analysis. Online analysis of
Data storage: This phase refers to storing and managing large- multidimensional data becomes a must and potential source of
scale data sets. A data storage system consists of two parts i.e., information for decision making. This would require adapting
hardware infrastructure and data management [6]. Hardware existing OLAP approaches to big data.
infrastructure refers to using information and communications 4. Effective machine learning techniques for big data mining.
technology (ICT) resources for several tasks (such as Machine learning and data mining should be adapted to big
distributed storage). Data management refers to the set of data to unleash the full potential of collected data.
software positioned on top of hardware infrastructure to

IDL - International Digital Library 3|P a g e Copyright@IDL-2017


IDL - International Digital Library Of
Technology & Research
Volume 1, Issue 6, June 2017 Available at: www.dbpublications.org

International e-Journal For Technology And Research-2017


5. Efficient handling of big data streams. Some specific attributes that need to be protected are identified based on type
scenarios (e.g., stock exchange) would require analysis of data of big data and company policies.
in the form of streams. Fast and optimized solutions should be
developed to make inference on big data streams. In nutshell, encryption or cryptography alone cant stand as
6. Semantic lifting techniques. Semantics of collected big data big data privacy preservation method. They can help us to do
represents an important aspect for future development of big data anonymization but cannot be used directly for big data
data applications. Future approaches to big data analysis privacy.
should be able to manage with their semantics.
7. Programming models. Many programming models of big IV. PRIVACY PRESERVATION FOR BIG DATA
data infrastructures are available. Some examples include
MapReduce and Hadoop. We should consider different TWO-PHASE TOP-DOWN SPECIALIZATION (TPTDS)
approaches for storing and managing data.
8. Social analytics. The ability to differentiate those data that The sketch of the TPTDS approach is shown in figure 3. Three
can be trusted and comply with users needs and preferences is components of the TPTDS approach, namely, data partition,
important as well as different to achieve. Social analytics anonymization level merging(AL), and data specialization.
should then address this problem providing correct and sound
approaches to social data analysis.
9. Security and privacy. Big data are a priceless source of
information. However, it often contains sensitive information
that needs to be protected from unauthorized access and
release.

III TRADITIONAL DATA PRIVACY PRESERVATION


METHODS

Cryptography refers to set of techniques and algorithms for


protecting data. In cryptography plaintext is transformed into
cipher text using various encryption schemes. There are
numerous methods based on this scheme like public key
cryptography, digital signatures etc.

Cryptography alone cant enforce the privacy demanded by


common cloud computing and big data services [9]. This is
because big data differs from traditional large data sets on the Figure. 3 Execution framework overview of MRTDS.
basis of three Vs (velocity, variety, volume) [10, 11]. It is
these features of big data that make big data architecture TPTDS approach to conduct the computation required in TDS
different from traditional information architectures. These in a highly scalable and efficient fashion. The two phases of
changes in architecture and its complex nature make our approach are based on the two levels of parallelization
cryptography and traditional encryption schemes not scalable
provisioned by Map Reduce on cloud. Essentially, Map
up to the privacy needs of big data.
Reduce on cloud has two levels of parallelization, i.e., job
level and task level. Job level parallelization means that many
The challenge with cryptography is all or nothing retrieval Map Reduce jobs can be executed simultaneously to make full
policy of encrypted data [12]. The less sensitive data that can use of cloud infrastructure resources. Combined with cloud,
be useful in big data analytics is also encrypted and user is not Map Reduce becomes more powerful and elastic as cloud can
allowed to access it. It makes data unreachable to those who offer infrastructure resources on demand, for example,
dont have access to decryption key. Also privacy may be Amazon Elastic Map Reduce service. Task level
breached if data is stolen before encryption or cryptographic parallelization refers to that many Mapper/reducer tasks in a
keys are misused. Map Reduce job are executed simultaneously over data splits.
To achieve High scalability, we parallelizing multiple
Attribute based encryption can also be used for big data jobs on data partitions in the first phase, but the resultant
privacy [13, 14]. This method of securing big data is based on
anonymization levels are not indistinguishable. To obtain
relationships among attributes present in big data. The

IDL - International Digital Library 4|P a g e Copyright@IDL-2017


IDL - International Digital Library Of
Technology & Research
Volume 1, Issue 6, June 2017 Available at: www.dbpublications.org

International e-Journal For Technology And Research-2017


finally consistent anonymous data sets, the second phase is o The intermediate results of the number of small data
essential to integrate the intermediate results and further sets are merged here.
anonymize entire data sets. Then, we run a subroutine over o The MRTDS driver is used to organize the small
each of the partitioned data sets in parallel to make full use of intermediate result for merging; the merged data sets
the job level parallelization of MapReduce. The subroutine is are unruffled on cloud.
a MapReduce version of centralized TDS (MRTDS) which o The merging result is again applied in anonymization
concretely conducts the computation required in TPTDS. called specialization.
MRTDS anonymizes data partitions to generate intermediate Specialization:
anonymization levels. An intermediate anonymization level o After getting the intermediate result those results are
means that further specialization can be performed without merged into one.
violating k-anonymity. MRTDS only leverages the task level o Then we again applies the anonymization on the
Parallelization of MapReduce. merged data it called specialization.
o Here we are using the two kinds of jobs such as IGPL
ALGORITHM 1. SKETCH OF TWO-PHASE TDS (TPTDS). UPDATE AND IGPL INITIALIZATION.
o The jobs are organized by web using the driver.
Input: information set D, obscurity parameters k, kI and the Obs:
number of partitions p. The OBS called optimized balancing scheduling.
Output: Anonymous information set D. o Here we focus on the two kinds of the scheduling
1: Partition D into Di,1 i p. called time and size.
2: Execute MRTDS(Di, kI, AL0) AL0i, one i p in o Here data sets are split in to the specified size and
parallel as multiple MapReduce jobs. applied anonymization on specified time.
3: Merge all intermediate anonymization levels into one, o The OBS approach is to deliver the high ability on
Merge(AL01, AL02, . . ., AL0p) ALI. handles the large data sets.
4: Execute MRTDS(D, k, ALI) AL to realize kanonymity.
5: Specialize D in line with AL, Output D V.IMPLEMENTATION AND IMPROVEMENT

Modules Description: To elaborate however knowledge sets are processed in


MRTDS, the execution framework supported common place
Data Partition: MapReduce is depicted in Fig. 1. The solid arrow lines
o In this module the data partition is performed on the represent the information flows within the canonical
cloud. MapReduce framework. From Fig. 1, we are able to see that
o Here we collect the large no of data sets. the iteration of MapReduce jobs is controlled by
o We are split the large into small data sets. anonymization level AL in Driver. The data flows for
o Then we provide the random number for each data handling iterations are represented by dotted arrow lines. AL
set. is sent from Driver to any or all staff including Mappers and
Anonymization: Reducers via the distributed cache mechanism. The worth of
o After getting the individual data sets we apply the AL is changed in Driver rendering to the output of the IGPL
anonymization. The anonymization means hide or data formatting or IGPL Update jobs. Because the quantity of
remove the sensitive field in data sets. such knowledge is extremely small compared with knowledge
o Then we get the intermediate result for the small sets that may be anonymized, they can be expeditiously
data sets the intermediate results are used for the transmitted between Driver and workers. We have a tendency
specialization process. to adopt Hadoop, Associate in Nursing ASCII text file
o All intermediate anonymization levels are merged implementation of MapReduce, to implement MRTDS. Since
into one in the second phase. The merging of most of Map and cut back functions have to be compelled to
anonymization levels is completed by merging cuts. access current anonymization level AL, we have a propensity
To ensure that the merged intermediate to use the distributed cache mechanism to pass the content of
anonymization level ALI(Anonymization Service AL to every Mapper or Reducer node as shown in Fig. 1.
Level Improve) never violates privacy requirements, Also, Hadoop provides the mechanism to line easy
the more general one is selected as the merged one international variables for Mappers and Reducers. The
simplest specialization is passed into the Map operate of IGPL
Merging: Update job during this method. The partition hash operate
within the shuffle part is changed because the 2 jobs need that

IDL - International Digital Library 5|P a g e Copyright@IDL-2017


IDL - International Digital Library Of
Technology & Research
Volume 1, Issue 6, June 2017 Available at: www.dbpublications.org

International e-Journal For Technology And Research-2017


the key-value pairs with the same key:p field instead of entire
key ought to visit the same Reducer. To scale back
communication traffics, MRTDS exploits combiner
mechanism that aggregates the key value pairs with constant
key into one on the nodes running Map functions. The
following are snapshot of implementation of Two-Phase TDS
approach for data anonymization for Preserving Bigdata
Privacy.

IDL - International Digital Library 6|P a g e Copyright@IDL-2017


IDL - International Digital Library Of
Technology & Research
Volume 1, Issue 6, June 2017 Available at: www.dbpublications.org

International e-Journal For Technology And Research-2017


[5] L. Xu, C. Jiang, J.Wang, J. Yuan, and Y. Ren,
``Information security in big data: Privacy and data mining,''
VI. CONCLUSION AND FUTURE RESEARCH
in IEEE Access, vol. 2, pp. 1149_1176, Oct. 2014.
CHALLENGES [6] H. Hu, Y. Wen, T.-S. Chua, and X. Li, ``Toward scalable
systems for big data analytics: A technology tutorial,'' IEEE
Access, vol. 2, pp. 652_687,Jul.2014.
In this paper, we have examined the scalability problem of
large-scale data anonymization by TDS, and proposed a highly [7] Ardagna C A, Damiani E. Business Intelligence meets Big
scalable two-phase TDS approach using MapReduce on cloud. Data: An Overview on Security and Privacy[J].
[8] Labrinidis A, Jagadish H V. Challenges and opportunities
Data sets are partitioned and anonymized in parallel in the first
with big data [J]. Proceedings of the VLDB Endowment,
phase, producing intermediate results. Then, the intermediate
results are merged and further anonymized to produce 2012, 5(12): 2032-2033.
consistent k-anonymous data sets in the second phase. We [9] M. V. Dijk, A. Juels, "On the impossibility of
have creatively applied MapReduce on cloud to data cryptography alone for privacy-preserving cloud computing,"
anonymization and deliberately designed a group of Proceedings of the 5th USENIX conference on Hot topics in
innovative MapReduce jobs to concretely achieve the security, August 10, 2010, pp.1-8.
[10]S. Sagiroglu and D. Sinanc, Big Data: A Review, Proc.
specialization computation in a highly scalable way.
Experimental results on real-world data sets have International Conference on Collaboration Technologies and
demonstrated that with our approach, the scalability and Systems, 2013, pp. 42- 47
efficiency of TDS are improved significantly over existing [11] Y. Demchenko, P. Grzsso, C. De Laat, P. Membrey,
approaches. In cloud environment, the privacy preservation for Addressing Big Data Issues in Scientific Data Infrastructure,
data analysis, share and mining is a challenging research issue Proc. International Conference on Collaboration Technologies
due to increasingly larger volumes of data sets, thereby and Systems, 2013, pp. 48- 55.
requiring intensive investigation. We will investigate the
adoption of our approach to the bottom-up generalization [12] Top Ten Big Data Security and Privacy Challenges,
algorithms for data anonymization. Based on the contributions Technical report, Cloud Security Alliance, November 2012
herein, we plan to further explore the next step on scalable [13]S. H. Kim, N. U. Kim, T. M. Chung, Attribute
privacy preservation aware analysis and scheduling on large- Relationship Evaluation Methodology for Big Data Security,
scale data sets. Optimized balanced scheduling strategies are Proc. International Conference on IT Convergence and
expected to be developed towards overall scalable privacy Security (ICITCS), 2013, pp. 1-4.
preservation aware data set scheduling.
[14] S.H. Kim, J. H. Eom, T. M. Chung, Big Data Security
Hardening Methodology Using Attributes Relationship, Proc.
REFERENCES International Conference on Information Science and
Applications (ICISA), 2013, pp. 1-2.
[15] H. Takabi, J.B.D. Joshi, and G. Ahn, Security and
[1] J. Manyika et al., Big data: The Next Privacy Challenges in Cloud Computing Environments, IEEE
Security and Privacy, vol. 8, no. 6, pp. 24-31, Nov. 2010.
Frontier for Innovation, [16] K. LeFevre, D.J. DeWitt, and R. Ramakrishnan,
Competition,and Productivity. Zrich, Incognito: Efficient Full-Domain K-Anonymity, Proc. ACM
Switzerland: McKinsey Global Inst., Jun. 2011, SIGMOD Intl Conf. Management of Data (SIGMOD 05),
pp. 49-60, 2005.
[2] Gantz J, Reinsel D. Extracting value from chaos [J]. IDC [17]ABID MEHMOOD, IYNKARAN NATGUNANATHAN,
iview, 2011: 1-12. pp. 1_137. YONG XIANG, (Senior Member, IEEE), GUANG HUA,
[3]A. Katal, M. Wazid, and R. H. Goudar, ``Big data: Issues, (Member, IEEE), AND SONG GUO, (Senior Member, IEEE),
challenges, tools and good practices,'' in Proc. IEEE Int. Conf. Protection of Big Data Privacy, IEEE ACCESS, 2016.
Contemp. Comput., Aug. 2013, pp. 404_409.
[4] B. Matturdi, X. Zhou, S. Li, and F. Lin, ``Big data security
and privacy: A review,'' China Commun., vol. 11, no. 14, pp.
135_145, Apr. 2014.

IDL - International Digital Library 7|P a g e Copyright@IDL-2017

You might also like