You are on page 1of 3

International Journal on Recent and Innovation Trends in Computing and Communication ISSN: 2321-8169

Volume: 5 Issue: 6 206 208


_______________________________________________________________________________________________
Big Data Harmonization Challenges and Applications

Prof. Jigna Ashish Patel Dr. Priyanka Sharma


Assistant Professor, CE Dept, Professor,Rakshashakti University
Institute of Technology, Nirma University Meghaninagar
Ahmedabad, India Ahmedabad,India
Jignas.patel@nirmauni.ac.in pspriyanka@yahoo.co.in

AbstractAs data grow, need for big data solution gets increased day by day. Concept of data harmonization exist since two decades. As data is
to be collected from various heterogeneous sources and techniques of data harmonization allow them to be in a single format at same place it is
also called data warehouse. Lot of advancement occurred to analyses historical data by using data warehousing. Innovations uncover the
challenges and problems faced by data warehousing every now and then. When the volume and variety of data gets increased exponentially,
existing tools might not support the OLAP operations by traditional warehouse approach. In this paper we tried to focus on the research being
done in the field of big data warehouse category wise. Research issues and proposed approaches on various kind of dataset is shown. Challenges
and advantages of using data warehouse before data mining task are also explained in detail.

Keywords-Data warehouse, big data

__________________________________________________*****_________________________________________________

I. INTRODUCTION operations, and data visualization. Relational OLAP (ROLAP)


faces lot of problems in big data era as data volumes get
Big data is usually pronounced as data with a large volume, increased exponentially. Scalability, table join operations,
variety of data and great velocity. In this digital world everyone unstructured data are major flaws in ROLAP [6]
generate data, collectively it becomes huge. It is required that
we deal with these data in order to provide the factual data ROLAP commonly uses star model and snowflake model,
analytics. Real time Data warehouse has to be updated to deal which stores dimensions and measures into relational tables,
with this three Vs, volume, variety and velocity[3] In new through foreign keys we refer those tables. In this big data era
paradigm shift we cannot ignore big data for business the performance of ROLAP is not accepted only because of
intelligence. Plenty of innovations has been done on BI and costly join operations. Lot of research has been found to
data analytics, only few of them can be feasible if updated or increase the performance of ROLAP by various techniques like
modified for big data. It is not only sufficient if any solution indexing and hashing. On the other side Multidimensional
can support only three Vs. Challenges like scalability, work OLAP (MOLAP) is best suitable solution for big data as it
distribution and integration is also that important. Big data provides fast response time in join operations specifically when
technology become very popular amongst every field the data volume is high. MOLAP system offers robust
nowadays. It may use to optimize businesses. Today, almost all performance, but it need additional storage to maintain the
type of companies/users who apply the business intelligence mappings between dimensions and measures.
utilities are using the raw data generated from company and
Similarly the basic problem of computing OLAP data cubes
develop intelligence to take decisions. The huge amount of
are varied in the wide variety of data types like classical
profile users data are processed and analyzed for advertising.
relational data sets, graph data sets, XML data sets, and social
OLAP and data warehouse are typical fields of data science network data sets. Unluckily straight forward solutions wont
which have been talked since numerous decades by the Data helpful for computing OLAP data cube over big data. The
Warehousing and database research groups. Modern data reason behind this complexity is increasing data rate and
warehouse deals with every type of data which is controversy dimensionality of model. When OLAP is merged with data
with traditional data warehouses. In the context of Big Data mining, the technique is known as OLAM (Online Analytical
research, computing OLAP data cubes over Big Data becomes Mining) which is very popular for its high performance but not
the most motivating challenges in the research community [5] suitable for semi structured or complex data. Lot of solutions
(theoretical and practical) are proposed in order to get efficient
computation and retrieval of the data from data warehouse and
analyzing various heterogeneous data. Scalability and
Traditional data warehouses process more like SQL type of proficiency have been issues in social media from its emergent.
dataset. It follows the steps like data preprocessing, data Data warehouse is the key phase of the whole procedure from
modelling, data normalization/denormalization, OLAP user query to faster response. Data warehouse is always in the
206
IJRITCC | June 2017, Available @ http://www.ijritcc.org
_______________________________________________________________________________________
International Journal on Recent and Innovation Trends in Computing and Communication ISSN: 2321-8169
Volume: 5 Issue: 6 206 208
_______________________________________________________________________________________________
research for achieving Business Intelligence (BI). Though III. APPLICATION AREA FOR OLAP AND BIG DATA
massive raw data is generated in every social media website, by
using unique and intelligent BI tools and algorithms the target Social media
is achieved. Twitter provides tremendous amount of research Today synonyms for social media is Facebook, Twitter,
dataset to work with your proposed algorithm. Since twitter is Quora, Instagram, you tube, WhatsApp, blogs and what not!
the most popular microblogging website, here we discussed the Every social media sites and blogs are popular in their
OLAP about the tools for twitter data set environment. If we talk about Twitter it has more than 500
II. CHALLENGES IN BIG DATA WAREHOUSE: million users and more than 340 million tweets per day. By
efficient use of data streaming algorithms and efficient tools &
By considering data warehouse and big data, we found the technology tweets can be favorited, embedded, unlike, replied
following challenges: to, shared. With this millions of data and social networks
Twitter performs analytics too. By taking this meta data and
1. Data Quality: applying OLAP operations over it in the combination of data
Biggest and very common challenge to deal with data is to mining tasks lot of information and knowledge extraction like
ensure the data quality. Building a data warehouse require 75% behavior of users, emerging trends issues can be analyzed. [3]
of the efforts, such as readying the data and transporting it into To extend the functionality and to overcome the limitations
the data warehouse. Get data from all heterogeneous sources of OLAP, research is continue in the area of social media.
and of different formats,it is real challenge to provide a single Techniques and functionalities are modified in order to get
platform for all different kinds of data.To ready the data good accuracy. Opinion mining and recommender systems are
various data quality tools can be used to maintain the data using semi structured data warehouses to extract knowledge to
quality. Lot of research in order to face the data problems and deal with unstructured as well as semi structured datasets.
compared the available tools [5] OLAP in social media should be extend to discover underlying
2. Scalability measure for unstructured dataset [4].

Any type of data warehouse should deal with increasing Text data
data rate. Storing capacity of data warehouse should be flexible For making the business wiser people are taking the help of
with real data size. It should support dynamic scaling. In the era reviews of users, advertisement, recommendation systems and
of cloud computing perfect solution for big data warehouse is lot more. Each of this methods are using textual data. To
dynamic scaling. We may choose horizontal scaling or vertical provide the platform for OLAP processing for textual data
scaling for our purpose. Variousplatforms are available for document warehousing is popularly used. Document
horizontal scaling like Hadoop and for vertical scaling like warehousing is the solution for storing multidimensional
GPU (Graphics Processing Unit). documents and to do analysis over it for proficient text mining.
3. Efficiency By using document warehousing approach various
heterogeneous document data can be integrated in well-formed
As far as efficiency of data warehouse is concern it is infrastructure. Challenges like scaling, performance and
related to construction of data warehouse as well as its security are also introduced in big data concern for document
operating efficiency. Big Data mining techniques either applied warehousing.[17]
via data warehouse or directly on data warehouse depending on
convenience. If data warehouse is able to respond faster for the Research point of view, document warehousing is the thirst
millions of queries then it is big efficiency concern [6] area to contribute in OLAPing. The paper emphases on giving
an improved solution of data warehousing in the big data era.
4. Heterogeneity Methodology mainly consists of three stages documentation,
aggregation and data loading stage. Documentation stage
Data coming from various heterogeneous sources results remove the data from data sources by including that data to
into variety of data, like structured, semi structured and simple text files. Aggregation phase uses MapReduce process
unstructured data set. Some sources follow RDBMS type and to finish ETL from various data files received from the first
some follows NoSQL databases. Every type of dataset must be stage. In this phase all the results generated will be transformed
provided a unique layer of data integration. Data warehouse into JSON objects. By using this approach parallelism can
should be flexible enough to deal with heterogeneous dataset so achieved better and big data problem can be solved [15]
that data warehouse wont suffer from the cost of
reconstruction [6].

207
IJRITCC | June 2017, Available @ http://www.ijritcc.org
_______________________________________________________________________________________
International Journal on Recent and Innovation Trends in Computing and Communication ISSN: 2321-8169
Volume: 5 Issue: 6 206 208
_______________________________________________________________________________________________
Spatial Data the 2012 Conference of the Center for Advanced Studies on
Collaborative Research, 2012, pp. 241242.
For the analytics of remote sensing data, spatial on line [5] A. Nandi, C. Yu, P. Bohannon, and R. Ramakrishnan, Data
analytical processing (SOLAP) is used. SOLAP is a perfect Cube Materialization and Mining over MapReduce, IEEE
solution for decision support system for exploring Transactions on Knowledge and Data Engineering, vol. 24, no.
multidimensional perspective of spatial data. It can be used in 10, pp. 17471759, Oct. 2012.
[6] A. Cuzzocrea, L. Bellatreche, and I.-Y. Song, Data
spatio-temporal analytics for whether and environment
warehousing and OLAP over big data: current challenges and
monitoring systems. As the data generated from earth
future research directions, in Proceedings of the sixteenth
observation, it is very challenging to manage because of large international workshop on Data warehousing and OLAP, 2013,
scale and aggregation point of view. SOLAP cube uses the pp. 6770.
concept of map reduce in order to get higher parallelism. [7] S. Mansmann, N. Ur Rehman, A. Weiler, and M. H. Scholl,
Newer approach is implemented on Hadoop framework using Discovering OLAP dimensions in semi-structured data,
the traditional operations like roll-up/drill down and slice/dice Information Systems, vol. 44, pp. 120133, Aug. 2014.
on optimized ROLAP/MOLAP/HOLAP cube [1]. [8] J. Song, C. Guo, Z. Wang, Y. Zhang, G. Yu, and J.-M. Pierson,
HaoLap: A Hadoop based OLAP system for big data, Journal
Web data of Systems and Software, vol. 102, pp. 167181, Apr. 2015.
[9] D.-H. Shin and M. J. Choi, Ecological views of big data:
Extreme use of internet and web generates massive web Perspectives and issues, Telematics and Informatics, vol. 32,
dataset. By the concept of Web warehousing the critical aspect no. 2, pp. 311320, May 2015.
related to decision support system can be built. Advantages like [10] J. Dittrich and J.-A. Quian-Ruiz, Efficient big data processing
improved productivity and cost savings can be achieved by in Hadoop MapReduce, Proceedings of the VLDB Endowment,
applying web warehousing. Web warehousing is the approach vol. 5, no. 12, pp. 20142015, 2012.
[11] S. Lee, S. Jo, and J. Kim, MRDataCube: Data cube
to build the OLAP cube and warehouse on web information in
computation using MapReduce, in Big Data and Smart
the form of semi structured data, graphics, text, sound, images,
Computing (BigComp), 2015 International Conference on, 2015,
multimedia objects, videos and many more. In simple language pp. 95102.
we may say web warehousing is the combination of data [12] I. Triguero, D. Peralta, J. Bacardit, S. Garca, and F. Herrera,
warehouse and web technology. Research in this area is to MRPR: A MapReduce solution for prototype reduction in big
show how efficient web warehouse than the data warehouse by data classification, Neurocomputing, vol. 150, pp. 331345,
applying the web data on traditional warehouse. For the big Feb. 2015.
data concern again map reduce procedures are used to avail [13] T. Niemi, J. Nummenmaa, and P. Thanisch, Normalising
high parallelism. Using the Hadoop framework and HBase OLAP cubes for controlling sparsity, Data & Knowledge
Engineering, vol. 46, no. 3, pp. 317343, Sep. 2003.
gives improved results.[18]
[14] N. U. Rehman, A. Weiler, and M. H. Scholl, OLAPing social
References media: the case of Twitter, in Proceedings of the 2013
IEEE/ACM International Conference on Advances in Social
[1] J. Li, L. Meng, F. Z. Wang, W. Zhang, and Y. Cai, A Map- Networks Analysis and Mining, 2013, pp. 11391146.
Reduce-enabled SOLAP cube for large-scale remotely sensed [15] M. Ben Kraiem, J. Feki, K. Khrouf, F. Ravat, and O. Teste,
data aggregation, Computers & Geosciences, vol. 70, pp. 110 OLAP of the tweets: From modeling toward exploitation, in
119, Sep. 2014. Research Challenges in Information Science (RCIS), 2014 IEEE
[2] C. Blanco, I. Garca-Rodrguez de Guzmn, E. Fernndez- Eighth International Conference on, 2014, pp. 110.
Medina, and J. Trujillo, An architecture for automatically [16] C. X. Lin, B. Ding, J. Han, F. Zhu, and B. Zhao, Text Cube:
developing secure OLAP applications from models, Computing IR Measures for Multidimensional Text Database
Information and Software Technology, vol. 59, pp. 116, Mar. Analysis, 2008, pp. 905910.
2015. [17] F. S. C. Tseng and A. Y. H. Chou, The concept of document
[3] N. U. Rehman, S. Mansmann, A. Weiler, and M. H. Scholl, warehousing for multi-dimensional modeling of textual-based
Building a Data Warehouse for Twitter Stream Exploration, business intelligence, Decision Support Systems, vol. 42, no. 2,
2012, pp. 13411348. pp. 727744, Nov. 2006.
[4] L. Petrazickis, M. Butuc, and B. Steinfeld, Crunching big data [18] X. Tan, D. C. Yen, and X. Fang, Web warehousing: Web
with Hadoop and BigInsights in the cloud, in Proceedings of technology meets data warehousing, Technology in Society,
vol. 25, no. 1, pp. 131148, Jan. 2003.

208
IJRITCC | June 2017, Available @ http://www.ijritcc.org
_______________________________________________________________________________________

You might also like