You are on page 1of 7

Advanced Review

Advances in data stream mining


Mohamed Medhat Gaber

Mining data streams has been a focal point of research interest over the past
decade. Hardware and software advances have contributed to the significance of
this area of research by introducing faster than ever data generation. This rapidly
generated data has been termed as data streams. Credit card transactions, Google
searches, phone calls in a city, and many others\are typical data streams. In many
important applications, it is inevitable to analyze this streaming data in real time.
Traditional data mining techniques have fallen short in addressing the needs of
data stream mining. Randomization, approximation, and adaptation have been
used extensively in developing new techniques or adopting exiting ones to enable
them to operate in a streaming environment. This paper reviews key milestones
and state of the art in the data stream mining area. Future insights are also be
presented. C 2011 Wiley Periodicals, Inc.

How to cite this article:


WIREs Data Mining Knowl Discov 2012, 2: 7985 doi: 10.1002/widm.52

INTRODUCTION the last category contributes to a generic controller


that could be used on top of any stream mining
D ata streams as a concept is defined as high-speed
generated instances of data that challenge our
computational systems to store, process, and reason
algorithm:

1. Two-phase techniques
about.1,2 However, streaming data, if analyzed, is an
important source of knowledge that enables us to take 2. Hoeffding bound-based techniques
extremely important decisions in real time. The area 3. Symbolic approximation-based techniques
has attracted attention of the data mining commu- 4. Granularity-based techniques
nity over the last decade to develop new techniques
or adopt existing ones aiming to realize the many This paper will discuss the main principles be-
important applications of data stream mining. Busi- hind each of the above categories and how this princi-
ness, scientific, and security applications have been ple has been applied to different techniques. This dis-
discussed extensively in the literature.3,4 cussion will be followed by presenting new directions
The last decade has witnessed an active research in the area. Finally, future insights will be given.
in the data stream mining. Hundreds of techniques
have been proposed to address the research issues of
analyzing rapidly arrived data streams in real time. NOTABLE TECHNIQUES IN DATA
Out of the large body of literature, we can iden- STREAM MINING
tify four different categories that have contributed in
This section will provide a discussion of the four
shaping this area of research as follows. Other cate-
identified categories of data stream mining techniques
gories of techniques can be identified. For example, a
listed in the introductory section.
large body of one-pass techniques do exist in the data
stream mining literature.1 However, the impact of
the following four categories have been widely recog- Two-Phase Techniques
nized. The first three categories represent approaches The two-phase techniques have been introduced by
to building learning algorithms. On the other hand, Aggarwal et al.5 The general idea for this category of
techniques is to maintain an online summary of data

Correspondence to: mohamed.gaber@port.ac.uk using what has been termed as microclusters. Micro-
School of Computing, University of Portsmouth, Portsmouth, clustering has extended the data structure proposed
Hampshire, UK by Zhang et al.6 to develop the balanced iterative
DOI: 10.1002/widm.52 reducing and clustering using hierarchies (BIRCH).

Volume 2, January/February 2012 


c 2011 John Wiley & Sons, Inc. 79
Advanced Review wires.wiley.com/widm

The maintenance of the online microclusters is fol- mated mean value expressed as:
lowed by a second phase that is done offline. This 
second phase differs from one technique to another R2 ln(1/)
= ,
according to whether the ultimate objective is running 2n
a supervised or an unsupervised technique.
where R is the range of the estimated number and n is
On the basis of the two-phase strategy, a frame-
the number of points. This generic method has been
work for clustering data streams termed as CluS-
applied to an extension of the traditional K-means
tream has been proposed.5 The proposed technique
clustering algorithm, VFKM, and decision tree classi-
divides the clustering process into two components:
fication, very fast decision trees (VFDT), techniques.
online and offline. The online component stores sum-
Unlike K-median that has been used extensively
mary statistics about the data streams and the of-
in data stream clustering, the K-means algorithm com-
fline one performs clustering on the summarized data
putes the cluster centers by using the mean values
according to a number of user preferences such as
of the data records assigned to the cluster under ex-
the time frame and the number of clusters. In an im-
amination. VFKM13 uses Hoeffding bound to deter-
portant milestone to the two-phase techniques, Ag-
mine the number of examples needed in each step
garwal et al.7 have proposed an extension to CluS-
of K-means algorithm. VFKM runs as a sequence of
tream termed as HPStreama projected clustering
K-means executions with each run uses more data
for high-dimensional data streams. HPStream has
records than the previous one until the calculated sta-
outperformed CluStream in a number of case stud-
tistical Hoeffding bound is satisfied.
ies. The main motivation behind the development of
Domingos and Hulten10,11 have developed
HPStream is that CluStream has not performed ef-
VFDT, which is a decision tree learning system based
fectively with high-dimensionality streaming informa-
on Hoeffding trees. It splits the tree using the current
tion.
best attribute taking into consideration that the num-
Aggarwal et al.8 have adopted the idea of mi-
ber of examples/records used satisfies the Hoeffding
croclusters introduced in CluStream5 in on-demand
bound. VFDT is an extended version of Hoeffding
classification. CluStream, as described earlier, divides
tree algorithm that addresses the research issues of
the clustering process into the two components: of-
data streams. These research issues are as follows:
fline and online.
On-demand classification8,9 uses clustering re- Ties of attributes: occur when two or more
sults to classify data using statistics of class distri- attributes have close values of the splitting
bution. The main motivation behind the technique is criteria such as information gain.
that the classification model should be used over a
time period according to the application. The tech- High speed nature of data streams: represents
nique uses microclustering for each class in the data an inherent feature of data streams.
stream. This initialization is followed by a nearest Bounded memory: the tree can grow till the
neighbor classification of the unlabeled data. The mi- algorithm runs out of memory.
croclusters are the key of the proposed technique Accuracy of the output: is an issue in all data
which is the subtractive property. This property en- stream mining algorithms.
ables the extraction of the needed microclusters over
the required time period. The extension of Hoeffding trees in VFDT has
been done using the following techniques:

Hoeffding Bound-Based Techniques Ties of attributes have been overcome using


Domingos and Hulten 10,11
have proposed a generic a user-specified threshold of acceptable error
strategy for scaling up machine learning algorithms measure for the output. That way the algo-
termed very fast machine learning (VFML). This strat- rithm running time will be reduced and it
egy depends on determining an upper bound for the overcomes the risk of infinite running time
learners accuracy loss as a function in the number of of the algorithm.
examples/data records in each step of the algorithm. The high speed nature of the streaming infor-
Hoeffding bound12 has been the key for the develop- mation has been addressed using batch pro-
ment of the VFML techniques. Hence, we have coined cessing. The computation of the splitting cri-
this group of techniques as Hoeffding bound-based teria is done in a batch processing rather than
techniques. It states that with probability 1 , the online processing. This significantly reduces
), where (r)
true mean (r) is at least ((r) is the esti- the time of recalculating the criteria for all

80 
c 2011 John Wiley & Sons, Inc. Volume 2, January/February 2012
WIREs Data Mining and Knowledge Discovery Advances in data stream mining

the attributes with each incoming record of SAX follows three major steps in converting a
the stream. time series from its numerical form to its symbolic
Bounded memory has been addressed by de- form. The first step is Piecewise Aggregate Approxi-
activating the least promising leaves and ig- mation (PAA). This is done by converting a time series
noring the poor attributes. The calculation of of size n to an arbitrarily size w using the following
these poor attributes is done through the dif- equation:
ference between the splitting criteria of the n
wi
w 
highest and lowest attributes. If the difference Ci = Cj ,
is greater than a prespecified value, the at- n
j= wn (i1)+1
tribute with the lowest splitting measure will
be removed from memory saving the memory where Ci is the ith time point in the approximated
of the data stream computing environment. time series.
The accuracy of the output has been taken The second step is symbolic discretization. This
into consideration using multiple scans over is done via producing equal areas under the curve
the data streams in the case of low data rates, of the Gaussian distribution and setting respective
and by using an accurate initialization of the breakpoints. Each breakpoint represents a step from
tree using a different, more accurate technique one letter to another when replacing the approxi-
to build an initial decision tree. mated values produced by the PAA process by its
approximated symbolic values. The final step uses a
distance measure between each two characters that
All of the above improvements have been are stored in a lookup table to find out the accumu-
tested using synthetic data sets. The experiments lated distance between any two subsequences of times
have proved efficiency of these improvements. The series.
VFDT has been extended to address the prob-
lem of concept drift in evolving data streams by
Hulten et al.11 The new framework has been termed Granularity-Based Techniques
as CVFDT. It is mainly running VFDT over fixed Granularity-based approach has been introduced by
sliding windows in order to have the most updated Gaber et al.1921 Having noted that stream mining
classifier. The change occurs when the splitting crite- techniques may fall short when running on resource-
ria change significantly across the input attributes. It constrained devices such as smart phones and sen-
is worth pointing out here the work by Masud et al.14 sor nodes, the granularity-based approach works on
is tackling concept drifts with emerging classes. adapting the mining techniques to change their re-
source consumption patterns over time according to
availability of resources.
SAX-Based Techniques Resource consumption patterns represent the
Symbolic ApproXimation (SAX) is a time series rep- change in resource consumption over a period of time
resentation that has been introduced by Keogh and which is termed as time frame. The algorithm gran-
his colleagues.15 SAX has proved to be the state-of- ularity settings are the input, output, and processing
the-art technique in time series representation. Time settings of a mining algorithm that can vary over time
series data is a typical streaming source with a tem- to cope with the availability of resources and current
poral dimension. data stream arrival rate. The following are definitions
In addition of being used in traditional data min- of each of these settings:
ing techniques such as clustering, classification, and
indexing, it has achieved important breakthroughs in Algorithm input granularity (AIG): AIG rep-
finding the most different subsequence in a time series resents the process of changing the data
termed discord16 and the most frequent subsequence stream arrival rates that feed the algorithm.
in a time series termed motif.17 Numerous applica- Examples of techniques that could be used in-
tions have used SAX representation with notable suc- clude sampling, load shedding, and creating
cess. Some examples can be recalled here. It has been data synopsis. Sampling has been the choice
reported16 that a premature ventricular contraction used in developing the granularity-based data
could be accurately identified using discord detection mining techniques.
in the time series of electrocardiogram (ECG). Li and Algorithm output granularity (AOG): AOG
Nallela18 have used motif discovery with SAX repre- is the process of changing the output size of
sentation to successfully find patterns of water level. the algorithm in order to preserve the limited

Volume 2, January/February 2012 


c 2011 John Wiley & Sons, Inc. 81
Advanced Review wires.wiley.com/widm

memory space. We refer to this output as the AGP(ri ): algorithm granularity parameter that af-
number of knowledge structures. For exam- fects the resource ri .
ple, we may refer to number of clusters or
rules. According to the above, the main rule to be used
Algorithm processing granularity (APG): to use the algorithm granularity approach is as fol-
APG is the process of changing the algorithm lows:
parameters in order to consume less pro-
cessing power. Randomization and approx- 
IF ALT > NoF (ri )
imation techniques represent the strategies of TF
APG. THEN SET AGP(ri )
ELSE SET AGP(ri ) +
It should be noted that there is a collective in-
teraction among the above three settings. AIG mainly Where AGP(ri ) + achieves higher accuracy at
affects the data rate and it is associated with band- the expense of higher consumption of the resource ri ,
width consumption and battery. On the other hand, and AGP(ri ) achieves lower accuracy at the advan-
AOG is associated with memory and APG is asso- tage of lower consumption of the resource ri .
ciated with processing power. However, the change This simplified rule could take different forms
in any of them affects the other resources. The pro- according to the monitored resource and the algo-
cess of enabling resource awareness should be very rithm granularity parameter applied to control the
lightweight in order to be feasible in a streaming en- consumption of this resource. Interested readers are
vironment characterized by its scarcity of resources. referred to Ref 19 for applying the above rule in con-
Accordingly, the algorithm granularity settings only trolling a data stream clustering algorithm termed as
consider direct interactions. RA-Cluster.
The algorithm granularity requires continuous Interested practitioners can use the following
monitoring of the computational resources. This is procedure for enabling resource awareness and adap-
done over fixed time intervals/frames that we denote tation for their data stream mining algorithms. The
as TF. According to this periodic resource monitoring, procedure follows the following steps:
the mining algorithm changes its parameters to cope
with the current consumption patterns of resources.
1. Identify the set of resources that mining al-
These parameters are AIG, APG, and AOG settings
gorithm will adapt accordingly (R);
discussed briefly in the previous section. It has to be
noted that setting the value of TF is a critical pa- 2. Set the application lifetime (ALT) and time
rameter for the success of the running technique. The interval/frame (TF);
higher the TF is, the lower the adaptation overhead 3. Define AGP(ri ) + and AGP(ri ) for every ri
will be, but at the expense of risking a high consump- R;
tion of resources during the long time frame. 4. Run the algorithm for TF;
The use of algorithm granularity as a general
5. Monitor the resource consumption for every
approach for mining data streams will require us to
ri R;
provide some formal definitions and notations. The
following are definitions that we will use in our dis- 6. Apply AGP(ri ) + or AGP(ri ) to every ri

cussion: R according to the ratio ALT
TF
: NoF (ri ) and
the rule given;
R: set of computational resources R = {r1 , r2 , 7. Repeat the last three steps.
. . ., rn };
TF: time interval for resource monitoring and Applying the above procedure is all what is
adaptation; needed to enable resource awareness and adaptation,
using the algorithm granularity approach, to stream
ALT: application lifetime;
mining algorithms.
ALT : time left to last the application lifetime; On the basis of the Granularity-based approach,
NoF(ri ): number of time frames to consume the re- a number of data stream mining algorithms have
sources ri , assuming that the consumption been developed. For a complete list of techniques,
pattern of ri will follow the same pattern of the reader is advised to review the recent tutorial by
the last time frame; Gama et al.22

82 
c 2011 John Wiley & Sons, Inc. Volume 2, January/February 2012
WIREs Data Mining and Knowledge Discovery Advances in data stream mining

NEW DIRECTION IN DATA STREAM Mobile Data Stream Mining


MINING The number of mobile users is in continuous in-
crease. Mobile data mining users are not an ex-
Data stream mining has evolved as a new form of
ception. Academic prototypes such as Open Mobile
online data analysis that has also challenged the com-
Miner (OMM)25 and commercial products such as
putational capabilities of our state-of-the-art data
MineFleet26 have already found their way to users.
processing facilities. However, advances in the com-
We can date back the early start of the area of
putational power of small computational devices
mobile data mining to MobiMine system developed
including personal digital assistants (PDAs), smart
by Karguta et al.27 Although the system targets mobile
phones and sensor nodes have realized an unpreceded
brokers in the stock exchange area, the data mining
opportunity to perform ubiquitous data stream min-
process has been performed on a server conserving the
ing. We can broadly categorize this area to mining
scarce resources of the mobile device, a PDA in this
sensor data streams and mobile data mining. Recent
case. Few years later, Karguta et al.28 have developed
achievements in these areas are discussed in the fol-
VEDAS system for distributed data stream mining
lowing subsections.
of a fleet of vehicles, analyzing both the drivers be-
havior and the vehicles health. The system has used
mobile devices running different data stream mining
Mining Sensor Data Streams techniques. This has been a result of the advances in
Many important applications coupled with the in- computational capabilities of our mobile devices.
crease of the computational power of wirelessly con- Mobility of the user, connectivity problems, and
nected sensor nodes have given birth to this new re- availability of computational resources are the major
search direction in the data stream mining area. research issues in this promising area of research. The
Mining data streams originated from sensor granularity-based approach has proved to be a suc-
nodes has witnessed notable success in the last few cessful solution when running stream mining tech-
years. Research issues associated with this area have niques on mobile devices with limited resources.21
been detailed in Ref 3. Differences between data The OMM tool25 has adopted the granularity-based
stream mining in sensor networks and other platforms approach.
as detailed in Ref 3 are as follows:
Future Insights
Duplication of data in densely deployment
of wireless sensor networks introduces a new We can state the future directions and insights in this
challenge. growing area of research:
Multilevel data mining is important in wire- Online medical, scientific, and biological data
less sensor networks given that individual sen- stream mining using data generated from
sors can generate local models that need to be medical, biological instruments, and various
integrated. tools employed in scientific laboratories;
Real-time data cleansing given that sensory Hardware solutions to small devices emitting
streaming data is likely to be noisy. or receiving data streams in order to enable
Adaptation to availability of resources is high-performance computation on small de-
inevitable given the limited resources that vices;
each sensor node has. It is worth men- Developing software architectures that serve
tioning the success of granularity-based data streaming applications;
approach in developing stream mining tech-
Situation aware data stream mining that re-
niques that are able to operate in wireless sen-
calls the models built in similar situations
sor networks.20,23
rather than building a new model;
Online text mining for opinion discovery with
The field is concerned with benefiting from the
the notable use of Web 2.0 technologies.
large deployment of small computational devices that
are able to communicate wirelessly and have increas-
ing sensing capabilities. This rich source of streaming
Conclusion
data is a key to the success of many important secu- This review paper has highlighted the major strategies
rity, scientific, and industrial applications. Examples and techniques used in data stream mining. We have
of these applications could be found in Refs 3,4,24. identified four categories of techniques: (1) two-phase

Volume 2, January/February 2012 


c 2011 John Wiley & Sons, Inc. 83
Advanced Review wires.wiley.com/widm

techniques, (2) Hoeffding bound-based techniques, search directions have been discussed. The first con-
(3) symbolic approximation-based techniques, and cerns mining data originated from sensor networks.
(4) granularity-based techniques. Details of each cat- Mobile data stream mining represents the second
egory have been discussed. area. Finally, future insights by the author have been
New directions and future insights in this grow- enumerated giving the reader some potential direction
ing area of research have been presented. Two re- for research.

REFERENCES
1. Gaber MM, Zaslavsky A, Krishnaswamy S. Mining clustering. In: Proceedings of the 18th International
data streams: a review. ACM SIGMOD Rec 2005, Conference on Machine Learning. Williams College,
34:1826. Williamstown, MA, USA, 2001, 106113.
14. Masud MM, Gao J, Khan L, Han J, Thuraisingham
2. Babcock B, Babu S, Datar M, Motwani R, Widom J.
BM. Integrating novel class detection with classifica-
Models and issues in data stream systems. In: Proceed-
tion for concept-drifting data streams. In: Proceed-
ings of PODS. 2002.
ings of the European Conference on Machine Learning
3. Gama J, Gaber MM, eds. Learning from Data Streams: and Principles and Practice of Knowledge Discovery in
Processing Techniques in Sensor Networks. Springer Databases. Bled, Slovenia, 2009, 7994.
Verlag; 2007.
15. Lin J, Keogh E, Lonardi S, Chiu B. A symbolic repre-
4. Ganguly A, Gama J, Omitaomu O, Gaber MM, Vat- sentation of time series, with implications for streaming
savai RR, eds. Knowledge Discovery from Sensor algorithms. In: Proceedings of the 8th ACM SIGMOD
Data. Berlin, Germany: CRC Press; 2008. Workshop on Research Issues in Data Mining and
5. Aggarwal CC, Han J, Wang J, Yu PS. A framework for Knowledge Discovery. San Diego, CA; 2003, 211.
clustering evolving data streams. In: Proceedings of the 16. Keogh E, Lin J, Fu A. HOT SAX: efficiently finding
29th VLDB Conference. Berlin; 2003, 8192. the most unusual time series subsequence. In: Pro-
6. Zhang T, Ramakrishnan R, Livny, M. BIRCH: an ef- ceedings of the 5th IEEE International Conference on
ficient data clustering method for very large databases. Data Mining (ICDM 2005). Houston, TX; 2005, 226
SIGMOD Rec. New York: ACM Press; 1996, 25:103 233.
114. 17. Chiu B, Keogh E, Lonardi S. Probabilistic discovery of
7. Aggarwal CC, Han J, Wang J, Yu P. A framework for time series motifs. In: Ninth ACM SIGKDD Interna-
high dimensional projected clustering of data streams. tional Conference on Knowledge Discovery and Data
In: Proceedings of the VLDB Conference. 2004. Mining. Washington D.C.; 2003, 493498.
8. Aggarwal CC, Han J, Wang J, Yu P. On demand clas- 18. Li L, Nallela S. Probabilistic discovery of motifs in wa-
sification of data streams. In: Proceedings of the ACM ter level. In: IEEE International Conference on Infor-
KDD Conference. Seattle, WA; 2004, 503508. mation Reuse and Integration. Las Vegas, NV; 2009,
9. Gaber MM, Zaslavsky A, Krishnaswamy S. A survey 388393.
of classification methods in data streams. In: Aggarwal 19. Gaber MM, Yu PS. A holistic approach for resource-
C, ed. Data Streams: Models and Algorithms. Springer aware adaptive data stream mining. J New Gen Com-
Verlag; 2007, 3959. put 2006, 25:95115.
20. Phung ND, Gaber MM, Rohm U. Resource-aware on-
10. Domingos P, Hulten G. Mining high-speed data
line data mining in wireless sensor networks. In: Pro-
streams. In: Proceedings of the 6th ACM SIGKDD In-
ceedings of the IEEE Symposium on Computational
ternational Conference on Knowledge Discovery and
Intelligence and Data Mining. IEEE Symposium Series
Data Mining. New York: ACM Press; 2000, 71
on Computational Intelligence. Honolulu, HI; 2007,
80.
139146.
11. Hulten G, Spencer L, Domingos P. Mining time- 21. Gaber MM. Data stream mining using granularity-
changing data streams. In: Proceedings of the 7th based approach. In: Abraham A, Hassanien A,
ACM SIGKDD International Conference on Knowl- Carvalho A, Snase V, eds. Foundations of Computa-
edge Discovery and Data Mining. New York: ACM tional Intelligence. Vol. 6. Berlin/Heidelberg: Springer;
Press; 2001, 97106. 2009, 4766.
12. Hoeffding W. Probability inequalities for sums of 22. Gama J, Gaber MM, Krishnaswamy S. Data stream
bounded random variables. J Am Stat Assoc 1963, mining: from theory to applications and from sta-
58:1330. tionary to mobile. In: Twenty-Fifth Symposium On
13. Domingos P, Hulten G. A general method for scaling Applied Computing. Sierre, Switzerland. Available
up machine learning algorithms and its application to at: http://www.csse.monash.edu.au/shonali/ACM-

84 
c 2011 John Wiley & Sons, Inc. Volume 2, January/February 2012
WIREs Data Mining and Knowledge Discovery Advances in data stream mining

SAC10-DS-Tutorial/Tutorial-SAC10-Final.pdf (Ac- Mobile Miner: a toolkit for mobile data stream mining.
cessed October 20, 2011.) ACM Knowl Discov Databases 2009.
23. Gaber MM, Shiddiqi AM. Distributed data stream 26. Agnik. MineFleet description. Available at: http://
classification for wireless sensor networks. In: Proceed- www.agnik.com/minefleet.html (Accessed October 17,
ings of the 2010 ACM Symposium on Applied Com- 2011.)
puting (SAC). Sierre, Switzerland: ACM Press; 2010, 27. Kargupta H, Park B, Pittie S, Liu L, Kushraj D, Sarkar
16291630. K. MobiMine: monitoring the stock market from a
24. Gaber MM, Vatsavai R, Omitaomu O, Gama J, PDA. ACM SIGKDD Explor 2002, 3:3746.
Chawla N, Ganguly A, eds. Knowledge Discovery 28. Kargupta H, Bhargava R, Liu K, Powers M, Blair P,
from Sensor Data. Lecture Notes in Computer Science. Bushra S, Dull J, Sarkar K, Klein M, Vasa M, Handy D.
Vol. 5840. Las Vegas, Berlin, Germany, NV: Springer; VEDAS: a mobile and distributed data stream mining
2010. system for real-time vehicle monitoring. In: Proceed-
25. Krishnaswamy S, Gaber MM, Harbach M, Hugues C, ings of the SIAM International Data Mining Confer-
Sinha A, Gillick B, Haghighi PD, Zaslavsky A. Open ence. Orlando, FL; 2004, 300311.

Volume 2, January/February 2012 


c 2011 John Wiley & Sons, Inc. 85

You might also like