Professional Documents
Culture Documents
Mining data streams has been a focal point of research interest over the past
decade. Hardware and software advances have contributed to the significance of
this area of research by introducing faster than ever data generation. This rapidly
generated data has been termed as data streams. Credit card transactions, Google
searches, phone calls in a city, and many others\are typical data streams. In many
important applications, it is inevitable to analyze this streaming data in real time.
Traditional data mining techniques have fallen short in addressing the needs of
data stream mining. Randomization, approximation, and adaptation have been
used extensively in developing new techniques or adopting exiting ones to enable
them to operate in a streaming environment. This paper reviews key milestones
and state of the art in the data stream mining area. Future insights are also be
presented. C 2011 Wiley Periodicals, Inc.
1. Two-phase techniques
about.1,2 However, streaming data, if analyzed, is an
important source of knowledge that enables us to take 2. Hoeffding bound-based techniques
extremely important decisions in real time. The area 3. Symbolic approximation-based techniques
has attracted attention of the data mining commu- 4. Granularity-based techniques
nity over the last decade to develop new techniques
or adopt existing ones aiming to realize the many This paper will discuss the main principles be-
important applications of data stream mining. Busi- hind each of the above categories and how this princi-
ness, scientific, and security applications have been ple has been applied to different techniques. This dis-
discussed extensively in the literature.3,4 cussion will be followed by presenting new directions
The last decade has witnessed an active research in the area. Finally, future insights will be given.
in the data stream mining. Hundreds of techniques
have been proposed to address the research issues of
analyzing rapidly arrived data streams in real time. NOTABLE TECHNIQUES IN DATA
Out of the large body of literature, we can iden- STREAM MINING
tify four different categories that have contributed in
This section will provide a discussion of the four
shaping this area of research as follows. Other cate-
identified categories of data stream mining techniques
gories of techniques can be identified. For example, a
listed in the introductory section.
large body of one-pass techniques do exist in the data
stream mining literature.1 However, the impact of
the following four categories have been widely recog- Two-Phase Techniques
nized. The first three categories represent approaches The two-phase techniques have been introduced by
to building learning algorithms. On the other hand, Aggarwal et al.5 The general idea for this category of
techniques is to maintain an online summary of data
Correspondence to: mohamed.gaber@port.ac.uk using what has been termed as microclusters. Micro-
School of Computing, University of Portsmouth, Portsmouth, clustering has extended the data structure proposed
Hampshire, UK by Zhang et al.6 to develop the balanced iterative
DOI: 10.1002/widm.52 reducing and clustering using hierarchies (BIRCH).
The maintenance of the online microclusters is fol- mated mean value expressed as:
lowed by a second phase that is done offline. This
second phase differs from one technique to another R2 ln(1/)
= ,
according to whether the ultimate objective is running 2n
a supervised or an unsupervised technique.
where R is the range of the estimated number and n is
On the basis of the two-phase strategy, a frame-
the number of points. This generic method has been
work for clustering data streams termed as CluS-
applied to an extension of the traditional K-means
tream has been proposed.5 The proposed technique
clustering algorithm, VFKM, and decision tree classi-
divides the clustering process into two components:
fication, very fast decision trees (VFDT), techniques.
online and offline. The online component stores sum-
Unlike K-median that has been used extensively
mary statistics about the data streams and the of-
in data stream clustering, the K-means algorithm com-
fline one performs clustering on the summarized data
putes the cluster centers by using the mean values
according to a number of user preferences such as
of the data records assigned to the cluster under ex-
the time frame and the number of clusters. In an im-
amination. VFKM13 uses Hoeffding bound to deter-
portant milestone to the two-phase techniques, Ag-
mine the number of examples needed in each step
garwal et al.7 have proposed an extension to CluS-
of K-means algorithm. VFKM runs as a sequence of
tream termed as HPStreama projected clustering
K-means executions with each run uses more data
for high-dimensional data streams. HPStream has
records than the previous one until the calculated sta-
outperformed CluStream in a number of case stud-
tistical Hoeffding bound is satisfied.
ies. The main motivation behind the development of
Domingos and Hulten10,11 have developed
HPStream is that CluStream has not performed ef-
VFDT, which is a decision tree learning system based
fectively with high-dimensionality streaming informa-
on Hoeffding trees. It splits the tree using the current
tion.
best attribute taking into consideration that the num-
Aggarwal et al.8 have adopted the idea of mi-
ber of examples/records used satisfies the Hoeffding
croclusters introduced in CluStream5 in on-demand
bound. VFDT is an extended version of Hoeffding
classification. CluStream, as described earlier, divides
tree algorithm that addresses the research issues of
the clustering process into the two components: of-
data streams. These research issues are as follows:
fline and online.
On-demand classification8,9 uses clustering re- Ties of attributes: occur when two or more
sults to classify data using statistics of class distri- attributes have close values of the splitting
bution. The main motivation behind the technique is criteria such as information gain.
that the classification model should be used over a
time period according to the application. The tech- High speed nature of data streams: represents
nique uses microclustering for each class in the data an inherent feature of data streams.
stream. This initialization is followed by a nearest Bounded memory: the tree can grow till the
neighbor classification of the unlabeled data. The mi- algorithm runs out of memory.
croclusters are the key of the proposed technique Accuracy of the output: is an issue in all data
which is the subtractive property. This property en- stream mining algorithms.
ables the extraction of the needed microclusters over
the required time period. The extension of Hoeffding trees in VFDT has
been done using the following techniques:
80
c 2011 John Wiley & Sons, Inc. Volume 2, January/February 2012
WIREs Data Mining and Knowledge Discovery Advances in data stream mining
the attributes with each incoming record of SAX follows three major steps in converting a
the stream. time series from its numerical form to its symbolic
Bounded memory has been addressed by de- form. The first step is Piecewise Aggregate Approxi-
activating the least promising leaves and ig- mation (PAA). This is done by converting a time series
noring the poor attributes. The calculation of of size n to an arbitrarily size w using the following
these poor attributes is done through the dif- equation:
ference between the splitting criteria of the n
wi
w
highest and lowest attributes. If the difference Ci = Cj ,
is greater than a prespecified value, the at- n
j= wn (i1)+1
tribute with the lowest splitting measure will
be removed from memory saving the memory where Ci is the ith time point in the approximated
of the data stream computing environment. time series.
The accuracy of the output has been taken The second step is symbolic discretization. This
into consideration using multiple scans over is done via producing equal areas under the curve
the data streams in the case of low data rates, of the Gaussian distribution and setting respective
and by using an accurate initialization of the breakpoints. Each breakpoint represents a step from
tree using a different, more accurate technique one letter to another when replacing the approxi-
to build an initial decision tree. mated values produced by the PAA process by its
approximated symbolic values. The final step uses a
distance measure between each two characters that
All of the above improvements have been are stored in a lookup table to find out the accumu-
tested using synthetic data sets. The experiments lated distance between any two subsequences of times
have proved efficiency of these improvements. The series.
VFDT has been extended to address the prob-
lem of concept drift in evolving data streams by
Hulten et al.11 The new framework has been termed Granularity-Based Techniques
as CVFDT. It is mainly running VFDT over fixed Granularity-based approach has been introduced by
sliding windows in order to have the most updated Gaber et al.1921 Having noted that stream mining
classifier. The change occurs when the splitting crite- techniques may fall short when running on resource-
ria change significantly across the input attributes. It constrained devices such as smart phones and sen-
is worth pointing out here the work by Masud et al.14 sor nodes, the granularity-based approach works on
is tackling concept drifts with emerging classes. adapting the mining techniques to change their re-
source consumption patterns over time according to
availability of resources.
SAX-Based Techniques Resource consumption patterns represent the
Symbolic ApproXimation (SAX) is a time series rep- change in resource consumption over a period of time
resentation that has been introduced by Keogh and which is termed as time frame. The algorithm gran-
his colleagues.15 SAX has proved to be the state-of- ularity settings are the input, output, and processing
the-art technique in time series representation. Time settings of a mining algorithm that can vary over time
series data is a typical streaming source with a tem- to cope with the availability of resources and current
poral dimension. data stream arrival rate. The following are definitions
In addition of being used in traditional data min- of each of these settings:
ing techniques such as clustering, classification, and
indexing, it has achieved important breakthroughs in Algorithm input granularity (AIG): AIG rep-
finding the most different subsequence in a time series resents the process of changing the data
termed discord16 and the most frequent subsequence stream arrival rates that feed the algorithm.
in a time series termed motif.17 Numerous applica- Examples of techniques that could be used in-
tions have used SAX representation with notable suc- clude sampling, load shedding, and creating
cess. Some examples can be recalled here. It has been data synopsis. Sampling has been the choice
reported16 that a premature ventricular contraction used in developing the granularity-based data
could be accurately identified using discord detection mining techniques.
in the time series of electrocardiogram (ECG). Li and Algorithm output granularity (AOG): AOG
Nallela18 have used motif discovery with SAX repre- is the process of changing the output size of
sentation to successfully find patterns of water level. the algorithm in order to preserve the limited
memory space. We refer to this output as the AGP(ri ): algorithm granularity parameter that af-
number of knowledge structures. For exam- fects the resource ri .
ple, we may refer to number of clusters or
rules. According to the above, the main rule to be used
Algorithm processing granularity (APG): to use the algorithm granularity approach is as fol-
APG is the process of changing the algorithm lows:
parameters in order to consume less pro-
cessing power. Randomization and approx-
IF ALT > NoF (ri )
imation techniques represent the strategies of TF
APG. THEN SET AGP(ri )
ELSE SET AGP(ri ) +
It should be noted that there is a collective in-
teraction among the above three settings. AIG mainly Where AGP(ri ) + achieves higher accuracy at
affects the data rate and it is associated with band- the expense of higher consumption of the resource ri ,
width consumption and battery. On the other hand, and AGP(ri ) achieves lower accuracy at the advan-
AOG is associated with memory and APG is asso- tage of lower consumption of the resource ri .
ciated with processing power. However, the change This simplified rule could take different forms
in any of them affects the other resources. The pro- according to the monitored resource and the algo-
cess of enabling resource awareness should be very rithm granularity parameter applied to control the
lightweight in order to be feasible in a streaming en- consumption of this resource. Interested readers are
vironment characterized by its scarcity of resources. referred to Ref 19 for applying the above rule in con-
Accordingly, the algorithm granularity settings only trolling a data stream clustering algorithm termed as
consider direct interactions. RA-Cluster.
The algorithm granularity requires continuous Interested practitioners can use the following
monitoring of the computational resources. This is procedure for enabling resource awareness and adap-
done over fixed time intervals/frames that we denote tation for their data stream mining algorithms. The
as TF. According to this periodic resource monitoring, procedure follows the following steps:
the mining algorithm changes its parameters to cope
with the current consumption patterns of resources.
1. Identify the set of resources that mining al-
These parameters are AIG, APG, and AOG settings
gorithm will adapt accordingly (R);
discussed briefly in the previous section. It has to be
noted that setting the value of TF is a critical pa- 2. Set the application lifetime (ALT) and time
rameter for the success of the running technique. The interval/frame (TF);
higher the TF is, the lower the adaptation overhead 3. Define AGP(ri ) + and AGP(ri ) for every ri
will be, but at the expense of risking a high consump- R;
tion of resources during the long time frame. 4. Run the algorithm for TF;
The use of algorithm granularity as a general
5. Monitor the resource consumption for every
approach for mining data streams will require us to
ri R;
provide some formal definitions and notations. The
following are definitions that we will use in our dis- 6. Apply AGP(ri ) + or AGP(ri ) to every ri
cussion: R according to the ratio ALT
TF
: NoF (ri ) and
the rule given;
R: set of computational resources R = {r1 , r2 , 7. Repeat the last three steps.
. . ., rn };
TF: time interval for resource monitoring and Applying the above procedure is all what is
adaptation; needed to enable resource awareness and adaptation,
using the algorithm granularity approach, to stream
ALT: application lifetime;
mining algorithms.
ALT : time left to last the application lifetime; On the basis of the Granularity-based approach,
NoF(ri ): number of time frames to consume the re- a number of data stream mining algorithms have
sources ri , assuming that the consumption been developed. For a complete list of techniques,
pattern of ri will follow the same pattern of the reader is advised to review the recent tutorial by
the last time frame; Gama et al.22
82
c 2011 John Wiley & Sons, Inc. Volume 2, January/February 2012
WIREs Data Mining and Knowledge Discovery Advances in data stream mining
techniques, (2) Hoeffding bound-based techniques, search directions have been discussed. The first con-
(3) symbolic approximation-based techniques, and cerns mining data originated from sensor networks.
(4) granularity-based techniques. Details of each cat- Mobile data stream mining represents the second
egory have been discussed. area. Finally, future insights by the author have been
New directions and future insights in this grow- enumerated giving the reader some potential direction
ing area of research have been presented. Two re- for research.
REFERENCES
1. Gaber MM, Zaslavsky A, Krishnaswamy S. Mining clustering. In: Proceedings of the 18th International
data streams: a review. ACM SIGMOD Rec 2005, Conference on Machine Learning. Williams College,
34:1826. Williamstown, MA, USA, 2001, 106113.
14. Masud MM, Gao J, Khan L, Han J, Thuraisingham
2. Babcock B, Babu S, Datar M, Motwani R, Widom J.
BM. Integrating novel class detection with classifica-
Models and issues in data stream systems. In: Proceed-
tion for concept-drifting data streams. In: Proceed-
ings of PODS. 2002.
ings of the European Conference on Machine Learning
3. Gama J, Gaber MM, eds. Learning from Data Streams: and Principles and Practice of Knowledge Discovery in
Processing Techniques in Sensor Networks. Springer Databases. Bled, Slovenia, 2009, 7994.
Verlag; 2007.
15. Lin J, Keogh E, Lonardi S, Chiu B. A symbolic repre-
4. Ganguly A, Gama J, Omitaomu O, Gaber MM, Vat- sentation of time series, with implications for streaming
savai RR, eds. Knowledge Discovery from Sensor algorithms. In: Proceedings of the 8th ACM SIGMOD
Data. Berlin, Germany: CRC Press; 2008. Workshop on Research Issues in Data Mining and
5. Aggarwal CC, Han J, Wang J, Yu PS. A framework for Knowledge Discovery. San Diego, CA; 2003, 211.
clustering evolving data streams. In: Proceedings of the 16. Keogh E, Lin J, Fu A. HOT SAX: efficiently finding
29th VLDB Conference. Berlin; 2003, 8192. the most unusual time series subsequence. In: Pro-
6. Zhang T, Ramakrishnan R, Livny, M. BIRCH: an ef- ceedings of the 5th IEEE International Conference on
ficient data clustering method for very large databases. Data Mining (ICDM 2005). Houston, TX; 2005, 226
SIGMOD Rec. New York: ACM Press; 1996, 25:103 233.
114. 17. Chiu B, Keogh E, Lonardi S. Probabilistic discovery of
7. Aggarwal CC, Han J, Wang J, Yu P. A framework for time series motifs. In: Ninth ACM SIGKDD Interna-
high dimensional projected clustering of data streams. tional Conference on Knowledge Discovery and Data
In: Proceedings of the VLDB Conference. 2004. Mining. Washington D.C.; 2003, 493498.
8. Aggarwal CC, Han J, Wang J, Yu P. On demand clas- 18. Li L, Nallela S. Probabilistic discovery of motifs in wa-
sification of data streams. In: Proceedings of the ACM ter level. In: IEEE International Conference on Infor-
KDD Conference. Seattle, WA; 2004, 503508. mation Reuse and Integration. Las Vegas, NV; 2009,
9. Gaber MM, Zaslavsky A, Krishnaswamy S. A survey 388393.
of classification methods in data streams. In: Aggarwal 19. Gaber MM, Yu PS. A holistic approach for resource-
C, ed. Data Streams: Models and Algorithms. Springer aware adaptive data stream mining. J New Gen Com-
Verlag; 2007, 3959. put 2006, 25:95115.
20. Phung ND, Gaber MM, Rohm U. Resource-aware on-
10. Domingos P, Hulten G. Mining high-speed data
line data mining in wireless sensor networks. In: Pro-
streams. In: Proceedings of the 6th ACM SIGKDD In-
ceedings of the IEEE Symposium on Computational
ternational Conference on Knowledge Discovery and
Intelligence and Data Mining. IEEE Symposium Series
Data Mining. New York: ACM Press; 2000, 71
on Computational Intelligence. Honolulu, HI; 2007,
80.
139146.
11. Hulten G, Spencer L, Domingos P. Mining time- 21. Gaber MM. Data stream mining using granularity-
changing data streams. In: Proceedings of the 7th based approach. In: Abraham A, Hassanien A,
ACM SIGKDD International Conference on Knowl- Carvalho A, Snase V, eds. Foundations of Computa-
edge Discovery and Data Mining. New York: ACM tional Intelligence. Vol. 6. Berlin/Heidelberg: Springer;
Press; 2001, 97106. 2009, 4766.
12. Hoeffding W. Probability inequalities for sums of 22. Gama J, Gaber MM, Krishnaswamy S. Data stream
bounded random variables. J Am Stat Assoc 1963, mining: from theory to applications and from sta-
58:1330. tionary to mobile. In: Twenty-Fifth Symposium On
13. Domingos P, Hulten G. A general method for scaling Applied Computing. Sierre, Switzerland. Available
up machine learning algorithms and its application to at: http://www.csse.monash.edu.au/shonali/ACM-
84
c 2011 John Wiley & Sons, Inc. Volume 2, January/February 2012
WIREs Data Mining and Knowledge Discovery Advances in data stream mining
SAC10-DS-Tutorial/Tutorial-SAC10-Final.pdf (Ac- Mobile Miner: a toolkit for mobile data stream mining.
cessed October 20, 2011.) ACM Knowl Discov Databases 2009.
23. Gaber MM, Shiddiqi AM. Distributed data stream 26. Agnik. MineFleet description. Available at: http://
classification for wireless sensor networks. In: Proceed- www.agnik.com/minefleet.html (Accessed October 17,
ings of the 2010 ACM Symposium on Applied Com- 2011.)
puting (SAC). Sierre, Switzerland: ACM Press; 2010, 27. Kargupta H, Park B, Pittie S, Liu L, Kushraj D, Sarkar
16291630. K. MobiMine: monitoring the stock market from a
24. Gaber MM, Vatsavai R, Omitaomu O, Gama J, PDA. ACM SIGKDD Explor 2002, 3:3746.
Chawla N, Ganguly A, eds. Knowledge Discovery 28. Kargupta H, Bhargava R, Liu K, Powers M, Blair P,
from Sensor Data. Lecture Notes in Computer Science. Bushra S, Dull J, Sarkar K, Klein M, Vasa M, Handy D.
Vol. 5840. Las Vegas, Berlin, Germany, NV: Springer; VEDAS: a mobile and distributed data stream mining
2010. system for real-time vehicle monitoring. In: Proceed-
25. Krishnaswamy S, Gaber MM, Harbach M, Hugues C, ings of the SIAM International Data Mining Confer-
Sinha A, Gillick B, Haghighi PD, Zaslavsky A. Open ence. Orlando, FL; 2004, 300311.