Professional Documents
Culture Documents
data mining
Matjaz Kukar
University of Ljubljana
matjaz.kukar@fri.uni-lj.si
10
Abstract
11
Most clinical tasks are regularly documented and stored on electronic media,
12
this causing large amounts of data to be collected over time. Stored data may
13
include either an explicit (date and time) or an implicit (order) time stamp
14
in which the particular datum was valid. Most statistical, machine learning
15
and data mining algorithms assume that the data they use is a random sample
16
17
available for mining today violate this assumption. They were gathered over
18
months or years, and the underlying processes generating them may have
19
20
21
tients data are regularly stored in a central computer databases, similar situa-
22
tions may occur. Expert physicians may easily, even unconsciously, adapt to
23
the changed environment, whereas machine learning and data mining tools
24
25
detect and adapt to the changed situation. In the paper we present a brief
26
overview of methods for explicit and implicit handling of temporal data, and
27
28
niques for dealing with concept drift in machine learning and data mining
29
30
31
32
33
34
35
1 Introduction
36
Most clinical tasks are nowadays being documented and stored on electronic me-
37
38
data. However, huge amounts of data present entirely new problems, both for the
39
data analysts and end-users (physicians). Physicians who have to make diagnos-
40
tic or therapeutic decisions based on these data may be overwhelmed by the sheer
41
amount of data if their ability to reason with the data does not scale up to the com-
42
puters capabilities. In addition to this, most stored data include either an explicit
43
(date and time) or an implicit (order) time stamp in which the particular datum was
44
valid. Such time series may consist of thousands of numbers that describe only
45
short time periods and are because of this especially difficult for interpretation and
46
subsequent reasoning.
47
48
helpless and cannot effectively exploit all the available information. This situation
49
50
(1)
51
This tells us that we can extract information from data only if we have appropriate
52
53
humans usually only have limited time available, the only ways to increase the
54
amount of information we can obtain from the data are either (1) to transform the
55
56
57
58
over a stretch of time has much more significance than an isolated finding
59
60
61
62
63
64
65
66
2. Machine learning, data mining, and other data analysis tools may be used to
67
detect, expose and utilize regularities from stored data, and to provide physi-
68
69
70
and apply it to solve new cases. As their insight in the problem is different
71
from that of the physicians, they are a valuable source of alternative hy-
72
73
74
75
In the paper we will mainly focus on the analysis of explicit and implicit
76
77
78
79
29, 28], or transforming the temporal data in a series of entities (states, events
80
and relations among them) in order to use them for efficient and comprehensible
81
82
We will briefly review some approaches for handling of explicit temporal (time-
83
stamped) data and show that in many cases the temporal component is implicitly
84
included in the data, yet often ignored. The aim of the paper is not in dealing
85
with explicit temporal data, but in dealing with data where temporal information is
86
87
ignorant machine learning and data mining methods, anomalous results may occur
88
because of this ignorance. We will review some simple, yet highly effective statisti-
89
cal and machine learning techniques for detecting and handling temporal problems
90
91
92
and proposed solutions for dealing with drifting concepts. In Sec. 3 we describe
93
the datasets we are using for demonstration and case study. In Sec. 4 we present
94
95
96
2 Methods
97
98
99
and reasoning that should simultaneously integrate information from the medical
100
101
trials and, most importantly, diagnostic test results. In most clinical institutions the
102
patients data are regularly stored in a central computer database. With time, more
103
and more records that include confirmed diagnoses appear in the database. Such
104
105
the outcome has already occurred are selected and analyzed, thus looking back-
106
ward to assess potential risk factors and diagnostic principles. Retrospective stud-
107
ies naturally fit into Machine Learning and Data Mining application frameworks,
108
109
lar as a support tool in medical decision making. All clinical data are collected
110
over (shorter or longer) time spans. In most cases the clinicians are aware of their
111
112
113
In several clinical tests or trials the patients state is monitored continuously and
114
the findings are time stamped and managed accordingly (e.g. ECG, EEG, long-
115
term repetitive tests, . . . ). Such data are treated as time series and are dealt with
116
117
118
119
must be examined over the same period of time in order to understand the patients
120
overall situation. This rather complex task has been traditionally the domain of de-
121
122
123
124
tions are methods that can be used to obtain abstract descriptions of the course of
125
(possibly multivariate) time series by extracting their most relevant features [19].
126
They are able to summarize the time course of multivariate data through abstracted
127
episodes which are valid over a certain time period. Temporal abstractions can be
128
129
able to summarize the data through some sufficient statistics, such as mean and
130
131
132
series at an abstract level. Temporal abstractions are usually used as the first step
133
in the process of automated reasoning, as well as for data preprocessing and data
134
revision [3].
135
136
To make it clear at the beginning: every data collection has an implicit temporal
137
component, as the data are collected over a certain time period. Having collected a
138
set of patient descriptions with confirmed diagnoses, the task of a Machine Learn-
139
140
with respect to the correct diagnosis. A set of possible diagnoses is used as a target
141
for classification process. The generated model can subsequently be used for risk
142
143
144
controlled. However, one must be aware that even in most strictly controlled envi-
145
ronments, unexpected changes may happen. For instance, a crucial piece of equip-
146
ment may start to fail and later gets replaced, personnel changes may happen, new
147
148
may not be visible immediately, it is necessary to act as soon as they are discov-
149
ered. While humans can with relative ease gradually adapt to changed situation, it
150
is not the same with machines, not even with learning ones. Most machine learning
151
152
training data (such as pruning of decision trees [11] and rules [7]), weight elim-
153
ination [32], . . . . However, it is definitely not desirable that perfectly valid new
154
examples, generated under changed conditions, are considered as noisy and there-
155
fore excluded from training. Therefore, generated models do not reflect changed
156
conditions until enough new examples are collected. During this transition the
157
158
159
Classical data mining in time series has several similarities with temporal abstrac-
160
tions. Namely, several important time series data mining problems basically reduce
161
162
trends, etc., generalized as motifs) in a longer time series, where motifs may or
163
may not be known in advance. If the user can properly define problem-dependent
164
motifs in advance they can be used to qualitatively describe the whole time series.
165
While there exists a vast body of work on efficiently locating known patterns
166
(motifs) in time series [1, 18] the problem of discovering motifs without any prior
167
knowledge about the regularities of the data under study has received far less at-
168
tention [5]. Such an algorithm would potentially allow a user to find surprising
169
170
prising pattern looks like. We are interested in looking for surprising patterns,
171
i.e., combinations of data points whose structure and frequency somehow defies
172
our expectations. The problem is referred to under various names in the literature,
173
including novelty detection [8], anomaly detection [33], and structural change de-
174
tection [26]. There exist efficient probabilistic algorithms [5] and statistical tests
175
176
177
When the patients state is monitored continuously and the data are time stamped
178
we deal with a time series. A time series often contains thousands or millions of
179
180
to transform the original time series into a new, more meaningful and computation-
181
ally manageable dataset [3]. The new dataset should contain all information from
182
the original time series, eliminate its computational problems, and be generally
183
more useful. For this purpose, temporal abstractions are frequently used.
184
185
186
data, external events, and abstraction goals, produce abstractions of the data that
187
interpret past and present states and trends and that are relevant for the given set of
188
goals.
189
The approach that was introduced by Shahar [29, 28] employs an inference
190
structure and related required knowledge that are specific to the task of abstracting
191
192
independent of any particular domain. The theory underlying this method is speci-
193
194
and the data-interpretation contexts that these entities create, by a knowledge based
195
196
197
tion ontology a theory of what entities, relations, and properties exist in any
198
particular domain from the point of view of the temporal abstraction task and, in
199
200
abstraction subtasks [28] are temporal context restriction, vertical temporal infer-
201
202
203
204
205
206
states (that introduce the notion of duration), events (that have temporal dimensions
207
and their occurrence modifies changes the state of the world), and chronicles
208
(that are defined as an ordered collection of temporal objects which represent the
209
real history of the world as it is perceived by the system). He also proposes three
210
211
212
that are used to dynamically modify the length and the location of a mobile tempo-
213
ral window that brings to light a set of temporal information useful to the current
214
reasoning process.
215
Forgetting is crucial for all artificial or natural systems with memory. There
216
are many forms of forgetting. Dojat [10, 9] describes two simple types: active for-
217
218
erased during the reasoning process, and passive forgetting where infrequently used
219
information vanishes with time. When processing information tend to exceed the
220
221
222
223
224
systems. They assist the clinical staff in medical environments such as operating
225
rooms or intensive care units, where decisions need to be taken quickly. If the in-
226
formation flood is overloading operators sensory inputs, false positive alarms may
227
be common and causes for overlooking of life-threatening situations. To aid the op-
228
erators in real time, intelligent patient monitoring systems [9] should reason about
229
230
timely response.
231
232
Most data analysis methods assume that all data was generated by a single con-
233
cept and is basically a random sample drawn from a stationary distribution [17].
234
In many cases, however, it is more accurate to assume that data was generated by
235
236
ditional machine learning systems learn incorrect models when they erroneously
237
238
[17]. For classification systems, which attempt to learn a discrete function given
239
examples of its inputs and outputs, this problem takes the form of changes in the
240
target function over time that is known as a concept drift [16, 20, 31, 34]. In this
241
242
243
Recently, several systems have been developed that employ Machine Learning
244
methods in real life applications. They learn real-life concepts that tend to change
245
over time [20, 31, 34]. An illustrative example comes from Text Mining when
246
247
The concept drift, whether abrupt or gradual [15, 16], occurs over time. The
248
evidences for changes in a concept are represented by the training examples, which
10
249
are distributed over time. Hence the old observation can become irrelevant to the
250
current time period and thus the learned knowledge can be outdated. Several meth-
251
ods have been suggested to cope with this problem, either to forget outdated in-
252
duced knowledge, or to forget outdated training examples [13, 15, 21, 34].
253
Special techniques are applied when concepts can be expected to recur [15].
254
255
ciated with irregular phenomena. In both cases the approach is to to identify stable
256
concepts and the associated context specific, locally stable concepts, and store them
257
258
The remainder of the paper aims to review some relatively simple statistical
259
and machine learning techniques devised to detect and cope with drifting concepts.
260
261
memory), that is, according to Dojat [10], crucial for artificial or natural systems
262
with memory. We will apply the so-called passive forgetting [10] as this approach
263
264
265
Partial memory learners are systems that select and maintain a portion of the past
266
training examples, which they use together with new examples in subsequent train-
267
ing episodes. Such systems can learn by memorizing selected new facts, or by
268
using selected facts to improve the current concept descriptions or to derive new
269
270
they can be less susceptible to overtraining when learning concepts that change or
271
drift, as compared to learners that use other memory models [27, 34]. The key
272
issues for partial memory learning systems are how they select the most relevant
273
examples from the input stream, maintain them, and use them in future learning
274
episodes. These decisions affect the systems classification accuracy, memory re-
11
275
quirements, and ability to cope with changing concepts. A selection policy might
276
keep each training example that arrives, while the maintenance policy forgets ex-
277
278
These policies more or less bias the learner toward recent events, and, as a
279
consequence, the system may forget about important but rarely occurring events.
280
On the other hand, the learner that is strongly anchored to the past may perform
281
282
283
284
examples that are irrelevant according to some time criteria (e.g. examples that
285
are outdated) are deleted from the partial memory [27]. Hence, these instances are
286
totally forgotten. The examples that remain in the partial memory are equally im-
287
portant for the learning algorithm. Another possibility is to use gradual forgetting
288
[21]. It can be implemented with a time based forgetting function, which provides
289
each example with a weight according to its occurring time. The importance of
290
an example diminishes with time. The drawback of this approach is that machine
291
learning algorithms need to implement techniques for dealing with unequally im-
292
portant examples.
293
294
295
w examples; as new examples arrive they are inserted into the beginning of the
296
window, a corresponding number of examples is removed from the end of the win-
297
dow, and the learner is reapplied [34]. As long as w is small relative to the rate of
298
concept drift, this procedure assures availability of a model for the current concept
299
generating the data. If the window is too small, however, this may result in insuffi-
12
300
cient examples to satisfactorily learn the concept. Further, the computational cost
301
302
303
Gradual forgetting. The principal idea behind gradual forgetting is that natural
304
forgetting is a gradual process. This means that newer training examples should be
305
more important than older ones and their importance should decrease with time.
306
The importance of example is given with its weight w = f (t). The calculated
307
weights must be in an interval that is suitable for the applied learning algorithms.
308
309
Assuming that training examples arrive on equal time steps, Koychev [21] suggests using a linear gradual forgetting function, defined as follows:
wi =
2k
i+k+1
n1
(2)
310
where i is a counter of observations starting from the most recent one and it goes
311
back over time i = 0 . . . n 1 where n is the length of the observed training se-
312
quence, and k is a parameter that represents the percent of decreasing the weight of
313
the first observation and consequently the percent of increasing the weight of the
314
last one in comparison to the average. By varying the parameter k, the slope of the
315
316
317
Within the same framework, a kernel function for example weighing can also
be used (Eq. 3).
2
1
d
wi =
e 2k2
2 k
(3)
318
Here d = i/n is a relative time distance to the training example from the past, and k
319
is a real-valued kernel parameter. Both forgetting functions (Eq. 2 and Eq. 3) were
320
13
321
Setting the parameters. While we have quite a few options for dealing with
322
drifting concepts, they all require parameter adjustment (window size, slope of
323
linear function, kernel parameter). Because we cannot detect drift until it has hap-
324
pened, these parameters cannot be optimally set in advance, unless we know the
325
exact extent of the drift. Therefore we always start with certain amount of drifted
326
data, that can be used for parameter optimization [20], such as window size, slope
327
328
329
The assumption in data mining is that the data is randomly drawn from a stationary
330
distribution. When the underlying distribution changes (e.g. over time), this change
331
can be detected with structural change tests for a certain confidence level .
332
333
334
search. The most important classes of tests on structural change are the tests from
335
the generalized fluctuation test framework (CUSUM and MOSUM tests) [22] on
336
one side, and tests based on F statistics [6, 2] on the other. A topic that has gained
337
338
structural change, i.e., to start after a history phase (without structural changes) to
339
analyze new observations and to be able to detect a structural change as soon after
340
341
(i = 1, ..., n),
(4)
342
where at time i, yi is the observation of the dependent variable, xi = (1, xi2 , ..., xik ) is
343
14
344
345
346
(i = 1, ..., n)
(5)
347
against the alternative that the coefficient vector varies over time, with certain tests
348
being more or less suitable (i.e., having good or poor power) for certain patterns of
349
350
Regression coefficients i are estimated with the ordinary least squares (OLS)
351
352
353
354
2.5.1 F tests
355
The most flexible approach to investigate whether the null hypothesis of no struc-
356
tural change holds, is to use F test statistics. F tests are designed to test against a
357
single shift alternative. Thus, the alternative can be formulated on the basis of the
358
model (Eq. 4)
i =
(1 i i0 )
(i0 < i n)
(6)
359
where i0 is some change point in the interval (k, n k). Chow [6] was the first to
360
suggest such a test on structural change for the case where the (potential) change
361
point i0 is known. He proposed to fit two separate regressions for the two subsam-
362
ples defined by i0 and to reject whenever the test statistics Fi0 exceeds some critical
363
value,
Fi0 =
u> u e> e
.
e> e/(n
2k)
15
(7)
364
where e = (uA , uB )> are the residuals from the full model (consisting of two regres-
365
sions), where the coefficients in the subsamples are estimated separately, and u are
366
the residuals from the restricted model, where the parameters are just fitted once
367
for all observations. The test statistic Fi0 has an asymptotic 2 distribution with
368
k degrees of freedom and (under the assumption of normality) Fi0 /k has an exact
369
370
Chow test is that the change point has to be known in advance. A natural idea to
371
extend the ideas from the Chow test is to calculate the F statistics for all potential
372
change points or for all potential change points in an interval [i, ] and to reject the
373
374
375
with the retrospective detection of structural changes in given data sets. Several
376
structural change tests have been extended to monitoring of linear regression mod-
377
els where new data arrive over time [26]. Such forward looking tests are closely
378
related to sequential tests. When new observations arrive, estimates are computed
379
sequentially from all available data (historical sample plus newly arrived data) and
380
compared to the estimate based only on the historical sample. As in the retro-
381
382
383
384
(i = 1, . . . , n, n + 1, . . .),
(8)
i.e., we expect new observations to arrive after time n, when the monitoring begins.
16
385
3 Materials
386
387
For illustration on how to use structural change tests we present from Goldberger
388
et al. [12] two interesting time series, series 1 (Fig. 1), and series 2 (Fig. 3)
389
390
rate from a single subject. The two subjects were engaged in comparable activities
391
for the duration of each series. The measurements (in units of beats per minute)
392
occur at 0.5 second intervals, so that the length of each series is exactly 15 minutes.
393
The rapid oscillations visible in series 1 are caused by respiratory sinus arrhyth-
394
mia, a modulation of heart rate that is greatest in young subjects, and gradually
395
396
On the other hand, series 2 belongs to the class of congestive heart failure,
397
where circulatory delays interfere with regulation of carbon dioxide and oxygen in
398
399
Both time series contain anomalies near the beginning and the end of observa-
400
tion period.
401
402
For illustration of implicit temporal data mining methods and simultaneous use of
403
structural change statistical tests, we focused on the Nuclear dataset [14, 25] for
404
three reasons:
405
406
407
408
we have been working on this dataset for quite some time and therefore know
it pretty well,
we have close relations with physician who has collected the data, and who
has provided also the original diagnoses,
17
409
it was possible to order the patients by the date of their examination, which
410
411
412
413
414
415
arteries due to stenosis or occlusion. CAD produces impaired function of the heart
416
417
In our study we used a dataset of 327 patients (250 males, 77 females) with
418
419
raphy and coronary angiography because of suspected CAD. The features from the
420
421
422
In 228 cases the disease was angiographically confirmed and in 99 cases it was
423
excluded. 162 patients had suffered from recent myocardial infarction. The pa-
424
tients were selected from a population of approximately 4000 patients who were
425
examined at the Nuclear Medicine Department between 1991 and 1994. We se-
426
lected only the patients with complete diagnostic procedures (all four levels) [25].
427
Results of the fourth level (coronary angiography) were taken as a gold standard.
428
4 Results
429
430
In Figs. 1 to 4 there are depicted the original time series (Figs. 1 and 3) and the
431
respective critical values of F tests (Figs. 2 and 4) at the significance level = 0.05.
18
432
433
turally changed intervals are detected: besides the true anomalies, the test detects
434
another interval with raised heart rate in the middle of the time series.
435
436
In series 2 (congestive heart failure) two structurally changed intervals are detected; they are both associated with true anomalies.
437
As we can see, F tests are quite good at detecting structural changes. Note that
438
such a periodic time series is a more difficult problem than a monotonic one. It is
439
therefore conceivable that structural change tests may be of use in data mining of
440
441
442
443
444
445
446
In this section we demonstrate how we can easily and efficiently detect and (at least
447
448
449
Our case study is in the diagnostics of coronary artery disease [14, 25]. This is
450
a two-class problem, diagnosing whether the patients suffer from coronary artery
451
disease (CAD), or not. The data were collected in years between 1991 and 1994.
452
After performing a leave-one-out testing on the whole dataset and ordering the
453
454
455
physicians and naive Bayesian classifier in the last observed year (1994) could be
19
456
457
The former seems not to be the case in our problem, since the class distribution
458
(see class prevalence in Fig. 5) does not change significantly over the observed
459
time interval. This leads us to a question what has happened and how can we deal
460
with it.
[Figure 5 about here.]
461
462
463
where Machine Learning tools are being used employ techniques for detecting an
464
dealing with time-changing concepts. While this may not happen very often, it may
465
466
for changed conditions it is also important for a Machine Learning system to decide
467
on when to rebuild a model to account for newly arrived training examples and what
468
469
470
For testing different methods for dealing with concept drift we applied the follow-
471
ing methodology. All examples were ordered by the time of patients examination.
472
When we needed to start with some initial training set, we fixed for this purpose
473
the first 100 out of 327 examples. Performance on this set was evaluated with
474
475
476
testing was done in single steps, where the potential training set consisted from the
477
first n examples, 100 n 326, and the testing set consisted of the (n + 1) -st ex-
478
ample. From training examples either last w (window size) was used for training,
479
480
Our experimental Machine Learning tool of choice was naive Bayesian classi-
481
fier. It was suitable for our purpose because of its very fast, incremental learning,
20
482
483
can easily be modified for dealing with unequally important (weighted) training
484
examples.
485
486
The first thing was to find out how can concept drift be detected. If we want to
487
detect it, it must have already happened or has started to happen. Thus we always
488
489
where sooner or later the diagnoses are confirmed, our task is easy. When we col-
490
lect enough drifted examples, the drift reflects in significantly decreased average
491
492
cation accuracy achieved in distant past (see average classification accuracy in Fig.
493
494
495
outcomes (and thus correctness of prognoses) may not be known for a long time, if
496
ever. In such cases it may be more useful to use a measure of reliability estimation
497
[23, 24] that assigns a kind of confidence value to every prediction (see reliability
498
estimation in Fig. 6). Although actual predictions may not be known, one could
499
500
501
502
503
In Fig. 6 it can clearly be observed, that the drift has been happening since the
504
beginning of the last observed year (1994), that is, from the example 261 on. So
505
21
506
Detecting the concept drift with a structural change test. For quantitative
507
evaluation of the concept drift structural change statistical tests has been used. Re-
508
sults of the F test performed on the original non-smoothed data (the raw 0/1 data of
509
Fig. 5) are shown in Fig. 7. On the significance level of = 0.05, we can detects
510
structural changes in first 30 patients in year 1991 as well as in the patients from
511
261 on in year 1994. The first structural change was admittedly caused by a biased
512
data collection, while the second one truly represents a concept drift (actually a
513
514
515
We wanted to check if this concept drift is also reflected in the structure of the
516
data. For this purpose we plotted the most important principal components for each
517
patient (Fig. 8) and tested it for a structural change (Fig. 9). On average, the most
518
important principal component explained about 35% of total data variance. As one
519
can see in both figures, there are no significant trends or regularities in Fig. 8,
520
although the components are slightly decreasing with time. Also, the test statistic
521
values of the structural change F test are far below the critical value on signifi-
522
523
524
525
526
527
528
529
after a history phase without structural changes (we left the first 30 patients out)
530
and incrementally analyzed and tested new observations for a structural change.
22
531
We found out that the F test is quite responsive in this matter, it correctly found
532
the location (patient 261) of the significant structural change 23 patients after the
533
actual start of the drift. The 23 patients correspond to approximately three months
534
in real time.
535
In Fig. 10 there is a situation 20 patients after the beginning of the drift, when
536
it wasnt detected yet. In contrast, Fig. 11 depicts a situation 25 patients after the
537
beginning of the drift, when the presence of structural change and its location in
538
539
540
541
If the data were monitored continuously in clinical practice, the drift could have
542
been detected a few months after its beginning, and not ten years later.
543
544
For dealing with the concept drift we applied windowing as well as linear and
545
kernel-based gradual forgetting (Sec. 2.4.3). We used the first 261 non-drifted
546
examples for parameter optimization (window size, slope and kernel parameter).
547
In order to evaluate the quality of obtained parameters, we also optimized the pa-
548
rameters on the drifted set. For comparison we selected the best achieved results.
549
The obtained parameter values were tested on the last 66 drifted examples, and
550
551
552
553
23
554
As we can see, differences in accuracy between optimized and actual best pa-
555
rameter values exist, but they are small. By using the optimized parameter values
556
the average performance on the whole dataset was 94-95% for all three methods.
557
558
significantly better than the other ones. However, we can see an improvement in
559
overall accuracy for naive Bayesian classifier by 4%. This is no small achievement,
560
since it actually reduces the error rate by 44% (from 9% to 5%). But more than
561
this it is important that the performance on the drifted examples (last 66) is much
562
higher (by 20%) for naive Bayesian classifier (from 64% to 83-85%). This means
563
that we can almost level the performance on this problematic subset with overall
564
performance and should equal to it when a few more training examples arrive.
565
566
Bayesian classifier with windowing and gradual forgetting methods for different
567
parameter values. For training ordinary naive Bayesian classifier, non-drifted ex-
568
amples (first 261) were used, and leave-one-out testing was performed, the others
569
570
571
572
5 Discussion
573
In the paper we have focused on, and briefly reviewed some approaches for han-
574
575
Explicit temporal data (time series) are frequently used in medical and other
576
studies. We briefly reviewed numerical (time series analysis) and symbolic (tem-
577
poral abstractions) methods for dealing with abundant temporal data. They are
24
578
579
poral data.
580
On the other hand, in implicit temporal data the temporal component is hidden,
581
yet always present, and determines their ordering, as were collected over certain
582
periods of time. More often than not, the hidden temporal components of collected
583
datasets are ignored. This may cause unfortunate anomalies and worsen results, as
584
585
586
drift), and three different machine learning methods for dealing with changing
587
588
found out that in our case study of coronary artery disease diagnostics all perform
589
reasonably well. While they all require setting certain parameters, they can be
590
automatically tuned on the training set [20], and nearly optimal results can be ex-
591
pected. In windowing, at most n (size of the training set) re-runs of the training
592
algorithm are required for window size selection, whereas for gradual forgetting
593
594
We can detect concept drift only after it has happened, so we always have
595
some data available. It can be detected by using classification accuracy (in diag-
596
597
598
confidence level. Our experiments with monitoring for a structural change tests
599
show that they are able to precisely and quickly locate the starting point of concept
600
drift. Namely, one should rebuild (re-learn) the model only when absolutely neces-
601
sary and adjust suitable parameters (e.q. window size, slope or kernel parameter)
602
to compensate for the drift. This is especially important for practical applications,
603
where rebuilding a model is not performed every time when a new training exam-
604
ple arrives. Model rebuilding may require a presence of a machine learning expert,
25
605
606
stored and used independently of the learner (e.g. in a handheld device, or even
607
printed on a paper). In such cases a model should be rebuilt and deployed only
608
609
610
611
Since it actually reduces the (already low) error rate by 44% (from 9% to 5%), it
612
613
examples, where it was about 20% (from 64% to 83-85%). This means that the
614
615
616
A very encouraging result is also it correct and quick pinpointing of the location
617
(patient 261) of the significant structural change only 23 patients after the actual
618
start of the drift. The 23 patients correspond to approximately three months in real
619
time. If the data were monitored and tested continuously, the drift could have been
620
detected a few months and not ten years after its beginning. We argue that any
621
(online) learning system that is used in practice should use similar techniques at
622
623
There are several things that can be done to further develop the described meth-
624
ods. Most notably, statistical tests should be integrated with machine learning and
625
data mining methods more thoroughly in order to continuously check for the pos-
626
sibility of a structural change (concept drift) and to guide the re-learning of the
627
models. Also, a weighting scheme for gradual forgetting should be devised, that
628
does not need to be recalculated every time a new training example arrives. This
629
would enable true incremental learning, however it would require from the learner
630
631
There is one question left unanswered, and that is how and why in our case
26
632
study a concept drift has occurred in the first place. Since the reasons are rather
633
delicate and personal in their nature, it will suffice to say that there was a serious
634
635
636
Acknowledgements
637
638
Ljubljana, for collecting the data, and Michel Dojat, Unite mixte INSERM-UJF
639
640
me his most interesting work. This work was supported by the Slovenian Ministry
641
642
References
643
644
645
646
647
648
649
650
651
652
653
654
[5] B. Chiu, E. Keogh, J. Lin, and S. Lonardi. Efficient discovery of unusual patterns in
time series. In Proc. KDD02, pages 550556, 2002.
27
655
656
[6] G. C. Chow. Tests of equality between sets of coefficients in two linear regressions.
Econometrica, 28:591605, 1960.
657
[7] W. W. Cohen. Fast effective rule induction. In A. Prieditis and S. Russel, editors,
658
Proc. 12th Intl. Conf. on Machine Learning ICML95, pages 115123, San Francisco,
659
660
[8] D. Dasgupta and S. Forrest. Novelty detection in time series data using ideas from
661
662
663
664
665
666
[10] M. Dojat and C. Sayettat. A realistic model for temporal reasoning in real-time
patient monitoring. Applied Artificial Intelligence, 10:121143, 1996.
667
[11] F. Esposito, D. Malerba, and G. Semeraro. Simplifying decision trees by pruning and
668
grafting: new results. In N. Lavrac and S. Wrobel, editors, Proc. Europ. Conf. on
669
670
671
672
673
York, 1991.
674
675
[13] I. Grabtree and S. Soltysiak. Identifying and tracking changing interests. International Journal of Digital Libraries, 2:3853, 1998.
676
677
678
679
680
[15] M. B. Harries, C. Sammut, and K. Horn. Extracting hidden context. Machine Learning, 32:101126, 1998.
28
681
682
[16] D. P. Helmbold and P. M. Long. Tracking drifting concepts by minimizing disagreements. Machine Learning, 14:2745, 1994.
683
684
Proceedings of the 17th ACM SIGKDD Inter. Conf. on Knowledge Discovery and
685
Data Mining, pages 97106, San Francisco, CA, 2001. ACM Press.
686
687
for fast similarity search in large time series databases. Journal of Knowledge and
688
689
690
fanelli, and J. Wyatt, editors, Lecture Notes in Medical Informatics, pages 6790.
691
692
[20] R. Klinkenberg and T. Joachims. Detecting concept drift with support vector ma-
693
694
ence on Machine Learning, pages 487494, Stanford, US, 2000. Morgan Kaufmann
695
696
697
698
699
700
[22] C. M. Kuan and K. Hornik. The generalized fluctuation test: A unifying view. Econometric Reviews, 14:135161, 1995.
701
[23] M. Kukar. Making reliable diagnoses with machine learning: A case study. In Silvana
702
703
Intelligence in Medicine Europe, AIME 2001, pages 8896, Cascais, Portugal, 2001.
704
Springer.
705
706
707
219231, 2002.
29
708
[25] M. Kukar, I. Kononenko, C. Groselj, K. Kralj, and J. Fettich. Analysing and im-
709
proving the diagnosis of ischaemic heart disease with machine learning. Artificial
710
711
712
713
714
715
716
717
718
719
720
[26] F. Leisch, K. Hornik, and C. M. Kuan. Monitoring structural changes with the generalized fluctuation test. Econometric Theory, 16:835854, 2000.
[27] M. A. Maloof and R. S. Michalski. Selecting examples for partial memory learning.
Machine Learning, 41(1):2752, 2000.
[28] Y. Shahar. A framework for knowledge-based temporal abstraction. Artificial Intelligence, 90(1-2):79133, 1997.
[29] Y. Shahar and M. A. Musen. Knowledge-based temporal abstraction in clinical domains. Artificial Intelligence in Medicine, 8(3):267298, 1996.
[30] C.E. Shannon and W. Weaver. The mathematical theory of communications. The
University of Illinois Press, Urbana, IL, 1949.
721
[31] N. A. Syed, H. Liu, and K. K. Sung. Handling concept drifts in incremental learning
722
with support vector machines. In Knowledge Discovery and Data Mining, pages
723
317321, 1999.
724
725
726
Systems, volume 3, pages 875882, Denver, CO, USA, 1991. Morgan Kaufman.
727
728
729
730
731
[34] G. Widmer and M. Kubat. Learning in the presence of concept drift and hidden
contexts. Machine Learning, 23(1):69101, 1996.
732
733
for testing for structural change in linear regression models. Journal of Statistical
734
30
735
736
737
List of Tables
1
2
31
32
33
Diagnostic level
Signs, symptoms and history
Exercise ECG
Myocardial scintigraphy
Coronary angiography
Total attributes
Disease prevalence
Entropy of classes
Diagnostic attributes
Nominal Numeric Total
23
7
30
7
9
16
22
9
31
1
1
53
25
78
70% positive 30% negative
0.89 bit
32
Naive
Bayes
Windowed
Linear
Kernel
Physicians
Ordinary Naive Bayes
Parameter
Window size
Slope (k)
Kernel size (k)
Optimized
Value Accuracy
100
85%
0.90
83%
0.25
83%
70%
64%
Best achieved
Value Accuracy
70
88%
0.9
83%
0.17
86%
70%
64%
33
Overall
Accuracy
95%
94%
95%
85%
91%
738
739
740
List of Figures
1
2
741
742
743
744
3
4
745
746
747
748
5
6
749
750
751
7
8
752
753
754
755
10
756
757
11
758
759
12
760
761
762
763
13
764
765
766
767
14
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
100
80
60
V1
40
20
0
0
500
1000
1500
Time
35
80
60
40
F statistics
20
0
400
600
800
1000
1200
1400
Time
36
100
80
60
V1
40
20
0
0
500
1000
1500
Time
37
80
60
40
F statistics
20
0
400
600
800
1000
1200
1400
Time
38
39
100%
80%
60%
40%
Reliability estimation
20%
Averaged reliability
estimation
Averaged classification
accuracy
0%
1991
1992
1993
1994
1995
Figure 6: Detecting concept drift with classification accuracy and reliability in the
Nuclear dataset.
40
12
10
8
6
F statistics
4
2
0
1991
1992
1993
1994
1995
Time
Figure 7: F test for a structural change in data from Fig. 5 (al pha = 0.05).
41
100
50
0
PC 1
50
100
1991
1992
1993
1994
1995
Time
Figure 8: Most important principal components, their running median (middle line)
and its range (top and bottom line).
42
8
6
4
F statistics
2
0
1991
1992
1993
1994
1995
Time
Figure 9: F test on the principal components from Fig. 8. All test statistic are far
below critical value ( = 0.05.
43
12
10
8
6
F statistics
4
2
0
1991
1992
1993
1994
Time
Figure 10: Monitoring F test for concept drift (20 drifted examples). Concept drift
is not detected yet.
44
12
10
8
6
F statistics
4
2
0
1991
1992
1993
1994
Time
Figure 11: Monitoring F test for concept drift (25 drifted examples). Concept drift
is determined as significant ( = 0.05).
45
100%
90%
80%
70%
60%
50%
40%
30%
Windowing
20%
Linear
Kernel
10%
0%
0.0
0.2
0.4
0.6
0.8
1.0
Figure 12: Parameter tuning: performance on the drifted examples. Kernel and
k parameter are represented as a real-value. Window size is represented as a share
of the whole training set (all training examples = 261 = 1.0).
46
Figure 13: Catching up with the drift with windowing in the Nuclear dataset. Notice the negative effect of too small window size (w=50).
47
Figure 14: Catching up with the drift with gradual forgetting in the Nuclear dataset.
Differences between linear and kernel-based forgetting are almost negligible.
48