Aiim Special Issue

1
Explicit and implicit handling of time in clinical
data mining
Matjaz Kukar
University of Ljubljana
Faculty of Computer and Information Science
Trzaska 25, SI-1001 Ljubljana, Slovenia
matjaz.kukar@fri.uni-lj.si
10
Abstract
11
Most clinical tasks are regularly documented and stored on electronic media,
12
this causing large amounts of data to be collected over time. Stored data may
13
include either an explicit (date and time) or an implicit (order) time stamp
14
in which the particular datum was valid. Most statistical, machine learning
15
and data mining algorithms assume that the data they use is a random sample
16
drawn from a stationary distribution. Unfortunately, many of the databases
17
available for mining today violate this assumption. They were gathered over
18
months or years, and the underlying processes generating them may have
19
changed during this time, sometimes radically (the phenomenon known in
20
machine learning as a concept drift). In clinical institutions, where the pa-
21
tients data are regularly stored in a central computer databases, similar situa-
22
tions may occur. Expert physicians may easily, even unconsciously, adapt to
23
the changed environment, whereas machine learning and data mining tools
24
may fail due to their underlaying assumptions. It is therefore important to
25
detect and adapt to the changed situation. In the paper we present a brief
26
overview of methods for explicit and implicit handling of temporal data, and
27
a more thorough review of selected statistical and machine learning tech-
28
niques for dealing with concept drift in machine learning and data mining
29
frameworks. We evaluate their possible use in clinical studies with a case
30
study of learning and monitoring data in coronary artery disease diagnostics.
31
32
33
Keywords: clinical studies, temporal learning, drifting concepts, partial mem-
34
ory learning, forgetting, machine learning, data mining.
35
1 Introduction
36
Most clinical tasks are nowadays being documented and stored on electronic me-
37
dia. This allows researchers to access previously unimaginable amounts of clinical
38
data. However, huge amounts of data present entirely new problems, both for the
39
data analysts and end-users (physicians). Physicians who have to make diagnos-
40
tic or therapeutic decisions based on these data may be overwhelmed by the sheer
41
amount of data if their ability to reason with the data does not scale up to the com-
42
puters capabilities. In addition to this, most stored data include either an explicit
43
(date and time) or an implicit (order) time stamp in which the particular datum was
44
valid. Such time series may consist of thousands of numbers that describe only
45
short time periods and are because of this especially difficult for interpretation and
46
subsequent reasoning.
47
A physician confronted with tens of megabytes of clinical data may be rather
48
helpless and cannot effectively exploit all the available information. This situation
49
is actually quite a common manifestation of the well-known phenomenon from
50
information theory [30]:
Information = f (Data, Cognitive structure, Time)
(1)
51
This tells us that we can extract information from data only if we have appropriate
52
cognitive (processing) capabilities and enough time to process the data. As we
53
humans usually only have limited time available, the only ways to increase the
54
amount of information we can obtain from the data are either (1) to transform the
55
data in a shape, more appropriate to our cognitive structure, or (2) to virtually
56
improve our cognitive structure by utilizing intelligent applications.
57
1. In a large dataset, a regularity, or (in temporal data) an emerging pattern
58
over a stretch of time has much more significance than an isolated finding
59
or even a set of findings. Experienced physicians are able to combine sev-
60
eral significant findings, to abstract such findings into clinically meaningful
61
higher-level concepts in a context-sensitive manner, and to detect significant
62
regularities in both low-level data and abstract concepts. It is thus desir-
63
able to provide short, informative, context-sensitive summaries clinical data
64
stored on electronic media, and to be able to answer queries about abstract
65
concepts that summarize the data.
66
2. Machine learning, data mining, and other data analysis tools may be used to
67
detect, expose and utilize regularities from stored data, and to provide physi-
68
cians an all-encompassing view on the problem at hand. Intelligent tools can
69
synthesize knowledge and emerging patterns from previously solved cases,
70
and apply it to solve new cases. As their insight in the problem is different
71
from that of the physicians, they are a valuable source of alternative hy-
72
potheses. When used in temporal problems, they are used in conjunction
73
with high-level abstractions and summarizations, as they usually cannot ex3
74
plicitly handle temporal data.
75
In the paper we will mainly focus on the analysis of explicit and implicit
76
temporal (time-stamped) data. Temporal data analysis presents several impor-
77
tant challenges that include time-stamped data preprocessing (transformation into
78
information-preserving and more useful form) by using temporal abstractions [3,
79
29, 28], or transforming the temporal data in a series of entities (states, events
80
and relations among them) in order to use them for efficient and comprehensible
81
reasoning [10, 9].
82
We will briefly review some approaches for handling of explicit temporal (time-
83
stamped) data and show that in many cases the temporal component is implicitly
84
included in the data, yet often ignored. The aim of the paper is not in dealing
85
with explicit temporal data, but in dealing with data where temporal information is
86
included in an implicit way. In many cases, especially in conjunction with time-
87
ignorant machine learning and data mining methods, anomalous results may occur
88
because of this ignorance. We will review some simple, yet highly effective statisti-
89
cal and machine learning techniques for detecting and handling temporal problems
90
and show their effect in a case study.
91
The paper is organized as follows. In Sec. 2 we review some related work
92
and proposed solutions for dealing with drifting concepts. In Sec. 3 we describe
93
the datasets we are using for demonstration and case study. In Sec. 4 we present
94
experimental results. Finally, in Sec. 5 we present some conclusions and directions
95
for future work.
96
2 Methods
97
2.1 Clinical decision-making and data mining
98
Clinical decision-making is a complicated process based on experience, judgement,
99
and reasoning that should simultaneously integrate information from the medical
100
literature and a variety of other sources, including quantitative results of clinical
101
trials and, most importantly, diagnostic test results. In most clinical institutions the
102
patients data are regularly stored in a central computer database. With time, more
103
and more records that include confirmed diagnoses appear in the database. Such
104
databases are frequently subject of retrospective studies. The patients in whom
105
the outcome has already occurred are selected and analyzed, thus looking back-
106
ward to assess potential risk factors and diagnostic principles. Retrospective stud-
107
ies naturally fit into Machine Learning and Data Mining application frameworks,
108
which are, due to ever-increasing amounts of data, becoming increasingly popu-
109
lar as a support tool in medical decision making. All clinical data are collected
110
over (shorter or longer) time spans. In most cases the clinicians are aware of their
111
temporal nature (e.g. for signal monitoring), but not allways.
112
2.1.1 Explicit temporal data
113
In several clinical tests or trials the patients state is monitored continuously and
114
the findings are time stamped and managed accordingly (e.g. ECG, EEG, long-
115
term repetitive tests, . . . ). Such data are treated as time series and are dealt with
116
accordingly. The analysis of multivariate time series is a difficult and frequent
117
problem in science in general, and in medicine in particular. It represents a cru-
118
cial challenge in clinical applications such as monitoring, where several parameters
119
must be examined over the same period of time in order to understand the patients
120
overall situation. This rather complex task has been traditionally the domain of de-
121
scriptive and inferential statistical techniques [4]. Recently, a methodology known
122
as temporal abstractions, based on artificial intelligence, has been proposed and
123
successfully exploited in several application domains [28, 3]. Temporal abstrac-
124
tions are methods that can be used to obtain abstract descriptions of the course of
125
(possibly multivariate) time series by extracting their most relevant features [19].
126
They are able to summarize the time course of multivariate data through abstracted
127
episodes which are valid over a certain time period. Temporal abstractions can be
128
viewed as the artificial intelligence alternatives to descriptive statistics. They are
129
able to summarize the data through some sufficient statistics, such as mean and
130
standard deviation in normal distributed observations. The number, duration and
131
type of temporal abstractions episodes can be considered a summary of the time
132
series at an abstract level. Temporal abstractions are usually used as the first step
133
in the process of automated reasoning, as well as for data preprocessing and data
134
revision [3].
135
2.1.2 Non-temporal (implicit temporal) data
136
To make it clear at the beginning: every data collection has an implicit temporal
137
component, as the data are collected over a certain time period. Having collected a
138
set of patient descriptions with confirmed diagnoses, the task of a Machine Learn-
139
ing algorithm is to automatically generate a model (a description) of the given data
140
with respect to the correct diagnosis. A set of possible diagnoses is used as a target
141
for classification process. The generated model can subsequently be used for risk
142
factors assessment and decision-making support.
143
In clinical trials, the experimental setup is supposed to be fixed and strictly
144
controlled. However, one must be aware that even in most strictly controlled envi-
145
ronments, unexpected changes may happen. For instance, a crucial piece of equip-
146
ment may start to fail and later gets replaced, personnel changes may happen, new
147
scientific discoveries may be absorbed in practice. While changes in the process
148
may not be visible immediately, it is necessary to act as soon as they are discov-
149
ered. While humans can with relative ease gradually adapt to changed situation, it
150
is not the same with machines, not even with learning ones. Most machine learning
151
algorithms already provide techniques for handling strange occurrences (noise) in
152
training data (such as pruning of decision trees [11] and rules [7]), weight elim-
153
ination [32], . . . . However, it is definitely not desirable that perfectly valid new
154
examples, generated under changed conditions, are considered as noisy and there-
155
fore excluded from training. Therefore, generated models do not reflect changed
156
conditions until enough new examples are collected. During this transition the
157
model performance on new examples would be poor.
158
2.2 Data mining in time series
159
Classical data mining in time series has several similarities with temporal abstrac-
160
tions. Namely, several important time series data mining problems basically reduce
161
to the core task of finding approximately repeated subsequences (patterns, shapes,
162
trends, etc., generalized as motifs) in a longer time series, where motifs may or
163
may not be known in advance. If the user can properly define problem-dependent
164
motifs in advance they can be used to qualitatively describe the whole time series.
165
While there exists a vast body of work on efficiently locating known patterns
166
(motifs) in time series [1, 18] the problem of discovering motifs without any prior
167
knowledge about the regularities of the data under study has received far less at-
168
tention [5]. Such an algorithm would potentially allow a user to find surprising
169
patterns in a massive database without having to specify in advance what a sur-
170
prising pattern looks like. We are interested in looking for surprising patterns,
171
i.e., combinations of data points whose structure and frequency somehow defies
172
our expectations. The problem is referred to under various names in the literature,
173
including novelty detection [8], anomaly detection [33], and structural change de-
174
tection [26]. There exist efficient probabilistic algorithms [5] and statistical tests
175
[35] for this purpose.
176
2.3 Explicit handling of time in clinical studies
177
When the patients state is monitored continuously and the data are time stamped
178
we deal with a time series. A time series often contains thousands or millions of
179
single measurement, especially if it is multivariate. In many cases it is very useful
180
to transform the original time series into a new, more meaningful and computation-
181
ally manageable dataset [3]. The new dataset should contain all information from
182
the original time series, eliminate its computational problems, and be generally
183
more useful. For this purpose, temporal abstractions are frequently used.
184
Temporal abstractions. The task of performing a temporal abstraction can be
185
viewed informally as a generic interpretation task: given a set of time-stamped
186
data, external events, and abstraction goals, produce abstractions of the data that
187
interpret past and present states and trends and that are relevant for the given set of
188
goals.
189
The approach that was introduced by Shahar [29, 28] employs an inference
190
structure and related required knowledge that are specific to the task of abstracting
191
higher-level concepts from time-stamped data in knowledge-based systems, but are
192
independent of any particular domain. The theory underlying this method is speci-
193
fied in a general, domain-independent way by a model of time, events, parameters,
194
and the data-interpretation contexts that these entities create, by a knowledge based
195
temporal abstraction theory.
196
The temporal abstraction mechanisms assume a task-specific temporal abstrac-
197
tion ontology a theory of what entities, relations, and properties exist in any
198
particular domain from the point of view of the temporal abstraction task and, in
199
particular, of the knowledge-based temporal abstraction method The five temporal
200
abstraction subtasks [28] are temporal context restriction, vertical temporal infer-
201
ence, horizontal temporal inference, temporal interpolation, and temporal pattern
202
matching, Given the domains temporal abstraction ontology, temporal abstraction
203
subtasks are performed by the five temporal abstraction mechanisms.
204
Temporal aggregation and forgetting. In an approach somewhat related to the
205
above temporal abstractions, Dojat [10] defines a temporal ontology consisting of
206
states (that introduce the notion of duration), events (that have temporal dimensions
207
and their occurrence modifies changes the state of the world), and chronicles
208
(that are defined as an ordered collection of temporal objects which represent the
209
real history of the world as it is perceived by the system). He also proposes three
210
higher-level forms of temporal abstractions: event-state relations that describe the
211
a cause-effect relationships between events and states, aggregation and forgetting
212
that are used to dynamically modify the length and the location of a mobile tempo-
213
ral window that brings to light a set of temporal information useful to the current
214
reasoning process.
215
Forgetting is crucial for all artificial or natural systems with memory. There
216
are many forms of forgetting. Dojat [10, 9] describes two simple types: active for-
217
getting where according to particular deductions some information are deliberately
218
erased during the reasoning process, and passive forgetting where infrequently used
219
information vanishes with time. When processing information tend to exceed the
220
capacity of the working memory (mental overload), forgetting mechanism are a
221
natural way to clean the outdated information.
222
These three abstractions are especially useful in modelling temporal aspects of
223
reasoning for real-time interpretation of clinical data such as real-time monitoring
224
systems. They assist the clinical staff in medical environments such as operating
225
rooms or intensive care units, where decisions need to be taken quickly. If the in-
226
formation flood is overloading operators sensory inputs, false positive alarms may
227
be common and causes for overlooking of life-threatening situations. To aid the op-
228
erators in real time, intelligent patient monitoring systems [9] should reason about
229
complex situations under constraints such as resource limitations and guarantee of
230
timely response.
231
2.4 Implicit handling of time in clinical studies
232
Most data analysis methods assume that all data was generated by a single con-
233
cept and is basically a random sample drawn from a stationary distribution [17].
234
In many cases, however, it is more accurate to assume that data was generated by
235
a series of concepts, or by a concept function with time-varying parameters. Tra-
236
ditional machine learning systems learn incorrect models when they erroneously
237
assume that the underlying concept is stationary if in fact it is changing or drifting
238
[17]. For classification systems, which attempt to learn a discrete function given
239
examples of its inputs and outputs, this problem takes the form of changes in the
240
target function over time that is known as a concept drift [16, 20, 31, 34]. In this
241
section we review some methods for dealing with a concept drift.
242
2.4.1 Drifting concepts
243
Recently, several systems have been developed that employ Machine Learning
244
methods in real life applications. They learn real-life concepts that tend to change
245
over time [20, 31, 34]. An illustrative example comes from Text Mining when
246
learning shifting human interests [13].
247
The concept drift, whether abrupt or gradual [15, 16], occurs over time. The
248
evidences for changes in a concept are represented by the training examples, which
10
249
are distributed over time. Hence the old observation can become irrelevant to the
250
current time period and thus the learned knowledge can be outdated. Several meth-
251
ods have been suggested to cope with this problem, either to forget outdated in-
252
duced knowledge, or to forget outdated training examples [13, 15, 21, 34].
253
Special techniques are applied when concepts can be expected to recur [15].
254
Recurring (oscilating) concepts may be due to cyclic phenomena or may be asso-
255
ciated with irregular phenomena. In both cases the approach is to to identify stable
256
concepts and the associated context specific, locally stable concepts, and store them
257
to be reused when appropriate.
258
The remainder of the paper aims to review some relatively simple statistical
259
and machine learning techniques devised to detect and cope with drifting concepts.
260
We will focus on forgetting of outdated training examples (learning with partial
261
memory), that is, according to Dojat [10], crucial for artificial or natural systems
262
with memory. We will apply the so-called passive forgetting [10] as this approach
263
is general and does not require significant changes in training algorithms.
264
2.4.2 Partial memory learning
265
Partial memory learners are systems that select and maintain a portion of the past
266
training examples, which they use together with new examples in subsequent train-
267
ing episodes. Such systems can learn by memorizing selected new facts, or by
268
using selected facts to improve the current concept descriptions or to derive new
269
concept descriptions. Researchers have developed partial memory systems because
270
they can be less susceptible to overtraining when learning concepts that change or
271
drift, as compared to learners that use other memory models [27, 34]. The key
272
issues for partial memory learning systems are how they select the most relevant
273
examples from the input stream, maintain them, and use them in future learning
274
episodes. These decisions affect the systems classification accuracy, memory re-
11
275
quirements, and ability to cope with changing concepts. A selection policy might
276
keep each training example that arrives, while the maintenance policy forgets ex-
277
amples after a fixed period of time.
278
These policies more or less bias the learner toward recent events, and, as a
279
consequence, the system may forget about important but rarely occurring events.
280
On the other hand, the learner that is strongly anchored to the past may perform
281
poorly if concepts change or drift.
282
2.4.3 Learning to forget
283
Most frequently, forgetting is implemented in an abrupt manner. That means the
284
examples that are irrelevant according to some time criteria (e.g. examples that
285
are outdated) are deleted from the partial memory [27]. Hence, these instances are
286
totally forgotten. The examples that remain in the partial memory are equally im-
287
portant for the learning algorithm. Another possibility is to use gradual forgetting
288
[21]. It can be implemented with a time based forgetting function, which provides
289
each example with a weight according to its occurring time. The importance of
290
an example diminishes with time. The drawback of this approach is that machine
291
learning algorithms need to implement techniques for dealing with unequally im-
292
portant examples.
293
Abrupt forgetting (windowing). A common approach to learning from time-
294
changing data is to repeatedly apply a traditional learner to a sliding window of
295
w examples; as new examples arrive they are inserted into the beginning of the
296
window, a corresponding number of examples is removed from the end of the win-
297
dow, and the learner is reapplied [34]. As long as w is small relative to the rate of
298
concept drift, this procedure assures availability of a model for the current concept
299
generating the data. If the window is too small, however, this may result in insuffi-
12
300
cient examples to satisfactorily learn the concept. Further, the computational cost
301
of reapplying a learner may be prohibitively high, especially if examples arrive at
302
a rapid rate and the concept changes quickly.
303
Gradual forgetting. The principal idea behind gradual forgetting is that natural
304
forgetting is a gradual process. This means that newer training examples should be
305
more important than older ones and their importance should decrease with time.
306
The importance of example is given with its weight w = f (t). The calculated
307
weights must be in an interval that is suitable for the applied learning algorithms.
308
309
Assuming that training examples arrive on equal time steps, Koychev [21] suggests using a linear gradual forgetting function, defined as follows:
wi =
2k
i+k+1
n1
(2)
310
where i is a counter of observations starting from the most recent one and it goes
311
back over time i = 0 . . . n 1 where n is the length of the observed training se-
312
quence, and k is a parameter that represents the percent of decreasing the weight of
313
the first observation and consequently the percent of increasing the weight of the
314
last one in comparison to the average. By varying the parameter k, the slope of the
315
forgetting function can be adjusted.
316
317
Within the same framework, a kernel function for example weighing can also
be used (Eq. 3).
2
1
d
wi =
e 2k2
2 k
(3)
318
Here d = i/n is a relative time distance to the training example from the past, and k
319
is a real-valued kernel parameter. Both forgetting functions (Eq. 2 and Eq. 3) were
320
utilized in experiments described in Sec. 4.
13
321
Setting the parameters. While we have quite a few options for dealing with
322
drifting concepts, they all require parameter adjustment (window size, slope of
323
linear function, kernel parameter). Because we cannot detect drift until it has hap-
324
pened, these parameters cannot be optimally set in advance, unless we know the
325
exact extent of the drift. Therefore we always start with certain amount of drifted
326
data, that can be used for parameter optimization [20], such as window size, slope
327
of linear function, kernel parameter.
328
2.5 A statistical view to implicit temporal data
329
The assumption in data mining is that the data is randomly drawn from a stationary
330
distribution. When the underlying distribution changes (e.g. over time), this change
331
can be detected with structural change tests for a certain confidence level .
332
The problem of detecting structural changes in time series (modelled by linear
333
regression relationships) is an important topic in statistical and econometric re-
334
search. The most important classes of tests on structural change are the tests from
335
the generalized fluctuation test framework (CUSUM and MOSUM tests) [22] on
336
one side, and tests based on F statistics [6, 2] on the other. A topic that has gained
337
a lot of interest recently, especially in conjunction with F tests, is to monitor for a
338
structural change, i.e., to start after a history phase (without structural changes) to
339
analyze new observations and to be able to detect a structural change as soon after
340
its occurrence as possible [26].
341
Let us consider the standard linear regression model

yi = xi| i + ui
(i = 1, ..., n),
(4)
342
where at time i, yi is the observation of the dependent variable, xi = (1, xi2 , ..., xik ) is
343
a k 1 vector of observations of the independent variables, with the first component
14
344
equal to unity, ui are identically independently distributed iid(0, 2 ), and i is the
345
k 1 vector of regression coefficients. Tests on structural change are concerned
346
with testing the null hypothesis of no structural change

H0 : i = 0
(i = 1, ..., n)
(5)
347
against the alternative that the coefficient vector varies over time, with certain tests
348
being more or less suitable (i.e., having good or poor power) for certain patterns of
349
deviation from the null hypothesis.
350
Regression coefficients i are estimated with the ordinary least squares (OLS)
351
estimate ( (i, j) ) based on the observations i + 1, . . . , i + j, and (i) = (0, i) is the
352
OLS estimate based on all observations up to i. Hence (n) is a common OLS
353
estimate in the linear regression model.
354
2.5.1 F tests
355
The most flexible approach to investigate whether the null hypothesis of no struc-
356
tural change holds, is to use F test statistics. F tests are designed to test against a
357
single shift alternative. Thus, the alternative can be formulated on the basis of the
358
model (Eq. 4)
i =
(1 i i0 )
(i0 < i n)
(6)
359
where i0 is some change point in the interval (k, n k). Chow [6] was the first to
360
suggest such a test on structural change for the case where the (potential) change
361
point i0 is known. He proposed to fit two separate regressions for the two subsam-
362
ples defined by i0 and to reject whenever the test statistics Fi0 exceeds some critical
363
value,
Fi0 =
u> u e> e
.
e> e/(n
2k)
15
(7)
364
where e = (uA , uB )> are the residuals from the full model (consisting of two regres-
365
sions), where the coefficients in the subsamples are estimated separately, and u are
366
the residuals from the restricted model, where the parameters are just fitted once
367
for all observations. The test statistic Fi0 has an asymptotic 2 distribution with
368
k degrees of freedom and (under the assumption of normality) Fi0 /k has an exact
369
F distribution with k and n 2k degrees of freedom. The major drawback of this
370
Chow test is that the change point has to be known in advance. A natural idea to
371
extend the ideas from the Chow test is to calculate the F statistics for all potential
372
change points or for all potential change points in an interval [i, ] and to reject the
373
null hypothesis if any of those statistics get too large.
374
Continuous monitoring of the data stream. Up to this point we were concerned
375
with the retrospective detection of structural changes in given data sets. Several
376
structural change tests have been extended to monitoring of linear regression mod-
377
els where new data arrive over time [26]. Such forward looking tests are closely
378
related to sequential tests. When new observations arrive, estimates are computed
379
sequentially from all available data (historical sample plus newly arrived data) and
380
compared to the estimate based only on the historical sample. As in the retro-
381
spective case, the hypothesis of no structural change is rejected if the difference
382
between these two estimates gets too large.
383
For monitoring, the standard linear regression model (4) is generalized to

yi = xi> i + ui
384
(i = 1, . . . , n, n + 1, . . .),
(8)
i.e., we expect new observations to arrive after time n, when the monitoring begins.
16
385
3 Materials
386
3.1 Heart rate time series
387
For illustration on how to use structural change tests we present from Goldberger
388
et al. [12] two interesting time series, series 1 (Fig. 1), and series 2 (Fig. 3)
389
Each series contains 1800 evenly-spaced measurements of instantaneous heart
390
rate from a single subject. The two subjects were engaged in comparable activities
391
for the duration of each series. The measurements (in units of beats per minute)
392
occur at 0.5 second intervals, so that the length of each series is exactly 15 minutes.
393
The rapid oscillations visible in series 1 are caused by respiratory sinus arrhyth-
394
mia, a modulation of heart rate that is greatest in young subjects, and gradually
395
decreases in amplitude with increasing age.
396
On the other hand, series 2 belongs to the class of congestive heart failure,
397
where circulatory delays interfere with regulation of carbon dioxide and oxygen in
398
the blood, leading to slow oscillations of heart rate.
399
Both time series contain anomalies near the beginning and the end of observa-
400
tion period.
401
3.2 Clinical diagnostics of coronary artery disease (Nuclear dataset)
402
For illustration of implicit temporal data mining methods and simultaneous use of
403
structural change statistical tests, we focused on the Nuclear dataset [14, 25] for
404
three reasons:
405
406
407
408
we have been working on this dataset for quite some time and therefore know
it pretty well,
we have close relations with physician who has collected the data, and who
has provided also the original diagnoses,
17
409
it was possible to order the patients by the date of their examination, which
410
is rare in publically available datasets (such as the UCI repository). This
411
is because mostly existing temporal information is not compiled by experts
412
preparing the data for analysis.
413
Coronary artery disease (CAD) is the most important cause of mortality in
414
all developed countries. It is caused by diminished blood flow through coronary
415
arteries due to stenosis or occlusion. CAD produces impaired function of the heart
416
and finally the necrosis of the myocardium myocardial infarction.
417
In our study we used a dataset of 327 patients (250 males, 77 females) with
418
performed clinical and laboratory examinations, exercise ECG, myocardial scintig-
419
raphy and coronary angiography because of suspected CAD. The features from the
420
ECG an scintigraphy data were extracted manually by the clinicians.

[Table 1 about here.]
421
422
In 228 cases the disease was angiographically confirmed and in 99 cases it was
423
excluded. 162 patients had suffered from recent myocardial infarction. The pa-
424
tients were selected from a population of approximately 4000 patients who were
425
examined at the Nuclear Medicine Department between 1991 and 1994. We se-
426
lected only the patients with complete diagnostic procedures (all four levels) [25].
427
Results of the fourth level (coronary angiography) were taken as a gold standard.
428
4 Results
429
4.1 Detecting the structural change in time series
430
In Figs. 1 to 4 there are depicted the original time series (Figs. 1 and 3) and the
431
respective critical values of F tests (Figs. 2 and 4) at the significance level = 0.05.
18
432
In series 1 (respiratory sinus arrhythmia, a modulation of heart) three struc-
433
turally changed intervals are detected: besides the true anomalies, the test detects
434
another interval with raised heart rate in the middle of the time series.
435
436
In series 2 (congestive heart failure) two structurally changed intervals are detected; they are both associated with true anomalies.
437
As we can see, F tests are quite good at detecting structural changes. Note that
438
such a periodic time series is a more difficult problem than a monotonic one. It is
439
therefore conceivable that structural change tests may be of use in data mining of
440
implicit temporal data.
441
[Figure 1 about here.]
442
443
444
445
4.2 Case study: nuclear diagnostics of coronary artery disease
446
In this section we demonstrate how we can easily and efficiently detect and (at least
447
partially) account for implicit temporal properties of collected data by building a
448
series of static data models.
449
Our case study is in the diagnostics of coronary artery disease [14, 25]. This is
450
a two-class problem, diagnosing whether the patients suffer from coronary artery
451
disease (CAD), or not. The data were collected in years between 1991 and 1994.
452
After performing a leave-one-out testing on the whole dataset and ordering the
453
results by the date of final examination, we obtained surprising classification ac-
454
curacy graphs as depicted in Fig. 5. Decreasing classification accuracy for both
455
physicians and naive Bayesian classifier in the last observed year (1994) could be
19
456
either a result of significantly changed class distribution or of a concept change.
457
The former seems not to be the case in our problem, since the class distribution
458
(see class prevalence in Fig. 5) does not change significantly over the observed
459
time interval. This leads us to a question what has happened and how can we deal
460
with it.
461
462
It is important that retrospective studies as well as ongoing (online) studies
463
where Machine Learning tools are being used employ techniques for detecting an
464
dealing with time-changing concepts. While this may not happen very often, it may
465
seriously skew results of otherwise perfectly valid studies. In order to compensate
466
for changed conditions it is also important for a Machine Learning system to decide
467
on when to rebuild a model to account for newly arrived training examples and what
468
extent of historical training data to use for learning.
469
4.2.1 Experimental setup
470
For testing different methods for dealing with concept drift we applied the follow-
471
ing methodology. All examples were ordered by the time of patients examination.
472
When we needed to start with some initial training set, we fixed for this purpose
473
the first 100 out of 327 examples. Performance on this set was evaluated with
474
a leave-one-out testing process. On the remaining examples we applied different
475
techniques, described in Sec. 2. For windowing as well as for gradual forgetting,
476
testing was done in single steps, where the potential training set consisted from the
477
first n examples, 100 n 326, and the testing set consisted of the (n + 1) -st ex-
478
ample. From training examples either last w (window size) was used for training,
479
or they were all assigned different weights (gradual forgetting).
480
Our experimental Machine Learning tool of choice was naive Bayesian classi-
481
fier. It was suitable for our purpose because of its very fast, incremental learning,
20
482
because it usually performs well in medical diagnostic problems and because it
483
can easily be modified for dealing with unequally important (weighted) training
484
examples.
485
4.2.2 Detecting the concept drift
486
The first thing was to find out how can concept drift be detected. If we want to
487
detect it, it must have already happened or has started to happen. Thus we always
488
have certain amount of data available for experimenting. In diagnostic problems,
489
where sooner or later the diagnoses are confirmed, our task is easy. When we col-
490
lect enough drifted examples, the drift reflects in significantly decreased average
491
classification accuracy achieved in recent past in comparison with average classifi-
492
cation accuracy achieved in distant past (see average classification accuracy in Fig.
493
6, recent past is last 50 examples).
494
However in prognostic problems the situation is more difficult, since actual
495
outcomes (and thus correctness of prognoses) may not be known for a long time, if
496
ever. In such cases it may be more useful to use a measure of reliability estimation
497
[23, 24] that assigns a kind of confidence value to every prediction (see reliability
498
estimation in Fig. 6). Although actual predictions may not be known, one could
499
detect the drift by significantly decreased average reliability estimations in recent
500
past in comparison with average reliability estimations calculated in distant past
501
(see average reliability estimation in Fig. 6, recent past is last 50 examples).
502
503
In Fig. 6 it can clearly be observed, that the drift has been happening since the
504
beginning of the last observed year (1994), that is, from the example 261 on. So
505
we selected the last 66 examples as a drifted set of our interest.
21
506
Detecting the concept drift with a structural change test. For quantitative
507
evaluation of the concept drift structural change statistical tests has been used. Re-
508
sults of the F test performed on the original non-smoothed data (the raw 0/1 data of
509
Fig. 5) are shown in Fig. 7. On the significance level of = 0.05, we can detects
510
structural changes in first 30 patients in year 1991 as well as in the patients from
511
261 on in year 1994. The first structural change was admittedly caused by a biased
512
data collection, while the second one truly represents a concept drift (actually a
513
changed context in data acquisition).

514
515
We wanted to check if this concept drift is also reflected in the structure of the
516
data. For this purpose we plotted the most important principal components for each
517
patient (Fig. 8) and tested it for a structural change (Fig. 9). On average, the most
518
important principal component explained about 35% of total data variance. As one
519
can see in both figures, there are no significant trends or regularities in Fig. 8,
520
although the components are slightly decreasing with time. Also, the test statistic
521
values of the structural change F test are far below the critical value on signifi-
522
cance level = 0.05. Clearly, unsupervised testing of principal components is not
523
very promising. A much better unsupervised approach seems to be transductive
524
reliability estimation (Fig. 6).
525
526
527
We also experimented with real-time monitoring of the data in order to detect
528
the structural change as soon as possible. As suggested by Leisch [26], we started
529
after a history phase without structural changes (we left the first 30 patients out)
530
and incrementally analyzed and tested new observations for a structural change.
22
531
We found out that the F test is quite responsive in this matter, it correctly found
532
the location (patient 261) of the significant structural change 23 patients after the
533
actual start of the drift. The 23 patients correspond to approximately three months
534
in real time.
535
In Fig. 10 there is a situation 20 patients after the beginning of the drift, when
536
it wasnt detected yet. In contrast, Fig. 11 depicts a situation 25 patients after the
537
beginning of the drift, when the presence of structural change and its location in
538
time is already located ( = 0.05).
539
540
541
If the data were monitored continuously in clinical practice, the drift could have
542
been detected a few months after its beginning, and not ten years later.
543
4.2.3 Dealing with the concept drift
544
For dealing with the concept drift we applied windowing as well as linear and
545
kernel-based gradual forgetting (Sec. 2.4.3). We used the first 261 non-drifted
546
examples for parameter optimization (window size, slope and kernel parameter).
547
In order to evaluate the quality of obtained parameters, we also optimized the pa-
548
rameters on the drifted set. For comparison we selected the best achieved results.
549
The obtained parameter values were tested on the last 66 drifted examples, and
550
the results are compared in Tab. 2. Fig. 12 depicts variations of classification
551
accuracy with different parameter settings.
552
[Table 2 about here.]
553
23
554
As we can see, differences in accuracy between optimized and actual best pa-
555
rameter values exist, but they are small. By using the optimized parameter values
556
the average performance on the whole dataset was 94-95% for all three methods.
557
We cannot say that any of them (windowing, linear or kernel-based forgetting) is
558
significantly better than the other ones. However, we can see an improvement in
559
overall accuracy for naive Bayesian classifier by 4%. This is no small achievement,
560
since it actually reduces the error rate by 44% (from 9% to 5%). But more than
561
this it is important that the performance on the drifted examples (last 66) is much
562
higher (by 20%) for naive Bayesian classifier (from 64% to 83-85%). This means
563
that we can almost level the performance on this problematic subset with overall
564
performance and should equal to it when a few more training examples arrive.
565
In Figs. 13 and 14 we compare classification accuracy of the ordinary naive
566
Bayesian classifier with windowing and gradual forgetting methods for different
567
parameter values. For training ordinary naive Bayesian classifier, non-drifted ex-
568
amples (first 261) were used, and leave-one-out testing was performed, the others
569
were added in training set and tested incrementally.
570
571
572
5 Discussion
573
In the paper we have focused on, and briefly reviewed some approaches for han-
574
dling of both explicit temporal (time-stamped) and implicit temporal data.
575
Explicit temporal data (time series) are frequently used in medical and other
576
studies. We briefly reviewed numerical (time series analysis) and symbolic (tem-
577
poral abstractions) methods for dealing with abundant temporal data. They are
24
578
especially useful for qualitative description, summarization, and reasoning on tem-
579
poral data.
580
On the other hand, in implicit temporal data the temporal component is hidden,
581
yet always present, and determines their ordering, as were collected over certain
582
periods of time. More often than not, the hidden temporal components of collected
583
datasets are ignored. This may cause unfortunate anomalies and worsen results, as
584
shown in our case study.
585
We reviewed statistical methods for detecting the structural change (concept
586
drift), and three different machine learning methods for dealing with changing
587
(drifting) concepts: windowing and gradual (linear or kernel-based) forgetting. We
588
found out that in our case study of coronary artery disease diagnostics all perform
589
reasonably well. While they all require setting certain parameters, they can be
590
automatically tuned on the training set [20], and nearly optimal results can be ex-
591
pected. In windowing, at most n (size of the training set) re-runs of the training
592
algorithm are required for window size selection, whereas for gradual forgetting
593
linear optimization with desired precision for respective parameter is sufficient.
594
We can detect concept drift only after it has happened, so we always have
595
some data available. It can be detected by using classification accuracy (in diag-
596
nostic problems) or reliability estimation (in prognostic problems), quantitatively
597
evaluated by statistical structural change tests (such as F test) with a prescribed
598
confidence level. Our experiments with monitoring for a structural change tests
599
show that they are able to precisely and quickly locate the starting point of concept
600
drift. Namely, one should rebuild (re-learn) the model only when absolutely neces-
601
sary and adjust suitable parameters (e.q. window size, slope or kernel parameter)
602
to compensate for the drift. This is especially important for practical applications,
603
where rebuilding a model is not performed every time when a new training exam-
604
ple arrives. Model rebuilding may require a presence of a machine learning expert,
25
605
especially if learning parameters need to be changed. Often a generated model is
606
stored and used independently of the learner (e.g. in a handheld device, or even
607
printed on a paper). In such cases a model should be rebuilt and deployed only
608
when it is really necessary.
609
In our case study of coronary artery disease diagnostics we managed to achieve
610
overall improvement of 4% compared to ordinary naive Bayesian classifiers result.
611
Since it actually reduces the (already low) error rate by 44% (from 9% to 5%), it
612
is no small achievement. The improvement was most notable in last 66 drifted
613
examples, where it was about 20% (from 64% to 83-85%). This means that the
614
performance on this problematic subset is almost levelled with overall performance
615
and should equal to it with a few more training examples.
616
A very encouraging result is also it correct and quick pinpointing of the location
617
(patient 261) of the significant structural change only 23 patients after the actual
618
start of the drift. The 23 patients correspond to approximately three months in real
619
time. If the data were monitored and tested continuously, the drift could have been
620
detected a few months and not ten years after its beginning. We argue that any
621
(online) learning system that is used in practice should use similar techniques at
622
least to detect and possibly deal with drifting concepts.
623
There are several things that can be done to further develop the described meth-
624
ods. Most notably, statistical tests should be integrated with machine learning and
625
data mining methods more thoroughly in order to continuously check for the pos-
626
sibility of a structural change (concept drift) and to guide the re-learning of the
627
models. Also, a weighting scheme for gradual forgetting should be devised, that
628
does not need to be recalculated every time a new training example arrives. This
629
would enable true incremental learning, however it would require from the learner
630
to cope with increasingly large, theoretically unlimited example weights.
631
There is one question left unanswered, and that is how and why in our case
26
632
study a concept drift has occurred in the first place. Since the reasons are rather
633
delicate and personal in their nature, it will suffice to say that there was a serious
634
yet unconscious human error in data acquisition (interpretation of scintigraphic
635
images) caused by professional traumatic experience.
636
Acknowledgements
637
I thank Ciril Groselj, Nuclear Medicine Department, University Medical Centre
638
Ljubljana, for collecting the data, and Michel Dojat, Unite mixte INSERM-UJF
639
U594 Neuroimagerie Fonctionnelle & Metabolique, Grenoble, for pointing to
640
me his most interesting work. This work was supported by the Slovenian Ministry
641
of Education and Science.
642
References
643
[1] R. Agrawal, C. Faloutsos, and A. Swami. Efficient similarity search in sequence
644
databases. In Proc. 4th Intl Conference on Foundations of Data Organization and
645
Algorithms, pages 6984, 1993.
646
647
648
649
[2] D. W. K. Andrews and W. Ploberger. Optimal tests when a nuisance parameter is

present only under the alternative. Econometrica, 62:13831414, 1994.
[3] R. Bellazzi, C. Larizza, and Alberto Riva. Temporal abstractions for diabetic patients
management. In Proc. AIME97, pages 319330, 1997.
650
[4] R. Bellazzi, P. Magni, C. Larizza, G. De Nicolao, A. Riva, and M. Stefanelli. Mining
651
biomedical time series by combining structural analysis and temporal abstractions.
652
JAMIA, Symposium supplement 1998, pages 160164, 1998.
653
654
[5] B. Chiu, E. Keogh, J. Lin, and S. Lonardi. Efficient discovery of unusual patterns in
time series. In Proc. KDD02, pages 550556, 2002.
27
655
656
[6] G. C. Chow. Tests of equality between sets of coefficients in two linear regressions.
Econometrica, 28:591605, 1960.
657
[7] W. W. Cohen. Fast effective rule induction. In A. Prieditis and S. Russel, editors,
658
Proc. 12th Intl. Conf. on Machine Learning ICML95, pages 115123, San Francisco,
659
California, USA, 1995. Morgan Kaufmann.
660
[8] D. Dasgupta and S. Forrest. Novelty detection in time series data using ideas from
661
immunology. In Proc. of The International Conference on Intelligent Systems, 1999.
662
[9] M. Dojat, F. Pachet, Z. Guessoum, D. Touchard, A. Harf, and L. Brochard.
663
Neoganesh: A working system for the automated control of assisted ventilation in
664
icus. Artificial Intelligence in Medicine, 11:97117, 1997.
665
666
[10] M. Dojat and C. Sayettat. A realistic model for temporal reasoning in real-time
patient monitoring. Applied Artificial Intelligence, 10:121143, 1996.
667
[11] F. Esposito, D. Malerba, and G. Semeraro. Simplifying decision trees by pruning and
668
grafting: new results. In N. Lavrac and S. Wrobel, editors, Proc. Europ. Conf. on
669
Machine Learning ECML95, pages 287290. Springer Verlag, 1995.
670
[12] A. L. Goldberger and D. R. Rigney. Nonlinear dynamics at the bedside. In L. Glass,
671
P. Hunter, and A. McCulloch, editors, Theory of Heart: Biomechanics, Biophysics,
672
and Nonlinear Dynamics of Cardiac Function, pages 583605. Springer-Verlag, New
673
York, 1991.
674
675
[13] I. Grabtree and S. Soltysiak. Identifying and tracking changing interests. International Journal of Digital Libraries, 2:3853, 1998.
676
[14] C. Groselj, M. Kukar, J. Fettich, and I. Kononenko. Machine learning improves
677
the accuracy of coronary artery disease diagnostic methods. In Proc. Computers in
678
Cardiology, volume 24, pages 5760, Lund, Sweden, 1997.
679
680
[15] M. B. Harries, C. Sammut, and K. Horn. Extracting hidden context. Machine Learning, 32:101126, 1998.
28
681
682
[16] D. P. Helmbold and P. M. Long. Tracking drifting concepts by minimizing disagreements. Machine Learning, 14:2745, 1994.
683
[17] G. Hulten, L. Spencer, and P. Domingos. Mining time-changing data streams. In
684
Proceedings of the 17th ACM SIGKDD Inter. Conf. on Knowledge Discovery and
685
Data Mining, pages 97106, San Francisco, CA, 2001. ACM Press.
686
[18] E. Keogh, K. Chakrabarti, M. Pazzani, , and S. Mehrotra. Dimensionality reduction
687
for fast similarity search in large time series databases. Journal of Knowledge and
688
Information Systems, 3(3):263286, 2000.
689
[19] E. Keravnou. Modelling medical concepts as time objects. In P. Barahona, M. Ste-
690
fanelli, and J. Wyatt, editors, Lecture Notes in Medical Informatics, pages 6790.
691
Springer-Verlag, Berlin, 1995.
692
[20] R. Klinkenberg and T. Joachims. Detecting concept drift with support vector ma-
693
chines. In P. Langley, editor, Proceedings of ICML-00, 17th International Confer-
694
ence on Machine Learning, pages 487494, Stanford, US, 2000. Morgan Kaufmann
695
Publishers, San Francisco, US.
696
[21] I. Koychev. Gradual forgetting for adaptation to concept drift. In Proceedings of
697
ECAI 2000 Workshop Current Issues in Spatio-Temporal Reasoning, pages 101106,
698
Berlin, Germany, 2000.
699
700
[22] C. M. Kuan and K. Hornik. The generalized fluctuation test: A unifying view. Econometric Reviews, 14:135161, 1995.
701
[23] M. Kukar. Making reliable diagnoses with machine learning: A case study. In Silvana
702
Quaglini, Pedro Barahona, and Steen Andreassen, editors, Proceedings of Artificial
703
Intelligence in Medicine Europe, AIME 2001, pages 8896, Cascais, Portugal, 2001.
704
Springer.
705
[24] M. Kukar and I. Kononenko. Reliable classifications with Machine Learning. In
706
Proceedings of 13th European Conference on Machine Learning, ECML 2002, pages
707
219231, 2002.
29
708
[25] M. Kukar, I. Kononenko, C. Groselj, K. Kralj, and J. Fettich. Analysing and im-
709
proving the diagnosis of ischaemic heart disease with machine learning. Artificial
710
Intelligence in Medicine, 16 (1):2550, 1999.
711
712
713
714
715
716
717
718
719
720
[26] F. Leisch, K. Hornik, and C. M. Kuan. Monitoring structural changes with the generalized fluctuation test. Econometric Theory, 16:835854, 2000.
[27] M. A. Maloof and R. S. Michalski. Selecting examples for partial memory learning.
Machine Learning, 41(1):2752, 2000.
[28] Y. Shahar. A framework for knowledge-based temporal abstraction. Artificial Intelligence, 90(1-2):79133, 1997.
[29] Y. Shahar and M. A. Musen. Knowledge-based temporal abstraction in clinical domains. Artificial Intelligence in Medicine, 8(3):267298, 1996.
[30] C.E. Shannon and W. Weaver. The mathematical theory of communications. The
University of Illinois Press, Urbana, IL, 1949.
721
[31] N. A. Syed, H. Liu, and K. K. Sung. Handling concept drifts in incremental learning
722
with support vector machines. In Knowledge Discovery and Data Mining, pages
723
317321, 1999.
724
[32] A. S. Weigend, D. E. Rumelhart, and A. H. Bernardo. Generalization by weight elimi-
725
nation with application to forecasting. In Advances in Neural Information Processing
726
Systems, volume 3, pages 875882, Denver, CO, USA, 1991. Morgan Kaufman.
727
[33] B. Whitehead and W. A. Hoyt. A function approximation approach to anomaly de-
728
tection in propulsion system test data. In Proc. AIAA/SAE/ASME/ASEE 29th Joint
729
Propulsion Conference, 1993.
730
731
[34] G. Widmer and M. Kubat. Learning in the presence of concept drift and hidden
contexts. Machine Learning, 23(1):69101, 1996.
732
[35] A. Zeileis, F. Leisch, K. Hornik, and C. Kleiber. strucchange: An r package
733
for testing for structural change in linear regression models. Journal of Statistical
734
Software, 7(2):138, 2002.
30
735
736
737
List of Tables
1
2
CAD data for different diagnostic levels. . . . . . . . . . . . . . .

Experimental results on the drifted data. . . . . . . . . . . . . . .
31
32
33
Diagnostic level
Signs, symptoms and history
Exercise ECG
Myocardial scintigraphy
Coronary angiography
Total attributes
Disease prevalence
Entropy of classes
Diagnostic attributes
Nominal Numeric Total
23
7
30
7
9
16
22
9
31
1
1
53
25
78
70% positive 30% negative
0.89 bit
Table 1: CAD data for different diagnostic levels.
32
Naive
Bayes
Windowed
Linear
Kernel
Physicians
Ordinary Naive Bayes
Parameter
Window size
Slope (k)
Kernel size (k)
Optimized
Value Accuracy
100
85%
0.90
83%
0.25
83%
70%
64%
Best achieved
Value Accuracy
70
88%
0.9
83%
0.17
86%
70%
64%
Table 2: Experimental results on the drifted data.
33
Overall
Accuracy
95%
94%
95%
85%
91%
738
739
740
List of Figures
1
2
741
742
743
744
3
4
745
746
747
748
5
6
749
750
751
7
8
752
753
754
755
10
756
757
11
758
759
12
760
761
762
763
13
764
765
766
767
14
Time series 1 (respiratory sinus arrhythmia). . . . . . . . . . . . .

Time series 1 (respiratory sinus arrhythmia). Results of structural
change tests. Exceeded critical value signal significant ( = 0.05)
structural changes. . . . . . . . . . . . . . . . . . . . . . . . . .
Time series 2 (congestive heart failure). . . . . . . . . . . . . . .
Time series 2 (congestive heart failure). Results of structural change
tests. Exceeded critical value signal significant ( = 0.05) structural changes. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Time-based variation of classification accuracy in the Nuclear dataset.
Detecting concept drift with classification accuracy and reliability
in the Nuclear dataset. . . . . . . . . . . . . . . . . . . . . . . .
F test for a structural change in data from Fig. 5 (al pha = 0.05). .
Most important principal components, their running median (middle line) and its range (top and bottom line). . . . . . . . . . . . .
F test on the principal components from Fig. 8. All test statistic
are far below critical value ( = 0.05. . . . . . . . . . . . . . . .
Monitoring F test for concept drift (20 drifted examples). Concept
drift is not detected yet. . . . . . . . . . . . . . . . . . . . . . . .
Monitoring F test for concept drift (25 drifted examples). Concept
drift is determined as significant ( = 0.05). . . . . . . . . . . . .
Parameter tuning: performance on the drifted examples. Kernel
and k parameter are represented as a real-value. Window size is
represented as a share of the whole training set (all training examples = 261 = 1.0). . . . . . . . . . . . . . . . . . . . . . . . . . .
Catching up with the drift with windowing in the Nuclear dataset.
Notice the negative effect of too small window size (w=50). . . . .
Catching up with the drift with gradual forgetting in the Nuclear
dataset. Differences between linear and kernel-based forgetting are
almost negligible. . . . . . . . . . . . . . . . . . . . . . . . . . .
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
100
80
60
V1
40
20
0
0
500
1000
1500
Time
Figure 1: Time series 1 (respiratory sinus arrhythmia).
35
80
60
40
F statistics
20
0
400
600
800
1000
1200
1400
Time
Figure 2: Time series 1 (respiratory sinus arrhythmia). Results of structural change

tests. Exceeded critical value signal significant ( = 0.05) structural changes.
36
100
80
60
V1
40
20
0
0
500
1000
1500
Time
Figure 3: Time series 2 (congestive heart failure).
37
80
60
40
F statistics
20
0
400
600
800
1000
1200
1400
Time
Figure 4: Time series 2 (congestive heart failure). Results of structural change

tests. Exceeded critical value signal significant ( = 0.05) structural changes.
38
Figure 5: Time-based variation of classification accuracy in the Nuclear dataset.
39
100%
80%
60%
40%
Reliability estimation
20%
Averaged reliability
estimation
Averaged classification
accuracy
0%
1991
1992
1993
1994
1995
Figure 6: Detecting concept drift with classification accuracy and reliability in the
Nuclear dataset.
40
12
10
8
6
F statistics
4
2
0
1991
1992
1993
1994
1995
Time
Figure 7: F test for a structural change in data from Fig. 5 (al pha = 0.05).
41
100
50
0
PC 1
50
100
1991
1992
1993
1994
1995
Time
Figure 8: Most important principal components, their running median (middle line)
and its range (top and bottom line).
42
8
6
4
F statistics
2
0
1991
1992
1993
1994
1995
Time
Figure 9: F test on the principal components from Fig. 8. All test statistic are far
below critical value ( = 0.05.
43
12
10
8
6
F statistics
4
2
0
1991
1992
1993
1994
Time
Figure 10: Monitoring F test for concept drift (20 drifted examples). Concept drift
is not detected yet.
44
12
10
8
6
F statistics
4
2
0
1991
1992
1993
1994
Time
Figure 11: Monitoring F test for concept drift (25 drifted examples). Concept drift
is determined as significant ( = 0.05).
45
100%
90%
80%
70%
60%
50%
40%
30%
Windowing
20%
Linear
Kernel
10%
0%
0.0
0.2
0.4
0.6
0.8
1.0
Figure 12: Parameter tuning: performance on the drifted examples. Kernel and
k parameter are represented as a real-value. Window size is represented as a share
of the whole training set (all training examples = 261 = 1.0).
46
Figure 13: Catching up with the drift with windowing in the Nuclear dataset. Notice the negative effect of too small window size (w=50).
47
Figure 14: Catching up with the drift with gradual forgetting in the Nuclear dataset.
Differences between linear and kernel-based forgetting are almost negligible.
48

Aiim Special Issue

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Aiim Special Issue

Uploaded by

Copyright:

Available Formats

1

Explicit and implicit handling of time in clinical

Faculty of Computer and Information Science

Trzaska 25, SI-1001 Ljubljana, Slovenia

drawn from a stationary distribution. Unfortunately, many of the databases

changed during this time, sometimes radically (the phenomenon known in

machine learning as a concept drift). In clinical institutions, where the pa-

may fail due to their underlaying assumptions. It is therefore important to

a more thorough review of selected statistical and machine learning tech-

frameworks. We evaluate their possible use in clinical studies with a case

study of learning and monitoring data in coronary artery disease diagnostics.

Keywords: clinical studies, temporal learning, drifting concepts, partial mem-

ory learning, forgetting, machine learning, data mining.

dia. This allows researchers to access previously unimaginable amounts of clinical

A physician confronted with tens of megabytes of clinical data may be rather

is actually quite a common manifestation of the well-known phenomenon from

information theory [30]:

Information = f (Data, Cognitive structure, Time)

cognitive (processing) capabilities and enough time to process the data. As we

data in a shape, more appropriate to our cognitive structure, or (2) to virtually

improve our cognitive structure by utilizing intelligent applications.

1. In a large dataset, a regularity, or (in temporal data) an emerging pattern

or even a set of findings. Experienced physicians are able to combine sev-

eral significant findings, to abstract such findings into clinically meaningful

higher-level concepts in a context-sensitive manner, and to detect significant

regularities in both low-level data and abstract concepts. It is thus desir-

able to provide short, informative, context-sensitive summaries clinical data

stored on electronic media, and to be able to answer queries about abstract

concepts that summarize the data.

cians an all-encompassing view on the problem at hand. Intelligent tools can

synthesize knowledge and emerging patterns from previously solved cases,

potheses. When used in temporal problems, they are used in conjunction

with high-level abstractions and summarizations, as they usually cannot ex3

plicitly handle temporal data.

temporal (time-stamped) data. Temporal data analysis presents several impor-

tant challenges that include time-stamped data preprocessing (transformation into

information-preserving and more useful form) by using temporal abstractions [3,

reasoning [10, 9].

included in an implicit way. In many cases, especially in conjunction with time-

and show their effect in a case study.

The paper is organized as follows. In Sec. 2 we review some related work

experimental results. Finally, in Sec. 5 we present some conclusions and directions

for future work.

2.1 Clinical decision-making and data mining

Clinical decision-making is a complicated process based on experience, judgement,

literature and a variety of other sources, including quantitative results of clinical

databases are frequently subject of retrospective studies. The patients in whom

which are, due to ever-increasing amounts of data, becoming increasingly popu-

temporal nature (e.g. for signal monitoring), but not allways.

2.1.1 Explicit temporal data

accordingly. The analysis of multivariate time series is a difficult and frequent

problem in science in general, and in medicine in particular. It represents a cru-

cial challenge in clinical applications such as monitoring, where several parameters

scriptive and inferential statistical techniques [4]. Recently, a methodology known

as temporal abstractions, based on artificial intelligence, has been proposed and

successfully exploited in several application domains [28, 3]. Temporal abstrac-

viewed as the artificial intelligence alternatives to descriptive statistics. They are

standard deviation in normal distributed observations. The number, duration and

type of temporal abstractions episodes can be considered a summary of the time

2.1.2 Non-temporal (implicit temporal) data

ing algorithm is to automatically generate a model (a description) of the given data

factors assessment and decision-making support.

In clinical trials, the experimental setup is supposed to be fixed and strictly

scientific discoveries may be absorbed in practice. While changes in the process