You are on page 1of 48

1

Explicit and implicit handling of time in clinical

data mining

Matjaz Kukar

University of Ljubljana

Faculty of Computer and Information Science

Trzaska 25, SI-1001 Ljubljana, Slovenia

matjaz.kukar@fri.uni-lj.si

10

Abstract

11

Most clinical tasks are regularly documented and stored on electronic media,

12

this causing large amounts of data to be collected over time. Stored data may

13

include either an explicit (date and time) or an implicit (order) time stamp

14

in which the particular datum was valid. Most statistical, machine learning

15

and data mining algorithms assume that the data they use is a random sample

16

drawn from a stationary distribution. Unfortunately, many of the databases

17

available for mining today violate this assumption. They were gathered over

18

months or years, and the underlying processes generating them may have

19

changed during this time, sometimes radically (the phenomenon known in

20

machine learning as a concept drift). In clinical institutions, where the pa-

21

tients data are regularly stored in a central computer databases, similar situa-

22

tions may occur. Expert physicians may easily, even unconsciously, adapt to

23

the changed environment, whereas machine learning and data mining tools

24

may fail due to their underlaying assumptions. It is therefore important to

25

detect and adapt to the changed situation. In the paper we present a brief

26

overview of methods for explicit and implicit handling of temporal data, and

27

a more thorough review of selected statistical and machine learning tech-

28

niques for dealing with concept drift in machine learning and data mining

29

frameworks. We evaluate their possible use in clinical studies with a case

30

study of learning and monitoring data in coronary artery disease diagnostics.

31
32
33

Keywords: clinical studies, temporal learning, drifting concepts, partial mem-

34

ory learning, forgetting, machine learning, data mining.

35

1 Introduction

36

Most clinical tasks are nowadays being documented and stored on electronic me-

37

dia. This allows researchers to access previously unimaginable amounts of clinical

38

data. However, huge amounts of data present entirely new problems, both for the

39

data analysts and end-users (physicians). Physicians who have to make diagnos-

40

tic or therapeutic decisions based on these data may be overwhelmed by the sheer

41

amount of data if their ability to reason with the data does not scale up to the com-

42

puters capabilities. In addition to this, most stored data include either an explicit

43

(date and time) or an implicit (order) time stamp in which the particular datum was

44

valid. Such time series may consist of thousands of numbers that describe only

45

short time periods and are because of this especially difficult for interpretation and

46

subsequent reasoning.

47

A physician confronted with tens of megabytes of clinical data may be rather

48

helpless and cannot effectively exploit all the available information. This situation

49

is actually quite a common manifestation of the well-known phenomenon from

50

information theory [30]:

Information = f (Data, Cognitive structure, Time)

(1)

51

This tells us that we can extract information from data only if we have appropriate

52

cognitive (processing) capabilities and enough time to process the data. As we

53

humans usually only have limited time available, the only ways to increase the

54

amount of information we can obtain from the data are either (1) to transform the

55

data in a shape, more appropriate to our cognitive structure, or (2) to virtually

56

improve our cognitive structure by utilizing intelligent applications.

57

1. In a large dataset, a regularity, or (in temporal data) an emerging pattern

58

over a stretch of time has much more significance than an isolated finding

59

or even a set of findings. Experienced physicians are able to combine sev-

60

eral significant findings, to abstract such findings into clinically meaningful

61

higher-level concepts in a context-sensitive manner, and to detect significant

62

regularities in both low-level data and abstract concepts. It is thus desir-

63

able to provide short, informative, context-sensitive summaries clinical data

64

stored on electronic media, and to be able to answer queries about abstract

65

concepts that summarize the data.

66

2. Machine learning, data mining, and other data analysis tools may be used to

67

detect, expose and utilize regularities from stored data, and to provide physi-

68

cians an all-encompassing view on the problem at hand. Intelligent tools can

69

synthesize knowledge and emerging patterns from previously solved cases,

70

and apply it to solve new cases. As their insight in the problem is different

71

from that of the physicians, they are a valuable source of alternative hy-

72

potheses. When used in temporal problems, they are used in conjunction

73

with high-level abstractions and summarizations, as they usually cannot ex3

74

plicitly handle temporal data.

75

In the paper we will mainly focus on the analysis of explicit and implicit

76

temporal (time-stamped) data. Temporal data analysis presents several impor-

77

tant challenges that include time-stamped data preprocessing (transformation into

78

information-preserving and more useful form) by using temporal abstractions [3,

79

29, 28], or transforming the temporal data in a series of entities (states, events

80

and relations among them) in order to use them for efficient and comprehensible

81

reasoning [10, 9].

82

We will briefly review some approaches for handling of explicit temporal (time-

83

stamped) data and show that in many cases the temporal component is implicitly

84

included in the data, yet often ignored. The aim of the paper is not in dealing

85

with explicit temporal data, but in dealing with data where temporal information is

86

included in an implicit way. In many cases, especially in conjunction with time-

87

ignorant machine learning and data mining methods, anomalous results may occur

88

because of this ignorance. We will review some simple, yet highly effective statisti-

89

cal and machine learning techniques for detecting and handling temporal problems

90

and show their effect in a case study.

91

The paper is organized as follows. In Sec. 2 we review some related work

92

and proposed solutions for dealing with drifting concepts. In Sec. 3 we describe

93

the datasets we are using for demonstration and case study. In Sec. 4 we present

94

experimental results. Finally, in Sec. 5 we present some conclusions and directions

95

for future work.

96

2 Methods

97

2.1 Clinical decision-making and data mining

98

Clinical decision-making is a complicated process based on experience, judgement,

99

and reasoning that should simultaneously integrate information from the medical

100

literature and a variety of other sources, including quantitative results of clinical

101

trials and, most importantly, diagnostic test results. In most clinical institutions the

102

patients data are regularly stored in a central computer database. With time, more

103

and more records that include confirmed diagnoses appear in the database. Such

104

databases are frequently subject of retrospective studies. The patients in whom

105

the outcome has already occurred are selected and analyzed, thus looking back-

106

ward to assess potential risk factors and diagnostic principles. Retrospective stud-

107

ies naturally fit into Machine Learning and Data Mining application frameworks,

108

which are, due to ever-increasing amounts of data, becoming increasingly popu-

109

lar as a support tool in medical decision making. All clinical data are collected

110

over (shorter or longer) time spans. In most cases the clinicians are aware of their

111

temporal nature (e.g. for signal monitoring), but not allways.

112

2.1.1 Explicit temporal data

113

In several clinical tests or trials the patients state is monitored continuously and

114

the findings are time stamped and managed accordingly (e.g. ECG, EEG, long-

115

term repetitive tests, . . . ). Such data are treated as time series and are dealt with

116

accordingly. The analysis of multivariate time series is a difficult and frequent

117

problem in science in general, and in medicine in particular. It represents a cru-

118

cial challenge in clinical applications such as monitoring, where several parameters

119

must be examined over the same period of time in order to understand the patients

120

overall situation. This rather complex task has been traditionally the domain of de-

121

scriptive and inferential statistical techniques [4]. Recently, a methodology known

122

as temporal abstractions, based on artificial intelligence, has been proposed and

123

successfully exploited in several application domains [28, 3]. Temporal abstrac-

124

tions are methods that can be used to obtain abstract descriptions of the course of

125

(possibly multivariate) time series by extracting their most relevant features [19].

126

They are able to summarize the time course of multivariate data through abstracted

127

episodes which are valid over a certain time period. Temporal abstractions can be

128

viewed as the artificial intelligence alternatives to descriptive statistics. They are

129

able to summarize the data through some sufficient statistics, such as mean and

130

standard deviation in normal distributed observations. The number, duration and

131

type of temporal abstractions episodes can be considered a summary of the time

132

series at an abstract level. Temporal abstractions are usually used as the first step

133

in the process of automated reasoning, as well as for data preprocessing and data

134

revision [3].

135

2.1.2 Non-temporal (implicit temporal) data

136

To make it clear at the beginning: every data collection has an implicit temporal

137

component, as the data are collected over a certain time period. Having collected a

138

set of patient descriptions with confirmed diagnoses, the task of a Machine Learn-

139

ing algorithm is to automatically generate a model (a description) of the given data

140

with respect to the correct diagnosis. A set of possible diagnoses is used as a target

141

for classification process. The generated model can subsequently be used for risk

142

factors assessment and decision-making support.

143

In clinical trials, the experimental setup is supposed to be fixed and strictly

144

controlled. However, one must be aware that even in most strictly controlled envi-

145

ronments, unexpected changes may happen. For instance, a crucial piece of equip-

146

ment may start to fail and later gets replaced, personnel changes may happen, new

147

scientific discoveries may be absorbed in practice. While changes in the process

148

may not be visible immediately, it is necessary to act as soon as they are discov-

149

ered. While humans can with relative ease gradually adapt to changed situation, it

150

is not the same with machines, not even with learning ones. Most machine learning

151

algorithms already provide techniques for handling strange occurrences (noise) in

152

training data (such as pruning of decision trees [11] and rules [7]), weight elim-

153

ination [32], . . . . However, it is definitely not desirable that perfectly valid new

154

examples, generated under changed conditions, are considered as noisy and there-

155

fore excluded from training. Therefore, generated models do not reflect changed

156

conditions until enough new examples are collected. During this transition the

157

model performance on new examples would be poor.

158

2.2 Data mining in time series

159

Classical data mining in time series has several similarities with temporal abstrac-

160

tions. Namely, several important time series data mining problems basically reduce

161

to the core task of finding approximately repeated subsequences (patterns, shapes,

162

trends, etc., generalized as motifs) in a longer time series, where motifs may or

163

may not be known in advance. If the user can properly define problem-dependent

164

motifs in advance they can be used to qualitatively describe the whole time series.

165

While there exists a vast body of work on efficiently locating known patterns

166

(motifs) in time series [1, 18] the problem of discovering motifs without any prior

167

knowledge about the regularities of the data under study has received far less at-

168

tention [5]. Such an algorithm would potentially allow a user to find surprising

169

patterns in a massive database without having to specify in advance what a sur-

170

prising pattern looks like. We are interested in looking for surprising patterns,

171

i.e., combinations of data points whose structure and frequency somehow defies

172

our expectations. The problem is referred to under various names in the literature,

173

including novelty detection [8], anomaly detection [33], and structural change de-

174

tection [26]. There exist efficient probabilistic algorithms [5] and statistical tests

175

[35] for this purpose.

176

2.3 Explicit handling of time in clinical studies

177

When the patients state is monitored continuously and the data are time stamped

178

we deal with a time series. A time series often contains thousands or millions of

179

single measurement, especially if it is multivariate. In many cases it is very useful

180

to transform the original time series into a new, more meaningful and computation-

181

ally manageable dataset [3]. The new dataset should contain all information from

182

the original time series, eliminate its computational problems, and be generally

183

more useful. For this purpose, temporal abstractions are frequently used.

184

Temporal abstractions. The task of performing a temporal abstraction can be

185

viewed informally as a generic interpretation task: given a set of time-stamped

186

data, external events, and abstraction goals, produce abstractions of the data that

187

interpret past and present states and trends and that are relevant for the given set of

188

goals.

189

The approach that was introduced by Shahar [29, 28] employs an inference

190

structure and related required knowledge that are specific to the task of abstracting

191

higher-level concepts from time-stamped data in knowledge-based systems, but are

192

independent of any particular domain. The theory underlying this method is speci-

193

fied in a general, domain-independent way by a model of time, events, parameters,

194

and the data-interpretation contexts that these entities create, by a knowledge based

195

temporal abstraction theory.

196

The temporal abstraction mechanisms assume a task-specific temporal abstrac-

197

tion ontology a theory of what entities, relations, and properties exist in any

198

particular domain from the point of view of the temporal abstraction task and, in

199

particular, of the knowledge-based temporal abstraction method The five temporal

200

abstraction subtasks [28] are temporal context restriction, vertical temporal infer-

201

ence, horizontal temporal inference, temporal interpolation, and temporal pattern

202

matching, Given the domains temporal abstraction ontology, temporal abstraction

203

subtasks are performed by the five temporal abstraction mechanisms.

204

Temporal aggregation and forgetting. In an approach somewhat related to the

205

above temporal abstractions, Dojat [10] defines a temporal ontology consisting of

206

states (that introduce the notion of duration), events (that have temporal dimensions

207

and their occurrence modifies changes the state of the world), and chronicles

208

(that are defined as an ordered collection of temporal objects which represent the

209

real history of the world as it is perceived by the system). He also proposes three

210

higher-level forms of temporal abstractions: event-state relations that describe the

211

a cause-effect relationships between events and states, aggregation and forgetting

212

that are used to dynamically modify the length and the location of a mobile tempo-

213

ral window that brings to light a set of temporal information useful to the current

214

reasoning process.

215

Forgetting is crucial for all artificial or natural systems with memory. There

216

are many forms of forgetting. Dojat [10, 9] describes two simple types: active for-

217

getting where according to particular deductions some information are deliberately

218

erased during the reasoning process, and passive forgetting where infrequently used

219

information vanishes with time. When processing information tend to exceed the

220

capacity of the working memory (mental overload), forgetting mechanism are a

221

natural way to clean the outdated information.

222

These three abstractions are especially useful in modelling temporal aspects of

223

reasoning for real-time interpretation of clinical data such as real-time monitoring

224

systems. They assist the clinical staff in medical environments such as operating

225

rooms or intensive care units, where decisions need to be taken quickly. If the in-

226

formation flood is overloading operators sensory inputs, false positive alarms may

227

be common and causes for overlooking of life-threatening situations. To aid the op-

228

erators in real time, intelligent patient monitoring systems [9] should reason about

229

complex situations under constraints such as resource limitations and guarantee of

230

timely response.

231

2.4 Implicit handling of time in clinical studies

232

Most data analysis methods assume that all data was generated by a single con-

233

cept and is basically a random sample drawn from a stationary distribution [17].

234

In many cases, however, it is more accurate to assume that data was generated by

235

a series of concepts, or by a concept function with time-varying parameters. Tra-

236

ditional machine learning systems learn incorrect models when they erroneously

237

assume that the underlying concept is stationary if in fact it is changing or drifting

238

[17]. For classification systems, which attempt to learn a discrete function given

239

examples of its inputs and outputs, this problem takes the form of changes in the

240

target function over time that is known as a concept drift [16, 20, 31, 34]. In this

241

section we review some methods for dealing with a concept drift.

242

2.4.1 Drifting concepts

243

Recently, several systems have been developed that employ Machine Learning

244

methods in real life applications. They learn real-life concepts that tend to change

245

over time [20, 31, 34]. An illustrative example comes from Text Mining when

246

learning shifting human interests [13].

247

The concept drift, whether abrupt or gradual [15, 16], occurs over time. The

248

evidences for changes in a concept are represented by the training examples, which

10

249

are distributed over time. Hence the old observation can become irrelevant to the

250

current time period and thus the learned knowledge can be outdated. Several meth-

251

ods have been suggested to cope with this problem, either to forget outdated in-

252

duced knowledge, or to forget outdated training examples [13, 15, 21, 34].

253

Special techniques are applied when concepts can be expected to recur [15].

254

Recurring (oscilating) concepts may be due to cyclic phenomena or may be asso-

255

ciated with irregular phenomena. In both cases the approach is to to identify stable

256

concepts and the associated context specific, locally stable concepts, and store them

257

to be reused when appropriate.

258

The remainder of the paper aims to review some relatively simple statistical

259

and machine learning techniques devised to detect and cope with drifting concepts.

260

We will focus on forgetting of outdated training examples (learning with partial

261

memory), that is, according to Dojat [10], crucial for artificial or natural systems

262

with memory. We will apply the so-called passive forgetting [10] as this approach

263

is general and does not require significant changes in training algorithms.

264

2.4.2 Partial memory learning

265

Partial memory learners are systems that select and maintain a portion of the past

266

training examples, which they use together with new examples in subsequent train-

267

ing episodes. Such systems can learn by memorizing selected new facts, or by

268

using selected facts to improve the current concept descriptions or to derive new

269

concept descriptions. Researchers have developed partial memory systems because

270

they can be less susceptible to overtraining when learning concepts that change or

271

drift, as compared to learners that use other memory models [27, 34]. The key

272

issues for partial memory learning systems are how they select the most relevant

273

examples from the input stream, maintain them, and use them in future learning

274

episodes. These decisions affect the systems classification accuracy, memory re-

11

275

quirements, and ability to cope with changing concepts. A selection policy might

276

keep each training example that arrives, while the maintenance policy forgets ex-

277

amples after a fixed period of time.

278

These policies more or less bias the learner toward recent events, and, as a

279

consequence, the system may forget about important but rarely occurring events.

280

On the other hand, the learner that is strongly anchored to the past may perform

281

poorly if concepts change or drift.

282

2.4.3 Learning to forget

283

Most frequently, forgetting is implemented in an abrupt manner. That means the

284

examples that are irrelevant according to some time criteria (e.g. examples that

285

are outdated) are deleted from the partial memory [27]. Hence, these instances are

286

totally forgotten. The examples that remain in the partial memory are equally im-

287

portant for the learning algorithm. Another possibility is to use gradual forgetting

288

[21]. It can be implemented with a time based forgetting function, which provides

289

each example with a weight according to its occurring time. The importance of

290

an example diminishes with time. The drawback of this approach is that machine

291

learning algorithms need to implement techniques for dealing with unequally im-

292

portant examples.

293

Abrupt forgetting (windowing). A common approach to learning from time-

294

changing data is to repeatedly apply a traditional learner to a sliding window of

295

w examples; as new examples arrive they are inserted into the beginning of the

296

window, a corresponding number of examples is removed from the end of the win-

297

dow, and the learner is reapplied [34]. As long as w is small relative to the rate of

298

concept drift, this procedure assures availability of a model for the current concept

299

generating the data. If the window is too small, however, this may result in insuffi-

12

300

cient examples to satisfactorily learn the concept. Further, the computational cost

301

of reapplying a learner may be prohibitively high, especially if examples arrive at

302

a rapid rate and the concept changes quickly.

303

Gradual forgetting. The principal idea behind gradual forgetting is that natural

304

forgetting is a gradual process. This means that newer training examples should be

305

more important than older ones and their importance should decrease with time.

306

The importance of example is given with its weight w = f (t). The calculated

307

weights must be in an interval that is suitable for the applied learning algorithms.

308

309

Assuming that training examples arrive on equal time steps, Koychev [21] suggests using a linear gradual forgetting function, defined as follows:

wi =

2k
i+k+1
n1

(2)

310

where i is a counter of observations starting from the most recent one and it goes

311

back over time i = 0 . . . n 1 where n is the length of the observed training se-

312

quence, and k is a parameter that represents the percent of decreasing the weight of

313

the first observation and consequently the percent of increasing the weight of the

314

last one in comparison to the average. By varying the parameter k, the slope of the

315

forgetting function can be adjusted.

316

317

Within the same framework, a kernel function for example weighing can also
be used (Eq. 3).
2
1
d
wi =
e 2k2
2 k

(3)

318

Here d = i/n is a relative time distance to the training example from the past, and k

319

is a real-valued kernel parameter. Both forgetting functions (Eq. 2 and Eq. 3) were

320

utilized in experiments described in Sec. 4.

13

321

Setting the parameters. While we have quite a few options for dealing with

322

drifting concepts, they all require parameter adjustment (window size, slope of

323

linear function, kernel parameter). Because we cannot detect drift until it has hap-

324

pened, these parameters cannot be optimally set in advance, unless we know the

325

exact extent of the drift. Therefore we always start with certain amount of drifted

326

data, that can be used for parameter optimization [20], such as window size, slope

327

of linear function, kernel parameter.

328

2.5 A statistical view to implicit temporal data

329

The assumption in data mining is that the data is randomly drawn from a stationary

330

distribution. When the underlying distribution changes (e.g. over time), this change

331

can be detected with structural change tests for a certain confidence level .

332

The problem of detecting structural changes in time series (modelled by linear

333

regression relationships) is an important topic in statistical and econometric re-

334

search. The most important classes of tests on structural change are the tests from

335

the generalized fluctuation test framework (CUSUM and MOSUM tests) [22] on

336

one side, and tests based on F statistics [6, 2] on the other. A topic that has gained

337

a lot of interest recently, especially in conjunction with F tests, is to monitor for a

338

structural change, i.e., to start after a history phase (without structural changes) to

339

analyze new observations and to be able to detect a structural change as soon after

340

its occurrence as possible [26].

341

Let us consider the standard linear regression model


yi = xi| i + ui

(i = 1, ..., n),

(4)

342

where at time i, yi is the observation of the dependent variable, xi = (1, xi2 , ..., xik ) is

343

a k 1 vector of observations of the independent variables, with the first component

14

344

equal to unity, ui are identically independently distributed iid(0, 2 ), and i is the

345

k 1 vector of regression coefficients. Tests on structural change are concerned

346

with testing the null hypothesis of no structural change


H0 : i = 0

(i = 1, ..., n)

(5)

347

against the alternative that the coefficient vector varies over time, with certain tests

348

being more or less suitable (i.e., having good or poor power) for certain patterns of

349

deviation from the null hypothesis.

350

Regression coefficients i are estimated with the ordinary least squares (OLS)

351

estimate ( (i, j) ) based on the observations i + 1, . . . , i + j, and (i) = (0, i) is the

352

OLS estimate based on all observations up to i. Hence (n) is a common OLS

353

estimate in the linear regression model.

354

2.5.1 F tests

355

The most flexible approach to investigate whether the null hypothesis of no struc-

356

tural change holds, is to use F test statistics. F tests are designed to test against a

357

single shift alternative. Thus, the alternative can be formulated on the basis of the

358

model (Eq. 4)
i =

(1 i i0 )

(i0 < i n)

(6)

359

where i0 is some change point in the interval (k, n k). Chow [6] was the first to

360

suggest such a test on structural change for the case where the (potential) change

361

point i0 is known. He proposed to fit two separate regressions for the two subsam-

362

ples defined by i0 and to reject whenever the test statistics Fi0 exceeds some critical

363

value,
Fi0 =

u> u e> e
.
e> e/(n

2k)

15

(7)

364

where e = (uA , uB )> are the residuals from the full model (consisting of two regres-

365

sions), where the coefficients in the subsamples are estimated separately, and u are

366

the residuals from the restricted model, where the parameters are just fitted once

367

for all observations. The test statistic Fi0 has an asymptotic 2 distribution with

368

k degrees of freedom and (under the assumption of normality) Fi0 /k has an exact

369

F distribution with k and n 2k degrees of freedom. The major drawback of this

370

Chow test is that the change point has to be known in advance. A natural idea to

371

extend the ideas from the Chow test is to calculate the F statistics for all potential

372

change points or for all potential change points in an interval [i, ] and to reject the

373

null hypothesis if any of those statistics get too large.

374

Continuous monitoring of the data stream. Up to this point we were concerned

375

with the retrospective detection of structural changes in given data sets. Several

376

structural change tests have been extended to monitoring of linear regression mod-

377

els where new data arrive over time [26]. Such forward looking tests are closely

378

related to sequential tests. When new observations arrive, estimates are computed

379

sequentially from all available data (historical sample plus newly arrived data) and

380

compared to the estimate based only on the historical sample. As in the retro-

381

spective case, the hypothesis of no structural change is rejected if the difference

382

between these two estimates gets too large.

383

For monitoring, the standard linear regression model (4) is generalized to


yi = xi> i + ui

384

(i = 1, . . . , n, n + 1, . . .),

(8)

i.e., we expect new observations to arrive after time n, when the monitoring begins.

16

385

3 Materials

386

3.1 Heart rate time series

387

For illustration on how to use structural change tests we present from Goldberger

388

et al. [12] two interesting time series, series 1 (Fig. 1), and series 2 (Fig. 3)

389

Each series contains 1800 evenly-spaced measurements of instantaneous heart

390

rate from a single subject. The two subjects were engaged in comparable activities

391

for the duration of each series. The measurements (in units of beats per minute)

392

occur at 0.5 second intervals, so that the length of each series is exactly 15 minutes.

393

The rapid oscillations visible in series 1 are caused by respiratory sinus arrhyth-

394

mia, a modulation of heart rate that is greatest in young subjects, and gradually

395

decreases in amplitude with increasing age.

396

On the other hand, series 2 belongs to the class of congestive heart failure,

397

where circulatory delays interfere with regulation of carbon dioxide and oxygen in

398

the blood, leading to slow oscillations of heart rate.

399

Both time series contain anomalies near the beginning and the end of observa-

400

tion period.

401

3.2 Clinical diagnostics of coronary artery disease (Nuclear dataset)

402

For illustration of implicit temporal data mining methods and simultaneous use of

403

structural change statistical tests, we focused on the Nuclear dataset [14, 25] for

404

three reasons:

405

406

407

408

we have been working on this dataset for quite some time and therefore know
it pretty well,
we have close relations with physician who has collected the data, and who
has provided also the original diagnoses,

17

409

it was possible to order the patients by the date of their examination, which

410

is rare in publically available datasets (such as the UCI repository). This

411

is because mostly existing temporal information is not compiled by experts

412

preparing the data for analysis.

413

Coronary artery disease (CAD) is the most important cause of mortality in

414

all developed countries. It is caused by diminished blood flow through coronary

415

arteries due to stenosis or occlusion. CAD produces impaired function of the heart

416

and finally the necrosis of the myocardium myocardial infarction.

417

In our study we used a dataset of 327 patients (250 males, 77 females) with

418

performed clinical and laboratory examinations, exercise ECG, myocardial scintig-

419

raphy and coronary angiography because of suspected CAD. The features from the

420

ECG an scintigraphy data were extracted manually by the clinicians.


[Table 1 about here.]

421

422

In 228 cases the disease was angiographically confirmed and in 99 cases it was

423

excluded. 162 patients had suffered from recent myocardial infarction. The pa-

424

tients were selected from a population of approximately 4000 patients who were

425

examined at the Nuclear Medicine Department between 1991 and 1994. We se-

426

lected only the patients with complete diagnostic procedures (all four levels) [25].

427

Results of the fourth level (coronary angiography) were taken as a gold standard.

428

4 Results

429

4.1 Detecting the structural change in time series

430

In Figs. 1 to 4 there are depicted the original time series (Figs. 1 and 3) and the

431

respective critical values of F tests (Figs. 2 and 4) at the significance level = 0.05.

18

432

In series 1 (respiratory sinus arrhythmia, a modulation of heart) three struc-

433

turally changed intervals are detected: besides the true anomalies, the test detects

434

another interval with raised heart rate in the middle of the time series.

435

436

In series 2 (congestive heart failure) two structurally changed intervals are detected; they are both associated with true anomalies.

437

As we can see, F tests are quite good at detecting structural changes. Note that

438

such a periodic time series is a more difficult problem than a monotonic one. It is

439

therefore conceivable that structural change tests may be of use in data mining of

440

implicit temporal data.

441

[Figure 1 about here.]

442

[Figure 2 about here.]

443

[Figure 3 about here.]

444

[Figure 4 about here.]

445

4.2 Case study: nuclear diagnostics of coronary artery disease

446

In this section we demonstrate how we can easily and efficiently detect and (at least

447

partially) account for implicit temporal properties of collected data by building a

448

series of static data models.

449

Our case study is in the diagnostics of coronary artery disease [14, 25]. This is

450

a two-class problem, diagnosing whether the patients suffer from coronary artery

451

disease (CAD), or not. The data were collected in years between 1991 and 1994.

452

After performing a leave-one-out testing on the whole dataset and ordering the

453

results by the date of final examination, we obtained surprising classification ac-

454

curacy graphs as depicted in Fig. 5. Decreasing classification accuracy for both

455

physicians and naive Bayesian classifier in the last observed year (1994) could be
19

456

either a result of significantly changed class distribution or of a concept change.

457

The former seems not to be the case in our problem, since the class distribution

458

(see class prevalence in Fig. 5) does not change significantly over the observed

459

time interval. This leads us to a question what has happened and how can we deal

460

with it.
[Figure 5 about here.]

461

462

It is important that retrospective studies as well as ongoing (online) studies

463

where Machine Learning tools are being used employ techniques for detecting an

464

dealing with time-changing concepts. While this may not happen very often, it may

465

seriously skew results of otherwise perfectly valid studies. In order to compensate

466

for changed conditions it is also important for a Machine Learning system to decide

467

on when to rebuild a model to account for newly arrived training examples and what

468

extent of historical training data to use for learning.

469

4.2.1 Experimental setup

470

For testing different methods for dealing with concept drift we applied the follow-

471

ing methodology. All examples were ordered by the time of patients examination.

472

When we needed to start with some initial training set, we fixed for this purpose

473

the first 100 out of 327 examples. Performance on this set was evaluated with

474

a leave-one-out testing process. On the remaining examples we applied different

475

techniques, described in Sec. 2. For windowing as well as for gradual forgetting,

476

testing was done in single steps, where the potential training set consisted from the

477

first n examples, 100 n 326, and the testing set consisted of the (n + 1) -st ex-

478

ample. From training examples either last w (window size) was used for training,

479

or they were all assigned different weights (gradual forgetting).

480

Our experimental Machine Learning tool of choice was naive Bayesian classi-

481

fier. It was suitable for our purpose because of its very fast, incremental learning,
20

482

because it usually performs well in medical diagnostic problems and because it

483

can easily be modified for dealing with unequally important (weighted) training

484

examples.

485

4.2.2 Detecting the concept drift

486

The first thing was to find out how can concept drift be detected. If we want to

487

detect it, it must have already happened or has started to happen. Thus we always

488

have certain amount of data available for experimenting. In diagnostic problems,

489

where sooner or later the diagnoses are confirmed, our task is easy. When we col-

490

lect enough drifted examples, the drift reflects in significantly decreased average

491

classification accuracy achieved in recent past in comparison with average classifi-

492

cation accuracy achieved in distant past (see average classification accuracy in Fig.

493

6, recent past is last 50 examples).

494

However in prognostic problems the situation is more difficult, since actual

495

outcomes (and thus correctness of prognoses) may not be known for a long time, if

496

ever. In such cases it may be more useful to use a measure of reliability estimation

497

[23, 24] that assigns a kind of confidence value to every prediction (see reliability

498

estimation in Fig. 6). Although actual predictions may not be known, one could

499

detect the drift by significantly decreased average reliability estimations in recent

500

past in comparison with average reliability estimations calculated in distant past

501

(see average reliability estimation in Fig. 6, recent past is last 50 examples).

502

[Figure 6 about here.]

503

In Fig. 6 it can clearly be observed, that the drift has been happening since the

504

beginning of the last observed year (1994), that is, from the example 261 on. So

505

we selected the last 66 examples as a drifted set of our interest.

21

506

Detecting the concept drift with a structural change test. For quantitative

507

evaluation of the concept drift structural change statistical tests has been used. Re-

508

sults of the F test performed on the original non-smoothed data (the raw 0/1 data of

509

Fig. 5) are shown in Fig. 7. On the significance level of = 0.05, we can detects

510

structural changes in first 30 patients in year 1991 as well as in the patients from

511

261 on in year 1994. The first structural change was admittedly caused by a biased

512

data collection, while the second one truly represents a concept drift (actually a

513

changed context in data acquisition).


[Figure 7 about here.]

514

515

We wanted to check if this concept drift is also reflected in the structure of the

516

data. For this purpose we plotted the most important principal components for each

517

patient (Fig. 8) and tested it for a structural change (Fig. 9). On average, the most

518

important principal component explained about 35% of total data variance. As one

519

can see in both figures, there are no significant trends or regularities in Fig. 8,

520

although the components are slightly decreasing with time. Also, the test statistic

521

values of the structural change F test are far below the critical value on signifi-

522

cance level = 0.05. Clearly, unsupervised testing of principal components is not

523

very promising. A much better unsupervised approach seems to be transductive

524

reliability estimation (Fig. 6).

525

[Figure 8 about here.]

526

[Figure 9 about here.]

527

We also experimented with real-time monitoring of the data in order to detect

528

the structural change as soon as possible. As suggested by Leisch [26], we started

529

after a history phase without structural changes (we left the first 30 patients out)

530

and incrementally analyzed and tested new observations for a structural change.
22

531

We found out that the F test is quite responsive in this matter, it correctly found

532

the location (patient 261) of the significant structural change 23 patients after the

533

actual start of the drift. The 23 patients correspond to approximately three months

534

in real time.

535

In Fig. 10 there is a situation 20 patients after the beginning of the drift, when

536

it wasnt detected yet. In contrast, Fig. 11 depicts a situation 25 patients after the

537

beginning of the drift, when the presence of structural change and its location in

538

time is already located ( = 0.05).

539

[Figure 10 about here.]

540

[Figure 11 about here.]

541

If the data were monitored continuously in clinical practice, the drift could have

542

been detected a few months after its beginning, and not ten years later.

543

4.2.3 Dealing with the concept drift

544

For dealing with the concept drift we applied windowing as well as linear and

545

kernel-based gradual forgetting (Sec. 2.4.3). We used the first 261 non-drifted

546

examples for parameter optimization (window size, slope and kernel parameter).

547

In order to evaluate the quality of obtained parameters, we also optimized the pa-

548

rameters on the drifted set. For comparison we selected the best achieved results.

549

The obtained parameter values were tested on the last 66 drifted examples, and

550

the results are compared in Tab. 2. Fig. 12 depicts variations of classification

551

accuracy with different parameter settings.

552

[Table 2 about here.]

553

[Figure 12 about here.]

23

554

As we can see, differences in accuracy between optimized and actual best pa-

555

rameter values exist, but they are small. By using the optimized parameter values

556

the average performance on the whole dataset was 94-95% for all three methods.

557

We cannot say that any of them (windowing, linear or kernel-based forgetting) is

558

significantly better than the other ones. However, we can see an improvement in

559

overall accuracy for naive Bayesian classifier by 4%. This is no small achievement,

560

since it actually reduces the error rate by 44% (from 9% to 5%). But more than

561

this it is important that the performance on the drifted examples (last 66) is much

562

higher (by 20%) for naive Bayesian classifier (from 64% to 83-85%). This means

563

that we can almost level the performance on this problematic subset with overall

564

performance and should equal to it when a few more training examples arrive.

565

In Figs. 13 and 14 we compare classification accuracy of the ordinary naive

566

Bayesian classifier with windowing and gradual forgetting methods for different

567

parameter values. For training ordinary naive Bayesian classifier, non-drifted ex-

568

amples (first 261) were used, and leave-one-out testing was performed, the others

569

were added in training set and tested incrementally.

570

[Figure 13 about here.]

571

[Figure 14 about here.]

572

5 Discussion

573

In the paper we have focused on, and briefly reviewed some approaches for han-

574

dling of both explicit temporal (time-stamped) and implicit temporal data.

575

Explicit temporal data (time series) are frequently used in medical and other

576

studies. We briefly reviewed numerical (time series analysis) and symbolic (tem-

577

poral abstractions) methods for dealing with abundant temporal data. They are

24

578

especially useful for qualitative description, summarization, and reasoning on tem-

579

poral data.

580

On the other hand, in implicit temporal data the temporal component is hidden,

581

yet always present, and determines their ordering, as were collected over certain

582

periods of time. More often than not, the hidden temporal components of collected

583

datasets are ignored. This may cause unfortunate anomalies and worsen results, as

584

shown in our case study.

585

We reviewed statistical methods for detecting the structural change (concept

586

drift), and three different machine learning methods for dealing with changing

587

(drifting) concepts: windowing and gradual (linear or kernel-based) forgetting. We

588

found out that in our case study of coronary artery disease diagnostics all perform

589

reasonably well. While they all require setting certain parameters, they can be

590

automatically tuned on the training set [20], and nearly optimal results can be ex-

591

pected. In windowing, at most n (size of the training set) re-runs of the training

592

algorithm are required for window size selection, whereas for gradual forgetting

593

linear optimization with desired precision for respective parameter is sufficient.

594

We can detect concept drift only after it has happened, so we always have

595

some data available. It can be detected by using classification accuracy (in diag-

596

nostic problems) or reliability estimation (in prognostic problems), quantitatively

597

evaluated by statistical structural change tests (such as F test) with a prescribed

598

confidence level. Our experiments with monitoring for a structural change tests

599

show that they are able to precisely and quickly locate the starting point of concept

600

drift. Namely, one should rebuild (re-learn) the model only when absolutely neces-

601

sary and adjust suitable parameters (e.q. window size, slope or kernel parameter)

602

to compensate for the drift. This is especially important for practical applications,

603

where rebuilding a model is not performed every time when a new training exam-

604

ple arrives. Model rebuilding may require a presence of a machine learning expert,

25

605

especially if learning parameters need to be changed. Often a generated model is

606

stored and used independently of the learner (e.g. in a handheld device, or even

607

printed on a paper). In such cases a model should be rebuilt and deployed only

608

when it is really necessary.

609

In our case study of coronary artery disease diagnostics we managed to achieve

610

overall improvement of 4% compared to ordinary naive Bayesian classifiers result.

611

Since it actually reduces the (already low) error rate by 44% (from 9% to 5%), it

612

is no small achievement. The improvement was most notable in last 66 drifted

613

examples, where it was about 20% (from 64% to 83-85%). This means that the

614

performance on this problematic subset is almost levelled with overall performance

615

and should equal to it with a few more training examples.

616

A very encouraging result is also it correct and quick pinpointing of the location

617

(patient 261) of the significant structural change only 23 patients after the actual

618

start of the drift. The 23 patients correspond to approximately three months in real

619

time. If the data were monitored and tested continuously, the drift could have been

620

detected a few months and not ten years after its beginning. We argue that any

621

(online) learning system that is used in practice should use similar techniques at

622

least to detect and possibly deal with drifting concepts.

623

There are several things that can be done to further develop the described meth-

624

ods. Most notably, statistical tests should be integrated with machine learning and

625

data mining methods more thoroughly in order to continuously check for the pos-

626

sibility of a structural change (concept drift) and to guide the re-learning of the

627

models. Also, a weighting scheme for gradual forgetting should be devised, that

628

does not need to be recalculated every time a new training example arrives. This

629

would enable true incremental learning, however it would require from the learner

630

to cope with increasingly large, theoretically unlimited example weights.

631

There is one question left unanswered, and that is how and why in our case

26

632

study a concept drift has occurred in the first place. Since the reasons are rather

633

delicate and personal in their nature, it will suffice to say that there was a serious

634

yet unconscious human error in data acquisition (interpretation of scintigraphic

635

images) caused by professional traumatic experience.

636

Acknowledgements

637

I thank Ciril Groselj, Nuclear Medicine Department, University Medical Centre

638

Ljubljana, for collecting the data, and Michel Dojat, Unite mixte INSERM-UJF

639

U594 Neuroimagerie Fonctionnelle & Metabolique, Grenoble, for pointing to

640

me his most interesting work. This work was supported by the Slovenian Ministry

641

of Education and Science.

642

References

643

[1] R. Agrawal, C. Faloutsos, and A. Swami. Efficient similarity search in sequence

644

databases. In Proc. 4th Intl Conference on Foundations of Data Organization and

645

Algorithms, pages 6984, 1993.

646
647

648
649

[2] D. W. K. Andrews and W. Ploberger. Optimal tests when a nuisance parameter is


present only under the alternative. Econometrica, 62:13831414, 1994.
[3] R. Bellazzi, C. Larizza, and Alberto Riva. Temporal abstractions for diabetic patients
management. In Proc. AIME97, pages 319330, 1997.

650

[4] R. Bellazzi, P. Magni, C. Larizza, G. De Nicolao, A. Riva, and M. Stefanelli. Mining

651

biomedical time series by combining structural analysis and temporal abstractions.

652

JAMIA, Symposium supplement 1998, pages 160164, 1998.

653
654

[5] B. Chiu, E. Keogh, J. Lin, and S. Lonardi. Efficient discovery of unusual patterns in
time series. In Proc. KDD02, pages 550556, 2002.

27

655
656

[6] G. C. Chow. Tests of equality between sets of coefficients in two linear regressions.
Econometrica, 28:591605, 1960.

657

[7] W. W. Cohen. Fast effective rule induction. In A. Prieditis and S. Russel, editors,

658

Proc. 12th Intl. Conf. on Machine Learning ICML95, pages 115123, San Francisco,

659

California, USA, 1995. Morgan Kaufmann.

660

[8] D. Dasgupta and S. Forrest. Novelty detection in time series data using ideas from

661

immunology. In Proc. of The International Conference on Intelligent Systems, 1999.

662

[9] M. Dojat, F. Pachet, Z. Guessoum, D. Touchard, A. Harf, and L. Brochard.

663

Neoganesh: A working system for the automated control of assisted ventilation in

664

icus. Artificial Intelligence in Medicine, 11:97117, 1997.

665
666

[10] M. Dojat and C. Sayettat. A realistic model for temporal reasoning in real-time
patient monitoring. Applied Artificial Intelligence, 10:121143, 1996.

667

[11] F. Esposito, D. Malerba, and G. Semeraro. Simplifying decision trees by pruning and

668

grafting: new results. In N. Lavrac and S. Wrobel, editors, Proc. Europ. Conf. on

669

Machine Learning ECML95, pages 287290. Springer Verlag, 1995.

670

[12] A. L. Goldberger and D. R. Rigney. Nonlinear dynamics at the bedside. In L. Glass,

671

P. Hunter, and A. McCulloch, editors, Theory of Heart: Biomechanics, Biophysics,

672

and Nonlinear Dynamics of Cardiac Function, pages 583605. Springer-Verlag, New

673

York, 1991.

674
675

[13] I. Grabtree and S. Soltysiak. Identifying and tracking changing interests. International Journal of Digital Libraries, 2:3853, 1998.

676

[14] C. Groselj, M. Kukar, J. Fettich, and I. Kononenko. Machine learning improves

677

the accuracy of coronary artery disease diagnostic methods. In Proc. Computers in

678

Cardiology, volume 24, pages 5760, Lund, Sweden, 1997.

679
680

[15] M. B. Harries, C. Sammut, and K. Horn. Extracting hidden context. Machine Learning, 32:101126, 1998.

28

681
682

[16] D. P. Helmbold and P. M. Long. Tracking drifting concepts by minimizing disagreements. Machine Learning, 14:2745, 1994.

683

[17] G. Hulten, L. Spencer, and P. Domingos. Mining time-changing data streams. In

684

Proceedings of the 17th ACM SIGKDD Inter. Conf. on Knowledge Discovery and

685

Data Mining, pages 97106, San Francisco, CA, 2001. ACM Press.

686

[18] E. Keogh, K. Chakrabarti, M. Pazzani, , and S. Mehrotra. Dimensionality reduction

687

for fast similarity search in large time series databases. Journal of Knowledge and

688

Information Systems, 3(3):263286, 2000.

689

[19] E. Keravnou. Modelling medical concepts as time objects. In P. Barahona, M. Ste-

690

fanelli, and J. Wyatt, editors, Lecture Notes in Medical Informatics, pages 6790.

691

Springer-Verlag, Berlin, 1995.

692

[20] R. Klinkenberg and T. Joachims. Detecting concept drift with support vector ma-

693

chines. In P. Langley, editor, Proceedings of ICML-00, 17th International Confer-

694

ence on Machine Learning, pages 487494, Stanford, US, 2000. Morgan Kaufmann

695

Publishers, San Francisco, US.

696

[21] I. Koychev. Gradual forgetting for adaptation to concept drift. In Proceedings of

697

ECAI 2000 Workshop Current Issues in Spatio-Temporal Reasoning, pages 101106,

698

Berlin, Germany, 2000.

699
700

[22] C. M. Kuan and K. Hornik. The generalized fluctuation test: A unifying view. Econometric Reviews, 14:135161, 1995.

701

[23] M. Kukar. Making reliable diagnoses with machine learning: A case study. In Silvana

702

Quaglini, Pedro Barahona, and Steen Andreassen, editors, Proceedings of Artificial

703

Intelligence in Medicine Europe, AIME 2001, pages 8896, Cascais, Portugal, 2001.

704

Springer.

705

[24] M. Kukar and I. Kononenko. Reliable classifications with Machine Learning. In

706

Proceedings of 13th European Conference on Machine Learning, ECML 2002, pages

707

219231, 2002.

29

708

[25] M. Kukar, I. Kononenko, C. Groselj, K. Kralj, and J. Fettich. Analysing and im-

709

proving the diagnosis of ischaemic heart disease with machine learning. Artificial

710

Intelligence in Medicine, 16 (1):2550, 1999.

711
712

713
714

715
716

717
718

719
720

[26] F. Leisch, K. Hornik, and C. M. Kuan. Monitoring structural changes with the generalized fluctuation test. Econometric Theory, 16:835854, 2000.
[27] M. A. Maloof and R. S. Michalski. Selecting examples for partial memory learning.
Machine Learning, 41(1):2752, 2000.
[28] Y. Shahar. A framework for knowledge-based temporal abstraction. Artificial Intelligence, 90(1-2):79133, 1997.
[29] Y. Shahar and M. A. Musen. Knowledge-based temporal abstraction in clinical domains. Artificial Intelligence in Medicine, 8(3):267298, 1996.
[30] C.E. Shannon and W. Weaver. The mathematical theory of communications. The
University of Illinois Press, Urbana, IL, 1949.

721

[31] N. A. Syed, H. Liu, and K. K. Sung. Handling concept drifts in incremental learning

722

with support vector machines. In Knowledge Discovery and Data Mining, pages

723

317321, 1999.

724

[32] A. S. Weigend, D. E. Rumelhart, and A. H. Bernardo. Generalization by weight elimi-

725

nation with application to forecasting. In Advances in Neural Information Processing

726

Systems, volume 3, pages 875882, Denver, CO, USA, 1991. Morgan Kaufman.

727

[33] B. Whitehead and W. A. Hoyt. A function approximation approach to anomaly de-

728

tection in propulsion system test data. In Proc. AIAA/SAE/ASME/ASEE 29th Joint

729

Propulsion Conference, 1993.

730
731

[34] G. Widmer and M. Kubat. Learning in the presence of concept drift and hidden
contexts. Machine Learning, 23(1):69101, 1996.

732

[35] A. Zeileis, F. Leisch, K. Hornik, and C. Kleiber. strucchange: An r package

733

for testing for structural change in linear regression models. Journal of Statistical

734

Software, 7(2):138, 2002.

30

735

736
737

List of Tables
1
2

CAD data for different diagnostic levels. . . . . . . . . . . . . . .


Experimental results on the drifted data. . . . . . . . . . . . . . .

31

32
33

Diagnostic level
Signs, symptoms and history
Exercise ECG
Myocardial scintigraphy
Coronary angiography
Total attributes
Disease prevalence
Entropy of classes

Diagnostic attributes
Nominal Numeric Total
23
7
30
7
9
16
22
9
31
1
1
53
25
78
70% positive 30% negative
0.89 bit

Table 1: CAD data for different diagnostic levels.

32

Naive
Bayes
Windowed
Linear
Kernel
Physicians
Ordinary Naive Bayes

Parameter
Window size
Slope (k)
Kernel size (k)

Optimized
Value Accuracy
100
85%
0.90
83%
0.25
83%
70%
64%

Best achieved
Value Accuracy
70
88%
0.9
83%
0.17
86%
70%
64%

Table 2: Experimental results on the drifted data.

33

Overall
Accuracy
95%
94%
95%
85%
91%

738

739
740

List of Figures
1
2

741
742
743
744

3
4

745
746
747
748

5
6

749
750
751

7
8

752
753

754
755

10

756
757

11

758
759

12

760
761
762
763

13

764
765
766
767

14

Time series 1 (respiratory sinus arrhythmia). . . . . . . . . . . . .


Time series 1 (respiratory sinus arrhythmia). Results of structural
change tests. Exceeded critical value signal significant ( = 0.05)
structural changes. . . . . . . . . . . . . . . . . . . . . . . . . .
Time series 2 (congestive heart failure). . . . . . . . . . . . . . .
Time series 2 (congestive heart failure). Results of structural change
tests. Exceeded critical value signal significant ( = 0.05) structural changes. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Time-based variation of classification accuracy in the Nuclear dataset.
Detecting concept drift with classification accuracy and reliability
in the Nuclear dataset. . . . . . . . . . . . . . . . . . . . . . . .
F test for a structural change in data from Fig. 5 (al pha = 0.05). .
Most important principal components, their running median (middle line) and its range (top and bottom line). . . . . . . . . . . . .
F test on the principal components from Fig. 8. All test statistic
are far below critical value ( = 0.05. . . . . . . . . . . . . . . .
Monitoring F test for concept drift (20 drifted examples). Concept
drift is not detected yet. . . . . . . . . . . . . . . . . . . . . . . .
Monitoring F test for concept drift (25 drifted examples). Concept
drift is determined as significant ( = 0.05). . . . . . . . . . . . .
Parameter tuning: performance on the drifted examples. Kernel
and k parameter are represented as a real-value. Window size is
represented as a share of the whole training set (all training examples = 261 = 1.0). . . . . . . . . . . . . . . . . . . . . . . . . . .
Catching up with the drift with windowing in the Nuclear dataset.
Notice the negative effect of too small window size (w=50). . . . .
Catching up with the drift with gradual forgetting in the Nuclear
dataset. Differences between linear and kernel-based forgetting are
almost negligible. . . . . . . . . . . . . . . . . . . . . . . . . . .

34

35

36
37

38
39
40
41
42
43
44
45

46
47

48

100
80
60
V1
40
20
0
0

500

1000

1500

Time

Figure 1: Time series 1 (respiratory sinus arrhythmia).

35

80
60
40

F statistics

20
0

400

600

800

1000

1200

1400

Time

Figure 2: Time series 1 (respiratory sinus arrhythmia). Results of structural change


tests. Exceeded critical value signal significant ( = 0.05) structural changes.

36

100
80
60
V1
40
20
0
0

500

1000

1500

Time

Figure 3: Time series 2 (congestive heart failure).

37

80
60
40

F statistics

20
0

400

600

800

1000

1200

1400

Time

Figure 4: Time series 2 (congestive heart failure). Results of structural change


tests. Exceeded critical value signal significant ( = 0.05) structural changes.

38

Figure 5: Time-based variation of classification accuracy in the Nuclear dataset.

39

100%

80%

60%

40%

Reliability estimation

20%

Averaged reliability
estimation
Averaged classification
accuracy

0%
1991

1992

1993

1994

1995

Figure 6: Detecting concept drift with classification accuracy and reliability in the
Nuclear dataset.

40

12
10
8
6

F statistics

4
2
0
1991

1992

1993

1994

1995

Time

Figure 7: F test for a structural change in data from Fig. 5 (al pha = 0.05).

41

100
50
0

PC 1

50
100
1991

1992

1993

1994

1995

Time

Figure 8: Most important principal components, their running median (middle line)
and its range (top and bottom line).

42

8
6
4

F statistics

2
0
1991

1992

1993

1994

1995

Time

Figure 9: F test on the principal components from Fig. 8. All test statistic are far
below critical value ( = 0.05.

43

12
10
8
6

F statistics

4
2
0

1991

1992

1993

1994

Time

Figure 10: Monitoring F test for concept drift (20 drifted examples). Concept drift
is not detected yet.

44

12
10
8
6

F statistics

4
2
0

1991

1992

1993

1994

Time

Figure 11: Monitoring F test for concept drift (25 drifted examples). Concept drift
is determined as significant ( = 0.05).

45

100%
90%
80%
70%
60%
50%
40%
30%

Windowing

20%

Linear
Kernel

10%
0%
0.0

0.2

0.4

0.6

0.8

1.0

Figure 12: Parameter tuning: performance on the drifted examples. Kernel and
k parameter are represented as a real-value. Window size is represented as a share
of the whole training set (all training examples = 261 = 1.0).

46

Figure 13: Catching up with the drift with windowing in the Nuclear dataset. Notice the negative effect of too small window size (w=50).

47

Figure 14: Catching up with the drift with gradual forgetting in the Nuclear dataset.
Differences between linear and kernel-based forgetting are almost negligible.

48

You might also like