Professional Documents
Culture Documents
Instrumentation
Chris Thornton
Cognitive and Computing Sciences
University of Sussex
Brighton
BN1 9QH
UK
Email: Christopher.Thornton@renet.uk.com
Tel: (44)1273 678856
May 21, 2003
Abstract
Indirect sensing is a new sensing methodology which aims to elimi-
nate the need need for special-purpose transducers in certain domains.
Rather than the sensor signal being derived from a directly connected
transducer, it is derived by data-mining representations of ambient phe-
nomena (sound, light, temperature etc.) As an information-fusion task
this involves the fusion of a knowledge source (i.e., the bias characteris-
tics of the utilised data-mining method) with a data source. The paper
explores an application involving sensing of engine temperature in a light
aircraft. A number of data-mining methods are compared on the task and
conclusions are drawn regarding the bias characteristics which are key for
this application.
1 Introduction
Conventional sensing technologies are typically implemented in a direct fashion
[1, 2]. That is to say, a transducer is brought into physical contact with the
target property with the aim of deriving a clear and unambiguous signal. For
example, a direct-sensing approach to the problem of establishing the current
position of the undercarriage on a light aircraft is to attach pressure sensors to
the bracings which support the undercarriage in the lowered position. When
these sensors generate a signal, the undercarriage is conrmed as lowered.
An alternative is indirect sensing [3]. In this approach, there is no attempt
to bring a transducer into direct contact with the target itself. Rather the
1
aim is to infer changes in it by taking account of its more widespread eects,
i.e., ambient eects involving sound, light, temperature etc. An indirect sensing
approach to the problem of sensing undercarriage-position might involve the use
of a vibration sensor attached to the frame of the aircraft. The undercarriage
would then be conrmed as lowered when the airframe registers a particular
pattern of vibrations.
The indirect approach to sensing oers several advantages and enables a
greater variety of engineering solutions. It also allows sensing to be carried out
using general-purpose and therefore inexpensive hardware. The drawback is
that it typically entails a much greater degree of downstream signal-processing.
In order to derive an accurate reading of target states, it will be necessary to
interpret the signatures that the target renders in the ambient array. Because of
the huge variety of sources that contribute to a typical pattern of ambient energy
release, there is considerable interference among these signatures. As a result,
they tend to form complex and ambiguous representations of the contributing
factors.
The attempt to hand-craft processing routines for this task is a daunting
challenge and one which oers no real promise of success. The present work
explores a dierent approach which involves fusing information from dierent
sources. The essential idea is to bring together a knowledge source, in the form
of certain data-mining technologies, with a data source. The latter is made up
from the raw data representing ambient phenomena but fashioned in the form
training examples. The goal is to derive the desired processing automatically
without any need for hand-coding.
The approach places a heavy burden on the data-mining method used. How-
ever, there is every reason to believe that state-of-the-art methods are able to
bear this sort of load. In the last half century, there has been a considerable
progress on the data-mining task with innovations accumulating rapidly in the
last decade. The tangible result is that we now have an extensive `tool box' of
data-mining technologies which may be applied to tasks such as this [4].
1.1 Domain choice
The indirect-sensing approach promises to pay dividends in areas where the
implementation of direct sensors is problematic or costly. One area which ts
the bill perfectly is that of general aviation | non-commercial, light aircraft.
The costs of implementing instrumentation in conventional light aircraft, using
direct-sensing technologies, may form more than half of the total manufacturing
cost [5]. So the potential for cost-reduction through use of indirect-sensing
methodologies in this area is extremely good.
In addition, there is the possibility that indirect-sensing might enable rela-
tively sophisticated sensing functionality to `trickle down' to the less sophisti-
cated end of the aviation spectrum. A case in point involves the phenomenon of
carburettor icing, a problem which plagues the current generation of light air-
craft. Direct sensing of carurettor icing is feasible but not generally implemented
on light aircraft for reasons of cost. Pilots of aircraft such as the common Cessna
2
152 and 172 models must learn to sense intuitively the moment when carburet-
tor heat should be applied so as to eliminate any build-up of carburettor ice.
Using an indirect approach, there is the possibility of implementing a reliable
but low-cost sensory mechanism without the need for any direct engineering
intervention, i.e., without the need for any modications to the fundamental
carburettor design.
Taken to its logical extreme, the indirect sensing approach oers the possibil-
ity of assembling an entire, `oine' instrumentation system for a light aircraft,
without introducing any engineering changes whatsoever. A portable computer
equipped with an array of general-purpose ambient sensors might provide vir-
tual back-up for all the main instrumentation systems. Since instrumentation
failure is implicated in a large proportion of general aviation accidents [5], the
net eect might be a signicant improvement in safety and reliability.
The work described below attempts to probe the practicality of using indirect-
sensing in this context. In particular, it focuses on engine-temperature sensing
in a Cessna-152 light aircraft. An empirical study is described involving the ap-
plication of data-mining methods to a real-world dataset. The aim of the study
was to determine whether data-mining of ambient audio data could provide the
means of accurately measuring engine temperature (i.e., whether data-mining
methods are able to determine engine temperature from engine noise).
3
a particular datum. This involves identifying commonalities among data asso-
ciated with the same prediction. In general, these commonalities are statistical
properties of the relevant data.
A review of such methods might begin with the nearest-neighbours method
[9, 10], from the eld of statistics. In this method, learning involves nothing
more than storing away the examples in memory. For any given datum d, a
prediction is then generated simply by averaging over the predictions associated
with those examples showing the highest level of similarity to d. Surprisingly,
this approach produces very respectable prediction results, although usage may
be limited in practice due to the costs involved in iterating over the entire
example set each time a new prediction is required [4].
Also originating in statistics is the method of predicting classication labels
using a linear-discriminant function (i.e., a linear hyperplane separating dier-
ently classied examples in the data space) [10]. In this method, learning usually
involves some variation on the theme of least-mean-squared (LMS) regression.
Essentially, the method uses observations of misclassications to incrementally
`nudge' a separating hyperplane into position. Over time, LMS regression has
come to be seen as a major foundation for classication methods applied to
numerical data [12].
Data mining also has roots outside of statistics. The decision tree method,
for example, has a history which embraces Quinlan's ID3 method [13, 14], a pro-
gram developed within the machine learning (articial intelligence) community.
Learning here involves producing a decision tree for predictions which eectively
minimises the number of tests which have to be made in order to generate a
prediction consistent with the training examples.
Lodged somewhere between the realm of machine learning and that of statis-
tics is the eld of neural networks [15, 16]. This gained considerable territory
in the 1980s on the basis of the noted successes of the backpropagation method
of network training [17]. Backpropagation enables networks containing inter-
mediate (or hidden) units to be trained, paving the way for the derivation of
complex partitionings of the data via the introduction of superpositions of linear
boundaries.
4
is also a greater chance that the hypothesis will be inconsistent with unseen
cases. The likely result is that a hypothesis will be selected which produces
poor prediction. If the bias is low, there is a greater variance among consistent
hypotheses and therefore a greater chance of poor prediction and generalisation.
If the bias is high, there is lower variance but also the risk that no consistent
hypothesis will be discovered.
A useful construct for understanding this tradeo is the VC dimension
(Vapnik-Chervonenkis dimension). This is an (inverse) measure of the bias
inherent in a particular hypothesis language. A dataset is said to be `shattered'
by a hypothesis language if the language enables every possible subset of data to
be partitioned. Thus, the VC dimension of a hypothesis language is eectively
a measure of how large a dataset can get before the ecacy of the language
starts to drop below the theoretical maximum.1 Higher VC dimension implies
lower bias and vice versa.
Using the concept of VC dimension, we can reformulate the bias-variance
issue in a clearer way. The aim in learning must be to nd the lowest VC di-
mension which will still yield successful performance. In other words, the bias
must be weak enough to admit at least one, consistent hypothesis. But it must
be strong enough to constrain the number of poor-quality (i.e., poorly predict-
ing) hypotheses. In cases where a learner uses a xed hypothesis language,
achieving this aim involves ensuring that the hypothesis language has the right
level of bias.
In some cases the strength of the bias is a modiable property of the learning
method. A case in point is the feedforward, neural network [19]. Here the bias
is partly a function of the number of hidden units in the network: with more
hidden units the network is able to dene increasingly complex partitions. With
fewer hidden units, the network must capture the relevant regularities in terms
of simpler partitions. By varying the number of hidden units, the bias may be
set to any desired level.
But, again, there is the problem of calculating what the level should be. With
neural nets, a practical solution involves a procedure known as early-stopping
[20]. It is a natural consequence of the learning regime that true regularities
will be modelled before noise. So if learning is stopped at the moment when
cross-validation tests suggest that the network is starting to overt the data,
the network's excess capacity (low bias) will never be utilised. The bias is then
eectively `strengthened' to an appropriate level.
3.1 Support Vector Machines
The technique of early-stopping makes bias-setting a semi-automatic part of the
learning. But, as Vapnik has shown, bias-setting can be made a fully automatic,
i.e., a seamless part of the learning process. Vapnik's solution involves a learning
method called support vector machines [21, 22]. The novel feature of these is
1 This is like rating a car's power in terms of the maximum speed at which the engine still
yields maximum energy eciency.
5
the way in which they divide the learning problem into two subproblems: the
problem of nding a way of re-representing the data in a higher-dimensional
space and (b) the problem of nding a simple partitioning of the data in that
space.
SVMs are provided with a kernel function. This implicitly maps the ex-
amples into a high-dimensional feature space. (In fact, kernel functions map
combinations of examples to their inner products in the feature space; but this
less costly operation is sucient for application of the required optimisation
techniques.) They then use quadratic programming techniques to discover a
subset of the examples | the so-called `support vectors' | which are separated
by a hyperplane in the feature space. In eect, the SVM uses QP methods to
discover a representation of the data which allows the desired partition to be
formed as a simple hyperplane.
An attractive aspect of the SVM is the way in which it automatically handles
the bias-variance tradeo. In essence, the SVM seeks a solution embodying the
highest bias by maximising the support margin, i.e, the distance between the
optimal separating hyperplane and its closest support vectors.
In eect, then, the SVM carries out three tasks. First, there is the re-
representational task in which the data are mapped into a higher-dimensional
space. Second there is the task of identifying a separating hyperplane in the
feature space. Finally, there is the task of maximisingthe size of the margin. The
rst two tasks we might view as `learning' in the traditional sense. The third we
might view as bias maximisation (i.e., the discovery of a solution with optimal
predictive performance). SVMs eectively wrap all three tasks into a single
optimisation task and then solve it using quadratic programming techniques.
By making the number of free parameters used dependent on the size of the
margin associated with the separating hyperplane, SVMs eectively introduce
a feedback loop which ensures that decreases in bias are only introduced as
necessary.
SVMs have a rm theoretical foundation in Vapnik's statistical learning
theory and, in addition, automatically handle the bias-selection problem. It is
sometimes argued that, in practice, other approaches such as neural networks
will be preferable on grounds of prediction performance or training cost. In the
experiment described below, however, the results show SVM to be one of the
strongest performers. In this experiment, then, there is a degree of support for
the thesis that the automatic bias selection facet of the SVM yields important,
practical advantages.
6
conduit) and the signal from this is used to drive an indicator via an electrical
coil.
The experiment aimed to discover whether it would be possible to implement
this sensing task in an indirect, inductive manner, using ambient sound energy
produced by the engine. (This is considerable in the case of the Cessna 152.). A
dataset was constructed by sampling the ambient sound energy while the engine
was subjected to a typical pattern of usage including engine warm-up, take-o
and circuit
ying. Training examples were then constructed by associating
small sequences of sound data with observed temperatures. These were then
formulated as a dataset of training examples and presented to a range of data-
mining methods.
5 Data collection
7
Figure 2: Cessna 152 instrument panel.
dataset into a training set of 160 examples and a testing set of 80 examples. The
examples were randomly selected in each case but with the restriction that there
should be no overlap between the two sets. Prediction accuracies derived when
the data-mining methods were tested on the testing data were then treated as
measures of the degree to which signal processing of ambient sound energy could
be used to measure engine temperature.
8
Figure 3: Typical MLP network following training on temperature data.
resulted from training is shown in Figure 3. Here positive values are represented
by red colours, negative values by blue. The intensity of the colour represents
the absolute magnitude of the value in all cases. Lines represent connections
and circles represent units in the usual way, while unit biases are represented
by small half-ovals protruding from the right edge of each unit's circle.
7 Results
Method C4.5 kNN MLP SVM NBC
Error rate 0.274 0.169 0.174 0.219 0.970
Accuracy 73% 83% 83% 82% 3%
The results of the experiment are shown in tabular form above and as a
histogram in Figure 4. The gures shown are all averages taken over 100 test
runs using randomly selected training/testing sets.
Clearly, the best results were produced by kNN, the MLP and by SVM,
which all achieved a prediction accuracy of at least 82%. (A typical pattern of
error reduction achieved by the MLP on these data is shown in Figure 5.) The
performance produced by the C4.5 method is slightly below this gure while the
performance of the NBC method is extremely poor, suggesting some extreme
form of mismatch to the data.
The Naive Bayes Classier works by determining the frequencies with which
particular input values are associated with particular output values within the
training examples. Predictions for a specic datum are then generated by de-
termining the most probable output for that datum, treating the frequencies
9
Figure 4: Error rates for temperature generalisation.
associated with the datum's input values as probabilities. (This involves appli-
cation of Bayes theorem for calculating a priori from a posteriori probabilities.)
In normal cases this procedure works well. However, the assumption is made
that all input values seen in a test datum will also be frequently represented
within the training data (this is, of course, a key element of the iid assump-
tion). If they have not done so there will be no relevant frequencies upon which
to derive class probabilities and therefore no basis upon which to generate pre-
dictions. This is precisely the problem which occurs in the temperature-sensing
experiment. Input values appearing in data are real numbers in the range 0.0-
1.0 representing mean amplitude values. The vast majority of these are unique
within the data. Thus, the NBC is put in the position of having to generate
predictions on the basis of irrelevant frequencies.
The presence of real-valued input values is also the likely explanation for
the below-par performance of C4.5. This method relies on the recursive deriva-
tion of a minimal decision tree for predictions. In the case of symbolic input
variables, decision tree branches are generated by `splitting' the relevant data
upon the possible values of the variable. However, with real-valued data, the
10
Figure 5: Typical error curves for MLP training.
method's only recourse is to explore possible threshold values, i.e., possible ways
of splitting the range of observed values into two or more ranges. In the im-
plementation of C4.5 utilised for this experiment, binary splits were sought for
any real-valued variables. The resulting constraint is that eective performance
may only be achieved in the case where salient data groupings can be achieved
through the introduction of binary divisions in each variable (or dimension) of
the data space. A reasonable conclusion in the present case is that this require-
ment is not satised by the training data.
With respect to the three, top-performing data-mining methods (MLP, kNN
and SVM), rm conclusions are less easily drawn. As a general rule, we ex-
pect data-mining methods to produce similar levels of performance on the same
dataset. In the case where they do not, the explanation is generally related to the
inappropriateness of the bias introduced, i.e., the degree to which the particular
tendencies of the data-mining method are `out of tune' with the characteristics
of the data.
But though we may be able to explain the dierences between the three
top-performing methods in terms of bias, there is still the problem of explaining
why none of the methods produce perfect performance. This is a key issue if
11
the work is to achieve any real applications potential. But here we run into the
fundamental problem in this area, namely the lack of any generic theory of data
mining. Data-mining methods attempt to produce a `solution' to a `problem'.
But the solution is not dened analytically. In fact, given the present state of
theoretical understanding, it is but vaguely characterised.
However, we can begin to make some guesses about where the problems
might lie. There appear to be several possible explanations for the sub-optimal
levels of prediction performance among these three methods. First, there is the
possibility that the methods were applied in an inappropriate way. Where a
method involves the setting of user parameters this is an ever-present risk. This
applies to kNN but particularly to SVM and MLP. In the case of the latter
method, the user must choose the number of hidden units to use, the activation
function, the learning rate and momentum as well as the number of epochs over
which training should be applied. Although the MLP method is surprisingly
robust, it is clear that the selection of an inappropriate value for one of these
variables might well aect the performance in an adverse manner.
Another possible explanation relates to the nature of the data-acquisition
process. The rather crude way in which audio data has been presented in the
form of training examples may have worked to conceal the regularities that
would enable accurate prediction. Although the original source for the training
data was a high-resolution audio le, for purposes of representing the data as a
training set, the data were reformulated in the form of mean-amplitude values
averaged over a one second interval. Thus, the information available to the data-
mining methods was in fact a low resolution representation of the underlying
sound source. It is possible that this reduction in resolution had an adverse
aect on the ultimate performance of all the data mining methods.
Another, more interesting possibility is that there is a fundamental mis-
match between the biases of the data mining methods used and the signicant
characteristics of the data. Recall that in this experiment, the aim is to infer
changes in an underlying target (engine temperature) from the signature that
those changes produce in an ambient energy array (engine noise). In eect, the
aim is to use signal processing of the ambient data to obtain a virtual sensor
for the underlying target.
As noted, the fundamental problem with this idea is the fact that the way in
which the target impacts the ambient data will be very complex. Because many
dierent properties will contribute to changes in the same ambient variables,
there will be high levels of interference between dierent signatures and the
correspondences between variables of the target and variables of the ambient
data will be 1-to-many. We should therefore not expect to see any simple corre-
spondences between absolute values of ambient variables and absolute values of
the underlying variables. At best, the signatures to be processed will comprise
relational (i.e., non-absolute) patterns. If the indirect-sensing approach is to
succeed, then, the data-mining methods utilised must have the ability to detect
and exploit patterns of relationships.
This observation may point towards the beginnings of a more theoretical
explanation for the below-par performance of the data-mining methods used
12
on this problem. The degree to which these characteristically empirical data-
mining methods are able to exploit relationships is a matter of debate. However,
as has been argued elsewhere [Thornton, Truth from 23], it is rather clear that
simpler approaches such as nearest-neighbours and the decision-tree method are
not sensitive in any degree to relational properties of the data. They are solely
sensitive to absolute properties.
With respect to the neural network method (MLP), the judgement on this
issue is less certain. Early advocates of the method tended towards the view
that the MLP method is capable of exploiting relational properties of the data,
with its well-known capabilities with respect to the XOR problem being oered
as a `proof' [19]. However, this demonstration is open to re-interpretation [24]
and a conservative judgement on this issue would now rate the abilities of the
MLP with respect to relational problems as still under evaluation [25].
Similar remarks might be made with respect to the SVM, at least in the
conguration used for this experiment (i.e., using the standard polynomial ker-
nel function). However, the fact that the performance achieved by the SVM on
these data is close to that achieved by the ineluctably non-relational nearest-
neighbours method suggests that the performance decit shown by both meth-
ods may well be put down to the same cause.
More work is clearly needed to bring clarity to these issues. The long-
term aim must be to more accurately characterise the characteristics of the
data signatures which are key in this application. Theoretical development in
the area of computational learning theory may well start to ll in the gaps
with respect to our understanding of the interaction between bias and data.
With this information to hand, better judgements may be possible regarding
the bias-appropriateness of dierent data-mining methods for indirect-sensing
applications.
8 Concluding comments
The paper has explored the practicality of using indirect-sensing methodologies
for general-aviation instrumentation solutions. The approach described relies on
fusing information inherent in the bias characteristics of a data-mining method,
with exemplar-formatted data. Real-world data were collected and utilised in a
comparative study to identify which methods would produce best performance
on the data-mining aspect of the problem. The results showed that good results
were obtained using several dierent methods.
The lack of any rm theoretical framework for understanding bias/data in-
teraction was noted to be a signicant problem for this investigation. It pro-
foundly limits the progress that can be made with respect to method selection
and/or customisation. Informal analysis suggest that the key aspect of bias in
this application is the ability to detect and exploit relational data eects, since
the 1-to-many character of the forces aecting ambient variables means that
absolute patterns are most unlikely to be of any signicance.
While the performance levels obtained from the best-performers in this study
13
are inadequate for realistic applications work, they are nevertheless considerably
above the level of chance. Given the early stage of work, there is every reason
to hope that more research will yield practicable levels of performance in due
course.
References
[1] Gibson, P. and Power, C. (2000). Introductory Remote Sensing: Application
and Digital Image Processing. London: Routledge.
[2] Gibson, P. (2000). Introductory Remote Sensing: Principles and Concepts.
Routledge Publishers.
[3] Thornton, C. (2003). Indirect sensing through abstractive learning. Intelli-
gent Data Analysis, 7, No. 3/4.
[4] Michalski, R., Bratko, I. and Kubat, M. (1998). Machine Learning and
Data Mining: Methods and Applications. New York: Wiley.
[5] Platt, J. (2000). Human Factors and Flight Safety. Manchester: Airplan
Flight Equipment.
[6] Holsheimer, M. and Siebes, A. (1994). Data mining: the search for knowl-
edge in databases. Technical report CS-R9406, CWI.
[7] Weiss, S. and Indurkhya, N. (1998). Predictive Data Mining: A Practical
Guide. San Francisco: Morgan Kaufmann Publishers Inc.
[8] Edelstein, H. (1999). Introduction to data mining and knowledge discovery
(3rd ed). Potomac, MD: Two Crows Corp.
[9] Duda, R. and Hart, P. (1973). Pattern Classication and Scene Analysis.
New York: Wiley.
[10] Nilsson, N. (1990). The Mathematical Foundations of Learning Machines.
San Mateo, California: Morgan Kaufmann.
[11] Mitchell, T. (1997). Machine Learning. McGraw-Hill.
[12] Hinton, G. (1989). Connectionist learning procedures. Articial Intelli-
gence, 40 (pp. 185-234).
[13] Quinlan, J. (1983). Learning ecient classication procedures and their
application to chess end games. In R. Michalski, J. Carbonell and T.
Mitchell (Eds.), Machine Learning: An Articial Intelligence Approach.
Palo Alto: Tioga.
[14] Quinlan, J. (1993). C4.5: Programs for Machine Learning. San Mateo,
California: Morgan Kaufmann.
14
[15] Rumelhart, D., McClelland, J. and the PDP Research Group, (Eds.) (1986).
Parallel Distributed Processing: Explorations in the Microstructures of
Cognition. Vols I and II. Cambridge, Mass.: MIT Press.
[16] Michie, D., Speigelhalter, D. and Taylor, C. (Eds.) (1994). Machine Learn-
ing, Neural and Statistical Classication. Ellis Horwood.
[17] Rumelhart, D., Hinton, G. and Williams, R. (1986). Learning representa-
tions by back-propagating errors. Nature, 323 (pp. 533-6).
[18] Vapnik, V. (1995). The Nature of Statistical Learning Theory. New York:
Springer.
[19] Rumelhart, D., Hinton, G. and Williams, R. (1986). Learning internal rep-
resentations by error propagation. In D. Rumelhart, J. McClelland and
the PDP Research Group (Eds.), Parallel Distributed Processing: Explo-
rations in the Microstructures of Cognition. Vols I and II (pp. 318-362).
Cambridge, Mass.: MIT Press.
[20] Prechelt, L. (1996). Automatic early stopping using cross validation: quan-
tifying the criteria. Neural Networks, 9 (pp. 457-462). 3.
[21] Burges, C. (1998). A tutorial on support vector machines for pattern
recognition. Data Mining and Knowledge Discovery, 2, No. 2 (pp. 121-
167).
[22] Cristianini, N. and Shawe-Taylor, J. (2000). An Introduction to Support
Vector Machines. Cambridge: Cambridge University Press.
[23] Thornton, C. (2000). Truth from Trash: How Learning Makes Sense. MIT
Press.
[24] Clark, A. and Thornton, C. (1997). Trading spaces: computation, rep-
resentation and the limits of uninformed learning. Behaviour and Brain
Sciences, 20 (pp. 57-90). Cambridge University Press.
[25] Thornton, C. (1996). Parity: the problem that won't go away. In G.
McCalla (Ed.), Proceeding of AI-96 (Toronto, Canada) (pp. 362-374).
Springer.
15