You are on page 1of 33

DATA MINING USING NEURAL NETWORKS

INDEX

Chapter Pg.No.

1. Data Mining 5
Introduction 5
What is Data Mining? 6
Knowledge Discovery in Database 8
Other Related Areas 10
Data Mining Techniques 12
2. Neural Networks 16
2.1 Introduction 16
2.2 Structure and Function of a Single Neuron 17
2.3 A Neural Net 18
2.4 Training the Neural Net 22
3. Neural Networks Based Data Mining 26
3.1 Introduction 26
3.2 Suitability of Neural Networks for Data Mining 26
3.3 Challenges Involved 27
3.4 Advantages 28
3.5 Extraction Methods 28
3.6 The TREPAN Algorithm 30
4. Conclusion 34
References 35
2
1. DATA MINING

Introduction
The past two decades has seen a dramatic increase in the amount of information
or data being stored in electronic format. This accumulation of data has taken place at
an explosive rate. It has been estimated that the amount of information in the world
doubles every 20 months and the size and number of databases are increasing even
faster. The increase in use of electronic data gathering devices such as point-of-sale or
remote sensing devices has contributed to this explosion of available data. The problem
of effectively utilizing these massive volumes of data is becoming a major problem for
all enterprises.
Data storage became easier as the availability of large amounts of computing
power at low cost ie the cost of processing power and storage is falling, made data
cheap. There was also the introduction of new machine learning methods for knowledge
representation based on logic programming etc. in addition to traditional statistical
analysis of data. The new methods tend to be computationally intensive hence a demand
for more processing power.
It was recognized that information is at the heart of business operations and that
decision-makers could make use of the data stored to gain valuable insight into the
business. Database Management systems gave access to the data stored but this was
only a small part of what could be gained from the data. Traditional on-line transaction
processing systems, OLTPs, are good at putting data into databases quickly, safely and
efficiently but are not good at delivering meaningful analysis in return. Analyzing data
can provide further knowledge about a business by going beyond the data explicitly
stored to derive knowledge about the business. This is where Data Mining has obvious
benefits for any enterprise.




3
What is Data Mining?
Definition
Researchers William J Frawley, Gregory Piatetsky-Shapiro and Christopher J
Matheus have defined Data Mining as:
Data mining is the search for relationships and global patterns that exist in
large databases but are `hidden' among the vast amount of data, such as a
relationship between patient data and their medical diagnosis. These
relationships represent valuable knowledge about the database and the objects
in the database and, if the database is a faithful mirror, of the real world
registered by the database.
The analogy with the mining process is described as:
Data mining refers to "using a variety of techniques to identify nuggets of
information or decision-making knowledge in bodies of data, and extracting
these in such a way that they can be put to use in the areas such as decision
support, prediction, forecasting and estimation. The data is often voluminous,
but as it stands of low value as no direct use can be made of it; it is the hidden
information in the data that is useful", Clementine User Guide, a data mining
toolkit.

Explanation
Data mining, the extraction of hidden predictive information from large
databases, is a powerful new technology with great potential to help companies focus
on the most important information in their data warehouses. Data mining tools predict
future trends and behaviours, allowing business to make proactive knowledge driven
decisions. The automated, prospective analysis offered by data mining move beyond the
analysis of past events provided by retrospective tools typical of decision support
systems. Data mining tools can answer business questions that traditionally were too
time consuming to resolve. They scour databases for hidden patterns, finding predictive
information that experts may miss because it lies outside their expectations.
4
The data mining process consists of three basic stages: exploration, model
building and pattern definition. Fig. 1.1 shows a simple data mining structure.















Fig. 1.1 Data Mining Structure

Basically data mining is concerned with the analysis of data and the use of
software techniques for finding patterns and regularities in sets of data. It is the
computer which is responsible for finding the patterns by identifying the underlying
rules and features in the data. The idea is that it is possible to strike gold in unexpected
places as the data mining software extracts patterns not previously discernable or so
obvious that no-one has noticed them before.
Data mining analysis tends to work from the data up and the best techniques are
those developed with an orientation towards large volumes of data, making use of as
much of the collected data as possible to arrive at reliable conclusions and decisions.
The analysis process starts with a set of data, uses a methodology to develop an optimal
representation of the structure of the data during which time knowledge is acquired.
Once knowledge has been acquired this can be extended to larger sets of data working
Data Mining Predictive
Modeling
Outcome
Prediction
Forecasting
Trends &
Variations
Affinities &
Associations
Conditional
Logic
Discovery
Deviation
Detection
Link Analysis
Forensic Analysis
5
on the assumption that the larger data set has a structure similar to the sample data.
Again this is analogous to a mining operation where large amounts of low grade
materials are sifted through in order to find something of value.

Example
A home finance loan actually has an average life span of only 7 to 10 years, due
to prepayment. Prepayment means, the loan is paid off early, rather than at the end of,
say 25 years. People prepay loans when they refinance or when they sell their home.
The financial return that a home-finance derives from a loan depends on its life span.
Therefore it is necessary for the financial institutions to be able to predict the life spans
of their loans. Rule discovery techniques are used to accurately predict the aggregate
number of loan payments in a given quarter (or in a year), as a function of prevailing
interest rates, borrower characteristics, and account data. This information can be used
to finetune loan parameters such as interest rates, points and fees, in order to maximize
profits.
Knowledge Discovery in Database (KDD)
KDD and Data Mining
Knowledge Discovery in Database (KDD) was formalized in 1989, with
reference to the general concept of being broad and high level in pursuit of seeking
knowledge from data. The term data mining was then coined; this high-level application
technique is used to present and analyze data for decision-makers.
Data mining is only one of the many steps involved in knowledge discovery in
databases. The KDD process tends to be highly iterative and interactive. Data mining
analysis tends to work up from the data and the best techniques are developed with an
orientation towards large volumes of data, making use of as much data as possible to
arrive at reliable conclusions and decisions. The analysis process starts with a set of
data, and uses a methodology to develop an optimal representation of the structure of
data, during which knowledge is acquired. Once knowledge is acquired, this can be
extended to large sets of data on the assumption that the large data set has a structure
similar to the simple data set.
6
Fayyad distinguishes between KDD and data mining by giving the following
definitions:
Knowledge discovery in databases is the process of identifying a valid,
potentially useful and ultimately understandable structure in data.
Data mining is a step in the KDD process concerned with the algorithmic means
by which patterns or structures are enumerated from the data under acceptable
computational efficiency limits.
The structures that are the outcome of the data mining process must meet certain
conditions so that these can be considered as knowledge. These conditions are: validity,
understandability, utility, novelty and interestingness.
Stages of KDD
The stages of KDD, starting with the raw data and finishing with the extracted
knowledge, are given below.
















Fig. 1.2 Stages of KDD

Data Target Data Preprocessed
Data
Knowledge Patterns Transformed
Data
Selection Preprocessing
Data Mining
Transformation
Interpretation
& Evaluation
7
Selection: This stage is concerned with selecting or segmenting the data that are
relevant to some criteria. E.g.: for credit card customer profiling, we extract the type of
transactions for each type of customers and we may not be interested in the details of
the shop where the transaction takes place.
Preprocessing: Preprocessing is the data cleaning stage where unnecessary information
is removed. E.g.: it is unnecessary to note the sex of a patient when studying pregnancy.
This stage reconfigures the data to ensure a consistent format, as there is a possibility of
inconsistent formats.
Transformation: The data is not merely transferred across, but transformed in order to
be suitable for the task of data mining. In this stage, the data is made usable and
navigable.
Data Mining: This stage is concerned with the extraction of patterns from the data.
Interpretation and Evaluation: The patterns obtained in the data mining stage are
converted into knowledge, which in turn, is used to support decision-making.

Other Related Areas
Data Mining has drawn on a number of other fields, some of which are listed
below.
Statistics
Statistics is a theory-rich approach for data analysis, which generates results that can be
overwhelming and difficult to interpret. Not withstanding this, statistics is one of the
foundations on which data mining technology is built. Statistical analysis systems are
used by analysts to detect unusual patterns and explain patterns using statistical models.
Statistics have an important role to play and data mining will not replace such analyses,
but rather statistics can act upon more directed analyses based on the results of data
mining.

Machine Learning
Machine learning is the automation of a learning process and learning is tantamount to
the construction of rules based on observations. This is a broad field, which includes not
8
only learning from examples, but also reinforcement learning, learning with teacher,
etc. A learning algorithm takes the data set and its accompanying information as input
and returns a statement e.g. a concept representing the results of learning as output.

Inductive Learning
Induction is the inference of information from data and inductive learning is the model
building process where the environment i.e. database is analyzed with a view to finding
patterns. Similar objects are grouped in classes and rules formulated whereby it is
possible to predict the class of unseen objects. This process of classification identifies
classes such that each class has a unique pattern of values, which forms the class
description. The nature of the environment is dynamic hence the model must be
adaptive i.e. should be able learn. Inductive learning where the system infers knowledge
itself from observing its environment has two main strategies: Supervised Learning and
Unsupervised Learning.

Supervised Learning
This is learning from examples where a teacher helps the system construct a model by
defining classes and supplying examples of each class.

Unsupervised Learning
This is learning from observation and discovery.

Mathematical Programming
Most of the major data mining tasks can be equivalently formulated as problems in
mathematical programming for which efficient algorithms are available. It provides a
new insight into the problems of data mining.




9
Data Mining Techniques
Researchers identify two fundamental goals of data mining: prediction and
description. Prediction makes use of existing variables in the database in order to
predict unknown or future values of interest, and description focuses on finding patterns
describing the data and the subsequent presentation for user interpretation. The relative
emphasis of both, prediction and description differ with respect to the underlying
application and technique.
There are several data mining techniques fulfilling these objectives. Some of
these are associations, classifications, sequential patterns and clustering.
Another approach of the study of data mining techniques is to classify the
techniques as: user-guided or verification-driven data mining and, discovery-driven or
automatic discovery of rules. Most of the techniques of data mining have elements of
both the models.

Data Mining Models
Verification Model:
The verification model takes an hypothesis from the user and tests the validity of
it against the data. The emphasis is with the user who is responsible for formulating the
hypothesis and issuing the query on the data to affirm or negate the hypothesis.
In a marketing division, for example, with a limited budget for a mailing
campaign to launch a new product it is important to identify the section of the
population most likely to buy the new product. The user formulates a hypothesis to
identify potential customers and the characteristics they share. Historical data about
customer purchase and demographic information can then be queried to reveal
comparable purchases and the characteristics shared by those purchasers. The whole
operation can be repeated by successive refinements of hypothesis until the required
limit is reached.
Discovery Model:
The discovery model differs in its emphasis in that it is the system automatically
discovering important information hidden in the data. The data is sifted in search of
frequently occurring patterns, trends and generalizations about the data without
10
intervention or guidance from the user. The discovery or data mining tools aim to reveal
a large number of facts about the data in as short a time as possible.
An example of such a model is a supermarket database, which is mined to
discover the particular groups of customers to target for a mailing campaign. The data is
searched with no hypothesis in mind other than for the system to group the customers
according to the common characteristics found.

Discovery Driven Tasks
The typical discovery driven tasks are:
Association rules:
An association rule is an expression of the form X =>Y, where X and Y are the
sets of items. The intuitive meaning of such a rule is that the transaction of the database,
which contains X tends to contain Y. Given a database, the goal is to discover all the
rules that have the support and confidence greater than or equal to the minimum support
and confidence, respectively.
Let L ={l1, l2, , lm} nbe a set of literals called items. Let D, the database, be
a set of transactions, where each transaction T is a set of items. T supports an item x, if
x is in T. T is said to support a subset of items X, if T supports each item x in X. X =>y
holds with confidence c, if c% of the transactions in D that support X also support Y.
The rule X =>Y has support s, in the transaction set D if s% of the transactions in D
support X U Y. Support means how often X and Y occur together as a percentage of the
total transactions. Confidence measures how much a particular item is dependent on
another. Patterns with a combination of intermediate values of confidence and support
provide the user with interesting and previously unknown information.
Clustering:
Clustering is a method of grouping data into different groups, so that the data in
each group share similar trends and patterns. The algorithm tends to automatically
partition the data space into a set of regions or clusters, to which the examples in the
tables are assigned, either deterministically or probability-wise. The goal of the process
is to identify all sets of similar examples in the data, in some optimal fashion.
11
Clustering according to similarity is a concept which appears in many
disciplines. If a measure of similarity is available, then there are a number of techniques
for forming clusters. Another approach is to build set functions that measure some
particular property of groups. This latter approach achieves what is known as optimal
partitioning.
Classification Rules:
Classification involves finding rules that partition the data into disjoint groups.
The input for the classification data set is the training data set, whose class labels are
already known. Classification analyses the training data set and constructs a model
based on the class label, and aims to assign class label to the future unlabelled records.
Since the class field is known, this type of classification is known as supervised
learning.
There are several classification discovery models. They are: the decision tree,
neural networks, genetic algorithms and some statistical models.
Frequent Episodes:
Frequent episodes are the sequence of events that occur frequently, close to each
other and are extracted from the time sequences. How close it has to be to consider it as
frequent is domain dependent. This is given by the user as the input and the output are
the prediction rules for the time sequences.
Deviation Detection:
Deviation detection is to identify outlying points in a particular data set, and
explain whether they are due to noise or other impurities being present in the data or
due to trivial reasons. It is usually applied with the database segmentation, and is the
source of true discovery, since the outliers express deviation from some previously
known expectation and norm. By calculating the values of measures of current data and
comparing them with previous data as well as with the normative data, the deviations
can be obtained.




12
Data Mining Methods
Various data mining methods are:
Neural Networks
Genetic Algorithms
Rough Sets Techniques
Support Vector Machines
Cluster Analysis
Induction
OLAP
Data Visualization
13
2. NEURAL NETWORKS

Introduction
Anyone can see that the human brain is superior to a digital computer at many
tasks. A good example is the processing of visual information: a one-year-old baby is
much better and faster at recognizing objects, faces, and so on than even the most
advanced AI system running on the fastest supercomputer. The brain has many other
features that would be desirable in artificial systems.
This is the real motivation for studying neural computation. It is an alternative
paradigm to the usual one (based on a programmed instruction sequence), which was
introduced by von Neumann and has been used as the basis of almost all machine
computation to date. It is inspired by the knowledge from neuroscience, though it does
not try to be biologically realistic in detail.
Neural networks are an approach to computing that involves developing
mathematical structures with the ability to learn. The methods are the result of academic
investigations to model nervous system learning. Neural networks have the remarkable
ability to derive meaning from complicated or imprecise data and can be used to extract
patterns and detect trends that are too complex to be noticed by either humans or other
computer techniques. A trained neural network can be thought of as an "expert" in the
category of information it has been given to analyze. This expert can then be used to
provide projections given new situations of interest and answer "what if" questions.
Neural networks use a set of processing elements (or nodes) analogous to
neurons in the brain. These processing elements are interconnected in a network that
can then identify patterns in data once it is exposed to the data, i.e the network learns
from experience just as people do. This distinguishes neural networks from traditional
computing programs, that simply follow instructions in a fixed sequential order.




14
Structure and Function of a Single Neuron
McCulloch and Pitts (in 1943) proposed a simple model of a neuron as a binary
threshold unit. Specifically, the model neuron computes a weighted sum of its inputs
from other units, and outputs a one or a zero according to whether this sum is above or
below a certain threshold.
The figure below shows the structure of a typical artificial neuron:







Threshold
Fig. 2.1 Structure of a Single Neuron

Explanation:
The neuron has a set of nodes that connects it to inputs, output, or other neurons, also
called synapses (connections / links). A Linear Combiner is a function that takes all
inputs and produces a single value. A simple way of doing it is by adding together the
weighted inputs. Thus, the linear combiner will produce: (w
i1
* x
1
+w
i2
* x
2
+ +w
in

* x
n
).
The Activation Function is a non-linear function, which takes any input from minus
infinity to plus infinity and squeezes it into the -1 to 1 or into 0 to 1 interval.
This simple model of a neuron makes the following assumptions:
1. The position on the neuron (node) of the incoming synapse (connection) is
irrelevant.
2. Each node has a single output value, distributed to other nodes via outgoing links,
irrespective of their positions.
Linear
Combiner

Activation
Function
wi1
wi2
wi3
I
N
P
U
T
S
O
U
T
P
U
T
15
3. All inputs come in at the same time or remain activated at the same level long
enough for computation to occur. (An alternative is to postulate the existence of
buffers to store weighted inputs inside nodes).
The threshold is calculated using the Heaviside function as shown below:
n
i
(t+1) = (
j
w
ij
n
j
(t) -
i
)
Here ni is either 1 or 0, and represents the state of neuron i as firing or not firing
respectively. Time t is taken as discrete, with one time unit elapsing per processing step.
(x) is the unit step function, or Heaviside function:
(x) =1, if x>=0
=0, otherwise.
The weight wij represents the strength of the synapse connecting neuron j to neuron i. It
can be positive or negative corresponding to an excitatory or inhibitory synapse
respectively. It is zero if there is no synapse between i and j. The cell specific parameter
i is the threshold value for unit i; the weighted sum of inputs must reach or
exceed the threshold for the neuron to fire.
A simple generalization of the above equation which will consider the activation
function is:
n
i
=g (
j
w
ij
n
j
-
i
)
The number ni is now continuous valued and is called state or activation of unit i. The
function g(x) is the activation function.
Rather than writing the time t and t+1 explicitly, we now simply give a rule for updating
ni whenever that occurs. Units are often updated asynchronously, in random order, at
random times.

A Neural Net
A single neuron is insufficient for many practical problems, and network with a
large number of nodes are frequently used. The way the nodes are connected determines
how computations proceeds and constitutes an important early design decision by a
neural network developer.

16
Fully Connected Networks
In this architecture, every node is connected to every node, and these
connections may be either excitatory (positive weights), inhibitory (negative weights),
or irrelevant (almost zero weights).











Fig 2.2 Fully Connected Fig 2.3 Fully Connected
Asymmetric Network Symmetric Network

In a fully connected asymmetric network, the connection from one node to
another may carry a different weight than the connection from the second node to the
first. In a symmetric network, the weight that connects one node to another is equal to
its symmetric reverse.
Hidden nodes are the nodes, whose interaction with the external environment is
indirect.

Layered Networks
These are networks in which nodes are partitioned into subsets called layers,
with no connections that lead from layer j to layer k if j>k.
A single input arrives at and is distributed to other nodes by each node of the
input layer or layer 0; no other computation occurs at nodes in layer 0, and there are
no intra-layer connections among nodes in this layer. Connections with arbitrary
Input Node Input Node
Input Node Input Node
Output Node
Output Node
Output Node
Output Node
Hidden Node Hidden Node
17
weights, may exist from any node in layer i to any node in layer j for j >=i; intra-layer
connections may exist.












Layer 0 (Input Layer) Layer 1 Layer 2 Layer 3(Output Layer)
( Hidden Layers )

Fig 2.4 A Layered Network

Acyclic Networks
This is a subclass of layered networks in which there are no intra-layer
connections, as shown in the fig. 2.5. A connection may exist between any node in layer
i and any node in layer j for i <j, but a connection is not allowed for i =j. Networks that
are not acyclic are referred to as recurrent networks.







18










Layer 0 (Input Layer) Layer 1 Layer 2 Layer 3(Output Layer)
( Hidden Layers )
Fig 2.5 An Acyclic Network

Feedforward Networks









Layer 0 (Input Layer) Layer 1 Layer 2 Layer 3(Output Layer)
( Hidden Layers )
Fig 2.6 A Feedforward 3-2-3-2 Network

This is a subclass of acyclic networks in which a connection is allowed from a
node in layer i only to nodes in layer i+1 as shown in fig. 2.6. These networks are
19
succinctly described by a sequence of numbers indicating the number of nodes in each
layer.
These networks, generally with no more than 4 such layers, are among the most
common neural nets in use. Conceptually, nodes in successively higher layers abstract
successively higher-level features from preceding layers.

Modular Neural Networks
Most problems are solved using neural networks whose architecture consists of
several modules, with sparse interconnections between modules. Modularity allows the
neural network developer to solve smaller tasks separately using small (neural network)
modules and then combine these modules in a logical manner. Modules can be
organized in several different ways, some of which are: hierarchical organization,
successive refinement and input modularity.

Training the Neural Net
In order to fit a particular artificial neural network (ANN) to a particular
problem, it must be trained (or learned) to generate a correct response for a given set of
inputs. Unsupervised training may be used when a clear link between data sets and
target output values do not exist. Supervised training involves providing an ANN with
specified input and output values, and allowing it to iteratively reach a solution.

Perceptron Learning Rule
This is the first learning scheme of neural computing. The weights are changed
by an amount proportional to the difference between the desired output and the actual
output. If W is the vector and Wi is the change is the ith weight, then Wi is
proportional to a term which is the error times the input. A learning rate parameter
decides the magnitude of change. If the learning rate is high, the change in the weight is
bigger at every step.
Wi = (D Y). Ii
where is the learning rate, D is the desired output, and Y is the actual output.
20
If the classes are visualized geometrically in n-dimensional space, then the
perceptron generates descriptions of the classes in terms of a set of hyperplanes that
separate these classes. When the classes are actually not linearly separable, then the
perceptron (single layer) is not effective in properly classifying such cases.

Training in Multi-Layer Perceptron
The MLP overcomes the above shortcoming of the single layer perceptron. The
idea is to carry out the computation layer-wise, moving in the forward direction.
Similarly, the weight adjustment can be done layer-wise, by moving in a backward
direction. For the nodes in the output layer, it is easy to compute the error, as we know
the actual outcome and the desired result. For the nodes in the hidden layers, since we
do not know the desired result, we propagate the error computed in the last layer
backward. This process gives a change in the weight for the edges layer-wise. This
standard method used in training MLPs is called the back propagation algorithm.
Formally, the training steps consist of:
Forward Pass: The outputs and the error at the output units are calculated.
Backward Pass: The output unit error is used to alter weights on the output units. Then
the error, at the hidden nodes is calculated, and weights on hidden nodes are altered
using these values.
For each training data, a forward pass and a backward pass is performed. This is
repeated over and over again, until the error is at an acceptably low level.

Training RBF networks
The RBF design involves deciding on their centers and the sharpness (standard
deviations) of their Gaussians. Generally, the centers and SDs (standard deviations) are
decided by examining the vectors in the training data. RBF networks are trained in a
similar way a MLP. The output layer weights are trained using the delta rule.

Competitive Learning
Competitive learning, or winner-takes-all is regarded as the basis of a number of
unsupervised learning strategies. It consists of k units with weight vectors wk, of equal
21
dimension to the input data. During learning process, the unit with its weight vector
closest to the input vector x is adapted in such a way that the vector becomes closer to
the input vector after adaptation. The unit with the closest vector is termed as the winner
of the selection process. This learning strategy is generally implemented by gradually
reducing the difference between the weight vector and the input vector. The actual
amount of reduction at each learning step is guided by means of the learning rate, .
During the learning process, the weight vectors converge towards the mean of the set of
input data.

Kohonens SOM
The self-organising map (SOM) was a neural network model developed by
Teuvo Kohonen during 1979-82. SOM is one of the most widely used unsupervised NN
models and employs competitive learning steps. It consists of a layer of input units,
each of which is fully connected to a set of output units. These output units are arranged
in some topology (the most common choice is a 2-d grid. The inputs, after receiving the
input patterns X, propagate them as they are onto the output units. Each of the output
units k is assigned a weight vector wk. During the learning step, the unit c
corresponding to the highest activity level w.r.t. a randomly selected input pattern X, is
adapted in such a way that it exhibits an even higher activity level at a future
presentation of X. This is accomplished by competitive learning.
The similarity metric is chosen to be the Euclidean distance. During the learning
steps of SOM, a set of units around the winner is tuned towards the currently presented
input pattern enabling a spatial arrangement of the input patterns, such that similar
inputs are mapped onto regions close to each other in the grid of output units. Thus, the
training process of SOM results in a topological organization of the input patterns.
Thus, SOM takes a high-dimensional input and clusters it, but still retains some
topological ordering of the output. After training, an input will cause some of the output
units in some area to become active. Such clustering (and dimensional reduction) is
very useful as a preprocessing stage, whether for neural network data processing, or for
more additional techniques.

22














Fig. 2.7 SOM Architecture


wk
2-D Array of
output units
High
dimensional
input X
23
3. NEURAL NETWORK BASED DATA MINING

Introduction
There is no general theory that specifies the type of neural network, number of
layers, number of nodes (at various layers), or learning algorithm for a given problem. As
such, todays network builder must experiment with a large number of neural networks
before converging upon the appropriate one for the problem in hand.

The suitability of Neural Networks for Data Mining
For some problems, neural-networks provide a more suitable inductive bias for
data mining than competing algorithms.
Inductive Bias: Given a fixed set of training examples, there are infinitely many
models that could account for the data, and every learning algorithm has inductive bias
that determines the model that it is likely to return. There are two aspects to the
inductive bias of an algorithm: the restricted hypothesis bias and the preference bias.
The hypothesis bias refers to the constraints that the algorithm places on the hypotheses
that it is to construct. For example, the hypothesis space of a perceptron is limited to the
linear discriminant functions. The preference bias of an algorithm is the preference
ordering that it places on the models within its hypothesis space. For example, most
algorithms try to fit a simple hypothesis to a given training set and then progressively
explore more complex hypotheses until they find an acceptable fit.
In some cases, the neural networks have a more restricted hypothesis space bias
than other learning algorithms. For example, for sequential and temporal prediction
tasks represent a class of problems, for which, neural networks present a more
appropriate hypothesis space. Recurrent networks, which are applied to most of these
problems, are able to maintain state information from one time state to the next. This
means that recurrent networks use their hidden units to learn derived information to the
relevant task at hand, and they can make use of this derived information at one instant
to help make prediction for the next instant.
24
Although neural networks have an appropriate inductive bias for a wide class of
problems, they are not commonly used for data mining tasks. There are two reasons:
trained neural networks are usually not comprehensible and many neural network learning
methods are slow, making them impractical for very large data sets.

Challenges Involved
The hypothesis represented by a trained neural network is defined by:
(a) The topology of the network
(b) The transfer functions used for hidden and output units and
(c) The real-valued parameters associated with the network connections (i.e., the
weights) and the units (i.e., the biases of sigmoid units).
Such hypotheses are difficult to comprehend for several reasons. First, typical
systems have hundreds or thousands of real-valued parameters. These parameters encode
the relationships between the input features and target values. Although single-parameter
encodings are usually not hard to understand, the sheer number of parameters in a typical
network can make the task of understanding them quite difficult. Second, in multi-layer
networks, these parameters may represent non-linear, non-monotonic relationships
between the input features and the target values. Thus, it is usually not possible to
determine, in isolation, the effect of a given feature on the target value, because this effect
may be medicated by the values of other features.
These non-linear, non-monotonic relationships are represented by hidden units,
which combine the inputs of multiple features, thus allowing the model to take advantage
of dependencies among the features. Hidden units can be thought of, as representing
higher-level, derived features. Understanding of hidden units is usually difficult because
they learn distributed representations. In a distributed representation, the individual units
do not correspond to well understood features in a problem domain. Instead, features,
which are meaningful in the context of the problem domain, are often encoded by patterns
of activation across many hidden units. Similarly, each hidden unit may play a part in
representing many derived features.
25
Consider the issue of learning time required for neural networks. The process of
learning in most neural network methods, involves using some gradient-based
optimization method to adjust the networks parameters. Such optimization iteratively
executes two basic steps: calculating the gradient of the error function (with respect to the
networks adjustable parameters) and adjusting the networks parameters in the direction
suggested by the gradient. Learning can be quite slow in such methods, because the
optimization may involve a large number of small steps, and the cost of calculating the
gradient at each step may be quite expensive.

Advantages
One appealing aspect of many neural-network learning methods is that they are
online algorithms, meaning that they update their hypothesis after every example is
presented. Because they update their parameters frequently, online neural-network
learning algorithms converge much faster than batch algorithms. This is specially the case
for large data sets. Often a reasonably good solution can be found in one pass through a
large training set. For this reason, we argue that training-time performance of neural-
network algorithms may often be acceptable for data mining tasks, especially given the
availability of high performance desktop computers.

Extraction Methods
One approach to understanding the hypothesis represented by a trained neural
network is to translate the hypotheses into a more comprehensible language. Various
strategies using this approach have been investigated under the rubric of rule extraction.
Some keywords:
Representation Language: It is the language used by the extraction methods to describe
the neural networks learned model. The languages that have been used include
conjunctive inference rules, fuzzy rules, m-of-n rules, decision trees and finite state
automata.
Extraction Strategy: It is the strategy used by the extraction method to map the model
represented by the trained neural network into a model in the new representation
26
language. Specifically how the method explores the candidate descriptions and what level
of descriptions it uses to characterize the neural network. That is, do the rules extracted by
the method describe the behavior as a whole (global methods), or the behaviour of
individual units in the network (local methods) or something in between these two cases.
Network Requirements: The architectural and training requirements that the extraction
method imposes on neural networks. In other words, the range of networks to which the
method is applicable.

Rule Extraction Task
Consider the following example:
y




6 4 4 0 - 4




x1 x2 x3 x4 x5

Fig. 3.1 A Simple Network

Fig 3.1 illustrates the task of rule extraction with a simple network. This one-
layer network has five Boolean inputs and one Boolean output. The extracted symbolic
rules specify conditions on the input features, that when satisfy, give a guaranteed output
state.
The output unit specifies a threshold function to compute its activation as
follows:
a
y
= 1, if (
i
w
i
a
i
+ ) >0
0, otherwise


=-9
27

The extracted rules are:
Y x1 ^x2 ^x3
Y x1 ^x2 ^-x5
Y x1 ^x3 ^-x5
Whenever a neural network is used for a classification problem, there is always
an implicit decision procedure that is used to decide which class is predicted by the given
network. In the simple example above, the decision procedure is to simply to predict y =
true when the activation of the output unit was 1 and y =false when it was 0.
In general, an extracted rule gives a set of conditions under which the network,
coupled with its decision procedure, predicts a given class.
One of the dimensions along which the rule-extraction methods can be
characterized is their level of description. One approach is to extract a set of global rules
that characterize the output classes directly in terms of the inputs. An alternative approach
is to extract a set of local rules, by decomposing the multiplayer networks into a
collection of single layer networks.

3.6 The TREPAN Algorithm
3.6.1 Introduction
The Trepan algorithm is used for extracting comprehensible, symbolic
representations from trained neural networks. The algorithm uses queries to induce a
decision tree that approximates the concept represented by a given network. Experiments
demonstrate that Trepan is able to produce decision trees that maintain a high level of
fidelity to their respective networks while being comprehensible and accurate. Unlike
previous work in this area, the algorithm is general in its applicability and scales well to
large networks and problems with high-dimensional input spaces.




28
3.6.2 Extracting Decision Trees
Our approach views the task of extracting a comprehensible concept
description from a trained network as an inductive learning problem. In this learning task,
the target concept is the function represented by the network, and the concept description
produced by our learning algorithm is a decision tree that approximates the network.
However, unlike most inductive learning problems, we have available an oracle that is
able to answer queries during the learning process. Since the target function is simply the
concept represented by the network, the oracle uses the network to answer queries. The
advantage of learning with queries, as opposed to ordinary training examples, is that they
can be used to garner information precisely where it is needed during the learning process.

Membership Queries and The Oracle: The role of the oracle is to determine the class
(as predicted by the network) of each instance that is presented as a query. Queries to the
oracle, however, do not have to be complete instances, but instead can specify constraints
on the values that the features can take. In the latter case, the oracle generates a complete
instance by randomly selecting values for each feature, while ensuring that the constraints
are satisfied. In order to generate these random values, Trepan uses the training data to
model each feature's marginal distribution. Trepan uses frequency counts to model the
distributions of discrete-valued features, and a kernel density estimation method
(Silverman, 1986) to model continuous features. The oracle is used for three different
purposes: (i) to determine the class labels for the network's training examples; (ii) to
select splits for each of the tree's internal nodes; (iii) and to determine if a node covers
instances of only one class.

Tree Expansion. Unlike most decision-tree algorithms, which grow trees in a depth-first
manner, Trepan grows trees using a best-first expansion.
Split Types. The role of internal nodes in a decision tree is to partition the input space in
order to increase the separation of instances of different classes. This algorithm forms
trees that use m-of-n expressions for its splits. An m-of-n expression is a Boolean
expression that is specified by an integer threshold, m, and a set of n Boolean conditions.
29
An m-of-n expression is satisfied when at least m of its n conditions are satisfied. For
example, suppose we have three Boolean features, a, b, and c; the m-of-n expression 2-of-
fa; :b; cg is logically equivalent to (a ^:b) . (a ^c) . (:b ^c).
Split Selection. Split selection involves deciding how to partition the input space at a
given internal node in the tree. A limitation of conventional tree-induction algorithms is
that the amount of training data used to select splits decreases with the depth of the tree.
Thus splits near the bottom of a tree are often poorly chosen because these decisions are
based on few training examples. In contrast, because Trepan has an oracle available, it is
able to use as many instances as desired to select each split.
Stopping Criteria. Trepan uses two separate criteria to decide when to stop growing an
extracted decision tree. First, a given node becomes a leaf in the tree if, with high
probability, the node covers only instances of a single class. To make this decision,
Trepan determines the proportion of examples that fall into the most common class at a
given node, and then calculates a confidence interval around this proportion.
Trepan also accepts a parameter that specifies a limit on the number of internal
nodes in an extracted tree. This parameter can be used to control the comprehensibility of
extracted trees, since in some domains, it may require very large trees to describe
networks to a high level of fidelity.

3.6.3 Algorithm
Input: Oracle(), training set S, feature set F, min_sample
Initialize root of the tree, R, as root node

/* get a sample of instances */
use S to construct a model M
R
of the distribution of instances covered by node R
q :=max(0, min_sample - | S | )
query_instances
R
:=a set of q instances generated using model M
R


/* use the network to label all instances */
for each example x (S U query_instances
R
)
30
class label for x :=Oracle(x)

/* do a best-first expansion of the tree */
initialize queue with tuple (R, S, query_instances
R,
{})
while queue not empty and global stopping criteria not satisfied

/* make node of at head of queue into an internal node */
remove (node N, S
N
, query_instances
N
, constraints
N
) from head of queue
use F, S
N
, and query_instances
N
, to construct a splitting test T at node N

/* make children nodes */
for each outcome t, of test T
make C, a new child node of N
constraints
C
:=constraints
N
U {T=t}

/* get a sample of instances for the node C */
S
C
:=members of S
N
with outcome t on test T
Construct a model M
C
of the distribution of instances covered by node C
q :=max(0, min_sample - | S
C
| )
query_instances
C
:=a set of q instances generated using model M
C
and
constraints
C
for each example x query_instances
C

class label for x :=Oracle(x)

/* make node C a leaf node for now */
use S
C
and query_instances
C
to determine a class label for C

/* determine if node C should be expanded */
if local stopping criteria not satisfied then
put (C, S
C
, quey_instances
C
, constraints
C
) into queue
Return: tree with root R
31


4. CONCLUSION
The advent of Data Mining is only the latest step in the extension of quantitative,
"scientific" methods to business. It empowers every nonstatistician - that is 99.9% of us
all - to study, understand and improve our operational or decisional processes whether in
science, business or society.
For the first time, thanks to the increased power of computers, new methods
replace the skill of the statistical artisan with massive-computational methods, obtaining
equal or better results in far less time without the need for any specialised knowledge.
Data Mining is probably the most useful way to take advantage of the massive
processing power available on many desktop computers, and the definitely most
promising and exciting research field in Advanced Informatics.
Neural Networks algorithms are among the most popular data mining and machine
learning techniques used today. As computers become faster, the neural net methodology
is replacing many traditional tools in the field of knowledge discovery and some related
fields.
A significant limitation of neural networks is that their concept representations are
usually not amenable to human understanding. We have presented an algorithm that is
able to produce comprehensible descriptions of trained networks by extracting decision
trees that accurately describe the networks' concept representations. We believe that our
algorithm, which takes advantage of the fact that a trained network can be queried,
represents a promising advance towards the goal of general methods for understanding the
solutions encoded by trained networks.
One of the principal strengths of Trepan is its generality. In contrast to most rule
extraction algorithms, it makes few assumptions about the architecture of a given
network, and it does not require a special training method for the network. Moreover,
Trepan is able to handle tasks that involve both discrete-valued and continuous-valued
features.
32
REFERENCES

[1] IEEE Transactions on Neural Networks; Data Mining in a Soft Computing Framework: A
Survey, Authors: Sushmita Mitra, Sankar K. Pal and Pabitra Mitra. (J anuary 2002, Vol. 13,
No. 1)

[2] Using Neural Networks for Data Mining: Mark W. Craven, J ude W. Shavlik

[3] Data Mining Techniques: Arjun K. Pujari

[4] Introduction to the theory of Neural Computation: J ohn Hertz, Anders Krogh, Richard G.
Palmer

[5] Elements of Artificial Neural Networks: Kishan Mehrotra, Chilukuri K. Mohan, Sanjay
Ranka.

[6] Artificial Neural Networks: Galgotia Publication

[7] Neural Networks based Data Mining and Knowledge Discovery in Inventory Applications:
Kanti Bansal, Sanjeev Vadhavkar, Amar Gupta

[8] Data Mining, An Introduction: Ruth Dilly, Parallel Computer Centre, Queens University
Belfast: http://www.pcc.qub.ac.uk/tec/courses/datamining/stu_notes/dm_book_1.html

[9] Introduction to Backpropagation Neural Networks: http://cortex.snowseed.com/index.html

[10] Data Mining Techniques: Electronic textbook, Statsoft:
http://www.statsoftinc.com/textbook/stdatmin.html#neural


33

You might also like