You are on page 1of 91

pattern recognition

in stock market

Introduction

motivation
Our time is limited, better not to waste it
working
Life style costs money
Create someone else to do the job for you

metatrader
Online broker
Lets you trade foreign currency, stocks
and indexes
MetaQuotes Language (MQL) similar to C,
allows you to buy and sell
Can be linked with dynamic linked libraries
(dll)

Pattern recognition
Pattern recognition aims to classify data
(patterns) based either on a priori knowledge or
on statistical information extracted from the
patterns. The patterns to be classified are usually
groups of measurements or observations,
defining points in an appropriate
multidimensional space.

To understand is to perceive patterns

SVM

Number of art books purchased

Linear Support Vector Machines

A direct marketing company wants to sell a


new book:
The Art History of Florence
Nissan Levin and Jacob Zahavi in Lattin,
Carroll and Green (2003).

buyers
non-buyers

Problem: How to identify buyers and nonbuyers using the two variables:

Months since last purchase

Months since last purchase


Number of art books purchased

Linear SVM: Separable Case


Main idea of SVM:

Number of art books purchased

separate groups by a line.

However: There are infinitely many lines that


have zero training error

buyers
non-buyers

which line shall we choose?

Months since last purchase

Number of art books purchased

Linear SVM: Separable Case

buyers
non-buyers

SVM use the idea of a margin around the


separating line.

margi
n

The thinner the margin,

the more complex the model,

Months since last purchase

The best line is the one with the


largest margin.

Linear SVM: Separable Case

Number of art books purchased

x2

The line having the largest margin is:

w 1x 1

w 2x 2

=
+b

w 2x
+
x1

+b
2

w1x1 + w2x2 + b = 0

Where

w1

x1 = months since last purchase

margin

x2 = number of art books purchased

w 1x 1

2
wx

Months since last purchase

b=

x1

-1

Note:
w1xi 1 + w2xi 2 + b +1
w1xj 1 + w2xj 2 + b 1

for i
for j

Linear SVM: Separable Case


The width of the margin is given by:

Number of art books purchased

x2

w 1x 1

2
w 2x

w 2x
+
x1

w1

margin

=
+b

w 1x 1

+b
2

x2
2
w
+

b=

-1

Note:

2 w

maximize
the margin

Months since last purchase

margin

x1

1 ( 1)
w12 w 22

w 2
minimize

2
|| w ||

minimize

wx i b 1 for yi 1
wx i b 1 for yi 1

yi (wx i b) 1 0 for all i

Linear SVM: Separable Case


2 w
x2

maximize
the margin

w 2

minimize

minimize

The optimization problem for SVM is:

minimize L( w ) w

margin

subject to:

w1xi 1 + w2xi 2 + b +1
w1xj 1 + w2xj 2 + b 1

x1

for i
for j

Linear SVM: Separable Case


Support vectors

x2

Support vectors are those points that lie


on the boundaries of the margin

The decision surface (line) is determined


only by the support vectors. All other
points are irrelevant

x1

Linear SVM: Nonseparable Case


Non-separable case: there is no line
Training set: 1000 targeted customers separating errorlessly the two groups
x2

buyers
non-buyers

w 1x 1

x2

L( w,C ) w

C i

+w

=
+b

Here, SVM minimize L(w,C) :

maximize
the margin

minimize the
training errors

L(w,C) = Complexity +

Errors

subject to:

x1

w1xi 1 + w2xi 2 + b +1 i
w1xj 1 + w2xj 2 + b 1 + i

I,j 0

for i
for j

vectors Xi
labels yi = 1
y sign (w X b)
min :
w ,b

1
2

w C 1 yi (w X i b)
2

margin and error vectors

yi (w X i b) 1, i S
y sign ( i yi X i X b)
iS

w i yi X i
iS

Linear SVM: The Role of C


x2

x2

C=5

x1

Bigger C

increased complexity
( thinner margin )

Smaller C

C=1

x1
decreased complexity
( wider margin )

smaller number errors

bigger number errors

( better fit on the data )

( worse fit on the data )

Vary both complexity and empirical error via C by affecting the optimal w and optimal
number of training errors

Non-linear SVMs
Transform x (x)
The linear algorithm depends only on xxi, hence
transformed algorithm depends only on (x)(xi)
Use kernel function K(xi,xj) such that K(xi,xj)= (x)(xi)

Nonlinear SVM: Nonseparable Case

Mapping into a higher-dimensional space

x2

x11
x
21

xl 1

x112

2 x11 x12

x122

2
x 21

2 x 21 x 22

2
x 22

xl 2

2
l1

x12

x 22

2 xl1xl 2

2
l2

Optimization task: minimize L(w,C)

L(w ,C ) w

subject to:

C i
i

w 1x i21 w 2 2 x i 1x i 2 w 3 x i22 b 1 i

w1x 2j 1 w 2 2 x j 1x j 2 w 3 x 2j 2 b 1 j

x1

Nonlinear SVM: Nonseparable Case

Map the data into higher-dimensional space:

x1

x
2

(-1,1)

(-1,-1)

1, 2 , 1

x22

(1,1)

x1

1, 1 1, 2 , 1
1, 1 1, 2 , 1
1, 1 1, 2 , 1

2 x1 x2
x22

1, 1

x2

2
1

x12

(1,-1)

2 x1 x2

Nonlinear SVM: Nonseparable Case


Find the optimal hyperplane in the transformed space

x1

x
2

2 x1 x2
x22

1, 1 1, 2 , 1
1, 1 1, 2 , 1
1, 1 1, 2 , 1
1, 1

2
1

x2

(-1,1)

(-1,-1)

x22

(1,1)

x1

1, 2 , 1

x12

(1,-1)

2 x1 x2

Nonlinear SVM: Nonseparable Case


Observe the decision surface in the original space (optional)

x1

x
2

1, 1 1, 2 , 1
1, 1 1, 2 , 1
1, 1 1, 2 , 1
1, 1

2 x1 x2
x22
2
1

x2

x22

x1

1, 2 , 1

x12
2 x1 x2

Nonlinear SVM: Nonseparable Case


Dual formulation of the (primal) SVM minimization problem

Primal
min

w
2

Dual
C i

yi w x i b 1 i
yi 1

Subject to

i 0

max

Subject to

0 i C

yi 0

yi 1

1
2

yi yj xi xj

Nonlinear SVM: Nonseparable Case


Dual formulation of the (primal) SVM minimization problem

x1

x
2

Dual

x12

2 x1 x2
x22

max

x ,

(x

i1

2 xi1 xi 2 , x

2
j1

, xi 2 ) ( x j1 , x j 2 )

x x
i

2
i2

, 2 x j1 x j 2 , x

2
j2

max

1
2

yi yj xi xj

2 i j yi yj ( xi ) ( xj )
i

K ( x i , x j ) ( x i ) ( x j )
(kernel function)

( x i ) ( x j )
2
i1

max

i 12 i j yi yj xi xj
i

Subject to

0 i C

yi 0

yi 1

Solving
Construct & minimise the Lagrangian

N
1
2
L(w , b, ) || w || i [ yi (wx i b) 1]
2
i 1
wrt. constraint i 0, i 1,...N

Take derivatives wrt. w and b, equate them to 0


N
L(w , b, )
w i yi x i 0
w
i 1

L(w , b, ) N
i yi 0
b
i 1
KKT cond : i [ yi (wx i b) 1] 0

parameters are expressed as a linear


combination of training points
only SVs will have non-zero i

The Lagrange multipliers i are called dual variables


Each training point has an associated dual variable.

Applications
Handwritten digits recognition
Of interest to the US Postal services
4% error was obtained
about 4% of the training data were SVs only

Text categorisation
Face detection
DNA analysis

Architecture of SVMs
Nonlinear Classifier(using kernel)
Decision function
l

f ( x) sgn( vi ( ( x) ( xi )) b)
i 1
l

sgn( vi k ( x, xi ) b)
i 1

( xi ) substitute for each


train example xi
vi i yi
vi are computed as the
solution of quadratic program

Artificial Neural Networks

Neural Network
Taxonomy of Neural Network Architecture

The architecture of the neural network refers to the arrangement


of the connection between neurons, processing element, number
of layers, and the flow of signal in the neural network. There are
mainly two category of neural network architecture: feedforward and feedback (recurrent) neural networks

Neural Network
Feed-forward network, Multilayer Perceptron

Neural Network
Recurrent network

Multilayer Perceptron (MLP)


Input Layer

Neuron processing element

x1

Hidden Layer
h1

x2
Input
Vector

x3
x4

.
.
.

x1

Output Layer
O1
x2

h2

xn

w1
w2

y F(y)

wn

xn

F(y)

MLP Structure

Backpropagation Learning
Architecture:
Feedforward network of at least one layer of non-linear
hidden nodes, e.g., # of layers L 2 (not counting the input
layer)
Node function is differentiable
most common: sigmoid function

Learning: supervised, error driven,


generalized delta rule
Call this type of nets BP nets
The weight update rule
(gradient descent approach)
Practical considerations
Variations of BP nets
Applications

Backpropagation Learning
Notations:
Weights: two weight matrices:

w(1,0) from input layer (0) to hidden layer (1)


w( 2,1) from hidden layer (1) to output layer (2)
w2(1,1,0) weight from node 1 at layer 0 to node 2 in layer 1
Training samples: pair of
so it is supervised learning
Input pattern:
Output pattern:

{( x p , d p ) p 1,..., P}

x p ( x p ,1 ,..., x p ,n )

Desired output:

o p (o p ,1 ,..., o p ,k )

d p (d p ,1 ,..., d p ,k )

Error:
error for output node j when xp is
l p, j d p,k o p,k
applied
sum square error

P K

(l p, j ) 2

p 1 (change
j 1
This error drives learning

and

(1, 0 )

( 2),1)

Backpropagation Learning
Sigmoid function again:
Differentiable:

1
1 e x
1
x
S ' ( x)

(
1

e
)'
x 2
(1 e )
1
x

e
)
x 2
(1 e )
1
e x

x
1 e 1 ex
S ( x)

Saturation
region

Saturation
region

S ( x)(1 S ( x))

When |net| is sufficiently large, it moves into one of the


two saturation regions, behaving like a threshold or ramp
function.

Chain rule of differentiation dz

dz dy dx
if z f ( y ), y g ( x), x h(t ) then f ' ( y ) g ' ( x)h' (t )
dt dy dx dt

Backpropagation Learning
Forward computing:
Apply an input vector x to input nodes
Computing output vector x(1) on hidden layer
x (j1) S ( net (j1) ) S ( w(j1,i,0) xi )
i

Computing the output vector o on output layer


ok S (netk( 2) ) S ( wk( 2, ,j1) x (j1) )
j

The network is said to be a map from input x to output o

Objective of learning:
Modify the 2 weight matrices to reduce sum square error
P
K
p 1 k 1(l p,k )2 for the given P training samples as much
as possible (to zero if possible)

Backpropagation Learning
Idea of BP learning:
Update of weights in w(2, 1) (from hidden layer to output
layer):
delta rule as in a single layer net using sum square error
Delta rule is not applicable to updating weights in w(1, 0)
(from input and hidden layer) because we dont know the
desired values for hidden nodes
Solution: Propagating errors at output nodes down to
hidden nodes, these computed errors on hidden nodes
drives the update of weights in w(1, 0) (again by delta rule),
thus called error Back Propagation (BP) learning
How to compute errors on hidden nodes is the key
Error backpropagation can be continued downward if the
net has more than one hidden layer
Proposed first by Werbos (1974), current formulation by
Rumelhart, Hinton, and Williams (1986)

Backpropagation Learning
Generalized delta rule:
Consider sequential learning mode: for a given sample (xp, dp)
E k (l p , k ) 2 k (d p , k o p , k ) 2
Update weights by gradient descent
For weight in w(2, 1):
For weight in w

(1, 0)

wk( 2, ,j1) (E / wk( 2, ,j1) )

w(j1,i,0) (E / w(j1,i,0) )

Derivation of update rule for w(2, 1):


since E is a function of lk = dk ok, ok is a function of
netk( 2 )
wk( 2, ,j1)
is a function of
, by chain rule

netk( 2)
, and

Backpropagation Learning
Derivation of update rule for

ok

w(j1,i,0)

wk( 2, ,j1)

consider hidden node j:


(1)
(1, 0 )
net
w
weight j ,i influences
j

(1)
S
(
net
it sends
j ) to all output nodes

w(j1,i,0 )

(1, 0 )
all K terms in E are functions of w j ,i

E k (d k ok ) 2 , ok S (net k( 2) ), net k( 2) j x (j1) wk( 2, ,j1) ,


x (j1) S (net (j1) ), net (j1) i xi w(j1,i,0)
by chain
rule

E
ok

S (net k( 2) )
netk( 2)

netk( 2)
x (j1)

x (j1)

net (j1)

net (j1)

w(j1,i)

Backpropagation Learning
Update rules:
for outer layer weights w(2, 1) :

( 2)

(
d

o
)
S
'
(
net
where k
k
k
k )

for inner layer weights w(1, 0) :

where

j (k k wk( 2, ,j1) ) S ' (net (j1) )


Weighted sum of errors
from output layer

Note: if S is a logistic function,


then S(x) = S(x)(1 S(x))

Backpropagation Learning
Pattern classification: an example
Classification of myoelectric signals
Input pattern: 2 features, normalized to real values
between -1 and 1
Output patterns: 3 classes

Network structure: 2-5-3


2 input nodes, 3 output nodes,
1 hidden layer of 5 nodes
= 0.95, = 0.4 (momentum)

Error bound e = 0.05


332 training samples
Maximum iteration = 20,000
When stopped, 38 patterns remain misclassified

38 patterns misclassified

Strengths of BP Learning
Great representation power
Any L2 function can be represented by a BP net
Many such functions can be approximated by BP learning
(gradient descent approach)

Easy to apply
Only requires that a good set of training samples is
available
Does not require substantial prior knowledge or deep
understanding of the domain itself (ill structured problems)
Tolerates noise and missing data in training samples
(graceful degrading)

Easy to implement the core of the learning algorithm


Good generalization power
Often produce accurate results for inputs outside the
training set

Deficiencies of BP Learning
Learning often takes a long time to converge
Complex functions often need hundreds or thousands of
epochs

The net is essentially a black box


It may provide a desired mapping between input and output
vectors (x, o) but does not have the information of why a
particular x is mapped to a particular o.
It thus cannot provide an intuitive (e.g., causal) explanation
for the computed result.
This is because the hidden nodes and the learned weights do
not have clear semantics.
What can be learned are operational parameters, not general,
abstract knowledge of a domain

Unlike many statistical methods, there is no theoretically wellfounded way to assess the quality of BP learning
What is the confidence level of o computed from input x using
such net?
What is the confidence level for a trained BP net, with the final
E (which may or may not be close to zero)?

Problem with gradient descent approach


only guarantees to reduce the total error to a local
minimum. (E may not be reduced to zero)
Cannot escape from the local minimum error state
Not every function that is representable can be
learned

How bad: depends on the shape of the error surface.


Too many valleys/wells will make it easy to be trapped
in local minima
Possible remedies:
Try nets with different # of hidden layers and hidden
nodes (they may lead to different error surfaces, some
might be better than others)
Try different initial weights (different starting points on the
surface)
Forced escape from local minima by random perturbation
(e.g., simulated annealing)

Generalization is not guaranteed even if the error


is reduced to 0
Over-fitting/over-training problem: trained net fits the training
samples perfectly (E reduced to 0) but it does not give accurate
outputs for inputs not in the training set
Possible remedies:
More and better samples
Using smaller net if possible
Using larger error bound
(forced early termination)
Introducing noise into samples
modify (x1,, xn) to (x1+1,
, xn+n) where i are small
random displacements
Cross-Validation
leave some (~10%) samples as test data (not used for weight update)
periodically check error on test data
learning stops when error on test data starts to increase

Network paralysis with sigmoid activation function


Saturation regions:

S ( x) 1 /(1 e x ), its derivative S ' ( x) S ( x)(1 S ( x)) 0


when x .
When x falls in a saturation region, S ( x) hardly changes its value
regardless how fast the magnitude of x increases

Input to an node may fall into a saturation region


when some of its incoming weights become very
large during learning. Consequently, weights stop to
change no matter how hard you try.
Possible remedies:
Use non-saturating activation functions
Periodically normalize all weights

wk , j : wk , j / w.k

The learning (accuracy, speed, and generalization)


is highly dependent of a set of learning
parameters
Initial weights, learning rate, # of hidden layers and #
of nodes...
Most of them can only be determined empirically
(via experiments)

Practical Considerations
A good BP net requires more than the core of the learning
algorithms. Many parameters must be carefully selected
to ensure a good performance.
Although the deficiencies of BP nets cannot be
completely cured, some of them can be eased by some
practical means.
Initial weights (and biases)
Random, [-0.05, 0.05], [-0.1, 0.1], [-1, 1]
Avoid bias in weight initialization
Normalize weights for hidden layer (w(1, 0)) (Nguyen-Widrow)

Random assign initial weights for all hidden nodes


For each hidden node j, normalize its weight by

w(j1,i,0) w(j1,i,0) / w(j1,0)

where 0.7 n m

m # of hiddent nodes, n # of input nodes

w(j1,0)

after normalization

Training samples:
Quality and quantity of training samples often determines the
quality of learning results
Samples must collectively represent well the problem space
Random sampling
Proportional sampling (with prior knowledge of the problem
space)
# of training patterns needed: There is no theoretically idea
number.
Baum and Haussler (1989): P = W/e, where
W: total # of weights to be trained (depends on net structure)
e: acceptable classification error rate
If the net can be trained to correctly classify (1 e/2)P of the
P training samples, then classification accuracy of this net is
1 e for input patterns drawn from the same sample space
Example: W = 27, e = 0.05, P = 540. If we can successfully
train the network to correctly classify (1 0.05/2)*540 = 526
of the samples, the net will work correctly 95% of time with
other input.

How many hidden layers and hidden nodes


per layer:
Theoretically, one hidden layer (possibly with many
hidden nodes) is sufficient for any L2 functions
There is no theoretical results on minimum necessary
# of hidden nodes
Practical rule of thumb:
n = # of input nodes; m = # of hidden nodes
For binary/bipolar data: m = 2n
For real data: m >> 2n
Multiple hidden layers with fewer nodes may be trained
faster for similar quality in some applications

Example: compressing character bitmaps.


Each character is represented by a 7 by 9 pixel
bitmap, or a binary vector of dimension 63
10 characters (A J) are used in experiment
Error range:
tight: 0.1 (off: 0 0.1; on: 0.9 1.0)
loose: 0.2 (off: 0 0.2; on: 0.8 1.0)

Relationship between # hidden nodes, error


range, and convergence rate
relaxing error range may speed up
increasing # hidden nodes (to a point) may
speed up
error range: 0.1 hidden nodes: 10 # epochs: 400+
error range: 0.2 hidden nodes: 10 # epochs: 200+
error range: 0.1 hidden nodes: 20 # epochs: 180+
error range: 0.2 hidden nodes: 20 # epochs: 90+
no noticeable speed up when # hidden nodes increases
to beyond 22

Other applications.
Medical diagnosis
Input: manifestation (symptoms, lab tests, etc.)
Output: possible disease(s)
Problems:
no causal relations can be established
hard to determine what should be included as
inputs
Currently focus on more restricted diagnostic tasks
e.g., predict prostate cancer or hepatitis B based
on standard blood test
Process control
Input: environmental parameters
Output: control parameters
Learn ill-structured control functions

Stock market forecasting


Input: financial factors (CPI, interest rate, etc.) and
stock quotes of previous days (weeks)
Output: forecast of stock prices or stock indices
(e.g., S&P 500)
Training samples: stock market data of past few
years
Consumer credit evaluation
Input: personal financial information (income, debt,
payment history, etc.)
Output: credit rating
And many more
Key for successful application
Careful design of input vector (including all
important features): some domain knowledge
Obtain good training samples: time and other cost

Summary of BP Nets
Architecture
Multi-layer, feed-forward (full connection between
nodes in adjacent layers, no connection within a layer)
One or more hidden layers with non-linear activation
function (most commonly used are sigmoid functions)
BP learning algorithm
Supervised learning (samples (xp, dp))
Approach: gradient descent to reduce the total error

w E / w

(why it is also called generalized delta rule)


Error terms at output nodes
error terms at hidden nodes (why it is called error BP)
Ways to speed up the learning process
Adding momentum terms
Adaptive learning rate (delta-bar-delta)
Quickprop
Generalization (cross-validation test)

Strengths of BP learning
Great representation power
Wide practical applicability
Easy to implement
Good generalization power
Problems of BP learning
Learning often takes a long time to converge
The net is essentially a black box
Gradient descent approach only guarantees a local minimum error
Not every function that is representable can be learned
Generalization is not guaranteed even if the error is reduced to zero
No well-founded way to assess the quality of BP learning
Network paralysis may occur (learning is stopped)
Selection of learning parameters can only be done by trial-and-error
BP learning is non-incremental (to include new training samples, the
network must be re-trained with all old and new samples)

Experiments

Stock Prediction

Stock prediction is a difficult task due to the nature of the stock data
which is very noisy and time varying.
The efficient market hypothesis claim that future price of the stock is
not predictable based on publicly available information.
However theory has been challenged by many studies and a few
researchers have successfully applied machine learning approach
such as neural network to perform stock prediction

?Is the Market Predictable

Efficient Market Hypothesis (EMH) (Fama, 1965)


Stock market is efficient in that the current market prices reflect all information
available to traders, so that future changes cannot be predicted relying on past prices
or publicly available information.

Murphy's law : Anything that can go wrong will go wrong.


Fama et al. (1988) showed that 25% to 40% of the variance in
the stock returns over the period of three to five years is
predictable from past return
Pesaran and Timmerman (1999) conclude that the UK stock market is
predictable for the past 25 years.
Saad (1998) has successfully employed different neural network models
to predict the trend of various stocks on a short-term range

Optimistic report

Implementation

In this paper we propose to investigate SVM, MLP and RBF network


for the task of predicting the future trend of the 3 major stock indices
a) Kuala Lumpur Composite Index (KLCI)
b) Hongkong Hangseng index
c) Nikkei 225 stock index
using input based on technical indicators.
This paper approach the problem based on 2 class pattern
classification formulated specifically to assist investor in making
trading decisions
The classifier is asked to recognise investment opportunities that
can give a return of r% or more within the next h days. r=3% h=10
days

System Block Diagram

The classifier is to predict if the trend of the stock index increment of


more than 3% within the next 10 days period can be achieved.

Data from
daily
historical
data
converted
into
technical
analysis
indicator

Increment Achievable ??

Classifier

Yes / No

Data Used

Kuala Lumpur Stock Index (KLCI) for the period of 1992-1997

Data Used

Hangseng index (20/4/1992-1/9/1997)

Data Used
Nikkei 225 stock index (20/4/1982-1/9/1987)

TABLE 1: DESCRIPTION OF INPUT TO CLASSIFIER

xi i=1,2,3 .12 n=15

Input to Classifier

DLN (t) = sign[q(t)-q(t-N)] * ln (q(t)/q(t-N) +1) (1)


q(t) is the index level at day t and DLN (t) is the actual input to the classifier.

Prediction Formulation

Consider ymax(t) as the maximum upward movement of the stock


index value within the period t and t + . y(t) represents the stock
index level at day t

Prediction Formulation

Classification
The prediction of stock trend is formulated as a two class
classification problem.
yr(t) > r% >> Class 2
yr(t) r% >> Class 1

Prediction Formulation
Classification

Let (xi , yi ) 1<i<N be a set of N training examples, each input example


xi
Rn n=15 being the dimension of the input space, belongs to a class
labelled by yi +1,-1.

Yi =-1

Yi =+1

Performance Measure
True Positive (TP) is the number of positive
predicted correctly as positive class.
False Positive (FP) is the number of negative
predicted wrongly as positive class.
False Negative (FN) is the number of positive
predicted wrongly as negative class.
True Negative (TN) is the number of negative
predicted correctly as negative class.

class
class
class
class

Performance Measure

Accuracy = TP+TN / (TP+FP+TN+FN)


Precision = TP/(TP+FP)
Recall rate (sensitivity) = TP/(TP+FN)
F1 = 2 * Precision * Recall/(Precision + Recall)

Testing Method
Rolling Window Method is Used to Capture Training and
Test Data

Train

Test

Train =600 data Test= 400 data

Experiment and Result


Experiments are conducted to predict the stock trend of
three major stock indexes, KLCI, Hangseng and Nikkei.
SVM, MLP and RBF network is used in making trend
prediction based on classification and regression
approach.
A hypothetical trading system is simulated to find out the
annualized profit generated based on the given
prediction.

Experiment and Result

Trading Performance
A hypothetical trading system is used
When a positive prediction is made, one unit of money
was invested in a portfolio reflecting the stock index. If
the stock index increased by more than r% (r=3%) within
the next h days (h=10) at day t, then the investment is
sold at the index price of day t. If not, the investment is
sold on day t+1 regardless of the price. A transaction fee
of 1% is charged for every transaction made.
Use annualised rate of return .

Trading Performance
Classifier Evaluation Using Hypothetical Trading
System

Trading Performance

Experiment and Result


Classification Result

Experiment and Result


The result shows better performance of neural network
techniques when compared to K nearest neighbour
classifier. SVM shows the overall better performance on
average than MLP and RBF network in most of the
performance metric used

Experiment and Result


Comparison of Receiver Operating Curve (ROC)

Experiment and Result


Area under Curve (ROC)

Conclusion

We have investigated the SVM, MLP and RBF network as a


classifier and regressor to assess it's potential in the stock trend
prediction task

Support vector machine (SVM) has shown better performance


when compared to MLP and RBF .

SVM classifier with probabilistic output outperform MLP and RBF


network in terms of error-reject tradeoff

Both the classification and regression model can be used for a


profitable trend prediction system. The classification model has the
advantage in which pattern rejection scheme can be incorporated.

This report

Implementation
OnlineSVR by Francesco Parrella
http://onlinesvr.altervista.org/
BPN by Karsten Kutza
http://www.neural-networks-at-your-fingertips.com/

Results

Basically zero correlation between prediction and the actual


outcome

Suffer from many technical failures

Still have faith that these methods (when applied correctly) can
predict the future better then a random guess

Tried many sorts of topologies of the BPN and the input values to
SVM, looks like the secret does not lie there

Future investigation, use wavelets/noiselets coefficients as inputs

References
http://www.cs.unimaas.nl/datamining/slides2009/svm_presentation.ppt
http://merlot.stat.uconn.edu/~lynn/svm.ppt
http://www.cs.bham.ac.uk/~axk/ML_SVM05.ppt
http://www.stanford.edu/class/msande211/KKTgeometry.ppt
http://www.csee.umbc.edu/~ypeng/F09NN/lecture-notes/NN-Ch3.ppt
http://fit.mmu.edu.my/caiic/reports/report04/mmc/haris.ppt
http://www.youtube.com/watch?v=oQ1sZSCz47w
Google, Wikipedia and others

You might also like