You are on page 1of 14

See

discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/291777724

Test of imputation methods for missing values


in currency time series a case of European
countries that have not adopted the...
Technical Report November 2007
DOI: 10.13140/RG.2.1.4888.6805

CITATIONS

READS

21

1 author:
Aleksandar Petreski
National Bank of the Republic of Macedonia
3 PUBLICATIONS 0 CITATIONS
SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Real estate markets mass appraisal, indices, spillover and financial stability View project

Systemic credit risk model View project

All content following this page was uploaded by Aleksandar Petreski on 25 January 2016.

The user has requested enhancement of the downloaded file. All in-text references underlined in blue are added to the original document
and are linked to publications on ResearchGate, letting you access and read them immediately.

Test of imputation methods for missing values in currency time


series a case of European countries that have not adopted the
euro

Aleksandar Petreski

November, 2007

1/13

1 Abstract
This study is motivated by the problem of missing data and its consequences on accuracy of the
parameters estimation and the problem of reliability of output when missing data is used as input.
Precisely, the test was done on data set of non-euro currency time series of Poland, Slovakia, UK
and Russia (European countries that have not adopted the euro).
Several methods of filling missing historical data were used and their imputation accuracy was
compared. It was examined efficiency of imputation of: simple interpolation method, regression
analysis, Principal Component Analysis (PCA) and the Expectation Maximization (EM)
algorithm.
It was found that for the periods and the data series analysed, linear interpolation (for univariate
series) and PCA (for multivariate series) outperformed the other methodologies.
Out performance of nave approach such as linear interpolation for univariate series, might speak
about the quality of the data and the market from which data is coming. This confirms the
intuition, that in the illiquid markets, market returns exhibits autocorrelation and follow some
interpolated pattern.
Furthermore, for the multivariate series, it was found that the accuracy of the imputation depends
on the strength of the correlation between currencies.

2/13

Contents
1
2

3
4
5
6

Abstract ...................................................................................................................................... 2
Methodology .............................................................................................................................. 4
2.1 Interpolation ........................................................................................................................ 4
2.2 Regression analysis (Simple linear regression) ................................................................... 4
2.3 Principal Component Analysis (PCA) ................................................................................ 5
2.4 The Expectation Maximization (EM) Algorithm ................................................................ 6
a. Maximum Likelihood Estimation ....................................................................................... 6
b. The E and M-Step of the Algorithm ................................................................................... 7
2.5 Imputation Accuracy Measure ............................................................................................ 8
Data............................................................................................................................................ 8
Application of methodology ....................................................................................................... 9
Results ...................................................................................................................................... 11
References ................................................................................................................................ 13

3/13

2 Methodology
This study investigated and made comparison on several imputation techniques in order to
overcome the missing values in the currency time series:
Interpolation,
Regression,
PCA Principal Component Analysis,
The Expectation Maximization (EM) Algorithm.

2.1 Interpolation
Missing data could be interpolated using several methods:

linear (first degree polynomial function of one variable)


f(x) = ax + b
cubic (a polynomial of degree three)
f x ax 3 bx 2 cx d
spline (is a special function defined piecewise by polynomials).

2.2 Regression analysis (Simple linear regression)


Regression analysis could be used to fulfil missing values by assuming constant relationship
(correlation) between two variables (in our case two time-series of yields) during all period of
observation. Hence, by estimating the correlation between observed values on non-missing dates
and by knowing the values for the independent (regressor) variable on missing dates, one can
guess the value of the dependent variable (regresand) on missing dates.
The general form of a simple linear regression is:
y i xi i
where is the intercept, is the slope, and i is the error term, which picks up the unpredictable
y
part of the response variable i . The error term is usually posited to be normally distributed. The

x's and y's are the data quantities from the sample of yield time-series, and and are the
unknown parameters "constants") to be estimated from the data. Estimates for the values of and

can be derived by the method of ordinary least squares. The method is called "least squares",
because estimates of and minimize the sum of squared error estimates for the given data set.
The estimates of and are often denoted by and or their corresponding Roman letters. It can
be shown that least squares estimates are given by

x x y y
x x
i

And

y x
where y is the mean (average) of the y values and x is the mean of the x values.
4/13

2.3 Principal Component Analysis (PCA)


Principal Component Analysis (PCA) is a common method for extracting the most significant
uncorrelated sources of variation in a multivariate system. PCA reduces dimensionality, so that
only the most significant sources of information are used. This approach is very useful in highly
correlated systems, like yield term structure, because there will be a small number of independent
sources of variation and most of them can be described by just a few principal components.
(Alexander (2001). It avoids some of the problems with linear regression that are common to
factor models, such as the choice of the explanatory variables. For example, assuming that there is
complete dataset for some yield terms, one might fill-in missing data by performing regression
analysis, but still there is a problem of choice of the factor. Also, when there is a strong
multicolinearity in the system, model parameter estimates and their standard errors could be
affected.
Assume that data on which the PCA is to be carried is organized in matrix format (N x M), where
N denotes observations on each variable (in this case time series of particular yield term) i = 1, 2,
.., N (rows) and M denotes variables (yield terms), indexed j = 1, 2, .., M. Firstly, sample
mean of M-th column is subtracted from each element xi , j of M-th column and then is divided by
the sample standard deviation of the M-th column. Consequently, a matrix X of standardized mean
deviations is being created, where M columns of the stationary data matrix X have mean 0 and
2
variance 1 .
PCA is based on eigenvalue and eigenvector analysis of the unbiased covariance matrix V of the
data X:
1
Xt X
N 1
where PC`s are uncorrelated with each other and the first PC explains the greatest amount of the
total variation in X, the second the remaining variation etc.
If denote W as the m m matrics of eigenvectors of V and as the m m diagonal matrix of
eigenvalues of V, then
VW W
V WW ' since W 1 W '

Principal components are linear combinations of the returns


P N m X N m Wmm
where the weights are the columns of W
Pm w1m X 1 w2m X 2 wNm X N
1
'
But, since W W , then X N m P N m W ' mm

which means that returns also linear combinations of the PCA`s


X i wi1 P1 wi1 P2 wi1 Pn

(4)

In highly correlated systems, such as yield term structure, there is intuitive interpretation of all the
PC`s (trend, tilt, convexity).
5/13

2.4 The Expectation Maximization (EM) Algorithm


The Expectation Maximization (EM) algorithm is a common technique developed by Dempster et
al. (1977), for determining maximum-likelihood estimates when data are not entirely available.
Under the assumption that the data are normally distributed, the EM algorithm enables to calculate
efficient parameter estimates, for which the observed data are the most likely.
Each iteration of the EM algorithm consists of two processes: the E-step, and the M-step. In the
expectation, or E-step, the missing data are estimated given the observed data and current estimate
of the model parameters. In the M-step, under the assumption that the missing data are known,
models parameters are estimated such that maximizes the likelihood function. Then the estimate of
the M step is used in E step and these steps are iterated until convergence occurs. Convergence of
both steps is assured since the algorithm is guaranteed to increase the likelihood at each iteration.

a. Maximum Likelihood Estimation


The principle of maximum likelihood provides a means of choosing an asymptotically efficient
estimator for a parameter or a set of parameters. In order to introduce the concept of likelihood,
assume the probability of an event X (data) dependent on model parameters , written as
P (X | )
then the likelihood is
L ( | X)
that is, the likelihood of the parameters given the data.
The aim of maximum likelihood estimation is to find the parameter value(s) that makes the
observed data most likely. This is because the likelihood of the parameters given the data is
defined to be equal to the probability of the data given the parameters.
Consider a random sample of n observations from a normal distribution with mean and variance
2. The probability density function for each observation is
1
2

f xi 2

1
2

1 xi 2

2
2

Since the observations are independent, their joint density is:


n

n
2

f x1 , x2 xn , 2 f x i 2

n
2 2

1 n xi 2

2 i 1 2

This equation gives the probability of observing this specific sample. Hence, there are values of
mean and variance 2, that make this sample most probable, in other words, the maximum
likelihood estimates (MLE) mle of mean and variance 2 .
Since the log function is monotonically increasing and easier to work with, MLE is usually
calculated q by maximizing the natural logarithm of the likelihood function:

6/13

n
n
1 n x 2
ln 2
ln 2

2
2
2 i 1 2
This translates into finding solutions to the following first order conditions:

ln L , 2 x1 , x2 xn

n
ln L
1
2 xi 0

i 1

n
ln L n
1
x 2 0

2
2
4

2
2
i 1

Finally, the MLE estimates are:

ml

1 n
1 n
xi x n and 2 ml xi x n
n i 1
n i 1

b. The E and M-Step of the Algorithm


If there are missing observations Yobs in the sample, the MLE estimates of parameters mle are not
obtainable, because the likelihood function is not defined at the missing data points. In order to
resolve the problem the EM algorithm is applied. The Ymis enclose information relevant to
estimating mle , and mle in turns it helps to compute probable values of Ymis .
This relation suggests the following scheme for estimating mle in the presence of Yobs alone:
Fill in the missing data Ymis based on an initial estimate of mle ,
Re-estimate mle based on Yobs and then

Filled-in Ymis iterate until the estimates converge.

In any missing data case, the distribution of the complete dataset Y can be factored as:
p Y P Yobs P Ymis Yobs ,
(1)
Considering each term in the above equation as a function of , it follows that:
l Y l yobs P y mis yobs , c
(2)
where l Y log P Y indicates the complete data log likelihood, l Yobs log L Y obs
the observed data log likelihood, and c a random constant.
Averaging equation (2) over the distribution P y mis yobs , , where t is an initial estimate of the
unknown parameter one gets:
Q t l yobs H t

(3)

where
Q t l Y P Ymis Yobs , t dYmis

and
H t log P Ymis Yobs, P Ymis Yobs, t dYmis

7/13

A basic conclusion of Dempster et al. (1977) is that if t 1 is considered as the value of


that maximizes Q t , then t 1 is an improved estimation compared to t

l t 1 yobs l t yobs
To sum up, it is suitable to think of the EM as an iterative algorithm that operates in two steps as
follows:
E-step: The expectation step, in which the function Q t

is calculated by averaging the

complete data log likelihood l Y over P ymis yobs, t and


M-step: The maximization step, in which t 1 is calculated by maximizing Q t .

2.5 Imputation Accuracy Measure


At the end, it was used Root Mean Squared Error (RMSE.) to gauge the imputation accuracy of
the alternative filling methods retained. The smaller the RMSE is, the better the accuracy of
model.
2


1 T

RMSE
y t yt
T i 1


where y t represents estimated (modelled) value and y t is a true value for particular missing date.

3 Data
In order to have appropriate check for the imputation methods, it was started from the complete
data set, from which randomly were deleted some data points, while still keeping record of deleted
data. In this way it was possible to observe the difference between estimated (modelled) missing
values and true values for particular missing dates.
The test was done on non-EURO currency time series of PLN (Poland), SKK (Slovakia), GBP
(UK) and RUB (Russia). Poland, Slovakia and UK are EU Member States that use their own
currency and have not adopted the euro, while Russia is not an EU-member country.
It was chosen the period from 08/10/2001 till 31/12/2005, with 1103 yield (1102 returns)
observations, for 4 currencies (GBP, SKK, PLN and RUB) and 4 yield terms at each currency
(1M, 3M, 6M, 1Y).
This should enable to have enough long time-series, diversity in level of correlation within 4 term
series and variety in level of autocorrelation within particular time series.
It should be noted that firstly the deletion was made in yields data set and consequently were
deleted data points from the returns data set, which is 2 return data points for 1 yield data point.
So, from series of 1002 observations (returns), same number of data points was deleted (78), for
every currency, on the same date, in the same time series (3M), in order to avoid any bias in the
conclusion about the efficiency of the imputation methods.

8/13

4 Application of methodology
Interpolation of missing data is done using Matlab function Fillts, which fill missing values in
time series using linear, cubic and spline functions. Interpolation was made on the yields directly
and on the returns, in order to check which method might have better imputation accuracy.
Regression was performed using appropriate excel add-in function for such analysis, on all three
non-missing time-series and the missing one, in order to compare the relationship between 1M
maturity and each of other three time series (3M, 6M, 1Y).
PCA was done using Excel common and add-in matrix functions. Before performing PCA input
data should be stationary. Therefore, yield data history is not used directly, but instead differences
of yields are employed. Furthermore, these stationary data is normalized before the analysis;
otherwise the first principal component might be affected by the variable with highest volatility.
After the data is being made stationary and it is normalized, covariance matrix V has been
calculated. From this matrix V, matrix W of eigenvectors is calculated1. Then equation (4) is used,
and PCA`s are calculated by multiplying the standardized data matrix X with the transpose of the
matrix of eigenvectors W.
Suppose that the yield term with missing observations is X 1 , and assume that X 1 , X 4 , are the
correlated yields series. PCA for fulfilling missing data is used in the following schedule:
step 1: It is performed a PCA on X 1 , X 4 , using only the dates where the dataset is complete, to
obtain PC and factors weights w11, w12 , w13 , w14 . The choice of number of the principal
components depends on how highly correlated the system is, with the restriction that they should
be less than the number of variables (<4).
step 2: It is performed one more PCA on X 2 , X 4 , by using the data from all the dates on these
variables . The number of principal components (3) is the same as the number of variables;
step 3: Using the first 3 factors weights from step 1 w11, w12 , w13 and the principal components
from step 2 P1 , P3 , original data is remodelled on X 1 according to the equation:
X 1 w11P1 w12 P2 w13 P13
step 4: The values obtained by the above equations are standardized and therefore are multiplied
by their standard deviation and added to their mean to transform them back into un-standardized
terms.
step 5: The final step is the calibration of our model. The true data on X 1 for missing dates, for
which we keep the record, is compared with the modelled one that is obtained from step 4. As
expected, there is a difference between original and modelled data. By choosing the appropriate
mean and standard deviation, standard error between observed original and modelled data (for
non-missing dates) is minimized, simultaneously trying to minimize standard error between
missing and modelled data (for missing dates).
Optimization problem was solved using Excel Solver add-in. One might expect that minimizing
observed error should always minimize unobserved error. But, there was a case for PLN currency,
where after minimization standard error of the unobserved values has been increased.
1

Function MEigenvecPow returns all eigenvectors of a given matrix. Optional parameters are: IterMax : sets the
maximum number of iterations allowed (default 1000). Norm: if TRUE, the function returns normalized vectors
|v|=1 (default FALSE)

9/13

Possible justification for the calibration procedure is that observed parameters (mean and standard
deviation) are estimated sample parameters and they can be different from the true population
parameters. Therefore, there is some freedom in choosing parameters, which in turn enables this
optimization procedure of minimizing standard error. It must be noted that freedom should not be
misused, so there are must be some constraints in choosing mean and standard deviation, which
will make them reasonably close to the observed one.
In order to impose reasonable constraints to the parameters, it was created distribution of dynamic
(conditional) averages and standard deviations, from which afterwards were estimated the
minimum and maximum tails using Generalized Extreme value2 (GEV) theory.
For the creation of dynamic mean(s)/standard deviation(s) distribution, it was used complete
dataset for the observed dates with rolling window of 25 data points. Length of the rolling window
is chosen as such in order to create smooth distribution of block maxima (or minima)3 (at least
visually).
From this series, by using appropriate rolling window of data (20data points), were taken their
maximum/minimum value and was created distribution of block maxima (or minima). Finally, it
was used Matlab GEV fitting procedure to fit the distribution and estimate parameters.
The distribution function of the GEV is given by

H k x


x k
exp 1 k

, k 0

exp e k , k 0

where 1 kx >0 and k is the shape parameter.


Estimated parameters were used along with appropriate confidence interval (0.05=1-0.95) to
estimate the tails (minimum/maximum).
EM iteration is done using Matlab4. EM was used to fulfil missing values not only in multivariate
series, but in univariate series as well. It was used the fact that in the less liquid financial markets,
yield series are auto correlated. Hence, assuming significant autocorrelation, by adding to the
univariate series its lagged values (with lag of 1 and 2 data points), it was created multivariate
series, on which was used EM algorithm.

The GEV can be defined constructively as the limiting distribution of block maxima (or minima). That is, if you
generate a large number of independent random values from a single probability distribution, and take their maximum
value, the distribution of that maximum is approximately a GEV.
3 The original distribution determines the shape parameter, k, of the resulting GEV distribution. Distributions whose
tails fall off as a polynomial, such as Student's t, lead to a positive shape parameter. Distributions whose tails decrease
exponentially, such as the normal, correspond to a zero shape parameter. Distributions with finite tails, such as the
beta, correspond to a negative shape parameter.
4
Code is created by A. Meucci (2005)

10/13

5 Results
To sum up, it were used two different structures of data, univariate and multivariate toward which
was approached with different methods of full-filling missing values.
Missing values in univariate series were fulfilled with three interpolation methods (linear, cubic,
spline) and by using EM algorithm on specially constructed multivariate series (it was added series
of lagged values of the univariate series to the initial one). This special case was based of
assumption of significant autocorrelation within univariate time series.
Missing values in univariate series were fulfilled by using regression analysis, EM algorithm and
PCA. These methods are based on assumed correlation between time series of different yield
terms.
All the methods were used for the four currencies and it was calculated (RMSE.) to gauge the
imputation accuracy of the different filling methods. It should be noted that it is wrong to compare
accuracy for different currency within one method, since it is comparing among different scales.
Four currencies have different absolute values and returns are calculated as a difference between
two consecutive yields (absolute measure, not relative measure).
The results are presented below, by currency and by method:
Table 1
Standard error between true and estimated missing values, by currency and method
Interpolation

GBP
PLN
RUB
SKK

linear
0.012634
0.109914
0.367476
0.061995

yileds
cubic
0.012634
0.119677
0.361841
0.062694

spline
0.013450
0.140475
0.427538
0.081617

linear
0.022444
0.224596
0.696896
0.135018

Regression analysis

GBP
PLN
RUB
SKK

3M beta
0.014285
0.147775
0.311836
0.086751

6M beta
0.018383
0.176477
0.348184
0.087307

EM
returns
cubic
0.022883
0.224950
0.723855
0.144017

spline
0.030971
0.266227
0.988194
0.211207

univariate
0.276059
0.189223
0.528138
0.094536

PCA

1Y beta
0.020128
0.185126
0.357611
0.090837

uncalibrated
0.014334
0.133532
0.317607
0.114076

calibrated
0.014184
0.145507
0.307854
0.084888

calibrated without restriction


calibrated without restriction
calibrated without restriction
with restricton in calibration

multivariate
0.271685
0.145121
0.314370
0.084691

Table 2
Correlation between different yield term series, by currency
GBP

PLN
1M

1M

3M

6M

1Y

1M

1M

3M

0.74

6M

0.47

0.80

1Y

0.32

0.61

0.92

1M

3M

6M

1
1

RUB

3M

6M

1Y

3M

0.51

6M

0.38

0.45

1Y

0.29

0.42

0.46

3M

6M

1
1

SKK

1M

1Y

1M

1M

3M

0.82

6M

0.73

0.90

1Y

0.73

0.89

1
0.97

1Y

3M

0.37

6M

0.28

0.53

1Y

0.25

0.46

1
0.60

11/13

Table 3
Autocorrelation within yield term series, by currency and extent of lag
GBP
PLN
autocorrelation coeficient
0.03434
0.00796
indicator
insignificant insignificant
standard error
0.03367
0.03267
t-stat
1.02003
0.24367
t-inv
1.96272
1.96271
autocorrelation coeficient
-0.02466
indicator
insignificant
se(B)
0.03396
t-stat
-0.72615
t-inv
1.96299

-0.14814
significant
0.03644
-4.06547
1.96299

RUB
0.17289
significant
0.03402
5.08169
1.96271

SKK
-0.23805
significant
0.03236
-7.35592
1.96271

-0.00218
0.11565
insignificant significant
0.03754
0.03539
-0.05804
3.26796
1.96298
1.96299

As can be seen in Table 1, linear interpolation applied directly on yield outperformed other
methodologies for PLN and SKK, has similar results with multivariate methods for GBP and is
worse then multivariate methods for RUB. The worse results showed spline interpolation applied
on returns, for all currencies, except for GBP, where univariate EM is the worst.
Hence, it could be concluded that the accuracy of the imputation depends on the strength of the
correlation within the system. As the correlation within the system is increasing, multivariate
methods, gave better results at the expense of nave interpolation. As can be seen from Table 2,
correlation is strongest for the RUB time series, where the multivariate methods gave the best
results.
Out performance of nave approach such as linear interpolation for univariate series, might speak
about the quality of the data and the market from which data is coming. This confirms the
intuition, that in the illiquid markets, market returns exhibits autocorrelation and follow some
interpolated pattern.
Also it is worth nothing to mention imputation accuracy of EM algorithm used on lagged values of
the univariate series. From the Table 3 it could be noticed that accuracy of the imputation depends
on the strength of the autocorrelation. As the autocorrelation increases, imputation accuracy
advances. Hence, the worst results are for the GBP, where there is no significant autocorrelation
between univariate series and the series of its lagged values (neither lagged for 1 day, neither
lagged for 2 days). The results are better when the autocorrelation increase (which is connected
with the illiquid and undeveloped markets). Therefore, for SKK, where there is a significant
autocorrelation of first degree (lagged for 1 day) and also second degree (lagged for 2 days),
univariate EM works reasonably well and its imputation accuracy is very close to other methods.
Within the class of multivariate series, PCA outperformed EM algorithm and regression analysis
for all currencies, while the regression analysis shows the worst results for all currencies, except
for GBP, where EM is the worst. This suggests that PCA and EM have avoided the problem of
multicollinearity in the system and have avoided bringing ambiguous decision about the choice of
the explanatory variables. As yield data is highly correlated, PCA is intuitively an appropriate
methodology for filling missing observations.

12/13

6 References
Alexander, C. (2001). Market models: A Guide to Financial Data Analysis, Wiley and Sons LTD,
Chichester, UK, pp. 175- 178.
Borman, S. (2006). The Expectation Maximization Algorithm: A Short Tutorial
Dempster, A., Laird, N. and Rubin. D. (1977). Maximum likelihood from in-complete data via the
EM algorithm. Journal of the Royal Statistical Society: Series B, 39(1):138.
Naftci, S.(2000). Value at Risk Calculations, Extreme Events, and Tail Estimation. Journal of
Derivatives

13/13
View publication stats

You might also like