Professional Documents
Culture Documents
discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/291777724
CITATIONS
READS
21
1 author:
Aleksandar Petreski
National Bank of the Republic of Macedonia
3 PUBLICATIONS 0 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Real estate markets mass appraisal, indices, spillover and financial stability View project
All content following this page was uploaded by Aleksandar Petreski on 25 January 2016.
The user has requested enhancement of the downloaded file. All in-text references underlined in blue are added to the original document
and are linked to publications on ResearchGate, letting you access and read them immediately.
Aleksandar Petreski
November, 2007
1/13
1 Abstract
This study is motivated by the problem of missing data and its consequences on accuracy of the
parameters estimation and the problem of reliability of output when missing data is used as input.
Precisely, the test was done on data set of non-euro currency time series of Poland, Slovakia, UK
and Russia (European countries that have not adopted the euro).
Several methods of filling missing historical data were used and their imputation accuracy was
compared. It was examined efficiency of imputation of: simple interpolation method, regression
analysis, Principal Component Analysis (PCA) and the Expectation Maximization (EM)
algorithm.
It was found that for the periods and the data series analysed, linear interpolation (for univariate
series) and PCA (for multivariate series) outperformed the other methodologies.
Out performance of nave approach such as linear interpolation for univariate series, might speak
about the quality of the data and the market from which data is coming. This confirms the
intuition, that in the illiquid markets, market returns exhibits autocorrelation and follow some
interpolated pattern.
Furthermore, for the multivariate series, it was found that the accuracy of the imputation depends
on the strength of the correlation between currencies.
2/13
Contents
1
2
3
4
5
6
Abstract ...................................................................................................................................... 2
Methodology .............................................................................................................................. 4
2.1 Interpolation ........................................................................................................................ 4
2.2 Regression analysis (Simple linear regression) ................................................................... 4
2.3 Principal Component Analysis (PCA) ................................................................................ 5
2.4 The Expectation Maximization (EM) Algorithm ................................................................ 6
a. Maximum Likelihood Estimation ....................................................................................... 6
b. The E and M-Step of the Algorithm ................................................................................... 7
2.5 Imputation Accuracy Measure ............................................................................................ 8
Data............................................................................................................................................ 8
Application of methodology ....................................................................................................... 9
Results ...................................................................................................................................... 11
References ................................................................................................................................ 13
3/13
2 Methodology
This study investigated and made comparison on several imputation techniques in order to
overcome the missing values in the currency time series:
Interpolation,
Regression,
PCA Principal Component Analysis,
The Expectation Maximization (EM) Algorithm.
2.1 Interpolation
Missing data could be interpolated using several methods:
x's and y's are the data quantities from the sample of yield time-series, and and are the
unknown parameters "constants") to be estimated from the data. Estimates for the values of and
can be derived by the method of ordinary least squares. The method is called "least squares",
because estimates of and minimize the sum of squared error estimates for the given data set.
The estimates of and are often denoted by and or their corresponding Roman letters. It can
be shown that least squares estimates are given by
x x y y
x x
i
And
y x
where y is the mean (average) of the y values and x is the mean of the x values.
4/13
(4)
In highly correlated systems, such as yield term structure, there is intuitive interpretation of all the
PC`s (trend, tilt, convexity).
5/13
f xi 2
1
2
1 xi 2
2
2
n
2
f x1 , x2 xn , 2 f x i 2
n
2 2
1 n xi 2
2 i 1 2
This equation gives the probability of observing this specific sample. Hence, there are values of
mean and variance 2, that make this sample most probable, in other words, the maximum
likelihood estimates (MLE) mle of mean and variance 2 .
Since the log function is monotonically increasing and easier to work with, MLE is usually
calculated q by maximizing the natural logarithm of the likelihood function:
6/13
n
n
1 n x 2
ln 2
ln 2
2
2
2 i 1 2
This translates into finding solutions to the following first order conditions:
ln L , 2 x1 , x2 xn
n
ln L
1
2 xi 0
i 1
n
ln L n
1
x 2 0
2
2
4
2
2
i 1
ml
1 n
1 n
xi x n and 2 ml xi x n
n i 1
n i 1
In any missing data case, the distribution of the complete dataset Y can be factored as:
p Y P Yobs P Ymis Yobs ,
(1)
Considering each term in the above equation as a function of , it follows that:
l Y l yobs P y mis yobs , c
(2)
where l Y log P Y indicates the complete data log likelihood, l Yobs log L Y obs
the observed data log likelihood, and c a random constant.
Averaging equation (2) over the distribution P y mis yobs , , where t is an initial estimate of the
unknown parameter one gets:
Q t l yobs H t
(3)
where
Q t l Y P Ymis Yobs , t dYmis
and
H t log P Ymis Yobs, P Ymis Yobs, t dYmis
7/13
l t 1 yobs l t yobs
To sum up, it is suitable to think of the EM as an iterative algorithm that operates in two steps as
follows:
E-step: The expectation step, in which the function Q t
RMSE
y t yt
T i 1
where y t represents estimated (modelled) value and y t is a true value for particular missing date.
3 Data
In order to have appropriate check for the imputation methods, it was started from the complete
data set, from which randomly were deleted some data points, while still keeping record of deleted
data. In this way it was possible to observe the difference between estimated (modelled) missing
values and true values for particular missing dates.
The test was done on non-EURO currency time series of PLN (Poland), SKK (Slovakia), GBP
(UK) and RUB (Russia). Poland, Slovakia and UK are EU Member States that use their own
currency and have not adopted the euro, while Russia is not an EU-member country.
It was chosen the period from 08/10/2001 till 31/12/2005, with 1103 yield (1102 returns)
observations, for 4 currencies (GBP, SKK, PLN and RUB) and 4 yield terms at each currency
(1M, 3M, 6M, 1Y).
This should enable to have enough long time-series, diversity in level of correlation within 4 term
series and variety in level of autocorrelation within particular time series.
It should be noted that firstly the deletion was made in yields data set and consequently were
deleted data points from the returns data set, which is 2 return data points for 1 yield data point.
So, from series of 1002 observations (returns), same number of data points was deleted (78), for
every currency, on the same date, in the same time series (3M), in order to avoid any bias in the
conclusion about the efficiency of the imputation methods.
8/13
4 Application of methodology
Interpolation of missing data is done using Matlab function Fillts, which fill missing values in
time series using linear, cubic and spline functions. Interpolation was made on the yields directly
and on the returns, in order to check which method might have better imputation accuracy.
Regression was performed using appropriate excel add-in function for such analysis, on all three
non-missing time-series and the missing one, in order to compare the relationship between 1M
maturity and each of other three time series (3M, 6M, 1Y).
PCA was done using Excel common and add-in matrix functions. Before performing PCA input
data should be stationary. Therefore, yield data history is not used directly, but instead differences
of yields are employed. Furthermore, these stationary data is normalized before the analysis;
otherwise the first principal component might be affected by the variable with highest volatility.
After the data is being made stationary and it is normalized, covariance matrix V has been
calculated. From this matrix V, matrix W of eigenvectors is calculated1. Then equation (4) is used,
and PCA`s are calculated by multiplying the standardized data matrix X with the transpose of the
matrix of eigenvectors W.
Suppose that the yield term with missing observations is X 1 , and assume that X 1 , X 4 , are the
correlated yields series. PCA for fulfilling missing data is used in the following schedule:
step 1: It is performed a PCA on X 1 , X 4 , using only the dates where the dataset is complete, to
obtain PC and factors weights w11, w12 , w13 , w14 . The choice of number of the principal
components depends on how highly correlated the system is, with the restriction that they should
be less than the number of variables (<4).
step 2: It is performed one more PCA on X 2 , X 4 , by using the data from all the dates on these
variables . The number of principal components (3) is the same as the number of variables;
step 3: Using the first 3 factors weights from step 1 w11, w12 , w13 and the principal components
from step 2 P1 , P3 , original data is remodelled on X 1 according to the equation:
X 1 w11P1 w12 P2 w13 P13
step 4: The values obtained by the above equations are standardized and therefore are multiplied
by their standard deviation and added to their mean to transform them back into un-standardized
terms.
step 5: The final step is the calibration of our model. The true data on X 1 for missing dates, for
which we keep the record, is compared with the modelled one that is obtained from step 4. As
expected, there is a difference between original and modelled data. By choosing the appropriate
mean and standard deviation, standard error between observed original and modelled data (for
non-missing dates) is minimized, simultaneously trying to minimize standard error between
missing and modelled data (for missing dates).
Optimization problem was solved using Excel Solver add-in. One might expect that minimizing
observed error should always minimize unobserved error. But, there was a case for PLN currency,
where after minimization standard error of the unobserved values has been increased.
1
Function MEigenvecPow returns all eigenvectors of a given matrix. Optional parameters are: IterMax : sets the
maximum number of iterations allowed (default 1000). Norm: if TRUE, the function returns normalized vectors
|v|=1 (default FALSE)
9/13
Possible justification for the calibration procedure is that observed parameters (mean and standard
deviation) are estimated sample parameters and they can be different from the true population
parameters. Therefore, there is some freedom in choosing parameters, which in turn enables this
optimization procedure of minimizing standard error. It must be noted that freedom should not be
misused, so there are must be some constraints in choosing mean and standard deviation, which
will make them reasonably close to the observed one.
In order to impose reasonable constraints to the parameters, it was created distribution of dynamic
(conditional) averages and standard deviations, from which afterwards were estimated the
minimum and maximum tails using Generalized Extreme value2 (GEV) theory.
For the creation of dynamic mean(s)/standard deviation(s) distribution, it was used complete
dataset for the observed dates with rolling window of 25 data points. Length of the rolling window
is chosen as such in order to create smooth distribution of block maxima (or minima)3 (at least
visually).
From this series, by using appropriate rolling window of data (20data points), were taken their
maximum/minimum value and was created distribution of block maxima (or minima). Finally, it
was used Matlab GEV fitting procedure to fit the distribution and estimate parameters.
The distribution function of the GEV is given by
H k x
x k
exp 1 k
, k 0
exp e k , k 0
The GEV can be defined constructively as the limiting distribution of block maxima (or minima). That is, if you
generate a large number of independent random values from a single probability distribution, and take their maximum
value, the distribution of that maximum is approximately a GEV.
3 The original distribution determines the shape parameter, k, of the resulting GEV distribution. Distributions whose
tails fall off as a polynomial, such as Student's t, lead to a positive shape parameter. Distributions whose tails decrease
exponentially, such as the normal, correspond to a zero shape parameter. Distributions with finite tails, such as the
beta, correspond to a negative shape parameter.
4
Code is created by A. Meucci (2005)
10/13
5 Results
To sum up, it were used two different structures of data, univariate and multivariate toward which
was approached with different methods of full-filling missing values.
Missing values in univariate series were fulfilled with three interpolation methods (linear, cubic,
spline) and by using EM algorithm on specially constructed multivariate series (it was added series
of lagged values of the univariate series to the initial one). This special case was based of
assumption of significant autocorrelation within univariate time series.
Missing values in univariate series were fulfilled by using regression analysis, EM algorithm and
PCA. These methods are based on assumed correlation between time series of different yield
terms.
All the methods were used for the four currencies and it was calculated (RMSE.) to gauge the
imputation accuracy of the different filling methods. It should be noted that it is wrong to compare
accuracy for different currency within one method, since it is comparing among different scales.
Four currencies have different absolute values and returns are calculated as a difference between
two consecutive yields (absolute measure, not relative measure).
The results are presented below, by currency and by method:
Table 1
Standard error between true and estimated missing values, by currency and method
Interpolation
GBP
PLN
RUB
SKK
linear
0.012634
0.109914
0.367476
0.061995
yileds
cubic
0.012634
0.119677
0.361841
0.062694
spline
0.013450
0.140475
0.427538
0.081617
linear
0.022444
0.224596
0.696896
0.135018
Regression analysis
GBP
PLN
RUB
SKK
3M beta
0.014285
0.147775
0.311836
0.086751
6M beta
0.018383
0.176477
0.348184
0.087307
EM
returns
cubic
0.022883
0.224950
0.723855
0.144017
spline
0.030971
0.266227
0.988194
0.211207
univariate
0.276059
0.189223
0.528138
0.094536
PCA
1Y beta
0.020128
0.185126
0.357611
0.090837
uncalibrated
0.014334
0.133532
0.317607
0.114076
calibrated
0.014184
0.145507
0.307854
0.084888
multivariate
0.271685
0.145121
0.314370
0.084691
Table 2
Correlation between different yield term series, by currency
GBP
PLN
1M
1M
3M
6M
1Y
1M
1M
3M
0.74
6M
0.47
0.80
1Y
0.32
0.61
0.92
1M
3M
6M
1
1
RUB
3M
6M
1Y
3M
0.51
6M
0.38
0.45
1Y
0.29
0.42
0.46
3M
6M
1
1
SKK
1M
1Y
1M
1M
3M
0.82
6M
0.73
0.90
1Y
0.73
0.89
1
0.97
1Y
3M
0.37
6M
0.28
0.53
1Y
0.25
0.46
1
0.60
11/13
Table 3
Autocorrelation within yield term series, by currency and extent of lag
GBP
PLN
autocorrelation coeficient
0.03434
0.00796
indicator
insignificant insignificant
standard error
0.03367
0.03267
t-stat
1.02003
0.24367
t-inv
1.96272
1.96271
autocorrelation coeficient
-0.02466
indicator
insignificant
se(B)
0.03396
t-stat
-0.72615
t-inv
1.96299
-0.14814
significant
0.03644
-4.06547
1.96299
RUB
0.17289
significant
0.03402
5.08169
1.96271
SKK
-0.23805
significant
0.03236
-7.35592
1.96271
-0.00218
0.11565
insignificant significant
0.03754
0.03539
-0.05804
3.26796
1.96298
1.96299
As can be seen in Table 1, linear interpolation applied directly on yield outperformed other
methodologies for PLN and SKK, has similar results with multivariate methods for GBP and is
worse then multivariate methods for RUB. The worse results showed spline interpolation applied
on returns, for all currencies, except for GBP, where univariate EM is the worst.
Hence, it could be concluded that the accuracy of the imputation depends on the strength of the
correlation within the system. As the correlation within the system is increasing, multivariate
methods, gave better results at the expense of nave interpolation. As can be seen from Table 2,
correlation is strongest for the RUB time series, where the multivariate methods gave the best
results.
Out performance of nave approach such as linear interpolation for univariate series, might speak
about the quality of the data and the market from which data is coming. This confirms the
intuition, that in the illiquid markets, market returns exhibits autocorrelation and follow some
interpolated pattern.
Also it is worth nothing to mention imputation accuracy of EM algorithm used on lagged values of
the univariate series. From the Table 3 it could be noticed that accuracy of the imputation depends
on the strength of the autocorrelation. As the autocorrelation increases, imputation accuracy
advances. Hence, the worst results are for the GBP, where there is no significant autocorrelation
between univariate series and the series of its lagged values (neither lagged for 1 day, neither
lagged for 2 days). The results are better when the autocorrelation increase (which is connected
with the illiquid and undeveloped markets). Therefore, for SKK, where there is a significant
autocorrelation of first degree (lagged for 1 day) and also second degree (lagged for 2 days),
univariate EM works reasonably well and its imputation accuracy is very close to other methods.
Within the class of multivariate series, PCA outperformed EM algorithm and regression analysis
for all currencies, while the regression analysis shows the worst results for all currencies, except
for GBP, where EM is the worst. This suggests that PCA and EM have avoided the problem of
multicollinearity in the system and have avoided bringing ambiguous decision about the choice of
the explanatory variables. As yield data is highly correlated, PCA is intuitively an appropriate
methodology for filling missing observations.
12/13
6 References
Alexander, C. (2001). Market models: A Guide to Financial Data Analysis, Wiley and Sons LTD,
Chichester, UK, pp. 175- 178.
Borman, S. (2006). The Expectation Maximization Algorithm: A Short Tutorial
Dempster, A., Laird, N. and Rubin. D. (1977). Maximum likelihood from in-complete data via the
EM algorithm. Journal of the Royal Statistical Society: Series B, 39(1):138.
Naftci, S.(2000). Value at Risk Calculations, Extreme Events, and Tail Estimation. Journal of
Derivatives
13/13
View publication stats