Professional Documents
Culture Documents
A survey on multi-output
regression
Hanen Borchani,1 Gherardo Varando,2 Concha Bielza2 and
Pedro Larraaga2
In recent years, a plethora of approaches have been proposed to deal with the
increasingly challenging task of multi-output regression. This study provides a
survey on state-of-the-art multi-output regression methods, that are categorized
as problem transformation and algorithm adaptation methods. In addition, we
present the mostly used performance evaluation measures, publicly available
data sets for multi-output regression real-world problems, as well as open-source
software frameworks. 2015 John Wiley & Sons, Ltd.
How to cite this article:
INTRODUCTION
Correspondence
to: hanen@cs.aau.dk
1 Machine
2 Computational
216
MULTI-OUTPUT REGRESSION
Let us consider the training data set D of N instances
containing a value assignment for each variable
X1 , , Xm , Y 1 , , Y d , i.e., D = {(x(1) , y(1) ), ,
(x(N) , y(N) )}. Each instance is characterized by an
input vector
( l of m l descriptive
) or predictive variables
()
()
(l)
l)
(
x = x1 , , xj , , xm and an output vector
( l
)
()
(l)
(l)
of d target variables y(l) = y1 , , yi , , yd ,
with i {1, , d}, j {1, , m}, and l {1, , N}.
The task is to learn a multi-target regression model
from D consisting of finding a function h that assigns
to each instance, given by the vector x, a vector y of d
Volume 5, September/October 2015
target values:
h X1 Xm Y1 Yd
)
(
)
(
x = x1 , , xm y = y1 , , yd ,
where Xj and Yi denote the sample spaces of each
predictive variable Xj , for all j {1, , m}, and each
target variable Y i , for all i {1, , d}, respectively.
Note that, all target variables are considered to be
continuous here. The learned multi-target model will
be
{ used afterward
}to simultaneously predict the values
y(N+1) , ,
y(N ) of all target variables of the new
{
}
217
wires.wiley.com/widm
Overview
Single-Target Method
In the baseline ST method,4 a multi-target
model is comprised of d single-target models, each
training set
) on (a transformed
)}
{( trained
{
}
(1)
(N) , y(N)
Di =
,
,
x
,
i
1, , d ,
x(1)
,
y
i
i
1
to predict the value of a single-target variable Y i .
In this way, the target variables are predicted independently and potential relationships between them
cannot be exploited. The ST method is also known as
binary relevance in the literature.13
As the multi-target prediction problem is transformed into several single-target problems, any offthe-shelf ST regression algorithm can be used. For
instance, Spyromitros-Xioufis et al.4 used four
well-known regression algorithms, namely, ridge
regression,25 SVR machines,26 regression trees,27 and
stochastic gradient boosting.28
Moreover, Hoerl and Kennard25 proposed
the separate ridge regression method to deal with
multi-variate regression problems. It consists of
performing a separate ridge regression of each
individual target Y i on the predictor variables
X = (X1 , , Xm ). The regression coefficient estimates ij , with i {1, , d} and j {1, , m}, are the
solution to a penalized least squares criterion:
{ }m
aij j=1
]2
N [
m
(l) (l)
= arg minm
a j xj
yi
j=1
+ i
{
}
a2j , i 1, , d ,
j=1
,
x
,
training set Di =
x(1) , y(1)
,
y
i
i
218
( l
)
()
(l) (l)
(l)
is a
where x(l) = x1 , , xN ,
y1 , ,
yd
transformed input vector consisting of the original input vector of the training set augmented by
predictions (or estimates) of their target variables
yielded by the first-stage models. In fact, MTRS is
based on the idea that a second-stage model is able to
correct the prediction of a first-stage model by using
information about the predictions of other first-stage
models.
The predictions for a new instance
x(N + 1) are obtained by generating first-stage
models (inducing the estimated
output vector
)
(N+1)
y(N+1) =
,
then
applying the
y(N+1)
,
,
y
1
d
second-stage models
on the transformed input vec)
(
tor x(N+1) = x(N+1)
, , x(N+1)
,
y(N+1)
, ,
yd(N+1)
m
1
1
to produce
the final estimated
multi-output targets
)
( (N+1)
(N+1)
(N+1)
.
=
y1
, ,
yd
y
Regressor Chains
The RC method4 is inspired by the recent multi-label
chain classifiers.31 RC is another problem transformation method, based on the idea of chaining
single-target models. The training of RC consists of
selecting a random chain (i.e., permutation) of the set
of target variables, then building a separate regression model for each target following the order of the
selected chain.
Assuming that the ordered set or the full chain
C = (Y 1 , Y 2 , , Y d ) is selected, the first model is only
concerned with the prediction of Y 1 . Then, subsequent
models for Y i,s. t. i >1 {(
are trained) on the
( transformed
)}
, , x(N)
,
Di =
x(1)
, y(1)
, y(N)
i
i
i
i
)
(
(l)
(l)
(l) (l)
(l)
where xi = x1 , , xm , y1 , , yi1
is a
transformed input vector consisting of the original input vector of the training set augmented by
the actual values of all previous targets in the chain.
Spyromitros-Xioufis et al.4 then introduced the
regressor chain corrected (RCC) method that uses
cross-validation estimates instead of the actual values
in the transformation data step.
However, the main problem with the RC and
RCC methods is that they are sensitive to the selected
chain ordering. To avoid this problem, and alike,31
Spyromitros-Xioufis et al.4 proposed a set of regression chain models with differently ordered chains: if
the number of distinct chains was less than 10, they
created exactly as many models as the number of distinct label chains; otherwise, they selected 10 chains
randomly. The resulting approaches are called ensemble of regressor chains (ERC) and ensemble of regressor chains corrected (ERCC).
data
sets
x(l) , y(l)
(
)
(l)
I , x(l) , y
.
d
f =
Statistical Methods
The statistical approaches are considered as the first
attempt to deal with simultaneously predicting multiple real-valued targets. They aim to take advantage
of correlations between the target variables in order
to improve predictive accuracy compared with the
traditional procedure of doing individual regressions
of each target variable on the common set of predictor
variables.
Izenman33 proposed reduced-rank regression
which places a rank of constraint on the matrix of
estimated regression coefficients. Considering the following regression model:
yi =
{
}
aij xj + i , i 1, , d ,
j=1
rank(A)=r
where
with
estimated
error
= E( T ),
T
e = {1 , , d }. The above equation is then solved
as r = Br , where d m is the matrix of the
ordinary least squares (OLS) estimates and the
reduced-rank shrinking matrix Br d d is given by
Br = T1 Ir T,
where Ir = diag {1 (i r)}di=1 and T is the canonical
co-ordinate matrix that seeks to maximize the correlation between the d-vector y and the m-vector x.
Later, Brown and Zidek7 presented a
multi-variate version of the Hoerl-Kennard ridge
regression rule and proposed the estimator (K):
)
(
)1 ( T
(K) = xT x Id + Im K
x x Id ,
where K(d d) > 0 is the ridge matrix. denotes
are (md 1)
the usual Kronecker product, and ,
vectors of estimators of = ( 1 , , m )T , where
1 , , m are each (1 d) row vectors of . represents the maximum likelihood estimator of
corresponding to K = 0.
Furthermore, van der Merwe and Zidek34
introduced the filtered canonical y-variate regression
(FICYREG) method defined as a generalization to the
multi-variate regression problem of the James-Stein
estimator. The estimated coefficient matrix d m
takes the form
= B A,
A
219
wires.wiley.com/widm
Overview
c2i 1
c2i
fi =
N
N
}
{
and fi max 0, fi .
In addition, one of the most prominent
approaches for dealing with the multi-output regression problem is the curds and whey (C&W) method
proposed by Breiman and Friedman in Ref 6. Basically, given d targets y = (y1 , , yd )T with separate
least squares regressions y = (y 1 , , y d )T , where y
and x are the sample means of y and x, respectively, a
more accurate predictor yi of each yi is obtained using
a linear combination
yi = yi +
(
)
{
}
yk yk , i 1, , d ,
bik
k=1
is measured by the L2 -norm of the regression coefficients associated with the input. To solve the L2 -SVS,
W, the m d matrix of regression coefficients, is estimated by minimizing the error sum of squares subject
to a sparsity constraint as follows:
m
1
2
minf (W) = ||y xW||F subject to
||wj ||2 r,
W
2
j=1
where the
subscript F denotes the Frobenius norm, i.e.,
||B||2F = ij b2ij . The factor || wj || 2 is a measure of the
importance of the jth input in the model, and r is a free
parameter that controls the amount of shrinkage that
is applied to the estimate.
If the value of r 0 is large enough, the optimal
W is equal to the OLS solution, whereas small values
of r impose a row-sparse structure on W, which
means that only some of the inputs are effective in the
estimate.
Abraham et al.35 coupled linear regressions and
quantile mapping to both minimize the residual errors
and capturing the joint (including nonlinear) relationships among variables. The method was tested on
bivariate and trivariate output spaces showing that it
is able to reduce residual errors while keeping the joint
distribution of the output variables.
(
)
aij xj xj , s.t.
yi = yi +
{ }m
aij j=1
j=1
(
)
N
m
( l
) 2
()
(l)
,
= arg minm
a j xj x j
yi yi
Traditionally, SVR is used with a single-output variable. It aims to determine the mapping between the
input vector x and the single output yi from a given
training data set Di , by finding the regressor w m 1
and the bias term b that minimize
( ( )
(
))
N
T
1
2
l)
l)
(
(
w+b
,
||w|| + C L y x
2
l=1
where () is a nonlinear transformation to a higher
dimensional Hilbert space H, and C is a parameter
chosen by the user that determines the trade-off
between the regularization and the error reduction term, first and second addend, respectively.
L is a Vapnick -insensitive loss function, which
is equal to 0 for |y(l) ((x(l) )T w + b)| < and to
|y(l) ((x(l) )T w + b)| for |y(l) ((x(l) )T w + b)| .
The solution (w and b) is induced by a linear combination of the training set in the transformed space
with an absolute error equal to or greater than .
Hence, in order to deal with the multi-output
case, single-output SVR can be easily applied independently to each output (see Multi-Output Support
Vector Regression in Problem Transformation Methods section). Because it exhibits the serious drawback
of not taking into account the possible correlations
between outputs however, several approaches have
T
1
2
l)
l)
(
(
||w || + C L y x
W+b
,
2 i=1 i
l=1
where the m d matrix W = (w1 , w2 , , wd ) and
b = (b1 , b2 , , bd )T .
For instance, Vazquez and Walter36 extended
SVR by considering the so-called Cokriging37
method, which is a multi-output version of Kriging
that exploits the correlations due to the proximity in the space of factors and outputs. In this
way, with an appropriate choice of covariance
and cross-covariances models, the authors showed
that multi-output SVR yields better results than an
independent prediction of the outputs.
Snchez-Fernndez et al.19 introduced a generalization of SVR. The so-called multiregressor SVR
(M-SVR) is based on an iterative reweighted least
squares (IRWLS) procedure that iteratively estimates
the weights W and the bias parameters b until convergence, i.e., until reaching a stationary point where
there is no more improvement of the considered loss
function.
Similarly, Brudnak38 developed a vector-valued
SVR by extending the notions of the estimator,
loss function and regularization functional from
the scalar-valued case; and Tuia et al.18 proposed
a multi-output SVR method by extending the
single-output SVR to multiple outputs while maintaining the advantages of a sparse and compact
solution using a cost function. Later, Deger et al.39
adapted Tuia et al.s18 approach to tackle the problem
of reflectance recovery from multispectral camera
output, and proved through their empirical results
that it has the advantages of being simpler and faster
to compute than a scalar-valued based method.
In Ref 40, Cai and Cherkassky described a
new methodology for regression problems, combining Vapniks SVM+ regression method and the MTL
setting. SVM+, also known as learning with structured data, extends the standard SVM regression by
taking into account the group information available
in the training data. The SVM+ approach learns
a single regression model using information on all
groups, whereas the proposed SVM+MTL approach
learns several related regression models, specifically
one model for each group.
In Ref 41, Liu et al. considered the output
space as a Riemannian submanifold to incorporate its
geometric structure into the regression process, and
they proposed a locally linear transformation (LLT)
mechanism to define the loss functions on the output
Volume 5, September/October 2015
Kernel Methods
A study of vector-valued learning with kernel methods was started by Micchelli and Pontil,9 where they
analyzed the regularized least squares from the computational point of view. They also analyzed the theoretical aspects of reproducing kernel Hilbert spaces
(RKHS) in the range-space of the estimator, and they
generalized the representer theorem for Tikhonov regularization to the vector-valued setting.
Baldassarre et al.44 later studied a class of
regularized kernel methods for multi-output learning which are based on filtering the spectrum of
the kernel matrix. They considered methods also
including Tikhonov regularization as a special case,
and alternatives such as vector-valued extensions of
squared loss function (L2 ) boosting and other iterative
221
wires.wiley.com/widm
Overview
(l)
yi yi ,
l=1 i=1
(l)
where yi denotes the value of the output variable
Y i for the instance l and yi denotes the mean of Y i
in the node. Each split is selected to minimize the
sum of squared error. Finally, each leaf of the tree
can be characterized by the multi-variate mean of its
instances, the number of instances at the leaf, and its
222
Rule Methods
Aho et al.58 presented a new method for learning
rule ensembles for multi-target regression problems
and simultaneously predicting multiple numeric target
attributes. The so-called FItted Rule Ensemble (FIRE)
algorithm transcribes an ensemble of regression trees
into a large collection of rules, then an optimization
procedure is used to select the best (and much smaller)
subset of these rules and determine their respective
weights.
Volume 5, September/October 2015
y = f (x) = w0 avg +
k=1
wk rk (x) +
d m
wij xij ,
i=1 j=1
xij = 0, , 0 , xj , 0 , , 0
i1
i+1
i
Finally, the values of all weights wij are also
determined using a gradient-directed optimization
algorithm that depends on a gradient threshold .
Thus, the optimization procedure is repeated using
different values of in order to find a set of weights
with the smallest validation error.
Table 1 summarizes the reviewed multi-output
regression algorithms.
DISCUSSION
Note that, even though the ST method is a simple
approach, it does not imply simpler models. In fact,
exploiting relationships among the output variables
could be used to improve the precision or reduce
computational costs as explained in what follows.
First, let us point out that some transformation
algorithms fail to properly exploit the multi-output
relationships, and therefore they may be considered as
ST methods. For instance, this is the case of RC using
linear regression as base models, namely, OLS or ridge
estimators of the coefficients.
223
wires.wiley.com/widm
Overview
References
Single target
Random linear target combinations
Separate ridge regression
Multi-target regressor stacking
Regressor chains
Multi-output SVR
Statistical methods
Multi-output SVR
Kernel methods
Rule methods
Year
4
Spyromitros-Xious et al.
Tsoumakas et al.5
Hoerl and Kennard25
Spyromitros-Xious et al.4
Spyromitros-Xious et al.4
Zhang et al.32
Izenman33
van der Merwe and Zidek34
Brown and Zidek7
Breiman and Friedman6
Simil and Tikka10
Abraham et al.35
Brudnak38
Cai et al.40
Deger et al.39
Han et al.17
Liu et al.41
Sanchez et al.19
Tuia et al.18
Vazquez and Walter36
Xu et al.43
Baldassarre et al.44
Evgeniou and Pontil45
Evgeniou et al.46
Micchelli and Pontil9
lvarez at al.47
Death49
Appice and Deroski2
Kocev et al.3
Kocev et al.52
Struyf and Deroski48
Ikonomovska et al.54
Stojanova et al.55
Appice et al.56
Levatic et al.57
Aho et al.58
Aho et al.1
2012
2014
1970
2012
2012
2012
1975
1980
1980
1997
2007
2013
2006
2009
2012
2012
2009
2004
2011
2003
2013
2012
2004
2005
2005
2012
2002
2007
2009
2012
2006
2011
2012
2014
2014
2009
2012
Predictive Performance
Considering the models predictive performance as a
comparison criterion, the benefits of using MTRS and
RC (or ERC and the corrected versions) instead of
the baseline ST approach are not so clear. In fact,
in Ref 4, an extensive empirical comparison of these
methods is presented, and the results show that ST
methods outperform several variants of MTRS and
ERC. This fact is especially notable in the straightforward applications. In particular, the benefits of MTRS
and RC methods appear to derive uniquely from the
randomization process (e.g., due to the order of the
chain) and from the ensemble model (e.g., ERC).
Statistical methods could improve notably the
performance with respect to a baseline ST regression but only if specific assumptions are fulfilled,
i.e., a relation among outputs truly exists, and a
Volume 5, September/October 2015
Computational Complexity
For a large number of output variables, all problem
transformation methods face the challenging problems
of either solving a large number of ST problems
(e.g., ST, MTRS, and RC) or a single large problem
(e.g., LS-SVR32 ). Nevertheless, note that ST and some
implementations of RC could be speeded up in the
training and/or prediction phases using a parallel
computation (see Open Source Software Frameworks
section and Appendix B).
Using ST with kernel methods as a base model
may also lead to compute the same kernel over the
same points more than once. In this case, it is computationally more efficient to consider multi-output kernels
and thus avoid redundant computations.
Multi-target regression trees and rule methods
are also designed to be more competitive from the
point of view of computational and memory complexity, especially compared to their ST counterparts.
225
wires.wiley.com/widm
Overview
yi ||
Ntest ||y
d
i
1
1
|
|
=
a =
(l)
d i=1
d i=1 Ntest l=1
yi
d
1
(2)
Ntest (
d
)
1 (l)
(l) 2
yi
yi
Ntest l=1
i=1
The
average
root-mean-squared
(aRMSE)3,17,18,39,52 :
1
RMSE
d i=1
N
test (
(l)
(l) 2
y
i
i
d
1 l=1
=
Ntest
d i=1
(3)
error
aRMSE =
(4)
Ntest (
)
(l)
(l) 2
y
i
i
d
1
l=1
=
test (
d i=1
)2
N
y(l) y
i
i
d
aRRMSE =
PERFORMANCE EVALUATION
MEASURES
In this section, we introduce the performance evaluation measures used to assess the behavior of learned
models when applied to an unseen or test data set
of size Ntest , and thereby to assess the multi-output
regression methods used for model induction. Let y(l)
and y (l) be the vectors of the actual and predicted outy be the vectors of
puts for x(l) , respectively, and y and
averages of the actual and predicted outputs, respectively. Besides measuring the computing times,1,3,17,52
the mostly used evaluation measures for assessing
multi-output regression models are:
The average correlation coefficient (aCC)3,43,52 :
1
aCC =
CC
d i=1
d
)( l
)
( (l)
()
yi yi
yi
yi
Ntest
l=1
1
d i=1
N
test (
test (
)2 N
)2
(l)
(l)
yi yi
yi
yi
d
l=1
l=1
(1)
226
(5)
l=1
1,3,52
as RRMSE, automatically re-scale the error contributions of each target variable, and hence, there might be
no need here to use an extra normalization operator.
Solar Flare
61
DATA SETS
Despite the many interesting applications of
multi-target regression, there are only a few publicly
available data sets. There follows a brief description of those data sets, which are then summarized
in Table 2 including details about the number of
instances (represented as training/testing or total
number of instances/CV where cross-validation (CV)
is applied for the evaluation process), the number of
targets, and the number of features.
Solar Flare60 : data set for predicting how often
three potential types of solar flarecommon,
moderate, and severe (i.e., d = 3)occur in a
24-h period. The prediction is performed from
the input information of 10 feature variables
describing active regions on the sun.
Water Quality61 : data set for inferring chemical
from biological parameters of river water quality.
The data are provided by the Hydrometeorological Institute of Slovenia and cover the six-year
period from 1990 to 1995. It includes the measured values of 16 different chemical parameters
and 14 bioindicator taxa.
OES97 and OES104 : data gathered from the
annual Occupation Employment Survey compiled by the US Bureau of Labor Statistics for
the years 1997 (OES97) and 2010 (OES10). Each
row provides the estimated number of full-time
equivalent employees across many employment
types for a specific metropolitan area. The input
variables are a randomly sequenced subset of
employment types, and the targets (d = 16) are
randomly selected from the entire set of categories above the 50% threshold.
ATP1d and ATP7d4 : data sets of airline
ticket prices where the rows are sequences
of time-ordered observations over several days.
The input variables include details about the
flights (such as prices, stops, and departure
date), and the six target variables are the minimum prices observed over the next 7 days for six
flight preferences (namely, any airline with any
number of stops, any airline nonstop only, Delta
Airlines, Continental Airlines, AirTran Airlines,
and United Airlines).
RF1 and RF24 : the river flow domain is a temporal prediction task designed to test predictions
Volume 5, September/October 2015
Instances
Features
Targets
1389/CV
10
Water Quality
1060/CV
16
14
OES974
323/CV
263
16
OES104
403/CV
298
16
ATP1d4
201/136
411
ATP7d
188/108
411
RF14
4108/5017
64
4108/5017
576
154/CV
16
2
4
RF2
EDM62
Polymer
43
41/20
10
Forestry-Kras63
60607/CV
160
Soil quality64
1945/CV
142
227
wires.wiley.com/widm
Overview
OPEN-SOURCE SOFTWARE
FRAMEWORKS
We present now a brief summary of available implementations of some multi-output regression algorithms.
228
Statistical Methods
The glmnet71 R package offers the possibility of
learning multi-target linear models with penalized
maximum likelihood. In particular, using this package,
it is possible to perform LASSO, ridge or mixed
penalty estimation of the coefficients.
CONCLUSION
In this study, the state of the art of multi-output regression is thoroughly surveyed, presenting details of the
main approaches that have been proposed in the literature, and including a theoretical comparison of these
approaches in terms of predictive performance, computational complexity, and representation and interpretability. Moreover, we have presented the most
often used performance evaluation measures, as well
as the publicly available data sets for multi-output
regression problems, and we have provided a summary of the related open-source software frameworks.
To the best of our knowledge, there is no other
review paper addressing the challenging problem of
multi-output regression. An interesting line of future
work would be to perform a comparative experimental study of the different approaches presented here
on the publicly available data sets to round out this
review. Another interesting extension of this review is
to consider different categorizations of the described
multi-output regression approaches, such as grouping them based on how they model the relationships
among the multiple target variables.
ACKNOWLEDGMENTS
This work has been partially supported by the Spanish Ministry of Economy and Competitiveness
through the Cajal Blue Brain project (C080020-09; the Spanish partner of the Blue Brain initiative
from EPFL), the TIN2013-41592-P project, and the Regional Government of Madrid through the
S2013/ICE-2845-CASI-CAM-CM project.
REFERENCES
1. Aho T, enko B, Deroski S, Elomaa T. Multi-target
regression with rule ensembles. J Mach Learn Res 2009,
373:20552066.
14. Bielza C, Li G, Larraaga P. Multi-dimensional classification with Bayesian networks. Int J Approx Reason
2011, 52:705727.
229
wires.wiley.com/widm
Overview
25. Hoerl AE, Kennard RW. Ridge regression: biased estimation for nonorthogonal problems. Technometrics
1970, 12:5567.
26. Drucker H, Burges CJC, Kaufman L, Smola A, Vapnik
V. Support vector regression machines. In: Proceedings
of the Advances in Neural Information Processing
Systems 9, Denver, CO, 1997, 155161.
27. Breiman L. Bagging predictors. Mach Learn 1997,
24:123140.
28. Friedman JH. Stochastic gradient boosting. Comput
Stat Data Anal 2002, 38:367378.
29. Godbole S, Sarawagi S. Discriminative methods for
multi-labeled classification. In: Proceedings of the
Eighth Pacific-Asia Conference on Knowledge Discovery and Data Mining, Sydney, Australia, 2004, 2230.
Springer Verlag.
30. Wolpert DH. Stacked generalization. Neural Netw
1992, 5:241259.
31. Read J, Pfahringer B, Holmes G, Frank E. Classifier
chains for multi-label classification. Mach Learn 2011,
85:333359.
32. Zhang W, Liu X, Ding Y, Shi D. Multi-output LS-SVR
machine in extended feature space. In: Proceedings of
the 2012 IEEE International Conference on Computational Intelligence for Measurement Systems and Applications, Tianjin, China, 2012, 130134.
33. Izenman AJ. Reduced-rank regression for the multivariate linear model. J Multivar Anal 1975, 5:248264.
34. van der Merwe A, Zidek JV. Multivariate regression analysis and canonical variates. Can J Stat 1980,
8:2739.
35. Abraham Z, Tan P, Perdinan P, Winkler J, Zhong
S, Liszewska M. Position preserving multi-output prediction. In: Proceedings of the European Conference
on Machine Learning and Principles and Practice of
Knowledge Discovery in Databases, Prague, Czech
Republic, 2013, 320335. Springer Verlag.
36. Vazquez E, Walter E. Multi-output support vector
regression. In: Proceedings of the Thirteen IFAC Symposium on System Identification, Rotterdam, The Netherlands, 2003, 18201825.
37. Chils JP, Delfiner P. Geostatistics: Modeling Spatial
Uncertainty. Wiley Series in Probability and Statistics.
Wiley; 1999.
38. Brudnak M. Vector-valued support vector regression.
In: Proceedings of the 2006 International Joint Conference on Neural Networks, Vancouver, Canada, 2006,
15621569. IEEE Press.
39. Deger F, Mansouri A, Pedersen M, Hardeberg JY.
Multi- and single-output support vector regression for
spectral reflectance recovery. In: Proceedings of the
Eighth International Conference on Signal Image Technology and Internet Based Systems, Sorrento, Italy,
2012, 139148. IEEE Press.
40. Cai F, Cherkassky V. SVM+ regression and multi-task
learning. In: Proceedings of the 2009 International Joint
230
APPENDIX A
PROOF OF LEMMA 1
We present here the proof of Lemma 1, when OLS
estimations of the coefficients are used. The case of
ridge regression is similar.
Let X be the N m matrix of input observations
and Y the N d matrix of output observations. Let
us assume that Xt X is invertible, otherwise, OLS estimation cannot be applied. Let also consider that the
ordering of the chain is exactly as follows: Y 1 , , Y d .
Hence, the coefficients of the first target are estimated
as the OLS ones:
(
)1 t
X y1 ,
m 1 = Xt X
where y1 is the first column of Y, corresponding to the
observations of Y 1 . Next, in the second training step of
the chain, the OLS estimation of the coefficients 2 are
computed as the regression of Y 2 over X1 , , Xm , Y 1
as follows:
where
(
)1 t
1 = X t X
X y1 m1 ,
(
(
)1 t )1
C = yt1 Y1 yt1 X Xt X
X y1
11 ,
(
)1
1m .
D = 1t = yt1 X Xt X
(
)1 t
X y1 is
Assuming that yt1 y1 yt1 X Xt X
invertible, i.e., it is different from 0, we have
)
( ) ( (
)1 t
2
X y2 + 1 CDXt y2
Xt X
=
,
2 =
2,1
1 CYt1 y2 CDXt y2 + Cyt1 y2
231
wires.wiley.com/widm
Overview
and the model of the first two steps of the chain can
be expressed as:
x1
y1 = 1t
xm
x1
and
y2 = 2t .
x
m
y1
y2 = 2 + 2,1 1t = 2 + 2,1 1t .
xm
xm
xm
Therefore, it is easy to see now that:
(
)1 t
t
2 + 2,1 1t = Xt X
X y2 .
(A1)
APPENDIX B
R CODE FOR A PARALLEL
IMPLEMENTATION OF ST
We developed the following R source code as an
example for a parallel implementation of the ST
232
233