Low rank tensor product smooths for GAMMs

Low rank scale invariant tensor product smooths for
Generalized Additive Mixed Models
Simon N. Wood
Department of Statistics, University of Glasgow
Glasgow G12 8QQ U.K.
September 1, 2004
Abstract
This paper considers generalized additive models and generalized additive mixed models
in which the smooth terms are represented using any relatively low rank basis, with an
associated quadratic penalty imposing smoothness, and estimation is via penalized likelihood
maximization and GCV or likelihood, REML or PQL maximization. A general method for
using low rank tensor product smooths to represent smooth functions of several variables in
GAMs and GAMMs is suggested. The method applies a separate wiggliness penalty for each
argument of the smooth, these penalties being induced in a natural way by the ‘marginal’
penalties associated with the smooths from which the tensor product smooth is constructed.
The key features of the method are (i) that the resulting smooth terms are invariant to
linear rescaling of the arguments of the smooth; (ii) that the smooths have a useful tuneable
range of smoothness (in contrast to more conventional approaches to low rank tensor product
smoothing); (iii) that the smooths have relatively low rank, and are hence computationally
efficient to use; (iv) that the smooths can be generated completely automatically from any
marginal smoothing bases and associated quadratic penalties and (v) that the method applies
equally well to smooths of any number of covariates. The combination of these features gives
the approach practical utility. The paper also suggests a simple parameterization of variance
components that enhances numerical stability when estimating smoothing parameters in a
mixed model representation of a GAM or GAMM.
1
1 Introduction
An Additive Mixed Model (special case of a GAMM, Lin and Zhang, 1999; Fahrmeir and Lang,
2001) has a structure something like
yi = Xi θ + w1i f1 (x1i ) + w2i f2 (x2i , x3i ) + . . . + Zi b + ²i (1)
where yi is a univariate response; θ is a vector of fixed parameters; Xi is a row of a fixed

effects model matrix; the wji s are covariates, dummy variables or often simply 1 (they are used
in ‘variable coefficient models’: Hastie and Tibshirani, 1993); the fj s are smooth functions of
covariates xk ; Zi is a row of a random effects model matrix; b ∼ N (0, ψ) is a vector of random
effects coefficients with unknown positive definite covariance matrix ψ; ² ∼ N (0, Λ) is a residual
error vector, with ith element ²i and covariance matrix Λ, which is usually assumed to have some
simple pattern. Generalized additive mixed models replace the normal residuals assumption with
an assumption that yi |b has some exponential family distribution and E(yi |b) is some monotonic
function of the right-hand side of (1), excluding the ²i term. These models are closely related to
the geoadditive models of Kammann and Wand (2003), and Ruppert, Wand and Carroll (2003)
discuss a number of examples of models of this type. Additive mixed models also bear some
relation to the models for designed experiments discussed, for example, by Verbyla et al. (1999)
and implemented, for example, by Ball (2003). Generalized additive models (GAMs, Hastie and
Tibshirani, 1990; see also Wahba, 1990) are a special case of GAMMs, which have no Zi b term.
GAMMs have an advantage over GAMs in that the more complex stochastic structure allows
treatment of autocorrelation and repeated measures situations. The way in which smooths are
actually incorporated into GAMMs varies. Lin and Zhang (1999) used cubic smoothing splines
to represent the univariate smooths that they considered, while Wang (1998) represented a full
smoothing spline ANOVA model (see e.g. Gu, 2002) as a normal linear mixed model. But other
authors have tended to opt for the more computationally parsimonious penalized regression
splines; either P-splines (Eilers and Marx, 1996) estimated using MCMC (Fahrmeir and Lang,
2001) or some variant on the thin plate spline basis with estimation by REML (Kammann and
Wand, 2003, Ruppert, Wand and Carroll, 2003). In the latter case the basis allows the smooth
terms to be neatly separated into an un-penalized component to be treated as a fixed effect and
a wiggly component to be treated as a random effect. Not all bases (for example the P-splines)
2
are so convenient, but in that case a simple re-parameterization is always possible which splits
the smooth into fixed and random components, as reviewed in section 2 of this paper.
Three approaches to representing smooths of more than one variable in GAMMs have been
suggested. Either low rank approximations to thin plate splines have been employed (Kammann
and Wand, 2003; Ruppert, Wand and Carroll, 2003) or tensor product P-splines have been
suggested, with the single penalty given by the Kronecker product of the penalties associated
with the marginal bases from which the smoothing basis is constructed (Fahrmeir and Lang,
2001). Finally, for smooths of 2 predictors in a fully Bayesian setting, and recognizing the
undersmoothing that results from single Kronecker product penalties, Lang and Brezler (2004)
suggested employing tensor products of equally spaced B-spline basis functions in conjunctions
with spatially symmetric priors on the B-spline coefficients based on neighbouring coefficients.
Lang and Brezler also generalized this to allow the degree of smoothing to vary over space: these
smooths perform well but are not invariant to re-scaling of the covariates.
By contrast, in non-GAMM settings the full tensor product smoothing splines of Wahba
(1990) and Gu (2002) effectively have a separate smoothing penalty associated with each marginal
basis of the tensor product, allowing the smooth to adapt to different degrees of underlying “wig-
gliness” with respect to different variables. Similarly Eilers and Marx (2002) have used tensor
products of B-splines to represent two dimensional surfaces, with separate difference penalties
applied to the coefficients of the B-splines along the two covariate axes. When it is not appro-
priate to assume isotropy of a smooth of several variables then the invariance of such tensor
product smooths is an important property. Section 3 of this paper shows how to form smooths
of several variables from tensor products of any set of bases with quadratic penalties in a way
that: (i) allows the smooth to be decomposed into fixed and random components suitable for
incorporation into a generalized linear mixed model, (ii) produces smooths that are invariant
to rescaling of their arguments and (iii) produces smooths that are computationally efficient to
work with, due to their relatively low rank.
A final practical issue is the numerical reliability of estimating GAMMs using standard mixed
modelling software and section 4 discusses a simple alternative to the usual log parameterization
of variance components that can enhance this. It is not the purpose of this paper to provide a
review of the topic of smoothing in general, or even mixed model approaches to smoothing, for
3
which the reader should consult the books by e.g. Ruppert, Wand and Carroll (2003), Hastie
and Tibshirani (1991) or Gu (2002) (or the papers by Verbyla et al., 1999 or Fahrmeir, Kneib
and Lang, 2004).
The work reported here was motivated by the need to provide GAMM and GAM methods
general enough to allow the modeller to pick the most appropriate penalized regression basis for
the problem at hand (e.g. regression splines, P-splines, pseudosplines or something else) while
generating well behaved smooth interaction terms from any mixture of lower dimensional bases
in a consistent and automatic manner.
2 Mixed model components from general basis functions and

single quadratic penalties
The representation of smooth model terms as random effects estimable via standard mixed
modelling software is now well established methodology (see e.g. Carroll, Wand and Ruppert,
2003; Verbyla et al. 1999). The purpose of this section is to show how this approach can be
adapted to any representation of a smooth using a set of basis functions and a single quadratic
penalty. The methods reported here are a straightforward generalization of those given in
Wood (2004, Appendix) or Fahrmeir, Kneib and Lang (2004, section 2.3), but are essential to
understanding the tensor product approach.
Suppose that a smooth term f (x) can be represented as
k
X
f (x) = aj (x)βj
j=1
where aj is a known function of the covariates x and βj is an unknown coefficient. Examples

of the aj might be B-splines, tensor products of B-splines, thin plate regression spline basis
functions, the truncated power basis for cubic splines, a ‘pseudospline’ basis (Hastie, 1996) or
some radial basis functions.
Further suppose that the wiggliness of f can be measured by a functional J(f ) which can
be expressed as a quadratic form β T Sβ where S is a positive semi-definite matrix of known
coefficients. Examples of the quantities represented by such a penalty are the thin-plate spline
penalties; the integrated square of second derivative penalty for a cubic spline and the various
4
difference penalties used in P-spline smoothing. In general S is only semi-definite because most
smoothness criteria consider some space of functions to be ‘completely smooth’: the number of
zero eigenvalues of S is the dimension of such a space, which will be denoted M .
Given a basis and data from which to estimate f , it is straightforward to produce a model
matrix, X, for f where Xij = aj (xi ). The mixed model approach to estimating f starts from
the premise that, by stating that f is smooth, we really believe that it is more probable that f
is smooth than that f is wiggly. This can be formalized by specifying a prior for the wiggliness
of the model which is ∝ exp(−λβ T Sβ/2), say. Such a prior implies an improper Gaussian prior
for β itself (Silverman, 1985).
Now it is possible to proceed by treating all the coefficients of f as random effects with
β ∼ N (0, S+ /λ), where S+ is a pseudoinverse of S (e.g. Fahrmeir and Lang, 2001), but
the improper prior is awkward to handle. If estimation by standard mixed effects methods is
required, it is better to try and split f into a component with a proper prior and a component
with a completely improper prior which will be treated as a fixed effects term. Some basis-
penalty combinations allow this to be done quite straightforwardly. For example Kammann and
Wand (2003), and Ruppert, Wand and Carroll (2003) use a low rank thin plate spline basis
in which the un-penalized space is represented using a polynomial basis; however, for many
smooths of practical interest the basis does not separate so easily.
In general the solution is simply to work in the eigenspace of the penalty matrix S. To do this
let S = UDUT where U is an orthogonal matrix, the columns of which are the eigenvectors of
S, and D is a diagonal matrix with the corresponding eigenvalues arranged in descending order
on the leading diagonal. Let D+ denote the smallest sub-matrix of D containing all the strictly
positive eigenvalues. Now reparameterize so that the new coefficient vector can be written
(bT T T T T T
R , βF ) ≡ U β, where βF is of dimension M . It is clear that β Sβ = bR D+ bR and that
the coefficients βF are unpenalized. Partitioning the eigenvector matrix so that U ≡ [UR : UF ],
where UF has M columns and defining XF ≡ XUF while XR = XUR the mixed model
representation of the smooth in terms of a linear predictor and random effects distribution is
now
XF βF + XR bR , bR ∼ N (0, D−1
+ /λ)
where λ and βF are fixed parameters to be estimated. For convenient estimation with standard
5
q √
software a further reparameterization is required. Defining b = D−1
+ bR and Z = XR D+
then the mixed model representation of the term is
XF βF + Zb, b ∼ N (0, I/λ)
Including such a term in a mixed model is simply a matter of appending the columns of XF to
the fixed effect model matrix, appending the columns of Z to the random effects model matrix
and specifying the given random effects covariance matrix. Obviously, the multiple smooth
terms of an additive model are easily combined (although some simply identifiability constraints
are then required).
3 Tensor products
The previous section dealt with any smooth with a single penalty, such as univariate smooths; the
thin plate regression splines of Wood (2003); the tensor product splines suggested by Fahrmeir
and Lang (2001) or Green and Silverman (1994) or non-spatially adaptive versions of the tensor
product splines proposed by Lang and Brezger (2004). However single penalty smooths of
multiple covariates are usually problematic if the covariates are not naturally on the same scale.
For example, if one covariate is time and another is some measure of location, then the relative
scaling of these has to be chosen in order to apply most single penalty smoothing method and,
in the absence of a systematic method, the choice will be necessarily ad hoc. In the case of
tensor product splines with a single penalty, constructed from a Kronecker product of ‘marginal
penalties’, the problem is different: in this case the smooth is invariant to rescaling, but only at
the cost of employing a penalty of impractically low rank (see below for further details).
Tensor product smooths with separate penalties associated with each covariate can provide
smooths which are invariant to rescaling of covariates and have a useful smoothness range. This
has long been recognized in the literature on full spline function estimation (e.g. Wahba, 1990;
Wang, 1998; Gu, 2002), while in the penalized regression spline literature Eilers and Marx
(2003) have successfully used bivariate tensor products of B-splines in which separate difference
penalties on the B-spline coefficients were applied in the two covariate directions. However
most applications of tensor products of regression splines have used only single penalties. This
section provides a general recipe for constructing low rank tensor product smooths from arbitrary
6
‘marginal’ basis function - quadratic penalty combinations, with separate penalties for each
covariate direction induced in a natural way from the marginal penalties. This is done in a
manner that allows straightforward incorporation of the resulting smooths into generalized linear
mixed models. In a GAM context the results provide low-rank, and hence computationally
efficient, analogues of full spline tensor product models, while generalizing the approach of
Eilers and Marx (2003) to any basis-penalty combination and smooths of arbitrary numbers of
covariates. In the GAMM context they provide a satisfactory means of generating smooths of
more than one variable when it is not reasonable to assume isotropy.
Consider the construction of a smooth function f (x1 , x2 , . . . , xd ), where the covariates xj are
scalar or more rarely vector variables. Suppose that for each covariate xi we have a set of basis
functions for representing smooth functions mi of xi alone. If this set is {ai,j (xi ) : j = 1, . . . , ki }
then the ith ‘marginal’ smooth function can be written as
ki
X
mi (xi ) = ai,j (xi )γj (2)
j=1
where the γj are unknown coefficients. Typical basis functions might be those for B-splines, thin
plate regression splines or pseudosplines. Let Ji (mi ) be a functional associated with the basis,
which measures the ‘wiggliness’ of mi , and suppose that it can be expressed as a quadratic
form Ji (mi ) ≡ γ T Si γ where γ T = [γ1 , γ2 , . . . , γki ] and Si is a matrix of known coefficients:
typical examples might be the squared derivative penalties used in spline smoothing, the squared
difference penalties of P-splines or the penalty on a pseudospline basis. A particular basis-penalty
combination gives the ingredients for estimating a smooth function of xi by penalized regression,
or as a component of a mixed model.
Now the usual tensor product approach uses products of the marginal basis functions ai,j to
construct a basis for f , leading to the following representation:
k1 ,kX
2 ,...,kd d
Y
f (x1 , x2 , . . . , xd ) = βj1 ,j2 ,...,jd ai,ji (xi )
j1 ,j2 ,...,jd =1 i=1
where the βj1 ,j2 ,...,jd ’s are unknown parameters.

To motivate production of an appropriate set of penalties for the tensor product smooth, it
helps to re-write the smooth as
k1 ...kl−1 ,kl+1 ...kd d
X Y [l]
f (x1 , x2 , . . . , xd ) = ai,ji (xi )fj1 ...jl−1 ,jl+1 ...jd (xl )
j1 ...jl−1 ,jl+1 ...jd =1 i6=l
7
where
kl
X
[l]
fj1 ...jl−1 ,jl+1 ...jd (xl ) = βj1 ,j2 ,...,jd al,jl (xl ).
jl =1
Qd
Hence the variation of f in the xl direction is characterized entirely by the j6=l kj functions
[l]
fj1 ...jl−1 ,jl+1 ...jd , which are each of the form (2). This suggests that a natural way of characterizing
the wiggliness of f in the xl direction is via the functional
k1 ...kl−1 ,kl+1 ...kd
X [l]
J [l] (f ) = Jl (fj1 ...jl−1 ,jl+1 ...jd );
j1 ...jl−1 ,jl+1 ...jd =1
[l]
that is via the sum of the marginal functionals applied to each of the fj1 ...jl−1 ,jl+1 ...jd . It is
relatively straightforward to show that
J [l] (f ) ≡ β T S̃l β
where S̃l = Ik1 ⊗ · · · ⊗ Ikl−1 ⊗ Sl ⊗ Ikl+1 ⊗ · · · ⊗ Ikd and β is a vector containing the βj1 ,j2 ,...,jd in
appropriate order. ⊗ is the Kronecker product and IR the rank R identity matrix.
As in the single penalty case, given the tensor product basis and data from which to estimate
f , it is straightforward to construct a model matrix X for the term, but if the coefficients
of the term are to be treated as random effects then their distribution would now be β ∼
µ ³P ´+ ¶
d
N 0, i=1 λi S̃i , where the λi are parameters to be estimated.
The dimension, MT , of the null space of the covariance matrix is readily shown to be given by
the product of the dimensions of the null spaces of the marginal penalty matrices Si (provided
that λi > 0 ∀ i) and as in the single penalty case the resulting rank deficiency of the covariance
matrix would cause problems for estimation of the model via standard mixed effects methods.
Again it helps to reparameterize using an eigenspace related to the penalty. Specifically let
d
X
S̃i = UDUT
i=1
where U is an orthogonal matrix of eigenvectors and D is a diagonal matrix of eigenvalues, with

MT zero elements at the end of the leading diagonal. Notice that there are no λi parameters
in the sum that is decomposed: this is reasonable since the null space of the penalty does not
depend on these parameters (however given finite precision arithmetic it might be necessary to
scale the S̃i matrices in some cases).
8
It is not now possible to achieve the sort of simple representation of a term that was obtained
with a single penalty, so the reparameterization is simpler. Partitioning the eigenvector matrix
so that U ≡ [UR : UF ] where UF has MT columns, it is necessary to define XF ≡ XUF ,
Z ≡ XUR and S i = UT
R S̃i UR . A mixed model representation of the tensor product term (i.e.
the linear predictor and random effects distribution) is

µ ³X ´−1 ¶
XF βF + Zb, b ∼ N 0, λi S i
where the λi and βF parameters have to be estimated. Clearly the covariance matrix structure
is not completely standard, but it can be handled quite easily in the nlme software of Pinheiro
and Bates (2000) by writing a new pdMat class. Given such a class, incorporation of one or more
tensor product terms into a linear mixed model is straightforward.
3.1 Simple example: a 3-way interaction
The general expressions given above can mask the basic simplicity of the approach, so it is worth
considering the construction of a tensor product smooth, f (w, x, z), of 3 variables from the bases
and penalties of 3 univariate smooths. Firstly, three marginal bases and penalties are obtained,
as if three univariate smooth terms mw (w), mx (x) and mz (z) were to be represented. Let the
bases be {awi (w) : i = 1 . . . I}, {axj (x) : j = 1 . . . J} and {azk (z) : i = 1 . . . K} with associated
penalty matrices Sw , Sx and Sz respectively. The bases and penalties need not be of the same
sort, of course. By the usual tensor product construction the smooth f would be represented as
X
f (w, x, z) = awi (w)axj (x)azk (z)βijk
ijk
where the IJK coefficients βijk are unknown. For the marginal z basis, the penalty with
coefficient matrix Sz penalizes the marginal coefficients associated with the sequence of basis
functions az1 (z), az2 (z) . . . azK (z). Hence in the tensor product basis it is natural to use Sz to
similarly penalize all IJ sequences of coefficients associated with k running from 1 to K while
i and j stay fixed. Symmetric arguments apply for the w and x directions. Hence 3 wiggliness
penalties are induced for the three covariate directions of f
β T S̃W β where S̃W = Sw ⊗ IJ ⊗ IK
β T S̃X β where S̃X = II ⊗ Sx ⊗ IK
9
β T S̃Z β where S̃Z = II ⊗ IJ ⊗ Sz .
To see how these penalties relate to the marginal penalties suppose that the marginal penal-
ties are the usual spline integrated squared second derivative penalties. In this case the penalty
R
on the marginal smooth mz would be m00z (z)2 dz. Now f can be re-written as:
X
f (w, x, z) = awi (w)axj (x)fi,j (z)
ij
where
X
fi,j (z) = azk (z)βijk ,
k
in which case it can be shown that
XZ
T 00
β S̃Z β ≡ fi,j (z)2 dz.
ij
Similar expressions apply for the other two penalties.
3.2 Why multiple penalties are preferable to a single penalty
There is a substantial practical difference between using the approach suggested here and a
simpler tensor product approach employing a single penalty matrix such as Sπ = Sw ⊗ Sx ⊗ Sz .
The problem with such single penalties is their degree of rank deficiency. For example, a smooth
of three variables constructed from three cubic spline bases, each of rank five, would have 125
parameters and a penalty of rank 27. Hence the effective degrees of freedom of the term would
have to lie between 98 and 125, rendering the penalization effectively useless. In contrast, using
the same marginal bases and the approach advocated here, the degrees of freedom of the smooth
would lie between 8 and 125: a much more useful range for practical work. Alternatively one
could employ a higher rank single penalty in association with a tensor product basis, but in
that case the resulting smooth is no longer invariant to linear re-scaling of the arguments of the
smooth.
3.3 Nesting
Consider once again marginal bases {ai,j (xi ) : j = 1, . . . , ki } such that the ith ‘marginal’ smooth
function can be written as
ki
X
mi (xi ) = ai,j (xi )γj
j=1
10
where the γj are unknown coefficients. If there exist a set of coefficients cj such that
ki
X
ai,j (xi )cj = 1 ∀ xi
j=1
(i.e. if the space spanned by the basis functions includes the constant function), then a model
component with the additive form
m1 (x1 ) + m2 (x2 ) + . . . + md (xd )
is nested within a model component of the form
f (x1 , x2 , . . . , xd )
if f is represented by a tensor product smooth constructed from the marginal bases. This is
obvious if each set of marginal basis functions contains a constant function, since then each
marginal basis function appears ‘on its own’ somewhere in the set of a tensor product basis
functions. Otherwise it follows from the previous argument and the fact that the marginal bases
whose span includes the constant functions can always be re-parameterized to include a constant
in the basis, without, of course, changing the space spanned by tensor product.
Since the xi can be vectors, and since any tensor product smooth can be built up in various
ways as a product of lower dimensional tensor product smooths it follows that any additive model
structure dependent on x1 , x2 , . . . , xd is nested within the tensor product smooth of all these
variables, provided that the same marginal bases are used for constructing the representation
of the additive model and the full tensor product, and provided that the spaces spanned by the
marginal bases include the constant functions. Hence these general low rank tensor product
schemes can be used for low rank versions of smoothing spline ANOVA, in both GAM and
GAMM contexts.
4 ‘notLog’ parameterization of variance components
The parameters, λ, in the preceding two sections take the role of variance parameters in the
mixed model representation. They must be non-negative, and the usual way of ensuring this
is to perform optimization on the parameters η = log(λ). This approach is taken in standard
11
mixed modelling software, but has some problems if, as in the case of smooths, the variance
parameters can legitimately become very large or close to zero. For such extreme values the
likelihood, PQL or REML score can be quite flat and this tends to leads to large trial steps for η
during Newton type iterative maximization. Large η steps can unfortunately lead to numerical
overflow or underflow of λ. For example, using 64 bit double arithmetic, if the magnitude of η is
larger than somewhere in the region of 700-800 then overflow or underflow will occur: for MLE,
REML or PQL either is problematic. This problem occurs inconveniently often in practice if
one tries to estimate GAMMs using standard software such as lme (Pinheiro and Bates, 2000).
Fortunately there is a simple solution. Rather than use λ = exp(η) one can use λ = notExp(η)
where 

 e1 (η 2 + 1)/2 η>1



notExp(η) =
 eη −1 ≤ η ≤ 1 (3)



 2e−1 /(η 2 + 1) η < −1
This monotonically maps the real line to the positive real line and is continuous to second
derivative, as required for Newton type optimization, but in finite precision arithmetic does not
overflow until η reaches the square root of the largest representable number, nor underflow until
η reaches the square root of the smallest representable positive number. The inverse function
(notLog, say) is easily obtained, as are derivatives if these are required. In practice, use of
this parameterization appears to considerably increase the reliability of the GAMM estimation
process, at least when using Pinheiro and Bates (2000) lme routine as the underlying estimation
engine. Obvious generalizations are possible if higher orders of continuity are needed.
5 Generalization and Confidence Intervals
The discussion so far has focused on additive mixed models, and general methods for setting
these up in a manner allowing estimation using standard software such as the nlme library of
Pinheiro and Bates (2000). Estimation in the generalized case can proceed in a completely
straightforward manner using the approximate PQL methods of e.g. Breslow and Clayton
(1993). Venables and Ripley (2002) provide a suitable function glmmPQL based on iterative calls
to the mixed modelling function lme from the nlme library.
The remaining issue is the calculation of confidence intervals. In most applications of
12
GAMMs these would be required primarily for the smooth terms and the fixed effects. If
this is the case then, following Silverman (1985), a Bayesian posterior covariance matrix for the
coefficients of these terms can be obtained. Conditioning on the parameter estimates for the
random effects, it is first necessary to calculate the covariance matrix for the response data (or
pseudodata in the PQL case) implied by the estimated random effects structure excluding the
smooth terms; suppose this is V. Then if θ is the vector of all the fixed parameters plus the
coefficient of the smooths, X is the model matrix corresponding to these terms and Si is the ith
penalty matrix (padded with zeros if necessary so that θ T Si θ is the correct penalty) then
X
θ|y ∼ N (θ̂, (XT V−1 X + λi Si )−1 )
where θ̂ is the vector of estimates or predictions of the elements of θ. This is essentially the
approach taken in Lin and Zhang (1999), and allows the required intervals to be obtained. The
only quantity not readily available from standard software is the estimate V, but with some
effort it is possible to extract it, at least from lme fits. As usual the degrees of freedom per
P
element of θ can be estimated from the leading diagonal of (XT V−1 X + λi Si )−1 XT V−1 X.
6 Examples
This section illustrates the utility of the methods using two simulated examples (so that the
correct answer is known) and a short real example. In the first example data were simulated
from the model
yi = f (xi ) + ei i = 1, . . . , 400
where f (x) = x11 [10(1 − x)]6 /5 + 10(10x)3 (1 − x)10 , ei = 0.6ei−1 + ²i for i = 1 . . . 400, e0 = 0,
²i ∼ N (0, 1.52 ) and the xi were uniformally spaced on [0,1]. The function f was then treated
as unknown and represented by a rank 20 P-spline basis (cubic B-splines, penalized by a 2nd
order difference penalty: see Eilers and Marx, 1996), while the noise was modelled as an AR(1)
process with unknown correlation parameter. The example is interesting, since the P-spline
basis does not have immediately identifiable fixed (i.e. unpenalized) and random (penalized)
components, so the approach of section 2 is required. After using this method to represent
the model as a linear mixed model it was estimated using REML (S routine lme, Pinheiro and
13
6
10
4
y − mean(y)
s(x,7.74)
5
2
0
0
−5
−4
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x x
6
6
4
4
s(x,13.69)
s(x,15.59)
2
2
0
0
−4 −2
−4 −2
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x x
Figure 1: Reconstructing a smooth function sampled with auto-regressive error. The upper
left plot shows the data. The upper right plot shows the reconstruction using a mixed model
representation of a P-spline model for the smooth function with an AR(1) error model, estimated
using REML; the bold line is the true function, the thin continuous line the reconstruction and
the dashed lines are 95% confidence limits. The lower left panel is similar but assuming i.i.d.
errors. The lower right panel is as the lower left, but estimated using penalized likelihood
maximization with smoothness selected by GCV. In all panels the plots are centered to have
zero mean over the covariate values. The figures in the y-axis labels give the estimated degrees
of freedom for the smooths.
14
6
6
4
4
y
y
2
2
0
0
0.0 0.2 0.4 0.6 0.8 1.0 0.00 0.01 0.02 0.03 0.04 0.05 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x z w v
Figure 2: Scatter plots of the square root of the response data against each candidate covariate
for the GAMM repeated measures example. Note how difficult it would be to judge what the
appropriate scaling of x and z ought to be by straightforward inspection of the data.
Bates, 2000). For comparison, fits were also made assuming i.i.d. errors using REML and
performing estimation by penalized likelihood using GCV for smoothness selection (gam from R
package mgcv). Figure 1 shows typical results: the mixed model with AR(1) errors produces a
reasonable reconstruction of the truth, with plausible 95% confidence bands, while the methods
that neglect autocorrelation overfit, and produce overly narrow confidence bands.
The second example uses a ‘repeated measures’ additive mixed model with one tensor product
term and Poisson errors. 400 Poisson data yi were simulated from yi ∼ Poi(eηi ) where
ηi = f1 (xi , zi ) + f2 (wi ) + bj if observation i is from group j.
There were 10 bj terms which were i.i.d. N (0, 1); each group contained 40 observations; the xi
and wi were independent uniform random deviates on (0,1); the zi were independent random
deviates on (0,0.05); f1 (x, z) = 2 exp(−(x−0.2)2 /σx2 −(z −0.015)2 /σz2 )+1.3 exp(−(x−0.7)2 /σx2 −
(z − 0.04)2 /σz2 ) where σx = 0.3 and σz = 0.02; f2 (w) = sin(2πw). The response data are plotted
against the three covariates, and a spurious covariate v, in figure 2. A typical situation producing
such data would be a marine biological survey conducted by several different research vessels,
where it is usually prudent to include a random effect for vessel in any analysis.
Three models were fitted to the data: all assumed Poisson errors and a log link; represented
f2 with a cyclic penalized cubic regression spline with 10 knots and included a final nuisance term
(f3 , not in the truth) represented by a 10 knot ‘P-spline’. The first model was a GAMM including
a random effect for group and representing f1 with a tensor product of penalized cubic regression
15
splines with 6 knots per direction (piecewise cubic Hermite polynomial basis used). The second
model was the same as the first except that f1 was represented by a rank 36 isotropic smooth
(a thin plate regression spline, Wood, 2003). The final model was as the first, but without the
random effect and estimated by penalized likelihood maximization with smoothing parameters
chosen by an unbiased risk estimator (see Wahba 1990), which is approximately AIC.
Results for a typical replicate simulation are shown in figure 3. It is clear that using an
isotropic smooth is unsatisfactory, while neglecting the correlation structure in the data leads
to considerable over-fitting. Hence, for this type of data, the work reported in this paper is a
necessary addition to GAMM and GAM methods. Note also that neither the cubic regression
spline nor P-spline bases used here have obvious components representing the space of un-
penalized functions, so again the methods of sections 2 and 3 are essential for their use in a
GAMM.
6.1 Mackerel example
Fish stock assessments are sometimes undertaken by surveying the eggs of a particular species
in order to work out egg abundance, from which total mass of the spawning stock of fish can be
indirectly inferred. One such survey was undertaken in 1992 off the west coasts of Britain, Eire
and France targeting Mackerel eggs. Several fisheries research vessels sampled on an ‘irregular
grid’ by hauling a fine meshed net vertically through the water column and counting the mackerel
eggs found in the net (see left most panel figure 4). Generalized Additive Models were used to
model these data by Borchers et al. (1997). The best models in terms of explaining the egg
abundances tend to depend almost exclusively on geographic predictors, such as longitude,
latitude and distance from the 200m sea bed contour (proxy for the distance from the edge of
the continental shelf). Such models are fine for stock assessment, but not completely satisfactory
in terms of biological interpretability, since they depend on quantities which the fish are unlikely
to be directly sensitive to.
Biologically, it would be interesting to try and base prediction entirely on variables that the
fish might be able to sense, such as salinity, water temperature, sea bed depth and perhaps
latitude (since day length varies with latitude over the survey area). For the purposes of this
example, square root of observed egg density per square metre of sea surface, y, is used as the
16
f1(x, z) s(x, z, 15.5) s(x, z, 21.5) s(x, z, 33.9)
0.05
0.05
0.05
0.05
0.04
0.04
0.04
0.04
0.03
0.03
0.03
0.03
z
z
0.02
0.02
0.02
0.02
0.01
0.01
0.01
0.01
0.00
0.00
0.00
0.00
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x x x x
1.0
1.0
1.0
1.0
0.5
0.5
0.5
0.5
s(w,5.38)
s(w,5.33)
s(w,4.12)
0.0
0.0
f2(w)
0.0
0.0
−0.5
−0.5
−0.5
−0.5
−1.0
−1.0
−1.0
−1.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
w w w w
1.0
1.0
1.0
1.0
0.5
0.5
0.5
0.5
s(v,1.15)
s(v,1.96)
s(v,8.81)
0.0
0.0
f3(v)
0.0
0.0
−0.5
−0.5
−0.5
−0.5
−1.0
−1.0
−1.0
−1.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
v v v v
Figure 3: The true and estimated component functions of the simulated repeated measures
GAMM example. The rows show, from top to bottom f1 , f2 and the spurious function (f3 , say).
The columns, from left to right, show: the true functions used in simulation; the component
functions of a GAMM estimated by PQL with f1 represented as a tensor product term and
with a random effect for group; the same as the previous column, but with an isotropic smooth
term for f1 ; finally a GAM assuming i.i.d. errors, but with a tensor product smooth for f1 ,
estimated by penalized likelihood maximization, with smoothing parameters chosen by unbiased
risk estimation (approximate AIC). In all cases the figure in the response axis label gives the
effective degrees of freedom of the plotted smooth term estimate.
17
58
58
58
56
56
56
54
54
54
50 52
52
52
lat
lat
lat
50
50
48
48
48
46
46
46
44
44
44
−14 −10 −6 −2 −14 −10 −6 −2 −14 −10 −6 −2
lon lon lon
Figure 4: Left panel: locations of mackerel egg samples with symbol areas proportional to egg
density per square metre of sea surface. Middle panel: model predicted square root egg density
over the survey area. Right panel: 5× the standard error of the estimates in the middle panel,
on the same scale as the middle panel.
response, and this is modelled as having a normal distribution (modelling the counts directly and
using a Poisson distribution is also possible, but in that case there is substantial overdispersion
to be dealt with). The model used was then:
√
yi = f1 (r.bdi , lati , tempi ) + f2 (sali ) + bj + ²i
assuming that observation i was obtained by boat j. The random effects bj are assumed i.i.d.
Normal, while the vector of residuals is ² ∼ N (0, Λ), Λ being given by the assumption that the
residuals are correlated in a manner that decays exponentially with geographic distance between
observations nested within vessel (see Pinheiro and Bates, 2000). The vessel effect is included
to allow for differences in operating procedures etc. between the boats. The spatial correlation
is to account for aggregation not explicable by the covariates. It seems sensible to nest this
correlation within vessel, since in practice different vessels tend to be separated in time when
proximate in space. The smooth function f1 was represented using a tensor product smooth,
with marginal cubic regression spline bases of dimension 6: it is a function of the square root of
sea bed depth, latitude and temperature at 20 metres depth. f2 was represented using a rank
18
lat=48 temp=15 depth=545
56
56
18
54
54
16
50 52
50 52
temp
lat
lat
14
48
48
12
46
46
44
44
10 20 30 40 50 60 10 20 30 40 50 60 12 14 16 18
r.bd r.bd temp
Figure 5: Each figure shows f1 against two covariates with the other covariate held at its mean
value in the data set. The function is only plotted for values of the covariates sufficiently close
to values observed in the data.
10 thin plate regression spline and is a function of salinity. It seemed biologically unlikely that
salinity would interact strongly with the other covariates.
The model was estimated by likelihood maximization (REML estimates are very similar).
The salinity effect is estimated to be a straight line with slope very close to zero, and no sensible
model selection criterion would leave it in the model, so it was dropped. The standard deviation
of the vessel effect was estimated to be only 1% of the residual standard deviation, and the
spatial auto-correlation was similarly close to zero, however these were not dropped, given their
role as nuisance factors included purely to avoid being misled about the other effects.
Figure 5 shows some slices through the estimated f1 : note the apparent preference for
relatively cool deep water, and the way that temperature preference does not seem to change
greatly with latitude. Figure 4 also shows predicted square root of egg density and its standard
deviation. Notice how the bulk of the distribution is off the shelf edge, and the survey area is
failing to cover the whole distribution: in part this is because the fish were expected to be rather
closer to the shelf edge (200 metre contour) than appears to actually be the case.
19
7 Computational issues
The computations reported above were conducted using R 1.8.1 (The R core development team,
2003). Basis and penalty construction were handled using the general basis construction facilities
provided in Wood’s (2000) package mgcv, which is written to allow straightforward definition
of new smooth classes. Model estimation was accomplished using Pinheiro and Bates’ (2000)
routine lme from their package nlme or Venables and Ripley’s (2004) routine glmmPQL from
their MASS library. To facilitate use of the non standard random effects covariance matrices and
the notLog parameterization of the λs with lme/glmmPQL two new pdMat classes (pdTens and
pdIdnot) were written, and are available on request along with a function for extracting the
matrix V required for confidence interval construction.
One issue of computational detail is the most efficient way in which to set up tensor product
bases. The approach given in section 3 is very straightforward, but it would be more com-
putationally efficient to identify the null space bases of each marginal smooth first, using the
methods of section 2, and then form the null space of the tensor product from the tensor product
of the marginal null spaces. However this is much more tedious to handle computionally, and
is unlikely to offer significant computational savings while the eigen-decomposition employed in
section 3 remains a small part of the total cost of model estimation.
8 Conclusion
The main innovation reported in this paper is a general method for producing low rank, scale
invariant tensor product smooths for inclusion into GAMMs and GAMs, which have a practically
useful smoothness range when smoothness is to be estimated as part of model fitting. The
importance of scale invariance is well illustrated in the first row of figure 3, where an example
of the suggested smooths is compared with an isotropic smooth: if the covariates of a smooth
are not on the same scale, then assuming isotropy can lead to very poor results, which the
suggested method overcomes by using a separate penalty for each covariate direction. The
importance of a useful smoothness range is illustrated by figure 5: a tensor product smooth
with a single penalty, with coefficient matrix constructed from a Kronecker product of marginal
penalty matrices, could not have represented a function as smooth as the one shown.
20
Of course one way of obtaining ‘scale invariant’ smooths is to simply rescale all covariates of a
smooth to have the same range (or to construct a penalty that does this implicitly). This ad hoc
approach only works if the degree of variability of the smooth should really be the same relative
to all covariate axes: usually there is no way of knowing in advance if this is an appropriate
assumption or not. Instead, the tensor product approach suggested here amounts to estimating
the appropriate scaling of covariates relative to each other as an integral part of model fitting.
These advantages of tensor product smooths with multiple penalties have been recognized in
the literature on smoothing spline ANOVA for some time (see e.g. Wahba, 1990 or Gu 2002 for
good summaries), and in the penalized regression spline context Eilers and Marx (2002) have
demonstrated the utility of using separate penalties for smooths of two variables constructed
from tensor products of B-splines. The work reported here is hence a generalization of Eilers
and Marx (2002) to tensor product smooths of any number of variables constructed from any
combination of marginal smoothing bases and penalties and to a GAMM or GAM setting.
Similarly it brings the advantages of SS-ANOVA style tensor product smoothing to GAMs
based on low rank smooths and to GAMMs.
While the methods presented here provide a practically useful means for modelling smooth
interactions in GAMMs (and GAMs), there are obvious deficiencies in the approach. For example
the scale invariance has been obtained at the cost of rotational invariance: tensor product
smooths are not invariant to rotation of the covariate space. It might be useful to have truly
anisotropic smooths available within the GAMM or GAM frameworks. The difficulty is in finding
a way of doing this that can be fully integrated with such models. It is also not clear how much
would be gained in practice if such smooths could be produced. The practical difference between
being able to use scale invariant smooths and not is rather substantial, as the examples in this
paper partly illustrate, but typically the tensor product smooths seem to do a good enough
job at representing the underlying truth that full rotational invariance is likely to offer rather
limited additional benefits.
Another appealing generalization would allow locally varying degrees of smoothness. Lang
and Bresger (2004) have achieved this in a GAMM setting, but not in conjunction with scale
invariance and a fully Bayesian approach to computation is required. Again, the fully local,
anisotropic (or even scale invariant) smooth that can easily be incorporated into GAMMs or
21
GAMs is not likely to be easy to achieve, even if data are available that might contain sufficient
information for reliable estimation of such terms.
In conclusion, the generality of the framework presented for constructing smooth terms for
inclusion in GAMMs and GAMs makes it straightforward to work with whatever basis and
quadratic penalty combination appears most suitable for representing the smooths in a model.
The production of scale invariant smooths with a useful smoothness range is automatic within
this framework and represents a useful advance. These advantages, in combination with the
numerically robust parameterization presented in section 4 and the use of computationally effi-
cient low rank smooth terms should help to make the use of Generalized Additive Mixed Models
a more routine undertaking than has previously been the case. The practical computational
advantages are underlined by the model fitting reported in section 6: the largest amount of
computer time required for an example model estimation was 80 seconds (for the full GAMM
fit to the second example; Pentium IV 1.7Ghz, Windows XP).
Acknowledgements
I’d like to thank Stefan Lang for very helpful discussion of GAMMs, Mark Bravington and
Sharon Hedley for useful discussions on the separability of smooth trend and autocorrelated
errors, Rod Ball for comments on the manuscript and helpful pointers and Douglas Bates for
help with lme.
References
Ball, R.D. (2003) lmeSplines. R News 3(3), 24-28
Borchers, D.L., S.T. Buckland, I.G. Priede and S. Ahmadi (1997) Improving the precision of
the daily egg production method using generalized additive models. Can. J. Fish. Aquat. Sci.
54:2727-2742.
Breslow, N.E. and Clayton, D.G. (1993) Approximate inference in generalized linear mixed
models. Journal of the American Statistical Association 88: 9- 25.
Eilers, P.H.C. and Marx, B.D. (1996) Flexible smoothing with B-splines and penalties. Statistical
22
Science 11:89-121
Eilers, P.H.C. and Marx, B.D. (2003) Multivariate calibration with temperature interaction using
two-dimensional penalized signal regression. Chemometrics and intelligent laboratory systems
66:159-174
Fahrmeir, L., Kneib, T. and Lang, S. (2004) Penalized structured additive regression for space-
time data: a Bayesian perspective. in press Statistica Sinica.
Fahrmeir, L. and Lang, S. (2001) Bayesian inference for generalized additive mixed models based
on Markov random field priors Applied Statistics 50, 201-220
Green, P.J. and Silverman, B.W. (1994) Nonparametric Regression and Generalized Linear Mod-
els Chapman and Hall, London.
Gu, C. (2002) Smoothing spline ANOVA models Springer-Verlag, New York
Hastie, T. (1996) Pseudosplines. J.R. Statist. Soc. B 58(2), 379-396.
Hastie, T.J. and Tibshirani, R.J. (1990) Generalized additive models. London, Chapman and
Hall.
Hastie, T.J. and Tibshirani, R.J. (1993) Varying coefficient models. J.R. Statist. Soc. B 55,
757-796.
Kammann, E.E. and Wand M.P. (2003) Geoadditive models. Applied Statistics 52:1-18
Lang, S. and Bresger, D. (2004) Bayesian P-splines.J. Comp. Graph. Statist. 13:183-212
Lin X. and Zhang, D. (1999) Inference in generalized additive mixed models using smoothing
splines. J.R. Statist. Soc. B 61, 381-400.
Pinheiro, J.C. and Bates, D.M. (2000) Mixed- Effects Models in S and S-PLUS Springer-Verlag,
New York.
R Development Core Team (2003). R: A language and environment for statistical comput-
ing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-00-3, URL
http://www.R-project.org.
Ruppert, D., Wand, M.P. and Carroll, R.J. (2003) Semiparametric Regression. Cambridge
23
Silverman, B.W. (1985) Some aspects of the spline smoothing approach to nonparametric re-
gression curve fitting. J.R. Statist. Soc. B 47,1-52.
Venables, W.N. and Ripley, B.D. (2002) Modern Applied Statistics in S 4th ed. , Springer-Verlag,
New York.
Verbyla, A.P., Cullis, B.R., Kenward, M.G. and Welham, S.J. (1999) The analysis of designed
experiments and longitudinal data by using smoothing splines Applied Statistics 48: 269-313
Wahba (1990) Spline models for observational data. CBMS-NSF Reg. Conf. Ser. Appl. Math.:
59.
Wood, S.N. (2000) Modelling and smoothing parameter estimation with multiple quadratic
penalties. J. R. Statist. Soc. B 62, 413-428.
Wood, S.N. (2003) Thin plate regression splines. J. R. Statist. Soc. B 65, 95-114.
Wang, Y. (1998) Mixed effects smoothing spline analysis of variance J.R. Statist. Soc. B 60,
159-174.
24

Low rank tensor product smooths for GAMMs

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Low rank tensor product smooths for GAMMs

Uploaded by

Copyright:

Available Formats

Low rank scale invariant tensor product smooths for

Generalized Additive Mixed Models

yi = Xi θ + w1i f1 (x1i ) + w2i f2 (x2i , x3i ) + . . . + Zi b + ²i (1)

where yi is a univariate response; θ is a vector of fixed parameters; Xi is a row of a fixed

2 Mixed model components from general basis functions and

where aj is a known function of the covariates x and βj is an unknown coefficient. Examples

then the mixed model representation of the term is

XF βF + Zb, b ∼ N (0, I/λ)

where the βj1 ,j2 ,...,jd ’s are unknown parameters.

where U is an orthogonal matrix of eigenvectors and D is a diagonal matrix of eigenvalues, with

the linear predictor and random effects distribution) is

3.1 Simple example: a 3-way interaction

β T S̃W β where S̃W = Sw ⊗ IJ ⊗ IK

β T S̃X β where S̃X = II ⊗ Sx ⊗ IK

Similar expressions apply for the other two penalties.

3.2 Why multiple penalties are preferable to a single penalty

m1 (x1 ) + m2 (x2 ) + . . . + md (xd )

is nested within a model component of the form

4 ‘notLog’ parameterization of variance components

5 Generalization and Confidence Intervals

ηi = f1 (xi , zi ) + f2 (wi ) + bj if observation i is from group j.

6.1 Mackerel example

Ball, R.D. (2003) lmeSplines. R News 3(3), 24-28

Gu, C. (2002) Smoothing spline ANOVA models Springer-Verlag, New York

Hastie, T. (1996) Pseudosplines. J.R. Statist. Soc. B 58(2), 379-396.

You might also like