You are on page 1of 53

Econometric Modelling

David F. Hendry Nufeld College, Oxford University. July 18, 2000


Abstract The theory of reduction explains the origins of empirical models, by delineating all the steps involved in mapping from the actual data generation process (DGP) in the economy far too complicated and high dimensional ever to be completely modeled to an empirical model thereof. Each reduction step involves a potential loss of information from: aggregating, marginalizing, conditioning, approximating, and truncating, leading to a local DGP which is the actual generating process in the space of variables under analysis. Tests of losses from many of the reduction steps are feasible. Models that show no losses are deemed congruent; those that explain rival models are called encompassing. The main reductions correspond to well-established econometrics concepts (causality, exogeneity, invariance, innovations, etc.) which are the null hypotheses of the mis-specication tests, so the theory has considerable excess content. General-to-specic (Gets) modelling seeks to mimic reduction by commencing from a general congruent specication that is simplied to a minimal representation consistent with the desired criteria and the data evidence (essentially represented by the local DGP). However, in small data samples, model selection is difcult. We reconsider model selection from a computer-automation perspective, focusing on general-to-specic reductions, embodied in PcGets an Ox Package for implementing this modelling strategy for linear, dynamic regression models. We present an econometric theory that explains the remarkable properties of PcGets. Starting from a general congruent model, standard testing procedures eliminate statistically-insignicant variables, with diagnostic tests checking the validity of reductions, ensuring a congruent nal selection. Path searches in PcGets terminate when no variable meets the pre-set criteria, or any diagnostic test becomes signicant. Non-rejected models are tested by encompassing: if several are acceptable, the reduction recommences from their union: if they re-appear, the search is terminated using the Schwartz criterion. Since model selection with diagnostic testing has eluded theoretical analysis, we study modelling strategies by simulation. The Monte Carlo experiments show that PcGets recovers the DGP specication from a general model with size and power close to commencing from the DGP itself, so model selection can be relatively non-distortionary even when the mechanism is unknown. Empirical illustrations for consumers expenditure and money demand will be shown live. Next, we discuss sample-selection effects on forecast failure, with a Monte Carlo study of their impact. This leads to a discussion of the role of selection when testing theories, and the problems inherent in conventional approaches. Finally, we show that selecting policy-analysis models by forecast accuracy is not generally appropriate. We anticipate that Gets will perform well in selecting models for policy.
Financial support from the UK Economic and Social Research Council under grant L138251009 Modelling Nonstationary Economic Time Series, R000237500, and Forecasting and Policy in the Evolving Macro-economy, L138251009, is gratefully acknowledged. The research is based on joint work with Hans-Martin Krolzig of Oxford University.

Contents
1 2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Theory of reduction . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Empirical models . . . . . . . . . . . . . . . . . . . . . 2.2 DGP . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Data transformations and aggregation . . . . . . . . . . 2.4 Parameters of interest . . . . . . . . . . . . . . . . . . . 2.5 Data partition . . . . . . . . . . . . . . . . . . . . . . . 2.6 Marginalization . . . . . . . . . . . . . . . . . . . . . . 2.7 Sequential factorization . . . . . . . . . . . . . . . . . . 1 2.7.1 Sequential factorization of WT . . . . . . . . . 1 2.7.2 Marginalizing with respect to VT . . . . . . . . 2.8 Mapping to I(0) . . . . . . . . . . . . . . . . . . . . . . 2.9 Conditional factorization . . . . . . . . . . . . . . . . . 2.10 Constancy . . . . . . . . . . . . . . . . . . . . . . . . . 2.11 Lag truncation . . . . . . . . . . . . . . . . . . . . . . . 2.12 Functional form . . . . . . . . . . . . . . . . . . . . . . 2.13 The derived model . . . . . . . . . . . . . . . . . . . . 2.14 Dominance . . . . . . . . . . . . . . . . . . . . . . . . 2.15 Econometric concepts as measures of no information loss 2.16 Implicit model design . . . . . . . . . . . . . . . . . . . 2.17 Explicit model design . . . . . . . . . . . . . . . . . . . 2.18 A taxonomy of evaluation information . . . . . . . . . . General-to-specic modelling . . . . . . . . . . . . . . . . . . . 3.1 Pre-search reductions . . . . . . . . . . . . . . . . . . . 3.2 Additional paths . . . . . . . . . . . . . . . . . . . . . . 3.3 Encompassing . . . . . . . . . . . . . . . . . . . . . . . 3.4 Information criteria . . . . . . . . . . . . . . . . . . . . 3.5 Sub-sample reliability . . . . . . . . . . . . . . . . . . . 3.6 Signicant mis-specication tests . . . . . . . . . . . . . The econometrics of model selection . . . . . . . . . . . . . . . . 4.1 Search costs . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Selection probabilities . . . . . . . . . . . . . . . . . . 4.3 Deletion probabilities . . . . . . . . . . . . . . . . . . . 4.4 Path selection probabilities . . . . . . . . . . . . . . . . 4.5 Improved inference procedures . . . . . . . . . . . . . . PcGets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 The multi-path reduction process of PcGets . . . . . . . 5.2 Settings in PcGets . . . . . . . . . . . . . . . . . . . . . 5.3 Limits to PcGets . . . . . . . . . . . . . . . . . . . . . 5.3.1 Collinearity . . . . . . . . . . . . . . . . . . 5.4 Integrated variables . . . . . . . . . . . . . . . . . . . . Some Monte Carlo results . . . . . . . . . . . . . . . . . . . . . 6.1 Aim of the Monte Carlo . . . . . . . . . . . . . . . . . . 6.2 Design of the Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 5 5 6 6 6 6 6 7 7 7 7 7 7 8 8 8 8 9 9 9 9 10 10 11 11 11 11 11 11 14 15 15 16 17 18 19 22 23 24 25 25 25 26

6.3 Evaluation of the Monte Carlo . . . . . . . . . . 6.4 Diagnostic tests . . . . . . . . . . . . . . . . . . 6.5 Size and power of variable selection . . . . . . . 6.6 Test size analysis . . . . . . . . . . . . . . . . . 7 Empirical Illustrations . . . . . . . . . . . . . . . . . . . 7.1 DHSY . . . . . . . . . . . . . . . . . . . . . . . 7.2 UK Money Demand . . . . . . . . . . . . . . . . 8 Model selection in forecasting, testing, and policy analysis 8.1 Model selection for forecasting . . . . . . . . . . 8.1.1 Sources of forecast errors . . . . . . . 8.1.2 Sample selection experiments . . . . . 8.2 Model selection for theory testing . . . . . . . . 8.3 Model selection for policy analysis . . . . . . . . 8.3.1 Congruent modelling . . . . . . . . . . 9 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . 10 Appendix: encompassing . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

27 27 29 31 33 33 37 40 40 40 42 43 44 45 46 48 50

1 Introduction
The economy is a complicated, dynamic, non-linear, simultaneous, high-dimensional, and evolving entity; social systems alter over time; laws change; and technological innovations occur. Time-series data samples are short, highly aggregated, heterogeneous, non-stationary, time-dependent and interdependent. Economic magnitudes are inaccurately measured, subject to revision and important variables not unobservable. Economic theories are highly abstract and simplied, with suspect aggregation assumptions, change over time, and often rival, conicting explanations co-exist. In the face of this welter of problems, econometric modelling of economic time series seeks to discover sustainable and interpretable relationships between observed economic variables. However, the situation is not as bleak as it may seem, provided some general scientic notions are understood. The rst key is that knowledge accumulation is progressive: one does not need to know all the answers at the start (otherwise, no science could have advanced). Although the best empirical model at any point will later be supplanted, it can provide a springboard for further discovery. Thus, model selection problems (e.g., data mining) are not a serious concern: this is established below, by the actual behaviour of model-selection algorithms. The second key is that determining inconsistencies between the implications of any conjectured model and the observed data is easy. Indeed, the ease of rejection worries some economists about econometric models, yet is a powerful advantage. Conversely, constructive progress is difcult, because we do not know what we dont know, so cannot know how to nd out. The dichotomy between construction and destruction is an old one in the philosophy of science: critically evaluating empirical evidence is a destructive use of econometrics, but can establish a legitimate basis for models. To understand modelling, one must begin by assuming a probability structure and conjecturing the data generation process. However, the relevant probability basis is unclear, sincet the economic mechanism is unknown. Consequently, one must proceed iteratively: conjecture the process, develop the associated probability theory, use that for modelling, and revise the starting point when the results do not match consistently. This can be seen in the gradual progress from stationarity assumptions, through integrated-cointegrated systems, to general non-stationary, mixing processes: further developments will undoubtedly occur, leading to a more useful probability basis for empirical modelling. These notes rst review the theory of reduction in 2 to explain the origins of empirical models, then discuss some methodological issues that concern many economists. Despite the controversy surrounding econometric methodology, the LSE approach (see Hendry, 1993, for an overview) has emerged as a leading approach to empirical modelling. One of its main tenets is the concept of general-to-specic modelling (Gets general-to-specic): starting from a general dynamic statistical model, which captures the essential characteristics of the underlying data set, standard testing procedures are used to reduce its complexity by eliminating statistically-insignicant variables, checking the validity of the reductions at every stage to ensure the congruence of the selected model. Section 3 discusses Gets, and relates it to the empirical analogue of reduction. Recently econometric model-selection has been automated in a program called PcGets, which is an Ox Package (see Doornik, 1999, and Hendry and Krolzig, 1999a) designed for Gets modelling, currently focusing on reduction approaches for linear, dynamic, regression models. The development of PcGets has been stimulated by Hoover and Perez (1999), who sought to evaluate the performance of Gets. To implement a general-to-specic approach in a computer algorithm, all decisions must be mechanized. In doing so, Hoover and Perez made some important advances in practical modelling, and our approach builds on these by introducing further improvements. Given an initial general model, many reduction paths could be considered, and different selection strategies adopted for each path. Some of

these searches may lead to different terminal specications, between which a choice must be made. Consequently, the reduction process is inherently iterative. Should multiple congruent contenders eventuate after a reduction round, encompassing can be used to test between them, with only the surviving usually non-nested specications retained. If multiple models still remain after this testimation process, a new general model is formed from their union, and the simplication process re-applied. Should that union repeat, a nal selection is made using information criteria, otherwise a unique congruent and encompassing reduction has been located. Automating Gets throws further light on several methodological issues, and prompts some new ideas, which are discussed in section 4. While the joint issue of variable selection and diagnostic testing using multiple criteria has eluded most attempts at theoretical analysis, computer automation of the model-selection process allows us to evaluate econometric model-selection strategies by simulation. Section 6 presents the results of some Monte Carlo experiments to investigate if the model-selection process works well or fails badly; their implications for the calibration of PcGets are also analyzed. The empirical illustrations presented in section 7 demonstrate the usefulness of PcGets for applied econometric research. Section 8 then investigates model selection in forecasting, testing, and policy analysis and shows the drawbacks of some widely-used approaches.

2 Theory of reduction
First we dene the notion of an empirical model, then explain the the origins of such models by the theory of reduction. 2.1 Empirical models In an experiment, the output is caused by the inputs and can be treated as if it were a mechanism: yt = f (zt ) + t [output] [input] [perturbation] (1)

where yt is the observed outcome of the experiment when zt is the experimental input, f () is the mapping from input to output, and t is a small, random perturbation which varies between experiments conducted at the same values of z. Given the same inputs {zt }, repeating the experiment generates essentially the same outputs. In an econometric model, however: yt = g (zt ) + t [observed] [explanation] [remainder] (2)

yt can always be decomposed into two components, namely g (zt ) (the part explained) and t (unexplained). Such a partition is feasible even when yt does not depend on g (zt ). In econometrics:
t

= yt g (zt ) .

(3)

Thus, models can be designed by selection of zt . Design criteria must be analyzed, and lead to the notion of a congruent model: one that matches the data evidence on the measured attributes. Successive congruent models should be able to explain previous ones, which is the concept of encompassing, and thereby progress can be achieved.

2.2 DGP Let {ut } denote a stochastic process where ut is a vector of n random variables. Consider the sample U1 = (u1 . . . uT ), where U1 = (u1 . . . ut1 ). Denote the initial conditions by U0 = t1 T (. . . ur . . . u1 u0 ), and let Ut1 = U0 : U1 . The density function of U1 conditional on U0 t1 T is given by DU U1 |U0 , where DU () is represented parametrically by a k-dimensional vector of T parameters = (1 . . . k ) with parameter space Rk . All elements of need not be the same at each t, and some of the {i } may reect transient effects or regime shifts. The data generation process (DGP) of {ut } is written as: DU U1 | U0 , T with Rk . (4)

The complete sample {ut , t = 1, . . . , T } is generated from DU () by a population parameter value p . The sample joint data density DU U1 |U0 , is called the Haavelmo distribution (see e.g., Spanos, T 1989). The complete set of random variables relevant to the economy under investigation over t = 1, . . . T is denoted {u }, where denotes a perfectly measured variable and U1 = (u , . . . , u ), dened on t 1 T T the probability space (, F, P). The DGP induces U1 = (u1 , . . . , uT ) but U1 is unmanageably large. T T Operational models are dened by a sequence of data reductions, organized into eleven stages. 2.3 Data transformations and aggregation
1 1 1 One-one mapping of U1 to a new data set WT : U1 WT . The DGP of U1 , and so of WT is T T T characterized by the joint density: 1 DU U1 | U0 , 1 = DW WT | W0 , 1 T T T

(5)

where 1 and 1 , making parameter change explicit The transformation from U to W T T affects the parameter space, so is transformed into . 2.4 Parameters of interest M. Identiable, and invariant to an interesting class of interventions. 2.5 Data partition
1 Partition WT into the two sets: 1 1 WT = X1 : VT T

(6)

where the X1 matrix is T n. Everything about must be learnt from analyzing the X1 alone, so that T T 1 VT is not essential to inference about . 2.6 Marginalization
1 1 DW WT | W0 , 1 = DV|X VT | X1 , W0 , 1 DX X1 | W0 , 1 . T T a,T T b,T 1 1 Eliminate VT by discarding the conditional density DV|X VT |X1 , W0 , 1 T a,T

(7)

in (7), while retaining

the marginal density DX X1 |W0 , 1 . must be a function of 1 alone, given by = f 1 . T b,T b,T b,T A cut is required, so that 1 : 1 a,T b,T a b .

2.7 Sequential factorization To create the innovation process sequentially factorize X1 as: T D X X1 T Mean innovation error process |
t

W0 , 1 b,T

=
t=1

Dx xt | X1 , W0 , b,t . t1

(8)

= xt E xt |X1 . t1

1 2.7.1 Sequential factorization of WT .

Alternatively:
1 DW WT | W0 , 1 = T T

Dw (wt | Wt1 , t ) .
t=1

(9)

RHS innovation process is t = wt E

1 wt |Wt1

1 2.7.2 Marginalizing with respect to VT . 1 Dw (wt | Wt1 , t ) = Dv|x (vt | xt , Wt1 , a,t ) Dx xt | Vt1 , X1 , W0 , b,t , t1

(10)

as Wt1 = 1 Vt1 :

1 Vt1 , X1 , W0 t1

. must be obtained from { b,t } alone. Marginalize with respect to (11)

1 Dx xt | Vt1 , X1 , W0 , b,t = Dx xt | X1 , W0 , . t1 t1 b,t

No loss of information if and only if b,t = t, so the conditional, sequential distribution of {xt } b,t 1 does not depend on Vt1 (Granger non-causality). 2.8 Mapping to I(0) Needed to ensure conventional inference is valid, though many inferences will be valid even if this reduction is not enforced. Cointegration would need to be treated in a separate set of lectures. 2.9 Conditional factorization Factorize the density of xt into sets of n1 and n2 variables where n1 + n2 = n: xt = yt : zt , where the yt are endogenous and the zt are non-modelled. Dx xt | X1 , W0 , bt = Dy|z yt | zt , X1 , W0 , a,t Dz zt | X1 , W0 , b,t t1 t1 t1 zt is weakly exogenous for if (i) = f ( a,t ) alone; and (ii) ( a,t , b,t ) a b . 2.10 Constancy Complete parameter constancy is : a,t = a t where a a , so that is a function of a : = f ( a ).
T t=1

(12)

(13)

(14)

Dy|z yt | zt , X1 , W0 , a t1

(15)

with a .

2.11 Lag truncation Fix the extent of the history of X1 in (15) at s earlier periods: t1
ts Dy|z yt | zt , X1 , W0 , a = Dy|z yt | zt , Xt1 , W0 , . t1

(16)

2.12 Functional form


Map yt into yt = h (yt ) and zt into z = g (zt ), and denote the resulting data by X . Assume that yt t and z simultaneously make Dy |z () approximately normal and homoscedastic, denoted Nn1 [ t , ]: t ts Dy|z yt | zt , Xt1 , W0 , = Dy |z yt | z , Xts , W0 , t t1

(17)

2.13 The derived model

where t app Nn1 [0, ], and A (L) and B (L) are polynomial matrices (i.e., matrices whose elements are polynomials) of order s in the lag operator L. t is a derived, and not an autonomous, process dened by: (19) t = A (L) h (y)t B (L) g (z)t . The reduction to the generic econometric equation involves all the stages of aggregation, marginalization, conditioning etc., transforming the parameters from which determines the stochastic features of the data, to the coefcients of the empirical model. 2.14 Dominance Consider two distinct scalar empirical models denoted M1 and M2 with mean-innovation processes (MIPs) {t } and { t } relative to their own information sets, where t and t have constant, nite vari2 2 ances and 2 respectively. Then M1 variance dominates M2 if < 2 , denoted by M1 M2 . Variance dominance is transitive since if M1 M2 and M2 M3 then M1 M3 , and anti-symmetric since if M1 M2 then it cannot be true that M2 M1 . A model without a MIP error can be variance dominated by a model with a MIP on a common data set. The DGP cannot be variance dominated in the population by any models thereof (see e.g. Theil, 1971, p543). Let Ut1 denote the universe of information for the DGP and let Xt1 be the subset, with associated innovation sequences {u,t } and {x,t }. Then as {Xt1 } {Ut1 }, E [u,t |Xt1 ] = 0, whereas E [x,t |Ut1 ] need not be zero. A model with an innovation error cannot be variance dominated by a model which uses only a subset of the same information. If t = xt E [xt |Xt1 ], then 2 is no larger than the variance of any other empirical model error dened by t = xt G [xt |Xt1 ] whatever the choice of G []. The conditional expectation is the minimum mean-square error predictor. These implications favour general rather than simple empirical models, given any choice of information set, and suggest modelling the conditional expectation. A model which nests all contending explanations as special cases must variance dominate in its class. Let model Mj be characterized by parameter vector j with j elements, then as in Hendry and Richard (1982): M1 is parsimoniously undominated in the class {Mi } if i, 1 i and no Mi M1 . Model selection procedures (such as AIC or the Schwarz criterion: see Judge, Grifths, Hill, L tkepohl u and Lee (1985)) seek parsimoniously undominated models, but do not check for congruence.

A (L) h (y)t = B (L) g (z)t +

(18)

2.15 Econometric concepts as measures of no information loss [1] Aggregation entails no loss of information on marginalizing with respect to disaggregates when the retained information comprises a set of sufcient statistics for the parameters of interest . [2] Transformations per se do not entail any associated reduction but directly introduce the concept of parameters of interest, and indirectly the notions that parameters should be invariant and identiable. [3] Data partition is a preliminary although the decision about which variables to include and which to omit is perhaps the most fundamental determinant of the success or otherwise of empirical modelling. [4] Marginalizing with respect to vt is without loss providing the remaining data are sufcient for , 1 whereas marginalizing without loss with respect to Vt1 entails both Granger non-causality for xt and a cut in the parameters. [5] Sequential factorization involves no loss if the derived error process is an innovation relative to the history of the random variables, and via the notion of common factors, reveals that autoregressive errors are a restriction and not a generalization. [6] Integrated data systems can be reduced to I(0) by suitable combinations of cointegration and differencing, allowing conventional inference procedures to be applied to more parsimonious relationships. [7] Conditional factorization reductions, which eliminate marginal processes, lead to no loss of information relative to the joint analysis when the conditioning variables are weakly exogenous for the parameters of interest. [8] Parameter constancy, implicitly relates to invariance as constancy across interventions which affect the marginal processes. [9] Lag truncation involves no loss if the error process remains an innovation despite excluding some of the past of relevant variables. [10] Functional form approximations need involve no reduction (logs of log-normally distributed variables): e.g. when the two densities in (17) are equal. [11] The derived model, as a reduction of the DGP, is nested within that DGP and its properties are explained by the reduction process: knowledge of the DGP entails knowledge of all reductions thereof. When knowledge of one model entails knowledge of another, the rst is said to encompass the second. 2.16 Implicit model design This correcponds to the symptomatology approach in econometrics, testing for problems (autocorrelation, heteroscedasticity, omitted variables, multicollinearity, non-constant parameters etc.), and correcting these. 2.17 Explicit model design Mimic reduction theory in practical research to minimize the losses due to the reductions selected: leads to Gets modelling. 2.18 A taxonomy of evaluation information Partition the data X1 used in modelling into the three information sets: T [a] past data; [b] present data [c] future data. X1 = X1 : xt : Xt+1 T t1 T (20)

10

[d] theory information, which often is the source of parameters of interest, and is a creative stimulus in economics; [e] measurement information, including price index theory, constructed identities such as consumption equals income minus savings, data accuracy and so on; and: [f] data of rival models, which could be analyzed into past, present and future in turn. The six main criteria which result for selecting an empirical model are: [a] homoscedastic innovation errors; [b] weakly exogenous conditioning variables for the parameters of interest; [c] constant, invariant parameters of interest; [d] theory consistent, identiable structures; [e] data admissible formulations on accurate observations; and [f] encompass rival models. Models which satisfy the rst ve information sets are said to be congruent: an encompassing congruent model satises all six criteria.

3 General-to-specic modelling
The practical embodiment of reduction is general-to-specic (Gets) modelling. The DGP is replaced by the concept of the local DGP (LDGP), namely the joint distribution of the subset of variables under analysis. Then a general unrestricted model (GUM) is formulated to provide a congruent approximation to the LDGP, given the theoretical and previous empirical background. The empirical analysis commences from this general specication, after testing for mis-specications, and if none are apparent, is simplied to a parsimonious, congruent representation, each simplication step being checked by diagnostic testing. Simplication can be done in many ways: and although the goodness of a model is intrinsic to it, and not a property of the selection route, poor routes seem unlikely to deliver useful models. Even so, some economists worry about the impact of selection rules on the properties of the resulting models, and insist on the use of a priori specications: but these need knowledge of the answer before we start, so deny empirical modelling any useful role and in practice, it has rarely contributed. Few studies have investigated how well general-to-specic modelling does. However, Hoover and Perez (1999) offer important evidence in a major Monte Carlo, reconsidering the Lovell (1983) experiments. They place 20 macro variables in databank; generate one (y) as a function of 05 others; regress y on all 20 plus all lags thereof, then let their algorithm simplify that GUM till it nds a congruent (encompassing) irreducible result. They check up to 10 different paths, testing for mis-specication, collect the results from each, then select one choice from the remainder by following many paths, the algorithm is protected against chance false routes, and delivers an undominated congruent model. Nevertheless, Hendry and Krolzig (1999b) improve on their algorithm in several important respects and this section now describes these. 3.1 Pre-search reductions First, groups of variables are tested in the order of their absolute t-values, commencing with a block where all the p-values exceed 0.9, and continuing down towards the pre-assigned selection criterion, when deletion must become inadmissible. A less-stringent signicance level is used at this step, usually 10%, since the insignicant variables are deleted permanently. If no test is signicant, the F-test on all variables in the GUM has been calculated, establishing that there is nothing to model.

11

3.2 Additional paths Blocks of variables constitute feasible search paths, in addition to individual-coefcients, like the block F-tests in the preceding sub-section but along search paths. All paths that also commence with an insignicant t-deletion are explored. 3.3 Encompassing Encompassing tests select between the candidate congruent models at the end of path searches. Each contender is tested against their union, dropping those which are dominated by, and do not dominate, another contender. If a unique model results,select that; otherwise, if some are rejected, form the union of the remaining models, and repeat this round till no encompassing reductions result. That union then constitutes a new starting point, and the complete path-search algorithm repeats till the union is unchanged between successive rounds. 3.4 Information criteria When a union coincides with the original GUM, or with a previous union, so no further feasible reductions can be found, PcGets selects a model by an information criterion. The preferred nal-selection rule presently is the Schwarz criterion, or BIC, dened as: SC = 2 log L/T + p log(T )/T, where L is the maximized likelihood, p is the number of parameters and T is the sample size. For T = 140 and p = 40, minimum SC corresponds approximately to the marginal regressor satisfying |t| 1.9. 3.5 Sub-sample reliability For that nally-selected model, sub-sample reliability is evaluated by the HooverPerez overlapping split-sample test. PcGets concludes that some variables are denitely excluded; some denitely included, and some have an uncertain role, varying from a reliability of 25% (included in the nal model, but insignicant and insignicant in both sub-samples), through to 75% (signicant overall and in one sub-sample, or in both sub-samples). 3.6 Signicant mis-specication tests If the initial mis-specication tests are signicant at the pre-specied level, we raise the required signicance level, terminating search paths only when that higher level is violated. Empirical investigators would re-specify the GUM on rejection. To see why Gets does well, we develop the analytics for several of its procedures.

4 The econometrics of model selection


The key issue for any model-selection procedure is the cost of search, since there are always bound to be mistakes in statistical inference: specically, how bad is it to search across many alternatives? The conventional statistical analysis of repeated testing provides a pessimistic background: every test has a non-zero null rejection frequency (or size, if independent of nuisance parameters), and so type I errors

12

accumulate. Setting a small size for every test can induce low power to detect the inuences that really matter. Critics of general-to-specic methods have pointed to a number of potential difculties, including the problems of lack of identication, measurement without theory, data mining, pre-test biases, ignoring selection effects, repeated testing, and the potential path dependence of any selection: see inter alia, Faust and Whiteman (1997), Koopmans (1947), Lovell (1983), Judge and Bock (1978), Leamer (1978), Hendry, Leamer and Poirier (1990), and Pagan (1987). The following discussion draws on Hendry (2000a). Koopmans critique followed up the earlier attack by Keynes (1939, 1940) on Tinbergen (1940a, 1940b), and set the scene for doubting all econometric analyses that failed to commence from prespecied models. Lovells study of trying to select a small relation (zero to ve regressors) hidden in a large database (40 variables) found a low success rate, thereby suggesting that search procedures had high costs, and supporting an adverse view of data-based model selection. The third criticism concerned applying signicance tests to select variables, arguing that the resulting estimator was biased in general by being a weighted average of zero (when the variable was excluded) and an unbiased coefcient (on inclusion). The fourth concerned biases in reported coefcient standard errors from treating the selected model as if there was no uncertainty in the choice. The next argued that the probability of retaining variables that should not enter a relationship would be high because a multitude of tests on irrelevant variables must deliver some signicant outcomes. The sixth suggested that how a model was selected affected its credibility: at its extreme, we nd the claim in Leamer (1983) that the mapping is the message, emphasizing the selection process over the properties of the nal choice. In the face of this barrage of criticism, many economists came to doubt the value of empirical evidence, even to the extent of referring to it as a scientic illusion (Summers, 1991). The upshot of these attacks on empirical research was that almost all econometric studies had to commence from pre-specied models (or pretend they did). Summers (1991) failed to notice that this was the source of his claimed scientic illusion: econometric evidence had become theory dependent, with little value added, and a strong propensity to be discarded when fashions in theory changed. Much empirical evidence only depends on low-level theories which are part of the background knowledge base not subject to scrutiny in the current analysis so a data-based approach to studying the economy is feasible. Since theory dependence has at least as many drawbacks as sample dependence, data modelling procedures are essential: see Hendry (1995a). Indeed, all of these criticisms are refutable, as we now show. First, identication has three attributes, as discussed in Hendry (1997), namely uniqueness, satisfying the required interpretation, and correspondence to the desired entity. A non-unique result is clearly not identied, so the rst attribute is necessary, but insufcient, since uniqueness can be achieved by arbitrary restrictions (criticized by Sims, 1980, inter alia). There can exist a unique combination of several relationships which is incorrectly interpreted as one of those equations: e.g., a reduced form that has a positive price effect, wrongly interpreted as a supply relation. Finally, a unique, interpretable model of (say) a money-demand relation may in fact correspond to a Central Banks supply schedule, and this too is sometimes called a failure to identify the demand relation. Because economies are highly interdependent, simultaneity was long believed to be a serious problem, but higher frequencies of observation have attenuated this problem. Anyway, simultaneity is not invariant under linear transformations although linear systems are so can be avoided by eschewing contemporaneous regressors until weak exogeneity is established. Conditioning ensures a unique outcome, although it cannot guarantee that the resulting model corresponds to the underlying reality. Next, Keynes appears to have believed that statistical work in economics is impossible without

13

knowledge of everything in advance. But if partial explanations are devoid of use, and empirically we could discover nothing not already known, then no science could have progressed. That is clearly refuted by the historical record. The fallacy in Keyness argument is that since theoretical models are incomplete and incorrect, an econometrics that is forced to use such theories as the only permissible starting point for data analysis can contribute little useful knowledge, except perhaps rejecting the theories. When invariant features of reality exist, progressive research can discover them in part without prior knowledge of the whole: see Hendry (1995b). A similar analysis applies to the attack in Koopmans on the study by Burns and Mitchell: he relies on the (unstated) assumption that only one sort of economic theory is applicable, that it is correct, and that it is immutable (see Hendry and Morgan, 1995). Data mining is revealed when conicting evidence exists or when rival models cannot be encompassed and if they can, then an undominated model results despite the inappropriate procedure. Thus, stringent critical evaluation renders the data mining criticism otiose. Gilbert (1986) suggests separating output into two groups: the rst contains only redundant results (those parsimoniously encompassed by the nally-selected model), and the second contains all other ndings. If the second group is not null, then there has been data mining. On such a characterization, Gets cannot involve data mining, despite depending heavily on data basing. When the LDGP is known a priori from economic theory, but an investigator did not know that the resulting model was in fact true, so sought to test conventional null hypotheses on its coefcients, then inferential mistakes will occur in general. These will vary as a function of the characteristics of the LDGP, and of the particular data sample drawn, but for many parameter values, the selected model will differ from the LDGP, and hence have biased coefcients. This is the pre-test problem, and is quite distinct from the costs of searching across a general set of specications for a congruent representation of the LDGP. If a wide variety of models would be reported when applying any given selection procedure to different samples from a common DGP, then the results using a single sample apparently understate the true uncertainty. Coefcient standard errors only reect sampling variation conditional on a xed specication, with no additional terms from changes in that specication (see e.g., Chateld, 1995). Thus, reported empirical estimates must be judged conditional on the resulting equation being a good approximation to the LDGP. Undominated (i.e., encompassing) congruent models have a strong claim to provide such an approximation, and conditional on that, their reported uncertainty is a good measure of the uncertainty inherent in such a specication for the relevant LDGP. The theory of repeated testing is easily understood: the probability p that none of n tests rejects at 100% is: p = (1 )n . When 40 tests of correct null hypotheses are conducted at = 0.05, p0.05 0.13, whereas p0.005 0.89. However, it is difcult to obtain spurious t-test values much in excess of three despite repeated testing: as Sargan (1981) pointed out, the t-distribution is thin tailed, so even the 0.5% critical value is less than three for 50 degrees of freedom. Unfortunately, stringent criteria for avoiding rejections when the null is true lower the power of rejection when it is false. The logic of repeated testing is accurate as a description of the statistical properties of mis-specication testing: conducting four independent diagnostic tests at 5% will lead to about 19% false rejections. Nevertheless, even in that context, there are possible solutions such as using a single combined test which can substantially lower the size without too great a power loss (see e.g., Godfrey and Veale, 1999). It is less clear that the analysis is a valid characterization of selection procedures in general when more one path is searched, so there is no error correction for wrong reductions. In fact, the serious practical difculty is not one of avoiding

14

spuriously signicant regressors because of repeated testing when many hypotheses are tested, it is retaining all the variables that genuinely matter. Path dependence is when the results obtained in a modelling exercise depend on the simplication sequence adopted. Since the quality of a model is intrinsic to it, and progressive research induces a sequence of mutually-encompassing congruent models, proponents of Gets consider that the path adopted is unlikely to matter. As Hendry and Mizon (1990) expressed the matter: the model is the message. Nevertheless, it must be true that some simplications lead to poorer representations than others. One aspect of the value-added of the approach discussed below is that it ensures a unique outcome, so the path does not matter. We conclude that each of these criticisms of Gets can be refuted. Indeed, White (1990) showed that with sufciently-rigorous testing, the selected model will converge to the DGP. Thus, any overtting and mis-specication problems are primarily nite sample. Moreover, Mayo (1981) emphasized the importance of diagnostic test information being effectively independent of the sufcient statistics from which parameter estimates are derived. Hoover and Perez (1999) show how much better Gets is than any method Lovell considered, suggesting that modelling per se need not be bad. Indeed, overall, the size of their selection procedure is close to that expected, and the power is reasonable. Moreover, re-running their experiments using our version (PcGets) delivered substantively better outcomes (see Hendry and Krolzig, 1999b). Thus, the case against model selection is far from proved. 4.1 Search costs Let pdgp denote the probability of retaining the ith variable out of k when commencing from the DGP i specication and applying the relevant selection test at the same signicance level as the search procedure. Then 1 pdgp is the expected cost of inference. For irrelevant variables, pdgp 0, so that i i whole cost for those is attributed to search. Let pgum denote the probability of retaining the ith variable i when commencing from the GUM, and applying the same selection test and signicance level. Then, the search costs are pdgp pgum . False rejection frequencies of the null can be lowered by increasing i i the required signicance levels of selection tests, but only at the cost of also reducing power. However, it is feasible to lower the former and raise the latter simultaneously by an improved search algorithm, subject to the bound of attaining the same performance as knowing the DGP from the outset. To keep search costs low, any model-selection process must satisfy a number of requirements. First, it must start from a congruent statistical model to ensure that selection inferences are reliable: consequently, it must test for model mis-specication initially, and such tests must be well calibrated (nominal size close to actual). Secondly, it must avoid getting stuck in search paths that initially inadvertently delete relevant variables, thereby retaining many other variables as proxies: consequently, it must search many paths. Thirdly, it must check that eliminating variables does not induce diagnostic tests to become signicant during searches: consequently, model mis-specication tests must be computed at every stage. Fourthly, it must ensure that any candidate model parsimoniously encompasses the GUM, so no loss of information has occurred. Fifthly, it must have a high probability of retaining relevant variables: consequently, a loose signicance level and powerful selection tests are required. Sixthly, it must have a low probability of retaining variables that are actually irrelevant: consequently, this clashes with the fth objective in part, but requires an alternative use of the available information. Finally, it must have powerful procedures to select between the candidate models, and any models derived from them, to end with a good model choice, namely one for which:
k

L=
i=1

pdgp pgum i i

15

is close to zero. 4.2 Selection probabilities When searching a large database for that DGP, an investigator could well retain the relevant regressors much less often than when the correct specication is known, in addition to retaining irrelevant variables in the nally-selected model. We rst examine the problem of retaining signicant variables commencing from the DGP, then turn to any additional power losses resulting from search. For a regression coefcient i , hypothesis tests of the null H0 : i = 0 will reject with a probability dependent on the non-centrality parameter of the test. We consider the slightly more general setting where t-tests are used to check an hypothesis, denoted t(n, ) for n degrees of freedom, when is the non-centrality parameter, equal to zero under the null. For a critical value c , P (|t| c |H0 ) = where H0 implies = 0. The following table records some approximate power calculations when one coefcient null hypothesis is tested and when four are tested, in each case, precisely once. t-test powers n P (|t| c ) P (|t| c )4 1 100 0.05 0.16 0.001 2 50 0.05 0.50 0.063 2 100 0.01 0.26 0.005 3 50 0.01 0.64 0.168 4 50 0.05 0.98 0.902 4 50 0.01 0.91 0.686 6 50 0.01 1.00 0.997 Thus, there is little hope of retaining variables with = 1, and only a 5050 chance of retaining a single variable with a theoretical |t| of 2 when the critical value is also 2, falling to 3070 for a critical value of 2.6. When = 3, the power of detection is sharply higher, but still leads to more than 35% mis-classications. Finally, when = 4, one such variable will almost always be retained. However, the nal column shows that the probability of retaining all four relevant variables with the given non-centrality is essentially negligible even when they are independent, except in the last few cases. Mixed cases (with different values of ) can be calculated by multiplying the probabilities in the fourth column (e.g., for = 2, 3, 4, 6 the joint P () = 0.15 at = 0.01). Such combined probabilities are highly non-linear in , since one is almost certain to retain all four when = 6, even at a 1% signicance level. The important conclusion is that, despite knowing the DGP, low signalnoise variables will rarely be retained using t-tests when there is any need to test the null; and if there are many relevant variables, all of them are unlikely to be retained even when they have quite large non-centralities. 4.3 Deletion probabilities The most extreme case where low deletion probabilities might entail high search costs is when many variables are included but none actually matters. PcGets systematically checks the reducibility of the GUM by testing simplications up to the empty model. A one-off F-test FG of the GUM against the null model using critical value c would have size P (FG c ) = under the null if it was the only test implemented. Consequently, path searches would only commence % of the time, and some of these could also terminate at the null model. Let there be k regressors in the GUM, of which n are retained

16

when t-test selection is used should the null model be rejected. In general, when there are no relevant variables, the probability of retaining no variables using t-tests with critical value c is: P (|ti | < c i = 1, . . . , k) = (1 )k . Combining (21) with the FG -test, the null model will be selected with approximate probability: pG = (1 ) + (1 )k , (22) (21)

where is the probability of FG rejecting yet no regressors being retained (conditioning on FG c cannot decrease the probability of at least one rejection). Since is set at quite a high value, such as 0.20, whereas = 0.05 is more usual, FG c0.20 can occur without any |ti | c0.05 . Evaluating (22) for = 0.20, = 0.05 and k = 20 yields pG 0.87; whereas the re-run of the HooverPerez experiments with k = 40 reported by Hendry and Krolzig (1999b) using = 0.01 yielded 97.2% in the Monte Carlo as against a theory prediction from (22) of 99%. Alternatively, when = 0.1 and = 0.01 (22) has an upper bound of 96.7%, falling to 91.3% for = 0.05. Thus, it is relatively easy to obtain a high probability of locating the null model, even when 40 irrelevant variables are included, using relatively tight signicance levels, or a reasonable probability for looser signicance levels. 4.4 Path selection probabilities We now calculate how many spurious regressors will be retained in path searches. The probability distribution of one or more null coefcients being signicant in pure t-test selection at signicance level is given by the k + 1 terms of the binomial expansion of: ( + (1 ))k . The following table illustrates by enumeration for k = 3: event probability number retained P (|ti | < c , i = 1, . . . 3) (1 )3 0 P (|ti | c | |tj | < c , j = i) 3 (1 )2 1 2 P (|ti | < c | |tj | c , j = i) 3 (1 ) 2 3 P (|ti | c , i = 1, . . . 3) 3 Thus, for k = 3, the average number of variables retained is: n = 3 3 + 2 3 (1 ) 2 + 3 (1 )2 = 3 = k. The result n = k is general. When = 0.05 and k = 40, n equals 2, falling to 0.4 for = 0.01: so even if only t-tests are used, few spurious variables will be retained. Combining the probability of a non-null model with the number of variables selected when the GUM F-test rejects: p = , (where p is the probability any given variable will be retained), which does not depend on k. For = 0.1, = 0.01, we have p = 0.001. Even for = 0.25 and = 0.05, p = 0.0125 before search paths and diagnostic testing are included in the algorithm. The actual behaviour of PcGets is much more complicated than this, but can deliver a small overall size. Following the event FG c when = 0.1 (so the null is incorrectly rejected 10% of the time), and approximating by 0.5 variables retained when

17

that occurs, then the average non-deletion probability (i.e., the probability any given variable will be retained) is pr = n/k = 0.125%, as against the reported value of 0.19% found by Hendry and Krolzig (1999b). These are very small retention rates of spuriously-signicant variables. Thus, in contrast to the relatively high costs of inference discussed in the previous section, those of search arising from retaining additional irrelevant variables are almost negligible. For a reasonable GUM with (say) 40 variables where 25 are irrelevant, even without the pre-selection and multiple path searches of PcGets, and using just t-tests at 5%, roughly one spuriously signicant variable will be retained by chance. Against that, from the previous section, there is at most a 50% chance of retaining each of the variables that have non-centralites around 2, and little chance of keeping them all: the difcult problem is retention of relevance, not elimination of irrelevance. The only two solutions are better inference procedures, or looser critical values; we will consider them both. 4.5 Improved inference procedures An inference procedure involves a sequence of steps. As a simple example, consider a procedure comprising two F-tests: the rst is conducted at the = 50% level, the second at = 5%. The variables to be tested are rst ordered by their t-values in the GUM, such that t2 t2 t2 , and the rst F-test 1 2 k adds in variables from the smallest observed t-values till a rejection would occur, with either F1 > c or an individual |t| > c (say). All those variables except the last are then deleted from the model, and a second F-test conducted of the null that all remaining variables are signicant. If that rejects, so F2 > c , all the remaining variables are retained, otherwise, all are eliminated. We will now analyze the probability properties of this 2-step test when all k regressors are orthogonal for a regression model estimated from T observations. Once m variables are included in the rst step, non-rejection requires that (a) the diagnostics are insignicant; (b) m 1 variables did not induce rejection, (c) |tm | < c and (d): F1 (m, T k) t2 i 1 m
m i=1

t2 c . i

(23)

Clearly, any 1 reduces the mean F1 statistic, and since P(|ti | < 1) = 0.68, when k = 40, approximately 28 variables fall in that group; and P(|ti | 1.65) = 0.1 so only 4 variables should chance to have a larger |ti | value on average. In the conventional setting where = 0.05 with P(|ti | < 2) 0.95, only 2 variables will chance to have larger t-values, whereas slightly more than half will have t2 < 0.5 or smaller. Since P(F1 (20, 100) < 1|H0 ) 0.53, a rst step with = 0.5 i 2 1, and some larger t-values as well hence the need to check should eliminate all variables with ti that |tm | < c (below we explain why collinearity between variables that matter and those that do not should not jeopardize this step). A crude approximation to the likely value of (23) under H0 is to treat all t-values within blocks as having a value equal to the mid-point. We use the ve ranges t2 < 0.5, 1, 1.652 , 4, and greater than 4, i using the expected numbers falling in each of the rst four blocks, which yields: 1 31.8 0.25 20 + 0.75 8 + 1.332 8 + 1.822 2 = 0.84, 38 38 noting P(F1 (38, 100) < 0.84|H0 ) 0.72 (setting all ts equal to the upper bound of each block yields an illustrative upper bound of about 1.3 for F1 ). Thus, surprisingly-large values of , such as 0.75, can be selected for this step yet have a high probability of eliminating almost all the irrelevant variables. Indeed, using = 0.75 entails c 0.75 when m = 20, since: F1 (38, 100) P (F1 (20, 100) < 0.75 | H0 ) 0.75,

18 or c 0.8 for m = 30. When the second F-test is invoked for a null model, it will falsely reject more than % of the time since all small t-values have already been eliminated, but the resulting model will still be tiny in comparison to the GUM. Conversely, this procedure has a much higher probability of retaining a block of relevant variables. For example, commencing with 40 regressors of which m = 35 (say) were correctly eliminated, should the 5 remaining variables all have expected t-values of two the really difcult case in section 4.2 then: E [F2 (5, 100)] When = 0.05, c = 2.3 and: P F2 (5, 100) 2.3 | F2 = 4 > 0.99, (using a non-central 2 () approximation to F2 ), thereby almost always retaining all ve relevant variables. This is obviously a dramatic improvement over the near zero probability of retaining all ve variables using t-tests on the DGP in section 4.2. Practical usage of PcGets suggests its operational characteristics are well described by this analysis. 1 5
40 i=36

E t2 i

4.

(24)

5 PcGets
PcGets attempts to meet all of the criteria in section 4. First, it always starts from a congruent general linear, dynamic statistical model, using a battery of mis-specication tests to ensure congruence. Secondly, it is recommended that the GUM have near orthogonal, non-integrated regressors so test outcomes are relatively orthogonal. Then PcGets pre-selection tests for variables at a loose signicance level (25% or 10%, say), to remove variables that are highly irrelevant then simplies the model to be searched accordingly by eliminating those variables. It then explores multiple selection paths each of which begins by eliminating one or more statistically-insignicant variables, with diagnostic tests checking the validity of all reductions, thereby ensuring a congruent nal model. Path searches continue till no further reductions are feasible, or a diagnostic tests rejects. All the viable terminal selections resulting from these search paths are stored. If there is more than one terminal model, parsimonious encompassing tests are conducted of each against their union to eliminate models that are dominated and do not dominate any others. If a non-unique outcome does not result, the search procedure is repeated from the new union. Finally, if mutually-encompassing contenders remain, information criteria are used to select between these terminal reductions. Additionally, sub-sample signicance is used to assess the reliability of the resulting model choice. For further details, see e.g., Hendry and Krolzig (1999b). There is little research on how to design model-search algorithms in econometrics. The search procedure must have a high probability of retaining variables that do matter in the LDGP, and eliminating those that do not. To achieve that goal, PcGets uses encompassing tests between alternative reductions. Balancing the objectives of small size and high power still involves a trade-off, but one that is dependent on the algorithm: the upper bound is probably determined by the famous lemma in Neyman and Pearson (1928). Nevertheless, to tilt the size-power balance favourably, sub-sample information is also exploited, building on the further development in Hoover and Perez of investigating split samples for signicance (as against constancy). Since non-central t-values diverge with increasing sample size, whereas central ts uctuate around zero, the latter have a low probability of exceeding any given critical value in two sub-samples, even when those sample overlap. Thus, adventitiously-signicant variables may be revealed by their insignicance in one or both of the sub-samples.

19

PcGets embodies some further developments. First, PcGets undertakes pre-search simplication F-tests to exclude variables from the general unrestricted model (GUM), after which the GUM is reformulated. Since variables found to be irrelevant on such tests are excluded from later analyses, this step uses a loose signicance level (such as 10%). Next, many possible paths from that GUM are investigated: reduction paths considered include both multiple deletions as well as single, so t and/or F test statistics are used as simplication criteria. The third development concerns the encompassing step: all distinct contending valid reductions are collected, and encompassing is used to test between these (usually non-nested) specications. Models which survive encompassing are retained; all encompassed equations are rejected. If multiple models survive this testimation process, their union forms a new general model, and selection path searches recommence. Such a process repeats till a unique contender emerges, or the previous union is reproduced, then stops. Fourthly, the diagnostic tests require careful choice to ensure they characterize the salient attributes of congruency, are correctly sized, and do not overly restrict reductions. A further improvement concerns model choice when mutually-encompassing distinct models survive the encompassing step. A minimum standard error rule, as used by Hoover and Perez (1999), will probably over-select as it corresponds to retaining all variables with |t| > 1. Instead, we employ information criteria which penalize the likelihood function for the number of parameters. Finally, sub-sample information is used to accord a reliability score to variables, which investigators may use to guide their model choice. In Monte Carlo experiments, a progressive research strategy (PRS) can be formulated in which decisions on the nal model choice are based on the outcomes of such reliability measure. 5.1 The multi-path reduction process of PcGets The starting point for Gets model-selection is the general unrestricted model, so the key issues concern its specication and congruence. The larger the initial regressor set, the more likely adventitious effects will be retained; but the smaller the GUM, the more likely key variables will be omitted. Further, the less orthogonality between variables, the more confusion the algorithm faces, leading to a proliferation of mutual-encompassing models, where nal choices may only differ marginally (e.g., lag 2 versus 1).1 Finally, the initial specication must be congruent, with no mis-specication tests failed at the outset. Empirically, the GUM would be revised if such tests rejected, and little is known about the consequences of doing so (although PcGets will enable such studies in the near future). In Monte Carlo experiments, the program automatically changes the signicance levels of such tests. The reduction path relies on a classical, sequential-testing approach. The number of paths is increased to try all single-variable deletions, as well as various block deletions from the GUM. Different critical values can be set for multiple and single selection tests, and for diagnostic tests. Denote by the signicance level for the mis-specication tests (diagnostics) and by the signicance level for the selection t-tests (we ignore F tests for the moment). The corresponding p-values of these are denoted and , respectively. During the specication search, the current specication is simplied only if no diagnostic test rejects its null. This corresponds to a likelihood-based model evaluation, where the likelihood function of model M is given by the density: LM ( M ) = fM (Y; M ) if min M (Y; M ) < 0,

where fM (Y; M ) is the probability density function (pdf) associated with model M at the parameter
Some empirical examples for autoregressive-distributed lag (ADL) models and single-equation equilibrium-correction models (EqCM) are presented in section 7.
1

20 vector M , for the sample Y. The vector of test statistics p-values, M (Y; M ), is evaluated at the maximum likelihood estimate M under model M, and mapped into its marginal rejection probabilities. So the pdf of model M is only accepted as the likelihood function if the sample information coheres with the underlying assumptions of the model itself. In Monte Carlo experiments, PcGets sets the signicance levels of the mis-specication tests endogenously: when a test of the DGP (or true model) reveals a signicant diagnostic outcome (as must happen when tests have a non-zero size), the signicance level is adjusted accordingly. In the event that the GUM fails a mis-specication test at the desired signicance level , a more stringent critical value is used. If the GUM also fails at the reduced signicance leve < , the test statistic is excluded from the test battery during the following search. Thus for the kth test we have that: [ , 1] desired signicance level k = if k,GUM (Y, GUM ) [ , ) reduced signicance level 0 [0, ) test excluded where 0 < < < 1. 2 Each set of search paths is ended by an encompassing step (see e.g., Mizon and Richard, 1986, and Hendry and Richard, 1989). Used as the last step of model selection, encompassing seems to help control the size resulting from many path searches. When a given path eliminates a variable x that matters, other variables proxy such an effect, leading to a spuriously large and mis-specied model. However, some other paths are likely to retain x, and in the encompassing tests, the proxies will frequently be revealed as conditionally redundant, inducing a smaller nal model, focused on the genuine causal factors.

In contrast, Hoover and Perez (1999) drop such a test from the checking set (so an ever-increasing problem of that type may lurk undetected). Their procedure was justied on the grounds that if the GUM failed a specication test in a practical application, then an LSE economist would expand the search universe to more variables, more lags, or transformations of the variables. In a Monte Carlo setting, however, it seems better to initially increase the nominal level for rejection, and if during any search path, that higher level is exceeded, then stop; we nd that sometimes such GUM tests cease to be signicant as reduction proceeds, and sometimes increase to reveal a awed path.

21

Table 1 Stage I (1) Estimation and testing of the GUM

The PcGets algorithm.

(a) If all variables are signicant, the GUM is the nal model, and the algorithm stops; (b) if a diagnostic test fails for the GUM, its signicance level is adjusted or the test is excluded from the test battery during simplications of the GUM; (c) otherwise, search paths start by removing an insignicant variable, or a set of insignicant variables. (2) Multiple reduction paths: sequential simplication and testing of the GUM (a) If any diagnostic tests fail, that path is terminated, and the algorithm returns to the last accepted model of the search path: (i) if the last accepted model cannot be further reduced, it becomes the terminal model of the particular search path; (ii) otherwise, the last removed variable is re-introduced, and the search path continues with a new reduction by removing the next least-insignicant variable of the last accepted model. (b) If all tests are passed, but one or more variables are insignicant, the least signicant variable is removed: if that specication has already been tested on a previous path, the current search path is terminated; (c) if all diagnostic tests are passed, and all variables are signicant, the model is the terminal model of that search path. (3) Encompassing (a) If none of the reductions is accepted, the GUM is the nal model; (b) if only one model survives the testimation process, it is the nal model; (c) otherwise, the terminal models are tested against their union: (i) if all terminal models are rejected, their union is the nal model; (ii) if exactly one of the terminal models is not rejected, it is the nal model; (iii) otherwise, rejected models are removed, and the remaining terminal models tested against their union: 1. if all remaining terminal models are rejected, their union is the nal model; 2. if exactly one remaining terminal model is not rejected, it is the nal model; 3. otherwise, the union of the surviving models becomes the GUM of Stage II. Stage II (1) Estimation and testing of the GUM as in Stage I (signicance levels remain xed) (2) Multiple reduction paths as in Stage I (3) Encompassing and nal model selection (a) If only one model survives the testimation process of Stage II, it is the nal model; (b) otherwise, the terminal models of stage II are tested against their union: (i) if all terminal models are rejected, their union is the nal model. (ii) if exactly one terminal model is not rejected, it is the nal model. (iii) otherwise, the set of non-dominated terminal models are reported or information criteria are applied to select a unique nal model.

22

The selection of the nal model also improves upon Hoover and Perez (1999). Instead of selecting the best-tting equation, PcGets focuses on encompassing testing between the candidate congruent selections. If a unique choice occurs, then the algorithm is terminated, otherwise, the union of the variables is formed as a new starting point for the reduction. Should that coincide with the previous union, then a model is selected by an information criterion (AIC, HQ, SC); otherwise the algorithm retries all the available paths again from that smaller union: if no simpler encompassing congruent model appears, nal choice is by AIC, HQ or SC, etc.3 Table 1 records details of the basic algorithm. To control the overall size of the model-selection procedure, two extensions of the original algorithm were taken. First, the introduction before Stage I of block (F) tests of groups of variables, ordered by their t-values in the GUM (but potentially according to economic theory). This set includes the overall F-test of all regressors to check that there is something to model. Variables that are insignicant at this step, usually at a liberal critical value, are eliminated from the analysis, and a smaller GUM is formulated. Secondly, following Hoover and Perez (1999), Stage III was introduced as a check for potential over-selection in Stage II by a sub-sample split to eliminate problematic variables from the reduction search. This mimics the idea of recursive estimation, since a central t statistic wanders around the origin, while a non-central t diverges. Thus, the ith variable might be signicant by chance for T1 observations, yet not for T2 > T1 whereas the opposite holds for the jth if there is not too much sample overlap. Consequently, a progressive research strategy (shown as PRS below) can gradually eliminate adventitiously-signicant variables. Hoover and Perez (1999) found that by adopting a progressive search procedure (as in Stage III), the number of spurious regressors can lowered (inducing a lower overall size), without losing much power. Details of the resulting algorithm are shown in Table 2. 5.2 Settings in PcGets The testimation process of PcGets depends on the choice of: the n diagnostic checks in the test battery; the parameters of these diagnostic tests; the signicance levels of the n diagnostics; pre-search F-test simplication; the signicance levels of such tests; the simplication tests (t and/or F); the signicance levels of the simplication tests; the signicance levels of the encompassing tests; the sub-sample split; the signicance levels of the sub-sample tests; the weights accorded to measure reliability.

The choice of mis-specication alternatives determines the number and form of the diagnostic tests. Their individual signicance levels in turn determine the overall signicance level of the test battery. Since signicant diagnostic-test values terminate search paths, they act as constraints on moving away
3

The information criteria are dened as follows: AIC = 2 log L/T + 2n/T, SC = 2 log L/T + n log(T )/T, HQ = 2 log L/T + 2n log(log(T ))/T,

where L is the maximized likelihood, n is the number of parameters and T is the sample size: see Akaike (1985), Schwarz (1978), and Hannan and Quinn (1979).

23

Table 2 Stage 0

Additions to the basic PcGets algorithm.

(1) Pre-simplication and testing of the GUM (a) If a diagnostic test fails for the GUM, the signicance level of that test is adjusted, or the test is excluded from the test battery during simplications of the GUM; (b) if all variables are signicant, the GUM is the nal model, and the algorithm stops; (c) otherwise, F-tests of sets of individually-insignicant variables are conducted: (i) if one or more diagnostic tests fails, that F-test reduction is cancelled, and the algorithm returns to the previous step; (ii) if all diagnostic tests are passed, the blocks of variables that are insignicant are removed and a simpler GUM specied; (iii) if all diagnostic tests are passed, and all blocks of variables are insignicant, the null model is the nal model. Stage III (1) Post-selection sub-sample evaluation (a) Test the signicance of every variable in the nal model from Stage II in two overlapping sub-samples (e.g., the rst and last r%): (i) (ii) (iii) (iv) (v) (vi) if a variable is signicant overall and both sub-samples, accord it 100% reliable; if a variable is signicant overall and in one sub-sample, accord it 75% reliable; if a variable is signicant overall and in neither sub-sample, accord it 50% reliable; if a variable is insignicant overall but in both sub-samples, accord it 50% reliable; if a variable is insignicant overall and in only one sub-sample, accord it 25% reliable; if a variable is insignicant overall and in neither sub-sample, accord it 0% reliable.

from the GUM. Thus, if a search is to progress towards an appropriate simplication, such tests must be well focused and have the correct size. The pre-search tests were analyzed above, as were the path searches. The choices of critical values for pre-selection, selection and encompassing tests are important for the success of PcGets: the tighter the size, the fewer the spurious inclusions of irrelevant, but the more the false exclusions of relevant variables. In the nal analysis, the calibration of PcGets depends on the characteristics valued by the user: if PcGets is employed as a rst pre-selection step in a users research agenda, the optimal values of , , and may be higher than when the focus is on controlling the overall size of the selection process. The non-expert user settings reect this. In section 6, we will use simulation techniques to investigate the calibration of PcGets for the operational characteristics of the diagnostic tests, the selection probabilities of DGP variables, and the deletion probabilities of non-DGP variables. However, little research has been undertaken to date to optimize any of the choices, or to investigate the impact on model selection of their interactions. 5.3 Limits to PcGets Davidson and Hendry (1981, p.257) mentioned four main problems in the general-to-specic methodology: (i) the chosen general model can be inadequate, comprising a very special case of the DGP; (ii) data limitations may preclude specifying the desired relation; (iii) the non-existence of an optimal sequence for simplication leaves open the choice of reduction path; and (iv) potentially-large type-II

24

error probabilities of the individual tests may be needed to avoid a high type-I error of the overall sequence. By adopting the multiple path development of Hoover and Perez (1999), and implementing a range of important improvements, PcGets overcomes many of problems associated with points (iii) and (iv). However, the empirical success of PcGets must depend crucially on the creativity of the researcher in specifying the general model and the feasibility of estimating it from the available data aspects beyond the capabilities of the program, other than the diagnostic tests serving their usual role of revealing model mis-specication. There is a central role for economic theory in the modelling process in prior specication, prior simplication, and suggesting admissible data transforms. The rst of these relates to the inclusion of potentially-relevant variables, the second to the exclusion of irrelevant effects, and the third to the appropriate formulations in which the inuences to be included are entered, such as log or ratio transforms etc., differences and cointegration vectors, and any likely linear transformations that might enhance orthogonality between regressors. The LSE approach argued for a close link of theory and model, and explicitly opposed running regressions on every variable on the database as in Lovell (1983) (see e.g., Hendry and Ericsson, 1991a). PcGets currently focuses on general-to-simple reductions for linear, dynamic, regression models, and economic theory often provides little evidence for specifying the lag lengths in empirical macro-models. Even when the theoretical model is dynamic, the lags are usually chosen either for analytical convenience (e.g., rst-order differential equation systems), or to allow for certain desirable features (as in the choice of a linear second-order single-equation model to replicate cycles). Therefore, we adopt the approach of starting with an unrestricted rational-lag model with a maximal lag length set according to available evidence (e.g., as 4 or 5 for quarterly time series, to allow for seasonal dynamics). Prior analysis remains essential for appropriate parameterizations; functional forms; choice of variables; lag lengths; and indicator variables (including seasonals, special events, etc.). Orthogonalization helps notably in selecting a unique representation; as does validly reducing the initial GUM. The present performance of PcGets on previously-studied empirical problems is impressive, even when the GUM is specied in highly inter-correlated, and probably non-stationary, levels. Hopefully, PcGets support in automating the reduction process will enable researchers to concentrate their efforts on designing the GUM: that could again signicantly improve the empirical success of the algorithm. 5.3.1 Collinearity Perfect collinearity denotes an exact linear dependence between variables; perfect orthogonality denotes no linear dependencies. However, any state in between these is both harder to dene and to measure as it depends on which version of a model is inspected. Most econometric models contain subsets of variables that are invariant to linear transformations, whereas measures of collinearity are not invariant: if two standardized variables x and z are nearly perfectly correlated, each can act as a close proxy for the other, yet x + z and x z are almost uncorrelated. Moreover, observed correlation matrices are not reliable indicators of potential problems in determining if either or both variables should enter a model the source of their correlation matters. For example, inter-variable correlations of 0.9999 easily arise in systems with unit roots and drift, but there is no difculty determining the relevance of variables. Conversely, in the simple bivariate normal: xt zt where we are interested in the DGP: yt = xt + zt +
t

IN2

0 0

1 1

(25)

(26)

25 (for well behaved t say), when = 0.9999 there would be almost no hope of determining which variables mattered in (26), even if the DGP formulation were known. In economic time series, however, the former case is common, whereas (25) is almost irrelevant (although it might occur when trying to let estimation determine which of several possible measures of a variable is best). Transforming the variables to a near orthogonal representation before modelling would substantially resolve this problem, but otherwise, eliminating one of the two variables seems inevitable. Of course, which is dropped depends on the vagaries of sampling, and that might be thought to induce considerable unmeasured uncertainty, as the chosen model oscillates between retaining xt or zt . However, either variable individually is a near-perfect proxy for the dependence of yt on xt + zt , and so long as the entire system remains constant, selecting either, or the appropriate sum, does not actually increase the uncertainty greatly. That remains true even when one of the variables is irrelevant, although then the multiple-path search is highly likely to select the correct equation. And if the system is not constant, the collinearity will be broken. Nevertheless, the outcome of a Monte Carlo model-selection study of (26) given (25) when = 0.9999 might suggest that model uncertainty was large and coefcient estimates badly biased simply because different variables were retained in different replications. The appropriate metric is to see how well xt + zt is captured. In some cases, models are estimated to facilitate economic policy, and in such a collinear setting, changing only one variable will not have the anticipated outcome although it will end the collinearity and so allow precise estimates of the separate effects. Transforming the variables to a near orthogonal representation before modelling is assumed to have occurred in the remainder of the chapter. By having a high probability of selecting the LDGP in such an orthogonal setting, the reported uncertainties (such as estimated coefcient standard errors) in PcGets are not much distorted by selection effects. 5.4 Integrated variables To date, PcGets conducts all inferences as I(0). Most selection tests will in fact be valid even when the data are I(1), given the results in, say, Sims, Stock and Watson (1990). Only t- or F-tests for an effect that corresponds to a unit root require non-standard critical values. The empirical examples on I(1) data provided below do not reveal problems, but in principle it would be useful to implement cointegration tests and appropriate transformations after stage 0, and prior to stage I reductions. Similarly, Wooldridge (1999) shows that diagnostic tests on the GUM (and presumably simplications thereof) remain valid even for integrated time series.

6 Some Monte Carlo results


6.1 Aim of the Monte Carlo Although the sequential nature of PcGets and its combination of variable-selection and diagnostic testing has eluded most attempts at theoretical analysis, the properties of the PcGets model-selection process can be evaluated in Monte Carlo (MC) experiments. In the MC considered here, we aim to measure the size and power of the PcGets model-selection process, namely the probability of inclusion in the nal model of variables that do not (do) enter the DGP. First, the properties of the diagnostic tests under the potential inuence of nuisance regressors are investigated. Based on these results, a decision can be made as to which diagnostics to include in the test battery. Then the size and power of PcGets is compared to the empirical and theoretical properties

26 of a classical t-test. Finally we analyze how the success and failure of PcGets are affected by the choice of: (i) the signicance levels of the diagnostic tests; and (ii) the signicance levels of the specication tests. 6.2 Design of the Monte Carlo The Monte Carlo simulation study of Hoover and Perez (1999) considered the Lovell database, which embodies many dozens of relations between variables as in real economies, and is of the scale and complexity that can occur in macro-econometrics: the rerun of those experiments using PcGets is discussed in Hendry and Krolzig (1999b). In this paper, we consider a simpler experiment, which however, allows an analytical assessment of the simulation ndings. The Monte Carlo reported here uses only stages I and II in table 1: Hendry and Krolzig (1999b) show the additional improvements that can result from adding stages 0 and III to the study in Hoover and Perez (1999). The DGP is a Gaussian regression model, where the strongly-exogenous variables are Gaussian white-noise processes:
5

yt =
k=1

k,0 xk,t + t ,

t IN [0, 1] , vt IN10 [0, I10 ] for t = 1, . . . , T,

(27)

xt = vt ,

where 1,0 = 2/ T , 2,0 = 3/ T , 3,0 = 4/ T , 4,0 = 6/ T , 5,0 = 8/ T . The GUM is an ADL(1, 1) model which includes as non-DGP variables the lagged endogenous variable yt1 , the strongly-exogenous variables x6,t , . . . , x10,t and the rst lags of all regressors:
10 1

yt = 0,1 yt1 +
k=1 i=0

k,i xk,ti + 0,0 + ut , ut IN 0, 2 .

(28)

The sample size T is 100 or 1000 and the number of replications M is 1000. The orthogonality of the regressors allows an easier analysis. Recall that the t-test of the null k = 0 versus the alternative k = 0 is given by: k tk = = k The population value of the t-statistic is: t = k k k = , 1 2 1/2 k T Qkk k
2 (X X)1 kk

2 k / (X X)1 kk 2 2 /

where the moment matrix Q = limT T 1 X X is assumed to exist. Since the regressors are 2 2 2 2 orthogonal, we have that k = xk y /k and k = /(T k ): tk = k k = T k . k

Thus the non-zero population t-values are 2, 3, 4, 6, 8. In (28), 17 of 22 regressors are nuisance.

27

6.3 Evaluation of the Monte Carlo The evaluation of Monte Carlo experiments always involves measurement problems: see Hendry (1984). A serious problem here is that, with some positive probability, the GUM and the truth will get rejected ab initio on diagnostic tests. Tests are constructed to have non-zero nominal size under their null, so sometimes the truth will be rejected: and the more often, the more tests that are used. Three possible strategies suggest themselves: one rejects that data sample, and randomly re-draws; one changes the rejection level of the offending test; or one species a more general GUM which is congruent. We consider these alternatives in turn. Hoover and Perez (1999) use a 2-signicant test rejections criterion to discard a sample and redraw, which probably slightly favours the performance of Gets. In our Monte Carlo with PcGets, the problem is solved by endogenously adjusting the signicance levels of tests that reject the GUM (e.g., 1% to 0.1%). Such a solution is feasible in a Monte Carlo, but metaphysical in reality, as one could never know that a sample from an economy was unrepresentative, since time series are not repeatable. Thus, an investigator could never know that the DGP was simpler empirically than the data suggest (although such a nding might gradually emerge in a PRS), and so would probably generalize the initial GUM. We do not adopt that solution here, partly because of the difculties inherent in the constructive use of diagnostic-test rejections, and partly because it is moot whether the PcGet algorithm fails by overtting on such aberrant samples, when in a non-replicable world, one would conclude that such features really were aspects of the DGP. Notice that tting the true equation, then testing it against such alternatives, would also lead to rejection in this setting, unless the investigator knew the truth, and knew that she knew it, so no tests were needed. While more research is needed on cases where the DGP would be rejected against the GUM, here we allow PcGets to adjust signicance levels endogenously. Another major decision concerns the basis of comparison: the truth seems to be a natural choice, and both Lovell (1983) and Hoover and Perez (1999) measure how often the search nds the DGP exactly or nearly. Nevertheless, we believe that nding the DGP exactly is not a good choice of comparator, because it implicitly entails a basis where the truth is known, and one is certain that it is the truth. Rather, to isolate the costs of selection per se, we seek to match probabilities with the same procedures applied to testing the DGP. In each replication, the correct DGP equation is tted, and the same selection criteria applied: we then compare the retention rates for DGP variables from PcGets with those that occur when no search is needed, namely when inference is conducted once for each DGP variable, and additional (non-DGP) variables are never retained. 6.4 Diagnostic tests PcGets records the rejection frequencies of both specication and mis-specication tests for the DGP, the initial GUM, and the various simplications thereof based on the selection rules. Figure 1 displays quantilequantile (QQ) plots of the empirical distributions of seven potential mis-specication tests for the estimated correct specication, the general model, and the nally-selected model. Some strong deviations from the theoretical distributions (diagonal) are evident: the portmanteau statistic (see Box and Pierce, 1970) rejects serial independence of the errors too often in the correct specication, never in the general, and too rarely in the nal model. The hetero-x test (see White, 1980) was faced with degrees of freedom problems for the GUM, but anyway does not look good for the true and nal model either. Since this incorrect nite-sample size of the diagnostic tests induces an excessively-early termination of any search path, resulting in an increased overall size for variable selection, we decided to exclude the portmanteau and the hetero-x diagnostics from the test battery of statistics. Thus, the following results use the ve remaining diagnostic tests in table 3.

28

Correct Model
1 Chow1 1 Chow2 1 portmonteau 1 normality 1 AR test 1 hetero 1 hetero-X 6 4 2 0

Failed tests

.5

.5

.5

.5

.5

.5

.5

.5

.5

.5

.5

.5

.5

.5

General Model
Chow1 1 1 Chow2 1 portmonteau 1 normality 1 AR test 1 hetero 1 hetero-X 6 4 2 0

Failed tests

.5

.5

.5

.5

.5

.5

.5

.5

.5

.5

.5

.5

.5

.5

Final (Specific) Model


1 Chow1 1 Chow2 1 portmonteau 1 normality 1 AR test 1 hetero 1 hetero-X 6 4 2 0

Failed tests

.5

.5

.5

.5

.5

.5

.5

.5

.5

.5

.5

.5

.5

.5

Figure 1 Selecting diagnostics: QQ Plots for M = 1000 and T = 100..

Correct Model: QQ Plots of Diagnostics for M=1000 and T=1000 1 1 1


Chow1 Chow2 normality .5 .5 .5

AR test

Failed tests
hetero 6 4 2

.5

.5

0 1

.5 Chow1

.5 Chow2

.5 normality

.5 AR test

1 1

.5 hetero

General Model: 1 Plots of Diagnostics for M=1000 and T=1000 QQ 1 1


.5 .5 .5

Failed tests
6 4 2

.5

.5

0 1

.5 Chow1

.5 Chow2

.5 normality

.5 AR test

1 1

.5 hetero

Correct Model: QQ Plots of Diagnostics for M=1000 and T=100 1 1 1


.5 .5 .5

Failed tests
6 4 2

.5

.5

0 1

.5 Chow1

.5 Chow2

.5 normality

.5 AR test

1 1

.5 hetero

General Model: 1 Plots of Diagnostics for M=1000 and T=100 QQ 1 1


.5 .5 .5

Failed tests
6 4 2

.5

.5

.5

.5

.5

.5

.5

Figure 2

Diagnostics for small and large samples.

29

Table 3 Test battery.


Test Chow(1 T ) Chow(2 T ) portmanteau(r) normality test AR 1-p test hetero test hetero-x test F-test Alternative Predictive failure over a subset of (1 1 )T obs. Predictive failure over a subset of (1 2 )T obs. rth order residual autocorrelation Skewness and excess kurtosis pth order residual autocorrelation Heteroscedasticity quadratic in regressors x2 i Heteroscedasticity quadratic in regressors xi xj General Statistic F ((1 1 )T, 1 T k) F ((1 2 )T, 2 T k) 2 (r) 2 (2) F (p, T k p) F (q, T k q 1) F (q, T k q 1) F (q, T k q) Sources Chow (1960, p.594-595), Hendry (1979) Box and Pierce (1970) Jarque and Bera (1980), Doornik and Hansen (1994) Godfrey (1978), Harvey (1981, p.173) White (1980), Nicholls and Pagan (1983), Hendry and Doornik (1996)

There are T observations and k regressors in the model under the null. The value of q may differ across statistics, as may those of k and T across models. By default, PcGets sets p = 4, r = 12, 1 = [0.5T ]/T , and 2 = [0.9T ]/T .

Figure 2 demonstrates that for large samples (T = 1000), the empirical distributions of the test statistics are unaffected by the strongly-exogenous nuisance regressors. For small samples (T = 100), the properties of the mis-specication tests are still satisfactory, and except for the heteroscedasticity test, close to the distributions of the test statistics under the null of the true model. 6.5 Size and power of variable selection Simplication can at best eliminate the nuisance regressors all or most of the time (size), yet retain the substance nearly as often as the DGP (power). The metric to judge the costs of reduction and of mis-specication testing was noted above. The probability is low of detecting an effect that has a scaled population t-value less than 2 in absolute value when the empirical selection criterion is larger. This suggests weighting the failure of PcGets in relation to a variables importance, statistically and economically. Then, missing a variable with |t| < 2 would count for less than missing an effect with |t| > 4 (say). With such a baseline, low signal-noise variables will still rarely be selected, but that is attributable as a cost of inference, not a aw of Gets type searches. In the following, we measure the outcome of PcGets by comparing its power and size with that of classical t-tests applied once to the correct DGP equation. The power function of a t-test of size for including the kth variable xk,t with coefcient k,0 = 0 is given by: Pr (Include xk,t | k,0 = 0) = Pr (|tk | c | k,0 = 0) , when: Pr (|tk | c | k,0 = 0) = . The rejection probability is given by a non-central t distribution with degrees of freedom and noncentrality parameter , which can be approximated by a normal distribution (x) (see Abramowitz and Stegun, 1970), where: 1 t 1 4 x= . 1 t2 2 1 + 2 The power of a t-test with size and degrees of freedom is then given by the parametric variation of the population t-value in: Pr (Include xk,t | k,0 = 0) = Pr (tk c | k,0 = 0) + Pr (tk c | k,0 = 0) . For = 0.01 and 0.05, and = 100 and 1000, the power function is depicted in gure 3.

30

Pr( Include x | population t-value of x )


power for alpha=0.01 and T=100 power for alpha=0.05 and T=100 power for alpha=0.01 and T=1000 power for alpha=0.05 and T=1000

.9

.8

.7

.6

.5

.4

.3

.2

.1

-5

-4

-3

-2

-1

population t-value

Figure 3

Power function of a t test for = 0.01 and 0.05, and T = 100 and 1000.

Table 4 shows that for large samples (T = 1000), the size (0.052 versus 0.05) and power (0.51 versus 0.515) are nearly at the theoretical levels one would expect from a t-test of the true model. Hence the loss in size and power from the PcGets search is primarily a small-sample feature. But even for 100 observations and 22 regressors, the results from PcGets are promising. For = = 0.01, the loss in power is less than 0.025 and the size is 0.019. The difference in the Monte Carlo experiments between PcGets and the empirical size and power of the t-test is even smaller. All of these experiments used AIC when more than one model survived both the model search and the encompassing process: using SC had little effect on the outcome, but improved the size slightly (see table 5). This match of actual and theoretical size suggests that the algorithm is not subject to spurious overtting, since in a search over a large number of regressors, one must expect to select some adventious effects. Table 5 also shows the dramatic size improvements that can result from adding stage 0 (pre-search reduction) to the algorithm, at some cost in retaining relevant variables till |t| > 4. Smaller signicance levels ( = 0.01 versus 0.05) receive some support from the size-power tradeoffs here when there are many irrelevant variables. Tighter signicance levels for the t-tests reduce the empirical sizes accordingly, but lower the power substantially for variables with population t-values of 2 or 3, consistent with the large vertical differences between the power functions in gure 3 at such t-values. Otherwise, using 1% for all tests does well at the sample sizes current in macroeconomics, and dominates 5% dramatically on overall size, without much power loss for larger values of t. For example, on conventional t-tests alone, the probability of selecting no regressors when all 22 variables are irrelevant in (28) is: Pr tk,i c k, i | k,i = 0 = (1 )22 , which is 0.80 when = 0.01 but falls to 0.32 when = 0.05. Figure 4 claries the success and

31

Table 4
T, t=0 t=2 t=3 t=4 t=6 t=8 0.05 0.05 AIC 100 0.0812
0.0022

Power and Size I: Impact of diagnostic tests on PcGets t-tests.


PcGets t-test: simulated 0.05 0.01 AIC 1000 0.0521
0.0017

t-test: theoretical 0.01 0.05 0.05 0.01

0.05 0.01 AIC 100 0.0686


0.0021

0.05 0.00 AIC 100 0.0646


0.0019

0.01 0.01 AIC 100 0.0189


0.0010

0.05

0.05

100

1000

100

100 0.0500 0.5083 0.8440 0.9773 1.0000 1.0000

1000 0.0500 0.5152 0.8502 0.9791 1.0000 1.0000

100 0.0100 0.2713 0.6459 0.9127 0.9996 1.0000

0.5090
0.0158

0.4930
0.0158

0.4910
0.0158

0.5100
0.0158

0.2820
0.0142

0.4730 0.8120 0.9760 1.0000 1.0000

0.5010 0.8360 0.9850 1.0000 1.0000

0.2580 0.6130 0.9020 0.9990 1.0000

0.8070
0.0125

0.8020
0.0125

0.8010
0.0126

0.8340
0.0118

0.6210
0.0153

0.9750
0.0049

0.9720
0.0049

0.9710
0.0053

0.9880
0.0034

0.9000
0.0095

0.9990
0.0010

0.9990
0.0010

0.9990
0.0010

1.0000
0.0000

0.9990
0.0010

1.0000
0.0000

1.0000
0.0000

1.0000
0.0000

1.0000
0.0000

1.0000
0.0000

The table reports the size (selection probability when population t = 0) and the power (non-deletion probability for population t > 0) of a standard t-test and PcGets. Standard errors shown in italics.

Table 5 Power and Size II: Information criteria and pre-search reduction.
PcGets IC Pre-search t=0 t=2 t=3 t=4 t=6 t=8 0.05 AIC 0.0686
0.0021

0.05 HQ 0.0679
0.0019

0.05 SC 0.0677
0.0019

0.05 SC yes 0.0477


0.0015

0.01 AIC 0.0189


0.0010

0.01 HQ 0.0183
0.0010

0.01 SC 0.0185
0.0010

0.01 SC yes 0.0088


0.0006

0.4930
0.0158

0.4930
0.0158

0.4930
0.0158

0.4080
0.0140

0.2820
0.0142

0.2810
0.0142

0.2820
0.0142

0.1538
0.0106

0.8020
0.0125

0.8020
0.0126

0.8020
0.0126

0.7330
0.0124

0.6210
0.0153

0.6220
0.0153

0.6200
0.0154

0.4278
0.0147

0.9720
0.0049

0.9720
0.0052

0.9720
0.0053

0.9390
0.0061

0.9000
0.0095

0.9000
0.0096

0.8980
0.0096

0.7645
0.0125

0.9990
0.0010

0.9990
0.0010

0.9990
0.0010

0.9980
0.0012

0.9990
0.0010

0.9990
0.0010

0.9990
0.0010

0.9865
0.0034

1.0000
0.0000

1.0000
0.0000

1.0000
0.0000

1.0000
0.0000

1.0000
0.0000

1.0000
0.0000

1.0000
0.0000

1.0000
0.0000

All MC experiments use = 0.01 and T = = 100. Standard errors shown in italics.

failure of PcGets for the 22 regressors of the GUM (the coefcients are in the order of equation 28). 6.6 Test size analysis The Monte Carlo reveals strong effects from the choice of the signicance levels on the outcome of Gets. As table 4 shows, it is not only the signicance level of the t-tests () that matters, but also those of the diagnostics (): lowering the signicance level of the diagnostic tests from 0.05 to 0.01 reduces the size by 0.0126 without affecting the power (e.g. loss of 0.0050 at t = 3 and T = 100). This is a striking effect which merits closer examination. It is important to distinguish between the individual signicance levels (), and the overall signicance level of the test battery which can be difcult to determine. Suppose we have a battery of n mis-specication tests each evaluated at the signicance level . Assuming independence of the tests,

32

1 .95 .9 .85 .8 .75 .7 .65 .6 .55 .5 .45 .4 .35 .3 .25 .2 .15 .1 .05 0

Pr(Include variable) for M=1000


t=6 t=4 t=8 =0.05, =0.01, T=100 =0.05, =0.05, T=100 =0.05, =0.01, T=1000 =0.01, =0.01, T=100

t=3

t=2

Number of the variable in the GUM

10

11

12

13

14

15

16

17

18

19

20

21

Figure 4 Probability of including each variable. the overall rejection probability under the null is given by: 1 (1 )n . For example if n = 5 and = 0.05, then the probability of rejecting the DGP is 0.2262, which is substantial. To ensure an overall rejection probability of 0.05 under the null, the individual signicance level has to satisfy (1)n = 0.95. For, say, n = 5 mis-specication tests, then 0.01 is necessary. The combined variable-selection and specication-testing approach implies that the signicance levels, , of the diagnostics also affect the deletion probabilities for any non-DGP variable. The probability of selecting a non-DGP variable must be higher than the nominal size (see gure 4), since for pure random sampling, excluding even a nuisance variable can alter the outcome of one or more diagnostic tests to a rejection. Thus, despite an insignicant t-statistic, PcGets would keep such a variable in the model. The joint issue of variable selection and diagnostic testing in PcGets is hard to analyse, but consider the following one-step reduction from a GUM denoted M1 to the DGP M2 : M1 (GUM): yt = xt 1 + zt 2 + ut , M2 (DGP): yt = xt + t , where t is a homoscedastic innovation process. The reduction step consists of a specication test for 2 = 0 (t for scalar zt , or F for the vector case) and a set of n mis-specication tests. The individual signicance levels are and , respectively: we assume that these tests are independent. Let RM (RM ) denote a (non-)rejection on the set of mis-specication tests for model M. For M1 : Pr(RM1 ) = (1 )n = 1 Pr(RM1 ), (29)

33

which is the initial probability that the GUM is not rejected by the mis-specication test battery, discussed above. Then, the overall size, (probability of no reduction from M1 ), is: = + (1 ) Pr RM2 | RM1 + (1 ) (1 (1 ())n ) , (30) where the second term shows the ination of the nominal size due to diagnostic testing. It seems reasonable to let Pr RM2 |RM1 = 1 (1 ())n , as rejection is increasing in n and decreasing in . Equation (30) provides a lower bound for the multi-path model-search procedure. For example, it does not take into account that DGP variables might be eliminated by selection tests, and there might be a sequence of t tests rather than a single F test for the same set of variables. The multi-step, multipath reduction process of PcGets makes an analytical assessment intractable, but we get insights from the Monte Carlo ndings. From table 4 (where n = 5), the empirical sizes of PcGets selection at a nominal selection-test size of 5% are 0.065, 0.069, and 0.081 when conducted without diagnostic testing (i.e., = 0), = 0.01 and = 0.05 respectively. Solving (30) for these effects gives estimates of (0.01) = 0.0008 and (0.05) = 0.0035. Although Monte Carlo is always problem dependent, such ndings cohere with the established theory: the additional checking does not increase size greatly so long as we control the overall size of the test battery.

7 Empirical Illustrations
In this section, we apply PcGets to two well-known macro-econometric models: Davidson, Hendry, Srba and Yeo (1978) and Hendry and Ericsson (1991b). 7.1 DHSY We reconsider the single-equation equilibrium-correction model of Davidson et al. (1978) (DHSY), who proposed the following model of consumers expenditure in the UK: 4 ct = 0.09(c y)t4 + 0.484 yt 0.231 4 yt 0.124 pt 0.311 4 pt + 0.0064 Dt , (31) where yt denotes real personal disposable income, ct is real consumers expenditures on non-durable goods and services, pt is the implicit deator of ct (all in logs), and D is a dummy variable with the value unity in 68(i) and 73(i), 1 in the following quarter, and zero otherwise. We started with a GUM that generalizes (31) by allowing for 5 lags of all differenced variables, their EqCM, and a constant. The results are summarized in table 6. All three nal models are valid non-dominated reductions of the GUM, only differing slightly in their dynamic specication. Model 1 would be preferable if one applied the AIC criterion, model 2 in the case of HQ and SC. It is worth noting that introducing centered seasonal dummies into the general model (GUM 2) eliminates model 3 from the set of nal models, but otherwise results in identical models. This is surprising, as there are signicant seasonal effects in GUM 2. A comparison with the model chosen by DHSY shows the potential power of an automated generalto-specic model selection: PcGets detects ones that dominate DHSYs nal selection. The DHSY results reect their decision to delete the ct3 term, which possibly leads to an economically more sensible model. The value of structuring the problem before specifying the GUM becomes obvious in table 7 where we try up to 5 lags of each for ct on yt , 4 pt , and a constant to see what PcGets nds. In table 8, seasonals are added. It might be reassuring to see how the power of an articial intelligence model-specication system is limited by the researchers specication of the GUM. This is very much in line with the ideas presaged in section 5.3.

34

Table 6 DHSY 1959 (2) - 1975 (4) .


DHSY Coeff 4 ct1 4 ct2 4 ct3 4 ct4 4 ct5 4 yt 4 yt1 4 yt2 4 yt3 4 yt4 4 yt5 4 p t 4 pt1 4 pt2 4 pt3 4 pt4 4 pt5 (c y)t4 Constant 4 Dt CSeason 1 CSeason 2 CSeason 3 RSS sigma R2 R2 adj LogLik AIC HQ SC Chow(1967:4) Chow(1973:4) normality test AR 1-5 test hetero test 1.0142 0.7281 0.1120 0.2239 0.6478 t-value GUM 1 Coeff t-value 0.0743 0.5011 -0.0472 -0.3502 0.2304 1.7445 -0.0545 -0.3959 0.0821 0.6450 0.2332 5.2662 0.1618 2.5850 0.0267 0.4318 -0.0226 -0.3912 0.0069 0.0938 -0.0300 -0.4333 -0.4309 -3.3889 0.2633 1.3715 -0.0210 -0.1146 0.2005 1.0743 -0.2537 -1.3480 0.2110 1.4990 -0.0349 -1.1461 0.0026 0.5100 0.0083 2.7285 GUM 2 Coeff t-value 0.0041 0.0273 -0.0275 -0.2069 0.2317 1.7767 -0.0399 -0.2939 -0.0045 -0.0338 0.2594 5.6941 0.1787 2.8231 0.0353 0.5757 -0.0261 -0.4551 -0.0265 -0.3489 -0.0130 -0.1853 -0.4339 -3.4581 0.2170 1.1377 0.0322 0.1745 0.1677 0.9034 -0.2317 -1.2414 0.1215 0.8370 -0.1315 -2.3522 -0.0069 -1.0139 0.0074 2.4178 0.0085 2.0815 0.0051 1.6312 0.0043 1.5988 0.0016 0.0061 0.8970 0.5891 356.0237 -9.9410 -9.6415 -9.1842 0.7620 0.4718 0.0374 0.4442 0.1980 0.7388 0.8966 0.9815 0.8148 0.9891 0.8585 0.6256 0.5170 0.1000 0.6231 Final model 1 (GUM1, GUM2) Coeff t-value Final model 2 (GUM1, GUM2) Coeff t-value Final model 3 (GUM1) Coeff t-value

0.1743

3.0144

0.2572

3.7712

0.1706

2.7910

0.2508 0.2300

7.0717 5.7883

0.2486 0.1891

7.5433 4.8657

0.2614 0.1505

8.1151 4.3485

0.2362 0.1874

6.8964 4.7449

-0.4227 -4.3869 0.3051 2.9585

-0.4284 -4.7064 0.2668 2.5528

-0.2930 -6.5579

-0.3708 -3.9073 0.2307 2.1328

0.2490 0.1150 2.3280 -0.0591 -3.9155 0.0063 3.0116

3.9625 0.1250 2.5184 3.6101 2.8834

-0.0930 -7.7312 0.0065 2.8869

-0.0504 -3.0787 0.0068 3.2809 0.0087 0.0060

0.0023 0.0062 0.8528 0.7764 344.0657 -10.0915 -10.0134 -9.8941 0.4887 0.6945 0.9455 0.9507 0.7906

0.0018 0.0062 0.8867 0.6220 352.8531 -9.9359 -9.6755 -9.2778 0.9630 0.7187 0.6559 0.3498 0.3855 0.5573 0.7017 0.7204 0.8795 0.9772

0.0019 0.0057 0.8770 0.7723 350.0929 -10.2117 -10.1076 -9.9485 0.6641 0.7847 0.7722 0.9917 0.8469 0.8551 0.7414 0.8055 0.3536 0.6810

0.0020 0.0058 0.8720 0.7809 348.7563 -10.2017 -10.1105 -9.9713 0.6687 0.6824 0.6685 0.8778 0.7801 0.9570 0.7863 1.4087 0.4999 0.4172

0.0020 0.0058 0.8731 0.7688 349.0408 -10.1803 -10.0762 -9.9171 0.5527 0.6417 0.4944 0.7750 0.9611

(1) DHSY corresponds to equation (8.45)** in Davidson et al. (1978). (2) 4 Dt is the fourth difference of a dummy variable which is +1,-1 in 1968(i), 1968(ii) and 1973(i),1973(ii) reecting budget effects in 1968 and the introduction of VAT in 1973 (see footnote 5 in Davidson et al., 1978) (3) CSeason 1,2,3 are centralised seasonal dummies.

35

Table 7 DHSY without SEASONALS, 1959 (2) - 1976 (2) .


General model Coeff t-value 4 ct1 4 ct2 4 ct3 4 ct4 4 ct5 4 yt 4 yt1 4 yt2 4 yt3 4 yt4 4 yt5 4 p t 4 pt1 4 pt2 4 pt3 4 pt4 4 pt5 (c y)t4 Constant RSS sigma R2 Radj2 LogLik AIC HQ SC Chow(1967:4) Chow(1973:4) normality test AR 1-5 test hetero test 1.3920 0.9134 0.2589 0.2791 0.2646 -0.0900 -0.0377 0.2406 -0.0079 0.0395 0.2489 0.2030 0.0540 -0.0064 -0.0934 0.0186 -0.3919 0.2399 0.0186 0.0601 -0.2125 0.2154 -0.0514 0.0033 -0.6248 -0.2660 1.7255 -0.0551 0.2943 5.3523 3.1752 0.8294 -0.1074 -1.3259 0.2645 -3.0655 1.1928 0.0958 0.3192 -1.0761 1.4561 -1.6258 0.6162 0.0022 0.0066 0.8804 0.6380 357.8109 -9.8206 -9.5765 -9.2054 0.2434 0.5305 0.8786 0.9222 0.9992 1.1658 0.6604 0.3234 0.2078 0.6009 Final model 1 Coeff t-value Final model 2 Coeff t-value Final model 3 Coeff t-value Final model 4 Coeff t-value

0.2554

3.1972

0.1471

2.4153

0.1871

2.7581

0.1524

2.3475

0.2630 0.1896

7.5598 4.8375

0.2762 0.1879

7.8876 4.6804

0.2859 0.1585

8.2106 4.1983

0.2738 0.1346

7.8156 3.7473

-0.0804 -0.3999 0.3032

-2.0297 -5.0674 3.6100 -0.3919 0.2868 -4.8537 3.3490 -0.2742 0.1829 -5.3097 3.0574 -0.1693 0.1829 -5.8668 3.0574

-0.0696

-4.6275 0.0023 0.0061 0.8705 0.7822 355.0706 -10.0890 -9.9991 -9.8624 0.3414 0.7551 0.8507 0.9579 0.8505

-0.0699

-4.5376 0.0025 0.0063 0.8619 0.7869 352.8512 -10.0537 -9.9766 -9.8594

-0.0643

-3.8717 0.0026 0.0064 0.8583 0.7837 351.9716 -10.0282 -9.9511 -9.8339

0.1427 -0.0643 0.0099

3.3997 -3.8717 3.9229 0.0025 0.0063 0.8626 0.7876 353.0330 -10.0589 -9.9819 -9.8647

1.1699 0.7546 0.0989 0.2303 0.7030

0.3354 0.6706 0.9517 0.9478 0.7409

1.0505 0.8571 0.4123 0.2538 0.7415

0.4494 0.5776 0.8137 0.9362 0.7047

1.3654 0.8929 0.3682 0.4904 0.4176

0.1978 0.5459 0.8319 0.7821 0.9318

36

Table 8 DHSY with SEASONALS, 1959 (2) - 1976 (2) .


General model Coeff t-value 4 ct1 4 ct2 4 ct3 4 ct4 4 ct5 4 yt 4 yt1 4 yt2 4 yt3 4 yt4 4 yt5 4 p t 4 pt1 4 pt2 4 pt3 4 pt4 4 pt5 (c y)t4 Constant CSeason 1 CSeason 2 CSeason 3 RSS sigma R2 Radj2 LogLik AIC HQ SC Chow(1967:4) Chow(1973:4) normality test AR 1-5 test hetero test 1.0230 0.5628 0.0787 1.2453 0.1703 -0.1582 -0.0260 0.2234 0.0117 -0.0609 0.2726 0.2221 0.0588 -0.0047 -0.1178 0.0426 -0.4303 0.2027 0.0557 0.0612 -0.1809 0.1200 -0.1711 -0.0092 0.0107 0.0065 0.0046 -1.1301 -0.1915 1.6707 0.0851 -0.4550 5.9369 3.5704 0.9378 -0.0831 -1.7262 0.6174 -3.4959 1.0457 0.2932 0.3375 -0.9521 0.8230 -3.2051 -1.3280 2.7376 2.0817 1.6746 0.0019 0.0063 0.8972 0.6111 363.0433 -9.8853 -9.6027 -9.1730 0.5082 0.8330 0.9614 0.3054 0.9999 1.0827 0.5121 0.4538 0.1856 0.6109 Final model 1 Coeff t-value Final model 2 Coeff t-value Final model 3 Coeff t-value

0.2256

2.8686

0.2539

2.9652

0.1315

2.0653

0.2587 0.1839

7.6527 4.8264

0.2694 0.1555

7.9734 4.3204

0.2888 0.1331

8.7700 3.7959

-0.0836 -0.3794 0.2661

-2.1744 -4.9201 3.1994

-0.0801 -0.2646 0.1622

-2.0466 -5.3552 2.7605 -0.2035 -7.0134

-0.0827 0.0039

-5.2521 2.2111

-0.0784 0.0039

-4.5545 2.1400

0.1244 -0.0765 0.0039

2.9540 -4.4559 2.1593

0.0022 0.0060 0.8801 0.7781 357.7306 -10.1371 -10.0344 -9.8781 0.4201 0.8736 0.7970 0.9669 0.8504 1.1253 0.5742 0.7825 0.2376 0.7793

0.0022 0.0061 0.8755 0.7740 356.4398 -10.0997 -9.9969 -9.8407 0.3797 0.8272 0.6762 0.9442 0.6924 1.1338 0.8614 0.6811 0.4448 0.8515

0.0023 0.0061 0.8710 0.7826 355.2075 -10.0930 -10.0031 -9.8663 0.3697 0.5739 0.7114 0.8153 0.6062

37

7.2 UK Money Demand We now reconsider the Hendry and Ericsson (1991b) model (HE) of narrow money demand in the UK: (m p)t = 0.093 (m p x)t1 0.17 (m p x)t1 0.69pt 0.63Rt + 0.023 (32) where the lower-case data are in logs: m is M1, x is real total nal expenditure in 1985 prices, p is its deator, and R is the opportunity cost of holding money (3-month local-authority interest rate minus the retail sight-deposit rate). The results using PcGets are given in table 10. Two GUMs are considered, both nesting (32) without imposing the homogeneity restriction on (m p) and x. As the GUMs A and B are linked by linear transformations of the same set of regressors leading to an identical t, A and B are observationally equivalent. Although transformations per se do not entail any associated reduction, the different structuring of the information set affects the reduction process. This highlights the role of variable orthogonalization. Just one model survived the selection process in each case. Final model A corresponds to the HE model without imposing the homogeneity restriction and, hence, leaves a further valid reduction: the xt1 coefcient is dropped by PcGets after acceptance by the corresponding t-test (also supported by HQ and SC). The nal model resulting from GUM B is also essentially the HE model as: 0.82 pt 0.7pt1 = 0.8pt + 0.8pt1 0.7pt1 = 0.8pt + 0.1pt1 , and the last term is irrelevant. But only an expert system would notice the link between the regressors to cancel the redundant term. However, the initial formulation of regressors clearly matters, supporting EqCM forms, and conrming that orthogonalization helps. Alternatively, if the cointegrating vector is used: (m p x)t1 7pt1 0.7Rt1 , then both 2 p and R enter (see Hendry and Doornik, 1994), so that is also a correct feature detected. PcGets seems to be doing remarkably well as an expert on the empirical problems, as well as mimicking the good size and power properties which Hoover and Perez (1999) claim. To illustrate the benets from structuring the problem, we consider a simple unrestricted autoregressive-distributed lag model for UK M1, regressing m on p, x, and R with a lag length of 4. As shown in table 9, three reduced models survive the model-selection process, and differ regarding their dynamic specication, but are close regarding their long-run effects. Model 1 uniformly dominates 2 and 3. When rewritten in an EqCM form, the selected outcome is again similar to HE: mt = 0.33mt1 + 0.21mt4 + 0.33pt 0.20pt3 + 0.13xt 0.58Rt 0.34Rt2 0.11(m p x)t1 0.213 mt1 + 0.203 pt + 0.11xt 0.92Rt + 0.342 Rt . (33) If pre-search F-tests are used (at 10%), the nal model is the same as (33) other than omitting Rt2 . It must be stressed that these cases benet from fore-knowledge (e.g., of dummies, lag length etc.), some of which took the initial investigators time to nd.

38

Table 9 UKM1 Money Demand, 1964 (1) - 1989 (2).


HE Coeff (m p x)t1 pt pt1 rt rt1 (m p)t1 (m p)t2 (m p)t3 (m p)t4 xt xt1 xt2 xt3 xt4 2 p t 2 pt1 2 pt2 2 pt3 2 pt4 rt rt1 rt2 rt3 rt4 Constant RSS sigma R2 Radj2 LogLik AIC HQ SC Chow(1967:4) Chow(1973:4) normality test AR 1-5 test hetero test 0.5871 0.6367 1.9766 1.6810 1.7820 -0.0928 -0.6870 -0.6296 -0.1746 t-value -10.8734 -5.4783 -10.4641 -3.0102 Unrestricted HE Coeff t-value -0.0938 -0.6952 -0.6411 -0.1926 -10.6160 -5.4693 -9.8391 -2.7637 General A Coeff t-value -0.1584 -1.0499 -1.1121 -0.2827 -0.0407 -0.2906 -0.1446 -0.0623 0.0718 0.0083 -0.2274 -0.0925 0.3222 0.3483 0.6813 0.2944 -0.0430 0.6884 0.3293 0.2038 0.1631 0.0872 0.0434 -6.0936 -4.8670 -6.1012 -2.7449 -0.3696 -2.6800 -1.3519 -0.5509 0.5870 0.0720 -1.8802 -0.7815 1.0738 1.2135 2.4708 1.1908 -0.2134 3.1647 1.7151 1.2647 1.1558 0.7119 5.1072 0.0130 0.0130 0.8103 0.6240 447.2861 -8.4857 -8.2432 -7.8865 0.6029 0.6135 6.2503 0.4112 0.5902 0.9404 0.8439 0.0439 0.8395 0.9479 0.5464 0.5489 5.2260 1.4427 1.7668 Final model A Coeff t-value -0.0934 -0.7005 -0.6468 -0.1858 -10.5108 -5.4831 -9.8893 -2.6569 -1.1121 -0.2827 -0.0407 -0.2906 -0.1446 -0.0623 0.0718 0.0083 -0.2274 -0.0925 -0.7276 0.3483 0.6813 0.2944 -0.0430 -0.4236 0.3293 0.2038 0.1631 0.0872 0.0434 -6.1012 -2.7449 -0.3696 -2.6800 -1.3519 -0.5509 0.5870 0.0720 -1.8802 -0.7815 -3.5087 1.2135 2.4708 1.1908 -0.2134 -3.6397 1.7151 1.2647 1.1558 0.7119 5.1072 0.0130 0.0130 0.8103 0.6240 447.2861 -8.4857 -8.2432 -7.8865 0.6029 0.6135 6.2503 0.4112 0.5553 0.9404 0.8439 0.0439 0.8395 0.9649 0.4777 0.5379 6.9513 1.4285 1.0354 -0.7223 -0.2520 -9.2106 -2.7609 General B Coeff t-value -0.1584 -1.0499 -6.0936 -4.8670 Final model B Coeff t-value -0.1035 -0.7021 -9.1048 -4.8215

0.1746

3.0102

0.1384

1.4392

-0.8010

-4.3777

-0.4842

-4.5733

0.0234

5.8186 0.0164 0.0131 0.7616 0.7235 435.8552 -8.6171 -8.5644 -8.4868 0.9658 0.8267 0.3722 0.1473 0.0916

0.0244

5.3756 0.0163 0.0132 0.7622 0.7165 435.9734 -8.5995 -8.5362 -8.4432

0.0262

5.9862 0.0167 0.0133 0.7569 0.7191 434.8837 -8.5977 -8.5450 -8.4674 0.9805 0.8958 0.0733 0.2167 0.0948

0.0276

6.1696 0.0160 0.0131 0.7664 0.7128 436.8711 -8.5974 -8.5236 -8.4151 0.9937 0.9031 0.0309 0.2219 0.4256

0.5712 0.6204 2.4432 1.7672 1.1874

0.9717 0.8406 0.2948 0.1278 0.3112

39

Table 10

UKM1 Money Demand, 1964 (1) - 1989 (2).


Final model 1 Coeff t-value 0.6661 9.3474 Final model 2 Coeff t-value 0.6963 9.9698 Final model 3 Coeff t-value 0.6309 0.2708 6.4812 2.9985

General model Coeff t-value mt1 mt2 mt3 mt4 pt pt1 pt2 pt3 pt4 xt xt1 xt2 xt3 xt4 rt rt1 rt2 rt3 rt4 Constant RSS sigma R2 Radj2 LogLik AIC HQ SC Chow(1967:4) Chow(1973:4) normality test AR 1-5 test hetero test 0.4367 0.5427 6.1768 0.9351 0.8275 0.6265 0.1744 -0.2084 0.2815 0.1466 0.3099 -0.0557 -0.4272 0.1470 -0.0140 0.2946 -0.1351 -0.1585 0.1693 -0.4164 -0.3253 -0.0726 -0.0346 -0.0282 -0.3145 5.8463 1.4059 -1.6438 2.7652 0.6854 0.8998 -0.1613 -1.2779 0.7631 -0.1297 2.2751 -1.0353 -1.2075 1.5595 -3.6849 -1.9202 -0.4207 -0.2030 -0.2363 -0.7918 0.0135 0.0128 0.9998 0.8038 455.5085 -8.5394 -8.3310 -8.0247 0.9958 0.9064 0.0456 0.4631 0.7223

0.2083 0.3322

3.5952 5.9563

0.1847 0.4484

3.2485 5.7072 -4.4808 0.3868 -0.2880 5.4743 -4.1340

-0.2049 0.1290

-4.0702 6.6776

-0.3278

0.1222

6.3752

0.1008

6.6021

-0.5812 -0.3380

-8.6121 -3.0200

-0.5219 -0.3400

-7.9668 -2.9697

-0.4217 -0.2880

-4.4029 -2.3268

0.0153 0.0127 0.9998 0.9312 449.0368 -8.6674 -8.5944 -8.4872 0.5440 0.4743 6.1584 1.2622 1.3887 0.9814 0.9470 0.0460 0.2872 0.1781 0.5418 0.4374 5.8507 1.3347 1.4050

0.0157 0.0128 0.9998 0.9311 447.8882 -8.6449 -8.5719 -8.4647 0.9820 0.9628 0.0536 0.2569 0.1703 0.5243 0.4716 6.0494 1.4750 1.4021

0.0159 0.0129 0.9998 0.9311 447.2097 -8.6316 -8.5586 -8.4514 0.9864 0.9483 0.0486 0.2058 0.1717

40

8 Model selection in forecasting, testing, and policy analysis


This section overviews recent research on model selection in forecasting, testing, and policy analysis. In their research program into economic forecasting, Clements and Hendry (1998b, 1999a) delineated the sources of forecast errors, deduce their implications, and propose a number of solutions. They show that unanticipated deterministic shifts are the major cause of forecast failure, and hence model selection seems to play a relatively peripheral role in that arena. Section 8.1.1 discusses these potential sources of forecast errors, amplied in sub-sections 8.1.18.1.1 which consider deterministic factors, shifts in these, model mis-specication, and estimation uncertainty, before section 8.1 investigates sample-selection effects, and describes a Monte Carlo study to investigate their operational impact on forecast failure. Secondly, theory testing often is conducted without rst ascertaining the congruence of the models used. The dangers of doing so are discussed in section 8.2. Finally, Hendry and Mizon (2000) consider selecting policy-analysis models by their forecast accuracy, and show that such a criterion is not generally appropriate. This arises because, in the framework postulated by Clements and Hendry, it is impossible to prove the potential dominance in forecasting of causal variables over non-causal, of well-specied models over badly mis-specied,or even of 1step forecasts over multi-step. Consequently, models may perform well in forecasting purely because of their robustness to deterministic shifts, not because of their intrinsic ability to characterize the economy indeed, the same model may do better or worse depending on whether its proprietors implement appropriate intercept corrections when forecasting. Taken together, these results establish the following about data-based model selection: a fully-structured search procedure has excellent properties in terms of locating the DGP; model selection does not greatly distort forecasts; theory testing is hazardous unless congruent models are used; and forecast accuracy should not be used to select models in the policy arena.

However, we have not yet formally established that Gets should be used for selecting policy models from a theory-based general unrestricted model (GUM), as that almost certainly does not nest the DGP in practice and indeed such a proof may not be possible. 8.1 Model selection for forecasting Forecast performance in a world of deterministic shifts is not a good guide to model choice, unless the sole objective is short-term forecasting. This is because models which omit causal factors (including cointegrating relations, by imposing additional unit roots), may adapt more quickly in the face of unmodelled shifts, and so provide more accurate forecasts after a break. Consequently, there do not seem to be good grounds for selecting the best forecasting model for other purposes, such as economic policy analysis, or testing economic theories. We rst consider the sources of forecast errors, then investigate the impact of selection on forecast errors. 8.1.1 Sources of forecast errors A separate set of lecture notes discusses this topic in detail: the aim here is simply to provide a summary for continuity. Econometric models have three main components: deterministic terms, whose future values are known; observed stochastic variables with unknown future values; and unobserved errors all of whose

41

values are unknown. Any, or all, of these three components could be the source of forecast errors. Moreover, a models relationships could be: mis-specied; poorly estimated; based on inaccurate, or non-stationary, data; selected by data-based methods; involve collinear variables with non-parsimonious formulations; and suffer structural breaks. Given the complexity of modern economies, most of these possible causes will be present in any empirical macro-model, and will reduce forecast performance, either from inaccuracy or imprecision. However, some mistakes have more pernicious effects on forecasts than others, and most combinations do not seem to induce systematic forecast failure: analysis implicates structural breaks in deterministic terms as the primary cause of forecast failure. Forecast failure is dened as a signicant deterioration in forecast performance relative to the anticipated outcome, usually based on the historical performance of a model. Consequently, forecast failure would rarely occur in a constant-parameter, stationary world, with the implication that data-based selection should not play a key role. Moreover, many conventional results that can be proved in a constant-parameter, stationary setting change radically under parameter non-constancy: examples include the potential dominance in forecasting of causal variables by non-causal, of well-specied models by badly mis-specied, of 1-step forecasts by multi-step, and of known parameter values by estimated, as well as the value-added of intercept corrections and differencing transforms. The impact of model selection again seems secondary relative to the impacts of unmodelled non-stationarities. The potential sources of forecast errors include mistakes that derive from: (1) (2) (3) (4) (5) (6) (7) formulating a forecasting model according to incomplete theoretical notions, selecting by inappropriate empirical criteria, mis-specifying the system (to an unknown extent), estimating its parameters (possibly inconsistently), from (inaccurate) observations, generated by an integrated-cointegrated process, subject to intermittent structural breaks.

Such a framework closely mimics the empirical setting, and any resulting forecast-error taxonomy must include a source for each of the effects in 17, partitioned appropriately for deterministic and stochastic inuences. Although the decompositions of the errors are not unique, they can be expressed for any given system in nearly-orthogonal effects corresponding to inuences on forecast-error means and variances respectively. The former involve all the deterministic terms; the latter the remainder. We now briey consider these major categories of error, commencing with mean effects, then turn to variance components. The documentation supporting the claims made in the remainder of this section is provided in Clements and Hendry (1998b, 1999a). Mean effects Systematic forecast-error biases derive from deterministic factors being mis-specied, mis-estimated, or non-constant. The key effect derives from the uncertainty in, or mis-specication of, or changes to the equilibrium mean (not necessarily the intercept in a model), which is the unconditional expectation E[yt ] (where that exists) of the non-integrated components of the vector of variables under analysis. Let Em [] denote the expectation based on the model, then the 1-step forecast-error bias is E[yT +1 ] Em [yT +1 ]; this will differ from zero when E[yT +1 ] shifts unexpectedly relative to the models mean, or when Em [yT +1 ] is a biased estimate of E[yT +1 ]. Consequently, E [yT +h ] Em [yT +h ] is a major determinant of systematic forecast failure. In non-stationary dynamic processes, where unconditional expectations may not be dened, this expression can be generalized to E [yT +h |YT ] Em [yT +h |YT ] where YT denotes the forecast origin.

42

The admissible deductions on observing either the presence or absence of forecast failure are rather stark, particularly for general methodologies which believe that forecasts are the appropriate way to judge empirical models. In this setting of structural change, there may exist non-causal models (i.e., models none of whose explanatory variables enter the DGP) that do not suffer forecast failure, and indeed may forecast absolutely more accurately on reasonable measures, than previously congruent, theory-based models. Consequently, even relative success or failure is not a reliable basis for selecting between models other than for forecasting purposes. Conversely, a model that suffers severe forecast failure may nonetheless have constant parameters on ex post re-estimation: apparent failure on forecasting need have no implications for the goodness of a model, or its theoretical underpinnings, as it may arise from incorrect data, that are later corrected (see the concept of extended constancy in Hendry, 1996). Mis-specication of zero-mean stochastic components is unlikely to be a major source of forecast failure, but interacting with breaks elsewhere in the economy could precipitate failure. Equally, the false inclusion of variables which experienced equilibrium-mean shifts could have a marked impact on forecast failure: the model mean shifts although the data mean does not, thereby changing Em [yT +h ], when E [yT +h ] is unaltered. Finally, forecast-origin mis-measurement can also be pernicious, as an incorrect starting level is carried forward in dynamic models, so most forecasting agencies carefully appraise the latest observations for consistency with other available information. Variance effects Estimation uncertainty for the parameters of stochastic variables seems a secondary problem, as such errors add variance terms of O(1/T ) for stationary components and O(1/T 2 ) for non-stationary for samples of size T . However, mis-estimation of coefcients of deterministic terms could be deleterious to forecast accuracy when E [yT +h ] Em [yT +h ] is large by chance. Neither collinearity nor a lack of parsimony per se seem key culprits, although interacting with breaks occurring elsewhere in the economy could induce serious problems. Concerning the former, the ratio of in-sample to forecast-period MSE depends on the sum of the ratios of the respective eigenvalues of the regressor second-moment matrix, and the ratios of the smallest eigenvalues could change radically when collinearity alters, consistent with the need to eliminate non-systematic effects when forecasting. Similarly, concerning the latter, if a lack of parsimony entailed false inclusion of variables that suffered mean shifts, forecast failure could ensue, but on the combined sample, the offending variables would be dropped, and the model re-appear as constant (this effect may partly explain why arbitrary prior restrictions on VARs can improve forecast performance: by excluding most inuences, one will exclude irrelevant effects which later happen to change). 8.1.2 Sample selection experiments The impact of overtting in inducing forecast failure is the subject of the Monte Carlo study by Clements and Hendry (1999b). They examine whether simplication can induce spurious forecast failure by comparing a model which coincides with the DGP, an unrestricted model (over or under parameterized relative to the DGP), and a simplied model based on a t-test simplication using rules of the form |t| < 2. Scenarios with and without breaks, for over and under-specied models are considered. False inclusion without breaks The DGP is: yt = + zt + t , t IN [0, 1] (34)

43
2 where zt is strongly exogenous for (, ), generated by zt IN 0, z with zt and t independent. They set = 0 in (34) and tted three models: the GUM (equation (34)); the restricted model (which is the DGP yt = + t ); and a selected model based on a test of the signicance of zt in the GUM. For 100,000 Monte Carlo replications, at T = 6, 9, . . . , 30, they nd that there is little size distortion.

False exclusion without breaks Now the DGP is: yt = + zt + t , t IN [0, 1] ,

and they considered combinations of = {0.15, 0.3, . . . , 1.5} and T = {3, 6, . . . , 30}. The restricted and unrestricted models forecast-failure rejection frequencies (RFs) were close to the nominal 5% size. When selecting at a nominal 1% level, the RFs never exceeded 2% and fell in T . For 5% level selection tests, the RFs fell in T from 9% at T = 3 to 5% at T = 30. False exclusion and post-sample breaks {0.15, 0.3, . . . , 1.5}, and zt was given by: For forecast-period shifts, they set T = 20, where =

zt = 1t + t ,

t IN [0, 1] ,

when 1t is unity if the subscript is true ( = 21), with = {0.3, . . . , 3}. Thus, yt depends on zt , but the mean of zt shifts by at t = 21. The estimation period is t = 1, . . . , 20, followed by a 1-step ahead forecast test for period t = 21. The RFs for the restricted model are increasing in and , but the RFs for the selected model are smaller than those from the restricted model, so selection limits how badly one can do. False exclusion and in-sample breaks Now = 19 (in-sample break).The RFs for the restricted model are increasing in and , but the RFs for the selected model are smaller than those from the restricted model, conrming that selection limits how badly one can do. Forecast failure RFs for the restricted and selected models are fewer than for the forecast-period shift. Rejection frequencies of the in-sample test, H0 : = 0, are higher when the break occurs within-sample. Taken together these experiments suggest that observed forecast failure is probably due to some of the other sources discussed above. Since the MSFEs of the restricted models are smaller than those of the initially-unrestricted equations, it is generally benecial to simplify. When excess rejection occurs, it is usually because simplication reduced the in-sample standard error below the uncertainty implicit in the model, so is almost of the false-rejection variety. Thus, they nd little evidence that strategies such as general-to-specic induce signicant over-tting, or thereby cause forecast-failure rejection rates to greatly exceed nominal sizes. Parameter non-constancies put a premium on correct specication, but in general, model-selection effects appear to be relatively small, and progressive research is often able to detect the mis-specications. 8.2 Model selection for theory testing Although not normally perceived as a selection issue, tests of economic theories based on wholesample goodness of t comparisions can be seriously misled by breaks. If unmodelled shifts lead to lagged information from other variables appearing irrelevant, theory tests can be badly distorted. For example, if cointegration fails, then long-run relationships often viewed as the statistical embodiment of economic theory predictions will receive no support. A potential example concerns tests of the

44

implications of the Hall (1978) Euler-equation consumption theory when credit rationing changes, as happened in the UK (see Clements and Hendry, 1998a). The log of real consumers expenditure on nondurables and services (c) is not cointegrated with log real personal disposable income (y) over 1962(2) 1992(4) (a unit-root test using 5 lags of each variable, a constant and seasonals delivers tur = 0.97 so does not reject: see Banerjee and Hendry, 1992, on the properties of this test), although the solved long-run relation is: c = 0.53 + 0.98 y + Seasonals. (35)
(0.99) (0.10)

Lagged income terms are individually (max t = 1.5) and jointly (F(5, 109) = 1.5) insignicant in explaining 4 ct = ct ct4 . Such evidence appears to support the Hall life-cycle model, which entails that consumption changes are unpredictable, with permanent consumption proportional to fullyanticipated permanent income. As g. 5a shows for annual changes, the data behaviour is at odds with the theory after 1985, since consumption rst grows faster than income for several years, then falls faster far from smoothing. Moreover, the large departure from equilibrium in (35) is manifest in panel b, resulting in a marked deterioration in the resulting (xed-parameter) 1-step forecast errors from the model in Davidson et al. (1978) after 1984(4) (the period to the right of the vertical line in g. 5c). Finally, an autoregressive model for 4 ct = 4 ct 4 ct1 produces 1-step forecast errors which are smaller than average after 1984(4), consistent with a deterministic shift around the mid 1980s (see Hendry, 1994, and Muellbauer, 1994, for explanations based on nancial deregulation inducing a major reduction in credit rationing), which neither precludes the ex ante predictability of consumption from a congruent model, nor consumption and income being cointegrated. The apparent insignicance of additional variables may be an artefact of mis-specifying a crucial shift, so the selected model is not valid support for the theory; and conversely, non-causal proxies for a break may seem signicant. Thus, models used to test theories must rst be demonstrated to be congruent and encompassing. 8.3 Model selection for policy analysis A statistical forecasting system is one having no economic-theory basis, in contrast to econometric models for which economic theory is the hallmark. Since the former system will rarely have implications for economic-policy analysis and may not even entail links between target variables and policy instruments, being the best available forecasting device is insufcient to ensure its value for policy analysis. Consequently, the main issue is the converse: does the existence of a dominating forecasting procedure invalidate the use of an econometric model for policy? Since forecast failure often results from factors unrelated to the policy change in question, an econometric model may continue to characterize the responses of the economy to a policy, despite its forecast inaccuracy. Further, when policy changes are implemented, forecasts from a statistical model may be improved by combining them with the predicted policy responses from an econometric model. The forecasting model may nevertheless remain distinct from that policy model, for reasons explained by Hendry and Mizon (2000). The rationale for this analysis follows from the taxonomy of forecast errors in section 8.1.1 which recorded that deterministic shifts were the primary source of systematic forecast failure in econometric models. Devices like intercept corrections can robustify forecasting models against breaks which have occurred prior to forecasting (see e.g., Clements and Hendry, 1996, and Hendry and Clements, 1999). While such tricks may mitigate forecast failure, the resulting models need not have useful policy implications. Conversely, post-forecasting policy changes will induce breaks in models that did not embody the relevant policy links, whereas econometric systems need not experience that policy-regime shift. Consequently, when both structural breaks and regime shifts occur, neither class of model alone

45

.1

D4c D4y

5 2.5 0 2.5

Cointegration residual

.05

1960

1970

1980

1990

1960 2 1 0 1

1970
DD4c residual

1980

1990

DHSY residual

2 0 2 1970 1980 1990

2 1970 1980 1990

Figure 5 UK real consumers expenditure and income with model residuals. is adequate: this suggests that they should be combined and Hendry and Mizon (2000) provide an empirical illustration of doing so. 8.3.1 Congruent modelling As a usable knowledge base, theory-related, congruent, encompassing econometric models remain undominated by matching the data in all measurable respects (see, e.g., Hendry, 1995a). For empirical understanding, such models seem likely to remain an integral component of any progressive research strategy. Nevertheless, even the best available model can be caught out when forecasting by a sudden and unanticipated outbreak of (say) a major war or other crisis for which no effect was included in the forecast. However, if empirical models which are congruent within sample remain subject to a non-negligible probability of failing out of sample, then a critic might doubt their worth. Our defence of the program of attempting to discover such models rests on the fact that empirical research is part of a progressive strategy, in which knowledge gradually accumulates. This includes knowledge about general causes of structural changes, such that later models incorporate measures accounting for previous events, and hence are more robust (e.g., to wars, changes in credit rationing, nancial innovations, etc.). For example, the dummies for purchase-tax changes in Davidson et al. (1978) that at the time mopped up forecast failure, later successfully predicted the effects of introducing VAT, as well as the consequences of its doubling in 1979; and the First World-War shift in money demand in Ericsson, Hendry and Prestwich (1998) matched that needed for the Second World War. Since we now have an operational selection methodology with excellent properties, Gets seems a natural way to select policy-analysis models. When the GUM is a congruent representation, embedding the relevant theory knowledge that is available with the target-instrument linkages, and parsimoniously encompassing previous empirical ndings, the Gets selection strategy offers scope for selecting policy

46

models. Four features favour such a view. First, for a given null rejection frequency, variables that matter in the DGP are selected with the same probabilities as if the DGP were known. Absent omniscience, it is difcult to imagine doing much better. Secondly, although estimates are biased on average, conditional on retaining a variable, it coefcient provides an unbiased estimate of the policy reaction parameter. This is essential for economic policy if a variable is included, PcGets delivers the right response; otherwise, when it is excluded, one is simply unaware that such an effect exists.4 Thirdly, the probability of retaining adventitiously signicant variables is around the nominal size of the selection t-tests for the variables that remain after pre-selection simplication. If that is (say) even as large as 30 regressors, of which 5 actually matter, then at 1% signicance, 0.25 variables will be retained on average: i.e., one additional spuriously-signicant variable per four equations. This seems unlikely to distort policy in important ways. Finally, the sub-sample or more generally recursive selection procedures help reveal which variables have non-central t-statistics, and which central (and hence should be eliminated). Conversely, the very problem that plagues forecasting, namely deterministic shifts, simply reects the ease of detecting such breaks: other forms of break are not so detectable, as shown by Hendry and Doornik (1997) and Hendry (2000b). Indeed, changes to the coefcients of zero-mean variables can be difcult to detect in dynamic models, which may help explain the apparent empirical irrelevance of the Lucas (1976) critique: see Ericsson and Irons (1995). For policy models, such undetected changes could be hazardous: the estimated parameters would appear to be constant, but in fact be mixtures across regimes, leading to inappropriate advice (e.g., estimated impulse responses could have the wrong sign). In a progressive research context (i.e., from the perspective of learning), this is unproblematic since most policy changes involve deterministic shifts (as opposed to mean-preserving spreads), hence earlier incorrect inferences will be detected rapidly but is cold comfort to the policy maker, or the economic agents subjected to the wrong policies. Overall, the role of Gets in selecting policy models is promising, but a few steps remain to establish all its credentials.

9 Conclusions
The aim of these notes was to describe the theory of reduction, link it to general-to-specic modelling and evaluate computerized model-selection strategies to see if they worked well, indifferently, or failed badly. Recent developments seem set to resolve several of the central problems in econometric methodology. Building on the Hoover and Perez (1999) approach of searching many reduction paths, Krolzig and Hendry (2000) have automated Gets, in a program called PcGets. First, the initial general statistical model is tested for congruence, which is maintained throughout the selection process by diagnostic checks, thereby ensuring a congruent nal model. Next statistically-insignicant variables are eliminated by selection tests, both in blocks and individually. Many reduction paths are searched, to prevent the algorithm from becoming stuck in a sequence that inadvertently eliminates a variable that matters and thereby retains other variables as proxies. If several models are selected, encompassing tests resolve the choice; and if more than one congruent, mutually-encompassing choice remains, model-selection criteria are the nal arbiter. Lastly, sub-sample signicance helps identify spuriously signicant regressors. PcGets, therefore, implements all of the methodological prerequisites argued for in Hendry (1995a). As an analogy, PcGets works like a series of sieves: after testing that the initial model is congruThis is one of three reasons why we have not explored shrinkage estimators, which have been proposed as a solution to the pre-test problem, namely, they deliver biased estimators (see, e.g., Judge and Bock, 1978). The second, and main, reason is that such a strategy has no theoretical underpinnings in processes subject to intermittent parameter shifts. The nal reason concerns the need for progressivity, explaining more by less, which such an approach hardly facilitates.
4

47

ent, it rst removes the completely-irrelevant variables subject to retaining congruence, then checks all initially-feasible reduction paths to remove less-obviously irrelevant variables, before it tests between the contending selections by encompassing. The chosen congruent and encompassing model is then examined by checking the signicance of variables in sub-samples to see if any fools gold remains. Thus, an operational version of the Gets methodology conrms its power. The results are very positive: the diagnostic-test operational characteristics are ne; selection-test probabilities match those relevant to the DGP; and deletion-test probabilities are reasonable even when no sub-sample testing is used. On two empirical modelling problems, given the GUM that earlier investigators used, PcGets selects either closely similar, or somewhat improved specications. Thus, we deem the implementation successful, and deduce that the underlying methodology is appropriate for model selection in econometrics. Computer automation of model selection is in its infancy, and already considerable progress has been achieved. The exceptional progress to date merely sets a lower bound on performance. Moreover, there is a burgeoning symbiosis between the implementation and the theory developments in either stimulate advances in the other. Nevertheless, PcGets is a rst attempt: consequently, we believe it is feasible to circumvent the baseline nominal selection probabilities. First, since diagnostic tests must be insignicant at every stage to proceed, PcGets avoids spurious inclusion of a variable simply because wrong standard errors are computed (e.g., from residual autocorrelation). Thus, it could attain the same lower bound as in a pure white-noise setting, since every selection must remain both congruent and encompassing. Secondly, following multiple paths reduces the overall size, relative to (say) stepwise regression, despite the hugely increased number of selection (and diagnostic) tests conducted. Such an outcome highlights that an alternative statistical theory of model selection is needed than the conventional Bonferroni-type of analysis, and Hendry and Krolzig (1999a) present the basics thereof. Intuitively, the iterative loops around sequences of path searches could be viewed as sieves of ever-decreasing meshes ltering out the relevant from the irrelevant variables: as an analogy, rst large rocks are removed, then stones, pebbles, so nally only the gold dust remains. Post-selection tests may further reduce the probability of including non-DGP variables below the nominal size of selection t-tests, at possible costs in the power of retaining relevant variables, and possibly the diagnostics becoming signicant. Although the costs of missing real inuences rise, the power-size trade off in Hoover and Perez (1999) is quite at around an 80-20 split. So far, we have not examined the role of structural breaks in model selection other than in forecasting. In general, breaks in regressors in-sample should not alter the selection probabilities: there is still an % chance of false inclusion, but different variables will be selected. However, breaks after selection, as the sample grows, should help to eliminate adventitious inuences. PcGets tests for constancy as one diagnostic, and conducts sub-sample evaluation of reliability, but does not make decisions based on the latter information. In a progressive research strategy (PRS), such accrual of information is essential. Extensions of the algorithm discussedabove are being explored. One concerns xing economicallyessential variables, then applying PcGets to the remaining orthogonalized variables. Forced search paths also merit consideration (e.g., all even-numbered variables in a Monte Carlo). Suppose two investigators commenced with distinct subsets of the GUM, would they converge to the same reduction path if, after separate reduction exercises, they conducted encompassing tests, and recommenced from the union of their models? Such a search could be a forced path in the algorithm, and may well be a good one to follow, but remains a topic for future research. Implementing cointegration reductions and other linear transforms could also help here. Further work comparing Gets on the same data to simple-to-general approaches, and other strategies such as just using information criteria to select are merited. More detailed Monte Carlo studies

48

are required to investigate the impacts of breaks (interacting with the effects of sub-sample selection), collinearity, integration and cointegration. But the door is open and we anticipate some fascinating developments will follow for model selection. We have not yet established that Gets should be used for selecting policy models from a theory-based GUM and such a proof may not be possible, despite the accuracy with which the DGP is located. Nevertheless, achieving that aim represents the next step of our research program, and we anticipate that Gets will perform well in selecting models for policy.

10 Appendix: encompassing
Consider two models M1 and M2 of a random variable y, neither model necessarily being the DGP (M0 ), where their respective families of density functions are given by: M1 : D1 (y | ) , A Rk M2 : D2 (y | ) , D Rn M0 : Dy (y | ) , Rp . (36) (37) (38)

Two models M1 and M2 are isomorphic when their parameters are oneone transformations of each . other, so they are the same model in different parameterizations, denoted by M1 = M2 . We assume that M1 and M2 are not isomorphic. From the theory of reduction M1 and M2 are nested in M0 , denoted by M1 M0 , etc. For nested models, there exist transformations t () = = (1 : 2 ) , such that M1 is derived by imposing 2 = 0, and hence 1 = . There is no loss from reducing M0 to M1 when 2 is in fact zero, but there is no necessity that 2 = 0. Similarly, t () = = (1 : 2 ) delivers M2 when 2 = 0 with 1 = . Since some parameters of a model may be zero, when M3 is the extension of M1 which potentially allows 2 = 0 even though 2 is zero, then we dene the two models to be . equivalent, denoted M1 M3 . Thus, M1 = M3 implies M1 M3 . Equivalent models need not have the same number of parameters, but in a minimal-sufcient parameterization, would be isomorphic. Since M1 and M2 are nested in M0 , their parameters are functions of those of the DGP, denoted by = () and = (). In practice, M1 and M2 may not be nested with respect to each other, despite being nested in Dy (), and one aim of encompassing is to compare their parameters irrespective of nesting. However, when M1 is the DGP, then M2 M1 despite its appearance, so that must be a function of . We denote this function by = 21 (), where 21 : A D, and the subscripts record between which models the mapping occurs. Thus, as in the previous example, is what M1 predicts M2 should nd as the value of the parameter of D2 (y|). Even though is unknown, we can compare the predicted value 21 (()) with the value = () which M2 actually nds. The population encompassing difference between M2 and the prediction of M2 based on M1 is: ( p ) = ( p ) 21 ( ( p )) . Now we can dene encompassing formally (see Hendry and Richard, 1989). DEFINITION 1. Model 1 encompasses model 2, denoted by M1 E M2 , with respect to if = 0. Encompassing is an exact property in the population, but is a contingent relation: it cannot be expected to hold for all values of . We could have M1 E M2 when = 1 whereas M2 E M1 when = 2 . For to vanish (apart from probability-zero events), () must be a sufcient re-parameterization of in so far as is concerned. Encompassing will not be possible if there are aspects of on which depends but does not. Letting E c denote does not encompass, if M1 E c M2 , then M2 must reect specic features of the DGP that are not included in M1 . (39)

49 THEOREM 1. If M1 E M2 then either M2 E c M1 or the two models are equivalent. First consider that both M2 E M1 and M1 E M2 hold. The former implies = 12 (), and hence that: = ( p ) 12 ( ( p )) = 0. Since = 0, then = 21 () as well, so that: = 21 (12 ()) and hence 21 12 = In . (40)

Thus, after suitable re-parameterizations and the elimination of redundant parameters, the models M1 and M2 are isomorphic. Conversely, when = 0 and = 0 then M2 E c M1 . Minimal nesting model is Mm , with parameter vector Rm , such that M1 Mm and M2 Mm , and no smaller model with integer dimension less than m nests both M1 and M2 . Encompassing implements the notion of a progressive research strategy: a model-building strategy in which knowledge is gradually accumulated as codied, reproducible information about the world. Must have three properties (a) Reexivity: M1 E M1 . (b) Anti-symmetry: if M1 E M2 and M2 E M1 , then M1 M2 . (c) Transitivity: if M1 E M2 and M2 E M3 , then M1 E M3 . Then denes a partial ordering across models. We dene parsimonious encompassing (denoted Ep ): DEFINITION 2. M1 parsimoniously encompasses M2 , written as M1 Ep M2 , if M1 M2 and M1 E M2 . Statistical-theory derivations are often simpler when M1 is nested in M2 . Parsimonious encompassing is related to the original concept of encompassing via a renement of the notion of the Mm . Since Mm cannot contain any information additional to that in M1 and M2 if it is to be minimal, we have: Mm = M1 M 2 where M denotes the model which represents all aspects of M2 that do not overlap with M1 . Then we 2 have the following theorem: THEOREM 2. When M1 Mm and M1 Ep Mm then M1 Mm . Since M1 Mm implies that Mm EM1 , and by denition 2, M1 Ep Mm implies M1 EMm , then by theorem 1, M1 and M2 are equivalent. Thus we can establish the theorem: THEOREM 3. If M1 Mm and M 2 Mm , then M1 E M2 if and only if M1 Ep Mm . From theorem 2 when M1 Mm , if M1 Ep Mm , then Mm M1 ; and as M2 Mm M1 , c then M1 EM2 . Conversely, if M1 Ep Mm when M1 Mm = M1 M , since M1 EM1 , then M2 must 2 contain additional information to that in M1 , and hence M1 E c M2 . Parsimonious encompassing Ep is transitive: if M1 Ep Mm and Mm Ep Mg , then M1 Ep Mg : a valid reduction of a valid reduction remains a valid reduction.

50

References
Abramowitz, M., and Stegun, N. C. (1970). Handbook of Mathematical Functions. New York: Dover Publications Inc. Akaike, H. (1985). Prediction and entropy. In Atkinson, A. C., and Fienberg, S. E. (eds.), A Celebration of Statistics, pp. 124. New York: Springer-Verlag. Banerjee, A., and Hendry, D. F. (1992). Testing integration and cointegration: An overview. Oxford Bulletin of Economics and Statistics, 54, 225255. Box, G. E. P., and Pierce, D. A. (1970). Distribution of residual autocorrelations in autoregressiveintegrated moving average time series models. Journal of the American Statistical Association, 65, 15091526. Chateld, C. (1995). Model uncertainty, data mining and statistical inference. Journal of the Royal Statistical Society, A, 158, 419466. With discussion. Chow, G. C. (1960). Tests of equality between sets of coefcients in two linear regressions. Econometrica, 28, 591605. Clements, M. P., and Hendry, D. F. (1996). Intercept corrections and structural change. Journal of Applied Econometrics, 11, 475494. Clements, M. P., and Hendry, D. F. (1998a). Forecasting economic processes. International Journal of Forecasting, 14, 111131. Clements, M. P., and Hendry, D. F. (1998b). Forecasting Economic Time Series: The Marshall Lectures on Economic Forecasting. Cambridge: Cambridge University Press. Clements, M. P., and Hendry, D. F. (1999a). Forecasting Non-stationary Economic Time Series: The Zeuthen Lectures on Economic Forecasting. Cambridge, Mass.: MIT Press. Clements, M. P., and Hendry, D. F. (1999b). Modelling methodology and forecast failure. Unpublished typescript, Economics Department, University of Oxford. Davidson, J. E. H., and Hendry, D. F. (1981). Interpreting econometric evidence: The behaviour of consumers expenditure in the UK. European Economic Review, 16, 177192. Reprinted in Hendry, D. F. (1993), Econometrics: Alchemy or Science? Oxford: Blackwell Publishers. Davidson, J. E. H., Hendry, D. F., Srba, F., and Yeo, J. S. (1978). Econometric modelling of the aggregate time-series relationship between consumers expenditure and income in the United Kingdom. Economic Journal, 88, 661692. Reprinted in Hendry, D. F. (1993), Econometrics: Alchemy or Science? Oxford: Blackwell Publishers. Doornik, J. A. (1999). Object-Oriented Matrix Programming using Ox 3rd end. London: Timberlake Consultants Press. Doornik, J. A., and Hansen, H. (1994). A practical test for univariate and multivariate normality. Discussion paper, Nufeld College. Ericsson, N. R., Hendry, D. F., and Prestwich, K. M. (1998). The demand for broad money in the United Kingdom, 18781993. Scandinavian Journal of Economics, 100, 289324. Ericsson, N. R., and Irons, J. S. (1995). The Lucas critique in practice: Theory without measurement. In Hoover, K. D. (ed.), Macroeconometrics: Developments, Tensions and Prospects. Dordrecht: Kluwer Academic Press. Faust, J., and Whiteman, C. H. (1997). General-to-specic procedures for tting a data-admissible, theory-inspired, congruent, parsimonious, encompassing, weakly-exogenous, identied, struc-

51

tural model of the DGP: A translation and critique. CarnegieRochester Conference Series on Public Policy, 47, 121161. Gilbert, C. L. (1986). Professor Hendrys econometric methodology. Oxford Bulletin of Economics and Statistics, 48, 283307. Reprinted in Granger, C. W. J. (ed.) (1990), Modelling Economic Series. Oxford: Clarendon Press. Godfrey, L. G. (1978). Testing for higher order serial correlation in regression equations when the regressors include lagged dependent variables. Econometrica, 46, 13031313. Godfrey, L. G., and Veale, M. R. (1999). Alternative approaches to testing by variable addition. Mimeo, York University, UK. Granger, C. W. J. (ed.)(1990). Modelling Economic Series. Oxford: Clarendon Press. Hall, R. E. (1978). Stochastic implications of the life cycle-permanent income hypothesis: Evidence. Journal of Political Economy, 86, 971987. Hannan, E. J., and Quinn, B. G. (1979). The determination of the order of an autoregression. Journal of the Royal Statistical Society, B, 41, 190195. Harvey, A. C. (1981). The Econometric Analysis of Time Series. Deddington: Philip Allan. Hendry, D. F. (1979). Predictive failure and econometric modelling in macro-economics: The transactions demand for money. In Ormerod, P. (ed.), Economic Modelling, pp. 217242. London: Heinemann. Reprinted in Hendry, D. F. (1993), Econometrics: Alchemy or Science? Oxford: Blackwell Publishers. Hendry, D. F. (1984). Monte Carlo experimentation in econometrics. In Griliches, Z., and Intriligator, M. D. (eds.), Handbook of Econometrics, Vol. 23, C.H. 16. Amsterdam: North-Holland. Hendry, D. F. (1993). Econometrics: Alchemy or Science? Oxford: Blackwell Publishers. Hendry, D. F. (1994). HUS revisited. Oxford Review of Economic Policy, 10, 86106. Hendry, D. F. (1995a). Dynamic Econometrics. Oxford: Oxford University Press. Hendry, D. F. (1995b). Econometrics and business cycle empirics. Economic Journal, 105, 16221636. Hendry, D. F. (1996). On the constancy of time-series econometric equations. Economic and Social Review, 27, 401422. Hendry, D. F. (1997). On congruent econometric relations: A comment. CarnegieRochester Conference Series on Public Policy, 47, 163190. Hendry, D. F. (2000a). Econometrics: Alchemy or Science? Oxford: Oxford University Press. Hendry, D. F. (2000b). On detectable and non-detectable structural change. Structural Change and Economic Dynamics, Anniversary Issue. Forthcoming. Hendry, D. F., and Clements, M. P. (1999). Economic forecasting in the face of structural breaks. In Holly, S., and Weale, M. (eds.), Econometric Modelling: Techniques and Applications. Cambridge: Cambridge University Press. Forthcoming. Hendry, D. F., and Doornik, J. A. (1994). Modelling linear dynamic econometric systems. Scottish Journal of Political Economy, 41, 133. Hendry, D. F., and Doornik, J. A. (1996). Empirical Econometric Modelling using PcGive 9 for Windows. London: Timberlake Consultants Press. Hendry, D. F., and Doornik, J. A. (1997). The implications for econometric modelling of forecast failure. Scottish Journal of Political Economy, 44, 437461. Special Issue. Hendry, D. F., and Ericsson, N. R. (1991a). An econometric analysis of UK money demand in Monetary

52

Trends in the United States and the United Kingdom by Milton Friedman and Anna J. Schwartz. American Economic Review, 81, 838. Hendry, D. F., and Ericsson, N. R. (1991b). Modeling the demand for narrow money in the United Kingdom and the United States. European Economic Review, 35, 833886. Hendry, D. F., and Krolzig, H.-M. (1999a). General-to-specic model specication using PcGets for Ox. Mimeo, Economics Department, Oxford University. Hendry, D. F., and Krolzig, H.-M. (1999b). Improving on Data mining reconsidered by K.D. Hoover and S.J. Perez. Econometrics Journal, 2, 4158. Hendry, D. F., Leamer, E. E., and Poirier, D. J. (1990). A conversation on econometric methodology. Econometric Theory, 6, 171261. Hendry, D. F., and Mizon, G. E. (1990). Procrustean econometrics: or stretching and squeezing data. in Granger (1990), pp. 121136. Hendry, D. F., and Mizon, G. E. (2000). On selecting policy analysis models by forecast accuracy. In Atkinson, A. B., Glennerster, H., and Stern, N. (eds.), Putting Economics to Work: Volume in Honour of Michio Morishima, pp. 71113. London School of Economics: STICERD. Hendry, D. F., and Morgan, M. S. (1995). The Foundations of Econometric Analysis. Cambridge: Cambridge University Press. Hendry, D. F., and Richard, J.-F. (1982). On the formulation of empirical models in dynamic econometrics. Journal of Econometrics, 20, 333. Reprinted in Granger, C. W. J. (ed.) (1990), Modelling Economic Series. Oxford: Clarendon Press and in Hendry D. F. (1993), Econometrics: Alchemy or Science? Oxford: Blackwell Publishers. Hendry, D. F., and Richard, J.-F. (1989). Recent developments in the theory of encompassing. In Cornet, B., and Tulkens, H. (eds.), Contributions to Operations Research and Economics. The XXth Anniversary of CORE, pp. 393440. Cambridge, MA: MIT Press. Hoover, K. D., and Perez, S. J. (1999). Data mining reconsidered: Encompassing and the general-tospecic approach to specication search. Econometrics Journal, 2, 125. Jarque, C. M., and Bera, A. K. (1980). Efcient tests for normality, homoscedasticity and serial independence of regression residuals. Economics Letters, 6, 255259. Judge, G. G., and Bock, M. E. (1978). The Statistical Implications of Pre-Test and Stein-Rule Estimators in Econometrics. Amsterdam: North Holland Publishing Company. Judge, G. G., Grifths, W. E., Hill, R. C., L tkepohl, H., and Lee, T.-C. (1985). The Theory and Practice u of Econometrics, 2nd end. New York: John Wiley. Keynes, J. M. (1939). Professor Tinbergens method. Economic Journal, 44, 558568. Keynes, J. M. (1940). Comment. Economic Journal, 50, 154156. Koopmans, T. C. (1947). Measurement without theory. Review of Economics and Statistics, 29, 161 179. Krolzig, H.-M., and Hendry, D. F. (2000). Computer automation of general-to-specic model selection procedures. Journal of Economic Dynamics and Control, forthcoming. Leamer, E. E. (1978). Specication Searches. Ad-Hoc Inference with Non-Experimental Data. New York: John Wiley. Leamer, E. E. (1983). Lets take the con out of econometrics. American Economic Review, 73, 31 43. Reprinted in Granger, C. W. J. (ed.) (1990), Modelling Economic Series. Oxford: Clarendon Press.

53

Lovell, M. C. (1983). Data mining. Review of Economics and Statistics, 65, 112. Lucas, R. E. (1976). Econometric policy evaluation: A critique. In Brunner, K., and Meltzer, A. (eds.), The Phillips Curve and Labor Markets, Vol. 1 of Carnegie-Rochester Conferences on Public Policy, pp. 1946. Amsterdam: North-Holland Publishing Company. Mayo, D. (1981). Testing statistical testing. In Pitt, J. C. (ed.), Philosophy in Economics, pp. 175230: D. Reidel Publishing Co. Reprinted as pp. 4573 in Caldwell B. J. (1993), The Philosophy and Methodology of Economics, Vol. 2, Aldershot: Edward Elgar. Mizon, G. E., and Richard, J.-F. (1986). The encompassing principle and its application to non-nested hypothesis tests. Econometrica, 54, 657678. Muellbauer, J. N. J. (1994). The assessment: Consumer expenditure. Oxford Review of Economic Policy, 10, 141. Neyman, J., and Pearson, E. S. (1928). On the use and interpretation of certain test criteria for purposes of statistical inference. Biometrika, 20A, 175240, 263294. Nicholls, D. F., and Pagan, A. R. (1983). Heteroscedasticity in models with lagged dependent variables. Econometrica, 51, 12331242. Pagan, A. R. (1987). Three econometric methodologies: A critical appraisal. Journal of Economic Surveys, 1, 324. Reprinted in Granger, C. W. J. (ed.) (1990), Modelling Economic Series. Oxford: Clarendon Press. Sargan, J. D. (1981). The choice between sets of regressors. Mimeo, Economics Department, London School of Economics. Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 6, 461464. Sims, C. A. (1980). Macroeconomics and reality. Econometrica, 48, 148. Reprinted in Granger, C. W. J. (ed.) (1990), Modelling Economic Series. Oxford: Clarendon Press. Sims, C. A., Stock, J. H., and Watson, M. W. (1990). Inference in linear time series models with some unit roots. Econometrica, 58, 113144. Spanos, A. (1989). On re-reading Haavelmo: A retrospective view of econometric modeling. Econometric Theory, 5, 405429. Summers, L. H. (1991). The scientic illusion in empirical macroeconomics. Scandinavian Journal of Economics, 93, 129148. Theil, H. (1971). Principles of Econometrics. London: John Wiley. Tinbergen, J. (1940a). Statistical Testing of Business-Cycle Theories. Geneva: League of Nations. Vol. I: A Method and its application to Investment Activity. Tinbergen, J. (1940b). Statistical Testing of Business-Cycle Theories. Geneva: League of Nations. Vol. II: Business Cycles in the United States of America, 19191932. White, H. (1980). A heteroskedastic-consistent covariance matrix estimator and a direct test for heteroskedasticity. Econometrica, 48, 817838. White, H. (1990). A consistent model selection. in Granger (1990), pp. 369383. Wooldridge, J. M. (1999). Asymptotic properties of some specication tests in linear models with integrated processes. In Engle, R. F., and White, H. (eds.), Cointegration, Causality and Forecasting, pp. 366384. Oxford: Oxford University Press.

You might also like