You are on page 1of 4

Time Series, Cross Sectional Analysis

Note on ARIMA models in Stata. When Stata estimates and AR-1 model with no covariates, for instance, it does not just put the lagged dependent variable on the right hand side. Instead, it uses FGLS to put rho times the lag on the right hand side. Note the slight difference between these two models, which is very small only because rho is so close to one in this dataset. This means that if the right specification means including a lag, you cant simply run an AR-1 model. Neal Beck has done some simulations to determine in what situations this is particularly problematic, and I think they are in the Beck/Katz 1995 APSR piece on PCSE. That article is the best place to learn about estimating standard errors correctly with this sort of data, and can be read if you know what a Kronicker product is. In todays lecture, Ill focus on coefficient estimates and specifications.
. arima air if time>1949, ar(1) (setting optimization to BHHH) Iteration 0: log likelihood = -706.79588 Iteration 6: log likelihood = -706.59758 ARIMA regression Sample: 1960m3 to 1972m1 Number of obs Wald chi2(1) Prob > chi2 = = = 143 2519.29 0.0000

Log likelihood = -706.5976

-----------------------------------------------------------------------------| OPG air | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------air | _cons | 279.6855 67.52138 4.14 0.000 147.3461 412.025 -------------+---------------------------------------------------------------ARMA | ar | L1 | .9637296 .0192007 50.19 0.000 .926097 1.001362 -------------+---------------------------------------------------------------/sigma | 33.5509 1.848817 18.15 0.000 29.92729 37.17452 ------------------------------------------------------------------------------

reg air laggedair Source | SS df MS -------------+-----------------------------Model | 1871165.72 1 1871165.72 Residual | 158355.94 141 1123.09177 -------------+-----------------------------Total | 2029521.66 142 14292.4061 Number of obs F( 1, 141) Prob > F R-squared Adj R-squared Root MSE = 143 = 1666.08 = 0.0000 = 0.9220 = 0.9214 = 33.513

-----------------------------------------------------------------------------air | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------laggedair | .958932 .023493 40.82 0.000 .9124878 1.005376

_cons | 13.7055 7.133673 1.92 0.057 -.3972779 27.80829 ------------------------------------------------------------------------------

VIII. Modeling Choices with Data Across Space and Time


A. Definitions. Suppose you have a dataset with information on N units (which may be countries, states, NGOs, or survey respondents) over T periods, giving you N x T observations. If N is very large relative to T, your dataset is cross-sectionally dominated, and if T is less than about five, you have panel data. This name comes from survey panels with lots of respondents observed for only a few time periods. Lucky economists with this type of data have focused on particular kinds of error relationships, and proved their asymptotics as N gets very large. If your dataset has more time observations than units for instance, trade balances since WWII in the Benelux nations then you have time series-cross section data. You may want to pay more attention to time series processes with this sort of dataset, but the rest of the issues that we will discuss today apply to both sorts of data (although to varying degrees). B. Describing your data to Stata. You should begin by reading the help file on xt, which sends you to a variety of commands used when you have data over units and time. Many of these commands can be used without describing your data to Stata, but if you are going to use any time series process models (like AR), then you need to use the tsset command, followed by the variable that identifies the units of the data and then the variable that identifies the time at which they are observed. For instance, in my Medicaid dataset, I can identify and then describe my data. Then I am free to use Statas canned features that allow me to estimate the models described today, either using an OLSlike setup or incorporating a tobit, logit, poisson, negative binomial, or probit likelihood function.
tsset alpha year panel variable: alpha, 1 to 50 time variable: year, 1975 to 1993 The xt series of commands provide tools for analyzing cross-sectional time-series (panel) datasets: help xtdes help xtsum help xttab help xtdata help xtline Describe pattern of xt data Summarize xt data Tabulate xt data Faster specification searches with xt data Line plots with xt data

help xtreg Fixed-, between- and random-effects, and population-averaged linear models help xtregar Fixed- and random-effects linear models with an AR(1) disturbance help xtgls Panel-data models using GLS help xtpcse OLS or Prais-Winsten models with panel-corrected standard errors help xtrchh Hildreth-Houck random coefficients models help xtivreg Instrumental variables and two-stage least squares for panel-data models help xttobit Random-effects tobit models help xtintreg Random-effects interval data regression models help xtlogit Fixed-effects, random-effects, & population-averaged logit models help xtprobit Random-effects and population-averaged probit models help xtpoisson Fixed-effects, random-effects, & population-averaged Poisson models help xtnbreg Fixed-effects, random-effects, & population-averaged negative binomial models

C. The Most Constrained Model. The simplest model for this type of data is the one that contains the most restrictive assumptions. In these observations, the units are indexed by i = 1, , N and the times are indexed t = 1, , T. yi,t = xi,t + ei,t We assume that the intercept is the same for all units (and we dont even write it out separately, remembering that it is contained in the coefficient estimate for a column of ones included in the X matrix). Suppose we are explaining the GDP of European countries over the past 50 years. This means that the intercepts for the countries, their baseline wealth controlling for all of our explanatory variables, are about the same. This may be the least tenable assumption, and both fixed effects and random effects are methods that loosen this assumption and allow each unit to have its own intercept. If you needed to include unitspecific intercepts, but specified your model incorrectly, the coefficients of variables that do not vary across time within a unit will be biased because they will pick up all of the effects that should have been attributed to the intercepts. We assume that the intercept is the same for each time period. If we did not use the price of our in our model, all of our observations in, say, 1978 will likely have a lower intercept. We might want to alter this model by allowing intercepts to vary by year, and we do this in practice by including a dummy variable for T-1 years (year fixed effects). If we dont, our errors will have contemporaneous autocorrelation. We assume that our betas pool across time and space, that the effects of independent variables are the same in all units and at all times. As social scientists in search of general laws of politics, we want

to be able to make this assumption. But we may want to be cautious, or to specify that a certain variable has one impact in agricultural economies and another in industrial markets. If so, we can interact that variable with a dummy specifying the type of economy, thus allowing its effect to vary across types of units. We assume that our errors are independent. In fact, there may be some heteroskedasticity that is real (which we know how to fix by running FGLS) or merely a product of constraining all intercepts to be the same (which we can fix by specifying the intercepts correctly). There may also be some AR or similar processes in each units time series, which we can model by including things like the lagged dependent variable. D. The Least Constrained Model. If we dont want to make any of those assumptions, we could write out the following model, which gives each unit and time its own intercept and its own slope for each variable. Nothing pools. Why not take this safe route? The problem is that this model would have no degrees of freedom. So we must pick the appropriate restrictions. But be careful, because if you constrain your model too much, you will have bias and not merely inefficiency. yi,t = i,t + i,t xi,t + ei,t

You might also like