You are on page 1of 16

Reporter: MELVIN R.

ESTOLANO Title: POISSON REGRESSION ANALYSIS STAT135 (INTRODUCTION TO REGRESSION ANALYSIS)

INTRODUCTION

In clinical works, one often encounters many discrete response variables have counts as a possible outcome. Often it is a count of rare events such as number of new cases of lung cancer occurring over a certain period of time, number of birth cases in a year, number of automobile theft in 1995, etc. This generalized linear model assumes a Poisson distribution for the random component, but like counts Poisson variates can take nonnegative integer value. The aim of regression analysis in such instances is to model the dependent variable Y as the estimate of outcome using some or all of the explanatory variables. Recall that when the response variable had a normal distribution we found that its mean could be linked to a set of explanatory variables using a linear function, . In the case of a binary regression the fact that probability lies between 0 1 imposes a constraint. The normality assumption of multiple linear regression is lost, and so is the constancy of variance. Without this assumptions the F and t test have no

basis. The solution was to use the logistic transformation on the probability of p or logit p, such that, .

But in the case of Poisson regression where the response variable is in the form of a count we face a different constraint. Count data that are all positive integers for rare events it approaches a Poisson distribution that is more approximate since Poisson mean is greater than 0, so the Poisson distribution has a positive mean. Though one can model the Poisson mean in Generalized Linear Models using the identity link, it is more common to model the log of the mean. So that, the logarithm of the response variable is linked to a linear function of explanatory variables such that,

For this model, the mean satisfies the exponential relationship so that, Or In other words, the typical Poisson regression model expresses the log outcome rate as a linear function of a set of predictors. So, a one-unit increase in X has a multiplicative impact of e on Y. The mean of Y at x + 1 equals the mean of Y at x multiplied by e. If = 0, then e = e0 = 1and the multiplicative factor is 1; that is, the mean of Y does not change as X changes. So if > 0, then e > 1, and the mean of Y increases as X increases and if < 0, then the mean of Y decreases as X decreases. .

METHODOLOGY

A Generalized Linear Model (GLM) assumes that the response data Yi has a distribution from an exponential family (normal, inverse Gaussian, gamma, Poisson, binomial). The generalized model is written as,

Where,

Yi ~ exponential family and g () is a non- linear function that links .

the random component Yi to the systematic component

Poisson Regression Analysis is a technique for modeling the dependent variable Yi that describes count data. It is a general linear model that assumes that the response data follows a Poisson distribution (Agresti, 1996).

Remember that, for Generalized Linear Models, we use a link function to transform Y, so if take the transformed Y and add the regression equation, we get a Poisson regression,

Where, log (y) is a logarithmic function. We assume that the expected value of the observed response can be written as
i

and that there is a function g that

relates the mean of the response to a linear predictor, say

As in least-squares regression, the relationship between the log(y) is assumed to be linear. The log(y) changes linearly as a function of the explanatory variables.
3

Log link (much more common) log(), which is the natural parameter of Poisson distribution and the log link is the canonical link for GLMs with Poisson distribution. The Poisson regression model for counts (with a log link) is . This is often referred to as Poisson loglinear model. Wherein the relationship between the mean of the response variable and the linear predictor for the log link function is, (

Poisson regression models are based on linear regression (which is based on the normal distribution) but use a log link to correct the non-linearity of Poisson assumptions. This log link is particularly attractive for Poisson regression because it ensures that all of the predicted values of the response variable will be nonnegative. For the estimation of model parameters in Poisson regression we use the method of maximum likelihood, the development follows closely to the approach used in logistic regression. If we have a random sample n observations on the response Y and the predictor x, then the likelihood function is,

Where likelihood,

). Once the link function is specified, we maximize the log-

Iteratively reweighted least squares can be used to find the maximum likelihood estimates of the parameters in Poisson regression. Once the parameter estimates are obtained, the fitted Poisson regression model is, ( )

For example, if the identity link is used, the prediction equation becomes, ( )

The parameters are usually estimated using maximum likelihood procedure. Wald test is used for testing for a GLM model. The Wald test is based on the

behavior of the log-likelihood function at the Maximum Likelihood (ML) estimates , having chi squared form

, where ASE stands for Asymptotic Standard Error.

A Generalized Linear Model (GLM) will give an accurate description and inference if it fits well the model into the data. Pearson goodness-of- fit statistic is

used to test whether the model adequately fits the data or not. The Pearson has the form: where yi is the ith of the response variable and

statistic

is a mean of the response. Poisson

distribution reveals three main problems in the application of regression analysis following classical assumption. First, the Poisson distribution is skewed, while traditional regression assumes a symmetric distribution of errors. Second, the Poisson distribution is non- negative, while classical regression might assume a distribution of values that maybe negative. Third, the variance of a Poisson distribution increases as the mean increase, while traditional regression assumes a constant variance. Thus, Poisson regression is considered in case the dependent variable shows unbounded explosion of variance and other distributional problems.

Some Important assumptions in Poisson Regression: 1. Rates are log-linear (changes linearly with increasing exposure) 2. Interactions on multiplicative scale 3. At each level of each covariate, the number of cases has variance equal to its mean 4. Observations are independent

Some important properties in Poisson regression: 1. As increases, the mass of the distribution shifts to the right. Specially, . The parameter known as the rate since it is the expected number of times that an event has occurred per unit of time. can also be interpreted as the mean or expected count. 2. The variance is equal to the mean, . The equality of the mean

and the variance is known as equidispersion. In practice, count variables often have a variance greater than the mean, which is called overdispersion. A negative binomial regression model can be a remedy for this violation. 3. As increases, the probability of 0s decreases. For many count variables, there are many observed 0s than the predicted by the Poisson distribution. 4. As incrases, the Poisson distribution approximates a normal distribution. Violations in Poisson Regression: 1. Overdispersion- it is the characteristics of a Poisson distribution wherein the variance is greater than the mean, Poisson models may not be appropriate for this instance, so we introduce a new model which is referred to the neagative binomial Poisson regression to remedy this violation. 2. Non-independence of events it is due the residual cofounding. An alternative solution for this is to use autocorrelation or autoregressive models. 3. Excess Zero observations it is due to large amounts of 0 counts, a ZeroInflated Regression model can remedy this violation.

RESULTS AND DISCUSSION

SAS Data Analysis: POISSON REGRESSION A cohort subject, some non-smokers and others smokers, was observed for several years. Thus number of cases of cancer of the lung diagnosed among the different categories was recorded. Data regarding the number of years of smoking were also obtained from each individual. For each category the person-years of observation were calculated. The investigators wish to address the question of the relative risks of smokers. Description of the Data The respondents are 35 smokers and non-smokers with different cases of lung cancer. The response variable of interest is the cases, for which we will predict the cases of cancer of the lungs by the variables: number of cigarettes smoked and number of years smoking for each category the person-years of observation.
data poissonreg; input cigsperday yearssmoking personyears cases; cards; proc means data = poissonreg mean std min max var; var cases cigsperday yearssmoking personyears; run;

The MEANS Procedure Variable Mean Std Dev Minimum Maximum Variance ---------------------------------------------------------------------------------------------cases 2.2571429 3.0615534 0 10.0000000 9.3731092 cigsperday 17.0000000 12.7648414 0 40.0000000 162.9411765 yearssmoking 35.0000000 14.3486011 15.0000000 55.0000000 205.8823529 personyears 2403.63 2149.21 104.0000000 10366.00 4619091.83 ----------------------------------------------------------------------------------------------

proc univariate data = poissonreg noprint; histogram cases / midpoints = 0 to 20 by 1 vscale = count; run;

17. 5

15. 0

12. 5

C o u n t

10. 0

7. 5

5. 0

2. 5

0 0 1 2 3 4 5 6 7 8 9 10 cases 11 12 13 14 15 16 17 18 19 20

Since over a number of years of observation some cases of cancer of the lung can be expected to arise from causes not related to smoking, we use this as our base line (uninformative) model and perform the analysis.
proc freq data = poissonreg; tables cases; run;
The FREQ Procedure Cumulative Cumulative cases Frequency Percent Frequency Percent -----------------------------------------------------------0 16 45.71 16 45.71 1 5 14.29 21 60.00 2 3 8.57 24 68.57 3 2 5.71 26 74.29 4 1 2.86 27 77.14 5 2 5.71 29 82.86 6 1 2.86 30 85.71 7 2 5.71 32 91.43 9 2 5.71 34 97.14 10 1 2.86 35 100.00

SAS POISSON REGRESSION ANALYSIS


proc genmod data = poissonreg; model cases = cigsperday yearssmoking personyears /dist=poisson; run;
The GENMOD Procedure Model Information Data Set WORK.POISSONREG Distribution Poisson Link Function Log Dependent Variable cases Observations Used 35 Criterion Criteria for Assessing Goodness of Fit DF Value 74.1146 74.1146 62.6362 62.6362 16.9020 Algorithm converged. 31 31 31 31 Value/DF 2.3908 2.3908 2.0205 2.0205

Deviance Scaled Deviance Pearson Chi-Square Scaled Pearson X2 Log Likelihood

Parameter Intercept cigsperday yearssmoking personyears Scale

DF 1 1 1 1 0

Analysis of Parameter Estimates Standard Wald 95% Confidence Estimate Error Limits -4.6710 0.0559 0.0888 0.0004 1.0000 0.9890 0.0100 0.0166 0.0001 0.0000 -6.6094 0.0363 0.0562 0.0002 1.0000 -2.7325 0.0756 0.1214 0.0006 1.0000

ChiSquare 22.30 31.12 28.50 15.54

Pr > ChiSq <.0001 <.0001 <.0001 <.0001

NOTE: The scale parameter was held fixed. NOTE: The scale parameter was estimated by maximum likelihood.

From Analysis of Parameter Estimates table of output, we can see that variables CIGARETTES PER DAY, YEARS SMOKING and PERSON YEAR are highly significant (p<.0001). The Criteria for Assessing Goodness of Fit section of output suggests that, because the value/df for both deviance and Pearson Chi-Square statistics is close to 1, Poisson model is quite adequate to describe the counts of CASES. The output shows the fit statistics summarizes the negative binomial coefficients for each of the variables along with the corresponding standard errors, Wald 95% confidence intervals, Wald Chi-Squared statistics and their corresponding p-values for each variable. To obtain robustness of the standard errors for the Poisson regression coefficients we will rerun proc genmod with the repeated statements.
10

proc genmod data = poissonreg; class cases; model cases = cigsperday yearssmoking personyears /dist=poisson; repeated subject = cases/ type=cs; run;
The GENMOD Procedure Model Information Data Set Distribution Link Function Dependent Variable Observations Used WORK.POISSONREG Poisson Log cases 35

cases

Class Level Information Class Levels Values 10 0 1 2 3 4 5 6 7 9 10

Criteria For Assessing Goodness Of Fit Criterion Deviance Scaled Deviance Pearson Chi-Square Scaled Pearson X2 Log Likelihood DF 31 31 31 31 Value 74.1146 74.1146 62.6362 62.6362 16.9020 Value/DF 2.3908 2.3908 2.0205 2.0205

Parameter Intercept cigsperday yearssmoking personyears Scale

DF 1 1 1 1 0

Algorithm converged. Analysis Of Initial Parameter Estimates Standard Wald 95% Confidence Estimate Error Limits -4.6710 0.0559 0.0888 0.0004 1.0000 0.9890 0.0100 0.0166 0.0001 0.0000 -6.6094 0.0363 0.0562 0.0002 1.0000 -2.7325 0.0756 0.1214 0.0006 1.0000

ChiSquare Pr > ChiSq 22.30 31.12 28.50 15.54 <.0001 <.0001 <.0001 <.0001

NOTE: The scale parameter was held fixed. GEE Model Information Correlation Structure Subject Effect Exchangeable cases (10 levels)

The GENMOD Procedure GEE Model Information Number of Clusters Correlation Matrix Dimension Maximum Cluster Size Minimum Cluster Size Exchangeable Working Correlation Correlation 0.9999 10 16 16 1

Analysis Of GEE Parameter Estimates Empirical Standard Error Estimates Parameter Intercept cigsperday yearssmoking personyears Estimate 1.5472 0.0000 0.0000 0.0000 Standard Error 0.2128 0.0000 0.0000 0.0000 95% Confidence Limits 1.1300 0.0000 0.0000 0.0000 1.9644 0.0000 0.0000 0.0000 Z Pr > |Z| 7.27 3.50 4.15 3.13 <.0001 0.0005 <.0001 0.0017

11

The robust standard errors attempt to adjust for heterogeneity in the model. Since there is fairly large change in the standard errors using the robust standard errors, it is more appropriate. The z-tests yield significant results, thus giving more realistic p-values. The main body of the output shows the Poisson coefficients, robust standard errors, p-values and 95% confidence intervals for the coefficient. All of the variables pvalue is less than the level of significance making all of the variables significant to the model so we dont need to rerun the model and drop some variable. The final Poisson model fits the data significantly, so we do not need to run a null model for comparison with the current model with the use of the chi-squared test on the difference of log likelihood. But in the case that a variable is insignificant we will drop the variable and rerun the model then we get the log likelihood difference of the full and the reduced model and compare test the we use

We accept the model and obtain a regression equation which could be written as,

)(

This equation is useful for estimating or predicting the relative risk of developing lung cancer by the number of cigarettes smoked (i.e. strength of the dose), or by the number of years smoking (total dose).
12

Some Results: Computations of the predicted value increase of the number of cases of lung cancer with the respective predictor variable.

The predicted number of cases of lung cancer given the number of cigarettes smoked is given by,

The predicted number of cases of lung cancer given the years smoking is given by,

The predicted number of cases of lung cancer given the person years is given by,

13

CONCLUSION

The typical Poisson regression model expresses the natural logarithm of the event or outcome of interest as a linear function of a set of predictors. Poisson regression is yet another form of regression wherein the response of variable of interest is not normally distributed. The dependent variable is a count of the occurrences of interest e.g. the number of cases of a disease that occur over a period of follow-up. Typically, one can estimate a ratio associated with a given predictor or exposure. A goodness of fit of the Poisson regression model is obtained by using the deviance statistic of a base-line model against a fuller-model. Poisson regression is not a very robust model as compared to Cox regression, logistic regression, etc... If you have individual data, you should use these other models. However, Poisson may be only option if you have count data and want to adjust for multiple confounders. It is not recommended that Poisson models be applied for small samples. The assumption of the Poisson models that the conditional mean is equal to the conditional variance needs to be checked. Other techniques for testing of the assumptions of a Poisson model can be considered as for the over-dispersion of the data, etc Failing to reject the null hypothesis that there is no over dispersion usually is an indication that there is a problem with the model specification, related to omitted predictors variables and the functional form of the predictor variables.

14

REFERENCES

ANDERSON, C. Poisson Regression or Regression of Counts & Rate. Illinois. AGRESTI, ALAN. (1996). An Introduction to Categorical Data Analysis. New York: John Wiley & Sons, Inc. CAMERON, A.C and TRIVEDI, P.K.(1998). Regression analysis of count

data, Cambridge University Press. BARRIOS, E. and RURU, Y. (2003). Poisson Regression Models of Malaria Incidence in Jayapura, Indonesia. pp. 32. HALEKOH, U. and HOJSGAARD, S. (2006). Count data and Poisson regression Presentation. Danish Institute of Agricultural Sciences. PARAGAS, E. (2010). Poisson Regression analysis Report

Internet Sources:
http://en.wikipedia.org/wiki/Poisson_regression http://www.oxfordjournals.org/our_journals/tropej/online/ma_chap13.pdf http://www.crim.upenn.edu/faculty/papers/berk/regression.pdf http://www.statpower.net/Content/MLRM/Lecture%20Slides/PoissonRegression.pdf

15

PPENDICES Raw Data


Obs. cigsperday(X1) yearssmoking(X2) personyears(X3) cases(Y) 1 0 15 10366 1 2 0 25 5969 0 3 0 35 3512 0 4 0 45 1421 0 5 0 55 862 2 6 5 15 3121 0 7 5 25 2288 0 8 5 35 1648 1 9 5 45 927 0 10 5 55 606 0 11 11 15 3577 0 12 11 25 2546 1 13 11 35 1826 0 14 11 45 988 2 15 11 55 449 3 16 16 15 4317 0 17 16 25 3185 0 18 16 35 1893 0 19 16 45 849 2 20 16 55 280 5 21 20 15 5683 0 22 20 25 5483 1 23 20 35 3646 5 24 20 45 1567 9 25 20 55 416 7 26 27 15 3042 0 27 27 25 4290 4 28 27 35 3529 9 29 27 45 1409 10 30 27 55 284 3 31 40 15 670 0 32 40 25 1482 0 33 40 35 1336 6 34 40 45 556 7 35 40 55 104 1

16

You might also like