Stata Prirucnik

UVOD U STATA SOFTWARE
21/09/2013
SOFTWARE OPTIONS (NO SUCH THING AS THE BEST)

MS Excel SPSS
STATA EViews R SAS (+ SAP, Oracle, Business Objects, etc.) MATLab Shazam Compare statistical packages online Install student/trial versions
2
OBJECTIVES
As per prof. Verbi Introduction to the Stata software:
Capabilities
User interface
Menus vs commands Statistics features Regression features
3
USEFUL INTRO INFO

o o o
STATA Help for all STATA commands
the STATA Users Guide and Reference Manual

STATA Journal - a quarterly publication containing articles about statistics, data analysis, teaching methods, and effective use of STATA's language.
STATA Tutorial for Stock and Watson Introduction to Econometrics, Pearson, 2003
o o o o
University of Toronto, Department of Economics, Elena Capatina www.ats.ucla.edu/stat/stata/webbooks/reg/default.htm www.iies.su.se/~masa/stata.htm www.princeton.edu/~erp/stata/main.html

4
HOW TO GET DATA

World Bank databases Penn World Tables COMPUSTAT
OECD National Accounts Database

Greene, Wooldridge, Hsiao, Gujarati, Kennedy, etc. Websites of professors Google Scholar
5
STATA WINDOWS
The command window The viewer/results window
The review of commands window

The variable window
Drop-down menu
Review window
Results window
Variables window
Command window
10
11
12
copy + later paste
13
14
DATA EDITOR VS. DATA BROWSER
Data editor shows you your data and you can edit it
Data browser shows you your data but you cannot edit it Check this frequently, especially after commands you are unsure about
15
data browser
data editor
16
17
18
TYPE OF COMMANDS
1. Administrative commands that tell STATA where to save results, how to manage computer memory, and 2. Commands that tell STATA to read and manage
datasets
3. Commands that tell STATA to modify existing variables or to create new variables
4. Commands that tell STATA to carry out the

statistical analysis
19
WORKING WITH STATA
Once you have started STATA, you will see a large window containing several smaller windows. At this point you can load the dataset and begin the statistical analysis. STATA can be operated interactively or in batch mode. When you use STATA interactively, you type each STATA command in the STATA command window and hit the Return/Enter key on your keyboard.
STATA executes the command and the results are displayed in the STATA Results window.
Then you enter the next command, STATA executes it, and so forth, until the analysis is complete. Even the simplest statistical analysis will involve several STATA commands.
20
WORKING WITH STATA
When STATA is used in batch mode, all of the commands for the analysis are listed in a file, and STATA is told to read the file and execute all of the commands. These files are called do files by STATA and are saved using a .do suffix. When STATA executes .do file, all of the empirical results for some work/paper/research study are produced.
21
WORKING WITH STATA
Using STATA in batch mode has three important advantages over using STATA interactively.
1. .do files provide an audit trail for your work. The file provides an exact record of each STATA command that allows you to be more efficient in your research. 2. .do files allows others to learn from your work, replicate your work and find other ways to improve their/your research. 3. Everyone makes errors when using STATA. When a command contains an error, it will not be executed by STATA, or if it is, it will produce the wrong result. Following an error, it often necessary to start the analysis from the beginning.
22
WORKING WITH STATA
If you are using STATA interactively, you must retype all of the commands. If you are using a do file, then you only need to correct the command containing the error and rerun the file. For these reasons, you are strongly encouraged to use .do files.
23
WORKING WITH STATA

1.
2.
From the command window

Using a .do file
A text file that can be edited using any text editor (the STATA do-file editor, notepad, word, etc), but you need to save it as filename.do for STATA to read it
file do for STATA to execute all commands
24
Lets run our first STATA .do file

25
EXAMPLE:
STATA1.DO
clear log using caschool.log use caschool.dta describe generate income = avginc*1000 summarize income log close exit
26
THE LOG USING COMMAND

The log file is an output file Creates and saves a log with all the actions performed by STATA and all the results
How do I view this log file?
From the drop-down menu: file log view and then search for your filename, keeping in mind it has extension .log
27
LOADING YOUR DATA
3 ways to enter your data:

1.
If your data is in STATA format, for example,

filename.dta, then enter: use filename.dta If your data is a comma delimited file, then enter: insheet
2.
using filename.txt
3.
Or simply copy paste your data from your .xls file. Warning: This is the most troublesome and error-prone
way of loading your data.
For other formats, can use StatTransfer software to convert to STATA format
28
USEFUL COMMANDS:
describe will list all the variables, their labels, types,
and tell you the number of observations
Two types of variables:

1. 2.
Numerical String (usually appear in red in the data browser) You can convert a string variable to numerical using the destring. For example, destring var1, replace or destring var1, force replace
** NO UNDO OPTION **
29
MORE COMMANDS:
generate or gen creates a new variable

e.g. generate income = avginc*1000 e.g. generate log_inc = log(income) e.g. gen inc_sq = (income)^2
30
MORE COMMANDS:
summarize or summ tells STATA to compute summary statistics (mean, standard deviations, etc.) for all variables
This is useful to identify outliers and get an idea of your data!
e.g. summarize
e.g. summ income inc_sq
31
ENDING THE DO FILE
log close closes the file stata1.log that contains the output.
The command exit tells STATA that the program has ended.
32
Lets run our second STATA .do file

33
EXAMPLE: STATA2.DO
# delimit ; * Increases memory for STATA to do work; set memory 600m; * Administrativne komande asldfjasldfkjsafld set more off; clear; log using caschool.log, replace; * Read in the Dataset; use caschool.dta; describe; * Transform data and Create New Variables; **** Construct average district income in $'s; generate income = avginc*1000; * Carry Out Statistical Analysis; ***** Summary Statistics for Income; summarize income; * End the Program ; log close; exit;
34
COMMENTS IN .DO FILE:
Star(*) STATA ignores the text that comes after * These lines can be used to describe what the commands are doing Allows you to write comments (usually administrative commands)
35
SOME USEFUL COMMANDS
# delimit ;
Tells STATA that each STATA command ends with a semicolon. Useful for long commands Do not forget the ; and write this even after the comment lines that start with *.
set more off

Ensures STATA executes all commands. If code is too long, the output window might be filled, and STATA will display --more-- at the bottom and not execute all commands
set memory 600m
Increases memory available for STATA to do work

36
OTHER COMMANDS
tabulate shows the frequency and percent of each value of a certain variable in the dataset e.g. tabulate county generate ... if e.g. generate teachers_new= teachers if teachers<=10 replace ... if e.g. replace teachers_new=0 if teachers>10
37
MORE COMMANDS
summarize ... if e.g. summarize teachers if county=Nevada or summarize teachers if county==Nevada or summarize teachers if county==Nevada or summarize teachers if county=="Nevada"
Some frequent operators:

< less than > greater than <= less than or equal to >= greater than or equal to == equal to ~= not equal to
38
MORE COMMANDS
by performs whatever command is given for each category of variable e.g. by county: summarize income by county, sort: summarize income sort simply sorts data in ascending order (for descending order find gsort) e.g. sort income e.g. sort county income
39
DELETING VARIABLES AND OBSERVATIONS
drop use this command to delete variables or observations e.g. drop avginc deletes average income variable e.g. drop if teachers<=5 deletes only the observations for which variable teachers is less than 5 keep use this command to keep variables or observations; the opposite of drop; using keep drops everything that is not in the keep command e.g. keep if teachers>=7
40
BASIC STATISTICAL RELATIONSHIPS

Correlation: correlate Remember: Correlation coefficients which are close to -1 or +1 indicate a strong linear correlation. Values close to o indicate a weak linear correlation; 0 indicates no linear correlation at all. e.g. correlate income teachers e.g. correlate income teachers computer Regression: reg performs OLS regression predicting value of dependent variable from one or more independent variables e.g. reg income teachers e.g. reg income teachers computer
41
GRAPHS: SCATTER PLOTS
e.g. graph twoway scatter income computer e.g. graph twoway scatter income computer || lfitci income computer
STATA graph editor
42
SAVING YOUR DATA
Saving data in Stata format: (the usual way) file save as ... or save file name.dta (on my PC the file is saved to C:\data) Export your data in another format: file export (choose file format)
43
SOME DATA CLEANING COMMANDS
reshape transforms (converts) data from long to wide format or from wide to long format Before using reshape, you need to determine whether the data are in long or wide form. Also determine the logical observation (i) and the subobservation (j) by which to organize the data.
44
EXAMPLES
famid 3 faminc96 75000 faminc97 76000 faminc98 77000 famid 1 1 1 2 year 96 97 98 96 97 98 96 97 98 faminc 40000 40500 41000 45000 45400 45800 75000 76000 77000
1
2
40000
45000
40500
45400
41000
45800
Wide format Long format
2 3 3 3
Data source: UCLA ATS

45
RESHAPE COMMANDS

Lets practice: use http://www.ats.ucla.edu/stat/stata/modules/kidshtwt, clear Then save example2.dta Ask yourself: Q: What is the stem of the variable going from wide to long? A: The stem is ht and wt
Q: What variable uniquely identifies an observation when it is in the wide form? A: famid and birth together uniquely identify the wide observations.
Q: What do we want to call the variable which contains the suffix of ht (and wt)? A: Lets call the suffix age.
From wide to long: reshape long stem-of-wide-vars, i(wide-id-var) j(var-for-suffix) Example: browse list famid birth ht1 ht2 wt1 wt2 reshape long ht wt, i(famid birth) j(age) list famid birth ht wt
46
EGEN COMMAND
Extended generate (egen) is more powerful than ordinary gen

Examples:
egen age_mean = mean(age), by(year) egen stdage = std(age)
47
LAGGED VARIABLES

[_n-1] tells STATA this is the previous observation [_n-2] is 2 observations before Examples: First sort your data! gen GDP_lagged= GDP[_n-1] gen GDP_2= GDP[_n-2] Other uses: Filling in missing data by ID: replace education=1 if education[_n-1]==1 & education[_n+1]==1 & ID[_n-1]==ID[_n+1];
48
COLLAPSE COMMAND

Lets practice: use http://www.ats.ucla.edu/stat/Stata/modules/collapse.htm Then Example: create one record per family (famid) with the average of age (avgage) and average weight (avgwt) within each family, and the number of kids (numkids) per family save example3.dta
collapse (mean) avgage=age avgwt=wt (count) numkids=birth, by(famid)
49
NEW DATA AFTER COLLAPSE
famid 1 2 3
avgage 6 5,333333 4
avgwt 40 50 40
numkids 3 3 3
50
PRESERVING DATA
preserve tells STATA to keep your data in memory, so if your next commands modify it, you can come back to your original data restore gives you back your original data
Example:

use data1.dta preserve collapse (mean) age, by (family) save data2.dta restore
51
SIMPLE REGRESSION
Example:
use http://www.ats.ucla.edu/stat/stata/notes/hsb2 browse regress science math female socst read
52
OUTPUT
53
OUTPUT
ANOVA table Model fit
Parameter estimates
54
ANOVA TABLE
Source: Looking at the breakdown of variance in the outcome variable, these are the categories we examine: Model, Residual, and Total. Total variance is partitioned into the variance which can be explained by the independent variables (Model) and the variance which is not explained by the independent variables (Residual, sometimes called Error). SS: These are the Sum of Squares associated with the three sources of variance: Total, Model and Residual. df : These are the degrees of freedom associated with the sources of variance.
55
DF
The total variance has N-1 degrees of freedom. The model degrees of freedom corresponds to the number of coefficients estimated minus 1. Including the intercept, there are 5 coefficients, so the model has 5-1=4 degrees of freedom. The Residual degrees of freedom is the DF total minus the DF model, 199-4=195.
DF is the number of free or linearly independent observations used in the calculation of the statistic. DF of a statistic is the number of quantities that enter into calculation of the statistic minus the number of constraints connecting these quantities. as the number of independent pieces of information available to estimate another piece of information.
This is the number of degrees of freedom is the number of independent observations in a sample of data that are available to estimate a parameter of the population from which that sample is drawn.
MS: These are the Mean Squares: the Sum of Squares divided by their respective DF.
56
OUTPUT
Parameter estimates
57
MODEL FIT
Number of obs: This is the number of observations used in the regression analysis. F(4, 195): This is the F-statistic. It is the Mean Square Model (2385.93019) divided by the Mean Square Residual (51.0963039), yielding F=46.69. The numbers in parentheses are the Model and Residual degrees of freedom from the previous ANOVA table. Prob > F: This is the p-value associated with the Fstatistic. It is used in testing the null hypothesis that all of the model coefficients are 0. R-squared: This is the proportion of variance in the dependent variable (science) which can be explained by the independent variables (math, female, socst and read).
58
MODEL FIT
R-squared is an overall measure of the strength of association and does not reflect the extent to which any particular indep. variable is associated with the dependent variable. Adj R-squared: This is an adjustment of the R-squared that penalizes the addition of extraneous predictors to the model. Adjusted R-squared is computed using the following formula: 1 - ((1 R2)((N - 1)/( N - k - 1)) where k is the number of predictors. Root MSE: Root MSE is the standard deviation of the error term, and is the square root of the Mean Square Residual (or Error).
59
OUTPUT
Parameter estimates
60
PARAMETER ESTIMATES
science: This column shows the dependent variable at the top (science) with the predictor variables below it (math, female, socst, read and _cons). The last variable (_cons) represents the constant or intercept. Coef.: These are the values for the regression equation for predicting the dependent variable from the independent variable. The regression equation can have the following form
Y(hat) = 0+ 1X1 + 2X2 + 3X3 + 4X4

61
PARAMETER ESTIMATES
The column estimates provide values for 0, 1, 2, 3 and 4.
science(predicted) = 12.32529 + .3893102 math + -2.009765 female +.0498443 socst+.3352998 read

math: The coefficient is .3893102. So for every unit increase in math, a .389 point increase in science is predicted, holding all other variables constant. Be careful! Since female is coded 0/1 (0=male, 1=female), we interpret the coefficient: the predicted science score would be 2 points lower for a female than for a male, for a randomly chosen student.
62
PARAMETER ESTIMATES
socst: The coefficient for socst is .0498443. So for every unit increase in socst, we expect an approximately .05 point increase in the science score, holding all other variables constant. read: The coefficient for read is .3352998. So for every unit increase in read, we expect a .34 point increase in the science score. t and P>|t|: These columns provide the t-value and 2-tailed p-value used in testing the null hypothesis that the coefficient (parameter) is 0. Remember: In significance testing, the p-value is the probability of obtaining a test statistic at least as extreme as the one that was actually observed, assuming that the null hypothesis is true. One often "rejects the null hypothesis" when the pvalue is less than the significance level (alhpa) which is often 0.01 or 0.05. When the null hypothesis is rejected, the result is said to be statistically significant. [95% Conf. Interval]: This shows a 95% confidence interval for the coefficient. Remember: The coefficient will not be statistically significant if the confidence interval includes 0.
63
PREDICTED VALUES

After the regression, type predict yhat This command creates a new variable yhat with the predicted values for the dependant variable (science). Next ... Regression diagnostics is beyond the scope of our short intro course, but ... Some issues:
heteroskedasticity (when disturbances do not all have the same variance), autocorrelation (when disturbances are correlated with one another), multicolinearity (two or more independent variables are approximately linearly related in the sample data)
64
DIAGNOSTICS
Lets check homoscedasticity of residuals predict r, residuals One of the main assumptions for the ordinary least squares regression is the homogeneity of variance of the residuals. If the model is well-fitted, there should be no pattern to the residuals plotted against the fitted values. If the variance of the residuals is non-constant, then the residual variance is said to be heteroscedastic.
65
DIAGNOSTICS
There are graphical and non-graphical methods for detecting heteroscedasticity. A commonly used graphical method is to plot the residuals versus fitted (predicted) values. We do this by issuing the rvfplot command. yline(0) puts a reference line at y=0. rvfplot, yline(0)
66
DIAGNOSTICS
estat imtest (White test)

estat hettest (Breusch-Pagan test)
67
DIAGNOSTICS
White and Breusch-Pagan tests test the null hypothesis that the variance of the residuals is homogenous. If the p-value is very small, we would have to reject the hypothesis and accept the alternative hypothesis that the variance is not homogenous. In this case, the evidence is against the null hypothesis that the variance is homogeneous. These tests are very sensitive to model assumptions, such as the assumption of normality. So, it is a common practice to combine the tests with diagnostic plots to make a judgment on the severity of the heteroscedasticity and to decide if any correction is needed for heteroscedasticity
68
69
LINEAR REGRESSION WITH PANEL DATA
Declaring the data to be a panel: Example, where data consists of many firms, each observed over 5 years
iis Firm ; tis Year ;
xt are the prefix for the commands in this class xtreg should be used for regressions with panel data
70
FIXED EFFECTS:
yit = a + xitb + vi + eit i.e. xtreg lnc lny, fe

Equivalent to including a dummy variable for each case (i.e. firm). But not really!
71
RANDOM EFFECTS (RE)

If you think some omitted variables may be constant over time but vary between cases, and others may be fixed between cases but vary over time, then you can include both types by using random effects. Stata's RE estimator is a weighted average of fixed and between effects i.e. xtreg lnc lny, re
72
CHOOSING BETWEEN FIXED AND RANDOM EFFECTS

running a Hausman test: estimate the FE model, save the coefficients, estimate the RE model, and then do the comparison. Example:

xtreg dependentvar var1 var2 var3 ... , fe estimates store fixed xtreg dependentvar var1 var2 var3 ... , re estimates store random hausman fixed random
If significant p-value, use FE Source: http://dss.princeton.edu/online_help/analysis/panel.ht m
73
TIME SERIES DATA

tsset declare data to be time-series data Examples:
tsset time, yearly (For an annual time series, time takes on values such as 1990, 1991, ...) tsset company year, yearly (For yearly panel data, variable company being the panel ID variable and year being a four-digit calendar year)
74
QUESTIONS
Ensar Sehic, PhD, Assistant Professor Academic Unit for Quantitative Economics University of Sarajevo, School of Economics and Business Trg Oslobodjenja 1, office #69, 71000 Sarajevo, B&H Tel: +387 33 253 767 Mob: +387 62 225 123 Email: ensar.sehic@efsa.unsa.ba Skype: ensar.sehic
75

Stata Prirucnik

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Stata Prirucnik

Uploaded by

Copyright:

Available Formats

UVOD U STATA SOFTWARE

SOFTWARE OPTIONS (NO SUCH THING AS THE BEST)

As per prof. Verbi Introduction to the Stata software:

USEFUL INTRO INFO

STATA Help for all STATA commands

the STATA Users Guide and Reference Manual

University of Toronto, Department of Economics, Elena Capatina www.ats.ucla.edu/stat/stata/webbooks/reg/default.htm www.iies.su.se/~masa/stata.htm www.princeton.edu/~erp/stata/main.html

HOW TO GET DATA

World Bank databases Penn World Tables COMPUSTAT

OECD National Accounts Database

The command window The viewer/results window

The review of commands window

copy + later paste

DATA EDITOR VS. DATA BROWSER

4. Commands that tell STATA to carry out the

WORKING WITH STATA

WORKING WITH STATA

WORKING WITH STATA

WORKING WITH STATA

WORKING WITH STATA

From the command window

file do for STATA to execute all commands

Lets run our first STATA .do file

THE LOG USING COMMAND

How do I view this log file?

LOADING YOUR DATA

3 ways to enter your data:

If your data is in STATA format, for example,

way of loading your data.

describe will list all the variables, their labels, types,

and tell you the number of observations

Two types of variables:

generate or gen creates a new variable

This is useful to identify outliers and get an idea of your data!

e.g. summ income inc_sq

ENDING THE DO FILE

Lets run our second STATA .do file

COMMENTS IN .DO FILE:

SOME USEFUL COMMANDS

set more off

set memory 600m

Increases memory available for STATA to do work

Some frequent operators:

DELETING VARIABLES AND OBSERVATIONS

BASIC STATISTICAL RELATIONSHIPS

GRAPHS: SCATTER PLOTS

SAVING YOUR DATA

SOME DATA CLEANING COMMANDS

Wide format Long format

Data source: UCLA ATS

Extended generate (egen) is more powerful than ordinary gen

collapse (mean) avgage=age avgwt=wt (count) numkids=birth, by(famid)

NEW DATA AFTER COLLAPSE

use http://www.ats.ucla.edu/stat/stata/notes/hsb2 browse regress science math female socst read

Y(hat) = 0+ 1X1 + 2X2 + 3X3 + 4X4

The column estimates provide values for 0, 1, 2, 3 and 4.

science(predicted) = 12.32529 + .3893102 math + -2.009765 female +.0498443 socst+.3352998 read

estat imtest (White test)

LINEAR REGRESSION WITH PANEL DATA

iis Firm ; tis Year ;

RANDOM EFFECTS (RE)

CHOOSING BETWEEN FIXED AND RANDOM EFFECTS

If significant p-value, use FE Source: http://dss.princeton.edu/online_help/analysis/panel.ht m