Stata Training Manual

Introduction to Stata
1

Contents
SECTION 1: INTRODUCTION TO STATA ................................................................................ 2
SECTION 2: EXPLORING DATA FILES .................................................................................... 5
SECTION 3: STORING COMMANDS AND OUTPUT ............................................................ 18
SECTION 4: CREATING NEW VARIABLES ........................................................................... 23
SECTION 5: MODIFYING VARIABLES .................................................................................. 31
SECTION 6: ADVANCED DESCRIPTIVE STATISTICS ........................................................ 35
SECTION 7: PRESENTING DATA WITH GRAPH (GRAPHING DATA) .............................. 40
SECTION 8: NORMALITY AND OUTLIER ............................................................................. 42
SECTION 9: STATISTICAL TESTS .......................................................................................... 53
SECTION 10: LINEAR REGRESSION ...................................................................................... 56
SECTION 11: LOGISTIC REGRESSION ................................................................................... 67
SECTION 12: PANEL DATA ANALYSIS ................................................................................ 72
SECTION 13: DATA MANAGEMENT..................................................................................... 74
SECTION 14: ADVANCED PROGRAMMING ......................................................................... 80
SECTION 15: TROUBLESHOOTING AND UPDATE ............................................................. 82

2
SECTION1:INTRODUCTIONTOSTATA

Stata is a package that offers a good combination of ease to learn and power. It has numerous
powerful yet simple commands for data management, which allows users to perform complex
manipulations with ease. Under Stata/SE, one can have up to 32,768 in a Stata data file and
11,000 for any estimation commands.

Stata performs most general statistical analyses (regression, logistic regression, ANOVA, factor
analysis, and some multivariate analysis). The greatest strengths of Stata are probably in
regression and logistic regression. Stata also has a very nice array of robust methods that are very
easy to use, including robust regression, regression with robust standard errors, and many other
estimation commands include robust standard errors as well.

Stata has the ability to easily download programs developed by other users and the ability to
create your own Stata programs that seamlessly become part of Stata. One can find many cutting
edge statistical procedures written by other users before and incorporate them into his/her own
Stata program. Stata uses one line commands which can be entered one command at a time or
can be entered many at a time in a Stata program.

When you open Stata, you will see a menu bar across the top, a tool bar with buttons, and 3-5
windows (the number of windows open depends on which windows were open the last time Stata
was used). Each is described briefly below.

The Stata Interface

1. Windows

The Stata windows give you all the key information about the data file you are using, recent
commands, and the results of those commands. Some of them open automatically when you start
Stata, while others can be opened using the Windows pull-down menu or the buttons on the tool
bar.

These are the Stata windows:
Stata Results To see recent commands and output
Stata Command To enter a command
Stata Browser To view the data file (needs to be opened)
Stata Editor To edit the data file (needs to be opened)
Stata Viewer To get help on how to use Stata
Variables To see a list of variables
Review To see recent commands
Stata Do-file Editor To write or edit a program (needs to be opened)

3

The Command window on the bottom right is where you'll enter commands. When you press
ENTER, they are pasted into the Stata Results window above, which is where you will see your
commands executed and view the results. You can also use recent commands again by using the
PageUp key (to go to the previous command) and Page Down key (to go to the next command).

The Result Window (with the black background) shows all recent commands, output, error
messages, and help. The text is color-coded as follows:
Green General information and the frame and headings of output tables
blue Commands or error messages that can be clicked on for more information
white Stata commands
yellow Numbers in output tables
red Error messages

The slide bar on the right side can be used to look at earlier results that are not on the screen.
However, unlike SPSS, the Stata results window does not keep all output generated. It will keep
about 300-600 lines of the most recent output, deleting earlier output. If you want to store output
in a file, you must use the log command. /More on this latter/

Stata Browser This window shows all the data in memory. The Stata Browser does not appear
automatically when you start Stata. The only way to open the Browser is to click on the buttom
with a table and magnifying glass. Unlike SPSS, when the Stata Browser is open, you cannot
execute any commands, either from the Stata Command window or from the Do-file Editor. In
addition, you also cannot change any of the data. You can, however, sort the data or hide certain
variables using buttons at the top of the Stata Browser window.

Stata Editor This window is exactly like the Stata Browser window except that you can change
the data. We do not recommend using this window because you will have no record of the
4
changes you make in the data. It is better to correct errors in the data using a Do-file program
that can be saved.

Stata Viewer This window provides help on Stata commands and rules. To open the Stata
Viewer window, you can click on Windows/Viewer or click on the eye button on the tool bar. To
use the Stata Viewer window, type a command in the space at the top and the Viewer will give
you the purpose and rules for using that command, along with some examples. Any blue text in
the Viewer can be clicked on for more information about that command.

Variables This window (tall with a white background) lists all the variables that exist in
memory. When you open a Stata data file, it lists the variables in the file. If you create new
variables, they will be added to the list of variables. If you delete variables, they will be removed
from the list. You can insert a variable into the Stata Command window by clicking on it in the
Variables window.

Do-file Editor This window allows you to write, edit, save, and execute a Stata program. A Stata
program (or Do-file) is simply a set of Stata commands written by the user. The advantage of using the
Do-file Editor rather than the Stata Command window is that the Do-file allows you to save, revise, and
rerun a set of commands. Exploratory analysis of the data can be done with the Stata Command window,
but any serious data analysis should be carried out using the Do-file Editor, not the Stata
Command window. The Do-File Editor can be opened by clicking on Windows/Do-file Editor or by
clicking on the envelope button. With so many windows, it is sometimes difficult to fit them all on the
screen. You can adjust the size and position of each window the way you like it and then save the layout
by clicking on Prefs/Save Windowing Preferences. Each time you open Stata, the windows will be
arranged according to your prefered layout.

On the right are two convenient windows. The Variables window keeps a list of your current
variables. If you click on one of them, its name will be pasted into the current command at the
location of the cursor, which saves a little typing. The Review window keeps a list of all the
commands you've typed in the Stata session. Click on one, and it will be pasted into the
command window, which is handy for fixing typos. Double-click, and the command will be
pasted and re-executed. You can also export everything in the Review window into a .do file
(more on them later) so you can run the exact same commands at any time. To do this right-click
the Review window.

When we first open Stata, all these windows are blank except for the Stata Results window. You
can resize these 4 windows independently, and you can resize the outer window as well. To save
your window size changes, click on Prefs button, then Save Windowing Preferences

Entering commands in Stata works pretty much like you expect. BACKSPACE deletes the
character to the left of the cursor, DELETE the character to the right, the arrow keys move the
cursor around, and if you type the text is inserted at the current location of the cursor. The up
arrow does not retrieve previous commands, but you can do that by pressing PAGE UP, or
CTRL-R, or by using the Review window.

5
2. Menus

Stata displays 9 drop-down menus across the top of the outer window, from left to right:
File
Open open a Stata data file (use)
Save/Save as save the Stata data in memory to disk
Do execute a do-file
Filename copy a filename to the command line
Print print log or graph
Exit quit Stata
Edit
Copy/Paste copy text among the Command, Results, and Log windows
Copy Table copy table from Results window to another file
Table copy options what to do with table lines in Copy Table
Prefs Various options for setting preferences. For example, you can save
a particularly layout of the different Stata windows or change the
colors used in Stata windows.
Data
Graphics
Statistics build and run Stata commands from menus
User menus for user-supplied Stata commands (download from Internet)
Window bring a Stata window to the front
Help Stata command syntax and keyword searches

3. Button bar

The buttons on the button bar are from left to right (equivalent command is in bold):
Open a Stata data file: use
Save the Stata data in memory to disk: save
Print a log or graph
Open a log, or suspend/close an open log: log
Open a new viewer
Bring Results window to front
Bring Graph window to front
New Dofile Editor: doedit
Edit the data in memory: edit
Browse the data in memory: browse
Scroll another page when --more-- is displayed: Space Bar
Stop current command or do-file: Ctrl-Break

SECTION2:EXPLORINGDATAFILES

2.1. Common Stata Syntax
This section covers commands that are used for preliminary exploration of data in a file. Stata
commands follow the same syntax:

[by varilist1:] command [varlist2] [if exp] [in range] [weight], [options]

6
Items inside of the squares brackets are either options or not available for every command. This
syntax applies to all Stata commands. In order to use by prefix, the dataset must first be sorted on
the by variable(s). it helps to repeat Stata command on subsets of the data.

Logical operators used in Stata

~ Not
== Equal
~= not equal
!= not equal
> greater than
>= greater than or equal
< less than
<= less than or equal
& And
| Or

Note that ==represents IS EQUAL TO.

Stata allows four kinds of weights in most commands (please refer to stata manual for further
information)

1. fweight, or frequency weights, are weights that indicate the number of duplicated
observations. It is used when your data set has been collapsed and contains a variable that tells
the frequency each record occurred.

2. pweight, or sampling weights, are weights that denote the inverse of the probability that the
observation is included due to the sampling design. pweights is correct to be used for sampling
survey data. The pweight command causes Stata to use the sampling weight as the number of
subjects in the population that each observation represents when computing estimates such as
proportions, means and regressions parameters. A robust variance estimation technique will
automatically be used to adjust for the design characteristics so that variances, standard errors
and confidence intervals are correct.

3. aweight, or analytic weights, are weights that are inversely proportional to the variance of an
observation; i.e., the variance of the j-th observation is assumed to be sigma^2/w_j, where w_j
are the weights. Typically, the observations represent averages and the weights are the number
of elements that gave rise to the average. For most Stata commands, the recorded scale of
aweights is irrelevant; Stata internally rescales them to sum to N, the number of observations in
your data, when it uses them.

Analytic weights are used when you want to compute a linear regression on data that are
observed means. Do not use aweights to specify sampling weights. This is because the formulas
that use aweights assume that larger weights designate more accurately measured observations.
Conversely, one observation from a sample survey is no more accurately measured than any
other observation. Hence, using the aweight command to specify sampling weights will cause
7
Stata to estimate incorrect values of the variance and standard errors of estimates, and p-values
for hypothesis tests.

4. iweight, or importance weights, are weights that indicate the "importance" of the observation
in some vague sense. iweights have no formal statistical definition; any command that supports
iweights will define exactly how they are treated. In most cases, they are intended for use by
programmers who who need to implement their own analytical techniques by using some of the
available estimation commands. Special care should be taken when using importance weights to
understand how they are used in the formulas for estimates and variance. This information is
available in the Methods and Formulas section in the Stata manual for each estimation command.
In general, these formulas will be incorrect for computing the variance for data from a sample
survey.

2.2 Examining dataset

clear
The clear command deletes all files, variables, and labels from the memory to get ready to use a
new data file. You can clear memory using the clear command or by using the clear up command
as part of the use command (see the use command). This command does not delete any data
saved to the hard-drive.

set memory
First you can check to see how much memory is allocated to hold your data using the memory
command. For instance, we are now running StataSE 9 under Windows, and this is what the
memory command told us.
8
Fi gur e 2: Wor ki ng memor y space
. memor y
byt es
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Det ai l s of set memor y usage
over head ( poi nt er s) 5, 808 0. 06%
dat a 107, 448 1. 02%
- - - - - - - - - - - - - - - - - - - - - - - - - - - -
dat a + over head 113, 256 1. 08%
f r ee 10, 372, 496 98. 92%
- - - - - - - - - - - - - - - - - - - - - - - - - - - -
Tot al al l ocat ed 10, 485, 752 100. 00%
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Ot her memor y usage
set maxvar usage 1, 816, 666
set mat si ze usage 1, 315, 200
pr ogr ams, saved r esul t s, et c. 3, 338
- - - - - - - - - - - - - - -
Tot al 3, 135, 204
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Gr and t ot al 13, 620, 956

We have 11MB free for reading in a data file. Whenever we want to read data file bigger than
this free bytes, we will get the error message read as:

no r oomt o add mor e obser vat i ons
r ( 901) ;

In this case I have to allocate to more memory, say 25MB (if 25MB are sufficient for current
file), with the set memory command before trying to use my file.

set memory 25m

Figure 3: Current memory allocation after set memory 25m command
Current memory allocation

current memory usage
settable value description (1M = 1024k)
--------------------------------------------------------------------
set maxvar 5000 max. variables allowed 1.733M
set memory 25M max. data space 25.000M
set matsize 400 max. RHS vars in models 1.254M
-----------
27.987M

Now that we have allocated enough memory, we will be able to read bigger files provided that it
is within the specified memory spaces. After setting the memory space to 25m, we have
information on memory space read us:

9
Figure 4: Adjusted working memory space
. memor y
byt es
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Det ai l s of set memor y usage
over head ( poi nt er s) 5, 808 0. 02%
dat a 107, 448 0. 41%
- - - - - - - - - - - - - - - - - - - - - - - - - - - -
dat a + over head 113, 256 0. 43%
f r ee 26, 101, 136 99. 57%
- - - - - - - - - - - - - - - - - - - - - - - - - - - -
Tot al al l ocat ed 26, 214, 392 100. 00%
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Ot her memor y usage
set maxvar usage 1, 816, 666
set mat si ze usage 1, 315, 200
pr ogr ams, saved r esul t s, et c. 1, 778
- - - - - - - - - - - - - - -
Tot al 3, 133, 644
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Gr and t ot al 29, 348, 036

If we want to allocate 25m (250 megabytes) every time we start Stata, We can type;

. set memory 250m, permanently

And then Stata will allocate this amount of memory every time we start Stata.

use

This command opens an existing Stata data file. The syntax is:

use filename [, clear ] opens new file
use [varlist] [if exp] [in range] using filename [, clear ] opens selected parts of file

If there is no extension, Stata assumes it is .dta.
If there is no path, Stata assumes it is in the current folder.
You can use a path name such as: use C:\...\ERHScons1999
If the path name has spaces, you must use double quotes: use .d:\my
data\ERHScons1999.
You can open selected variables of a file using a variable list.
You can open selected records of a file using if or in.

Here are some examples of the use command:
use ERHScons1999 opens the file ERHScons1999.dta for analysis.
use ERHScons1999 if q1a ==1 opens data from region 1
use ERHScons1999 in 5/25 opens records 5 through 25 of file
use hhid hhsize cons using ERHScons1999 opens 3 variables from ERHScons1999 file
use C:\training\ ERHScons1999 opens the file ERHScons1999.dta in the specified
folder
use .C:\data files\ ERHScons1999 use quotation marks if there are spaces
use ERHScons1999, clear clears memory before opening the new file
10

While running Do-file program, we have to use use and clear command at the same time.
For instance, here we load a raw data set from ERHScons1999. The clear option then allows
Stata to clear the memory of previous data set in order to load the new one.

. use C:\...\ERHScons1999.dta, clear

As Stata did not want you to lose the changes that you made to the data setting in memory. If you
really want to discard the changes in memory, clear option specifies that it is okay to replace the
data in memory, even though the current data have not been saved to disk.

save
The save command will save the dataset as a .dta file under the name you choose. Editing the
dataset changes data in the computer's memory, it does not change the data that is stored on the
computer's disk.

. save C:\...\consumption.dta, replace

The replace option allows you to save a changed file to the disk, replacing the original file. Stata
is worried that you will accidentally overwrite your data file. You need to use the replace option
to tell Stata that you know that the file exists and you want to replace it.

edit
This command use to open window called data editor window that allow us to view all
observation in the memory. You can change the data using data editor window but you do not
recommend using this window because you will have no record of the changes you make in the
data. It is better to correct errors in the data using a Do-file program that can be saved (we will
see Do-file program latter).

browse
This window is exactly like the Stata editor window except that you cant change the data.

Note: Unlike SPSS, when the Stata Editor or Browser is open, you cannot execute any
commands, either from the Stata Command window or from the Do-file Editor. In addition, you
also cannot change any of the data. You can, however, sort the data or hide certain variables
using buttons at the top of the Stata Browser window.

describe

This command provides a brief description of the data file. You can use des or d and Stata
will understand. The output includes:
the number of variables
the number of observations (records)
the size of the file
the list of variables and their characteristics

11
Example 1: Using describe to show information about a data file
. des

Cont ai ns dat a f r omC: \ t r ai ni ng\ ERHSCONS1999. dt a
obs: 1, 452
var s: 15 24 Feb 2007 07: 07
si ze: 113, 256 ( 98. 9%of memor y f r ee) ( _dt a has not es)
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
st or age di spl ay val ue
var i abl e name t ype f or mat l abel var i abl e l abel
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
q1a f l oat %9. 0g r eg Regi on
q1b doubl e %15. 0g w Wer eda
q1c doubl e %17. 0g pa Peseant associ at i on
q1d doubl e %12. 0g Househol d i d
sexh byt e %8. 0g sexhh Sex of househol d head
ageh f l oat %9. 0g p1s1q4 Age of househol d head
cons f l oat %9. 0g consumpt i on per mont h
f ood f l oat %9. 0g f ood cons per mont h
hhsi ze byt e %8. 0g househol d si ze
aeu f l oat %9. 0g adul t equi val ent uni t s i n
househol d
f pi f l oat %9. 0g f ood pr i ce i ndex
r conspc f l oat %9. 0g r eal consumpt i on per capi t a
1994 pr i ces
r consae f l oat %9. 0g r eal consumpt i on per adul t 1994
pr i ces
poor doubl e %8. 2f
hhi d doubl e %12. 0f sel ect ed househol d uni que i d
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Sor t ed by: hhi d

It also provides the following information on each variable in the data file:
the variable name
the storage type: byte is used for binary variables, int is used for integers, and is used
for continuous variables that may have decimals. To see the limits on each storage type,
type help datatypes.
the display type indicates how it will appear in the output.
the value label is the name of a set of labels for different values
the variable label is a name for the variable that is used in output.

list
This command lists values of variables in data set. The syntax is:

list [varlist] [if exp] [in range]

With varlist, you can specify which variables values will be presented. If list is not specified, all
variables will be listed. With if and in, you can specify which records will be listed. Here are
some

examples:
. list lists entire dataset
. list in 1/10 lists observations 1 through 10
12
. list hhsize q1a food lists selected variables
. list hhsize sex in 1/20 lists observations 1-20 for selected variables
. list if q1a <6 lists cases in region is 1 through 5

if
This command is used to select certain records in carrying out a command. This is similar to the
process if command in SPSS, except that in Stata it is not considered a separate command. The
syntax is:

command if exp

Examples include:

. list hhid q1a food if food>12000 lists data if food is above 12000
. tab q1aif cons>10000 &cons<20000 frequency table of region if consumption is in range
. summarize food if q1a==3 | q1a==4 statistics on food consumption for regions 3 and 4
. browse hhid q1a food if food>=1200 browse data if food consumption is above 12000

Note that if statements always use ==, not a single =. Also note that | indicates or while &
indicates and.

in
We have also used in to select records based on the case number. The syntax is:

command in exp
For example:
. list in 10 list observation number 10
. summarize in 10/20 summarize observations 10-20
Example 2: Using list to look at data
. l i st hhi d q1a q1b q1c q1d hhsi ze r conspc i n 10/ 25

+- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - +
| hhi d q1a q1b q1c q1d hhsi ze r conspc |
| - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - |
10. | 101010000010 Ti gr ay At sbi Har esaw 10 4 134. 5961 |
| - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - |
| - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - |
| - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - |
+- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - +
13

. l i st q1a cons aeu poor i n 200/ 215

+- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - +
| q1a cons aeu poor |
| - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - |
200. | Amhar a 661. 3979 1. 82 0. 00 |
201. | Amhar a 321. 7693 8. 14 1. 00 |
202. | Amhar a 169. 784 2. 3 0. 00 |
203. | Amhar a 907. 9995 3. 14 0. 00 |
204. | Amhar a 232. 6273 4. 148 1. 00 |
| - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - |
205. | Amhar a 432. 4525 6. 86 1. 00 |
206. | Amhar a 59. 53 1. 46 1. 00 |
207. | Amhar a 228. 22 3. 4 0. 00 |
208. | Amhar a 1298. 875 5. 44 0. 00 |
209. | Amhar a 144. 494 3. 48 1. 00 |
| - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - |
210. | Amhar a 266. 974 4. 28 0. 00 |
211. | Amhar a 43. 97179 . 74 1. 00 |
212. | Amhar a 216. 0467 3. 408 1. 00 |
213. | Amhar a 492. 4958 2. 94 0. 00 |
214. | Amhar a 437. 7144 2. 46 0. 00 |
| - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - |
215. | Amhar a 166. 354 1. 74 0. 00 |
+- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - +

If you are not careful with list, you will get a lot more output than you want. If Stata starts
giving you more output than you really want, use the stop button (red button with an X).

codebook
The codebook command is a great tool for getting a quick overview of the variables in the data
file. It produces a kind of electronic codebook from the data file, displaying information about
variables' names, labels and values.

14
Example 3: using codebook to look at data
. codebook
sexh Sex of househol d head
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

t ype: numer i c ( byt e)
l abel : sexhh

r ange: [ 0, 1] uni t s: 1
uni que val ues: 2 mi ssi ng . : 0/ 1452

t abul at i on: Fr eq. Numer i c Label
400 0 Femal e
1052 1 Mal e

. codebook
r conspc r eal consumpt i on per capi t a 1994 pr i ces
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

t ype: numer i c ( f l oat )

r ange: [ 4. 2201104, 1018. 2954] uni t s: 1. 000e- 07
uni que val ues: 1448 mi ssi ng . : 3/ 1452

mean: 90. 3674
st d. dev: 81. 9962

per cent i l es: 10% 25% 50% 75% 90%
25. 1043 39. 9402 65. 9926 114. 253 180. 891

inspect
It is another useful command for getting a quick overview of a data file. inspect command
displays information about the values of variables and is useful for checking data accuracy.

Example 4: Using inspect to look at data
. i nspect sexh

sexh: Sex of househol d head Number of Obser vat i ons
- - - - - - - - - - - - - - - - - - - - - - - - - - - - Non-
Tot al I nt eger s I nt eger s
| # Negat i ve - - -
| # Zer o 400 400 -
| # Posi t i ve 1052 1052 -
| # - - - - - - - - - - - - - - -
| # # Tot al 1452 1452 -
| # # Mi ssi ng -
+- - - - - - - - - - - - - - - - - - - - - - - - - - -
0 1 1452
( 2 uni que val ues)

sexh i s l abel ed and al l val ues ar e document ed i n t he l abel .

15
count
count command can be used to show the number of observations that satisfying if options. If no
conditions are specified, count displays the number of observations in the data.

. count
1452

. count i f q1a==3
466

2.3. Preliminary Descriptive Statistics

tabulate, tab1, tab2
These are three related commands that produce frequency tables for discrete variables. They can
produce one-way frequency tables (tables with the frequency of one variable) or two-way
frequency tables (tables with a row variable and a column variable. These commands are similar
to the frequency and crosstab commands in SPSS. How do they differ?

tabulate or tab produce a frequency table for one or two variables
tab1 produces a one-way frequency table for each variable in the
variable list
tab2 produces all possible two-variable tables from the list of variables

You can use several options with these commands:
all gives all the tests of association for two-way tables
cell gives the overall percentage for two-way tables
column gives column percentages for two-way tables
row gives row percentages for two-way tables
nofreq suppresses printing the frequencies.
chi2 provides the chi squared test for two-way tables

There are many other options, including other statistical tests. For more information, type help
tabulate

Some examples of the tabulate commands are:
. tabulate q1a produces table of frequency by region
. tabulate q1a sexh produces a cross-tab of frequencies by region and sex of head
. tabulate q1a hhsize, row produces a cross-tab by region and hhsize with row percentages
. tabulate sexh hhsize, cell nofreq produces a cross-tab of overall percent by sex and hhsize.
. tab1 q1a q1b hhsize produces three tables, a frequency table for each variable
. tab2 q1a poor sexh produces three tables, a cross-tab of each pair of variables

16

Example 5: Using tabulate on categorical variables
. t ab q1b

Wer eda | Fr eq. Per cent Cum.
- - - - - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
At sbi | 84 5. 79 5. 79
Sebhassahsi e | 66 4. 55 10. 33
Ankober | 86 5. 92 16. 25
Basso na Wor ana | 175 12. 05 28. 31
Enemayi | 61 4. 20 32. 51
Bugena | 144 9. 92 42. 42
Adaa | 95 6. 54 48. 97
Ker sa | 95 6. 54 55. 51
Dodot a | 109 7. 51 63. 02
Shashemene | 97 6. 68 69. 70
Cheha | 65 4. 48 74. 17
Kedi da Gamel a | 74 5. 10 79. 27
Bul e | 134 9. 23 88. 50
Bol oso | 96 6. 61 95. 11
Dar amal o | 71 4. 89 100. 00
- - - - - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Tot al | 1, 452 100. 00

. t ab q1b sexh

| Sex of househol d head
Wer eda | Femal e Mal e | Tot al
- - - - - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - +- - - - - - - - - -
At sbi | 48 36 | 84
Sebhassahsi e | 29 37 | 66
Ankober | 13 73 | 86
Basso na Wor ana | 52 123 | 175
Enemayi | 11 50 | 61
Bugena | 55 89 | 144
Adaa | 23 72 | 95
Ker sa | 31 64 | 95
Dodot a | 26 83 | 109
Shashemene | 26 71 | 97
Cheha | 22 43 | 65
Kedi da Gamel a | 15 59 | 74
Bul e | 11 123 | 134
Bol oso | 25 71 | 96
Dar amal o | 13 58 | 71
- - - - - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - +- - - - - - - - - -
Tot al | 400 1, 052 | 1, 452

In one-way tables, Stata gives the count, the percentage, and the cumulative percentage
(see first example in box).
In two-way tables, Stata gives the count only, unless you ask for other statistics (see
second example in box)
col, row, and cell request Stata to include percentages in two-way tables

summarize
The summarize command produces statistics on continuous variables like age, food, cons hhsize.
The syntax looks like this:
17

summarize [varlist] [if exp] [in range] [, [detail]]

By default, it produces the following statistics:
Number of observations
Average (or mean)
Standard deviation
Minimum
Maximum
If you specify detail Stata gives you additional statistics, such as
skewness,
kurtosis,
the four smallest values
the four largest values
various percentiles.

Here are some examples:
. summarize gives statistics on all variables
. summarize hhsize food gives statistics on selected variables
. summarize hhsize cons if q1a==3 gives statistics on two variables for one region

Example 6: Using summarize to study continuous variables
. sum r conspc r consae hhsi ze

Var i abl e | Obs Mean St d. Dev. Mi n Max
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
r conspc | 1449 90. 36742 81. 99623 4. 22011 1018. 295
r consae | 1449 108. 7874 97. 27053 4. 811201 1212. 256
hhsi ze | 1452 5. 782369 2. 740968 1 17

. sum r conspc r consae hhsi ze i f q1a==4

- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
r conspc | 395 111. 6185 99. 09839 8. 393298 1018. 295
r consae | 395 132. 6018 116. 6133 9. 608795 1212. 256
hhsi ze | 396 6. 209596 2. 853203 1 16

The first example gives the statistics for the whole sample, while the second gives the statistics
only for households in Region 4.

by
This prefix goes before a command and asks Stata to repeat the command for each value of a
variable. The general syntax is:

by varlist: command

Note: bysort command is most commonly used to shorten the sorting process

18
Some examples of the by prefix are:

bysort sex: sum rconsae for sex of hh head, give stats on real per capita
consumption.

Example 7: Using the by prefix
- > sexh = Femal e

- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
r conspc | 398 100. 2183 89. 18895 7. 068164 624. 1437
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- > sexh = Mal e

- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
r conspc | 1051 86. 63701 78. 82594 4. 22011 1018. 295

help

The help command gives you information about any Stata command or topic

help [command]

For example,
. help tabulate gives a description of the tabulate command
. help summarize gives a description of the summarize command

SECTION3:STORINGCOMMANDSANDOUTPUT

In this section, we discuss how to store commands and output for later use. First, we describe
how to store commands using a program (Stata calls it a Do-file), how to edit the program, and
how to run it. Second, we present different ways of saving and using the output generated by
Stata. The following topics are covered:

Using the Do-file Editor
log using
log off
log on
log close
set logtype to move tables from Stata to Word and Excel

Using the Do-file Editor

The Do-file Editor allows you to store a program (a set of commands) so that you can edit it and
execute it later. Why use the Do-file Editor?
It makes it easier to check and fix errors,
It allows you to run the commands later,
19
It lets you show others how you got your result, and
It allows you to collaborate with others on the analysis.

In general, any time you are running more than 10 commands to get a result, it is easier and safer
to use a Do-file to store the commands.

To open the Do-file Editor, you can click on Windows/Do-file Editor or click on the envelope on
the Tool Bar.

Within the Do-file Editor, there is a menu bar and tool bar buttons to carry out a variety of
editing functions. The menu bar is similar to the one in Microsoft Word:

File/New to open a new, blank Do-file
File/Open to open an existing Do-file
File/Save to save the current Do-file
File/Save as to saving the current Do-file under a new name
File/Insert file to insert another file into the current one
File/Print to print the Do-file
File/Close to close the Do-file
Edit/Undo to undo the last command
Edit/Cut to delete or move the marked text in the Do-file
Edit/Copy to copy the marked text in the Do-file
Edit/Paste to insert the copied or cut text into the Do-file
Search/Find to find a word or phrase in the Do-text
Search/Replace to find and replace a word or phrase in the Do-file
Tools/Do to execute all the commands or the marked commands in the Do-file
Tools/Run to execute all the commands or the marked commands in the Do-file
without showing any output in the Stata Results window

The tool bar buttons can be used to carry out some of these tasks more quickly. For example,
there are buttons for File/New, File/Open, File/Print, Search/Find, Edit/Cut, Edit/Copy,
Edit/Paste, Edit/Undo, Do, and Run. Probably the button you will use most is the second-to-last
one that shows a page with text on it. This is the Do button for executing the program or the
marked part of the program.

Finally, the keyboard commands may be even quicker to use than the buttons. The most useful
keyboard commands are:

Control-O Open file
Control-S Save file
Control-C Copy
Control-X Cut
Control-V Paste
Control-Z Undo
Control-F Find
Control-H Find and Replace

To run the commands in a Do-file, you can click on the Do button (the second-to-last one) or
click on Tools/Do. If you want to run one or just a few commands rather than the whole file,
20
mark the commands and click on the Do button. You do not have to mark the whole command,
but at least one character in the command must be marked in order for the command to be
executed (unlike SPSS, it is not enough to have the cursor on a command). Although layout is a
matter of personal preference, it may be useful to have the Stata Results window and the other
windows on one side of the screen and the Do-file Editor window on the other. This makes it
easy to switch back and forth. When you arrange the windows the way you like, you can save the
layout by clicking Prefs/Save Windowing Preferences. Each time you open Stata, it will use your
chosen layout.

Note: If you would like to add a note to a do file, but do not want Stata to execute your notes, /*
*/ is used.

/* This Stata program illustrates how to read create a do file */
log using C:\...\eeatraining.log,replace
log close

Saving the Output
As mentioned in earlier section, the Stata Results window does not keep all the output you
generate. It only stores about 300-600 lines, and when it is full, it begins to delete the old results
as you add new results. You can increase the amount of memory allocated to the Stata Results
Window. But even this will probably not be enough for a long session with Stata. Thus, we need
to use log to save the output.

There are four ways to control the log operations.
1. You can use the log button on the tool bar. It looks like a scroll.
2. You can click on File/Log to get four options: Begin (log using), Close, Suspend (log
off), and resume (log on).
3. You can use .log. commands in the Stata Command window
4. You can use .log. commands in the Stata Do-file Editor.

In this section, we describe the commands, which can be used in the Stata Command window or
in a do-file (program).

log using

This command creates a file with a copy of all the commands and output from Stata. The first
time you open a log, you must give a name to the new file to be created. The syntax is:

log using filename [, append replace [ text | smcl ] ]

where filename is that name you give the new file. The options are:

append adds the output to an existing file
replace replaces an existing file with the output
text tells Stata to create the log file in text (ASCII) format
smcl tells Stata to create the log file in SMCL format


log using temp22 saves output to a file called temp22
21
log using temp20, replace saves output to an existing file, temp20, replacing content
log using regoutput, append saves output to an existing file, results, adding to contents
log using .d:\my data\myfile.txt. saves output in specified file in specified folder

Several points should be remembered in using this command:

if you use an existing file name but do not say replace or append, Stata will give
an error message that the file already exists
log files in text format can be opened with Wordpad, Notepad, the DOS editor, or any
word processor., but the file does not have any formatting
smcl files have formatting (bold, colors, etc) but can only be opened with Stata
smcl format is the default

log off

This command temporarily turns off the logging of output, so that any subsequent output is not
copied to the log file. This is useful if you want to save some of the output but not all. Log off
only works after a log using command.

log on

This command is used to restart the logging, copying any new output to the log file that was
already defined. log on only works after a log using and a log off command.

log close

This command is used to turn off the logging and save the file. How are log off and log close
different? Log off allows you to turn it back on easily with log on continuing to use the same
log file. After a log close however, the only way to start logging again is with log using.

set logtype text

This command tells Stata to always save the log files in text (ASCII) format. It is the same as
adding the text subcommand to every log using command, but it is easier. If you prefer text
format log files, this is the best way to make sure all the log files are in this format.

set logtype smcl

This command tells Stata to always save log files in SMCL format. It is the same as adding the
smcl subcommand to every log using command.

Exercise 1: Exploring the ERHS

This section includes some questions that you can answer using the r5ERHS files provided on
your computer and the commands described in this section. Remember two tricks to make it
easier to fix your mistakes:

You can use PageUp to retrieve the most recent command.
22
You can click on variables in the Variable window to paste it into the Command window.

Summary file The file ERHScons1999 contains summary variables calculated from various
other data files. It is at the household level. Open the file by entering use
C:\training\ERHScons1999.dta, clear in the Command window and pressing Return. Open
do and log files to save command and outputs. Use log file and copy and paste some of output
tables into excel and word files.

1. How many variables and how many records are in ERHScons1999?
2. What percentages of households have female heads?
3. Is there a statistically significant difference between the percentage of female-headed
households in poor and non-poor?
4. What percentage of Amhara households are considered poor household?
5. What percentages of households are in SNNP region?
6. How does the percentage of female headed household vary by region?
7. What is the average size of a household?
8. What is the average size of household in the Oromia region?
9. How does household size vary with across status? (use poor variable)

Household members The file p1sec1_rv1 contains information about each member of the
household. It is at the individual level (each record is a person). You can answer the following
questions using this file:

1. What percentage of the individual is female?
2. What percentage of the individual over 45 years old is female?
3. What percentage of the individual under 5 is female?
4. What percentage of women are married?
5. What percentage of the women over the age of 18 are married?
6. Does this percentage vary among regions?
7. What is the status of individuals as compared to round 4?
8. What is the reason for household who left since round 4
9. What was the major occupation of household head?
10. What was the major occupation of household members aged 7 to 15?

Food and cash crops The file p2s1b_rv1 contains information on production of food and cash
crops. The data are at the crop level, meaning that each record represents one crop for one
household. Only crops that are grown by each household are included in the file. The crop codes
and labels are given in variable crop. You can answer the following questions with this file.

1. How many households in the sample grow maize and wheat?
2. Among maize growers, what was the average area with maize?
3. Among maize growers, what was the average amount of maize harvested?
4. Among wheat growers, what was the average amount of wheat harvested?
5. Does the average amount of Maize harvested vary among regions?
6. Does the average amount of Wheat harvested vary among regions?
7. Among farmers with more than 1 hectare of maize, what was the average amount of
maize harvested?
23
8. What is the average amount harvested for major cereal crops? (Teff, barely, wheat, maize
and sorghum?)
9. Farmers were asked Was any of the land cultivated under new extension program?
What was the average response?
10. Farmers were also asked Was any of the land cultivated irrigated? And % of the land
irrigated. Explore them.

SECTION4:CREATINGNEWVARIABLES

In the previous sections, we described how to explore the data using existing variables. In this
section, we discuss how to create new variables. When new variables are created, they are in
memory and they will appear in the Data Browser, but they will not be saved on the hard-disk
unless you use the save command.

In this section, we will cover the following commands and options.
generate
replace
tab , generate
operators
functions
recode
xtile
generate
This command is used to create a new variable. It is similar to compute in SPSS. The syntax is;
generate newvar =exp [if exp]
where exp is an expression like price*quant or 1000*kg. Several points about this command:
Unlike compute in SPSS, generate cannot be used to change the definition of an
existing variable. If you want to change an existing variable, you need to use replace,
You can use gen or g as an abbreviation for generate
If the expression is an equality or inequality, the variable will take the values 0 if the
expression is false and 1 if it is true
If you use if, the new variable will have missing values when the if statement is false

For example,
generate age2 =age*age create age squared variable
gen yield =outputkg/area if area>0 create new yield variable if area is positive
gen price =value/quant if quant>0 create new price variable if quant is positive
gen highprice =(price>1000) creates a dummy variable equal to 1 for high prices

24

replace
This command is used to change the definition of an existing variable. The syntax is the same:
replace oldvar =exp [if exp] [in exp]
Some points to remember:
Replace cannot be used to create a new variable. Stata will give an error message if the variable
does not exist.
There is no abbreviation for replace. Stata wants to make sure you really want to change the
variable.
If you use the if option, then the old values will be retained when the if statement is false
You can use the period (.) to represent missing values

For example,
replace price =avgprice if price >100000 replaces high values with an average price
replace income =. if income<=0 replace negative income with missing value
replace age =25 in 1007 replace age=25 in observation #1007

tabulate generate
This command is useful for creating a set of dummy variables (variables with a value of 0 or 1)
depending on the value of an existing categorical variable. The syntax is:
tabulate oldvariable, generate(newvariable)
The old variable is a categorical (or discrete) variable. The new variables will take the form newvariable1,
newvariable2, newvariable3, etc. Newvariablex will be equal to 1 if oldvariable=x and 0 otherwise. It is
easier to explain with an example. The variable q1a (region) is a variable that takes values of 1,3,4, 7,8
and 9 for the different regions of Ethiopia. We can create six dummy variables as follows:
tab q1a, gen(region)
This creates 6 new variables:
region1=1 if q1a=1 and 0 otherwise
region2 =1 if q1a =3 and 0 otherwise

region8=1 if q1a =8 and 0 otherwise

In Example 8, notice that there are 396 households in region 4 (Oromia) and the same number of
households for which with region4=1.

25

Example 8: Using tab, gen to create dummy variables
. t ab q1a, gen( r egi on)

Regi on | Fr eq. Per cent Cum.
- - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Ti gr ay | 150 10. 33 10. 33
Amhar a | 466 32. 09 42. 42
Or omi a | 396 27. 27 69. 70
7 | 139 9. 57 79. 27
8 | 134 9. 23 88. 50
9 | 167 11. 50 100. 00
- - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Tot al | 1, 452 100. 00

. t ab r egi on3

q1a==Or omi a | Fr eq. Per cent Cum.
- - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
0 | 1, 056 72. 73 72. 73
1 | 396 27. 27 100. 00
- - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Tot al | 1, 452 100. 00

egen
This is an extended version of generate[extended generate] to create a new variable by
aggregating the existing data. It is a powerful and useful command that does not exist in SPSS. It
adds summary statistics to each observation. To do the same thing in SPSS, you would need to
create a new file with aggregate and merge it with the original file using match files. The
syntax is:
egen newvar =fcn(arguments) [if exp] [in range] , by(var)
where newvar is the new variable to be created; fcn is one of numerous functions such as:

count() number of non-missing values
diff() compares variables, 1 if different, 0 otherwise
fill() fill with a pattern
group() creates a group id from a list of variables
iqr() interquartile range
ma() moving average
max() maximum value
mean() mean
median() median
min() minimum value
pctile() percentile
rank () rank
rmean() mean across variables
sd () standard deviation
std() standardize variables
26
sum () sums

argument is normally just a variable var in the by() subcommand must be a categorical variable

Here are some other examples:
egen avg =mean(yield) creates variable of average yield over entire sample
egen avg2 =median(income), by(sex) creates variable of median income for each sex
egen regprod =sum(prod), by(region) creates variable of total production for each region

Example 9: Using egen to calculate averages
. egen avecon=mean( cons) , by( q1c)
. gen hi ghavecon=( cons> avecon)
. l i st hhi d q1c cons avecon hi ghavecon i n 650/ 675

+- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - +
| hhi d q1c cons avecon hi ghav~n |
| - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - |
650. | 407070000039 Si r bana Godet i 673. 582 940. 6532 0 |
651. | 407070000040 Si r bana Godet i 793. 05 940. 6532 0 |
652. | 407070000041 Si r bana Godet i 985. 257 940. 6532 1 |
653. | 407070000042 Si r bana Godet i 844. 477 940. 6532 0 |
654. | 407070000043 Si r bana Godet i 946. 014 940. 6532 1 |
| - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - |
655. | 407070000044 Si r bana Godet i 2206. 057 940. 6532 1 |
656. | 407070000045 Si r bana Godet i 570. 0535 940. 6532 0 |
657. | 407070000046 Si r bana Godet i 1340. 926 940. 6532 1 |
658. | 407070000047 Si r bana Godet i 901. 222 940. 6532 0 |
659. | 407070000048 Si r bana Godet i 887. 775 940. 6532 0 |
| - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - |
660. | 407070000049 Si r bana Godet i 1026. 795 940. 6532 1 |
661. | 407070000051 Si r bana Godet i 1392. 845 940. 6532 1 |
662. | 407070000052 Si r bana Godet i 574. 218 940. 6532 0 |
663. | 407070000053 Si r bana Godet i 363. 63 940. 6532 0 |
664. | 407070000054 Si r bana Godet i 926. 551 940. 6532 0 |
| - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - |
665. | 407070000055 Si r bana Godet i 1256. 021 940. 6532 1 |
666. | 407070000057 Si r bana Godet i 753. 478 940. 6532 0 |
667. | 407070000058 Si r bana Godet i 1378. 575 940. 6532 1 |
668. | 407070000059 Si r bana Godet i 1640. 834 940. 6532 1 |
669. | 407070000060 Si r bana Godet i 472. 841 940. 6532 0 |
| - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - |
670. | 407070000062 Si r bana Godet i 721. 425 940. 6532 0 |
671. | 407070000063 Si r bana Godet i 1341. 702 940. 6532 1 |
672. | 407070000064 Si r bana Godet i 781. 82 940. 6532 0 |
673. | 407070000065 Si r bana Godet i 1962. 697 940. 6532 1 |
674. | 407070000070 Si r bana Godet i 945. 045 940. 6532 1 |
| - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - |
675. | 407070000071 Si r bana Godet i 1742. 247 940. 6532 1 |
+- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - +

In Example 9, we want to know which households have expenditure (cons) above the village
average. First, we calculate the average expenditure for each village with the egen command.
Then we create a dummy variable based on the expression (cons >avecons). The list output
shows how the village average is repeated for every household in the village and confirms that
the dummy variable is correctly calculated.
27

operators
This is not a Stata command, but a topic related to creating new variables. Most of the operators
are obvious, but some are not. Unlike SPSS, you cannot use words like or, and, eq, or
gt.
Arithmetic
+ addition
- subtraction
* multiplication
/ division
^ power

Relational
> greater than
< less than
>= more than or equal
<= less than or equal
== equal
~= not equal
!= not equal

Logical
~ not
| or
& and

Note: The most difficult rule to remember is when to use =and when to use ==.

Use a single equal symbol (=) when defining a variable.
Use a double equal symbol (==) when you are testing equality, such as in if
statement and when creating a dummy variable.

Here are some examples to illustrate the use of these operators. Suppose you want you create a
dummy variable indicating households in the Amhara region. One way is to write:

generate AmD =0

replace AmD =1 if q1a==3

Or you can get exactly the same result with just one command:

generate AmD =(q1a==3)

If the expression in parentheses is true, the value is set to 1. If it is false, the value is 0.

Logical operators are useful if you want to impose more than one condition. For example,
suppose you want to create a dummy variable for female household head in the Dodota. In other
words, a household head must be female head and in Dodota wereda to be selected.
28

gen DDfemale =0

replace DDfemale =1 if q1b==9 & sexh==0

or an easier way to do this would be:

gen DDfemale =(q1b==9 & sexh==0)

Or suppose you wanted to create a dummy variable for households in the two regions (Amhara
and Oromia). This variable can be created with:

gen amaoro =0

replace amaoro =1 if q1a==3 | q1a==4

or by one command:

gen amaoro =(q1a==3 | q1a==4)

You can also combine conditions using parentheses. Suppose you wanted a dummy variable that
indicates if a household is a poor farmer in one of the Tigray and Amhara region. We will define
poor as in the bottom 20 percent and use the variable poor.

gen PDF =((q1a==1 | q1b==3) & poor==1)

Note: Here is a list of some of the more commonly-used additional functions used to create new
variables in stata. Other functions can be found by typing help functions in the Stata Command
window.

abs(x) computes the absolute value of x
exp(x) calculates e to the x power.
ln(x) computes the natural logarithm of x
log(x) is a synonym for ln(x), the natural logarithm.
log10(x) computes the log base 10 of x.
sqrt(x) computes the square root of x.
invnorm(p) provides the inverse cumulative normal; invnorm(norm(z)) =z.
normden(z) provides the standard normal density.
normden(z,s) provides the normal density. normden(z,s) =normden(z)/s if s>0 and s not
missing, otherwise, the result is missing.
norm(z) provides the cumulative standard normal.
group(x) creates a categorical variable that divides the data into x as nearly equal-
sized subsamples as possible, numbering the first group 1, the second
group 2, etc. It uses the current order of the data.
int(x) gives the integer obtained by truncating x.
round(x,y) gives x rounded into units of y.

29

recode
This command changes the values of a categorical variable according to the rules specified. It is
like the recode command in SPSS except that in Stata you do not necessarily use parentheses.
The syntax is:

recode varname old=new old=new . [if exp] [in range]

recode x 1=2 changes all values of x=1 to x=2
recode x 1=2 3=4 changes 1 to 2 and 3 to 4
recode x 1=2 2=1 exchanges the values 1 and 2 in x
recode x 1=2 *=3 changes 1 in x to 2 and all other values to 3
recode x 1/5=2 changes 1 through 5 in x to 2
recode x 1 3 4 5 =6 changes 1, 3, 4 and 5 to 6
recode x .=9 changes missing to 9
recode x 9=. changes 9 to missing

Notice that you can use some special symbols in the rules:

* means all other values
. means missing values
x/y means all values from x to y
x y means x and y

For example, recode region value 8 and 9 to 7

Example 10: Using recode to define a new variable
. t ab q1a

- - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Ti gr ay | 150 10. 33 10. 33
Amhar a | 466 32. 09 42. 42
Or omi a | 396 27. 27 69. 70
7 | 139 9. 57 79. 27
8 | 134 9. 23 88. 50
9 | 167 11. 50 100. 00
- - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Tot al | 1, 452 100. 00

. r ecode q1a 8 9=7
( q1a: 301 changes made)

. t ab q1a

- - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Ti gr ay | 150 10. 33 10. 33
Amhar a | 466 32. 09 42. 42
Or omi a | 396 27. 27 69. 70
7 | 440 30. 30 100. 00
- - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Tot al | 1, 452 100. 00

30

xtile

This command creates a new variable that indicates which category a record falls into, when the
sample is sorted by an existing variable and divided into n groups of equal size. It is probably
easier to explain with examples. xtile can be used to create a variable that indicates which
income quintile a household belongs to, which decile in terms of farm size, or which tercile in
terms of coffee production. The syntax is:
xtile newvar =variable [if exp] [in range] , nq(#)
where newvar is the new categorical variable created; variable is the existing variable used to
create the quantile (e.g income, farm size); #is the number of different categories (eg 5 for
quintiles, 3 for terciles)
For example,
xtile incquint =income, nq(5)
xtile farmdec =farmsize, nq(10)
Suppose we want to create a variable indicating the deciles of expenditure per capita.

Example 11: Using xtile to generate deciles (using the ERHS99cons data)
. xt i l e r conseadec= r consae, nq( 10)
. t ab r conseadec

10 |
quant i l es |
of r consae | Fr eq. Per cent Cum.
- - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
1 | 145 10. 01 10. 01
2 | 145 10. 01 20. 01
3 | 145 10. 01 30. 02
4 | 145 10. 01 40. 03
5 | 145 10. 01 50. 03
6 | 145 10. 01 60. 04
7 | 145 10. 01 70. 05
8 | 145 10. 01 80. 06
9 | 145 10. 01 90. 06
10 | 144 9. 94 100. 00
- - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Tot al | 1, 449 100. 00

. t ab r conseadec sexh, col nof r e

10 |
quant i l es | Sex of househol d head
of r consae | Femal e Mal e | Tot al
- - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - +- - - - - - - - - -
1 | 7. 79 10. 85 | 10. 01
2 | 10. 30 9. 90 | 10. 01
3 | 8. 04 10. 75 | 10. 01
4 | 10. 30 9. 90 | 10. 01
5 | 8. 79 10. 47 | 10. 01
6 | 10. 30 9. 90 | 10. 01
7 | 10. 55 9. 80 | 10. 01
8 | 10. 05 9. 99 | 10. 01
9 | 10. 05 9. 99 | 10. 01
10 | 13. 82 8. 47 | 9. 94
31
- - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - +- - - - - - - - - -
Tot al | 100. 00 100. 00 | 100. 00

Exercise 2
1. Use the file ERHScons1999. Create a variable called reg4 which indicates whether a
household is in the Oromia or other regions. Then do a frequency table of the new
variable.
2. Using the same file, create a variable called hhquint that indicates the quintile of
household size. Then do a frequency table on the new variable.
3. Using the same file, create a dummy variable called enbugthat is equal to 1 if the
household is the Enemayi and Bugena weredas and 0 otherwise. Then do a frequency
table on the new variable.
4. Create a new variable avgexp which is equal to the wereda average of food
expenditure (food). Then calculate a new variable equal to the difference between the
household food expenditure and the wereda average expenditure.
5. Using the same file, create a new variable splot which is 1 if the person is cultivating
single plots and 0 otherwise.
6. Use file p1sec1_rv1. Create a set of dummy variables called relatxx based on the
relationship of the person to the household head. For example, relat01 is a dummy for
being the head, relat02 is a dummy for being the spouse, relat03 for a child, and so on.

SECTION5:MODIFYINGVARIABLES

In this section, we introduce some more powerful and flexible commands for generating results
from survey data. We begin with an explanation of how to label data in Stata. Then see how to
format variables. These are the topics and commands covered in this section:
rename variable
label variable
label define
label values
format variable
rename variables
This command is used to rename variables in order to give other variable name. The command is

. rename old_variable new_variable

For instance, generate regional dummy variables and then:

Example 12: renaming variable
. t ab q1a, gen( i ndex)

- - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Ti gr ay | 150 10. 33 10. 33
Amhar a | 466 32. 09 42. 42
Or omi a | 396 27. 27 69. 70
SNNP | 440 30. 30 100. 00
32
- - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Tot al | 1, 452 100. 00

. t ab i ndex1

q1a==Ti gr ay | Fr eq. Per cent Cum.
- - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
0 | 1, 302 89. 67 89. 67
1 | 150 10. 33 100. 00
- - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Tot al | 1, 452 100. 00

. t ab i ndex2

q1a==Amhar a | Fr eq. Per cent Cum.
- - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
0 | 986 67. 91 67. 91
1 | 466 32. 09 100. 00
- - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Tot al | 1, 452 100. 00

r ename i ndex1 Ti gr ay r ename i ndex1 var i abl e t o Ti gr ay
r ename i ndex2 Amhar a r ename i ndex2 var i abl e t o Amhar a
r ename i ndex3 Or omi a r ename i nxex 3 var i abl e t o Or omi a
r ename i ndex4 SNNP r ename i nxex4 var i abl e t o SNNP

label variable

This command is used to attach labels to variables in order to make the output easier to
understand. For example, we know that Tigray is region1, SNNP are region 7. So we may
want to label the variables as follows:

l abel var i abl e Ti gr ay" Regi on 1"
l abel var i abl e Amhar a" Regi on 3"
l abel var i abl e Or omi a" Regi on 4
l abel var i abel SNNP" Regi on 7"

You can use the abbreviation label var
If there are spaces in the label, you must use double quotation marks.
If there are no spaces, quotation marks are optional.
This command is like variable label in SPSS except that you can only label one variable per
command and Stata uses double quotation marks, not single
The limit is 80 characters for a label, but any labels over 30 characters will probably not look
good in a table.

label define
This command gives a name to a set of value labels. For example, instead of numbering the regions, we
can assign a label to each region. Instead of numbering the different sources of income, we can give them
labels. The syntax is:
label define lblname #"label" #"label" #label [, add modify]
33
where
lblname is the name given to the set of value labels
#are the value numbers
labelare the value labels
add means that you want to add these value labels to the existing set
modify means that you want to change these values in the existing set

Note that:
You can use the abbreviation label def
The double quotation marks are only necessary if there are spaces in the labels
Stata will not let you define an existing label unless you say modify or add
This command is similar to value label in SPSS except that in Stata you give the labels a name
and later attach it to the variable, while in SPSS you attach it to the variable in the same command.

34
label values
This command attaches named set of value labels to a categorical variable. The syntax is:
label values varname [lblname] [, nofix]
where varname is the categorical variable which will get the labels lblname is a set of labels that have
already been defined by label define
Here are some examples of labeling values in Stata.
l abel def i ne r eg 1" Ti gr ay" 3" Amhar a" 4" Or omi a" 7" SNNP" , modi f y
l abel val ues q1a r eg

Some additional commands that may be useful in labeling
label dir to request a list of existing label names
label list to request a list of all the existing value labels
label drop to delete a one or more labels
label save using to save label definitions as a Do-file
label data to give a label to a data file

format
The format command allows you to specify the display format for variables. The internal
precision of the variables is unaffected.

The syntax for format command is

. format varlist %fmt

where %fmt is listed below:

%f mt descr i pt i on exampl e
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Ri ght - j ust i f i ed f or mat s
%#. #g gener al numer i c f or mat %9. 0g
%#. #f f i xed numer i c f or mat %9. 2f
%#. #e exponent i al numer i c f or mat %10. 7e
%d def aul t numer i c el apsed dat e f or mat %d
%d. . . user - speci f i ed el apsed dat e f or mat %dM/ D/ Y
%#s st r i ng f or mat %15s

Ri ght - j ust i f i ed, comma f or mat s
%#. #gc gener al numer i c f or mat %9. 0gc
%#. #f c f i xed numer i c f or mat %9. 2f c

Leadi ng- zer o f or mat s
%0#. #f f i xed numer i c f or mat %09. 2f
%0#s st r i ng f or mat %015s

Lef t - j ust i f i ed f or mat s
%- #. #g gener al numer i c f or mat %- 9. 0g
%- #. #f f i xed numer i c f or mat %- 9. 2f
%- #. #e exponent i al numer i c f or mat %- 10. 7e
%- d def aul t numer i c el apsed dat e f or mat %- d
35
%- d. . . user - speci f i ed el apsed dat e f or mat %- dM/ D/ Y
%- #s st r i ng f or mat %- 15s

Lef t - j ust i f i ed, comma f or mat s
%- #. #gc gener al numer i c f or mat %- 9. 0gc
%- #. #f c f i xed numer i c f or mat %- 9. 2f c

Cent er ed f or mat s
%~#s st r i ng f or mat ( speci al ) %~15s
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Exercise 3
1. Use exercise 2 and label values and variables for newly created variables
2. label data file by This data is used for training
3. list existing label names

SECTION6:ADVANCEDDESCRIPTIVESTATISTICS

In Section 3, we have seen at preliminary descriptive statistics mostly applied to explore the
nature of the data. In this section we further explore more advanced statistics.

tabulate summarize

This command creates one- and two-way tables that summarize continuous variables. The
command tabulate by itself gives frequencies and percentages in each cell (cross-tabulations).
With the summarize option, we can put means and other statistics of a continous variable. The
syntax is:

tabulate varname1 varname2 [if exp] [in range], summarize(varname3) options

where
varname1 is a categorical row variable
varname2 is a categorical column variable (optional)
varname3 is the continuous variable summarized in each cell
options can be used to tell Stata which statistics you want

Some notes regarding this command:
The default statistics are the mean, the standard deviation, and the frequency.
You can specify which statistics with options means, standard and freq
You can use the abbreviation tabsum( )

Some examples:
tab q1a, sum(cons) gives the mean, std deviation, and frequency of per capita
expenditure for each region
tab q1b, sum(cons) mean gives the mean consumption for each village
tab q1a sexh, sum(food) gives the mean, std deviation, and frequency in each cell of
hh head sex per region
36

The first table is a one-way table (just one categorical variable) showing the mean, standard
deviation, and frequency of per capita expenditure for each expenditure region.
In the second table, we use the mean option so only mean per capita expenditure is shown.
In the third table, we add a second categorical variable (sexh) making it a two-way table.
Although we could have requested all the the default statistics in the two-way table, it makes the
table difficult to read so we do not advise it.

Example 13: Use tabulate. Sum () to generate tables
. t ab q1a, sum( cons)

| Summar y of consumpt i on per mont h
Regi on | Mean St d. Dev. Fr eq.
- - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Ti gr ay | 413. 93552 297. 701 149
Amhar a | 545. 91653 467. 28072 465
Or omi a | 697. 09029 478. 55749 395
SNNP | 331. 7384 221. 15601 440
- - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Tot al | 508. 51838 420. 4014 1449

. t ab q1b, sum( cons) mean

| Summar y of consumpt i on per mont h
Wer eda | Mean
- - - - - - - - - - - - +- - - - - - - - - - - -
At sbi | 417. 16834
Sebhassah | 409. 87
Ankober | 301. 87563
Basso na | 777. 31823
Enemayi | 234. 392
Bugena | 542. 38657
Adaa | 940. 65322
Ker sa | 567. 89355
Dodot a | 526. 58473
Shashemen | 775. 34926
Cheha | 342. 54209
Kedi da Ga | 239. 09955
Bul e | 379. 28676
Bol oso | 266. 93705
Dar amal o | 416. 28045
- - - - - - - - - - - - +- - - - - - - - - - - -
Tot al | 508. 51838

. t ab q1a sexh, sum( cons)

Means, St andar d Devi at i ons and Fr equenci es of consumpt i on per mont h

Regi on | Femal e Mal e | Tot al
- - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - +- - - - - - - - - -
Ti gr ay | 342. 44136 488. 3678 | 413. 93552
| 277. 62091 301. 46008 | 297. 701
| 76 73 | 149
- - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - +- - - - - - - - - -
Amhar a | 450. 61424 582. 89951 | 545. 91653
| 368. 60452 495. 93838 | 467. 28072
| 130 335 | 465
37
- - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - +- - - - - - - - - -
Or omi a | 610. 49528 728. 85178 | 697. 09029
| 518. 32024 459. 98768 | 478. 55749
| 106 289 | 395
- - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - +- - - - - - - - - -
SNNP | 271. 02927 346. 48695 | 331. 7384
| 171. 91652 229. 33158 | 221. 15601
| 86 354 | 440
- - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - +- - - - - - - - - -
Tot al | 433. 7347 536. 83799 | 508. 51838
| 389. 69001 428. 24021 | 420. 4014
| 398 1051 | 1449

tabstat
This command gives summary statistics for a set of continuous variable for each value of a
categorical variable. The syntax is:
tabstat varlist [if exp] [in range] , stat(statname [...]) by(varname)
where
varlist is a list of continuous variables
statname is a type of statistic
varname is a categorical variable

Some facts about this command:
The default statistic is the mean.
Optional statistics subcommands include mean, sum, max, min, range, sd (standard deviation),
var (variance), skewness, kurtosis, median, and pn (nth percentile).
Without the by() option, tabstat is like summarize except that it allows you to specify the list of
statistics to be displayed.
With the by() option, tabstat is like "tabulate summarize except that tabstat is more flexible in
the statistics and format
It is very similar to the SPSS command means.

Examples
tabstat food hhsize, stats(mean max min) gives mean, max, and min of food &
hhsize
tabstat food hhsize, by(q1a) gives mean of two variables for each region
tabstat food, stats(median) by(q1a) gives the median food consumption for each
region
The tabstat command displays summary statistics for a series of numeric variables in a single
table.

38
Example 14: Using tabstate to create Table
. tabstat rconsae, s(mean p50 sd cv min max) by( rconseadec) missing

Summar y f or var i abl es: r consae
by cat egor i es of : r conseadec ( 10 quant i l es of r consae)

r conseadec | mean p50 sd cv mi n max
- - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
1 | 21. 80935 21. 9194 5. 773654 . 264733 4. 811201 30. 40175
2 | 36. 24088 36. 03099 3. 400392 . 0938275 30. 6191 42. 70621
3 | 48. 52454 48. 31921 3. 09388 . 0637591 42. 74319 53. 91997
4 | 60. 38483 60. 0903 3. 811244 . 0631159 54. 00354 66. 85229
5 | 73. 09496 72. 92955 3. 61339 . 0494342 66. 90016 79. 38206
6 | 89. 3758 89. 33151 5. 708862 . 0638748 79. 39233 99. 11871
7 | 110. 407 110. 2909 6. 692319 . 060615 99. 12563 122. 8186
8 | 137. 7846 137. 5525 9. 298181 . 0674835 123. 5698 154. 9666
9 | 179. 5007 176. 1209 17. 33479 . 0965723 155. 0732 214. 4674
10 | 332. 2927 285. 4411 135. 2309 . 4069633 214. 4888 1212. 256
. | . . . . . .
- - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Tot al | 108. 7874 79. 38206 97. 27053 . 8941343 4. 811201 1212. 256
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

table
This command creates a wide variety of tables. It is probably the most flexible and useful of all
the table commands in Stata. The syntax is:

table rowvar colvar [if exp] [in range], c(clist) [row col]

where
rowvar is the categorical row variable
colvar is the categorical column variable
clist is a list of statistic and variables
row is an option to include a summary row
col is an option to include a summary column
Some useful facts about this command:
The default statistic is the frequency.
Optional statistics are mean, sd, sum, rawsum (unweighted), count, max, min, median, and pn
(nth percentile).
The c( ) is short for contents of each cell.
Like tab, it can be used to create one- and two-way frequency tables, but table cannot do
percentages
Like tabsum, it can be used to calculate basic stats for each value of a categorical variable
Its advantage over tabsum is that it can do more statistics and it can take more than one
continious variable
Like tabstat, it can be used to calculate advanced stats for each value of a categorical variable
Its advantage over tabstat is that it can use do two (and more) way tables, but its disadvantage is
that it has fewer statistics.
It is similar to table in SPSS, but easier to learn and less flexible in formatting

table q1a , row table of frequencies by region with total row
39
table q1a, c(mean income) table of average income by region
table q1a, c(mean yield sd yield median yield) table of yield statistics by region
table q1a, c(mean yield) format(%9.2f) table of average yields by region with
format .
table q1a sexh, c(mean yield) table of average yield by region and sex
table q1a sexh, c(mean income mean yield) table of avg yield & income by region & sex

Some output from table commands is shown in Example 15.

The table command calculates and displays tables of statistics, including frequency, mean,
standard deviation, sum, and 1
st
to 99
th
percentile. The row and col option specifies an additional
row and column to be added to the table, reflecting the total across rows and columns.

Example 15: Tabulate median real per capita consumption by region vs sex of household head
table q1a sexh, contents(p50 rconsae) row col missing

Regi on | Femal e Mal e Tot al
- - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Ti gr ay | 73. 05909 74. 20448 73. 56232
Amhar a | 124. 9734 95. 00103 104. 7363
Or omi a | 98. 59296 99. 43469 98. 75433
SNNP | 53. 73735 50. 34177 51. 14911
|
Tot al | 90. 04483 77. 18623 79. 38206

. t abl e r conseadec, c( mean r consae)

10|
quant i l es |
of |
r consae | mean( r consae)
- - - - - - - - - - +- - - - - - - - - - - - - -
1 | 21. 80935
2 | 36. 24088
3 | 48. 52454
4 | 60. 38483
5 | 73. 09496
6 | 89. 3758
7 | 110. 407
8 | 137. 7846
9 | 179. 5007
10 | 332. 2927

Exercise 4
1. Use ERHScons1999 and tabulate basic summery statistics showing mean, standard
deviation and frequency of per capita food consumption for each village. Interpret the
result.
2. Repeat the same procedures as q1 but report only median of food consumption.
3. Tabulate basic summery statistics for food consumption by sex of household head and
regions (use single table)
4. Tabulate mean 25p, median, 75p, sd, cv, min and max summery statistics for real food
consumption per capita by deciles of real consumption per capita.
40
5. Tabulate median real food consumption per capita by sex of household head and deciles
of real consumption per capita (use single table).

SECTION7:PRESENTINGDATAWITHGRAPH(GRAPHINGDATA)
This section provides a brief introduction to creating graphs. In Stata, all graphs are made with
the graph command, but there are 8 types of charts and numerous subcommands for controlling
the type and format of graph. In this section, we focus on four types of graph and a few options.

The commands that draw graphs are
graph twoway scatterplots, line plots, etc.
graph matrix scatterplot matrices
graph bar bar charts
graph dot dot charts
graph box box-and-whisker plots
graph pie pie charts

Graph commands can also used to produce histogram, box plot, kdensity, P-P plot, Q-Q plot but
we will postpone until the introduction of normality later. Let us first acquaint ourselves with
some twoway graph commands.

A two way scatterplot can be drawn using (graph) twoway scatter command to show the
relationship between two variables, cons (total consumption) and food (food consumption). As
we would expect, there is a positive relationship between the two variables.

. graph twoway scatter cons food
0
1
0
0
0
2
0
0
0
3
0
0
0
4
0
0
0
c
o
n
s
u
m
p
t
i
o
n

p
e
r

m
o
n
t
h
0 1000 2000 3000 4000
food cons per month

We can show the regression line predicting cons from food using lfit option.

. twoway lfit cons food
41
0
1
0
0
0
2
0
0
0
3
0
0
0
4
0
0
0
F
i
t
t
e
d

v
a
l
u
e
s
0 1000 2000 3000 4000
food cons per month

The two graphs can be overlapped like this

. twoway (scatter cons hhsize) (lfit cons hhsize)
0
1
0
0
0
2
0
0
0
3
0
0
0
4
0
0
0
0 5 10 15 20
household size
consumption per month Fitted values

Exercise 5:
Draw two way scatter with line fit graph for consumption per capita vs household size and
explain its pattern.
42
SECTION8:NORMALITYANDOUTLIER

Check for Normality

An outlier is an observation that lies in an abnormal distance from other values in a random
sample from a population. We must be extremely mindful of possible outliers and their adverse
effects during any attempt to measure the relationship between two continuous variables.

There are no official rules to identify outliers. In a sense, this definition leaves it up to the analyst
(or a consensus process) to decide what will be considered abnormal. Sometimes it is obvious
when an outlier is simply miscoded (for example, age reported as 230) and hence should be set to
missing. But most times it is not the case.

Before abnormal observations can be singled out, it is necessary to characterize normal
observations.

Skewness is a measure of symmetry, or more precisely, the lack of symmetry. A distribution, or
data set, is symmetric if it looks the same to the left and right of the center point. The skewness
for a normal distribution is zero and any symmetric data should have a skewness near zero.
Negative values for the skewness indicate data that are skewed left and positive values for the
skewness indicate data that are skewed right. By skewing left, we mean that the left tail is
heavier than the right tail. Similarly, skewing right means that the right tail is heavier than the
left tail.

Kurtosis is a measure of whether the data are peaked or flat relative to a normal distribution. That
is, data sets with high kurtosis tend to have a distinct peak near the mean, decline rather rapidly,
and have heavy tails. Data sets with low kurtosis tend to have a flat top near the mean rather than
a sharp peak. A uniform distribution would be the extreme case. The standard normal
distribution has a kurtosis of zero. Positive kurtosis indicates a "peaked" distribution and
negative kurtosis indicates a "flat" distribution. A value of 6 or larger on the true kurtosis
indicates a large departure from normality.

We can obtain skewness and kurtosis values by using detail option in summarize command.
Clearly, variable rconspc(real consumption per capita) is skewed to the right and has a peaked
distribution. Both statistics indicate the distribution of rconspc is far from normal.

. sumr conspc

- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
r conspc | 1449 90. 36742 81. 99623 4. 22011 1018. 295

43
. sumr conspc, det ai l

r eal consumpt i on per capi t a 1994 pr i ces
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Per cent i l es Smal l est
1% 11. 65814 4. 22011
5% 18. 67906 6. 865227
10% 25. 10425 7. 068164 Obs 1449
25% 39. 94022 8. 201794 Sumof Wgt . 1449

50% 65. 99258 Mean 90. 36742
Lar gest St d. Dev. 81. 99623
75% 114. 2533 577. 1937
90% 180. 8909 624. 1437 Var i ance 6723. 382
95% 236. 1537 660. 1689 Skewness 3. 212314
99% 405. 8775 1018. 295 Kur t osi s 21. 69683

Besides commands for descriptive statistics, such as summarize, we can also check normality of
a variable visually by looking at some basic graphs in Stata, including histograms, boxplots,
kdensity, pnorm, and qnorm. Lets keep using r conspc from ERHScons1999.dta file for
making some graphs.

The histogram command is an effective graphical technique for showing both the skewness and
kurtosis of r conspc.

histogram rconspc
0
.
0
0
2
.
0
0
4
.
0
0
6
.
0
0
8
.
0
1
D
e
n
s
i
t
y
0 200 400 600 800 1000
real consumption per capita 1994 prices

The normal option can be used to get a normal overlay. This shows the skew to the left in
rconspc.

44
. histogram rconspc, normal
0
.
0
0
2
.
0
0
4
.
0
0
6
.
0
0
8
.
0
1
D
e
n
s
i
t
y
0 200 400 600 800 1000

We can use the bin() option to increase the number of bins to 100. This better illustrates the
distribution of rconspc. This option specifies how to aggregate data into bins. Notice that the
histogram resembles a bell shape curve, but truncated at 0.

. histogram rconspc, normal bin(100)

0
.
0
0
2
.
0
0
4
.
0
0
6
.
0
0
8
.
0
1
D
e
n
s
i
t
y
0 200 400 600 800 1000

graph box draws vertical box plots. In a vertical box plot, the y axis is numerical, and the x axis
is categorical. The upper and lower bounds of box are defined by the 25
th
and 75
th
percentiles of
rconspc, and the line within the box is the median. The ends of the whiskers are 5
th
and 95
th

percentile of rconspc. graph box command can be used to produce a boxplot which can help us
examine the distribution of rconspc. If rconspc is normal, the median would be in the center of
the box and the end of whiskers would be equidistant from the box.

45
The boxplot for rconspc shows positive skew. The median is pulled to the low end of the box,
and the 95
th
percentile is stretched out away from the box, for both male and female hh head. In
fact it seems worse for male household head.

. graph box rconspc, by(sexh)

0
2
0
0
4
0
0
6
0
0
8
0
0
1
,
0
0
0
Female Male
r
e
a
l

c
o
n
s
u
m
p
t
i
o
n

p
e
r

c
a
p
i
t
a

1
9
9
4

p
r
i
c
e
s
Graphs by Sex of household head

The kdensity command with the normal option displays a density graph of the residual with a
normal distribution superimposed on the graph. This is particularly useful in verifying that the
residuals are normally distributed, which is a very important assumption for regression. The plot
shows that rconspc is more skewed to the right and has a higher mean than that of normal
distribution.

. kdensity rconspc, normal

0
.
0
0
2
.
0
0
4
.
0
0
6
.
0
0
8
.
0
1
D
e
n
s
i
t
y
0 200 400 600 800 1000
Kernel density estimate
Normal density

46

Graphical alternatives to the kdensity command are the P-P plot and Q-Q plot.

pnorm command produces a P-P plot, which graphs a standardized normal probability. It should
be approximately linear if the variable follows normal distribution. The straighter the line formed
by the P-P plot, the more the variable's distribution conforms to the normal distribution.

. pnorm rconspc

0
.
0
0
0
.
2
5
0
.
5
0
0
.
7
5
1
.
0
0
N
o
r
m
a
l

F
[
(
r
c
o
n
s
p
c
-
m
)
/
s
]
0.00 0.25 0.50 0.75 1.00
Empirical P[i] =i/(N+1)

Qnorm command plots the quantiles of a variable against the quantiles of a normal distribution.
If the Q-Q plot shows a line that is close to the 45 degree line, the variable is more normally
distributed.

. qnorm rconspc

-
5
0
0
0
5
0
0
1
0
0
0
r
e
a
l

c
o
n
s
u
m
p
t
i
o
n

p
e
r

c
a
p
i
t
a

1
9
9
4

p
r
i
c
e
s
-200 0 200 400
Inverse Normal

47
Both P-P and Q-Q plot prove that rconspc is not normal, with a long tail to the right. The qnorm
plot is more sensitive to deviances from normality in the tails of the distribution, where the
pnorm plot is more sensitive to deviances near the mean of the distribution.

From the statistics and graphs we can confidently conclude that there exists outlier, especially at
the upper end of the distribution.

Dealing with outliers
There are generally three ways to deal with outliers. The easiest is to delete them from analyses.
The second one is to use measures that are not sensitive to them, such as median instead of mean,
or transform the data to be more normal. The most complicated one is to replace them by
imputation.

Since our data is heavily left-tailed, we will focus on very large outliers. A customary criterion to
identify outlier is to three times of deviation from the median. Note that we are using the
median because it is a robust statistic and if there are big outliers the mean will shift a lot but not
the median.

Example 16: Using robust statistics to replace outliers
/* Calculate number of standard deviations from median by sex of hh head */
. use " C: \ . . \ t r ai ni ng\ ERHScons1999. dt a" , cl ear
. egen medi an=medi an( r conspc) , by ( sexh)
. egen sd=sd( r conspc) , by ( sexh)
. gen r at i o=( r conspc- medi an) / sd
* ( 3 mi ssi ng val ues gener at ed)
. gen out l i er =1 i f r at i o>3 & r at i o~=.
*( 1414 mi ssi ng val ues gener at ed)
. r epl ace out l i er =0 i f out l i er ==. & r at i o~=.
*( 1411 r eal changes made)

. t abul at e out l i er , mi ssi ng

out l i er | Fr eq. Per cent Cum.
- - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
0 | 1, 411 97. 18 97. 18
1 | 38 2. 62 99. 79
. | 3 0. 21 100. 00
- - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Tot al | 1, 452 100. 00

There are only 38 observations are identified as outliers. When we compare the mean and
median values from using table command, the mean value has dropped around 5% and 14%
among female and male headed households, respectively, while the medians are less sensitive to
outliers.
48
Example 17: Comparing mean and median values to replace outliers
. t abl e sexh out l i er , cont ent s( mean r conspc) r ow col mi ssi ng

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Sex of |
househol d | out l i er
head | 0 1 Tot al
- - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Femal e | 88. 56179 419. 9406 100. 2183
Mal e | 78. 57423 431. 6569 86. 63701
|
Tot al | 81. 29232 427. 3403 90. 36742
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

. t abl e sexh out l i er , cont ent s( medi an r conspc) r ow col mi ssi ng

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Sex of |
househol d | out l i er
head | 0 1 Tot al
- - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Femal e | 70. 84578 398. 7476 73. 41055
Mal e | 63. 45253 371. 2374 64. 25856
|
Tot al | 64. 36755 385. 9552 65. 99258
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Method 1: Listwise deletion
In this approach, any observation that contains outliers recoded to a missing so that the variable
is dropped. Although easy to understand and to perform, it runs the risk of causing bias. Stata
perform listwise deletion automatically by default in order to allow the data matrix to be
inverted, a necessity for regression analysis.

Sometimes by dropping outliers we can greatly improve /decrease the adverse effect of extreme
values. But it does not work in our data, as indicated by the histogram below.

49
. histogram rconspc if outlier==0, normal
0
.
0
0
2
.
0
0
4
.
0
0
6
.
0
0
8
.
0
1
D
e
n
s
i
t
y
0 100 200 300 400

Method 2: Robust statistics

An alternative is to choose robust statistics thats not sensitive to outliers, such as median over
mean, which is indicated above.

When we are concerned about outliers or skewed distributions, the rreg command is used for
robust regression. Robust regression will result regression coefficients and standard errors from
OLS, which is different from regress command with robust option.

Example 18: Robust statistics
. r eg r conspc hhsi ze

Sour ce | SS df MS Number of obs = 1449
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - F( 1, 1447) = 193. 80
Model | 1149884. 79 1 1149884. 79 Pr ob > F = 0. 0000
Resi dual | 8585572. 4 1447 5933. 36033 R- squar ed = 0. 1181
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Adj R- squar ed = 0. 1175
Tot al | 9735457. 19 1448 6723. 38204 Root MSE = 77. 028

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
r conspc | Coef . St d. Er r . t P>| t | [ 95%Conf . I nt er val ]
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
hhsi ze | - 10. 29891 . 7398004 - 13. 92 0. 000 - 11. 75011 - 8. 847716
_cons | 150. 0144 4. 738429 31. 66 0. 000 140. 7195 159. 3093
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

50
. r r eg r conspc hhsi ze

Huber i t er at i on 1: maxi mumdi f f er ence i n wei ght s = . 92791317
Bi wei ght i t er at i on 5: maxi mumdi f f er ence i n wei ght s = . 29378905

Robust r egr essi on Number of obs = 1449
F( 1, 1447) = 146. 84
Pr ob > F = 0. 0000

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
r conspc | Coef . St d. Er r . t P>| t | [ 95%Conf . I nt er val ]
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
hhsi ze | - 5. 583005 . 4607371 - 12. 12 0. 000 - 6. 486789 - 4. 679221
_cons | 105. 3791 2. 951026 35. 71 0. 000 99. 59032 111. 1678
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Method 3: Data transformation
Variable rconspc is still skewed to the right and bounded below zero. In this case, a log
transformation would be appropriate correct our dataset. The logarithm function tends to squeeze
together the larger values in your data set and stretches out the smaller values, which can
sometimes produce a dataset thats closer to symmetric. In addition, a log transformation can
help to pull in the outliers on the high end and make them closer to the rest of the data.

Lets have a look at the distribution after the log transformation.

. histogram lnrconspc if rconspc~=., normal

0
.
1
.
2
.
3
.
4
.
5
D
e
n
s
i
t
y
0 2 4 6 8
lnrconspc

Statistics from summarize command also indicates an almost perfect normal distribution.

51
Example 18: Basic descriptive statistics after data transforamtion
suml nr conspc i f out l i er ==0, det ai l

l nr conspc
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Per cent i l es Smal l est
1% 2. 456005 1. 439861
5% 2. 921272 1. 926469
10% 3. 210526 1. 955601 Obs 1411
25% 3. 680637 2. 104353 Sumof Wgt . 1411

50% 4. 16461 Mean 4. 155071
Lar gest St d. Dev. . 7208543
75% 4. 696181 5. 706972
90% 5. 07678 5. 772238 Var i ance . 5196309
95% 5. 333407 5. 778316 Skewness - . 2227742
99% 5. 612014 5. 802852 Kur t osi s 2. 724598

Method 4: Imputation
After identifying outliers, usually we first denote them as missing values. Missing data usually
present a problem in statistical analyses. If missing values are correlated with the outcome of
interest, then ignoring them will bias the results of statistical tests. In addition, most statistical
software packages (e.g., SAS, Stata) automatically drop observations that have missing values
for any variables used in an analysis. This practice reduces the analytic sample size, lowering the
power of any test carried out.

Other than simply dropping missing values, there is more than one approach of imputation to fill
in the cell of missing value. We will only focus on single imputation, referring to fill a missing
value with one single replacement value.

The easy approach is to use arbitrary methods to impute missing data, such as mean substitution.
Substitution of the simple grand mean will reduce the variance of the variable. Reduced variance
can bias correlation downward (attenuation) or, if the same cases are missing for two variables
and means are substituted, correlation can be inflated. These effects on correlation carry over in a
regression context to lack of reliability of the beta weights and of the related estimates of the
relative importance of independent variables. That is, mean substitution in the case of one
variable can lead to bias in estimates of the effects of other or all variables in the regression
analysis, because bias in one correlation can affect the beta weights of all variables. Mean
substitution is no longer recommended.

Another approach is regression-based imputation. In this strategy, it is assumed that the same
model explains the data for the non-missing cases as for the missing cases. First the analyst
estimates a regression model in which the dependent variable has missing values for some
observations, using all non-missing data. In the second step, the estimated regression coefficients
are used to predict (impute) missing values of that variable. The proper regression model
depends on the form of the dependent variable. A probit or logit is used for binary variables,
Poisson or other count models for integer-valued variables, and OLS or related models for
continuous variables. Even though this may introduce unrealistically low levels of noise in the
data, it performs more robustly than mean substitution and less complex than multiple
imputation. Thus it is the preferred approach in imputation.

52
Assuming we already coded outliers of rconspc as missing, now the missing values are replaced
(imputed) with predicted values.

. xi : r egr ess l nr conspc i . q1a i . sexh i . poor hhsi ze ageh, r obust
. pr edi ct yhat
( opt i on xb assumed; f i t t ed val ues)

. r epl ace l nr conspc=yhat i f r conspc==.
( 51 r eal changes made)

There is another Stata command to perform imputation. The impute command fills in missing
values by regression and put newly created variable into a new variable defined by generate
option.

. xi : i mput e l nr conspc i . q1a i . sexh i . poor hhsi ze ageh, gen( new1)
i . q1a _I q1a_1- 9 ( nat ur al l y coded; _I q1a_1 omi t t ed)
i . sexh _I sexh_0- 1 ( nat ur al l y coded; _I sexh_0 omi t t ed)
i . poor _I poor _0- 1 ( nat ur al l y coded; _I poor _0 omi t t ed)
0. 21%( 3) obser vat i ons i mput ed

. xi : r egr ess l nr conspc i . q1a i . sexh i . poor hhsi ze ageh, r obust
. pr edi ct yhat
. r epl ace l nr conspc=yhat i f r conspc==.
. compar e l nr conspc new1

- - - - - - - - - - di f f er ence - - - - - - - - - -
count mi ni mum aver age maxi mum
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
l nr conspc=new1 1452
- - - - - - - - - -
j oi nt l y def i ned 1452 0 0 0
- - - - - - - - - -
t ot al 1452

The impute command produces exactly the same results.

Exercise 6:
1. Use ERHScons1999, check normality for real consumption per adult equivalent using
histogram, boxplot, kdensity, pnormal and qnormal.
2. If there are outliers, replace them using robust statistics
3. Excluding outliers, check for normality of the distribution.
4. If the distribution is not yet normalized, apply appropriate transformation and check its
normality.

53

SECTION9:STATISTICALTESTS

compare command

The compare command is an easy way to check if two variables are the same. Lets first create
one variable compare, which equals tot_exp if tot_exp not missing, and equals 0 if tot_exp is
missing.

. gen comparecons=cons if cons~=.
( 51 mi ssi ng val ues gener at ed)

. replace comparecons=0 if cons==.
( 51 r eal changes made)

. compare cons comparecons

---------- difference ----------
count minimum average maximum
------------------------------------------------------------------------
cons=compare~s 1449
----------
jointly defined 1401 0 0 0
cons missing only 51
----------
total 1452

correlate command

The correlate command displays a matrix of Pearson correlations for the variable listed.

. correlate cons hhsize
(obs=1449)

| cons hhsize
-------------+------------------
cons | 1.0000
hhsize | 0.2601 1.0000

ttest command

We would like to see if the mean of hhsize equals to 6 by using single sample t-test, testing
whether the sample was drawn from a population with a mean of 6. ttest command is used for
this purpose.

54
. t t est hhsi ze=6

One- sampl e t t est
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Var i abl e | Obs Mean St d. Er r . St d. Dev. [ 95%Conf . I nt er val ]
- - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
hhsi ze | 1452 5. 782369 . 0719318 2. 740968 5. 641268 5. 923471
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
mean = mean( hhsi ze) t = - 3. 0255
Ho: mean = 6 degr ees of f r eedom= 1451

Ha: mean < 6 Ha: mean ! = 6 Ha: mean > 6
Pr ( T < t ) = 0. 0013 Pr ( | T| > | t | ) = 0. 0025 Pr ( T > t ) = 0. 9987

We are also interested that if cons is close to food.

. ttest cons=food

Pai r ed t t est
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Var i abl e | Obs Mean St d. Er r . St d. Dev. [ 95%Conf . I nt er val ]
- - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
cons | 1449 508. 5184 11. 04409 420. 4014 486. 8543 530. 1825
f ood | 1449 437. 3696 10. 21292 388. 7621 417. 3359 457. 4033
- - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
di f f | 1449 71. 14877 2. 130751 81. 10861 66. 96908 75. 32846
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
mean( di f f ) = mean( cons - f ood) t = 33. 3914
Ho: mean( di f f ) = 0 degr ees of f r eedom= 1448

Ha: mean( di f f ) < 0 Ha: mean( di f f ) ! = 0 Ha: mean( di f f ) > 0
Pr ( T < t ) = 1. 0000 Pr ( | T| > | t | ) = 0. 0000 Pr ( T > t ) = 0. 0000

The t-test for independent groups comes in two varieties: pooled variance and unequal variance.
We want to look at the differences in cons between male and female hh head. We will begin with
the ttest command for independent groups with pooled variance and compare the results to the
ttest command for independent groups using unequal variance.

. ttest cons, by(sexh)

Two- sampl e t t est wi t h equal var i ances
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Gr oup | Obs Mean St d. Er r . St d. Dev. [ 95%Conf . I nt er val ]
- - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Femal e | 398 433. 7347 19. 5334 389. 69 395. 3329 472. 1365
Mal e | 1051 536. 838 13. 20949 428. 2402 510. 918 562. 758
- - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
combi ned | 1449 508. 5184 11. 04409 420. 4014 486. 8543 530. 1825
- - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
di f f | - 103. 1033 24. 60287 - 151. 3644 - 54. 84217
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
di f f = mean( Femal e) - mean( Mal e) t = - 4. 1907
Ho: di f f = 0 degr ees of f r eedom= 1447

Ha: di f f < 0 Ha: di f f ! = 0 Ha: di f f > 0
Pr ( T < t ) = 0. 0000 Pr ( | T| > | t | ) = 0. 0000 Pr ( T > t ) = 1. 0000

55
. ttest cons, by(sexh) unequal

Two- sampl e t t est wi t h unequal var i ances
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Gr oup | Obs Mean St d. Er r . St d. Dev. [ 95%Conf . I nt er val ]
- - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Femal e | 398 433. 7347 19. 5334 389. 69 395. 3329 472. 1365
Mal e | 1051 536. 838 13. 20949 428. 2402 510. 918 562. 758
- - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
combi ned | 1449 508. 5184 11. 04409 420. 4014 486. 8543 530. 1825
- - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
di f f | - 103. 1033 23. 58059 - 149. 3921 - 56. 81448
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
di f f = mean( Femal e) - mean( Mal e) t = - 4. 3724
Ho: di f f = 0 Sat t er t hwai t e' s degr ees of f r eedom= 781. 352

Ha: di f f < 0 Ha: di f f ! = 0 Ha: di f f > 0
Pr ( T < t ) = 0. 0000 Pr ( | T| > | t | ) = 0. 0000 Pr ( T > t ) = 1. 0000

The by() option can be extended to group mean comparison test.

. ttest cons, by(q1a)
. ttest cons, by(q1a) unequal

Other statistical test

The hotelling command performs Hotelling's T-squared test of whether the means are equal
between two groups.

. hotel cons, by(sexh)

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- > sexh = Femal e

- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
cons | 398 433. 7347 389. 69 6. 514 3170. 157

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- > sexh = Mal e

- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
cons | 1051 536. 838 428. 2402 15. 98 3883. 536

2- gr oup Hot el l i ng' s T- squar ed = 17. 561974
F t est st at i st i c: ( ( 1449- 1- 1) / ( 1449- 2) ( 1) ) x 17. 561974 = 17. 561974

H0: Vect or s of means ar e equal f or t he t wo gr oups
F( 1, 1447) = 17. 5620
Pr ob > F( 1, 1447) = 0. 0000

56
The tabulate command performs a chi-square test to see if two variables are independent.
t ab sexh poor , chi 2

Sex of |
househol d | poor
head | 0. 00 1. 00 | Tot al
- - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - +- - - - - - - - - -
Femal e | 277 123 | 400
Mal e | 659 393 | 1, 052
- - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - +- - - - - - - - - -
Tot al | 936 516 | 1, 452

Pear son chi 2( 1) = 5. 5231 Pr = 0. 019

SECTION10:LINEARREGRESSION

This section describes the use of Stata to do regression analysis. Regression analysis involves
estimating an equation that best describes the data. One variable is considered the dependent
variable, while the others are considered independent (or explanatory) variables. Stata is capable
of many types of regression analysis and associated statistical test. In this section, we touch on
only a few of the more common commands and procedures. The commands described in this
section are:
regress
test, testparm
predict
probit
ovtest
hettest

regress

This is an example of ordinary linear regression by using regress command.

. r eg cons hhsi ze

- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - F( 1, 1447) = 104. 98
Model | 17310207. 1 1 17310207. 1 Pr ob > F = 0. 0000
Resi dual | 238605459 1447 164896. 654 R- squar ed = 0. 0676
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Adj R- squar ed = 0. 0670
Tot al | 255915666 1448 176737. 338 Root MSE = 406. 07

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
cons | Coef . St d. Er r . t P>| t | [ 95%Conf . I nt er val ]
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
hhsi ze | 39. 95906 3. 900049 10. 25 0. 000 32. 30871 47. 60942
_cons | 277. 0922 24. 97986 11. 09 0. 000 228. 0916 326. 0929
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

This regression tells us that for every extra person (hhsize) added to a household, total monthly
expenditure (cons) will increase by about 40 Ethiopia Birr. This increase is statistically
significant as indicated by the 0.000 probability associated with this coefficient.

57
The other important piece of information is the r-squared (r2) which equals to 0.0676. In essence,
this value tells us that by our independent variable (hhsize) accounts for approximately 7% of the
variation of dependent variable (cons).

We can run the regression with robust standard errors, which can tolerate a non-zero percentage
of outliers, i.e., when the residuals are not iid. This is very useful when there is heterogeneity of
variance. The robust option does not affect the estimates of the regression coefficients.

. r eg cons hhsi ze, r obust

Li near r egr essi on Number of obs = 1449
F( 1, 1447) = 98. 44
Pr ob > F = 0. 0000
R- squar ed = 0. 0676
Root MSE = 406. 07

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
| Robust
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
hhsi ze | 39. 95906 4. 027386 9. 92 0. 000 32. 05893 47. 8592
_cons | 277. 0922 22. 27592 12. 44 0. 000 233. 3957 320. 7888
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

The regress command without any arguments redisplays the last regression analysis.

Extract results
Stata stores results from estimation commands in e(), and you can see a list of what exactly is
stored using the ereturn list command.

. er et ur n l i st

scal ar s:
e( N) = 1449
e( df _m) = 1
e( df _r ) = 1447
e( F) = 98. 44285111812539
e( r 2) = . 0676402792171247
e( r mse) = 406. 0746904916928
e( mss) = 17310207. 09089088
e( r ss) = 238605458. 7112162
e( r 2_a) = . 0669959393962658
e( l l ) = - 10758. 51351538218
e( l l _0) = - 10809. 25501198254

macr os:
e( t i t l e) : " Li near r egr essi on"
e( depvar ) : " cons"
e( cmd) : " r egr ess"
e( pr oper t i es) : " b V"
e( pr edi ct ) : " r egr es_p"
e( model ) : " ol s"
e( est at _cmd) : " r egr ess_est at "
e( vcet ype) : " Robust "

mat r i ces:
e( b) : 1 x 2
e( V) : 2 x 2
58

f unct i ons:
e( sampl e)

Using the generate command, we can extract those results, such as estimated coefficients and
standard errors, to be used in other Stata commands.

. reg cons hhsize
. gen intercept=_b[_cons]

. display intercept
277. 09225
. gen slope=_b[hhsize]
. display slope
39. 959064

The estimates table command displays a table with coefficients and statistics for one or more
estimation sets in parallel columns. In addition, standard errors, t statistics, p-values, and scalar
statistics may be listed by b, se, t, p options.

. est i mat es t abl e, b se t p

- - - - - - - - - - - - - - - - - - - - - - - - - - -
Var i abl e | act i ve
- - - - - - - - - - - - - +- - - - - - - - - - - - -
hhsi ze | 39. 959065
| 3. 9000493
| 10. 25
| 0. 0000
_cons | 277. 09225
| 24. 979856
| 11. 09
| 0. 0000
- - - - - - - - - - - - - - - - - - - - - - - - - - -
l egend: b/ se/ t / p

Prediction commands
The predict command computes predicted value and residual for each observation. The default
shown below is to calculate the predicted cons.

. predict pred
( opt i on xb assumed; f i t t ed val ues)

When using the resid option the predict command calculates the residual.

. predict e, residual

We can plot the predicted value and observed value using graph twoway command.

. regress cons food
. predict pred
. graph twoway (scatter cons hhsize) (lfit pred hhsize)
59
0
1
0
0
0
2
0
0
0
3
0
0
0
4
0
0
0
0 5 10 15 20
household size
consumption per month Fitted values

The rvfplot command is a convenience command that generates a plot of the residual versus the
fitted values. It is used after regress command.

. regress cons food
. rvfplot

-
2
0
0
0
2
0
0
4
0
0
6
0
0
R
e
s
i
d
u
a
l
s
0 1000 2000 3000 4000
Fitted values

The rvpplot command is another convenience command which produces a plot of the residual
versus a specified predictor and it is also used after regress. In this example, it produces the same
graph as above.

. regress cons food
. rvpplot food

60
Hypothesis tests

The test command performs Wald tests for simple and composite linear hypotheses about the
parameters of estimation.

. recode q1a 7/9=7
. gen reg1=q1a==1
. gen reg3=q1a==3
. gen reg4=q1a==4
. gen reg7=q1a==7

. r egr ess cons hhsi ze r eg1 r eg3 r eg4 r eg7

- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - F( 4, 1444) = 87. 29
Model | 49831582. 4 4 12457895. 6 Pr ob > F = 0. 0000
Resi dual | 206084083 1444 142717. 509 R- squar ed = 0. 1947
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Adj R- squar ed = 0. 1925
Tot al | 255915666 1448 176737. 338 Root MSE = 377. 78

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
hhsi ze | 44. 0891 3. 719572 11. 85 0. 000 36. 79276 51. 38544
r eg1 | ( dr opped)
r eg3 | 155. 8197 35. 62022 4. 37 0. 000 85. 94683 225. 6926
r eg4 | 252. 3235 36. 41307 6. 93 0. 000 180. 8953 323. 7517
r eg7 | - 118. 6372 35. 93946 - 3. 30 0. 001 - 189. 1363 - 48. 13807
_cons | 170. 4098 37. 14745 4. 59 0. 000 97. 54106 243. 2786
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

As we have stated earlier consumption expenditure is positively related with household size.
Besides, household consumption expenditure in Amhara (reg3) and Oromia (reg4) regions are
significantly greater than household consumption expenditure in Tigray (reg1). In contrast,
household expenditure in SNNP (reg7) is significantly less than household expenditure in Tigray
region. Note that stata automatically drop one of the regional dummy variable (reg1 in this case)
to avoid perfect multicollinearity.

. t est r eg3=0

( 1) r eg3 = 0

F( 1, 1444) = 19. 14
Pr ob > F = 0. 0000

. t est r eg3= r eg4= r eg7

( 1) r eg3 - r eg4 = 0
( 2) r eg3 - r eg7 = 0

F( 2, 1444) = 109. 84
Pr ob > F = 0. 0000

Example above gives the result of some tests related to the regression analysis shown earlier. The
test command test the hypothesis that region 3 variable is zero (test reg3=0) and all region
variables (region3=region4 =region 7) are zero, finding that the probability is very low (less
61
than 0.000) so we can reject this hypothesis. This is not suppressing since each is statistically
significant on it own.

If you want to test the hypothesis that a set of related variable are all equal to zero, you can use
the reslated testparm command

. testparm reg* test of hypothesis that all region* dummies are zero

. t est par mr eg*

( 1) r eg3 = 0
( 2) r eg4 = 0
( 3) r eg7 = 0

F( 3, 1444) = 75. 96
Pr ob > F = 0. 0000

The hypothesis of no regional influence is rejected meaning that the regional coefficient are
jointly significant (i.e. region does influence total consumption).

Note: test and predict are commands that can be used in conjunction with all of the above
estimation procedures.

The suest (seemingly unrelated estimate) command combines the estimation results from
regressions (including parameter estimates and associated covariance matrices) into a single
parameter vector and simultaneous covariance matrix of the sandwich/robust type.

Typical applications of suest command are tests for intra-model and cross-model hypotheses
using test or testnl command, such as a generalized Hausman specification test, or Chow test for
structural break.

Before we perform any test using suest command, it is important we first keep estimation results
by estimates store command.

. r eg cons hhsi ze i f poor ==1
. est i mat es st or e spoor
. r eg cons hhsi ze i f poor ==0
. est i mat es st or e npoor
. suest spoor npoor

Si mul t aneous r esul t s f or spoor , npoor

Number of obs = 1449

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
| Robust
| Coef . St d. Er r . z P>| z| [ 95%Conf . I nt er val ]
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
spoor _mean |
hhsi ze | 30. 48242 2. 014054 15. 13 0. 000 26. 53495 34. 4299
_cons | 35. 48593 11. 76232 3. 02 0. 003 12. 4322 58. 53965
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
spoor _l nvar |
_cons | 9. 16965 . 0679724 134. 90 0. 000 9. 036426 9. 302873
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
62
npoor _mean |
hhsi ze | 85. 00465 5. 166889 16. 45 0. 000 74. 87774 95. 13157
_cons | 210. 9857 25. 48277 8. 28 0. 000 161. 0404 260. 931
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
npoor _l nvar |
_cons | 11. 95647 . 1078529 110. 86 0. 000 11. 74508 12. 16786
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

. t est hhsi ze

( 1) [ spoor _mean] hhsi ze = 0
( 2) [ npoor _mean] hhsi ze = 0

chi 2( 2) = 499. 73
Pr ob > chi 2 = 0. 0000

Next we want to see if the same hhsize coefficient holds for poor and nonpoor households. We
can type

. t est [ spoor _mean] hhsi ze= [ npoor _mean] hhsi ze

( 1) [ spoor _mean] hhsi ze - [ npoor _mean] hhsi ze = 0

chi 2( 1) = 96. 66
Pr ob > chi 2 = 0. 0000

Or we can test if coefficients between equations are equal, or a Chow test.

. t est ( [ spoor _mean] hhsi ze= [ npoor _mean] hhsi ze)
( [ spoor _mean] _cons= [ npoor _mean] _cons)

( 2) [ spoor _mean] _cons - [ npoor _mean] _cons = 0

chi 2( 2) = 1179. 33
Pr ob > chi 2 = 0. 0000

This is equivalent to have accumulate options in test command, which tests hypothesis jointly
with previously tested hypotheses

. t est [ spoor _mean] hhsi ze= [ npoor _mean] hhsi ze
. t est [ spoor _mean] _cons= [ npoor _mean] _cons, accumul at e

( 2) [ spoor _mean] _cons - [ npoor _mean] _cons = 0

chi 2( 2) = 1179. 33
Pr ob > chi 2 = 0. 0000

ovtest
Regression analysis generates the best unbiased linear estimates of the true coefficients
provided that some assumptions are satisfied. One assumption is that there are no missing
variables that are correlated with the error term. This command performs a Ramsey RESET to
test for omitted variables (misspecification). The syntax is:
63

ovtest [, rhs]

This test amounts to estimating y =xb+zt+u and then testing t=0. If the rhs option is not
specified, powers of the fitted values are used for z. Otherwise; the powers of the independent
variables are used. Examples of the test are:

regress cons hhsize reg3 reg4 reg7
. ovtest tests significance of powers of predicted cons
. ovtest, rhs tests significance of powers of hhsize, reg3, reg4 and reg7

Example;

. ovt est

Ramsey RESET t est usi ng power s of t he f i t t ed val ues of cons
Ho: model has no omi t t ed var i abl es
F( 3, 1441) = 4. 47
Pr ob > F = 0. 0039

The ovtest, reject the hypothesis that there are no omitted variables, indicated that we need to
improve the specification.

Heteroskedasticity

We can always visually check how well the regression surface fits the data by plotting residuals
versus fitted values, like rvfplot or rvpplot commands. In addition, there are a bunch of
statistical tests to test heteroskedasticity in regression errors.

We can use the hettest command to run an auxiliary regression of
2
ln
i
e on the fitted values.

. het t est

Br eusch- Pagan / Cook- Wei sber g t est f or het er oskedast i ci t y
Ho: Const ant var i ance
Var i abl es: f i t t ed val ues of cons

chi 2( 1) = 81. 50
Pr ob > chi 2 = 0. 0000

The hettest indicates that there is heterorskedasticity which needs to be dealt with

We can also use information matrix test by imtest command, which provides a summary test of
violations of the assumptions on regression errors.

. i mt est

Camer on & Tr i vedi ' s decomposi t i on of I M- t est

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Sour ce | chi 2 df p
- - - - - - - - - - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Het er oskedast i ci t y | 16. 46 2 0. 0003
64
Skewness | 24. 54 1 0. 0000
Kur t osi s | 6. 66 1 0. 0099
- - - - - - - - - - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Tot al | 47. 66 4 0. 0000
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

The imtest also approved existence of heteroskedasticity, skweness and kurtosis problems

Exercise7:
1. Using real consumption per capita as dependent variable repeat the above regression and
interpret the results.
2. Using log of consumption as dependent variable repeat the above regression and interpret
the result. What are the difference between results of q1 and q2?

xi command for categorical data

When there is categorical data, it could be inefficient to generate a series of dummy variables.
The xi prefix is used to dummy code categorical variables, and we tag these variables with an
i. in front of each target variable.

In our example, the explanatory variable q1a has 4 levels and requires 3 dummy variables. The
test command is used to test the collective effect of the 3 dummy-coded variables. In other
words, it tests the main effect of variable q1a. Note that the dummy-coded variables name is
written in exactly the same one as it appears in the regression results, including the uppercase I.

. xi : r egr ess cons hhsi ze i . q1a, r obust

F( 4, 1444) = 83. 67
Pr ob > F = 0. 0000
R- squar ed = 0. 1947
Root MSE = 377. 78

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
| Robust
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
hhsi ze | 44. 0891 3. 965998 11. 12 0. 000 36. 30937 51. 86884
_I q1a_3 | 155. 8197 31. 30962 4. 98 0. 000 94. 40252 217. 2369
_I q1a_4 | 252. 3235 32. 75505 7. 70 0. 000 188. 0709 316. 5761
_I q1a_7 | - 118. 6372 25. 7164 - 4. 61 0. 000 - 169. 0827 - 68. 19171
_cons | 170. 4098 31. 09044 5. 48 0. 000 109. 4225 231. 3971
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

. test _Iq1a_3 _Iq1a_4 _Iq1a_7

( 1) _Iq1a_3 = 0
( 2) _Iq1a_4 = 0
( 3) _Iq1a_7 = 0

F( 3, 1444) = 93.96
Prob > F = 0.0000

We reject the null hypothesis of no regional effects since p-value is small.
65

Similarly, we can apply xi command to create village dummy (q1b)

. xi : r egr ess cons hhsi ze i . q1b, r obust
i . q1b _I q1b_1- 16 ( nat ur al l y coded; _I q1b_1 omi t t ed)

F( 15, 1433) = 42. 56
Pr ob > F = 0. 0000
R- squar ed = 0. 3266
Root MSE = 346. 79

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
| Robust
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
hhsi ze | 45. 2582 3. 714029 12. 19 0. 000 37. 97269 52. 54372
_I q1b_2 | - 27. 54804 46. 85358 - 0. 59 0. 557 - 119. 457 64. 36091
_I q1b_3 | - 87. 41371 45. 04186 - 1. 94 0. 052 - 175. 7688 . 9413348
_I q1b_4 | 355. 9933 49. 9613 7. 13 0. 000 257. 9882 453. 9984
_I q1b_5 | - 169. 5377 35. 10535 - 4. 83 0. 000 - 238. 4011 - 100. 6743
_I q1b_6 | 158. 2973 42. 26349 3. 75 0. 000 75. 39231 241. 2022
_I q1b_7 | 512. 4817 60. 70446 8. 44 0. 000 393. 4026 631. 5608
_I q1b_8 | 112. 9675 44. 07454 2. 56 0. 010 26. 50996 199. 425
_I q1b_9 | 63. 10264 44. 01605 1. 43 0. 152 - 23. 24016 149. 4454
_I q1b_10 | 292. 1852 63. 00507 4. 64 0. 000 168. 5932 415. 7773
_I q1b_12 | - 123. 2652 40. 63929 - 3. 03 0. 002 - 202. 9841 - 43. 54632
_I q1b_13 | - 271. 599 37. 48298 - 7. 25 0. 000 - 345. 1264 - 198. 0716
_I q1b_14 | - 86. 31787 36. 13403 - 2. 39 0. 017 - 157. 1991 - 15. 43661
_I q1b_15 | - 182. 1813 36. 70234 - 4. 96 0. 000 - 254. 1774 - 110. 1852
_I q1b_16 | - 11. 66292 46. 28264 - 0. 25 0. 801 - 102. 4519 79. 12606
_cons | 176. 1548 36. 033 4. 89 0. 000 105. 4717 246. 8379
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

. t est _I q1b_2 _I q1b_3 _I q1b_4 _I q1b_5 _I q1b_6 _I q1b_7 _I q1b_8 _I q1b_9 _I q1b_10
_I q1b_12 _I q1b_13 _I q1b_14 _I q1b_15 _I q1b_16

( 1) _I q1b_2 = 0
( 2) _I q1b_3 = 0
( 3) _I q1b_4 = 0
( 4) _I q1b_5 = 0
( 5) _I q1b_6 = 0
( 6) _I q1b_7 = 0
( 7) _I q1b_8 = 0
( 8) _I q1b_9 = 0
( 9) _I q1b_10 = 0
( 10) _I q1b_12 = 0
( 11) _I q1b_13 = 0
( 12) _I q1b_14 = 0
( 13) _I q1b_15 = 0
( 14) _I q1b_16 = 0

F( 14, 1433) = 41. 73
Pr ob > F = 0. 0000

Thus, we reject the null hypothesis of no villages (areas) effects since p-value is small.

The xi prefix can also be used to create dummy variables for q1b and for the interaction term of
q1b and hhsize. The first test command tests the overall interaction and the second test command
test the main effect of areas.

. xi: regress cons hhsize i.q1b*hhsize, robust
66
. t est _I q1bXhhsi _2 _I q1bXhhsi _3 _I q1bXhhsi _4 _I q1bXhhsi _5 _I q1bXhhsi _6
_I q1bXhhsi _7 _I q1bXhhsi _8 _I q1bXhhs _9 _I q1bXhhsi _10 _I q1bXhhsi _12
_I q1bXhhsi _13 _I q1bXhhsi _14 _I q1bXhhsi _15 _I q1bXhhsi _16

. t est _I q1b_2 _I q1b_3 _I q1b_4 _I q1b_5 _I q1b_6 _I q1b_7 _I q1b_8 _I q1b_9
_I q1b_10 _I q1b_12 _I q1b_13 _I q1b_14 _I q1b_15 _I q1b_16

By default, Stata selects the first category in the categorical variable as the reference category. If
we would like to declare a certain category as reference category, the char (characteristics)
command is needed.

In the model above, we would like to use region 5 as reference region, and the commands are

. char q1a[omit] 7

. xi : r egr ess cons hhsi ze i . q1a, r obust

F( 4, 1444) = 83. 67
Pr ob > F = 0. 0000
R- squar ed = 0. 1947
Root MSE = 377. 78

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
| Robust
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
hhsi ze | 44. 0891 3. 965998 11. 12 0. 000 36. 30937 51. 86884
_I q1a_1 | 118. 6372 25. 7164 4. 61 0. 000 68. 19171 169. 0827
_I q1a_3 | 274. 4569 24. 86033 11. 04 0. 000 225. 6907 323. 2232
_I q1a_4 | 370. 9607 25. 45855 14. 57 0. 000 321. 021 420. 9004
_cons | 51. 7726 26. 34448 1. 97 0. 050 . 0950491 103. 4502
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

. t est _I q1a_1 _I q1a_3 _I q1a_4

( 1) _I q1a_1 = 0
( 2) _I q1a_3 = 0
( 3) _I q1a_4 = 0

F( 3, 1444) = 93. 96
Pr ob > F = 0. 0000

Some estimation procedures in Stata are included here:

anova analysis of variance and covariance
arch autoregressive conditional heterosce. family of estimators
arima autoregressive integrated moving average models
bsqreg quantile regression with bootstrapped standard errors
cnreg censored-normal regression
cnsreg constrained linear regression
ereg maximum-likelihood exponential distribution models
glm generalized linear models
ivreg instrumental variable and two-stage least squares regression
67
Lnormal maximum-likelihood lognormal distribution models
mvreg multivariate regression
nl nonlinear least squares
poisson maximum-likelihood poisson regression
qreg quantile regression
reg3 three-stage least squares regression
regress linear regression
rreg robust regression using IRLS
sureg seemingly unrelated regression
tobit tobit regression
vwls variance-weighted least squares regression
zinb zero-inflated negative binomial model
zip zero-inflated poisson models

SECTION11:LOGISTICREGRESSION

Logistic regression
We are not going to talk the theory behind logistic regression, per se, but focus on how to
perform logistic regression analyses and interpret the results using Stata. It is assumed that users
are familiar with logistic regression.

The logistic command by default produces the output in odds ratios but can display the
coefficients if the coef options is used.

. l ogi st i c poor hhsi ze ageh sexh, coef

Logi st i c r egr essi on Number of obs = 1452
LR chi 2( 3) = 120. 33
Pr ob > chi 2 = 0. 0000
Log l i kel i hood = - 884. 66248 Pseudo R2 = 0. 0637

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
poor | Coef . St d. Er r . z P>| z| [ 95%Conf . I nt er val ]
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
hhsi ze | . 2340767 . 0230735 10. 14 0. 000 . 1888534 . 2792999
ageh | - . 00178 . 003769 - 0. 47 0. 637 - . 0091671 . 0056071
sexh | - . 1278524 . 1363766 - 0. 94 0. 349 - . 3951457 . 1394408
_cons | - 1. 813833 . 2422366 - 7. 49 0. 000 - 2. 288608 - 1. 339058
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

The exact same results can be obtained by using the logit command.

. l ogi t poor hhsi ze ageh sexh

I t er at i on 0: l og l i kel i hood = - 944. 82915
68

LR chi 2( 3) = 120. 33
Pr ob > chi 2 = 0. 0000

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
hhsi ze | . 2340767 . 0230735 10. 14 0. 000 . 1888534 . 2792999
ageh | - . 00178 . 003769 - 0. 47 0. 637 - . 0091671 . 0056071
sexh | - . 1278524 . 1363766 - 0. 94 0. 349 - . 3951457 . 1394408
_cons | - 1. 813833 . 2422366 - 7. 49 0. 000 - 2. 288608 - 1. 339058
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

The xi prefix can also be used in logistic model to include categorical variables.

. xi : l ogi t poor hhsi ze ageh sexh i . q1b
i . q1b _I q1b_1- 16 ( nat ur al l y coded; _I q1b_1 omi t t ed)


LR chi 2( 17) = 447. 13
Pr ob > chi 2 = 0. 0000

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
hhsi ze | . 2540365 . 0268947 9. 45 0. 000 . 2013237 . 3067492
ageh | - . 0017158 . 0043324 - 0. 40 0. 692 - . 0102073 . 0067756
sexh | - . 2484387 . 1609912 - 1. 54 0. 123 - . 5639757 . 0670983
_I q1b_2 | 1. 00783 . 362959 2. 78 0. 005 . 2964436 1. 719217
_I q1b_3 | 1. 459217 . 3487419 4. 18 0. 000 . 7756951 2. 142738
_I q1b_4 | - . 6362756 . 3294573 - 1. 93 0. 053 - 1. 282 . 0094489
_I q1b_5 | . 8389042 . 3749527 2. 24 0. 025 . 1040104 1. 573798
_I q1b_6 | - 1. 031675 . 3775811 - 2. 73 0. 006 - 1. 77172 - . 2916295
_I q1b_7 | - 3. 720754 1. 040637 - 3. 58 0. 000 - 5. 760366 - 1. 681142
_I q1b_8 | . 3595295 . 3376553 1. 06 0. 287 - . 3022627 1. 021322
_I q1b_9 | - . 5104919 . 3506632 - 1. 46 0. 145 - 1. 197779 . 1767954
_I q1b_10 | - . 6988431 . 3668961 - 1. 90 0. 057 - 1. 417946 . 02026
_I q1b_12 | 1. 659669 . 3781756 4. 39 0. 000 . 9184587 2. 40088
_I q1b_13 | 2. 55036 . 4331979 5. 89 0. 000 1. 701308 3. 399413
_I q1b_14 | . 6482351 . 3205897 2. 02 0. 043 . 0198909 1. 276579
_I q1b_15 | 1. 617736 . 3421814 4. 73 0. 000 . 947073 2. 288399
_I q1b_16 | . 1479744 . 3737048 0. 40 0. 692 - . 5844736 . 8804224
_cons | - 2. 152062 . 360365 - 5. 97 0. 000 - 2. 858365 - 1. 44576
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Extract results

We can use ereturn or estat command to retrieve results from estimation, same as with other
regression commands.
69

. er et ur n l i st
scal ar s:
e( N) = 1452
e( l l _0) = - 944. 829150727132
e( l l ) = - 721. 2662778966318
e( df _m) = 17
e( chi 2) = 447. 1257456610003
e( r 2_p) = . 2366172473176216

macr os:
e( t i t l e) : " Logi st i c r egr essi on"
e( depvar ) : " poor "
e( cmd) : " l ogi t "
e( cr i t t ype) : " l og l i kel i hood"
e( pr edi ct ) : " l ogi t _p"
e( pr oper t i es) : " b V"
e( est at _cmd) : " l ogi t _est at "
e( chi 2t ype) : " LR"

mat r i ces:
e( b) : 1 x 18
e( V) : 18 x 18

f unct i ons:
e( sampl e)

. estat summarize

Est i mat i on sampl e l ogi t Number of obs = 1452

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Var i abl e | Mean St d. Dev. Mi n Max
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
poor | . 3553719 . 4787908 0 1
hhsi ze | 5. 782369 2. 740968 1 17
ageh | 49. 34711 15. 64917 18 95
sexh | . 7245179 . 4469108 0 1
_I q1b_2 | . 0454545 . 2083707 0 1
_I q1b_3 | . 0592287 . 2361335 0 1
_I q1b_4 | . 1205234 . 3256848 0 1
_I q1b_5 | . 042011 . 2006834 0 1
_I q1b_6 | . 0991736 . 2989979 0 1
_I q1b_7 | . 065427 . 247363 0 1
_I q1b_8 | . 065427 . 247363 0 1
_I q1b_9 | . 0750689 . 2635932 0 1
_I q1b_10 | . 0668044 . 249769 0 1
_I q1b_12 | . 0447658 . 2068607 0 1
_I q1b_13 | . 0509642 . 2200004 0 1
_I q1b_14 | . 0922865 . 2895297 0 1
_I q1b_15 | . 0661157 . 2485698 0 1
_I q1b_16 | . 0488981 . 2157292 0 1
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

70
. est at i c

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Model | Obs l l ( nul l ) l l ( model ) df AI C BI C
- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
. | 1452 - 944. 8292 - 721. 2663 18 1478. 533 1573. 585
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Marginal effects

We use mfx command to numerically calculate the marginal effects or the elasticities and their
standard errors after estimation. Several options are available for the calculation of marginal
effects.
dydx is the default.
eyex specifies that elasticities be calculated in the form of d(lny)/d(lnx)
dyex specifies that elasticities be calculated in the form of d(y)/d(lnx)
eydx specifies that elasticities be calculated in the form of d(lny)/d(x)

. mf x, dydx

Mar gi nal ef f ect s af t er l ogi t
y = Pr ( poor ) ( pr edi ct )
= . 29194589
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
var i abl e | dy/ dx St d. Er r . z P>| z| [ 95%C. I . ] X
- - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
hhsi ze | . 0525128 . 00572 9. 19 0. 000 . 04131 . 063715 5. 78237
ageh | - . 0003547 . 0009 - 0. 40 0. 692 - . 00211 . 0014 49. 3471
sexh*| - . 0524912 . 03475 - 1. 51 0. 131 - . 120599 . 015617 . 724518
_I q1b_2*| . 2364405 . 08983 2. 63 0. 008 . 060381 . 4125 . 045455
_I q1b_3*| . 3449546 . 08142 4. 24 0. 000 . 18538 . 50453 . 059229
_I q1b_4*| - . 1173583 . 0532 - 2. 21 0. 027 - . 221631 - . 013085 . 120523
_I q1b_5*| . 1947244 . 09284 2. 10 0. 036 . 012753 . 376696 . 042011
_I q1b_6*| - . 1735392 . 04883 - 3. 55 0. 000 - . 269243 - . 077835 . 099174
_I q1b_7*| - . 3321019 . 02071 - 16. 04 0. 000 - . 37269 - . 291514 . 065427
_I q1b_8*| . 0787698 . 07774 1. 01 0. 311 - . 073607 . 231146 . 065427
_I q1b_9*| - . 0953845 . 05835 - 1. 63 0. 102 - . 209748 . 018979 . 075069
_I q1b_10*| - . 1248791 . 0551 - 2. 27 0. 023 - . 23287 - . 016888 . 066804
_I q1b_12*| . 3912309 . 08351 4. 69 0. 000 . 22756 . 554902 . 044766
_I q1b_13*| . 5568326 . 06416 8. 68 0. 000 . 43108 . 682586 . 050964
_I q1b_14*| . 1464237 . 07716 1. 90 0. 058 - . 004807 . 297654 . 092287
_I q1b_15*| . 3809777 . 07704 4. 95 0. 000 . 229984 . 531971 . 066116
_I q1b_16*| . 0314128 . 08135 0. 39 0. 699 - . 128026 . 190852 . 048898
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
( *) dy/ dx i s f or di scr et e change of dummy var i abl e f r om0 t o 1

Hypothesis Tests

Likelihood-ratio test

The lrtest command performs a likelihood-ratio test for the null hypothesis that the parameter
vector of a statistical model satisfies some smooth constraint. To conduct the test, both the
unrestricted and the restricted models must be fitted using the maximum likelihood method (or
some equivalent method), and the results of at least one must be stored using estimates store.

71
The lrtest command provides an important alternative to Wald testing for models fitted by
maximum likelihood. Wald testing requires fitting only one model (the unrestricted model).
Hence, it is computationally more attractive than likelihood-ratio testing. Most statisticians,
however, favor using likelihood-ratio testing whenever feasible since the null-distribution of the
LR test statistic is often "more closely" chi-square distributed than the Wald test statistic.

We would like to see if the introduction of regional dummy will help our estimation. We perform
a likelihood ratio test using lrtest command.

. xi : l ogi t poor hhsi ze ageh i . q1a
. est i mat es st or e n1
. l ogi t poor hhsi ze ageh

. l r t est n1

Li kel i hood- r at i o t est LR chi 2( 5) = 169. 86
( Assumpt i on: . nest ed i n n1) Pr ob > chi 2 = 0. 0000

The null hypothesis is firmly rejected.

Other hypothesis tests for parameters are the same as described in OLS.

Other related commands
Stata has a variety of commands for performing estimation when the dependent variable is
dichotomous or polychotomous. Here is a list of some estimation commands for discrete
dependent variable. See estimation commands for a complete list of all of Stata's estimation
commands.

asmprobit alternative-specific multinomial probit regression
binreg GLM models for the binomial family
biprobit bivariate probit regression
blogit logit regression for grouped data
bprobit probit regression for grouped data
clogit conditional logistic regression
cloglog complementary log-log regression
glogit weighted least squares logit on grouped data
gprobit weighted least squares probit on grouped data
heckprob probit model with selection
hetprob heteroskedastic probit model
ivprobit probit model with endogenous regressors
logistic logistic regression
logit maximum-likelihood logit regression
mlogit maximum-likelihood multinomial logit models
mprobit multinomial probit regression
72
nbreg maximum-likelihood negative binomial regression
nlogit nested logit regression
ologit maximum-likelihood ordered logit
oprobit maximum-likelihood ordered probit
probit maximum-likelihood probit estimation
rologit rank-ordered logistic regression
scobit skewed logistic regression
slogit stereotype logistic regression
xtcloglog random-effects and population-averaged cloglog models
xtgee GEE population-averaged generalized linear models
xtlogit fixed-effects, random-effects, and population-averaged logit models
xtprobit random-effects and population-averaged probit models

Exercise 8:
Generate a variable containing food expenditure terciles ftercile. Using ftercile as dependent
variable run multinomial logit where the independent variable are the same as the above example
and interpret the results.

SECTION12:PANELDATAANALYSIS
Panel data are cross sectional and longitudinal (time series). Some example is the ERHS data
which was conducted six times between 1994 and 2004. Panel data may have group effects, time
effects, or both. These effects are analyzed by fixed effect and random effect models.

A panel data set contains observation on n individuals, each measured at T points in time. In
other words, each individual (1 through n subjects) includes T observations (1 through t time
period). Thus, the total number of observations is nT. Figure 6, below illustrates the data
arrangement of a panel data set.

Figure 6: Data arrangement of panel data
Group Time Variable1 Variable2 Variable3
1 1
1 2

2 1
2 2
.
n 1
n 2
.
n T

73
Fixed Effect versus Random Effect Models
Panel data models estimate fixed and/or random effects models using dummy variables. The core
difference between fixed and random effect models lies in the role of dummies. If dummies are
considered as a part of the intercept, it is a fixed effect model. In a random effect model, the
dummies act as an error term (see Figure 7).

The fixed effect model examines group differences in intercepts, assuming the same slopes and
constant variances across groups. Fixed effect models use least square dummy variable (LSDV),
within effect, and between effect estimation methods. Thus, ordinary least squares (OLS)
regressions with dummies, in fact are fixed effect models.

Figure 7: Fixed effects and Random effects models
Fixed Effect Model Random Effect Model
Functional form
'
( )
it i it it
y X v = + + +

'
( )
it it i it
y X v = + + +
intercepts Varying across group and/or time Constant
Error variances Constant Varying across group and/or time
Slopes constant Constant Constant
Estimation LSDV, within effect, between effect GLS, FGLS
Hypothesis test Incremental F test Breusch-Pagan LM test
*
2
(0, )
it v
v IID

The random effect model, by contrast, estimates variance components for groups and error,
assuming the same intercept and slopes. The difference among groups (or time periods) lies in
the variance of the error term. This model is estimated by generalized least squares (GLS) when
the matrix, a variance structure among groups, is known. The feasible generalized least
squares (FGLS) method is used to estimate the variance structure when is not know. A typical
example is the groupwise heteroscadastic regression model (Green 2003). There are various
estimation methods for FGLS including maximum likelihood methods and simulations (Baltagi
and Chang 1994).

Fixed effects are tested by the (incremental) F test, while random effects are examined by the
Lagrange multiplier (LM) test (Breusch and Pagan 1980). If the null hypothesis is not rejected
the pooled OLS regression is favored. The Hausman specification test (Hausman 1978)
compares fixed effect and random effect models. Figure 1 compares the fixed effect and random
effect models.

Group effect models create dummies using grouping variables (e.g. region, wereda, etc). If one
group variable is considered, it called a one-way fixed or random group effect model. The two-
way group effect models have two sets of dummy variables, one for a grouping variables and
other for a tie variable.

LSDV regression, the within effect model, the between effect model (group or time mean
model), GLS and FGLS are fundamentally based pm OLS in terms of estimation. Thus, any
procedure and command for OLS is good for the panel data models.

The STATA . xt r eg command estimates within effect (fixed effect) models with the f e option,
between effect models with the be option, and random effect models with the r e option. The
74
following xt commands are families used to run panel data analysis. I recommend further
readings to understand it in details.

xt des Descr i be pat t er n of xt dat a
xt sum Summar i ze xt dat a
xt t ab Tabul at e xt dat a
xt dat a Fast er speci f i cat i on sear ches wi t h xt dat a
xt l i ne Li ne pl ot s wi t h xt dat a
xt r eg Fi xed- , bet ween- and r andom- ef f ect s, and
popul at i on- aver aged l i near model s
xt r egar Fi xed- and r andom- ef f ect s l i near model s wi t h an AR( 1)
di st ur bance
xt gl s Panel - dat a model s usi ng GLS
xt pcse OLS or Pr ai s- Wi nst en model s wi t h panel - cor r ect ed st andar d
er r or s
xt r c Randomcoef f i ci ent s model s
xt i vr eg I nst r ument al var i abl es and t wo- st age l east squar es f or
panel - dat a model s
xt abond Ar el l ano- Bond l i near , dynami c panel dat a est i mat or

xt t obi t Random- ef f ect s t obi t model s
xt i nt r eg Random- ef f ect s i nt er val dat a r egr essi on model s

xt l ogi t Fi xed- ef f ect s, r andom- ef f ect s, & popul at i on- aver aged l ogi t
model s
xt pr obi t Random- ef f ect s and popul at i on- aver aged pr obi t model s
xt cl ogl og Random- ef f ect s and popul at i on- aver aged cl ogl og model s

xt poi sson Fi xed- ef f ect s, r andom- ef f ect s, & popul at i on- aver aged
Poi sson model s
xt nbr eg Fi xed- ef f ect s, r andom- ef f ect s, & popul at i on- aver aged
negat i ve bi nomi al model s

xt gee Popul at i on- aver aged panel - dat a model s usi ng GEE

Notes: Using different examples, we will discuss more about panel data during training sessions.

SECTION13:DATAMANAGEMENT

Subset data

We can subset data by keeping or dropping variables, or by keeping and dropping observations.

1. keep and drop variables

Suppose our data file have many variables, but we only care about just a handful of them. We
can subset our data file to keep just those variables to our interest. The keep command is used to
keep variables in the list while dropping other variables.

. keep hhid cons food

75
Instead of wanting to keep just a handful of variables, it is possible that we might want to get rid
of just one or two variables in the data file. The drop command is used to drop variables in the
list while keeping other variables.

. drop cons

2. keep and drop observations

The keep if command is used to keep observations if condition is met.

. keep if sexh==1
( 400 obser vat i ons del et ed)

If we want to focus on male headed households in the data set, which means 400 female headed
households are dropped from the data set.

Similar concepts can be found in drop if command. We eliminate the observations with missing
values with drop if command. The portion after the drop if specifies which observations that
should be dropped.

. drop if missing(cons)
( 3 obser vat i ons del et ed)

3. Use use command to drop variables and observations

You can eliminate both variables and observations with the use command. Lets read in just
hhid,cons,q1a,sexh, hhsize from ERHScons1999.dta file.

. use hhid, cons, q1a, sexh, hhsize using ERHScons1999.dta

We can also limited read in data for female headed households.

. use hhid, cons, q1a, sexh, hhsize using ERHScons1999.dta if sexh==1

Organize data

sort
The sort command arranges the observations of the current data into ascending order based on
the values of the variables listed. There is no limit to the number of variables in the variable list.
Missing numeric values are interpreted as being larger than any other number, so they are placed
last. When you sort on a string variable, however, null strings are placed first.

. sort hhid sexh cons

Variable ordering
The order command helps us to organize variables in a way that makes sense by changing the
order of the variables. While there are several possible orderings that are logical, we usually put
76
the id variable first, followed by the demographic variables, such as region, zone, gender,
urban/rural. We will put the variables regarding the household total expenditure as follows.

. order hhid q1a q1b q1c q1d sexh ageh cons

Using _n and _N in conjunction with the by command can produce some very useful results.
When used with by command, _N is the total number of observations within each group listed in
by command, and _n is the running counter to uniquely identify observations within the group.
To use the by command we must first sort our data on the by variable.

. sort group
. by group: generate n1=_n
. by group: generate n2=_N
. list
+- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - +
| scor e gr oup i d nt n1 n2 |
| - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - |
1. | 72 1 1 7 1 4 |
2. | 85 1 7 7 2 4 |
3. | 76 1 3 7 3 4 |
4. | 90 1 6 7 4 4 |
5. | 84 2 2 7 1 2 |
| - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - |
6. | 82 2 5 7 2 2 |
7. | 89 3 4 7 1 1 |
+- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - +

Now n1 is the observation number within each group and n2 is the total number of observations
for each group. This is very useful in programming, especially in identifying duplicate
observations.

To use _n to find out duplicated observations, we can type:

. sort group
. list if id == id[_n+1]

To use _N to identify duplicated observations, use:

. sort group score
. by group score: generate ngroup=_N
. list if ngroup>1

If there are a lot of variables in the data set, it could take a long time to type them all out twice.
We can make use of the * and ? wildcards to indicates that we wish to use all the variables.

Further we can combine sort and by commands into a single statement. Below is a simplified
version of the code and will yield the same results as above.

. bysort *:generate nn=_N
. list if nn>1

77
Create one data set from two or more data sets

Appending data files

We can create a new data by append command, which concatenates two datasets, that is, stick
them together vertically, one after another.

Supposing we are given one file with data for the rural households (called rural.dta) and a file for
the urban households (called urban.dta). We need to combine these files together to be able to
analyze them.

. use ERHS1999.dta, clear
. append using ERHS1997.dta
. append using ERHS1995.dta

The append command does not require that the two datasets contain the same variables, even
though this is typically the case. But it highly recommended to use the identical list of variables
for append command to avoid missing values from one dataset.

One-to-one match merging

Another way of combining data files is match merging. The merge command sticks two datasets
horizontally, one next to the other. Before any merge, both datasets must be sorted by identical
merge variable.

Assuming we are working on ERHS 1999 data, and we have been given two files. One file has
all the consumption information (own production, (called p2sec9a.dta)) and the other file with
community prices by wereda (called p_r5.dta). Both data sets have been cleaned and sorted by
hhid and item1234. We would like to merge the two households together by hhid item1234.

. use p2sec9a.dta, clear
. list
. sort hhid item1234
. save consumption.dta, replace

. use p_r5, clear
. list
. sort hhid item1234
. save comprice.dta, replace

. use consumption.dta, clear
. merge hhid item1234 using comprice.dta

After merge command, a _merge variable appears. The _merge variable indicates, for each
observation, how the merge go. This is especially useful in identifying mismatched records.
_merge can have one of three values in merging file A using file B:

_merge==1 the records contains information from master data file A
_merge==2 the records contains information from using data file B
_merge==3 the records contains information from both files

78
When there are many records, tabulating _merge is very useful to summarize how many
mismatched observations you have. In this case, all of the records match so the value for _merge
is always 3.

. tab _merge
_mer ge | Fr eq. Per cent Cum.
- - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
1 | 3, 605 24. 21 24. 21
2 | 9, 732 65. 37 89. 58
3 | 1, 551 10. 42 100. 00
- - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Tot al | 14, 888 100. 00

One-to-many match merging

Another kind of merge is called a one to many merge. Say, we have one data file household.dta
contains household information, and another data file individual.dta contains information of each
individual in the household. If we merge households.dta with individual.dta, there can be
multiple individuals per household and hence it is a one to many merge.

The strategy for the one to many merge is really the same as the one to one match merge.

. use household.dta, clear
. list
. sort hhid
. save h1.dta, replace

. use individual.dta, clear
. list
. sort hhid
. save h2.dta, replace

. use h1.dta, clear
. merge hhid using h2.dta

There is no difference in the order of files to be merged and the results are the same. The only
difference is the order of the records after the merge.

Label data

Besides giving labels to variables, we can also label the data set itself so that we will remember
what the data are. The label data command places a label on the whole dataset.

. label data relabeled household

We can also add some notes to the data set. The note: (note the colon, :) command allows you
to place notes into the dataset.

. notes hhsize: the variable fsize was renamed to hhsize

The notes command display all notes in the data set.

79
. notes
hhsi ze:
t he var i abl e f si ze was r enamed t o hhsi ze

Collapse

Sometimes we have data files that need to be aggregated at a higher level to be useful for us. For
example, we have household data but we really interested in regional data. The collapse
command serves this purpose by converting the dataset in memory into a dataset of means, sums,
medians and percentiles. Note that the collapse command creates a new dataset and all
household information disappear and only the specified variable aggregation remain at the region
level. The resulting summary table can by viewed by edit command.

For instance, we would like to see the mean cons in each q1a and sex of hh head.

. collapse (mean) cons, by(q1a sex)
. edit

r egco ur ban t ot _exp
1 0 12. 067
1 1 14. 899
2 0 13. 022
2 1 17. 849
3 0 11. 612
3 1 16. 507
4 0 13. 324
4 1 17. 790
5 0 15. 152
5 1 22. 627
6 0 11. 890
6 1 18. 261
7 0 12. 313
7 1 18. 591
12 0 10. 851
12 1 19. 714
13 0 19. 528
13 1 20. 021
14 0 21. 568
14 1 30. 597
15 0 16. 627
15 1 19. 574

However, this table is not easy to interpret, and we can call it a long format since the data of
urban and rural are vertically listed. We will use reshape command to convert it into a wide
form where the rural and urban are horizontally arranged in a twoway table. The reshape wide
command tells system that we want to go from long to wide. The i() option records row variable
while j() column variable.

. reshape wide tot_exp, i(regco) j(urban)
( not e: j = 0 1)

Dat a l ong - > wi de
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Number of obs. 22 - > 11
Number of var i abl es 3 - > 3
80
j var i abl e ( 2 val ues) ur ban - > ( dr opped)
xi j var i abl es:
t ot _exp - > t ot _exp0 t ot _exp1
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
The converted table is two-way table.

r egco t ot _exp0 t ot _exp1
1 12. 067 14. 899
2 13. 022 17. 849
3 11. 612 16. 507
4 13. 324 17. 790
5 15. 152 22. 627
6 11. 890 18. 261
7 12. 313 18. 591
12 10. 851 19. 714
13 19. 528 20. 021
14 21. 568 30. 597
15 16. 627 19. 574

If needed, the table can be converted back into the long form by reshape long.

. reshape long tot_exp, i(regco)
The collapse and reshape commands are examples of the power and simplicity of Stata in its
ability to shape data files.

SECTION14:ADVANCEDPROGRAMMING

Besides simple one-line commands, we can always get more from Stata by more sophisticated
programming.

Looping

Consider the sample program below, which reads in income data for twelve months.

input famid inc1-inc12
1 3281 3413 3114 2500 2700 3500 3114 3319 3514 1282 2434 2818
2 4042 3084 3108 3150 3800 3100 1531 2914 3819 4124 4274 4471
3 6015 6123 6113 6100 6100 6200 6186 6132 3123 4231 6039 6215
end

Say that we wanted to compute the amount of tax (10%) paid for each months, which means to
compute 12 variables by multiplying each of the inc* variable by 0.10.

There is more than one way to execute part of your do file more than once.

1. The simplest way is to use 12 generate commands.

generate taxinc1 = inc1 * .10
81
generate taxinc10= inc10 * .10

2. Another way to computer 12 variables is to use the foreach command.

In the example below, we use the foreach command to cycle through the variables inc1 to inc12
and compute the taxable income as taxinc1-taxinc12.

foreach var of varlist inc1-inc12 {
generate tax`var' = `var' * .10
}

The initial foreach statement tells Stata that we want to cycle through the variables inc1 to inc12
using the statements that are surrounded by the curly braces. Note the curly braces must be open
at the end of foreach command line. The first time we cycle through the statements, the value of
var will be inc1 and the second time the value of var will be inc2 and so on until the final
iteration where the value of var will be inc12. Each statement within the loop (in this case, just
the one generate statement) is evaluated and executed. When we are inside the foreach loop, we
can access the value of var by surrounding it with the funny quotation marks like this `var' . The
` is the quote right below the ~on your keyboard and the ' is the quote below the " on your
keyboard. The first time through the loop, `var' is replaced with inc1, so the statement

generate tax`var' = `var' * .10

becomes


This is repeated for inc2 and then inc3 and so on until inc12. So, this foreach loop is the
equivalent of executing the 12 generate commands manually, but much easier and less error
prone.

3. The third way is to use while loop.

First we define a Stata local variable that is going to be the loop increment. Similar to the
foreach command, codes are in terms of local variable `var'.

local i=1
while ì'<=12 {
generate taxincì'=incì'*0.10
local i=ì'+1
}

Local variable i can be seen as a counter, and the while command states how many times the
commands within the while loop are going to be replicated. This statement basically says do
until counter value reaches the limit 12. Note the curly braces must be open at the end of while
command line. All commands between curly braces will be executed each time the system go
through the while loop. So first the statement

generate taxincì'=incì'*0.10

becomes

generate taxinc1=inc1*0.10
82

The counter value is increased by 1 unit afterwards. Note that the fourth line means the value of
local variable i will be increased by 1 from its current value stored in ì'.

SECTION15:TROUBLESHOOTINGANDUPDATE

The help command followed by a Stata command brings up the on-line help system for that
command. It can be used from the command line or from the help window. With help you must
spell the full name of the command completely and correctly.

. help regress

The help contents will list all commands that can be accessed using help command.

. help contents

The search command looks for the term in help files, Stata Technical Bulletins and Stata FAQs.
It can be used from the command line or from the help window.

. search logit

The findit command can be used to search the Stata site and other sites for Stata related
information, including ado files. Say that we are interested in panel data, so we search for this
program from within Stata by typing

. findit panel data

The Stata viewer window appears and we are shown a number of resources related to this key
word.

Stata is composed of an executable file and official ado files. Ado stands for automatically
loaded do file. An ado file is a Stata command that created by users like you. Once installed in
your computer, they work pretty much the same way so Stata commands. Stata files are regularly
updated. It is important to make sure that you are always running the most up to date Stata, and
please do so regularly.

The update command reports on the current update level and installs official updates to Stata. It
helps users to be up to date with the latest Stata ado and executable file, and copy and installs the
ado files into the directory specified.

. update
. update ado, into(d:\ado)

You can keep track of all the users ado files that you have added to your package over time by
ado command, which will list all of them, with information on where you got it from and what it
does.

. ado
[ 1] package spost 9_ado f r omht t p: / / www. i ndi ana. edu/ ~j sl soc/ st at a
spost 9_ado St at a 9 commands f or t he post - est i mat i on i nt er pr et at i on of

[ 2] package st 0081 f r omht t p: / / www. st at a- j our nal . com/ sof t war e/ sj 5- 1
SJ 5- 1 st 0081. Vi sual i zi ng mai n ef f ect s and i nt er act i ons. . .

83
These ado files can be deleted by ado uninstall command.

. ado uninstall st0081
package st 0081 f r omht t p: / / www. st at a- j our nal . com/ sof t war e/ sj 5- 1
SJ 5- 1 st 0081. Vi sual i zi ng mai n ef f ect s and i nt er act i ons. . .

( package uni nst al l ed)
84

Helpful Sources
http://www.stata.com/
http://www.stata.com/statalist/
Statalist is hosted at the Harvard School of Public Health, and is an email listserver where Stata
users including experts writing Stata programs to users like us maintain a lively dialogue about
all things statistical and Stata. You
can sign on to statalist so that you can receive as well as post your own questions through email.

http://ideas.repec.org/s/boc/bocode.html
http://www.princeton.edu/~erp/stata/main.html
http://www.cpc.unc.edu/services/computer/presentations/statatutorial/
http://www.ats.ucla.edu/stat/stata/

Stata Training Manual

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Stata Training Manual

Uploaded by

Copyright:

Available Formats

Introduction to Stata

You might also like