You are on page 1of 85

Training Module Using Stata for Survey Data Analysis

Ethiopian Economics Association /Ethiopian Economic Policy Research Institute/

September 2009

Background This is a one-week training modules offered as part of EEA/EEPRI training for its members. EEA has a long tradition of organizing short training courses for its members towards updating their skill and capacity in research, analysis and planning. The training modules consist of (i) brief introduction to rural household survey modules and (ii) using modern stata for survey data analysis. Three characteristics of these modules need to be emphasized because they have implications for the role of the participants. The training modules are semi-structured hand-out in which trainees will use computers to learn different methods of analyzing data. Thus, active participation of the trainees is expected and necessary to maximize the benefit from the training. The training modules focus on how to use computer software to implement a wide range of topics and analytical methods. In order to cover this range of methods, the course cannot provide detailed explanations of the statistical methods themselves, so it is assumed that trainees have some familiarity with concepts such as means, frequency distributions, and regression analysis. The training modules are cumulative in the sense that understanding the material of one day depends on having attended the training course the day before. If you cannot attend the course every day for the full day, it will be difficult to understand the new materials.

Objectives The objective of this training module is to improve the ability of the trainees to use Stata to generate descriptive statistics and tables from survey data, as well as carry out preliminarily linear, non-linear and panel data regression analysis. In particular, the course aims to train the participants in the following methods: basic file management such as opening, modifying, and saving files advance file management such as merging, appending, and aggregating files documenting data files with variable labels and value labels generating new variables using various functions and operations creating tables to describe the distribution of continuous and discrete variables creating tables to describe the relationships between two or more variables using regression analysis to study the impact of various variables on a dependent variable testing hypotheses using statistical methods

Organization of the course The training course is divided into 16 sections. We will cover some material in all sections, but we may not be able to cover all the material, depending on time allocated for training and background of trainees. Section 1: Introduction to Stata Section 2: Exploring data files with Stata Section 3: Storing commands and output Section 4: Creating new variables 1

Section 5: Modifying variables Section 6: Advanced descriptive statistics Section 7: Presenting data with graph (graphing data) Section 8: Normality and outlier Section 9: Statistical tests Section 10: Linear regression Section 11: Logistic regression Section 12: Panel data analysis (regression) Section 13: Data management Section 14: Advanced programming Section 15: Trouble shooting and update Each section will include some training in the use of Stata commands and a practical application of these commands to the analysis of household survey data. The ERHS1999, for example, contains over fifty files, but we will focus our attention on few of them:

SECTION 1: INTRODUCTION TO STATA Stata is a package that offers a good combination of ease to learn and power. It has numerous powerful yet simple commands for data management, which allows users to perform complex manipulations with ease. Under Stata/SE, one can have up to 32,768 in a Stata data file and 11,000 for any estimation commands. Stata performs most general statistical analyses (regression, logistic regression, ANOVA, factor analysis, and some multivariate analysis). The greatest strengths of Stata are probably in regression and logistic regression. Stata also has a very nice array of robust methods that are very easy to use, including robust regression, regression with robust standard errors, and many other estimation commands include robust standard errors as well. Stata has the ability to easily download programs developed by other users and the ability to create your own Stata programs that seamlessly become part of Stata. One can find many cutting edge statistical procedures written by other users before and incorporate them into his/her own Stata program. Stata uses one line commands which can be entered one command at a time or can be entered many at a time in a Stata program. When you open Stata, you will see a menu bar across the top, a tool bar with buttons, and 3-5 windows (the number of windows open depends on which windows were open the last time Stata was used). Each is described briefly below. The Stata Interface 1. Windows The Stata windows give you all the key information about the data file you are using, recent commands, and the results of those commands. Some of them open automatically when you start Stata, while others can be opened using the Windows pull-down menu or the buttons on the tool bar. These are the Stata windows: Stata Results Stata Command Stata Browser Stata Editor Stata Viewer Variables Review Stata Do-file Editor

To see recent commands and output To enter a command To view the data file (needs to be opened) To edit the data file (needs to be opened) To get help on how to use Stata To see a list of variables To see recent commands To write or edit a program (needs to be opened)

The Command window on the bottom right is where you'll enter commands. When you press ENTER, they are pasted into the Stata Results window above, which is where you will see your commands executed and view the results. You can also use recent commands again by using the Page Up key (to go to the previous command) and Page Down key (to go to the next command). The Result Window (with the black background) shows all recent commands, output, error messages, and help. The text is color-coded as follows: Green General information and the frame and headings of output tables blue Commands or error messages that can be clicked on for more information white Stata commands yellow Numbers in output tables red Error messages The slide bar on the right side can be used to look at earlier results that are not on the screen. However, unlike SPSS, the Stata results window does not keep all output generated. It will keep about 300-600 lines of the most recent output, deleting earlier output. If you want to store output in a file, you must use the log command. /More on this latter/
Stata Browser This window shows all the data in memory. The Stata Browser does not appear

automatically when you start Stata. The only way to open the Browser is to click on the buttom with a table and magnifying glass. Unlike SPSS, when the Stata Browser is open, you cannot execute any commands, either from the Stata Command window or from the Do-file Editor. In 4

addition, you also cannot change any of the data. You can, however, sort the data or hide certain variables using buttons at the top of the Stata Browser window.
Stata Editor This window is exactly like the Stata Browser window except that you can change

the data. We do not recommend using this window because you will have no record of the changes you make in the data. It is better to correct errors in the data using a Do-file program that can be saved.
Stata Viewer This window provides help on Stata commands and rules. To open the Stata

Viewer window, you can click on Windows/Viewer or click on the eye button on the tool bar. To use the Stata Viewer window, type a command in the space at the top and the Viewer will give you the purpose and rules for using that command, along with some examples. Any blue text in the Viewer can be clicked on for more information about that command. Variables This window (tall with a white background) lists all the variables that exist in memory. When you open a Stata data file, it lists the variables in the file. If you create new variables, they will be added to the list of variables. If you delete variables, they will be removed from the list. You can insert a variable into the Stata Command window by clicking on it in the Variables window.
Do-file Editor This window allows you to write, edit, save, and execute a Stata program. A Stata program (or Do-file) is simply a set of Stata commands written by the user. The advantage of using the Do-file Editor rather than the Stata Command window is that the Do-file allows you to save, revise, and rerun a set of commands. Exploratory analysis of the data can be done with the Stata Command window, but any serious data analysis should be carried out using the Do-file Editor, not the Stata Command window. The Do-File Editor can be opened by clicking on Windows/Do-file Editor or by clicking on the envelope button. With so many windows, it is sometimes difficult to fit them all on the screen. You can adjust the size and position of each window the way you like it and then save the layout by clicking on Prefs/Save Windowing Preferences. Each time you open Stata, the windows will be arranged according to your prefered layout.

On the right are two convenient windows. The Variables window keeps a list of your current variables. If you click on one of them, its name will be pasted into the current command at the location of the cursor, which saves a little typing. The Review window keeps a list of all the commands you've typed in the Stata session. Click on one, and it will be pasted into the command window, which is handy for fixing typos. Double-click, and the command will be pasted and re-executed. You can also export everything in the Review window into a .do file (more on them later) so you can run the exact same commands at any time. To do this right-click the Review window. When we first open Stata, all these windows are blank except for the Stata Results window. You can resize these 4 windows independently, and you can resize the outer window as well. To save your window size changes, click on Prefs button, then Save Windowing Preferences Entering commands in Stata works pretty much like you expect. BACKSPACE deletes the character to the left of the cursor, DELETE the character to the right, the arrow keys move the cursor around, and if you type the text is inserted at the current location of the cursor. The up arrow does not retrieve previous commands, but you can do that by pressing PAGE UP, or CTRL-R, or by using the Review window. 5

2. Menus Stata displays 8 drop-down menus across the top of the outer window, from left to right: File Open open a Stata data file (use) Save/Save as save the Stata data in memory to disk Do execute a do-file Filename copy a filename to the command line Print print log or graph Exit quit Stata Edit Copy/Paste copy text among the Command, Results, and Log windows Copy Table copy table from Results window to another file Table copy options what to do with table lines in Copy Table

Data Graphics Statistics User Window Help

build and run Stata commands from menus menus for user-supplied Stata commands (download from Internet) bring a Stata window to the front Stata command syntax and keyword searches

3. Button bar
The buttons on the button bar are from left to right (equivalent command is in bold): Open a Stata data file: use Save the Stata data in memory to disk: save Print a log or graph Open a log, or suspend/close an open log: log Open a new viewer Bring Graph window to front New Dofile Editor: doedit Edit the data in memory: edit Browse the data in memory: browse Clear-more condition: Space Bar Stop current command or do-file: Ctrl-Break SECTION 3: EXPLORING DATA FILES 3.1. Common Stata Syntax This section covers commands that are used for preliminary exploration of data in a file. Stata commands follow the same syntax: [by varilist1:] command [varlist2] [if exp] [in range] [weight], [options] Items inside of the squares brackets are either options or not available for every command. This syntax applies to all Stata commands. In order to use by prefix, the dataset must first be sorted on the by variable(s). it helps to repeat Stata command on subsets of the data. 6

Logical operators used in Stata ~ == ~= != > >= < <= & | Note that == represents IS EQUAL TO. Stata allows four kinds of weights in most commands (please refer to stata manual for further information) 1. fweight, or frequency weights, are weights that indicate the number of duplicated observations. It is used when your data set has been collapsed and contains a variable that tells the frequency each record occurred. 2. pweight, or sampling weights, are weights that denote the inverse of the probability that the observation is included due to the sampling design. pweights is correct to be used for sampling survey data. The pweight command causes Stata to use the sampling weight as the number of subjects in the population that each observation represents when computing estimates such as proportions, means and regressions parameters. A robust variance estimation technique will automatically be used to adjust for the design characteristics so that variances, standard errors and confidence intervals are correct. 3. aweight, or analytic weights, are weights that are inversely proportional to the variance of an observation; i.e., the variance of the j-th observation is assumed to be sigma^2/w_j, where w_j are the weights. Typically, the observations represent averages and the weights are the number of elements that gave rise to the average. For most Stata commands, the recorded scale of aweights is irrelevant; Stata internally rescales them to sum to N, the number of observations in your data, when it uses them. Analytic weights are used when you want to compute a linear regression on data that are observed means. Do not use aweights to specify sampling weights. This is because the formulas that use aweights assume that larger weights designate more accurately measured observations. Conversely, one observation from a sample survey is no more accurately measured than any other observation. Hence, using the aweight command to specify sampling weights will cause Stata to estimate incorrect values of the variance and standard errors of estimates, and p-values for hypothesis tests. Not Equal not equal not equal greater than greater than or equal less than less than or equal And Or

4. iweight, or importance weights, are weights that indicate the "importance" of the observation in some vague sense. iweights have no formal statistical definition; any command that supports iweights will define exactly how they are treated. In most cases, they are intended for use by programmers who who need to implement their own analytical techniques by using some of the available estimation commands. Special care should be taken when using importance weights to understand how they are used in the formulas for estimates and variance. This information is available in the Methods and Formulas section in the Stata manual for each estimation command. In general, these formulas will be incorrect for computing the variance for data from a sample survey. 3.2 Examining dataset clear The clear command deletes all files, variables, and labels from the memory to get ready to use a new data file. You can clear memory using the clear command or by using the clear up command as part of the use command (see the use command). This command does not delete any data saved to the hard-drive. set memory First you can check to see how much memory is allocated to hold your data using the memory command. For instance, we are now running StataSE 11 under Windows, and this is what the memory command told us.

Figure 2: Working memory space . memory bytes -------------------------------------------------------------------Details of set memory usage overhead (pointers) 5,808 0.06% data 107,448 1.02% ---------------------------data + overhead 113,256 1.08% free 10,372,496 98.92% ---------------------------Total allocated 10,485,752 100.00% -------------------------------------------------------------------Other memory usage set maxvar usage 1,816,666 set matsize usage 1,315,200 programs, saved results, etc. 3,338 --------------Total 3,135,204 ------------------------------------------------------Grand total 13,620,956

We have 11MB free for reading in a data file. Whenever we want to read data file bigger than this free bytes, we will get the error message read as:
no room to add more observations r(901);

In this case I have to allocate to more memory, say 25MB (if 25MB are sufficient for current file), with the set memory command before trying to use my file.
set memory 25m Figure 3: Current memory allocation after set memory 25m command Current memory allocation current memory usage settable value description (1M = 1024k) -------------------------------------------------------------------set maxvar 5000 max. variables allowed 1.733M set memory 25M max. data space 25.000M set matsize 400 max. RHS vars in models 1.254M ----------27.987M

Now that we have allocated enough memory, we will be able to read bigger files provided that it is within the specified memory spaces. After setting the memory space to 25m, we have information on memory space read us:

Figure 4: Adjusted working memory space


. memory bytes -------------------------------------------------------------------Details of set memory usage overhead (pointers) 5,808 0.02% data 107,448 0.41% ---------------------------data + overhead 113,256 0.43% free 26,101,136 99.57% ---------------------------Total allocated 26,214,392 100.00% -------------------------------------------------------------------Other memory usage set maxvar usage 1,816,666 set matsize usage 1,315,200 programs, saved results, etc. 1,778 --------------Total 3,133,644 ------------------------------------------------------Grand total 29,348,036

If we want to allocate 25m (250 megabytes) every time we start Stata, We can type;
. set memory 250m, permanently

And then Stata will allocate this amount of memory every time we start Stata. use

This command opens an existing Stata data file. The syntax is: use filename [, clear ] use [varlist] [if exp] [in range] using filename [, clear ] opens new file opens selected parts of file

If there is no extension, Stata assumes it is .dta. If there is no path, Stata assumes it is in the current folder. You can use a path name such as: use C:\...\ERHScons1999 If the path name has spaces, you must use double quotes: use .d:\my data\ERHScons1999. You can open selected variables of a file using a variable list. You can open selected records of a file using if or in.

Here are some examples of the use command: use ERHScons1999 opens the file ERHScons1999.dta for analysis. use ERHScons1999 if q1a == 1 opens data from region 1 use ERHScons1999 in 5/25 opens records 5 through 25 of file use hhid hhsize cons using ERHScons1999 opens 3 variables from ERHScons1999 file use C:\training\ ERHScons1999 opens the file ERHScons1999.dta in the specified folder use .C:\data files\ ERHScons1999 use quotation marks if there are spaces use ERHScons1999, clear clears memory before opening the new file 10

While running Do-file program, we have to use use and clear command at the same time. For instance, here we load a raw data set from ERHScons1999. The clear option then allows Stata to clear the memory of previous data set in order to load the new one.
. use C:\...\ERHScons1999.dta, clear

As Stata did not want you to lose the changes that you made to the data setting in memory. If you really want to discard the changes in memory, clear option specifies that it is okay to replace the data in memory, even though the current data have not been saved to disk. save The save command will save the dataset as a .dta file under the name you choose. Editing the dataset changes data in the computer's memory, it does not change the data that is stored on the computer's disk.
. save C:\...\consumption.dta, replace

The replace option allows you to save a changed file to the disk, replacing the original file. Stata is worried that you will accidentally overwrite your data file. You need to use the replace option to tell Stata that you know that the file exists and you want to replace it. edit This command use to open window called data editor window that allow us to view all observation in the memory. You can change the data using data editor window but you do not recommend using this window because you will have no record of the changes you make in the data. It is better to correct errors in the data using a Do-file program that can be saved (we will see Do-file program latter). browse This window is exactly like the Stata editor window except that you cant change the data. describe

This command provides a brief description of the data file. You can use des or d and Stata will understand. The output includes: the number of variables the number of observations (records) the size of the file the list of variables and their characteristics

11

Example 1: Using describe to show information about a data file


. des Contains data from C:\training\ERHSCONS1999.dta obs: 1,452 vars: 15 24 Feb 2007 07:07 size: 113,256 (98.9% of memory free) (_dta has notes) ----------------------------------------------------------------------------storage display value variable name type format label variable label ----------------------------------------------------------------------------q1a float %9.0g reg Region q1b double %15.0g w Wereda q1c double %17.0g pa Peseant association q1d double %12.0g Household id sexh byte %8.0g sexhh Sex of household head ageh float %9.0g p1s1q4 Age of household head cons float %9.0g consumption per month food float %9.0g food cons per month hhsize byte %8.0g household size aeu float %9.0g adult equivalent units in household fpi float %9.0g food price index rconspc float %9.0g real consumption per capita 1994 prices rconsae float %9.0g real consumption per adult 1994 prices poor double %8.2f hhid double %12.0f selected household unique id ----------------------------------------------------------------------------Sorted by: hhid

It also provides the following information on each variable in the data file: the variable name the storage type: byte is used for binary variables, int is used for integers, and float is used for continuous variables that may have decimals. To see the limits on each storage type, type help data types. the display type indicates how it will appear in the output. the value label is the name of a set of labels for different values the variable label is a name for the variable that is used in output. list This command lists values of variables in data set. The syntax is: list [varlist] [if exp] [in range] With varlist, you can specify which variables values will be presented. If list is not specified, all variables will be listed. With if and in, you can specify which records will be listed. Here are some examples: . list . list in 1/10

lists entire dataset lists observations 1 through 10 12

. list hhsize q1a food . list hhsize sex in 1/20 . list if q1a < 6

lists selected variables lists observations 1-20 for selected variables lists cases in region is 1 through 5

if This command is used to select certain records in carrying out a command. This is similar to the process if command in SPSS, except that in Stata it is not considered a separate command. The syntax is: command if exp Examples include: . list hhid q1a food if food> 2000 . tab q1a if cons>1000 &cons<2000 . summarize food if q1a==3 | q1a==4 . browse hhid q1a food if food>=1200 lists data if food is above 12000 frequency table of region if consumption is in range statistics on food consumption for regions 3 and 4 browse data if food consumption is above 1200

Note that if statements always use ==, not a single =. Also note that | indicates or while & indicates and. in We have also used in to select records based on the case number. The syntax is: command in exp For example: . list in 10 . summarize in 10/20 Example 2: Using list to look at data
. list hhid q1a q1b q1c q1d hhsize rconspc in 10/25 +-------------------------------------------------------------------+ | hhid q1a q1b q1c q1d hhsize rconspc | |-------------------------------------------------------------------| | 101010000010 Tigray Atsbi Haresaw 10 4 134.5961 | | 101010000011 Tigray Atsbi Haresaw 11 3 168.9437 | | 101010000012 Tigray Atsbi Haresaw 12 3 135.1815 | | 101010000013 Tigray Atsbi Haresaw 13 7 102.3454 | | 101010000014 Tigray Atsbi Haresaw 14 9 68.04964 | |-------------------------------------------------------------------| | 101010000015 Tigray Atsbi Haresaw 15 12 49.61188 | | 101010000016 Tigray Atsbi Haresaw 16 4 85.05015 | | 101010000017 Tigray Atsbi Haresaw 17 5 84.72104 | | 101010000018 Tigray Atsbi Haresaw 18 2 95.42028 | | 101010000019 Tigray Atsbi Haresaw 19 10 140.7843 | |-------------------------------------------------------------------| | 101010000020 Tigray Atsbi Haresaw 20 3 80.58356 | | 101010000021 Tigray Atsbi Haresaw 21 3 95.98959 | | 101010000022 Tigray Atsbi Haresaw 22 5 68.05075 | | 101010000023 Tigray Atsbi Haresaw 23 4 52.4964 | | 101010000024 Tigray Atsbi Haresaw 24 3 91.86269 | |-------------------------------------------------------------------| | 101010000025 Tigray Atsbi Haresaw 25 5 149.1702 | +-------------------------------------------------------------------+

list observation number 10 summarize observations 10-20

10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25.

13

. list q1a cons aeu poor in 200/215 +----------------------------------+ | q1a cons aeu poor | |----------------------------------| | Amhara 661.3979 1.82 0.00 | | Amhara 321.7693 8.14 1.00 | | Amhara 169.784 2.3 0.00 | | Amhara 907.9995 3.14 0.00 | | Amhara 232.6273 4.148 1.00 | |----------------------------------| | Amhara 432.4525 6.86 1.00 | | Amhara 59.53 1.46 1.00 | | Amhara 228.22 3.4 0.00 | | Amhara 1298.875 5.44 0.00 | | Amhara 144.494 3.48 1.00 | |----------------------------------| | Amhara 266.974 4.28 0.00 | | Amhara 43.97179 .74 1.00 | | Amhara 216.0467 3.408 1.00 | | Amhara 492.4958 2.94 0.00 | | Amhara 437.7144 2.46 0.00 | |----------------------------------| | Amhara 166.354 1.74 0.00 | +----------------------------------+

200. 201. 202. 203. 204. 205. 206. 207. 208. 209. 210. 211. 212. 213. 214. 215.

If you are not careful with list, you will get a lot more output than you want. If Stata starts giving you more output than you really want, use the stop button ( button with an X). codebook The codebook command is a great tool for getting a quick overview of the variables in the data file. It produces a kind of electronic codebook from the data file, displaying information about variables' names, labels and values.

14

Example 3: using codebook to look at data


. codebook sexh Sex of household head ---------------------------------------------------------------------------type: label: range: unique values: tabulation: numeric (byte) sexhh [0,1] 2 Freq. 400 1052 Numeric 0 1 Label Female Male units: missing .: 1 0/1452

.codebook rconspc real consumption per capita 1994 prices ----------------------------------------------------------------------------type: range: unique values: mean: std. dev: percentiles: numeric (float) [4.2201104,1018.2954] 1448 90.3674 81.9962 10% 25.1043 25% 39.9402 50% 65.9926 75% 114.253 90% 180.891 units: missing .: 1.000e-07 3/1452

inspect It is another useful command for getting a quick overview of a data file. inspect command displays information about the values of variables and is useful for checking data accuracy. Example 4: Using inspect to look at data
. inspect sexh sexh: Sex of household head ---------------------------| # | # | # | # | # # | # # +---------------------0 1 (2 unique values) Negative Zero Positive Total Missing Number of Observations NonTotal Integers Integers 400 400 1052 1052 ------------1452 1452 ----1452

sexh is labeled and all values are documented in the label.

15

count count command can be used to show the number of observations that satisfying if options. If no conditions are specified, count displays the number of observations in the data.
. count 1452 . count if 466 q1a==3

3.3. Preliminary Descriptive Statistics tabulate, tab1, tab2 These are three related commands that produce frequency tables for discrete variables. They can produce one-way frequency tables (tables with the frequency of one variable) or two-way frequency tables (tables with a row variable and a column variable. These commands are similar to the frequency and crosstab commands in SPSS. How do they differ? tabulate or tab tab1 tab2 produce a frequency table for one or two variables produces a one-way frequency table for each variable in the variable list produces all possible two-variable tables from the list of variables

You can use several options with these commands: all gives all the tests of association for two-way tables cell gives the overall percentage for two-way tables column gives column percentages for two-way tables row gives row percentages for two-way tables nofreq suppresses printing the frequencies. chi2 provides the chi squared test for two-way tables There are many other options, including other statistical tests. For more information, type help tabulate Some examples of the tabulate commands are:
. tabulate q1a . tabulate q1a sexh . tabulate q1a hhsize, row . tabulate sexh hhsize, cell nofreq . tab1 q1a q1b hhsize . tab2 q1a poor sexh produces table of frequency by region produces a cross-tab of frequencies by region and sex of head produces a cross-tab by region and hhsize with row percentages produces a cross-tab of overall percent by sex and hhsize. produces three tables, a frequency table for each variable produces three tables, a cross-tab of each pair of variables

16

Example 5: Using tabulate on categorical variables


. tab q1b Wereda | Freq. Percent Cum. ----------------+----------------------------------Atsbi | 84 5.79 5.79 Sebhassahsie | 66 4.55 10.33 Ankober | 86 5.92 16.25 Basso na Worana | 175 12.05 28.31 Enemayi | 61 4.20 32.51 Bugena | 144 9.92 42.42 Adaa | 95 6.54 48.97 Kersa | 95 6.54 55.51 Dodota | 109 7.51 63.02 Shashemene | 97 6.68 69.70 Cheha | 65 4.48 74.17 Kedida Gamela | 74 5.10 79.27 Bule | 134 9.23 88.50 Boloso | 96 6.61 95.11 Daramalo | 71 4.89 100.00 ----------------+----------------------------------Total | 1,452 100.00 . tab q1b sexh | Sex of household head Wereda | Female Male | Total ----------------+----------------------+---------Atsbi | 48 36 | 84 Sebhassahsie | 29 37 | 66 Ankober | 13 73 | 86 Basso na Worana | 52 123 | 175 Enemayi | 11 50 | 61 Bugena | 55 89 | 144 Adaa | 23 72 | 95 Kersa | 31 64 | 95 Dodota | 26 83 | 109 Shashemene | 26 71 | 97 Cheha | 22 43 | 65 Kedida Gamela | 15 59 | 74 Bule | 11 123 | 134 Boloso | 25 71 | 96 Daramalo | 13 58 | 71 ----------------+----------------------+---------Total | 400 1,052 | 1,452

In one-way tables, Stata gives the count, the percentage, and the cumulative percentage

(see first example in box). In two-way tables, Stata gives the count only, unless you ask for other statistics (see second example in box) col, row, and cell request Stata to include percentages in two-way tables summarize The summarize command produces statistics on continuous variables like age, food, cons hhsize. The syntax looks like this: 17

summarize [varlist] [if exp] [in range] [, [detail]] By default, it produces the following statistics: Number of observations Average (or mean) Standard deviation Minimum Maximum If you specify detail Stata gives you additional statistics, such as skewness, kurtosis, the four smallest values the four largest values various percentiles. Here are some examples: . summarize . summarize hhsize food . summarize hhsize cons if q1a==3 gives statistics on all variables gives statistics on selected variables gives statistics on two variables for one region

Example 6: Using summarize to study continuous variables


. sum rconspc rconsae hhsize Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------rconspc | 1449 90.36742 81.99623 4.22011 1018.295 rconsae | 1449 108.7874 97.27053 4.811201 1212.256 hhsize | 1452 5.782369 2.740968 1 17 . sum rconspc rconsae hhsize if q1a==4

Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------rconspc | 395 111.6185 99.09839 8.393298 1018.295 rconsae | 395 132.6018 116.6133 9.608795 1212.256 hhsize | 396 6.209596 2.853203 1 16

The first example gives the statistics for the whole sample, while the second gives the statistics only for households in Region 4. by This prefix goes before a command and asks Stata to repeat the command for each value of a variable. The general syntax is: by varlist: command Note: bysort command is most commonly used to shorten the sorting process

18

Some examples of the by prefix are: bysort sex: sum rconsae for sex of hh head, give stats on real per capita consumption.

Example 7: Using the by prefix


-> sexh = Female Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------rconspc | 398 100.2183 89.18895 7.068164 624.1437 --------------------------------------------------------------------------> sexh = Male Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------rconspc | 1051 86.63701 78.82594 4.22011 1018.295

help

The help command gives you information about any Stata command or topic help [command] For example, . help tabulate . help summarize

gives a description of the tabulate command gives a description of the summarize command

SECTION 4: STORING COMMANDS AND OUTPUT In this section, we discuss how to store commands and output for later use. First, we describe how to store commands using a program (Stata calls it a Do-file), how to edit the program, and how to run it. Second, we present different ways of saving and using the output generated by Stata. The following topics are covered: Using the Do-file Editor log using log off log on log close Using the Do-file Editor The Do-file Editor allows you to store a program (a set of commands) so that you can edit it and execute it later. Why use the Do-file Editor? It makes it easier to check and fix errors, It allows you to run the commands later, It lets you show others how you got your result, and It allows you to collaborate with others on the analysis. 19

In general, any time you are running more than 10 commands to get a result, it is easier and safer to use a Do-file to store the commands. To open the Do-file Editor, you can click on Windows/Do-file Editor or click on the envelope on the Tool Bar. Within the Do-file Editor, there is a menu bar and tool bar buttons to carry out a variety of editing functions. The menu bar is similar to the one in Microsoft Word: File/New to open a new, blank Do-file File/Open to open an existing Do-file File/Save to save the current Do-file File/Save as to saving the current Do-file under a new name File/Insert file to insert another file into the current one File/Print to print the Do-file File/Close to close the Do-file Edit/Undo to undo the last command Edit/Cut to delete or move the marked text in the Do-file Edit/Copy to copy the marked text in the Do-file Edit/Paste to insert the copied or cut text into the Do-file Search/Find to find a word or phrase in the Do-text Search/Replace to find and replace a word or phrase in the Do-file Tools/Do to execute all the commands or the marked commands in the Do-file Tools/Run to execute all the commands or the marked commands in the Do-file without showing any output in the Stata Results window The tool bar buttons can be used to carry out some of these tasks more quickly. For example, there are buttons for File/New, File/Open, File/Print, Search/Find, Edit/Cut, Edit/Copy, Edit/Paste, Edit/Undo, Do, and Run. Probably the button you will use most is the last one that shows a page with text on it. This is the Do button for executing the program or the marked part of the program. Finally, the keyboard commands may be even quicker to use than the buttons. The most useful keyboard commands are: Control-O Control-S Control-C Control-X Control-V Control-Z Control-F Control-H Open file Save file Copy Cut Paste Undo Find Find and Replace

To run the commands in a Do-file, you can click on the Do button (the last one) or click on Tools/Do. If you want to run one or just a few commands rather than the whole file, mark the commands and click on the Do button. You do not have to mark the whole command, but at least one character in the command must be marked in order for the command to be executed (unlike 20

SPSS, it is not enough to have the cursor on a command). Although layout is a matter of personal preference, it may be useful to have the Stata Results window and the other windows on one side of the screen and the Do-file Editor window on the other. This makes it easy to switch back and forth. When you arrange the windows the way you like, you can save the layout by clicking Prefs/Save Windowing Preferences. Each time you open Stata, it will use your chosen layout. Note: If you would like to add a note to a do file, but do not want Stata to execute your notes, /* */ is used.
/* This Stata program illustrates how to read create a do file */ log using C:\...\eeatraining.log,replace log close

Saving the Output As mentioned in earlier section, the Stata Results window does not keep all the output you generate. It only stores about 300-600 lines, and when it is full, it begins to delete the old results as you add new results. You can increase the amount of memory allocated to the Stata Results Window. But even this will probably not be enough for a long session with Stata. Thus, we need to use log to save the output. There are four ways to control the log operations. 1. You can use the log button on the tool bar. It looks like a scroll. 2. You can click on File/Log to get four options: Begin (log using), Close, Suspend (log off), and resume (log on). 3. You can use .log. commands in the Stata Command window 4. You can use .log. commands in the Stata Do-file Editor. In this section, we describe the commands, which can be used in the Stata Command window or in a do-file (program). log using This command creates a file with a copy of all the commands and output from Stata. The first time you open a log, you must give a name to the new file to be created. The syntax is: log using filename [, append replace [ text | smcl ] ] where filename is that name you give the new file. The options are: append replace text smcl Here are some examples:
log using temp22 saves output to a file called temp22 log using temp20, replace saves output to an existing file, temp20, replacing content log using regoutput, append saves output to an existing file, results, adding to contents log using .d:\my data\myfile.txt. saves output in specified file in specified folder

adds the output to an existing file replaces an existing file with the output tells Stata to create the log file in text (ASCII) format tells Stata to create the log file in SMCL format

21

Several points should be remembered in using this command: log off This command temporarily turns off the logging of output, so that any subsequent output is not copied to the log file. This is useful if you want to save some of the output but not all. Log off only works after a log using command. log on This command is used to restart the logging, copying any new output to the log file that was already defined. log on only works after a log using and a log off command. log close This command is used to turn off the logging and save the file. How are log off and log close different? Log off allows you to turn it back on easily with log on continuing to use the same log file. After a log close however, the only way to start logging again is with log using. set logtype text This command tells Stata to always save the log files in text (ASCII) format. It is the same as adding the text subcommand to every log using command, but it is easier. If you prefer text format log files, this is the best way to make sure all the log files are in this format. set logtype smcl This command tells Stata to always save log files in SMCL format. It is the same as adding the smcl subcommand to every log using command. Exercise 1: Exploring the ERHS This section includes some questions that you can answer using the r5ERHS files provided on your computer and the commands described in this section. Remember two tricks to make it easier to fix your mistakes: You can use PageUp to retrieve the most recent command. You can click on variables in the Variable window to paste it into the Command window. if you use an existing file name but do not say replace or append, Stata will give an error message that the file already exists log files in text format can be opened with Wordpad, Notepad, the DOS editor, or any word processor., but the file does not have any formatting smcl files have formatting (bold, colors, etc) but can only be opened with Stata smcl format is the default

22

Summary file The file ERHScons1999 contains summary variables calculated from various other data files. It is at the household level. Open the file by entering use C:\training\ERHScons1999.dta, clear in the Command window and pressing Return. Open do and log files to save command and outputs. Use log file and copy and paste some of output tables into excel and word files.

1. How many variables and how many records are in ERHScons1999? 2. What percentages of households have female heads? 3. Is there a statistically significant difference between the percentage of female-headed households in poor and non-poor? 4. What percentage of Amhara households are considered poor household? 5. What percentages of households are in SNNP region? 6. How does the percentage of female headed household vary by region? 7. What is the average size of a household? 8. What is the average size of household in the Oromia region? 9. How does household size vary with across status? (use poor variable) Household members The file p1sec1_rv1 contains information about each member of the household. It is at the individual level (each record is a person). You can answer the following questions using this file: 1. What percentage of the individual is female? 2. What percentage of the individual over 45 years old is female? 3. What percentage of the individual under 5 is female? 4. What percentage of women are married? 5. What percentage of the women over the age of 18 are married? 6. Does this percentage vary among regions? 7. What is the status of individuals as compared to round 4? 8. What is the reason for household who left since round 4 9. What was the major occupation of household head? 10. What was the major occupation of household members aged 7 to 15?

Food and cash crops The file p2s1b_rv1 contains information on production of food and cash crops. The data are at the crop level, meaning that each record represents one crop for one household. Only crops that are grown by each household are included in the file. The crop codes and labels are given in variable crop. You can answer the following questions with this file. 1. 2. 3. 4. 5. 6. 7. How many households in the sample grow maize and wheat? Among maize growers, what was the average area with maize? Among maize growers, what was the average amount of maize harvested? Among wheat growers, what was the average amount of wheat harvested? Does the average amount of Maize harvested vary among regions? Does the average amount of Wheat harvested vary among regions? Among farmers with more than 1 hectare of maize, what was the average amount of maize harvested? 8. What is the average amount harvested for major cereal crops? (Teff, barely, wheat, maize and sorghum?) 23

9. Farmers were asked Was any of the land cultivated under new extension program? What was the average response? 10. Farmers were also asked Was any of the land cultivated irrigated? And % of the land irrigated. Explore them.

SECTION 5: CREATING NEW VARIABLES In the previous sections, we described how to explore the data using existing variables. In this section, we discuss how to create new variables. When new variables are created, they are in memory and they will appear in the Data Browser, but they will not be saved on the hard-disk unless you use the save command. In this section, we will cover the following commands and options. generate replace tab , generate operators functions recode xtile generate This command is used to create a new variable. It is similar to compute in SPSS. The syntax is; generate newvar = exp [if exp]
where exp is an expression like price*quant or 1000*kg. Several points about this command:

Unlike compute in SPSS, generate cannot be used to change the definition of an existing variable. If you want to change an existing variable, you need to use replace, You can use gen or g as an abbreviation for generate If the expression is an equality or inequality, the variable will take the values 0 if the expression is false and 1 if it is true If you use if, the new variable will have missing values when the if statement is false

For example, generate age2 = age*age gen yield = outputkg/area if area>0 gen price = value/quant if quant>0 gen highprice = (price>1000) create age squared variable create new yield variable if area is positive create new price variable if quant is positive creates a dummy variable equal to 1 for high prices

24

replace This command is used to change the definition of an existing variable. The syntax is the same: replace oldvar = exp [if exp] [in exp] Some points to remember: Replace cannot be used to create a new variable. Stata will give an error message if the variable does not exist. There is no abbreviation for replace. Stata wants to make sure you really want to change the variable. If you use the if option, then the old values will be retained when the if statement is false You can use the period (.) to represent missing values

For example, replace price = avgprice if price > 1000 replace income =. if income<=0 replace age = 25 in 1007 replace

replaces high values with an average price replace negative income with missing value age=25 in observation #1007

tabulate generate This command is useful for creating a set of dummy variables (variables with a value of 0 or 1) depending on the value of an existing categorical variable. The syntax is: tabulate oldvariable, generate(newvariable) The old variable is a categorical (or discrete) variable. The new variables will take the form newvariable1, newvariable2, newvariable3, etc. Newvariablex will be equal to 1 if oldvariable=x and 0 otherwise. It is easier to explain with an example. The variable q1a (region) is a variable that takes values of 1,3,4, 7,8 and 9 for the different regions of Ethiopia. We can create six dummy variables as follows: tab q1a, gen(region) This creates 6 new variables: region1=1 if q1a=1 and 0 otherwise region2 =1 if q1a =3 and 0 otherwise region8=1 if q1a =8 and 0 otherwise

In Example 8, notice that there are 396 households in region 4 (Oromia) and the same number of households for which with region4=1.

25

Example 8: Using tab, gen to create dummy variables


. tab q1a, gen(region) Region | Freq. Percent Cum. ------------+----------------------------------Tigray | 150 10.33 10.33 Amhara | 466 32.09 42.42 Oromia | 396 27.27 69.70 7 | 139 9.57 79.27 8 | 134 9.23 88.50 9 | 167 11.50 100.00 ------------+----------------------------------Total | 1,452 100.00 . tab region3 q1a==Oromia | Freq. Percent Cum. ------------+----------------------------------0 | 1,056 72.73 72.73 1 | 396 27.27 100.00 ------------+----------------------------------Total | 1,452 100.00

egen This is an extended version of generate[extended generate] to create a new variable by aggregating the existing data. It is a powerful and useful command that does not exist in SPSS. It adds summary statistics to each observation. To do the same thing in SPSS, you would need to create a new file with aggregate and merge it with the original file using match files. The syntax is:
egen newvar = fcn(arguments) [if exp] [in range] , by(var) where newvar is the new variable to be created; fcn is one of numerous functions such as:

count() diff() fill() group() iqr() ma() max() mean() median() min() pctile() rank () rmean() sd () std() sum ()

number of non-missing values compares variables, 1 if different, 0 otherwise fill with a pattern creates a group id from a list of variables interquartile range moving average maximum value mean median minimum value percentile rank mean across variables standard deviation standardize variables sums

26

argument is normally just a variable var in the by() subcommand must be a categorical variable Here are some other examples: egen avg = mean(yield) egen avg2 = median(income), by(sex) egen regprod = sum(prod), by(region)

creates variable of average yield over entire sample creates variable of median income for each sex creates variable of total production for each region

Example 9: Using egen to calculate averages


. egen avecon=mean(cons), by( q1c) . gen highavecon=(cons> avecon) . list hhid q1c cons avecon highavecon in 650/675 +----------------------------------------------------------------+ | hhid q1c cons avecon highav~n | |----------------------------------------------------------------| | 407070000039 Sirbana Godeti 673.582 940.6532 0 | | 407070000040 Sirbana Godeti 793.05 940.6532 0 | | 407070000041 Sirbana Godeti 985.257 940.6532 1 | | 407070000042 Sirbana Godeti 844.477 940.6532 0 | | 407070000043 Sirbana Godeti 946.014 940.6532 1 | |----------------------------------------------------------------| | 407070000044 Sirbana Godeti 2206.057 940.6532 1 | | 407070000045 Sirbana Godeti 570.0535 940.6532 0 | | 407070000046 Sirbana Godeti 1340.926 940.6532 1 | | 407070000047 Sirbana Godeti 901.222 940.6532 0 | | 407070000048 Sirbana Godeti 887.775 940.6532 0 | |----------------------------------------------------------------| | 407070000049 Sirbana Godeti 1026.795 940.6532 1 | | 407070000051 Sirbana Godeti 1392.845 940.6532 1 | | 407070000052 Sirbana Godeti 574.218 940.6532 0 | | 407070000053 Sirbana Godeti 363.63 940.6532 0 | | 407070000054 Sirbana Godeti 926.551 940.6532 0 | |----------------------------------------------------------------| | 407070000055 Sirbana Godeti 1256.021 940.6532 1 | | 407070000057 Sirbana Godeti 753.478 940.6532 0 | | 407070000058 Sirbana Godeti 1378.575 940.6532 1 | | 407070000059 Sirbana Godeti 1640.834 940.6532 1 | | 407070000060 Sirbana Godeti 472.841 940.6532 0 | |----------------------------------------------------------------| | 407070000062 Sirbana Godeti 721.425 940.6532 0 | | 407070000063 Sirbana Godeti 1341.702 940.6532 1 | | 407070000064 Sirbana Godeti 781.82 940.6532 0 | | 407070000065 Sirbana Godeti 1962.697 940.6532 1 | | 407070000070 Sirbana Godeti 945.045 940.6532 1 | |----------------------------------------------------------------| | 407070000071 Sirbana Godeti 1742.247 940.6532 1 | +----------------------------------------------------------------+

650. 651. 652. 653. 654. 655. 656. 657. 658. 659. 660. 661. 662. 663. 664. 665. 666. 667. 668. 669. 670. 671. 672. 673. 674. 675.

In Example 9, we want to know which households have expenditure (cons) above the village average. First, we calculate the average expenditure for each village with the egen command. Then we create a dummy variable based on the expression (cons > avecons). The list output shows how the village average is repeated for every household in the village and confirms that the dummy variable is correctly calculated.

27

operators This is not a Stata command, but a topic related to creating new variables. Most of the operators are obvious, but some are not. Unlike SPSS, you cannot use words like or, and, eq, or gt. Arithmetic + addition subtraction * multiplication / division ^ power
Relational > < >= <= == ~= != Logical ~ | & not or and

greater than less than more than or equal less than or equal equal not equal not equal

Note: The most difficult rule to remember is when to use = and when to use ==. Use a single equal symbol (=) when defining a variable. Use a double equal symbol (==) when you are testing equality, such as in if statement and when creating a dummy variable.

Here are some examples to illustrate the use of these operators. Suppose you want you create a dummy variable indicating households in the Amhara region. One way is to write: generate AmD = 0 replace AmD = 1 if q1a==3 Or you can get exactly the same result with just one command: generate AmD = (q1a==3) If the expression in parentheses is true, the value is set to 1. If it is false, the value is 0. Logical operators are useful if you want to impose more than one condition. For example, suppose you want to create a dummy variable for female household head in the Dodota. In other words, a household head must be female head and in Dodota wereda to be selected.

28

gen DDfemale = 0 replace DDfemale = 1 if q1b==9 & sexh==0 or an easier way to do this would be: gen DDfemale = (q1b==9 & sexh==0) Or suppose you wanted to create a dummy variable for households in the two regions (Amhara and Oromia). This variable can be created with: gen amaoro = 0 replace amaoro = 1 if q1a==3 | q1a==4 or by one command: gen amaoro = (q1a==3 | q1a==4) You can also combine conditions using parentheses. Suppose you wanted a dummy variable that indicates if a household is a poor farmer in one of the Tigray and Amhara region. We will define poor as in the bottom 20 percent and use the variable poor. gen PDF = ((q1a==1 | q1b==3) & poor==1) Note: Here is a list of some of the more commonly-used additional functions used to create new variables in stata. Other functions can be found by typing help functions in the Stata Command window. abs(x) exp(x) ln(x) log(x) log10(x) sqrt(x) invnorm(p) normden(z) normden(z,s) norm(z) group(x) computes the absolute value of x calculates e to the x power. computes the natural logarithm of x is a synonym for ln(x), the natural logarithm. computes the log base 10 of x. computes the square root of x. provides the inverse cumulative normal; invnorm(norm(z)) = z. provides the standard normal density. provides the normal density. normden(z,s) = normden(z)/s if s>0 and s not missing, otherwise, the result is missing. provides the cumulative standard normal. creates a categorical variable that divides the data into x as nearly equalsized subsamples as possible, numbering the first group 1, the second group 2, etc. It uses the current order of the data. gives the integer obtained by truncating x. gives x rounded into units of y.

int(x) round(x,y)

29

recode This command changes the values of a categorical variable according to the rules specified. It is like the recode command in SPSS except that in Stata you do not necessarily use parentheses. The syntax is: recode varname old=new old=new . [if exp] [in range] Here are some examples: recode x 1=2 recode x 1=2 3=4 recode x 1=2 2=1 recode x 1=2 *=3 recode x 1/5=2 recode x 1 3 4 5 = 6 recode x .=9 recode x 9=. changes all values of x=1 to x= 2 changes 1 to 2 and 3 to 4 exchanges the values 1 and 2 in x changes 1 in x to 2 and all other values to 3 changes 1 through 5 in x to 2 changes 1, 3, 4 and 5 to 6 changes missing to 9 changes 9 to missing

Notice that you can use some special symbols in the rules: * . x/y xy means all other values means missing values means all values from x to y means x and y

For example, recode region value 8 and 9 to 7 Example 10: Using recode to define a new variable
. tab q1a Region | Freq. Percent Cum. ------------+----------------------------------Tigray | 150 10.33 10.33 Amhara | 466 32.09 42.42 Oromia | 396 27.27 69.70 7 | 139 9.57 79.27 8 | 134 9.23 88.50 9 | 167 11.50 100.00 ------------+----------------------------------Total | 1,452 100.00 . recode q1a 8 9=7 (q1a: 301 changes made) . tab q1a Region | Freq. Percent Cum. ------------+----------------------------------Tigray | 150 10.33 10.33 Amhara | 466 32.09 42.42 Oromia | 396 27.27 69.70 7 | 440 30.30 100.00 ------------+----------------------------------Total | 1,452 100.00

30

xtile This command creates a new variable that indicates which category a record falls into, when the sample is sorted by an existing variable and divided into n groups of equal size. It is probably easier to explain with examples. xtile can be used to create a variable that indicates which income quintile a household belongs to, which decile in terms of farm size, or which tercile in terms of coffee production. The syntax is: xtile newvar = variable [if exp] [in range] , nq(#) where newvar is the new categorical variable created; variable is the existing variable used to create the quantile (e.g income, farm size); # is the number of different categories (eg 5 for quintiles, 3 for terciles) For example, xtile incquint = income, nq(5) xtile farmdec = farmsize, nq(10) Suppose we want to create a variable indicating the deciles of expenditure per capita. Example 11: Using xtile to generate deciles (using the ERHS99cons data)
. xtile rconseadec= rconsae,nq(10) . tab rconseadec 10 | quantiles | of rconsae | Freq. Percent Cum. ------------+----------------------------------1 | 145 10.01 10.01 2 | 145 10.01 20.01 3 | 145 10.01 30.02 4 | 145 10.01 40.03 5 | 145 10.01 50.03 6 | 145 10.01 60.04 7 | 145 10.01 70.05 8 | 145 10.01 80.06 9 | 145 10.01 90.06 10 | 144 9.94 100.00 ------------+----------------------------------Total | 1,449 100.00 . tab rconseadec sexh,col nofre 10 | quantiles | Sex of household head of rconsae | Female Male | Total -----------+----------------------+---------1 | 7.79 10.85 | 10.01 2 | 10.30 9.90 | 10.01 3 | 8.04 10.75 | 10.01 4 | 10.30 9.90 | 10.01 5 | 8.79 10.47 | 10.01 6 | 10.30 9.90 | 10.01 7 | 10.55 9.80 | 10.01 8 | 10.05 9.99 | 10.01 9 | 10.05 9.99 | 10.01 10 | 13.82 8.47 | 9.94 -----------+----------------------+---------Total | 100.00 100.00 | 100.00

31

Exercise 2 1. Use the file ERHScons1999. Create a variable called reg4 which indicates whether a household is in the Oromia or other regions. Then do a frequency table of the new variable. 2. Using the same file, create a variable called hhquint that indicates the quintile of household size. Then do a frequency table on the new variable. 3. Using the same file, create a dummy variable called enbugthat is equal to 1 if the household is the Enemayi and Bugena weredas and 0 otherwise. Then do a frequency table on the new variable. 4. Create a new variable avgexp which is equal to the wereda average of food expenditure (food). Then calculate a new variable equal to the difference between the household food expenditure and the wereda average expenditure. 5. Using the same file, create a new variable splot which is 1 if the person is cultivating single plots and 0 otherwise. 6. Use file p1sec1_rv1. Create a set of dummy variables called relatxx based on the relationship of the person to the household head. For example, relat01 is a dummy for being the head, relat02 is a dummy for being the spouse, relat03 for a child, and so on.

SECTION 6: MODIFYING VARIABLES In this section, we introduce some more powerful and flexible commands for generating results from survey data. We begin with an explanation of how to label data in Stata. Then see how to format variables. These are the topics and commands covered in this section: rename variable label variable label define label values
format variable

rename variables This command is used to rename variables in order to give other variable name. The command is . rename old_variable new_variable For instance, generate regional dummy variables and then: Example 12: renaming variable
. tab q1a, gen(index) Region | Freq. Percent Cum. ------------+----------------------------------Tigray | 150 10.33 10.33 Amhara | 466 32.09 42.42 Oromia | 396 27.27 69.70 SNNP | 440 30.30 100.00 ------------+----------------------------------Total | 1,452 100.00

32

. tab index1 q1a==Tigray | Freq. Percent Cum. ------------+----------------------------------0 | 1,302 89.67 89.67 1 | 150 10.33 100.00 ------------+----------------------------------Total | 1,452 100.00 . tab index2 q1a==Amhara | Freq. Percent Cum. ------------+----------------------------------0 | 986 67.91 67.91 1 | 466 32.09 100.00 ------------+----------------------------------Total | 1,452 100.00

rename rename rename rename

index1 index2 index3 index4

Tigray Amhara Oromia SNNP

rename rename rename rename

index1 variable to Tigray index2 variable to Amhara inxex 3 variable to Oromia inxex4 variable to SNNP

label variable This command is used to attach labels to variables in order to make the output easier to understand. For example, we know that Tigray is region1, SNNP are region 7. So we may want to label the variables as follows:
label label label label variable variable variable variabel Tigray"Region 1" Amhara"Region 3" Oromia"Region 4 SNNP"Region 7"

You can use the abbreviation label var If there are spaces in the label, you must use double quotation marks. If there are no spaces, quotation marks are optional. This command is like variable label in SPSS except that you can only label one variable per command and Stata uses double quotation marks, not single The limit is 80 characters for a label, but any labels over 30 characters will probably not look good in a table.

label define This command gives a name to a set of value labels. For example, instead of numbering the regions, we can assign a label to each region. Instead of numbering the different sources of income, we can give them labels. The syntax is: label define lblname # "label" # "label" # label [, add modify]

where lblname is the name given to the set of value labels # are the value numbers 33

labelare the value labels add means that you want to add these value labels to the existing set modify means that you want to change these values in the existing set Note that:
You can use the abbreviation label def The double quotation marks are only necessary if there are spaces in the labels Stata will not let you define an existing label unless you say modify or add This command is similar to value label in SPSS except that in Stata you give the labels a name and later attach it to the variable, while in SPSS you attach it to the variable in the same command.

34

label values
This command attaches named set of value labels to a categorical variable. The syntax is: label values varname [lblname] [, nofix] where varname is the categorical variable which will get the labels lblname is a set of labels that have already been defined by label define Here are some examples of labeling values in Stata.
label define reg 1"Tigray" 3"Amhara" 4"Oromia" 7"SNNP",modify label values q1a reg

Some additional commands that may be useful in labeling

label dir label list label drop label save using label data

to request a list of existing label names to request a list of all the existing value labels to delete a one or more labels to save label definitions as a Do-file to give a label to a data file

format The format command allows you to specify the display format for variables. The internal precision of the variables is unaffected. The syntax for format command is
. format varlist %fmt

where %fmt is listed below:


%fmt description example ----------------------------------------------------------------------------Right-justified formats %#.#g general numeric format %9.0g %#.#f fixed numeric format %9.2f %#.#e exponential numeric format %10.7e %d default numeric elapsed date format %d %d... user-specified elapsed date format %dM/D/Y %#s string format %15s Right-justified, comma formats %#.#gc general numeric format %#.#fc fixed numeric format Leading-zero formats %0#.#f fixed numeric format %0#s string format Left-justified formats %-#.#g general numeric format %-#.#f fixed numeric format %-#.#e exponential numeric format %-d default numeric elapsed date format

%9.0gc %9.2fc

%09.2f %015s

%-9.0g %-9.2f %-10.7e %-d

35

%-d... %-#s

user-specified elapsed date format string format

%-dM/D/Y %-15s

Left-justified, comma formats %-#.#gc general numeric format %-#.#fc fixed numeric format

%-9.0gc %-9.2fc

Centered formats %~#s string format (special) %~15s -----------------------------------------------------------------------------

Exercise 3 1. Use exercise 2 and label values and variables for newly created variables 2. label data file by This data is used for training 3. list existing label names

SECTION 7: ADVANCED DESCRIPTIVE STATISTICS In Section 3, we have seen at preliminary descriptive statistics mostly applied to explore the nature of the data. In this section we further explore more advanced statistics. tabulate summarize This command creates one- and two-way tables that summarize continuous variables. The command tabulate by itself gives frequencies and percentages in each cell (cross-tabulations). With the summarize option, we can put means and other statistics of a continous variable. The syntax is: tabulate varname1 varname2 [if exp] [in range], summarize(varname3) options where varname1 varname2 varname3 options is a categorical row variable is a categorical column variable (optional) is the continuous variable summarized in each cell can be used to tell Stata which statistics you want

Some notes regarding this command: The default statistics are the mean, the standard deviation, and the frequency. You can specify which statistics with options means, standard and freq You can use the abbreviation tabsum( )
Some examples:

tab q1a, sum(cons) tab q1b, sum(cons) mean tab q1a sexh, sum(food)

gives the mean, std deviation, and frequency of per capita expenditure for each region gives the mean consumption for each village gives the mean, std deviation, and frequency in each cell of hh head sex per region

36

The first table is a one-way table (just one categorical variable) showing the mean, standard deviation, and frequency of per capita expenditure for each expenditure region. In the second table, we use the mean option so only mean per capita expenditure is shown. In the third table, we add a second categorical variable (sexh) making it a two-way table. Although we could have requested all the the default statistics in the two-way table, it makes the table difficult to read so we do not advise it.

Example 13: Use tabulate. Sum () to generate tables


. tab q1a, sum( cons) | Summary of consumption per month Region | Mean Std. Dev. Freq. ------------+-----------------------------------Tigray | 413.93552 297.701 149 Amhara | 545.91653 467.28072 465 Oromia | 697.09029 478.55749 395 SNNP | 331.7384 221.15601 440 ------------+-----------------------------------Total | 508.51838 420.4014 1449 . tab q1b, sum( cons) mean | Summary of consumption per month Wereda | Mean ------------+-----------Atsbi | 417.16834 Sebhassah | 409.87 Ankober | 301.87563 Basso na | 777.31823 Enemayi | 234.392 Bugena | 542.38657 Adaa | 940.65322 Kersa | 567.89355 Dodota | 526.58473 Shashemen | 775.34926 Cheha | 342.54209 Kedida Ga | 239.09955 Bule | 379.28676 Boloso | 266.93705 Daramalo | 416.28045 ------------+-----------Total | 508.51838 . tab q1a sexh, sum( cons) Means, Standard Deviations and Frequencies of consumption per month | Sex of household head Region | Female Male | Total -----------+----------------------+---------Tigray | 342.44136 488.3678 | 413.93552 | 277.62091 301.46008 | 297.701 | 76 73 | 149 -----------+----------------------+---------Amhara | 450.61424 582.89951 | 545.91653 | 368.60452 495.93838 | 467.28072 | 130 335 | 465 -----------+----------------------+---------Oromia | 610.49528 728.85178 | 697.09029

37

| 518.32024 459.98768 | 478.55749 | 106 289 | 395 -----------+----------------------+---------SNNP | 271.02927 346.48695 | 331.7384 | 171.91652 229.33158 | 221.15601 | 86 354 | 440 -----------+----------------------+---------Total | 433.7347 536.83799 | 508.51838 | 389.69001 428.24021 | 420.4014 | 398 1051 | 1449

tabstat This command gives summary statistics for a set of continuous variable for each value of a categorical variable. The syntax is: tabstat varlist [if exp] [in range] , stat(statname [...]) by(varname) where varlist is a list of continuous variables statname is a type of statistic varname is a categorical variable
Some facts about this command: The default statistic is the mean. Optional statistics subcommands include mean, sum, max, min, range, sd (standard deviation), var (variance), skewness, kurtosis, median, and pn (nth percentile). Without the by() option, tabstat is like summarize except that it allows you to specify the list of statistics to be displayed. With the by() option, tabstat is like "tabulate summarize except that tabstat is more flexible in the statistics and format It is very similar to the SPSS command means.

Examples

gives mean, max, and min of food & hhsize tabstat food hhsize, by(q1a) gives mean of two variables for each region tabstat food, stats(median) by(q1a) gives the median food consumption for each region The tabstat command displays summary statistics for a series of numeric variables in a single table.

tabstat food hhsize, stats(mean max min)

38

Example 14: Using tabstate to create Table . tabstat rconsae, s(mean p50 sd cv min max) by( rconseadec) missing
Summary for variables: rconsae by categories of: rconseadec (10 quantiles of rconsae) rconseadec | mean p50 sd cv min max -----------+-----------------------------------------------------------1 | 21.80935 21.9194 5.773654 .264733 4.811201 30.40175 2 | 36.24088 36.03099 3.400392 .0938275 30.6191 42.70621 3 | 48.52454 48.31921 3.09388 .0637591 42.74319 53.91997 4 | 60.38483 60.0903 3.811244 .0631159 54.00354 66.85229 5 | 73.09496 72.92955 3.61339 .0494342 66.90016 79.38206 6 | 89.3758 89.33151 5.708862 .0638748 79.39233 99.11871 7 | 110.407 110.2909 6.692319 .060615 99.12563 122.8186 8 | 137.7846 137.5525 9.298181 .0674835 123.5698 154.9666 9 | 179.5007 176.1209 17.33479 .0965723 155.0732 214.4674 10 | 332.2927 285.4411 135.2309 .4069633 214.4888 1212.256 . | . . . . . . -----------+-----------------------------------------------------------Total | 108.7874 79.38206 97.27053 .8941343 4.811201 1212.256 ------------------------------------------------------------------------

table This command creates a wide variety of tables. It is probably the most flexible and useful of all the table commands in Stata. The syntax is: table rowvar colvar [if exp] [in range], c(clist) [row col]
where rowvar colvar clist row col is the categorical row variable is the categorical column variable is a list of statistic and variables is an option to include a summary row is an option to include a summary column

Some useful facts about this command:


The default statistic is the frequency. Optional statistics are mean, sd, sum, rawsum (unweighted), count, max, min, median, and pn (nth percentile). The c( ) is short for contents of each cell. Like tab, it can be used to create one- and two-way frequency tables, but table cannot do percentages Like tabsum, it can be used to calculate basic stats for each value of a categorical variable Its advantage over tabsum is that it can do more statistics and it can take more than one continious variable Like tabstat, it can be used to calculate advanced stats for each value of a categorical variable Its advantage over tabstat is that it can use do two (and more) way tables, but its disadvantage is that it has fewer statistics. It is similar to table in SPSS, but easier to learn and less flexible in formatting

Here are some examples:

table q1a , row 39

table of frequencies by region with total row

table q1a, c(mean income) table q1a, c(mean yield sd yield median yield) table q1a, c(mean yield) format(%9.2f) table q1a sexh, c(mean yield) table q1a sexh, c(mean income mean yield)

table of average income by region table of yield statistics by region table of average yields by region with format . table of average yield by region and sex table of avg yield & income by region & sex

Some output from table commands is shown in Example 15. The table command calculates and displays tables of statistics, including frequency, mean, standard deviation, sum, and 1st to 99th percentile. The row and col option specifies an additional row and column to be added to the table, reflecting the total across rows and columns. Example 15: Tabulate median real per capita consumption by region vs sex of household head
table q1a sexh, contents(p50 rconsae) row col missing | Sex of household head Region | Female Male Total ----------+----------------------------Tigray | 73.05909 74.20448 73.56232 Amhara | 124.9734 95.00103 104.7363 Oromia | 98.59296 99.43469 98.75433 SNNP | 53.73735 50.34177 51.14911 | Total | 90.04483 77.18623 79.38206

. table rconseadec, c(mean rconsae) 10| quantiles | of | rconsae | mean(rconsae) ----------+-------------1 | 21.80935 2 | 36.24088 3 | 48.52454 4 | 60.38483 5 | 73.09496 6 | 89.3758 7 | 110.407 8 | 137.7846 9 | 179.5007 10 | 332.2927

Exercise 4 1. Use ERHScons1999 and tabulate basic summery statistics showing mean, standard deviation and frequency of per capita food consumption for each village. Interpret the result. 2. Repeat the same procedures as q1 but report only median of food consumption. 3. Tabulate basic summery statistics for food consumption by sex of household head and regions (use single table) 4. Tabulate mean 25p, median, 75p, sd, cv, min and max summery statistics for real food consumption per capita by deciles of real consumption per capita. 40

5. Tabulate median real food consumption per capita by sex of household head and deciles of real consumption per capita (use single table). SECTION 8: PRESENTING DATA WITH GRAPH (GRAPHING DATA) This section provides a brief introduction to creating graphs. In Stata, all graphs are made with the graph command, but there are 8 types of charts and numerous subcommands for controlling the type and format of graph. In this section, we focus on four types of graph and a few options. The commands that draw graphs are graph twoway scatterplots, line plots, etc. graph matrix scatterplot matrices graph bar bar charts graph dot dot charts graph box box-and-whisker plots graph pie pie charts Graph commands can also used to produce histogram, box plot, kdensity, P-P plot, Q-Q plot but we will postpone until the introduction of normality later. Let us first acquaint ourselves with some twoway graph commands. A two way scatterplot can be drawn using (graph) twoway scatter command to show the relationship between two variables, cons (total consumption) and food (food consumption). As we would expect, there is a positive relationship between the two variables.
. graph twoway scatter cons food
4000 0 consumption per month 1000 2000 3000

1000

2000 food cons per month

3000

4000

We can show the regression line predicting cons from food using lfit option.
. twoway lfit cons food

41

1000

Fitted values 2000

3000

4000

1000

2000 food cons per month

3000

4000

The two graphs can be overlapped like this


. twoway (scatter cons hhsize) (lfit cons hhsize)
4000 0 1000 2000 3000

10 household size

15 Fitted values

20

consumption per month

Exercise 5: Draw two way scatter with line fit graph for consumption per capita vs household size and explain its pattern.

42

SECTION 9: NORMALITY AND OUTLIER


Check for Normality An outlier is an observation that lies in an abnormal distance from other values in a random sample from a population. We must be extremely mindful of possible outliers and their adverse effects during any attempt to measure the relationship between two continuous variables. There are no official rules to identify outliers. In a sense, this definition leaves it up to the analyst (or a consensus process) to decide what will be considered abnormal. Sometimes it is obvious when an outlier is simply miscoded (for example, age reported as 230) and hence should be set to missing. But most times it is not the case. Before abnormal observations can be singled out, it is necessary to characterize normal observations. Skewness is a measure of symmetry, or more precisely, the lack of symmetry. A distribution, or data set, is symmetric if it looks the same to the left and right of the center point. The skewness for a normal distribution is zero and any symmetric data should have a skewness near zero. Negative values for the skewness indicate data that are skewed left and positive values for the skewness indicate data that are skewed right. By skewing left, we mean that the left tail is heavier than the right tail. Similarly, skewing right means that the right tail is heavier than the left tail. Kurtosis is a measure of whether the data are peaked or flat relative to a normal distribution. That is, data sets with high kurtosis tend to have a distinct peak near the mean, decline rather rapidly, and have heavy tails. Data sets with low kurtosis tend to have a flat top near the mean rather than a sharp peak. A uniform distribution would be the extreme case. The standard normal distribution has a kurtosis of three. Positive kurtosis indicates a "peaked" distribution and negative kurtosis indicates a "flat" distribution. A value of 6 or larger on the true kurtosis indicates a large departure from normality. We can obtain skewness and kurtosis values by using detail option in summarize command. Clearly, variable rconspc(real consumption per capita) is skewed to the right and has a peaked distribution. Both statistics indicate the distribution of rconspc is far from normal.
. sum rconspc Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------rconspc | 1449 90.36742 81.99623 4.22011 1018.295

43

. sum rconspc,detail real consumption per capita 1994 prices ------------------------------------------------------------Percentiles Smallest 1% 11.65814 4.22011 5% 18.67906 6.865227 10% 25.10425 7.068164 Obs 1449 25% 39.94022 8.201794 Sum of Wgt. 1449 50% 75% 90% 95% 99% 65.99258 114.2533 180.8909 236.1537 405.8775 Largest 577.1937 624.1437 660.1689 1018.295 Mean Std. Dev. Variance Skewness Kurtosis 90.36742 81.99623 6723.382 3.212314 21.69683

Besides commands for descriptive statistics, such as summarize, we can also check normality of a variable visually by looking at some basic graphs in Stata, including histograms, boxplots, kdensity, pnorm, and qnorm. Lets keep using rconspc from ERHScons1999.dta file for making some graphs. The histogram command is an effective graphical technique for showing both the skewness and kurtosis of rconspc.
histogram rconspc
.01 0 .002 Density .004 .006 .008

200

400 600 800 real consumption per capita 1994 prices

1000

The normal option can be used to get a normal overlay. This shows the skew to the right in rconspc.

44

. histogram rconspc, normal


.01 0 .002 Density .004 .006 .008

200

400 600 800 real consumption per capita 1994 prices

1000

We can use the bin() option to increase the number of bins to 100. This better illustrates the distribution of rconspc. This option specifies how to aggregate data into bins. Notice that the histogram resembles a bell shape curve, but truncated at 0.
. histogram rconspc, normal bin(100)

.002

.004

Density .006

.008

.01

200

400 600 800 real consumption per capita 1994 prices

1000

graph box draws vertical box plots. In a vertical box plot, the y axis is numerical, and the x axis is categorical. The upper and lower bounds of box are defined by the 25th and 75th percentiles of rconspc, and the line within the box is the median. The ends of the whiskers are 5th and 95th percentile of rconspc. graph box command can be used to produce a boxplot which can help us examine the distribution of rconspc. If rconspc is normal, the median would be in the center of the box and the end of whiskers would be equidistant from the box.

45

The boxplot for rconspc shows positive skew. The median is pulled to the low end of the box, and the 95th percentile is stretched out away from the box, for both male and female hh head. In fact it seems worse for male household head.
. graph box rconspc, by(sexh)

Female
1,000

Male

real consumption per capita 1994 prices

Graphs by Sex of household head

The kdensity command with the normal option displays a density graph of the residual with a normal distribution superimposed on the graph. This is particularly useful in verifying that the residuals are normally distributed, which is a very important assumption for regression. The plot shows that rconspc is more skewed to the right and has a higher mean than that of normal distribution.
. kdensity rconspc, normal
.01 0 .002 Density .004 .006 .008

200

400

600

800

200

400 600 800 real consumption per capita 1994 prices Kernel density estimate Normal density

1000

46

Graphical alternatives to the kdensity command are the P-P plot and Q-Q plot. pnorm command produces a P-P plot, which graphs a standardized normal probability. It should be approximately linear if the variable follows normal distribution. The straighter the line formed by the P-P plot, the more the variable's distribution conforms to the normal distribution.
. pnorm rconspc
1.00 0.00 Normal F[(rconspc-m)/s] 0.25 0.50 0.75

0.00

0.25

0.50 Empirical P[i] = i/(N+1)

0.75

1.00

Qnorm command plots the quantiles of a variable against the quantiles of a normal distribution. If the Q-Q plot shows a line that is close to the 45 degree line, the variable is more normally distributed.
. qnorm rconspc
real consumption per capita 1994 prices 0 500 1000 -500

-200

0 Inverse Normal

200

400

47

Both P-P and Q-Q plot prove that rconspc is not normal, with a long tail to the right. The qnorm plot is more sensitive to deviances from normality in the tails of the distribution, where the pnorm plot is more sensitive to deviances near the mean of the distribution. From the statistics and graphs we can confidently conclude that there exists outlier, especially at the upper end of the distribution.

Dealing with outliers


There are generally three ways to deal with outliers. The easiest is to delete them from analyses. The second one is to use measures that are not sensitive to them, such as median instead of mean, or transform the data to be more normal. The most complicated one is to replace them by imputation. Since our data is heavily right-tailed, we will focus on very large outliers. A customary criterion to identify outlier is to three times of deviation from the median. Note that we are using the median because it is a robust statistic and if there are big outliers the mean will shift a lot but not the median. Example 16: Using robust statistics to replace outliers
/* Calculate number of standard deviations from median by sex of hh head */ . use "C:\..\training\ERHScons1999.dta", clear . egen median=median(rconspc), by (sexh) . egen sd=sd(rconspc), by (sexh) . gen ratio=(rconspc-median)/sd * (3 missing values generated) . gen outlier=1 if ratio>3 & ratio~=. *(1414 missing values generated) . replace outlier=0 if outlier==. & ratio~=. *(1411 real changes made) . tabulate outlier, missing outlier | Freq. Percent Cum. ------------+----------------------------------0 | 1,411 97.18 97.18 1 | 38 2.62 99.79 . | 3 0.21 100.00 ------------+----------------------------------Total | 1,452 100.00

There are only 38 observations are identified as outliers. When we compare the mean and median values from using table command, the mean value has dropped around 5% and 14% among female and male headed households, respectively, while the medians are less sensitive to outliers.

48

Example 17: Comparing mean and median values to replace outliers


. table sexh outlier, contents(mean rconspc) row col missing ---------------------------------------Sex of | household | outlier head | 0 1 Total ----------+----------------------------Female | 88.56179 419.9406 100.2183 Male | 78.57423 431.6569 86.63701 | Total | 81.29232 427.3403 90.36742 ----------------------------------------

. table sexh outlier, contents(median rconspc) row col missing ---------------------------------------Sex of | household | outlier head | 0 1 Total ----------+----------------------------Female | 70.84578 398.7476 73.41055 Male | 63.45253 371.2374 64.25856 | Total | 64.36755 385.9552 65.99258 ----------------------------------------

Method 1: Listwise deletion In this approach, any observation that contains outliers recoded to a missing so that the variable is dropped. Although easy to understand and to perform, it runs the risk of causing bias. Stata perform listwise deletion automatically by default in order to allow the data matrix to be inverted, a necessity for regression analysis. Sometimes by dropping outliers we can greatly improve /decrease the adverse effect of extreme values. But it does not work in our data, as indicated by the histogram below.

49

. histogram

rconspc if outlier==0, normal

.002

.004

Density .006

.008

.01

100 200 300 real consumption per capita 1994 prices

400

Method 2: Robust statistics An alternative is to choose robust statistics thats not sensitive to outliers, such as median over mean, which is indicated above. When we are concerned about outliers or skewed distributions, the rreg command is used for robust regression. Robust regression will result regression coefficients and standard errors from OLS, which is different from regress command with robust option. Example 18: Robust statistics
. reg rconspc hhsize Source | SS df MS -------------+-----------------------------Model | 1149884.79 1 1149884.79 Residual | 8585572.4 1447 5933.36033 -------------+-----------------------------Total | 9735457.19 1448 6723.38204 Number of obs F( 1, 1447) Prob > F R-squared Adj R-squared Root MSE = = = = = = 1449 193.80 0.0000 0.1181 0.1175 77.028

-----------------------------------------------------------------------------rconspc | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------hhsize | -10.29891 .7398004 -13.92 0.000 -11.75011 -8.847716 _cons | 150.0144 4.738429 31.66 0.000 140.7195 159.3093 ------------------------------------------------------------------------------

50

. rreg rconspc hhsize Huber Huber Huber Huber Biweight Biweight Biweight Biweight iteration iteration iteration iteration iteration iteration iteration iteration 1: 2: 3: 4: 5: 6: 7: 8: maximum maximum maximum maximum maximum maximum maximum maximum difference difference difference difference difference difference difference difference in in in in in in in in weights weights weights weights weights weights weights weights = = = = = = = = .92791317 .26100886 .08196986 .02097291 .29378905 .0589816 .01602466 .0038901 Number of obs = F( 1, 1447) = Prob > F = 1449 146.84 0.0000

Robust regression

-----------------------------------------------------------------------------rconspc | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------hhsize | -5.583005 .4607371 -12.12 0.000 -6.486789 -4.679221 _cons | 105.3791 2.951026 35.71 0.000 99.59032 111.1678 ------------------------------------------------------------------------------

Method 3: Data transformation Variable rconspc is still skewed to the right and bounded below zero. In this case, a log transformation would be appropriate correct our dataset. The logarithm function tends to squeeze together the larger values in your data set and stretches out the smaller values, which can sometimes produce a dataset thats closer to symmetric. In addition, a log transformation can help to pull in the outliers on the high end and make them closer to the rest of the data. Lets have a look at the distribution after the log transformation.
. histogram lnrconspc if rconspc~=., normal

.1

.2

Density .3

.4

.5

4 lnrconspc

Statistics from summarize command also indicates an almost perfect normal distribution.

51

Example 18: Basic descriptive statistics after data transforamtion


sum lnrconspc if outlier==0, detail lnrconspc ------------------------------------------------------------Percentiles Smallest 1% 2.456005 1.439861 5% 2.921272 1.926469 10% 3.210526 1.955601 Obs 1411 25% 3.680637 2.104353 Sum of Wgt. 1411 50% 75% 90% 95% 99% 4.16461 4.696181 5.07678 5.333407 5.612014 Largest 5.706972 5.772238 5.778316 5.802852 Mean Std. Dev. Variance Skewness Kurtosis 4.155071 .7208543 .5196309 -.2227742 2.724598

Method 4: Imputation After identifying outliers, usually we first denote them as missing values. Missing data usually present a problem in statistical analyses. If missing values are correlated with the outcome of interest, then ignoring them will bias the results of statistical tests. In addition, most statistical software packages (e.g., SAS, Stata) automatically drop observations that have missing values for any variables used in an analysis. This practice reduces the analytic sample size, lowering the power of any test carried out. Other than simply dropping missing values, there is more than one approach of imputation to fill in the cell of missing value. We will only focus on single imputation, referring to fill a missing value with one single replacement value. The easy approach is to use arbitrary methods to impute missing data, such as mean substitution. Substitution of the simple grand mean will reduce the variance of the variable. Reduced variance can bias correlation downward (attenuation) or, if the same cases are missing for two variables and means are substituted, correlation can be inflated. These effects on correlation carry over in a regression context to lack of reliability of the beta weights and of the related estimates of the relative importance of independent variables. That is, mean substitution in the case of one variable can lead to bias in estimates of the effects of other or all variables in the regression analysis, because bias in one correlation can affect the beta weights of all variables. Mean substitution is no longer recommended. Another approach is regression-based imputation. In this strategy, it is assumed that the same model explains the data for the non-missing cases as for the missing cases. First the analyst estimates a regression model in which the dependent variable has missing values for some observations, using all non-missing data. In the second step, the estimated regression coefficients are used to predict (impute) missing values of that variable. The proper regression model depends on the form of the dependent variable. A probit or logit is used for binary variables, Poisson or other count models for integer-valued variables, and OLS or related models for continuous variables. Even though this may introduce unrealistically low levels of noise in the data, it performs more robustly than mean substitution and less complex than multiple imputation. Thus it is the preferred approach in imputation. 52

Assuming we already coded outliers of rconspc as missing, now the missing values are replaced (imputed) with predicted values.
. xi: regress lnrconspc i.q1a i.sexh i.poor hhsize ageh, robust . predict yhat (option xb assumed; fitted values) . replace lnrconspc=yhat if rconspc==. (51 real changes made)

There is another Stata command to perform imputation. The impute command fills in missing values by regression and put newly created variable into a new variable defined by generate option.
. xi: impute lnrconspc i.q1a i.sexh i.poor hhsize ageh, i.q1a _Iq1a_1-9 (naturally coded; i.sexh _Isexh_0-1 (naturally coded; i.poor _Ipoor_0-1 (naturally coded; 0.21% (3) observations imputed . . . . gen(new1) _Iq1a_1 omitted) _Isexh_0 omitted) _Ipoor_0 omitted)

xi: regress lnrconspc i.q1a i.sexh i.poor hhsize ageh, robust predict yhat replace lnrconspc=yhat if rconspc==. compare lnrconspc new1

---------- difference ---------count minimum average maximum -----------------------------------------------------------------------lnrconspc=new1 1452 ---------jointly defined 1452 0 0 0 ---------total 1452

The impute command produces exactly the same results.

Exercise 6: 1. Use ERHScons1999, check normality for real consumption per adult equivalent using histogram, boxplot, kdensity, pnormal and qnormal. 2. If there are outliers, replace them using robust statistics 3. Excluding outliers, check for normality of the distribution. 4. If the distribution is not yet normalized, apply appropriate transformation and check its normality.

53

SECTION 10: STATISTICAL TESTS compare command The compare command is an easy way to check if two variables are the same. Lets first create one variable compare, which equals tot_exp if tot_exp not missing, and equals 0 if tot_exp is missing.
. gen comparecons=cons if cons~=. (51 missing values generated) . replace comparecons=0 if cons==. (51 real changes made) . compare cons comparecons ---------- difference ---------count minimum average maximum -----------------------------------------------------------------------cons=compare~s 1449 ---------jointly defined 1401 0 0 0 cons missing only 51 ---------total 1452

correlate command The correlate command displays a matrix of Pearson correlations for the variable listed.
. correlate cons hhsize (obs=1449) | cons hhsize -------------+-----------------cons | 1.0000 hhsize | 0.2601 1.0000

ttest command We would like to see if the mean of hhsize equals to 6 by using single sample t-test, testing whether the sample was drawn from a population with a mean of 6. ttest command is used for this purpose.

54

. ttest hhsize=6 One-sample t test -----------------------------------------------------------------------------Variable | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] ---------+-------------------------------------------------------------------hhsize | 1452 5.782369 .0719318 2.740968 5.641268 5.923471 -----------------------------------------------------------------------------mean = mean(hhsize) t = -3.0255 Ho: mean = 6 degrees of freedom = 1451 Ha: mean < 6 Pr(T < t) = 0.0013 Ha: mean != 6 Pr(|T| > |t|) = 0.0025 Ha: mean > 6 Pr(T > t) = 0.9987

We are also interested that if cons is close to food.


. ttest cons=food
Paired t test -----------------------------------------------------------------------------Variable | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] ---------+-------------------------------------------------------------------cons | 1449 508.5184 11.04409 420.4014 486.8543 530.1825 food | 1449 437.3696 10.21292 388.7621 417.3359 457.4033 ---------+-------------------------------------------------------------------diff | 1449 71.14877 2.130751 81.10861 66.96908 75.32846 -----------------------------------------------------------------------------mean(diff) = mean(cons - food) t = 33.3914 Ho: mean(diff) = 0 degrees of freedom = 1448 Ha: mean(diff) < 0 Pr(T < t) = 1.0000 Ha: mean(diff) != 0 Pr(|T| > |t|) = 0.0000 Ha: mean(diff) > 0 Pr(T > t) = 0.0000

The t-test for independent groups comes in two varieties: pooled variance and unequal variance. We want to look at the differences in cons between male and female hh head. We will begin with the ttest command for independent groups with pooled variance and compare the results to the ttest command for independent groups using unequal variance.
. ttest cons, by(sexh)
Two-sample t test with equal variances -----------------------------------------------------------------------------Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] ---------+-------------------------------------------------------------------Female | 398 433.7347 19.5334 389.69 395.3329 472.1365 Male | 1051 536.838 13.20949 428.2402 510.918 562.758 ---------+-------------------------------------------------------------------combined | 1449 508.5184 11.04409 420.4014 486.8543 530.1825 ---------+-------------------------------------------------------------------diff | -103.1033 24.60287 -151.3644 -54.84217 -----------------------------------------------------------------------------diff = mean(Female) - mean(Male) t = -4.1907 Ho: diff = 0 degrees of freedom = 1447 Ha: diff < 0 Pr(T < t) = 0.0000 Ha: diff != 0 Pr(|T| > |t|) = 0.0000 Ha: diff > 0 Pr(T > t) = 1.0000

55

. ttest cons, by(sexh) unequal


Two-sample t test with unequal variances -----------------------------------------------------------------------------Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] ---------+-------------------------------------------------------------------Female | 398 433.7347 19.5334 389.69 395.3329 472.1365 Male | 1051 536.838 13.20949 428.2402 510.918 562.758 ---------+-------------------------------------------------------------------combined | 1449 508.5184 11.04409 420.4014 486.8543 530.1825 ---------+-------------------------------------------------------------------diff | -103.1033 23.58059 -149.3921 -56.81448 -----------------------------------------------------------------------------diff = mean(Female) - mean(Male) t = -4.3724 Ho: diff = 0 Satterthwaite's degrees of freedom = 781.352 Ha: diff < 0 Pr(T < t) = 0.0000 Ha: diff != 0 Pr(|T| > |t|) = 0.0000 Ha: diff > 0 Pr(T > t) = 1.0000

The by() option can be extended to group mean comparison test.


. ttest cons, by(q1a) . ttest cons, by(q1a) unequal

Other statistical test The hotelling command performs Hotelling's T-squared test of whether the means are equal between two groups.
. hotel cons, by(sexh)
--------------------------------------------------------------------------------------> sexh = Female Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------cons | 398 433.7347 389.69 6.514 3170.157 --------------------------------------------------------------------------------------> sexh = Male Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------cons | 1051 536.838 428.2402 15.98 3883.536

2-group Hotelling's T-squared = 17.561974 F test statistic: ((1449-1-1)/(1449-2)(1)) x 17.561974 = 17.561974 H0: Vectors of means are equal for the two groups F(1,1447) = 17.5620 Prob > F(1,1447) = 0.0000

56

The tabulate command performs a chi-square test to see if two variables are independent.
tab sexh poor, chi2 Sex of | household | poor head | 0.00 1.00 | Total -----------+----------------------+---------Female | 277 123 | 400 Male | 659 393 | 1,052 -----------+----------------------+---------Total | 936 516 | 1,452 Pearson chi2(1) = 5.5231 Pr = 0.019

SECTION 12: LINEAR REGRESSION


This section describes the use of Stata to do regression analysis. Regression analysis involves estimating an equation that best describes the data. One variable is considered the dependent variable, while the others are considered independent (or explanatory) variables. Stata is capable of many types of regression analysis and associated statistical test. In this section, we touch on only a few of the more common commands and procedures. The commands described in this section are: regress test, testparm predict probit ovtest hettest regress This is an example of ordinary linear regression by using regress command.
. reg cons hhsize Source | SS df MS -------------+-----------------------------Model | 17310207.1 1 17310207.1 Residual | 238605459 1447 164896.654 -------------+-----------------------------Total | 255915666 1448 176737.338 Number of obs F( 1, 1447) Prob > F R-squared Adj R-squared Root MSE = = = = = = 1449 104.98 0.0000 0.0676 0.0670 406.07

-----------------------------------------------------------------------------cons | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------hhsize | 39.95906 3.900049 10.25 0.000 32.30871 47.60942 _cons | 277.0922 24.97986 11.09 0.000 228.0916 326.0929 ------------------------------------------------------------------------------

This regression tells us that for every extra person (hhsize) added to a household, total monthly expenditure (cons) will increase by about 40 Ethiopia Birr. This increase is statistically significant as indicated by the 0.000 probability associated with this coefficient.

57

The other important piece of information is the r-squared (r2) which equals to 0.0676. In essence, this value tells us that by our independent variable (hhsize) accounts for approximately 7% of the variation of dependent variable (cons). We can run the regression with robust standard errors, which can tolerate a non-zero percentage of outliers, i.e., when the residuals are not iid. This is very useful when there is heterogeneity of variance. The robust option does not affect the estimates of the regression coefficients.
. reg cons hhsize, robust Linear regression Number of obs F( 1, 1447) Prob > F R-squared Root MSE = = = = = 1449 98.44 0.0000 0.0676 406.07

-----------------------------------------------------------------------------| Robust cons | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------hhsize | 39.95906 4.027386 9.92 0.000 32.05893 47.8592 _cons | 277.0922 22.27592 12.44 0.000 233.3957 320.7888 ------------------------------------------------------------------------------

The regress command without any arguments redisplays the last regression analysis.

Extract results
Stata stores results from estimation commands in e(), and you can see a list of what exactly is stored using the ereturn list command.
. ereturn list scalars: e(N) e(df_m) e(df_r) e(F) e(r2) e(rmse) e(mss) e(rss) e(r2_a) e(ll) e(ll_0) macros: e(title) e(depvar) e(cmd) e(properties) e(predict) e(model) e(estat_cmd) e(vcetype) matrices: e(b) : e(V) : 1 x 2 2 x 2 : : : : : : : : "Linear regression" "cons" "regress" "b V" "regres_p" "ols" "regress_estat" "Robust" = = = = = = = = = = = 1449 1 1447 98.44285111812539 .0676402792171247 406.0746904916928 17310207.09089088 238605458.7112162 .0669959393962658 -10758.51351538218 -10809.25501198254

58

functions: e(sample)

Using the generate command, we can extract those results, such as estimated coefficients and standard errors, to be used in other Stata commands.
. reg cons hhsize . gen intercept=_b[_cons] . display intercept 277.09225 . gen slope=_b[hhsize] . display slope 39.959064

The estimates table command displays a table with coefficients and statistics for one or more estimation sets in parallel columns. In addition, standard errors, t statistics, p-values, and scalar statistics may be listed by b, se, t, p options.
. estimates table, b se t p --------------------------Variable | active -------------+------------hhsize | 39.959065 | 3.9000493 | 10.25 | 0.0000 _cons | 277.09225 | 24.979856 | 11.09 | 0.0000 --------------------------legend: b/se/t/p

Prediction commands The predict command computes predicted value and residual for each observation. The default shown below is to calculate the predicted cons.
. predict pred (option xb assumed; fitted values)

When using the resid option the predict command calculates the residual.
. predict e, residual

We can plot the predicted value and observed value using graph twoway command.
. regress cons food . predict pred . graph twoway (scatter cons hhsize) (lfit pred hhsize)

59

1000

2000

3000

4000

10 household size

15 Fitted values

20

consumption per month

The rvfplot command is a convenience command that generates a plot of the residual versus the fitted values. It is used after regress command.
. regress cons food . rvfplot

-200

Residuals 200

400

600

1000

2000 Fitted values

3000

4000

60

Hypothesis tests
The test command performs Wald tests for simple and composite linear hypotheses about the parameters of estimation.
. . . . . recode q1a 7/9=7 gen reg1=q1a==1 gen reg3=q1a==3 gen reg4=q1a==4 gen reg7=q1a==7

. regress cons hhsize reg1 reg3 reg4 reg7 Source | SS df MS -------------+-----------------------------Model | 49831582.4 4 12457895.6 Residual | 206084083 1444 142717.509 -------------+-----------------------------Total | 255915666 1448 176737.338 Number of obs F( 4, 1444) Prob > F R-squared Adj R-squared Root MSE = = = = = = 1449 87.29 0.0000 0.1947 0.1925 377.78

-----------------------------------------------------------------------------cons | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------hhsize | 44.0891 3.719572 11.85 0.000 36.79276 51.38544 reg1 | (dropped) reg3 | 155.8197 35.62022 4.37 0.000 85.94683 225.6926 reg4 | 252.3235 36.41307 6.93 0.000 180.8953 323.7517 reg7 | -118.6372 35.93946 -3.30 0.001 -189.1363 -48.13807 _cons | 170.4098 37.14745 4.59 0.000 97.54106 243.2786 ------------------------------------------------------------------------------

As we have stated earlier consumption expenditure is positively related with household size. Besides, household consumption expenditure in Amhara (reg3) and Oromia (reg4) regions are significantly greater than household consumption expenditure in Tigray (reg1). In contrast, household expenditure in SNNP (reg7) is significantly less than household expenditure in Tigray region. Note that stata automatically drop one of the regional dummy variable (reg1 in this case) to avoid perfect multicollinearity.
. test reg3=0 ( 1) reg3 = 0 F( 1, 1444) = Prob > F = 19.14 0.0000

. test reg3= reg4= reg7 ( 1) ( 2) reg3 - reg4 = 0 reg3 - reg7 = 0 F( 2, 1444) = Prob > F = 109.84 0.0000

Example above gives the result of some tests related to the regression analysis shown earlier. The test command test the hypothesis that region 3 variable is zero (test reg3=0) and all region variables (region3= region4 = region 7) are zero, finding that the probability is very low (less 61

than 0.000) so we can reject this hypothesis. This is not suppressing since each is statistically significant on it own. If you want to test the hypothesis that a set of related variable are all equal to zero, you can use the reslated testparm command
. testparm reg*
. testparm reg* ( 1) ( 2) ( 3) reg3 = 0 reg4 = 0 reg7 = 0 F( 3, 1444) = Prob > F = 75.96 0.0000

test of hypothesis that all region* dummies are zero

The hypothesis of no regional influence is rejected meaning that the regional coefficient are jointly significant (i.e. region does influence total consumption). Note: test and predict are commands that can be used in conjunction with all of the above estimation procedures. The suest (seemingly unrelated estimate) command combines the estimation results from regressions (including parameter estimates and associated covariance matrices) into a single parameter vector and simultaneous covariance matrix of the sandwich/robust type. Typical applications of suest command are tests for intra-model and cross-model hypotheses using test or testnl command, such as a generalized Hausman specification test, or Chow test for structural break. Before we perform any test using suest command, it is important we first keep estimation results by estimates store command.
. . . . . reg cons hhsize if poor==1 estimates store spoor reg cons hhsize if poor==0 estimates store npoor suest spoor npoor

Simultaneous results for spoor, npoor


Number of obs = 1449

-----------------------------------------------------------------------------| Robust | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------spoor_mean | hhsize | 30.48242 2.014054 15.13 0.000 26.53495 34.4299 _cons | 35.48593 11.76232 3.02 0.003 12.4322 58.53965 -------------+---------------------------------------------------------------spoor_lnvar | _cons | 9.16965 .0679724 134.90 0.000 9.036426 9.302873 -------------+----------------------------------------------------------------

62

npoor_mean | hhsize | 85.00465 5.166889 16.45 0.000 74.87774 95.13157 _cons | 210.9857 25.48277 8.28 0.000 161.0404 260.931 -------------+---------------------------------------------------------------npoor_lnvar | _cons | 11.95647 .1078529 110.86 0.000 11.74508 12.16786 ------------------------------------------------------------------------------

. test hhsize ( 1) ( 2) [spoor_mean]hhsize = 0 [npoor_mean]hhsize = 0 chi2( 2) = Prob > chi2 = 499.73 0.0000

Next we want to see if the same hhsize coefficient holds for poor and nonpoor households. We can type
. test [spoor_mean]hhsize= [npoor_mean]hhsize ( 1) [spoor_mean]hhsize - [npoor_mean]hhsize = 0 chi2( 1) = Prob > chi2 = 96.66 0.0000

Or we can test if coefficients between equations are equal, or a Chow test.


. test ([spoor_mean]hhsize= [npoor_mean]hhsize) ([spoor_mean]_cons= [npoor_mean]_cons) ( 1) ( 2) [spoor_mean]hhsize - [npoor_mean]hhsize = 0 [spoor_mean]_cons - [npoor_mean]_cons = 0 chi2( 2) = 1179.33 Prob > chi2 = 0.0000

This is equivalent to have accumulate options in test command, which tests hypothesis jointly with previously tested hypotheses
. test [spoor_mean]hhsize= [npoor_mean]hhsize . test [spoor_mean]_cons= [npoor_mean]_cons,accumulate ( 1) ( 2) [spoor_mean]hhsize - [npoor_mean]hhsize = 0 [spoor_mean]_cons - [npoor_mean]_cons = 0 chi2( 2) = 1179.33 Prob > chi2 = 0.0000

ovtest
Regression analysis generates the best unbiased linear estimates of the true coefficients provided that some assumptions are satisfied. One assumption is that there are no missing variables that are correlated with the error term. This command performs a Ramsey RESET to test for omitted variables (misspecification). The syntax is: 63

ovtest [, rhs] This test amounts to estimating y = xb+zt+u and then testing t=0. If the rhs option is not specified, powers of the fitted values are used for z. Otherwise; the powers of the independent variables are used. Examples of the test are: regress cons hhsize reg3 reg4 reg7 . ovtest tests significance of powers of predicted cons . ovtest, rhs tests significance of powers of hhsize, reg3, reg4 and reg7 Example;
. ovtest Ramsey RESET test using powers of the fitted values of cons Ho: model has no omitted variables F(3, 1441) = 4.47 Prob > F = 0.0039

The ovtest, reject the hypothesis that there are no omitted variables, indicated that we need to improve the specification.

Heteroskedasticity
We can always visually check how well the regression surface fits the data by plotting residuals versus fitted values, like rvfplot or rvpplot commands. In addition, there are a bunch of statistical tests to test heteroskedasticity in regression errors. We can use the hettest command to run an auxiliary regression of ln ei2 on the fitted values.
. hettest Breusch-Pagan / Cook-Weisberg test for heteroskedasticity Ho: Constant variance Variables: fitted values of cons chi2(1) Prob > chi2 = = 81.50 0.0000

The hettest indicates that there is heterorskedasticity which needs to be dealt with We can also use information matrix test by imtest command, which provides a summary test of violations of the assumptions on regression errors.
. imtest Cameron & Trivedi's decomposition of IM-test --------------------------------------------------Source | chi2 df p ---------------------+----------------------------Heteroskedasticity | 16.46 2 0.0003

64

Skewness | 24.54 1 0.0000 Kurtosis | 6.66 1 0.0099 ---------------------+----------------------------Total | 47.66 4 0.0000 ---------------------------------------------------

The imtest also approved existence of heteroskedasticity, skweness and kurtosis problems Exercise7: 1. Using real consumption per capita as dependent variable repeat the above regression and interpret the results. 2. Using log of consumption as dependent variable repeat the above regression and interpret the result. What are the difference between results of q1 and q2?

xi command for categorical data


When there is categorical data, it could be inefficient to generate a series of dummy variables. The xi prefix is used to dummy code categorical variables, and we tag these variables with an i. in front of each target variable. In our example, the explanatory variable q1a has 4 levels and requires 3 dummy variables. The test command is used to test the collective effect of the 3 dummy-coded variables. In other words, it tests the main effect of variable q1a. Note that the dummy-coded variables name is written in exactly the same one as it appears in the regression results, including the uppercase I.
. xi: regress cons hhsize i.q1a, robust
i.q1a Linear regression _Iq1a_1-7 (naturally coded; _Iq1a_1 omitted) Number of obs F( 4, 1444) Prob > F R-squared Root MSE = = = = = 1449 83.67 0.0000 0.1947 377.78

-----------------------------------------------------------------------------| Robust cons | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------hhsize | 44.0891 3.965998 11.12 0.000 36.30937 51.86884 _Iq1a_3 | 155.8197 31.30962 4.98 0.000 94.40252 217.2369 _Iq1a_4 | 252.3235 32.75505 7.70 0.000 188.0709 316.5761 _Iq1a_7 | -118.6372 25.7164 -4.61 0.000 -169.0827 -68.19171 _cons | 170.4098 31.09044 5.48 0.000 109.4225 231.3971 ------------------------------------------------------------------------------

. test _Iq1a_3 _Iq1a_4 _Iq1a_7 ( 1) ( 2) ( 3) _Iq1a_3 = 0 _Iq1a_4 = 0 _Iq1a_7 = 0 F( 3, 1444) = Prob > F = 93.96 0.0000

We reject the null hypothesis of no regional effects since p-value is small. 65

Similarly, we can apply xi command to create village dummy (q1b)


. xi: regress cons hhsize i.q1b, robust i.q1b _Iq1b_1-16 (naturally coded; _Iq1b_1 omitted) Linear regression Number of obs F( 15, 1433) Prob > F R-squared Root MSE = = = = = 1449 42.56 0.0000 0.3266 346.79

-----------------------------------------------------------------------------| Robust cons | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------hhsize | 45.2582 3.714029 12.19 0.000 37.97269 52.54372 _Iq1b_2 | -27.54804 46.85358 -0.59 0.557 -119.457 64.36091 _Iq1b_3 | -87.41371 45.04186 -1.94 0.052 -175.7688 .9413348 _Iq1b_4 | 355.9933 49.9613 7.13 0.000 257.9882 453.9984 _Iq1b_5 | -169.5377 35.10535 -4.83 0.000 -238.4011 -100.6743 _Iq1b_6 | 158.2973 42.26349 3.75 0.000 75.39231 241.2022 _Iq1b_7 | 512.4817 60.70446 8.44 0.000 393.4026 631.5608 _Iq1b_8 | 112.9675 44.07454 2.56 0.010 26.50996 199.425 _Iq1b_9 | 63.10264 44.01605 1.43 0.152 -23.24016 149.4454 _Iq1b_10 | 292.1852 63.00507 4.64 0.000 168.5932 415.7773 _Iq1b_12 | -123.2652 40.63929 -3.03 0.002 -202.9841 -43.54632 _Iq1b_13 | -271.599 37.48298 -7.25 0.000 -345.1264 -198.0716 _Iq1b_14 | -86.31787 36.13403 -2.39 0.017 -157.1991 -15.43661 _Iq1b_15 | -182.1813 36.70234 -4.96 0.000 -254.1774 -110.1852 _Iq1b_16 | -11.66292 46.28264 -0.25 0.801 -102.4519 79.12606 _cons | 176.1548 36.033 4.89 0.000 105.4717 246.8379 ------------------------------------------------------------------------------

. test _Iq1b_2 _Iq1b_3 _Iq1b_4 _Iq1b_5 _Iq1b_6 _Iq1b_7 _Iq1b_8 _Iq1b_9 _Iq1b_10 _Iq1b_12 _Iq1b_13 _Iq1b_14 _Iq1b_15 _Iq1b_16 ( 1) ( 2) ( 3) ( 4) ( 5) ( 6) ( 7) ( 8) ( 9) (10) (11) (12) (13) (14) _Iq1b_2 = 0 _Iq1b_3 = 0 _Iq1b_4 = 0 _Iq1b_5 = 0 _Iq1b_6 = 0 _Iq1b_7 = 0 _Iq1b_8 = 0 _Iq1b_9 = 0 _Iq1b_10 = 0 _Iq1b_12 = 0 _Iq1b_13 = 0 _Iq1b_14 = 0 _Iq1b_15 = 0 _Iq1b_16 = 0 F( 14, 1433) = Prob > F = 41.73 0.0000

Thus, we reject the null hypothesis of no villages (areas) effects since p-value is small. The xi prefix can also be used to create dummy variables for q1b and for the interaction term of q1b and hhsize. The first test command tests the overall interaction and the second test command test the main effect of areas.
. xi: regress cons hhsize i.q1b*hhsize, robust

66

. test _Iq1bXhhsi_2 _Iq1bXhhsi_3 _Iq1bXhhsi_4 _Iq1bXhhsi_5 _Iq1bXhhsi_6 _Iq1bXhhsi_7 _Iq1bXhhsi_8 _Iq1bXhhs _9 _Iq1bXhhsi_10 _Iq1bXhhsi_12 _Iq1bXhhsi_13 _Iq1bXhhsi_14 _Iq1bXhhsi_15 _Iq1bXhhsi_16 . test _Iq1b_2 _Iq1b_3 _Iq1b_4 _Iq1b_5 _Iq1b_6 _Iq1b_7 _Iq1b_8 _Iq1b_9 _Iq1b_10 _Iq1b_12 _Iq1b_13 _Iq1b_14 _Iq1b_15 _Iq1b_16

By default, Stata selects the first category in the categorical variable as the reference category. If we would like to declare a certain category as reference category, the char (characteristics) command is needed. In the model above, we would like to use region 5 as reference region, and the commands are
. char q1a[omit] 7
. xi:regress cons hhsize i.q1a, robust i.q1a _Iq1a_1-7 (naturally coded; _Iq1a_7 omitted) Linear regression Number of obs F( 4, 1444) Prob > F R-squared Root MSE = = = = = 1449 83.67 0.0000 0.1947 377.78

-----------------------------------------------------------------------------| Robust cons | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------hhsize | 44.0891 3.965998 11.12 0.000 36.30937 51.86884 _Iq1a_1 | 118.6372 25.7164 4.61 0.000 68.19171 169.0827 _Iq1a_3 | 274.4569 24.86033 11.04 0.000 225.6907 323.2232 _Iq1a_4 | 370.9607 25.45855 14.57 0.000 321.021 420.9004 _cons | 51.7726 26.34448 1.97 0.050 .0950491 103.4502 -----------------------------------------------------------------------------. test _Iq1a_1 _Iq1a_3 _Iq1a_4 ( 1) ( 2) ( 3) _Iq1a_1 = 0 _Iq1a_3 = 0 _Iq1a_4 = 0 F( 3, 1444) = Prob > F = 93.96 0.0000

Some estimation procedures in Stata are included here: anova arch arima bsqreg cnreg cnsreg ereg glm ivreg analysis of variance and covariance autoregressive conditional heterosce. family of estimators autoregressive integrated moving average models quantile regression with bootstrapped standard errors censored-normal regression constrained linear regression maximum-likelihood exponential distribution models generalized linear models instrumental variable and two-stage least squares regression 67

Lnormal mvreg nl poisson qreg reg3 regress rreg sureg tobit vwls zinb zip

maximum-likelihood lognormal distribution models multivariate regression nonlinear least squares maximum-likelihood poisson regression quantile regression three-stage least squares regression linear regression robust regression using IRLS seemingly unrelated regression tobit regression variance-weighted least squares regression zero-inflated negative binomial model zero-inflated poisson models

SECTION 12: LOGISTIC REGRESSION


Logistic regression We are not going to talk the theory behind logistic regression, per se, but focus on how to perform logistic regression analyses and interpret the results using Stata. It is assumed that users are familiar with logistic regression. The logistic command by default produces the output in odds ratios but can display the coefficients if the coef options is used.
. logistic poor hhsize ageh sexh, coef
Logistic regression Number of obs LR chi2(3) Prob > chi2 Pseudo R2 = = = = 1452 120.33 0.0000 0.0637

Log likelihood = -884.66248

-----------------------------------------------------------------------------poor | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------hhsize | .2340767 .0230735 10.14 0.000 .1888534 .2792999 ageh | -.00178 .003769 -0.47 0.637 -.0091671 .0056071 sexh | -.1278524 .1363766 -0.94 0.349 -.3951457 .1394408 _cons | -1.813833 .2422366 -7.49 0.000 -2.288608 -1.339058 ------------------------------------------------------------------------------

The exact same results can be obtained by using the logit command.
. logit poor hhsize ageh sexh Iteration 0: Iteration 1: Iteration 2: log likelihood = -944.82915 log likelihood = -885.06496 log likelihood = -884.66261

68

Iteration 3:

log likelihood = -884.66248 Number of obs LR chi2(3) Prob > chi2 Pseudo R2 = = = = 1452 120.33 0.0000 0.0637

Logistic regression

Log likelihood = -884.66248

-----------------------------------------------------------------------------poor | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------hhsize | .2340767 .0230735 10.14 0.000 .1888534 .2792999 ageh | -.00178 .003769 -0.47 0.637 -.0091671 .0056071 sexh | -.1278524 .1363766 -0.94 0.349 -.3951457 .1394408 _cons | -1.813833 .2422366 -7.49 0.000 -2.288608 -1.339058 ------------------------------------------------------------------------------

The xi prefix can also be used in logistic model to include categorical variables.
. xi:logit poor hhsize ageh sexh i.q1b i.q1b _Iq1b_1-16 (naturally coded; _Iq1b_1 omitted) Iteration Iteration Iteration Iteration Iteration Iteration Iteration 0: 1: 2: 3: 4: 5: 6: log log log log log log log likelihood likelihood likelihood likelihood likelihood likelihood likelihood = = = = = = = -944.82915 -736.12463 -723.35239 -721.57232 -721.28496 -721.26641 -721.26628 Number of obs LR chi2(17) Prob > chi2 Pseudo R2 = = = = 1452 447.13 0.0000 0.2366

Logistic regression

Log likelihood = -721.26628

-----------------------------------------------------------------------------poor | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------hhsize | .2540365 .0268947 9.45 0.000 .2013237 .3067492 ageh | -.0017158 .0043324 -0.40 0.692 -.0102073 .0067756 sexh | -.2484387 .1609912 -1.54 0.123 -.5639757 .0670983 _Iq1b_2 | 1.00783 .362959 2.78 0.005 .2964436 1.719217 _Iq1b_3 | 1.459217 .3487419 4.18 0.000 .7756951 2.142738 _Iq1b_4 | -.6362756 .3294573 -1.93 0.053 -1.282 .0094489 _Iq1b_5 | .8389042 .3749527 2.24 0.025 .1040104 1.573798 _Iq1b_6 | -1.031675 .3775811 -2.73 0.006 -1.77172 -.2916295 _Iq1b_7 | -3.720754 1.040637 -3.58 0.000 -5.760366 -1.681142 _Iq1b_8 | .3595295 .3376553 1.06 0.287 -.3022627 1.021322 _Iq1b_9 | -.5104919 .3506632 -1.46 0.145 -1.197779 .1767954 _Iq1b_10 | -.6988431 .3668961 -1.90 0.057 -1.417946 .02026 _Iq1b_12 | 1.659669 .3781756 4.39 0.000 .9184587 2.40088 _Iq1b_13 | 2.55036 .4331979 5.89 0.000 1.701308 3.399413 _Iq1b_14 | .6482351 .3205897 2.02 0.043 .0198909 1.276579 _Iq1b_15 | 1.617736 .3421814 4.73 0.000 .947073 2.288399 _Iq1b_16 | .1479744 .3737048 0.40 0.692 -.5844736 .8804224 _cons | -2.152062 .360365 -5.97 0.000 -2.858365 -1.44576 ------------------------------------------------------------------------------

Extract results
We can use ereturn or estat command to retrieve results from estimation, same as with other regression commands.

69

. ereturn list scalars: e(N) e(ll_0) e(ll) e(df_m) e(chi2) e(r2_p) macros: e(title) e(depvar) e(cmd) e(crittype) e(predict) e(properties) e(estat_cmd) e(chi2type) matrices: e(b) : e(V) : functions: e(sample) 1 x 18 18 x 18 : : : : : : : : "Logistic regression" "poor" "logit" "log likelihood" "logit_p" "b V" "logit_estat" "LR" = = = = = = 1452 -944.829150727132 -721.2662778966318 17 447.1257456610003 .2366172473176216

. estat summarize Estimation sample logit Number of obs = 1452

------------------------------------------------------------Variable | Mean Std. Dev. Min Max -------------+----------------------------------------------poor | .3553719 .4787908 0 1 hhsize | 5.782369 2.740968 1 17 ageh | 49.34711 15.64917 18 95 sexh | .7245179 .4469108 0 1 _Iq1b_2 | .0454545 .2083707 0 1 _Iq1b_3 | .0592287 .2361335 0 1 _Iq1b_4 | .1205234 .3256848 0 1 _Iq1b_5 | .042011 .2006834 0 1 _Iq1b_6 | .0991736 .2989979 0 1 _Iq1b_7 | .065427 .247363 0 1 _Iq1b_8 | .065427 .247363 0 1 _Iq1b_9 | .0750689 .2635932 0 1 _Iq1b_10 | .0668044 .249769 0 1 _Iq1b_12 | .0447658 .2068607 0 1 _Iq1b_13 | .0509642 .2200004 0 1 _Iq1b_14 | .0922865 .2895297 0 1 _Iq1b_15 | .0661157 .2485698 0 1 _Iq1b_16 | .0488981 .2157292 0 1 ------------------------------------------------------------. estat ic
-----------------------------------------------------------------------------Model | Obs ll(null) ll(model) df AIC BIC -------------+---------------------------------------------------------------. | 1452 -944.8292 -721.2663 18 1478.533 1573.585 ------------------------------------------------------------------------------

70

Marginal effects
We use mfx command to numerically calculate the marginal effects or the elasticities and their standard errors after estimation. Several options are available for the calculation of marginal effects. dydx is the default. eyex specifies that elasticities be calculated in the form of d(lny)/d(lnx) dyex specifies that elasticities be calculated in the form of d(y)/d(lnx) eydx specifies that elasticities be calculated in the form of d(lny)/d(x)
. mfx, dydx
Marginal effects after logit y = Pr(poor) (predict) = .29194589 -----------------------------------------------------------------------------variable | dy/dx Std. Err. z P>|z| [ 95% C.I. ] X ---------+-------------------------------------------------------------------hhsize | .0525128 .00572 9.19 0.000 .04131 .063715 5.78237 ageh | -.0003547 .0009 -0.40 0.692 -.00211 .0014 49.3471 sexh*| -.0524912 .03475 -1.51 0.131 -.120599 .015617 .724518 _Iq1b_2*| .2364405 .08983 2.63 0.008 .060381 .4125 .045455 _Iq1b_3*| .3449546 .08142 4.24 0.000 .18538 .50453 .059229 _Iq1b_4*| -.1173583 .0532 -2.21 0.027 -.221631 -.013085 .120523 _Iq1b_5*| .1947244 .09284 2.10 0.036 .012753 .376696 .042011 _Iq1b_6*| -.1735392 .04883 -3.55 0.000 -.269243 -.077835 .099174 _Iq1b_7*| -.3321019 .02071 -16.04 0.000 -.37269 -.291514 .065427 _Iq1b_8*| .0787698 .07774 1.01 0.311 -.073607 .231146 .065427 _Iq1b_9*| -.0953845 .05835 -1.63 0.102 -.209748 .018979 .075069 _Iq1b_10*| -.1248791 .0551 -2.27 0.023 -.23287 -.016888 .066804 _Iq1b_12*| .3912309 .08351 4.69 0.000 .22756 .554902 .044766 _Iq1b_13*| .5568326 .06416 8.68 0.000 .43108 .682586 .050964 _Iq1b_14*| .1464237 .07716 1.90 0.058 -.004807 .297654 .092287 _Iq1b_15*| .3809777 .07704 4.95 0.000 .229984 .531971 .066116 _Iq1b_16*| .0314128 .08135 0.39 0.699 -.128026 .190852 .048898 -----------------------------------------------------------------------------(*) dy/dx is for discrete change of dummy variable from 0 to 1

Hypothesis Tests

Likelihood-ratio test
The lrtest command performs a likelihood-ratio test for the null hypothesis that the parameter vector of a statistical model satisfies some smooth constraint. To conduct the test, both the unrestricted and the restricted models must be fitted using the maximum likelihood method (or some equivalent method), and the results of at least one must be stored using estimates store. The lrtest command provides an important alternative to Wald testing for models fitted by maximum likelihood. Wald testing requires fitting only one model (the unrestricted model). Hence, it is computationally more attractive than likelihood-ratio testing. Most statisticians, however, favor using likelihood-ratio testing whenever feasible since the null-distribution of the LR test statistic is often "more closely" chi-square distributed than the Wald test statistic. 71

We would like to see if the introduction of regional dummy will help our estimation. We perform a likelihood ratio test using lrtest command.
. xi: logit poor hhsize ageh i.q1a . estimates store n1 . logit poor hhsize ageh . lrtest n1 Likelihood-ratio test (Assumption: . nested in n1) LR chi2(5) = Prob > chi2 = 169.86 0.0000

The null hypothesis is firmly rejected. Other hypothesis tests for parameters are the same as described in OLS.

Other related commands


Stata has a variety of commands for performing estimation when the dependent variable is dichotomous or polychotomous. Here is a list of some estimation commands for discrete dependent variable. See estimation commands for a complete list of all of Stata's estimation commands. asmprobit binreg biprobit blogit bprobit clogit cloglog glogit gprobit heckprob hetprob ivprobit logistic logit mlogit mprobit nbreg nlogit ologit oprobit probit alternative-specific multinomial probit regression GLM models for the binomial family bivariate probit regression logit regression for grouped data probit regression for grouped data conditional logistic regression complementary log-log regression weighted least squares logit on grouped data weighted least squares probit on grouped data probit model with selection heteroskedastic probit model probit model with endogenous regressors logistic regression maximum-likelihood logit regression maximum-likelihood multinomial logit models multinomial probit regression maximum-likelihood negative binomial regression nested logit regression maximum-likelihood ordered logit maximum-likelihood ordered probit maximum-likelihood probit estimation 72

rologit scobit slogit xtcloglog xtgee xtlogit xtprobit

rank-ordered logistic regression skewed logistic regression stereotype logistic regression random-effects and population-averaged cloglog models GEE population-averaged generalized linear models fixed-effects, random-effects, and population-averaged logit models random-effects and population-averaged probit models

Exercise 8: Generate a variable containing food expenditure terciles ftercile. Using ftercile as dependent variable run multinomial logit where the independent variable are the same as the above example and interpret the results.

SECTION 13: PANEL DATA ANALYSIS


Panel data are cross sectional and longitudinal (time series). Some example is the ERHS data which was conducted six times between 1994 and 2004. Panel data may have group effects, time effects, or both. These effects are analyzed by fixed effect and random effect models. A panel data set contains observation on n individuals, each measured at T points in time. In other words, each individual (1 through n subjects) includes T observations (1 through t time period). Thus, the total number of observations is nT. Figure 6, below illustrates the data arrangement of a panel data set. Figure 6: Data arrangement of panel data Group Time Variable1 1 1 1 2 2 1 2 2 . n 1 n 2 . n T

Variable2

Variable3

Fixed Effect versus Random Effect Models Panel data models estimate fixed and/or random effects models using dummy variables. The core difference between fixed and random effect models lies in the role of dummies. If dummies are considered as a part of the intercept, it is a fixed effect model. In a random effect model, the dummies act as an error term (see Figure 7). The fixed effect model examines group differences in intercepts, assuming the same slopes and constant variances across groups. Fixed effect models use least square dummy variable (LSDV), within effect, and between effect estimation methods. Thus, ordinary least squares (OLS) regressions with dummies, in fact are fixed effect models. 73

Figure 7: Fixed effects and Random effects models Fixed Effect Model Functional form yit ( i ) X 'it vit intercepts Error variances Slopes constant Estimation Hypothesis test * vit IID(0, v 2 ) Varying across group and/or time Constant Constant LSDV, within effect, between effect Incremental F test

Random Effect Model yit X 'it ( i vit ) Constant Varying across group and/or time Constant GLS, FGLS Breusch-Pagan LM test

The random effect model, by contrast, estimates variance components for groups and error, assuming the same intercept and slopes. The difference among groups (or time periods) lies in the variance of the error term. This model is estimated by generalized least squares (GLS) when the matrix, a variance structure among groups, is known. The feasible generalized least squares (FGLS) method is used to estimate the variance structure when is not know. A typical example is the groupwise heteroscadastic regression model (Green 2003). There are various estimation methods for FGLS including maximum likelihood methods and simulations (Baltagi and Chang 1994). Fixed effects are tested by the (incremental) F test, while random effects are examined by the Lagrange multiplier (LM) test (Breusch and Pagan 1980). If the null hypothesis is not rejected the pooled OLS regression is favored. The Hausman specification test (Hausman 1978) compares fixed effect and random effect models. Figure 1 compares the fixed effect and random effect models. Group effect models create dummies using grouping variables (e.g. region, wereda, etc). If one group variable is considered, it called a one-way fixed or random group effect model. The twoway group effect models have two sets of dummy variables, one for a grouping variables and other for a tie variable. LSDV regression, the within effect model, the between effect model (group or time mean model), GLS and FGLS are fundamentally based pm OLS in terms of estimation. Thus, any procedure and command for OLS is good for the panel data models. The STATA .xtreg command estimates within effect (fixed effect) models with the fe option, between effect models with the be option, and random effect models with the re option. The following xt commands are families used to run panel data analysis. I recommend further readings to understand it in details.
xtdes xtsum xttab xtdata xtline xtreg xtregar Describe pattern of xt data Summarize xt data Tabulate xt data Faster specification searches with xt data Line plots with xt data Fixed-, between- and random-effects, and population-averaged linear models Fixed- and random-effects linear models with an AR(1) disturbance

74

xtgls xtpcse xtrc xtivreg xtabond xttobit xtintreg xtlogit xtprobit xtcloglog xtpoisson xtnbreg

Panel-data models using GLS OLS or Prais-Winsten models with panel-corrected standard errors Random coefficients models Instrumental variables and two-stage least squares for panel-data models Arellano-Bond linear, dynamic panel data estimator Random-effects tobit models Random-effects interval data regression models Fixed-effects, random-effects, & population-averaged logit models Random-effects and population-averaged probit models Random-effects and population-averaged cloglog models Fixed-effects, random-effects, & population-averaged Poisson models Fixed-effects, random-effects, & population-averaged negative binomial models Population-averaged panel-data models using GEE

xtgee

Notes: Using different examples, we will discuss more about panel data during training sessions.

SECTION 14: DATA MANAGEMENT


Subset data We can subset data by keeping or dropping variables, or by keeping and dropping observations. 1. keep and drop variables Suppose our data file have many variables, but we only care about just a handful of them. We can subset our data file to keep just those variables to our interest. The keep command is used to keep variables in the list while dropping other variables.
. keep hhid cons food

Instead of wanting to keep just a handful of variables, it is possible that we might want to get rid of just one or two variables in the data file. The drop command is used to drop variables in the list while keeping other variables.
. drop cons

2. keep and drop observations The keep if command is used to keep observations if condition is met.
. keep if sexh==1 (400 observations deleted)

If we want to focus on male headed households in the data set, which means 400 female headed households are dropped from the data set. 75

Similar concepts can be found in drop if command. We eliminate the observations with missing values with drop if command. The portion after the drop if specifies which observations that should be dropped.
. drop if missing(cons) (3 observations deleted)

3. Use use command to drop variables and observations You can eliminate both variables and observations with the use command. Lets read in just hhid,cons,q1a,sexh, hhsize from ERHScons1999.dta file.
. use hhid, cons, q1a, sexh, hhsize using ERHScons1999.dta

We can also limited read in data for female headed households.


. use hhid, cons, q1a, sexh, hhsize using ERHScons1999.dta if sexh==1

Organize data
sort The sort command arranges the observations of the current data into ascending order based on the values of the variables listed. There is no limit to the number of variables in the variable list. Missing numeric values are interpreted as being larger than any other number, so they are placed last. When you sort on a string variable, however, null strings are placed first.
. sort hhid sexh cons

Variable ordering The order command helps us to organize variables in a way that makes sense by changing the order of the variables. While there are several possible orderings that are logical, we usually put the id variable first, followed by the demographic variables, such as region, zone, gender, urban/rural. We will put the variables regarding the household total expenditure as follows.
. order hhid q1a q1b q1c q1d sexh ageh cons

Using _n and _N in conjunction with the by command can produce some very useful results. When used with by command, _N is the total number of observations within each group listed in by command, and _n is the running counter to uniquely identify observations within the group. To use the by command we must first sort our data on the by variable.

76

. . . .

sort group by group: generate n1=_n by group: generate n2=_N list +-----------------------------------+ | score group id nt n1 n2 | |-----------------------------------| 1. | 72 1 1 7 1 4 | 2. | 85 1 7 7 2 4 | 3. | 76 1 3 7 3 4 | 4. | 90 1 6 7 4 4 | 5. | 84 2 2 7 1 2 | |-----------------------------------| 6. | 82 2 5 7 2 2 | 7. | 89 3 4 7 1 1 | +-----------------------------------+

Now n1 is the observation number within each group and n2 is the total number of observations for each group. This is very useful in programming, especially in identifying duplicate observations. To use _n to find out duplicated observations, we can type:
. sort group . list if id == id[_n+1]

To use _N to identify duplicated observations, use:


. sort group score . by group score: generate ngroup=_N . list if ngroup>1

If there are a lot of variables in the data set, it could take a long time to type them all out twice. We can make use of the * and ? wildcards to indicates that we wish to use all the variables. Further we can combine sort and by commands into a single statement. Below is a simplified version of the code and will yield the same results as above.
. bysort *:generate nn=_N . list if nn>1

Create one data set from two or more data sets


Appending data files
We can create a new data by append command, which concatenates two datasets, that is, stick them together vertically, one after another. Supposing we are given one file with data for the rural households (called rural.dta) and a file for the urban households (called urban.dta). We need to combine these files together to be able to analyze them. 77

. use ERHS1999.dta, clear . append using ERHS1997.dta . append using ERHS1995.dta

The append command does not require that the two datasets contain the same variables, even though this is typically the case. But it highly recommended to use the identical list of variables for append command to avoid missing values from one dataset.

One-to-one match merging


Another way of combining data files is match merging. The merge command sticks two datasets horizontally, one next to the other. Before any merge, both datasets must be sorted by identical merge variable. Assuming we are working on ERHS 1999 data, and we have been given two files. One file has all the consumption information (own production, (called p2sec9a.dta)) and the other file with community prices by wereda (called p_r5.dta). Both data sets have been cleaned and sorted by hhid and item1234. We would like to merge the two households together by hhid item1234.
. . . . . . . . use p2sec9a.dta, clear list sort hhid item1234 save consumption.dta, replace use p_r5, clear list sort hhid item1234 save comprice.dta, replace

. use consumption.dta, clear . merge hhid item1234 using comprice.dta

After merge command, a _merge variable appears. The _merge variable indicates, for each observation, how the merge go. This is especially useful in identifying mismatched records. _merge can have one of three values in merging file A using file B: _merge==1 _merge==2 _merge==3 the records contains information from master data file A the records contains information from using data file B the records contains information from both files

When there are many records, tabulating _merge is very useful to summarize how many mismatched observations you have. In this case, all of the records match so the value for _merge is always 3.
. tab _merge _merge | Freq. Percent Cum. ------------+----------------------------------1 | 3,605 24.21 24.21 2 | 9,732 65.37 89.58 3 | 1,551 10.42 100.00 ------------+----------------------------------Total | 14,888 100.00

78

One-to-many match merging


Another kind of merge is called a one to many merge. Say, we have one data file household.dta contains household information, and another data file individual.dta contains information of each individual in the household. If we merge households.dta with individual.dta, there can be multiple individuals per household and hence it is a one to many merge. The strategy for the one to many merge is really the same as the one to one match merge.
. . . . . . . . use household.dta, clear list sort hhid save h1.dta, replace use individual.dta, clear list sort hhid save h2.dta, replace

. use h1.dta, clear . merge hhid using h2.dta

There is no difference in the order of files to be merged and the results are the same. The only difference is the order of the records after the merge.

Label data
Besides giving labels to variables, we can also label the data set itself so that we will remember what the data are. The label data command places a label on the whole dataset.
. label data relabeled household

We can also add some notes to the data set. The note: (note the colon, :) command allows you to place notes into the dataset.
. notes hhsize: the variable fsize was renamed to hhsize

The notes command display all notes in the data set.


. notes hhsize: the variable fsize was renamed to hhsize

Collapse Sometimes we have data files that need to be aggregated at a higher level to be useful for us. For example, we have household data but we really interested in regional data. The collapse command serves this purpose by converting the dataset in memory into a dataset of means, sums, medians and percentiles. Note that the collapse command creates a new dataset and all

79

household information disappear and only the specified variable aggregation remain at the region level. The resulting summary table can by viewed by edit command. For instance, we would like to see the mean cons in each q1a and sex of hh head.
. collapse (mean) cons, by(q1a sex) . edit regco 1 1 2 2 3 3 4 4 5 5 6 6 7 7 12 12 13 13 14 14 15 15 urban 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 tot_exp 12.067 14.899 13.022 17.849 11.612 16.507 13.324 17.790 15.152 22.627 11.890 18.261 12.313 18.591 10.851 19.714 19.528 20.021 21.568 30.597 16.627 19.574

However, this table is not easy to interpret, and we can call it a long format since the data of urban and rural are vertically listed. We will use reshape command to convert it into a wide form where the rural and urban are horizontally arranged in a twoway table. The reshape wide command tells system that we want to go from long to wide. The i() option records row variable while j() column variable.
. reshape wide tot_exp, i(regco) j(urban) (note: j = 0 1) Data long -> wide ----------------------------------------------------------------------------Number of obs. 22 -> 11 Number of variables 3 -> 3 j variable (2 values) urban -> (dropped) xij variables: tot_exp -> tot_exp0 tot_exp1 -----------------------------------------------------------------------------

The converted table is two-way table.


regco 1 2 3 4 5 6 tot_exp0 12.067 13.022 11.612 13.324 15.152 11.890 tot_exp1 14.899 17.849 16.507 17.790 22.627 18.261

80

7 12 13 14 15

12.313 10.851 19.528 21.568 16.627

18.591 19.714 20.021 30.597 19.574

If needed, the table can be converted back into the long form by reshape long.
. reshape long tot_exp, i(regco)

The collapse and reshape commands are examples of the power and simplicity of Stata in its ability to shape data files.

Section 15: Advanced Programming


Besides simple one-line commands, we can always get more from Stata by more sophisticated programming.

Looping
Consider the sample program below, which reads in income data for twelve months.
input famid 1 3281 3413 2 4042 3084 3 6015 6123 end inc1-inc12 3114 2500 2700 3500 3114 3319 3514 1282 2434 2818 3108 3150 3800 3100 1531 2914 3819 4124 4274 4471 6113 6100 6100 6200 6186 6132 3123 4231 6039 6215

Say that we wanted to compute the amount of tax (10%) paid for each months, which means to compute 12 variables by multiplying each of the inc* variable by 0.10. There is more than one way to execute part of your do file more than once. 1. The simplest way is to use 12 generate commands.
generate generate generate generate generate generate generate generate generate generate generate generate taxinc1 = taxinc2 = taxinc3 = taxinc4 = taxinc5 = taxinc6 = taxinc7 = taxinc8 = taxinc9 = taxinc10= taxinc11= taxinc12= inc1 * .10 inc2 * .10 inc3 * .10 inc4 * .10 inc5 * .10 inc6 * .10 inc7 * .10 inc8 * .10 inc9 * .10 inc10 * .10 inc11 * .10 inc12 * .10

2. Another way to computer 12 variables is to use the foreach command. In the example below, we use the foreach command to cycle through the variables inc1 to inc12 and compute the taxable income as taxinc1-taxinc12.
foreach var of varlist inc1-inc12 { generate tax`var' = `var' * .10 }

81

The initial foreach statement tells Stata that we want to cycle through the variables inc1 to inc12 using the statements that are surrounded by the curly braces. Note the curly braces must be open at the end of foreach command line. The first time we cycle through the statements, the value of var will be inc1 and the second time the value of var will be inc2 and so on until the final iteration where the value of var will be inc12. Each statement within the loop (in this case, just the one generate statement) is evaluated and executed. When we are inside the foreach loop, we can access the value of var by surrounding it with the funny quotation marks like this `var' . The ` is the quote right below the ~ on your keyboard and the ' is the quote below the " on your keyboard. The first time through the loop, `var' is replaced with inc1, so the statement
generate tax`var' = `var' * .10

becomes
generate taxinc1 = inc1 * .10

This is repeated for inc2 and then inc3 and so on until inc12. So, this foreach loop is the equivalent of executing the 12 generate commands manually, but much easier and less error prone. 3. The third way is to use while loop. First we define a Stata local variable that is going to be the loop increment. Similar to the foreach command, codes are in terms of local variable `var'.
local i=1 while `i'<=12 { generate taxinc`i'=inc`i'*0.10 local i=`i'+1 }

Local variable i can be seen as a counter, and the while command states how many times the commands within the while loop are going to be replicated. This statement basically says do until counter value reaches the limit 12. Note the curly braces must be open at the end of while command line. All commands between curly braces will be executed each time the system go through the while loop. So first the statement
generate taxinc`i'=inc`i'*0.10

becomes
generate taxinc1=inc1*0.10

The counter value is increased by 1 unit afterwards. Note that the fourth line means the value of local variable i will be increased by 1 from its current value stored in `i'.

82

SECTION 16: TROUBLESHOOTING AND UPDATE


The help command followed by a Stata command brings up the on-line help system for that command. It can be used from the command line or from the help window. With help you must spell the full name of the command completely and correctly.
. help regress

The help contents will list all commands that can be accessed using help command.
. help contents

The search command looks for the term in help files, Stata Technical Bulletins and Stata FAQs. It can be used from the command line or from the help window.
. search logit

The findit command can be used to search the Stata site and other sites for Stata related information, including ado files. Say that we are interested in panel data, so we search for this program from within Stata by typing
. findit panel data

The Stata viewer window appears and we are shown a number of resources related to this key word. Stata is composed of an executable file and official ado files. Ado stands for automatically loaded do file. An ado file is a Stata command that created by users like you. Once installed in your computer, they work pretty much the same way so Stata commands. Stata files are regularly updated. It is important to make sure that you are always running the most up to date Stata, and please do so regularly. The update command reports on the current update level and installs official updates to Stata. It helps users to be up to date with the latest Stata ado and executable file, and copy and installs the ado files into the directory specified.
. update . update ado, into(d:\ado)

You can keep track of all the users ado files that you have added to your package over time by ado command, which will list all of them, with information on where you got it from and what it does.
. ado [1] package spost9_ado from http://www.indiana.edu/~jslsoc/stata spost9_ado Stata 9 commands for the post-estimation interpretation of [2] package st0081 from http://www.stata-journal.com/software/sj5-1 SJ5-1 st0081. Visualizing main effects and interactions...

These ado files can be deleted by ado uninstall command.


. ado uninstall st0081 package st0081 from http://www.stata-journal.com/software/sj5-1 SJ5-1 st0081. Visualizing main effects and interactions... (package uninstalled)

83

Helpful Sources
http://www.stata.com/ http://www.stata.com/statalist/ Statalist is hosted at the Harvard School of Public Health, and is an email listserver where Stata users including experts writing Stata programs to users like us maintain a lively dialogue about all things statistical and Stata. You can sign on to statalist so that you can receive as well as post your own questions through email. http://ideas.repec.org/s/boc/bocode.html http://www.princeton.edu/~erp/stata/main.html http://www.cpc.unc.edu/services/computer/presentations/statatutorial/ http://www.ats.ucla.edu/stat/stata/

84

You might also like