Professional Documents
Culture Documents
file, you need to clear the STATA memory. You can open a data file by typing use followed by a file name with its directory. Alternatively, you can open a file by pulling down File menu and choosing Open. After opening a file, you can simply discard the data file from the STATA memory by typing clear again. Note that the original file is still in the same folder. So you can open the same file again if you like. Next, I open the same STATA file and save it into a different holder.
. clear . use C:/Docs/FASID/Classes/Econometrics/wooldridge_data/WAGE1.DTA . save C:/Docs/tmp/WAGE1.DTA
Note that I am saving this file in a different folder so that I do not replace the original data file. My advice to you is Do not replace original files! If you create new variables (such as a squared variable) and want to save it, save it in a different folder or use a different file name. There is always a danger of replacing an original file with a new file which has fewer observations or variables. Descriptive Statistics Next, there are some commands to obtain descriptive information: describe and summarize. describe provides you types and definitions of variables. This is especially helpful when you use the data file for the first time. summarize provides descriptive statistics of variables: mean, standard deviations, minimums, and maximums. If you type summarize x, d(etail), you can get detailed information about a variable. Here is how they work:
. describe
Contains data from C:DocsFASIDClassesEconometricswooldridge_dataWAGE1.DTA obs: vars: size: 526 24 16 Sep 1996 15:52
--------------------------------------------------------------storage variable name type display format value label variable label
--------------------------------------------------------------wage educ exper lwage expersq tenursq float byte byte float int int %8.2g %8.0g %8.0g %9.0g %9.0g %9.0g average hourly earnings years of education years potential experience log(wage) exper^2 tenure^2
--------------------------------------------------------------------------
-------------+----------------------------------------------------wage | educ | 526 526 5.896103 12.56274 3.693086 2.769022 .53 0 24.98 18
1 -.6348783 1 0
. summarize wage, d average hourly earnings ------------------------------------------------------------Percentiles 1% 5% 10% 25% 50% 1.67 2.75 2.92 3.33 4.65 Largest 75% 90% 95% 99% 6.88 10 13 20 21.86 22.2 22.86 24.98 Variance Skewness Kurtosis 13.63888 2.007325 7.970083 Smallest .53 1.43 1.5 1.5 Obs Sum of Wgt. Mean Std. Dev. 526 526 5.896103 3.693086
To obtain frequency of a categorical variable, you can use table. table can also provide you descriptive statistics of other variables for each value of the categorical variables.
. table educ ---------------------years of | education | Freq.
----------+----------0 | 2 | 2 1
12 | 13 |
198 39
18 |
19
----------------------
. table educ, c(mean wage sd wage min wage max wage n wage) ---------------------------------------------------------------------years of | education | mean(wage) sd(wage) min(wage) max(wage) N(wage)
12 | 13 |
5.37136 5.59897
3.092932 3.026567
.53 2.00
22.20 15.38
198 39
18 |
10.6789
5.913146
3.50
24.98
19
------------------------------------------------------------------
Creating Variables You can create variables by using generate or gen for short:
. gen educsq=educ*educ
or
. gen educsq=educ^2
Suppose that you want to modify a variable, you need to use replace.
. replace female=2 if female==0
Here I have replaced zeros in female by 2. So now, female has one for female workers and two for male workers, instead of zero for male workers. In STATA, you need to type = twice to indicate the value of a variable is equal to something. Other cases are: >, >=, <=, and <. These are respectively larger than, equal to or larger than, equal to or smaller than, and smaller than.
Now, because female is not a dummy variable, I create a new dummy variable by typing:
. gen women=0 . replace women=1 if female == 1
Neat, isnt it? OLS estimations It is very easy to estimate OLS models in STATA.
. regress y x1 x2 x3 x4 x5
Here is an example: .
regress lwage Source | educ exper expersq female married northcen south west SS df MS Number of obs = F( 8, 517) = = = 526 45.95
0.0000 0.4156
-------------+-------------------------------------------------------------educ | exper | expersq | .0808207 .0363615 -.000645 .0070101 .0052269 .0001128 11.53 6.96 -5.72 0.000 0.000 0.000 .067049 .0260929 -.0008665 .0945925 .0466301 -.0004235
Graphs (this section is for version 8) It is sometimes a good idea to examine the data visually. Here, I just explain two
types of graphs: histograms and two-way graphs. Histogram is useful to see frequency and two-way is useful to examine a relationship between two variables.
. twoway histogram wage
Density 0 0 .1
.2
.3
20
25
When you have a discrete variable, by specifying it you can have a column for each value of a discrete variable:
10 years of education
15
20
When you want to examine a relationship between two variables, you can create a two-way graph by typing:
. graph twoway scatter wage educ
0 0
25
10 years of education
15
20
to get the same graph. You can also include a fitted line by typing lfit wage educ. But because there are two types of ploy-types, you need to specify that way:
. twoway (scatter wage educ) (lfit wage educ)
25 0 0 average hourly earnings/Fitted values 5 10 15 20
10 years of education
15 Fitted values
20
You can learn more about graphs in a STATA manual called Graphics.
Do-files execute commands recorded in them. By recording all of your commands in a do-file, you can keep a history of your work. This way, you can execute the exact same commands days or years later. You do not need to remember what you have done. Just you need to remember the files names. (Actually this is not easy either. Occasionally, I spend many hours looking for old do-files. I recommend descriptive file names.) Why do you need to use do-files? Even though the advantages of using do-files become clear as you get used to using them, you may think do-files are cumbersome at the beginning because you have to type every single command in do-files. There are three major reasons for using do-files: (i) it is easy to use do-files, (ii) you will be able to reproduce your results (even after many years), (iii) you can communicate with your colleagues by exchanging do-files. (i) You may not like typing all of your commands in do-files, instead of drag-and-click on STATA platform. However, once you remember some of important commands, you can do most of your work. When necessary, you can look up the manuals or use the help command in STATA to learn about commands. (ii) You will need to reproduce your results even after many months. For instance, your adviser may want you to modify your models. With do-files you can just make small changes and produce results according to your advisers comments; you do not need start from the scratch every time you change specifications.
(iii) When you work with your colleagues, it is useful to share the same data sets among your colleagues and exchange do-files. As long as data sets are the same, the same do-files will produce the same results. This way, your colleagues can check your work and make adjustments. So lets start using do-files! How to open a do-file Just click File-Do. You can open existing do-files. pencil, a new do-file will show up.
How to execute do-files After typing commands in a do-file, you can just click an icon with a lined-note. instance, type the following commands in a do-file: clear use c:docsfasideconometricshomeworkwage1.dta sum wage sum wage, d table female table female, c(mean wage) Then click an icon with a lined-note. You will probably see an error message file c:docsfasideconometricshomeworkwage1.dta not found This is because you dont have the wage1.dta data-file in the specified directory. But at least you know that the do-file has tried to execute your commands. correct the directory and execute the do-file again. If you did not face any problems, you should find:
. sum wage
For
Now,
Variable |
Obs
Mean
Std. Dev.
Min
Max
----------+-----------
You have run a do-file. We will learn these two commands (sum and table) later. But for now, you should save the do-file by clicking File-Save As.
Commands You Need to Know There is a note made by Wooldridge called Rudiments of STATA. This note explains most of important commands, so I do not repeat. Instead, I will show you an example of a do-file:
Example 6-1 *This is a do-file, called how_to_STATA, for Lecture 6 clear use c:docsfasideconometricshomeworkwage1.dta *log close log using c:docsfasideconometricshomeworkwage1.log, replace *Describe the data des sum wage sum wage, d table female table female, c(mean wage min wage max wage)
*Run OLS, predict logwage, and do F-test reg logwage female educ exper expersq predict yhat test exper expersq
End of Example 6-1 One very useful command is this: *. This is called a star. This is not exactly a command because a star (*) does not execute any work. Instead a star (*) prevents a command from executing. For instance, in the above do-file, the second star (*) is preventing a command log close from executing. I have left a star in front of log close because I do not want to execute this command yet. At this point there is no log file open. If I try to close a log-file (by saying log close), STATA will give me an error message and does not execute other commands. Thus I leave the second star. After running this do-file once, a log-file will be open and keep recording all the results on STATA-Results window. Thus from the second time, I will delete the second star in front of log close. As you can see, the star (*) is very useful to prevent some commands from executing temporary. Another way of using a star (*) is to put notes in do-files. Sometimes, you want to leave some notes in do-files to remind yourself or explain your colleagues. Remember you may need to open your do-files after many months or years. You may not remember all the details about your do-files at that time. From my experiences, it is a good idea to leave some notes in your do-files, as I have done in this do-file. Using log-files As I mentioned above, a log-file records all the results displayed on STATA screen. You can open a log-file in a word processor, such as Word. A font called Courier works the best with STATA outputs.
When you need to replace an old log-file under the same name, you need to add replace after a comma: log using c:docsfasideconometricshomeworkwage1.log, replace If you want to add new results at the end of an old log-file, you need to add append after a comma log using c:docsfasideconometricshomeworkwage1.log, append As I mention before, you can close a log-file by using log close
gsort - income This will sort observations from the largest to the smallest. You can also use more
than one variables. gsort female_head - income This will sort observations from the largest to the smallest for male and female headed households separately.
Aggregating the data In surveys and data, information is collected at different units. For instance, a typical household survey not only collects information at the household level (e.g., How much does this household use?) but also at the individual level (e.g., How old is this person?). To combine information collected at different units, we need to either aggregate data up to a higher unit or merge data from a higher unit to data at a lower unit. For instance, we need to create an aggregated data from the individual level up to the household level. In STATA, we can use collapse to create an aggregated data. that we have demographic information at the individual level: HHID 1 1 1 2 2 PersonID 1 2 3 1 2 Age 42 37 10 28 24 Gender Male Female Female Male Female For instance, assume
HHID indicates in household ID numbers in which each individual belongs; PersonID indicates ID numbers for each individual; and Age and Gender indicate personal information. Suppose that we want to create a variable called HHsize that indicates the household size. To create HHsize, I would create HHsize which is one for all individuals:
gen HHsize = 1 HHID 1 1 1 2 2 PersonID 1 2 3 1 2 Age 42 37 10 28 24 Gender Male Female Female Male Female HHsize 1 1 1 1 1
Then, I would aggregate up the data to the household level. collapse (sum) HHsize, by(HHID)
collapse aggregates up the data to the level identified by the identifying variable specified in by( ). In this example, I am aggregating the data up to HHID level. In the example, we will get an aggregated data looks like: HHID 1 2 HHsize 3 2
Notice that all the other variables are eliminated. In addition to summing up, you can also calculate means, standard deviations, maximums, minimums, median, etc. For instance, you can calculate average ages and find the maximum age within the household by typing: collapse (sum) HHsize (mean) Age (max) Agemax = Age, by(HHID) HHID 1 2 HHsize 3 2 Age 29.7 26 Agemax 42 28
After creating an aggregated data, you can combine this to another data using an
identifying variable. In the example, the identifying variable is HHID. Before merging this file with other data files at the household level, you need to sort the data according to the identifying variable. Thus, sort HHID save c:/data/tmp/hhsize, replace
Merging data files To combine data from different files, we need to merge files. Files must be sorted by the same identifying variable in the same order before merging. For instance, suppose that we have a data set of household income at the household level and bring in HHsize from a different file to crease a per capita income variable, called PCincome. First, we need to open a base file. income: HHID 1 2 income 302 189 I this example, this is a file with household
sort HHID merge HHID using c:/data/tmp/hhsize HHID 1 2 income 302 189 HHsize 3 2 Age 29.7 26 Agemax 42 28 merge 3 3
Thus, we have merged two data files at the household level (HHID).