You are on page 1of 12

DEPARTMENT OF BIOSTATISTICS

UNIVERSITY OF COPENHAGEN

DEPARTMENT OF BIOSTATISTICS
UNIVERSITY OF COPENHAGEN

Reading and Manipulating data

Data manipulation in Stata

Importing, exporting
Reading csv, fixed format files
listing, descrbing, viewing the data

Thomas Scheike

labels for working with the data


computing new variables and groupings
Transposing/Merging

September 22, 2014

Missing data in stata


deleting variables/cases

DEPARTMENT OF BIOSTATISTICS
UNIVERSITY OF COPENHAGEN

Data examples

DEPARTMENT OF BIOSTATISTICS
UNIVERSITY OF COPENHAGEN

Data examples

sperm.asc data set:


obsnr= observation number
year= year of sperm sample

saseko.csv data set


obs= observation number

n= number of subjects in sample

abstid= days of abstinence

meanage= mean age of subjects -30 year

alder= age of subject

meanabs= mean number of days of abstinence - 7 days

s1e2= sas=1, ecological farmer=2

meanvol= mean volume of ejaculate ml

konc = sperm concentration mill/ml

meanct= mean sperm concentration mill/ml

vol= volume of ejaculate ml

medct= median sperm concentration mill/ml


usa = 1 if sample is from the US, 0 if sample is from europe

DEPARTMENT OF BIOSTATISTICS
UNIVERSITY OF COPENHAGEN

Import data

DEPARTMENT OF BIOSTATISTICS
UNIVERSITY OF COPENHAGEN

Import data

can read certain formats


excel sheets (xls, xlsx) [import excel . . . ]
text data (delimited, csv, . . . ) [insheet]
text data (fixed columns) [infix]
text data (fixed columns, with a dictionary) [infile]
Unformatted text data
SAS xport [import sasxport]
ODBC data
XML data
stata files (dta) [use]
Here illustration of excel and text (csv, fixed with) data. With
other formats programs like Stat-Transfer can change most formats
to most formats.
Typically, data comes from some existing data base. csv is often an
import/export possibility from most systems.
In reality one sometimes needs to take data to different statistical
programs because the analyses is only available there.

Spss can read/write stata files


R can read/write stata files (write.dta)
Stattransfer (program for moving between formats)
1
2

// to read from R
saveold oldsperm

Windows:
1
2
3
4
5

scc install usespss


scc install usesas
cd data
usespss using mydata.sav
usesas using mydata.sas7bcat

DEPARTMENT OF BIOSTATISTICS
UNIVERSITY OF COPENHAGEN

Import data
1
2
3

DEPARTMENT OF BIOSTATISTICS
UNIVERSITY OF COPENHAGEN

Import data

cd data
import excel using mini-data.xlsx
describe
1

list

/home/ifsv/bhd252/undervis/stata-course/thomas/data
Contains data
obs:
4
vars:
5
size:
68
------------------------------------------------------------------------------storage display
value
variable name
type
format
label
variable label
------------------------------------------------------------------------------A
byte
%10.0g
B
double %10.0g
C
byte
%10.0g
D
str6
%9s
E
byte
%10.0g
------------------------------------------------------------------------------Sorted by:
Note: dataset has changed since last saved

list

1.
2.
3.
4.

+---------------------------+
| A
B
C
D
E |
|---------------------------|
| 1
2
99
anders
1 |
| 2
2.5
30
anders
1 |
| 3
5
5
thomas
2 |
| 4
4.5
25
thomas
2 |
+---------------------------+

DEPARTMENT OF BIOSTATISTICS
UNIVERSITY OF COPENHAGEN

Import data
1
2

1
2

DEPARTMENT OF BIOSTATISTICS
UNIVERSITY OF COPENHAGEN

Import data

import excel using mini-data.xlsx, sheet("Ark 1")


import sasexport sperma.xpt
insheet using saseko.csv, comma clear
describe

(7 vars, 196 obs)


Contains data
obs:
196
vars:
7
size:
3,528
------------------------------------------------------------------------------storage display
value
variable name
type
format
label
variable label
------------------------------------------------------------------------------v1
int
%8.0g
obs
int
%8.0g
abstid
float %9.0g
alder
byte
%8.0g
s1e2
byte
%8.0g
konc
float %9.0g
vol
float %9.0g
------------------------------------------------------------------------------DEPARTMENT OF BIOSTATISTICS
UNIVERSITY OF COPENHAGEN
Sorted by:
Note: dataset has changed since last saved

Import data

1.
2.
3.
4.
5.

DEPARTMENT OF BIOSTATISTICS
UNIVERSITY OF COPENHAGEN

Import data
1
2

3
4
1
2

clear
insheet your_file.csv, delimiter(";")

list in 1/5
+-----------------------------------------------+
| v1
obs
abstid
alder
s1e2
konc
vol |
|-----------------------------------------------|
| 1
1
.
26
1
0
2.7 |
| 2
2
4
44
1
26
5 |
| 3
3
4
36
1
12
5.5 |
| 4
4
4
40
1
83
4.5 |
| 5
5
3
37
1
36
2 |
+-----------------------------------------------+

clear
infix obsnr 1-5 year 6-11 n 12-17 meanage 18-26 meanabs
27-35 meanvol 36-44 meanct 45-53 medct 54-59 usa
60-63 2 first using sperm.asc
format meanage meanabs meanvol meanct medct %4.2f
drop obsnr
list in 1/5

meanct 45-53 medct 54-59 usa 60-63


(62 observations read)
drop obsnr

1.
2.
3.
4.
5.

2 first using sperm.asc

+-----------------------------------------------------------------+
| year
n
meanage
meanabs
meanvol
meanct
medct
usa |
|-----------------------------------------------------------------|
| 1938
200
-1.00
-2.00
3.00
120.63
.
1 |
| 1941
22
31.50
-2.00
2.98
107.00
.
1 |
| 1943
25
-1.00
-2.00
4.50
66.90
66.40
1 |
| 1944
50
-1.00
-2.00
3.19
85.70
.
0 |
| 1945
100
22.00
-2.00
3.40
134.00
.
1 |
+-----------------------------------------------------------------+

DEPARTMENT OF BIOSTATISTICS
UNIVERSITY OF COPENHAGEN

Import data
1
2

infix using sperm.dct, clear


des

infix dictionary using sperm.asc {


2 first
obsnr 1-5
year 6-11
n 12-17
meanage 18-26
meanabs 27-35
meanvol 36-44
meanct 45-53
medct 54-59
usa 60-63
}
(62 observations read)
Contains data
obs:
62
vars:
9
size:
2,232
------------------------------------------------------------------------------storage display
value
variable name
type
format
label
variable label
DEPARTMENT OF BIOSTATISTICS
------------------------------------------------------------------------------UNIVERSITY OF COPENHAGEN
obsnr
float %9.0g
year
float %9.0g
n
float %9.0g
meanage
float %9.0g
meanabs
float %9.0g
meanvol
float %9.0g
meanct
float %9.0g
medct
float %9.0g
usa
float %9.0g
------------------------------------------------------------------------------Sorted by:
Note: dataset has changed since last saved

Import

DEPARTMENT OF BIOSTATISTICS
UNIVERSITY OF COPENHAGEN

Import data

use sperm.dta, clear

Dictionaries can be quite elaborate, and may contain more


information
If usa was a string with information about region:
medct 54-59
usa str 60-63

DEPARTMENT OF BIOSTATISTICS
UNIVERSITY OF COPENHAGEN

Export data

Can save stata data in different formats


To save in (NewVersion.dta)
1

save NewVersion

To replace old version


1

save, replace

DEPARTMENT OF BIOSTATISTICS
UNIVERSITY OF COPENHAGEN

Looking at the data

DEPARTMENT OF BIOSTATISTICS
UNIVERSITY OF COPENHAGEN

Setting up the data


Output formats

Browse, Edit, list, describe


1
2
3
4
5
6
7
8
9

1
2

use sperm.dta, clear


list in 1/3
sort age
list if usa==1
list in -3/l
list in -3/-1
describe
edit
browse

format age %6.2g


format age %6.2f
format age %7.5e

Labels
1
2
3
4

label data "Sperm concentration data"


label variable age "age in years"
label variable catage "Age in age groups"
label define labcatage 1 "Young" 2 "Medium" 3 "Old" . "
Missing"
label values catage labcatage

DEPARTMENT OF BIOSTATISTICS
UNIVERSITY OF COPENHAGEN

Missing variables

Missing values are generally represented by "." or more generally as


".a", ".b" and so forth. This gives a way of having different types
of missing values.
1

mvdecode age year income, mv(97=. \ 98=.a \ 99=.b)

missing function returns a 1 if some of its arguments are missing


1
2

if missing(age) age=0
replace age=0 if missing(age,sex,land)

DEPARTMENT OF BIOSTATISTICS
UNIVERSITY OF COPENHAGEN

Sorting

1
2
3
4

sort land age income


sort age
gsort +age
gsort -age

Changing order of variables.


1

order age income

DEPARTMENT OF BIOSTATISTICS
UNIVERSITY OF COPENHAGEN

Computing new variables

DEPARTMENT OF BIOSTATISTICS
UNIVERSITY OF COPENHAGEN

Egen functions
help egen
1

1
2
3
4
5
6
7
8

generate loginc= ln(income)


label variable loginc "log-income"
egen mincome = mean(income)
egen sdin=std(income)
generate caseid=_n
replace income=income/1000
egen agesex= group(agecat sex)
replace income=100 in 3

2
3
4
5

use sperm.dta, clear


egen agg=rowmean(meanct meanvol)
egen tot=rowtotal(meanct meanvol)
gen tots=meanct+meanvol
list meanct meanvol tot tots agg in 1/5

(1 missing value generated)


(1 missing value generated)

loginc is a new variable on the data, and egen is a "global" variable


that can be used for computation and definitions of new variables
(attached to the data).

1.
2.
3.
4.
5.

+---------------------------------------------+
| meanct
meanvol
tot
tots
agg |
|---------------------------------------------|
|
.
.
0
.
. |
| 120.63
3.00
123.63
123.63
61.815 |
| 107.00
2.98
109.98
109.98
54.99 |
| 66.90
4.50
71.4
71.4
35.7 |
| 85.70
3.19
88.89
88.89
44.445 |
+---------------------------------------------+

DEPARTMENT OF BIOSTATISTICS
UNIVERSITY OF COPENHAGEN

Egen by

DEPARTMENT OF BIOSTATISTICS
UNIVERSITY OF COPENHAGEN

Recoding of continuous varaibles into groups

help egen
1
2
3
4

use sperm.dta, clear


by usa, sort: egen meanreg=mean(meanct)
sort year usa
list meanct usa meanreg in 1/5

1
2
3
4

generate byte agecat=21 if age<=21


replace agecat=38 if age>21 & age<=38
replace agecat=64 if age>38 & age<=64
replace agecat=75 if age>64 & age<.

(1 missing value generated)


sort year usa

1.
2.
3.
4.
5.

+-------------------------+
| meanct
usa
meanreg |
|-------------------------|
| 120.63
1
85.32429 |
| 107.00
1
85.32429 |
| 66.90
1
85.32429 |
| 85.70
0
80.05758 |
| 134.00
1
85.32429 |
+-------------------------+

gen igecat2=recode(age,21,38,64,100)

7
8
9

generate agecat=autocode(age,4,0,1000)
tabulate agecat, missing

10
11

generate age21b = age>=21

DEPARTMENT OF BIOSTATISTICS
UNIVERSITY OF COPENHAGEN

Recoding of continuous varaibles into groups

DEPARTMENT OF BIOSTATISTICS
UNIVERSITY OF COPENHAGEN

Recoding of continuous varaibles into groups


1
2
3

Income in groups on 10000:


1
2

4
5

gen lb = floor(inc/10000) * 10000


tabulate lb

use sperm.dta, clear


forvalues bot = 20(10)40 {
local top = bot + 10
gen catbot = meanct >= bot & meanct <= top
}
list meanct cat20 cat30 in 1/5

3
4

recode age (15/19=1) (20/24=2) (25/29=3) (30/34=4)


(35/39=5) (40/44=6) (45/49=7), gen(age5)

5
6

2. local top = bot + 10


3. gen catbot = meanct >= bot & meanct <= top
4. }

recode effort (0/4=1 Weak) (5/14=2 Moderate) (15/max=3


Strong), generate(efffortg) label(effortg)
1.
2.
3.
4.
5.

+------------------------+
| meanct
cat20
cat30 |
|------------------------|
|
.
0
0 |
| 120.63
0
0 |
| 107.00
0
0 |
| 66.90
0
0 |
| 85.70
0
0 |
+------------------------+

DEPARTMENT OF BIOSTATISTICS
UNIVERSITY OF COPENHAGEN

Recoding of continuous varaibles into groups


1
2
3
4

use sperm.dta, clear


egen agecat1 = cut(meanage), at(-1,0,3,10,20)
egen agecat2 = cut(age), group(4)
list meanage agecat1 agecat2 in 1/5

DEPARTMENT OF BIOSTATISTICS
UNIVERSITY OF COPENHAGEN

Recoding of factors.

1
2
3

(37 missing values generated)


(37 missing values generated)

1.
2.
3.
4.
5.

+-----------------------------+
| meanage
agecat1
agecat2 |
|-----------------------------|
|
.
.
. |
|
-1.00
-1
3 |
|
31.50
.
. |
|
-1.00
-1
3 |
|
-1.00
-1
3 |
+-----------------------------+

// siblings string "1","2","3","4 or more"


generate sibnum=real(siblings)
// sibnum=1,2,3,.

4
5

destring siblings, generate(sibnum) force

6
7
8

encode string, gen(stringnum)


decode stringnum, gen(stringS)

DEPARTMENT OF BIOSTATISTICS
UNIVERSITY OF COPENHAGEN

Recoding of factors.

After reading data sometimes manipulations are needed.


Generating a numeric where real strings are missing "."
1

DEPARTMENT OF BIOSTATISTICS
UNIVERSITY OF COPENHAGEN

Recoding

1
2
3
4

generate newvar= real(siblings)

generate effortg = effort


recode effortg 0/4=1 5/14=2 15/max=3
label define effortg 1 "Weak" 2 "Moderate" 3 "Strong
label values effortg effortg
label variable effortg "Family Planning Effort (Grouped
)"

DEPARTMENT OF BIOSTATISTICS
UNIVERSITY OF COPENHAGEN

Replace

1
2
3

replace income= income/1000


replace age = 29 if age > 30 & land="danmark"
replace age = 29 if missing(age) | cohort > 2014

DEPARTMENT OF BIOSTATISTICS
UNIVERSITY OF COPENHAGEN

Deleting variables and cases

1
2
3

drop age land


drop in 12/13
keep age land

DEPARTMENT OF BIOSTATISTICS
UNIVERSITY OF COPENHAGEN

Using scalars to store results for session

DEPARTMENT OF BIOSTATISTICS
UNIVERSITY OF COPENHAGEN

Transposing/Reshaping the data


Long to wide
1

1
2
3
4
5

scalar q1=r(r1)
scalar q2=r(r2)
list scalar
scalar drop
help scalar

use long-data.dta, clear


sort id
list in 1/3

(Written by R.
sort id

+-------------------+
| id
age
height |
|-------------------|
1. | 1
9
130 |
2. | 1
10
140 |
3. | 1
11
148 |
+-------------------+

DEPARTMENT OF BIOSTATISTICS
UNIVERSITY OF COPENHAGEN

Long to wide

DEPARTMENT OF BIOSTATISTICS
UNIVERSITY OF COPENHAGEN

Long to wide
1
2

1
2
3
4

gen index=1
gen testid=(id==id[_n-1])
replace index=index[_n-1]+1 if id==id[_n-1]
list in 1/3

(4 real changes made)


list in 1/3
+------------------------------------+
| id
age
height
index
testid |
|------------------------------------|
1. | 1
9
130
1
0 |
2. | 1
10
140
2
1 |
3. | 1
11
148
3
1 |
+------------------------------------+

drop testid
reshape wide age height, i(id) j(index)
list

drop testid
(note: j = 1 2 3)
Data
long
->
wide
----------------------------------------------------------------------------Number of obs.
6
->
2
Number of variables
4
->
7
j variable (3 values)
index
->
(dropped)
xij variables:
age
->
age1 age2 age3
height
->
height1 height2 height3
----------------------------------------------------------------------------+---------------------------------------------------------+
| id
age1
height1
age2
height2
age3
height3 |
|---------------------------------------------------------|
1. | 1
9
130
10
140
11
148 |
2. | 2
9.5
120
10.23
125
10.78
130 |
+---------------------------------------------------------+

DEPARTMENT OF BIOSTATISTICS
UNIVERSITY OF COPENHAGEN

Wide to Long
1
2

DEPARTMENT OF BIOSTATISTICS
UNIVERSITY OF COPENHAGEN

Merging files

reshape long age height, i(id) j(index)


list in 1/3

Can append a file with same structure.

(note: j = 1 2 3)

merge 1:1 (alternatively merge 1:k)

Data
wide
->
long
----------------------------------------------------------------------------Number of obs.
2
->
6
Number of variables
7
->
4
j variable (3 values)
->
index
xij variables:
age1 age2 age3
->
age
height1 height2 height3
->
height
----------------------------------------------------------------------------list in 1/3

Add additional variables based on "id" merge variable.


1
2

(Written by R.

1
2

DEPARTMENT OF BIOSTATISTICS
use wide-data.dta, clear
UNIVERSITY OF COPENHAGEN
list
age height, i(id) j(index)

+--------------------+
| id
region
bmi |
|--------------------|
1. | 1
1
22.3 |
2. | 2
1
28.3 |
+--------------------+

+---------------------------+
| id
index
age
height |
|---------------------------|
1. | 1
1
9
130 |
2. | 1
2
10
140 |
3. | 1
3
11
148 |
+---------------------------+

Merging
fileslong
3
reshape

use merge1.dta, clear


list

DEPARTMENT OF BIOSTATISTICS
UNIVERSITY OF COPENHAGEN

Merging files

4
5
6
1
2

* variables need to be on specific form


rename (Jan-Dec) temp#, addnumber
use merge2.dta, clear
list

(Written by R.
+-------------------+
| id
X_by
alder |
|-------------------|
1. | 1
2
27 |
2. | 2
.
17 |
3. | 3
28
17 |
+-------------------+

1
2

use merge3.dta, clear


list

(Written by R.

+----------------------------+
| id
X_by
alder
region |
|----------------------------|
1. | 1
2
2
1 |
2. | 2
.
17
2 |
3. | 3
28
17
1 |
+----------------------------+

DEPARTMENT OF BIOSTATISTICS
UNIVERSITY OF COPENHAGEN

Merging files

DEPARTMENT OF BIOSTATISTICS
UNIVERSITY OF COPENHAGEN

Merging files
1
2
3

1
2

use merge4-addid.dta, clear


list

(Written by R.

use merge1.dta, clear


merge 1:1 id using merge2.dta
list

(Written by R.

Result
# of obs.
----------------------------------------not matched
1
from master
0 (_merge==1)
from using
1 (_merge==2)

+--------------------+
| id
region
bmi |
|--------------------|
1. | 5
2
12.3 |
2. | 7
2
18.3 |
+--------------------+

list

matched
2
-----------------------------------------

(_merge==3)

+----------------------------------------------------+
| id
region
bmi
X_by
alder
_merge |
|----------------------------------------------------|
1. | 1
1
22.3
2
27
matched (3) |
2. | 2
1
28.3
.
17
matched (3) |
3. | 3
.
.
28
17
using only (2) |
+----------------------------------------------------+
DEPARTMENT OF BIOSTATISTICS
UNIVERSITY OF COPENHAGEN

Merging files
1
2
3

DEPARTMENT OF BIOSTATISTICS
UNIVERSITY OF COPENHAGEN

Merging files

use merge2.dta, clear


merge 1:1 id using merge1.dta
list

1
2

(Written by R.

Result
# of obs.
----------------------------------------not matched
1
from master
1 (_merge==1)
from using
0 (_merge==2)

list

matched
2
-----------------------------------------

use merge1, clear


append using merge4-addid, gen(oldnew)
list

(Written by R.
)
append using merge4-addid, gen(oldnew)
list

(_merge==3)

+-----------------------------------------------------+
| id
X_by
alder
region
bmi
_merge |
|-----------------------------------------------------|
1. | 1
2
27
1
22.3
matched (3) |
2. | 2
.
17
1
28.3
matched (3) |
3. | 3
28
17
.
.
master only (1) |
+-----------------------------------------------------+

1.
2.
3.
4.

+-----------------------------+
| id
region
bmi
oldnew |
|-----------------------------|
| 1
1
22.3
0 |
| 2
1
28.3
0 |
| 5
2
12.3
1 |
| 7
2
18.3
1 |
+-----------------------------+

DEPARTMENT OF BIOSTATISTICS
UNIVERSITY OF COPENHAGEN

Merging files
1
2
3
4
5

webuse dollars, clear


list
webuse sforce
list
merge m:1 region using http://www.stata-press.com/data/
r13/dollars
list

(Regional Sales & Costs)


+-----------------------------+
| region
sales
cost |
|-----------------------------|
| N Cntrl
419,472
227,677 |
| NE
360,523
138,097 |
| South
532,399
330,499 |
| West
310,565
165,348 |
+-----------------------------+
(Sales Force)
1.
2.
3.
4.

+--------------------+
| region
name |
|--------------------|
1. | N Cntrl
Krantz |
2. | N Cntrl
Phipps |
3. | N Cntrl
Willis |
4. | NE
Ecklund |
5. | NE
Franks |
|--------------------|
6. | South
Anderson |
7. | South
Dubnoff |
8. | South
Lee |
9. | South
McNeil |
10. | West
Charles |
|--------------------|
11. | West
Cobb |
12. | West
Grant |
+--------------------+
file http://www.stata-press.com/data/r13/dollars.dta not Stata format
r(610);

1.
2.
3.
4.
5.
6.

+--------------------+
| region
name |
|--------------------|
| N Cntrl
Krantz |
| N Cntrl
Phipps |
| N Cntrl
Willis |
| NE
Ecklund |
| NE
Franks |
|--------------------|
| South
Anderson |

You might also like