Sperm - Asc Data Set: Saseko - CSV Data Set

DEPARTMENT OF BIOSTATISTICS
UNIVERSITY OF COPENHAGEN
Reading and Manipulating data
Data manipulation in Stata
Importing, exporting
Reading csv, fixed format files
listing, descrbing, viewing the data
Thomas Scheike
labels for working with the data

computing new variables and groupings
Transposing/Merging
September 22, 2014
Missing data in stata

deleting variables/cases
Data examples
Data examples
sperm.asc data set:

obsnr= observation number
year= year of sperm sample
saseko.csv data set

obs= observation number
n= number of subjects in sample
abstid= days of abstinence
meanage= mean age of subjects -30 year
alder= age of subject
meanabs= mean number of days of abstinence - 7 days
s1e2= sas=1, ecological farmer=2
meanvol= mean volume of ejaculate ml
konc = sperm concentration mill/ml
meanct= mean sperm concentration mill/ml
vol= volume of ejaculate ml
medct= median sperm concentration mill/ml

usa = 1 if sample is from the US, 0 if sample is from europe
Import data
Import data
can read certain formats

excel sheets (xls, xlsx) [import excel . . . ]
text data (delimited, csv, . . . ) [insheet]
text data (fixed columns) [infix]
text data (fixed columns, with a dictionary) [infile]
Unformatted text data
SAS xport [import sasxport]
ODBC data
XML data
stata files (dta) [use]
Here illustration of excel and text (csv, fixed with) data. With
other formats programs like Stat-Transfer can change most formats
to most formats.
Typically, data comes from some existing data base. csv is often an
import/export possibility from most systems.
In reality one sometimes needs to take data to different statistical
programs because the analyses is only available there.
Spss can read/write stata files

R can read/write stata files (write.dta)
Stattransfer (program for moving between formats)
1
2
// to read from R
saveold oldsperm
Windows:
1
2
3
4
5
scc install usespss

scc install usesas
cd data
usespss using mydata.sav
usesas using mydata.sas7bcat
Import data
1
2
3
Import data
cd data
import excel using mini-data.xlsx
describe
1
list
/home/ifsv/bhd252/undervis/stata-course/thomas/data
Contains data
obs:
4
vars:
5
size:
68
------------------------------------------------------------------------------storage display
value
variable name
type
format
label
variable label
------------------------------------------------------------------------------A
byte
%10.0g
B
double %10.0g
C
byte
%10.0g
D
str6
%9s
E
byte
%10.0g
------------------------------------------------------------------------------Sorted by:
Note: dataset has changed since last saved
list
1.
2.
3.
4.
+---------------------------+
| A
B
C
D
E |
|---------------------------|
| 1
2
99
anders
1 |
| 2
2.5
30
anders
1 |
| 3
5
5
thomas
2 |
| 4
4.5
25
thomas
2 |
+---------------------------+
Import data
1
2
1
2
Import data
import excel using mini-data.xlsx, sheet("Ark 1")

import sasexport sperma.xpt
insheet using saseko.csv, comma clear
describe
(7 vars, 196 obs)

Contains data
obs:
196
vars:
7
size:
3,528
------------------------------------------------------------------------------storage display
value
variable name
type
format
label
variable label
------------------------------------------------------------------------------v1
int
%8.0g
obs
int
%8.0g
abstid
float %9.0g
alder
byte
%8.0g
s1e2
byte
%8.0g
konc
float %9.0g
vol
float %9.0g
------------------------------------------------------------------------------DEPARTMENT OF BIOSTATISTICS
Sorted by:
Import data
1.
2.
3.
4.
5.
Import data
1
2
3
4
1
2
clear
insheet your_file.csv, delimiter(";")
list in 1/5
+-----------------------------------------------+
| v1
obs
abstid
alder
s1e2
konc
vol |
|-----------------------------------------------|
| 1
1
.
26
1
0
2.7 |
| 2
2
4
44
1
26
5 |
| 3
3
4
36
1
12
5.5 |
| 4
4
4
40
1
83
4.5 |
| 5
5
3
37
1
36
2 |
+-----------------------------------------------+
clear
infix obsnr 1-5 year 6-11 n 12-17 meanage 18-26 meanabs
27-35 meanvol 36-44 meanct 45-53 medct 54-59 usa
60-63 2 first using sperm.asc
format meanage meanabs meanvol meanct medct %4.2f
drop obsnr
list in 1/5
meanct 45-53 medct 54-59 usa 60-63

(62 observations read)
drop obsnr
1.
2.
3.
4.
5.
2 first using sperm.asc
+-----------------------------------------------------------------+
| year
n
meanage
meanabs
meanvol
meanct
medct
usa |
|-----------------------------------------------------------------|
| 1938
200
-1.00
-2.00
3.00
120.63
.
1 |
| 1941
22
31.50
-2.00
2.98
107.00
.
1 |
| 1943
25
-1.00
-2.00
4.50
66.90
66.40
1 |
| 1944
50
-1.00
-2.00
3.19
85.70
.
0 |
| 1945
100
22.00
-2.00
3.40
134.00
.
1 |
+-----------------------------------------------------------------+
Import data
1
2
infix using sperm.dct, clear

des
infix dictionary using sperm.asc {

2 first
obsnr 1-5
year 6-11
n 12-17
meanage 18-26
meanabs 27-35
meanvol 36-44
meanct 45-53
medct 54-59
usa 60-63
}
(62 observations read)
Contains data
obs:
62
vars:
9
size:
2,232
------------------------------------------------------------------------------storage display
value
variable name
type
format
label
variable label
------------------------------------------------------------------------------UNIVERSITY OF COPENHAGEN
obsnr
float %9.0g
year
float %9.0g
n
float %9.0g
meanage
float %9.0g
meanabs
float %9.0g
meanvol
float %9.0g
meanct
float %9.0g
medct
float %9.0g
usa
float %9.0g
------------------------------------------------------------------------------Sorted by:
Import
Import data
use sperm.dta, clear
Dictionaries can be quite elaborate, and may contain more

information
If usa was a string with information about region:
medct 54-59
usa str 60-63
Export data
Can save stata data in different formats

To save in (NewVersion.dta)
1
save NewVersion
To replace old version

1
save, replace
Looking at the data
Setting up the data

Output formats
Browse, Edit, list, describe

1
2
3
4
5
6
7
8
9
1
2

list in 1/3
sort age
list if usa==1
list in -3/l
list in -3/-1
describe
edit
browse
format age %6.2g

format age %6.2f
format age %7.5e
Labels
1
2
3
4
label data "Sperm concentration data"

label variable age "age in years"
label variable catage "Age in age groups"
label define labcatage 1 "Young" 2 "Medium" 3 "Old" . "
Missing"
label values catage labcatage
Missing variables
Missing values are generally represented by "." or more generally as

".a", ".b" and so forth. This gives a way of having different types
of missing values.
1
mvdecode age year income, mv(97=. \ 98=.a \ 99=.b)
missing function returns a 1 if some of its arguments are missing

1
2
if missing(age) age=0
replace age=0 if missing(age,sex,land)
Sorting
1
2
3
4
sort land age income

sort age
gsort +age
gsort -age
Changing order of variables.

1
order age income
Computing new variables
Egen functions
help egen
1
1
2
3
4
5
6
7
8
generate loginc= ln(income)

label variable loginc "log-income"
egen mincome = mean(income)
egen sdin=std(income)
generate caseid=_n
replace income=income/1000
egen agesex= group(agecat sex)
replace income=100 in 3
2
3
4
5

egen agg=rowmean(meanct meanvol)
egen tot=rowtotal(meanct meanvol)
gen tots=meanct+meanvol
list meanct meanvol tot tots agg in 1/5
(1 missing value generated)

loginc is a new variable on the data, and egen is a "global" variable

that can be used for computation and definitions of new variables
(attached to the data).
1.
2.
3.
4.
5.
+---------------------------------------------+
| meanct
meanvol
tot
tots
agg |
|---------------------------------------------|
|
.
.
0
.
. |
| 120.63
3.00
123.63
123.63
61.815 |
| 107.00
2.98
109.98
109.98
54.99 |
| 66.90
4.50
71.4
71.4
35.7 |
| 85.70
3.19
88.89
88.89
44.445 |
+---------------------------------------------+
Egen by
Recoding of continuous varaibles into groups
help egen
1
2
3
4

by usa, sort: egen meanreg=mean(meanct)
sort year usa
list meanct usa meanreg in 1/5
1
2
3
4
generate byte agecat=21 if age<=21

replace agecat=38 if age>21 & age<=38
replace agecat=64 if age>38 & age<=64
replace agecat=75 if age>64 & age<.

sort year usa
1.
2.
3.
4.
5.
+-------------------------+
| meanct
usa
meanreg |
|-------------------------|
| 120.63
1
85.32429 |
| 107.00
1
85.32429 |
| 66.90
1
85.32429 |
| 85.70
0
80.05758 |
| 134.00
1
85.32429 |
+-------------------------+
gen igecat2=recode(age,21,38,64,100)
7
8
9
generate agecat=autocode(age,4,0,1000)
tabulate agecat, missing
10
11
generate age21b = age>=21

1
2
3
Income in groups on 10000:

1
2
4
5
gen lb = floor(inc/10000) * 10000

tabulate lb

forvalues bot = 20(10)40 {
local top = bot + 10
gen catbot = meanct >= bot & meanct <= top
}
list meanct cat20 cat30 in 1/5
3
4
recode age (15/19=1) (20/24=2) (25/29=3) (30/34=4)

(35/39=5) (40/44=6) (45/49=7), gen(age5)
5
6
2. local top = bot + 10

3. gen catbot = meanct >= bot & meanct <= top
4. }
recode effort (0/4=1 Weak) (5/14=2 Moderate) (15/max=3

Strong), generate(efffortg) label(effortg)
1.
2.
3.
4.
5.
+------------------------+
| meanct
cat20
cat30 |
|------------------------|
|
.
0
0 |
| 120.63
0
0 |
| 107.00
0
0 |
| 66.90
0
0 |
| 85.70
0
0 |
+------------------------+

1
2
3
4

egen agecat1 = cut(meanage), at(-1,0,3,10,20)
egen agecat2 = cut(age), group(4)
list meanage agecat1 agecat2 in 1/5
Recoding of factors.
1
2
3
(37 missing values generated)

(37 missing values generated)
1.
2.
3.
4.
5.
+-----------------------------+
| meanage
agecat1
agecat2 |
|-----------------------------|
|
.
.
. |
|
-1.00
-1
3 |
|
31.50
.
. |
|
-1.00
-1
3 |
|
-1.00
-1
3 |
+-----------------------------+
// siblings string "1","2","3","4 or more"

generate sibnum=real(siblings)
// sibnum=1,2,3,.
4
5
destring siblings, generate(sibnum) force
6
7
8
encode string, gen(stringnum)

decode stringnum, gen(stringS)
Recoding of factors.
After reading data sometimes manipulations are needed.

Generating a numeric where real strings are missing "."
1
Recoding
1
2
3
4
generate newvar= real(siblings)
generate effortg = effort

recode effortg 0/4=1 5/14=2 15/max=3
label define effortg 1 "Weak" 2 "Moderate" 3 "Strong
label values effortg effortg
label variable effortg "Family Planning Effort (Grouped
)"
Replace
1
2
3
replace income= income/1000

replace age = 29 if age > 30 & land="danmark"
replace age = 29 if missing(age) | cohort > 2014
Deleting variables and cases
1
2
3
drop age land

drop in 12/13
keep age land
Using scalars to store results for session
Transposing/Reshaping the data

Long to wide
1
1
2
3
4
5
scalar q1=r(r1)
scalar q2=r(r2)
list scalar
scalar drop
help scalar
use long-data.dta, clear

sort id
list in 1/3
(Written by R.
sort id
+-------------------+
| id
age
height |
|-------------------|
1. | 1
9
130 |
2. | 1
10
140 |
3. | 1
11
148 |
+-------------------+
Long to wide
Long to wide
1
2
1
2
3
4
gen index=1
gen testid=(id==id[_n-1])
replace index=index[_n-1]+1 if id==id[_n-1]
list in 1/3
(4 real changes made)

list in 1/3
+------------------------------------+
| id
age
height
index
testid |
|------------------------------------|
1. | 1
9
130
1
0 |
2. | 1
10
140
2
1 |
3. | 1
11
148
3
1 |
+------------------------------------+
drop testid
reshape wide age height, i(id) j(index)
list
drop testid
(note: j = 1 2 3)
Data
long
->
wide
----------------------------------------------------------------------------Number of obs.
6
->
2
Number of variables
4
->
7
j variable (3 values)
index
->
(dropped)
xij variables:
age
->
age1 age2 age3
height
->
height1 height2 height3
----------------------------------------------------------------------------+---------------------------------------------------------+
| id
age1
height1
age2
height2
age3
height3 |
|---------------------------------------------------------|
1. | 1
9
130
10
140
11
148 |
2. | 2
9.5
120
10.23
125
10.78
130 |
+---------------------------------------------------------+
Wide to Long
1
2
Merging files
reshape long age height, i(id) j(index)

list in 1/3
Can append a file with same structure.
(note: j = 1 2 3)
merge 1:1 (alternatively merge 1:k)
Data
wide
->
long
----------------------------------------------------------------------------Number of obs.
2
->
6
Number of variables
7
->
4
j variable (3 values)
->
index
xij variables:
age1 age2 age3
->
age
height1 height2 height3
->
height
----------------------------------------------------------------------------list in 1/3
Add additional variables based on "id" merge variable.

1
2
(Written by R.
1
2
use wide-data.dta, clear
list
age height, i(id) j(index)
+--------------------+
| id
region
bmi |
|--------------------|
1. | 1
1
22.3 |
2. | 2
1
28.3 |
+--------------------+
+---------------------------+
| id
index
age
height |
|---------------------------|
1. | 1
1
9
130 |
2. | 1
2
10
140 |
3. | 1
3
11
148 |
+---------------------------+
Merging
fileslong
3
reshape
use merge1.dta, clear

list
Merging files
4
5
6
1
2
* variables need to be on specific form

rename (Jan-Dec) temp#, addnumber
list
(Written by R.
+-------------------+
| id
X_by
alder |
|-------------------|
1. | 1
2
27 |
2. | 2
.
17 |
3. | 3
28
17 |
+-------------------+
1
2

list
(Written by R.
+----------------------------+
| id
X_by
alder
region |
|----------------------------|
1. | 1
2
2
1 |
2. | 2
.
17
2 |
3. | 3
28
17
1 |
+----------------------------+
Merging files
Merging files
1
2
3
1
2
use merge4-addid.dta, clear

list
(Written by R.

merge 1:1 id using merge2.dta
list
(Written by R.
Result
# of obs.
----------------------------------------not matched
1
from master
0 (_merge==1)
from using
1 (_merge==2)
+--------------------+
| id
region
bmi |
|--------------------|
1. | 5
2
12.3 |
2. | 7
2
18.3 |
+--------------------+
list
matched
2
-----------------------------------------
(_merge==3)
+----------------------------------------------------+
| id
region
bmi
X_by
alder
_merge |
|----------------------------------------------------|
1. | 1
1
22.3
2
27
matched (3) |
2. | 2
1
28.3
.
17
matched (3) |
3. | 3
.
.
28
17
using only (2) |
+----------------------------------------------------+
Merging files
1
2
3
Merging files

merge 1:1 id using merge1.dta
list
1
2
(Written by R.
Result
# of obs.
----------------------------------------not matched
1
from master
1 (_merge==1)
from using
0 (_merge==2)
list
matched
2
-----------------------------------------
use merge1, clear

append using merge4-addid, gen(oldnew)
list
(Written by R.
)
append using merge4-addid, gen(oldnew)
list
(_merge==3)
+-----------------------------------------------------+
| id
X_by
alder
region
bmi
_merge |
|-----------------------------------------------------|
1. | 1
2
27
1
22.3
matched (3) |
2. | 2
.
17
1
28.3
matched (3) |
3. | 3
28
17
.
.
master only (1) |
+-----------------------------------------------------+
1.
2.
3.
4.
+-----------------------------+
| id
region
bmi
oldnew |
|-----------------------------|
| 1
1
22.3
0 |
| 2
1
28.3
0 |
| 5
2
12.3
1 |
| 7
2
18.3
1 |
+-----------------------------+
Merging files
1
2
3
4
5
webuse dollars, clear

list
webuse sforce
list
merge m:1 region using http://www.stata-press.com/data/
r13/dollars
list
(Regional Sales & Costs)

+-----------------------------+
| region
sales
cost |
|-----------------------------|
| N Cntrl
419,472
227,677 |
| NE
360,523
138,097 |
| South
532,399
330,499 |
| West
310,565
165,348 |
+-----------------------------+
(Sales Force)
1.
2.
3.
4.
+--------------------+
| region
name |
|--------------------|
1. | N Cntrl
Krantz |
2. | N Cntrl
Phipps |
3. | N Cntrl
Willis |
4. | NE
Ecklund |
5. | NE
Franks |
|--------------------|
6. | South
Anderson |
7. | South
Dubnoff |
8. | South
Lee |
9. | South
McNeil |
10. | West
Charles |
|--------------------|
11. | West
Cobb |
12. | West
Grant |
+--------------------+
file http://www.stata-press.com/data/r13/dollars.dta not Stata format
r(610);
1.
2.
3.
4.
5.
6.
+--------------------+
| region
name |
|--------------------|
| N Cntrl
Krantz |
| N Cntrl
Phipps |
| N Cntrl
Willis |
| NE
Ecklund |
| NE
Franks |
|--------------------|
| South
Anderson |

Sperm - Asc Data Set: Saseko - CSV Data Set

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Sperm - Asc Data Set: Saseko - CSV Data Set

Uploaded by

Copyright:

Available Formats

DEPARTMENT OF BIOSTATISTICS

Reading and Manipulating data

Data manipulation in Stata

labels for working with the data

September 22, 2014

Missing data in stata

sperm.asc data set:

saseko.csv data set

n= number of subjects in sample

abstid= days of abstinence

meanage= mean age of subjects -30 year

alder= age of subject

meanabs= mean number of days of abstinence - 7 days

s1e2= sas=1, ecological farmer=2

meanvol= mean volume of ejaculate ml

konc = sperm concentration mill/ml

meanct= mean sperm concentration mill/ml

vol= volume of ejaculate ml

medct= median sperm concentration mill/ml

can read certain formats

Spss can read/write stata files

scc install usespss

import excel using mini-data.xlsx, sheet("Ark 1")

(7 vars, 196 obs)

meanct 45-53 medct 54-59 usa 60-63

2 first using sperm.asc

infix using sperm.dct, clear

infix dictionary using sperm.asc {

use sperm.dta, clear

Dictionaries can be quite elaborate, and may contain more

Can save stata data in different formats

To replace old version

Looking at the data

Setting up the data

Browse, Edit, list, describe

use sperm.dta, clear

format age %6.2g

label data "Sperm concentration data"

Missing values are generally represented by "." or more generally as

mvdecode age year income, mv(97=. \ 98=.a \ 99=.b)

missing function returns a 1 if some of its arguments are missing

sort land age income

Changing order of variables.

order age income

Computing new variables

generate loginc= ln(income)

use sperm.dta, clear

(1 missing value generated)

loginc is a new variable on the data, and egen is a "global" variable

Recoding of continuous varaibles into groups

use sperm.dta, clear

generate byte agecat=21 if age<=21

(1 missing value generated)

generate age21b = age>=21

Recoding of continuous varaibles into groups

Recoding of continuous varaibles into groups

Income in groups on 10000:

gen lb = floor(inc/10000) * 10000

use sperm.dta, clear

recode age (15/19=1) (20/24=2) (25/29=3) (30/34=4)

2. local top = bot + 10

recode effort (0/4=1 Weak) (5/14=2 Moderate) (15/max=3

Recoding of continuous varaibles into groups