You are on page 1of 31

ASSOCIATION MAPPING

• Seeks to identify specific functional variants (i.e. loci, alleles)


INTRODUCTION TO GENE linked to phenotypic difference in a trait, to facilitate detection of
trait-causing DNA sequence polymorphisms and/or selection of
MAPPING: genotypes that closely resemble in phenotype
• Also known as Linkage Disequilibrium (LD) mapping or
Association Mapping Association Genetics, is a population based survey used to
identify trait-marker relationships based on LD.

Leilani Nora
Assistant Scientist

ASSOCIATION VS. QTL MAPPING ASSOCIATION VS. QTL MAPPING


Attribute QTL mapping Association Genetics Attribute QTL mapping Association Genetics
Detection goal Quantitative trait locus, Quantitative Trait Nucleotide, Marker discovery Moderate Moderate for few traits, high
wide region within specific physically close as possible costs for many traits
pedigrees to causative sequences
Extent of inference Pedigree specific, except Species or sub species wide
Resolution of Low – moderate density High – disequilibrium within where species has high
causative trait linkage maps only required small physical regions extant LD
polymorphism requiring many markers.
Number of 102- low 103 105 for small genomes
Experimental Defined pedigrees, eg. LD experiments; unrelated markers required ~109 for large genomes
populations Backcross, F2, RI, three individuals for genome
and two generation Unstructured populations, coverage
pedigrees large numbers of small
unrelated families.
WHY ASSOCIATION GENETICS?

• The higher resolution afforded by use of unstructured


populations allows the intriguing possibility of identifying
the genes or even specific nucleotides underpinning trait
variation. HARDY WEINBERG
• The opportunity to use molecular markers to enhance
rates of genetic gain, including utilization of specific
EQUILIBRIUM
genes from non-elite germplasm in a more directed and
efficient manner.

HARDY WEINBERG EQUILIBRIUM (HWE) HWE ASSUMTIONS


Assumption Exception Effect
Random Mating Inbreeding Decrease or increase
• The Hardy-Weinberg model, describes and
/Outbreeding in heterozygosity
predicts genotype and allele frequencies in a non-
evolving population. No migration Migration Homogenize different
populations
• Is an expression of the notion of a population in No Mutation Source of all Increase
“genetic equilibrium” and is a basic principle of variation heterozygosity
population genetics. No selection Directional / Reduce variation
Disruptive / Increase variation
Stabilizing Increase variation
Selection
Infinite Genetic Drift Reduce Variation
population
HARDY WEINBERG EQUILIBRIUM HARDY WEINBERG EQUILIBRIUM
1st Generation : Genotype and Allele Frequencies 1st Generation : Genotype and Allele Frequencies
• Consider a locus with two alleles: A and a • Allele frequencies for population not in HWE
• We can use the Punnett’s square to produce all possible 1 1
combinations of these gametes (Table1) P(AA) = p2 + pq P(aa) = q 2 + pq
2 2
• Assume in the first generation the alleles are not in HWE
and the genotype frequency is shown in Table 2. Next Generation
Table 2 • When a population is in HWE, the next generation will
Table 1
result in the same genotype frequency as well as the same
Male Genotype Freq
allele frequency.
Female A a AA p2
• Allele frequencies for a population in HWE:
A AA Aa Aa pq P(AA) = P(A) P(A) =p2
a Aa aa aa q2 P(Aa) = 2P(A) P(a) = 2pq
where : p2+ 2pq + q2 =1 P(aa) = P(a)P(a) = q2

HARDY WEINBERG EQUILIBRIUM HARDY WEINBERG EQUILIBRIUM

• When a population is in HWE, the next generation will Violation in HWE Assumption
result in the same genotype frequency as well as the • When a locus is not in HWE, then this suggests one or
same allele frequency. more of the Hardy-Weinberg assumptions is false.
• Allele frequencies for a population in HWE: • Departure from HWE has been used to infer the
P(AA) = P(A) P(A) =p2 existence of natural selection, argue for existence of
P(Aa) = 2P(A) P(a) = 2pq assortive (non-random) mating, and infer genotyping
P(aa) = P(a)P(a) = q2 errors.
• It is therefore of interest to test whether a population is
• For example, consider a diallelic locus with alleles “A” in HWE at a locus.
and “a” with frequencies 0.85 and 0.15, respectively. If
the locus is in HWE, calculate the allele frequencies. • Two most popular ways of testing HWE
- Chi-Square test
- Exact test
CHI-SQUARE GOODNESS OF FIT CHI-SQUARE GOODNESS OF FIT
• Compares observed genotype counts with the values
expected under Hardy-Weinberg • Test Statistic for Allelic Association is:
• For a locus with two alleles, we might construct a
(Observed count - Expected count )2
table as follows: χ2 = ∑
Genotypes Expected Count
≈ under Ho χ (21df )
Genotype Observed Expected Under HWE
AA nAA np2
Aa nAa 2npq
aa naa nq2

where: n – is the number of individuals in the sample


p – is the probability that a random allele in a
population is of type A
q - is the probability that a random allele in a
population is of type “a”

DATAFRAME: ge03d1p1.csv DATAFRAME: ge03d1p1.csv


• Dataframe with 250 observations and 7 variables.
Read data file ge03d1p1.csv
> assoc1 <- read.table(“ge03d1p1.csv",
header=T, sep=‘,’)
> str(assoc1)
'data.frame': 250 obs. of 7 variables:
$ subj: int 144 965 374 2103 1533 2466 2425 198...
$ sex : int 0 0 0 0 0 1 0 0 1 1 ...
$ aff : int 0 1 1 0 0 0 0 0 0 0 ...
$ qt : num 1.365 -2.464 -1.332 -1.127 0.101 ...
$ snp4: Factor w/ 3 levels "A/A","A/B","B/B": ...
$ snp5: Factor w/ 3 levels "A/A","B/A","B/B": ...
$ snp6: Factor w/ 3 levels "A/A","B/A","B/B": ...
HWE USING PACKAGE genetics BASIC FUNCTIONS OF PACKAGE genetics
• table() is use to obtain a contingency table of
• Written by Gregory R. Warnes to facilitate analysis counts at each combination of factors
of genetic data
> table(x)
• Implement genetic analysis tests such as HWE
and LD # x – an arbitrary R object
• Implements new data classes such as genotype,
haplotype, and LD.data.frame • Genotype() creates a genotype object
• Facilitates export of data from R to the format > genotype(a1, …)
supported by such genetic analysis
# a1,a2 – vector(s) or matrix containing two alleles
for each individual

BASIC FUNCTIONS OF PACKAGE genetics PACKAGE genetics : summary()

• Illustration • summary() use to obtain a table of allelic and


genotype frequencies and other summary statistics
> library(genetics)
> Snp4 <- genotype(assoc1$snp4) > library(genetics)
> str(Snp4) > summary(Snp4)
Factor w/ 3 levels "A/A","A/B","B/B": 2 3... Number of samples typed: 243 (97.2%)
- attr(*, "allele.names")= chr [1:2] "A" "B"
Allele Frequency: (2 alleles) Genotype Frequency:
- attr(*, "allele.map")= chr[1:3, 1:2] "A"... Count Proportion
Count Proportion
- attr(*, "genotypeOrder")= chr[1:4] "A/A" "A/B”... A/A 109 0.45
A 323 0.66
B 163 0.34 A/B 105 0.43
NA 14 NA B/B 29 0.12
> table(Snp4) NA 7 NA
A/A A/B B/B Heterozygosity (Hu) = 0.4467269
Poly. Inf. Content = 0.3464355
109 105 29
PACKAGE genetics : HW.chisq() PACKAGE genetics : HW.exact()
• Test the null hypothesis (Ho) that Hardy-Weinberg • Exact test of HWE for 2 Allele Markers
equilibrium holds using chi-square method
> HWE.exact(x, …)
> HWE.chisq(x, …)
# x – genotype or haplotype object
# x – genotype or haplotype object

• Illustration • Illustration
> HWE.exact(Snp4)
> HWE.chisq(Snp4)
Exact Test for Hardy-Weinberg Equilibrium
Pearson's Chi-squared test with simulated p-value
(based on 10000
data: snp4
replicates)
N11 = 109, N12 = 105, N22 = 29, N1 = 323, N2 = 163,
p-value = 0.666
data: tab
X-squared = 0.2298, df = NA, p-value = 0.6657

LINKAGE DISEQUILIBRIUM
• Also known as gametic phase disequilibrium, gametic
disequilibrium and allelic association.
• Non random association of alleles at different loci

LINKAGE DISEQUILIBRIUM • It is the correlation between polymorphisms (SNPs)


that is caused by their shared history of mutation and
recombination.
• LD and Linkage are related but they are distinctly
different.
MEASURE OF LINKAGE DISEQUILIBRIUM
LINKAGE DISEQUILIBRIUM
• Consider two loci (A and B), each segregating for two
alleles (A, a, B, b)
• Two loci, A and B are said to be in linkage (or
gametic) disequilibrium if their respective alleles do • There are four possible gametes (or haplotypes)
not associate independently in the studied present in the populations:
population. Locus A
• Occurs when genotypes at the two loci are not Locus B A a Totals
independent of another. If all polymorphism were B XAB XaB pB
independent at the population level, association
b XAb Xab qb
studies would have to examine every one of them
Total pA qa 1.0
• Linkage disequilibrium makes tightly linked variants
strongly correlated producing cost savings for • Gamete : AB, Ab, aB, ab
association studies. • Frequency : XAB, XAb, XaB, Xab
• Allele frequency can be expressed as gamete frequencies :
pA, pa, pB, pb

MEASURE OF LINKAGE DISEQUILIBRIUM COEFFICIENT OF LD


• If the alleles at the two loci are randomly associated with • If alleles at the two loci are not randomly associated then
one another, then the frequencies of the four gametes there will be a deviation (D) in the expected frequencies
are equal to the product of the frequencies of alleles.
Locus A
Locus A Locus B A a Totals
Locus B A a Totals B pAB = pApB+DAB pa B =papB-DAB pB
B pAB = pA pB qa pB = (1-pA) pB pB b pAb= pApb-DAB pab =papb+DAB qb
b pAqb= pA (1-pB) qaqb = (1-pA) (1-pB) qb Total pA qa 1.0
Total pA qa 1.0
• This parameter D is the Coefficient of Linkage Disequilibrium
• In this situation there is no linkage disequilibrium and first proposed by Lewontin and Kojima (1960) .
gamete frequencies can be accurately followed using • The most common expression of D is:
allele frequencies. Dij = pij – pipj or DAB = pAB – pApB
MEASURE OF LINKAGE DISEQUILIBRIUM TEST OF LD
Chi-square Test of Linkage Disequilibrium (D)
Normalized Measure Of Lewontin, D’
2nD 2
D' =
D χ2 = ~ χ (1)
Dmax p A (1 − p A ) pB (1 − pB )
Where:
• Compared with the threshold value obtained from the chi-
Dmax = min[pApB, qaqb], if DAB < 0 square table with 1 df at certain level of significance.
min[pAqb, qapB], if DAB > 0 • n is the number of individuals in the population.
• If significant, this means that D is significantly different from 0
• Varies between 0 and 1 and allows to assess the extent and that the population under study is in linkage
of linkage disequilibrium relative to the maximum disequilibrium
possible value it can take.
• If not significant, this means that D is not significantly different
• D’ will only be less than one if all four possible haplotypes from 0 and that the population under study in in linkage
are observed. equilibrium.

MEASURE OF LINKAGE DISEQUILIBRIUM ILLUSTRATION OF LD


Figure 1: Completely Correlated
Correlation between A and B alleles, ∆2 or r2
2 A C
2 DAB χ2
∆ = =
p A (1 − p A ) pB (1 − pB ) 2n
G T

• If allele frequencies are equal, then r2 varies between 0 to 1


• 1 when the two markers provide identical information • Shows an example of LD where the two polymorphisms are
• 0 when they are in perfect equilibrium completely correlated with one another
• As for D, the maximum value of r2 depends on the allele • Two linked mutations occur at a similar point in time and no
frequencies and one can determine r’ value in a manner recombination has occurred between sites.
analogous to a D’. • In this case, the history of mutation and recombination for the
sites is the same.
ILLUSTRATION OF LD LD ANALYSIS IN R: LD()
Figure 2: Not Completely Correlated
• Computes pairwise linkage disequilibrium between
A C genetic markers

Usage
T
> LD(g1, g2, …)
G
# g1 – genotype object or dataframe containing
• Polymorphisms are not completely correlated, but there is no genotype objects
evidence of recombination.
# g2 – genotype object (ignored if g1 is a dataframe)
• This type of LD structure develop when mutations occur on
different allelic lineages.
• This is the situation in which r2 and D’ act differently, with D’
still equal to 1, but where r2 can be much smaller .

Sample: LD() Sample: LD()


> library (genetics) > All.LD
> Snp4 <- genotype(assoc1$snp4) Pairwise LD
-----------
> Snp5 <- genotype(assoc1$snp5) Snp5 Snp6
> Snp6 <- genotype(assoc1$snp6) Snp4 D 0.2009042 0.1735026
> AllSNPs <- data.frame(Snp4,Snp5, Snp6) Snp4 D' 0.9997352 0.8039577
Snp4 Corr. 0.8683117 0.7672419
> All.LD <- LD(AllSNPs) Snp4 X^2 354.3636283 270.7836866
Snp4 P-value < 2.2204e-16 < 2.2204e-16
Snp4 n 235 230

Snp5 D 0.2135702
Snp5 D' 0.9997231
Snp5 Corr. 0.9098532
Snp5 X^2 379.1474204
Snp5 P-value < 2.2204e-16
Snp5 n 229
FACTORS CREATING LINKAGE THE RATE OF RECOMBINATION, r
DISEQUILIBRIUM
• Let r be the rate of recombination between A and B
5 Processes that can produce linkage disequilibrium loci.
• r ranges in value between 0 and 0.5. Maximum
1. Mutation – provides the raw material for value of r is at 0.5 because with independent
producing polymorphisms that will be in LD assortment of the loci 0.5 of the gametes produced
2. Random Drift will be parental type

3. Gene Flow • Recombination in the two types of double


heterozygotes will produce different new genotypes
4. Inbreeding slows the decay of disequilibrium at rate r.
5. Evolution of supergenes • Double heterozygotes will produce offspring with
gametes like themselves at rate (1-r)

LINKAGE DISEQUILIBRIUM DECAY PLOT


THE RATE OF RECOMBINATION, r
Figure 3: LD Decay Plot measured as r2

• Disequilibrium will decay each generation


• After t generations: Dt = (1-θ)tD0
• Therefore, linkage disequilibrium decays each
generation at a rate determined by the degree of
recombination.

• Used to visualize the rate at which LD declines with


genetic or physical distance
• Scatterplot of r2 values versus genetic/physical
distances between all pairs of alleles within a gene,
along a chromosome, or across genome.
LINKAGE DISEQUILIBRIUM DECAY PLOT WHY IS LINKAGE DISQUILIBRIUM AND
Figure 4: Disequilibrium Matrix for Polymorphic Sites RECOMBINATION IMPORTANT?

• Recombination is intimately associated with sex


• Sex is synonymous with mixis – mixing of genes
between individuals
• Evolution of sex has thus been equated with the
evolution of genetic recombination.
• Mixis is needed to repair damaged DNA (Bernstein et al
1985)
• Many of the enzymes involved in repairing damaged
• Effective for visualizing the linear arrangement of LD between DNA also function in recombination.
polymorphic sites within a gene or loci along a chromosome.

LINEAR MODEL: lm()


> modelQT <-lm(qt~snp4, data=assoc1)
> summary(modelQT)
Call:
lm(formula = qt ~ snp4)

Residuals:
ASSOCIATION ANALYSIS Min 1Q Median
-2.63700 -0.62291 -0.01225
3Q
0.58922
Max
3.05561

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.081114 0.092517 -0.877 0.382

Linear Model snp4A/B


snp4B/B
-0.108366
-0.006041
0.132079 -0.820
0.201820 -0.030
0.413
0.976

Residual standard error: 0.9659 on 240 degrees of freedom


(7 observations deleted due to missingness)
Multiple R-squared: 0.003049, Adjusted R-squared: -0.005259
F-statistic: 0.367 on 2 and 240 DF, p-value: 0.6932
ASSOCIATION WITH BINARY OUTCOME: glm()
ASSOCIATION WITH BINARY OUTCOME: glm()
> modelQT2 <-glm(aff~snp4, family=“binomial”)
• Use to fit generalized linear models > summary(modelQT2)
Deviance Residuals:
Usage Min 1Q Median 3Q Max
-0.7433 -0.7204 -0.6715 -0.6715 1.7890
> glm(formula, family=“binomial”,…)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
# formula – a symbolic description of the model to be (Intercept) -1.3749 0.2386 -5.761 8.35e-09 ***
fitted. snp4A/B 0.1585 0.3331 0.476 0.634
snp4B/B 0.2297 0.4952 0.464 0.643
# family – character string which describe the error ---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
distribution. For binary outcome family=“binomial”
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 254.91 on 242 degrees of freedom
Residual deviance: 254.58 on 240 degrees of freedom
(7 observations deleted due to missingness)AIC: 260.58
Number of Fisher Scoring iterations: 4

DETECTING POPULATION STRUCTURE


Non-Parametric method
• Cluster Analysis
DETECTING POPULATION • Multi-Dimensional Scaling
STRUCTURE • Principal Component Analysis

Parametric method
• STRUCTURE
CLUSTER ANALYSIS HIERARCHICAL CLUSTER ANALYSIS
• Exploratory technique which may be used to search Steps in Performing Agglomerative Hierarchical
for category structure based on natural groupings in Clustering
the data, or reduce a very large body of data to a
relatively compact description 1. Obtain the Data Matrix
• No assumptions are made concerning the number 2. Standardize the data matrix if needed be
of groups or the group structure.
3. Generate the resemblance or distance
• Grouping is done on the basis of similarities or matrix
distances (dissimilarities).
4. Execute the Clustering Method
• There are various techniques in doing this which
may give different results. Thus researcher should
consider the validity of the clusters found.

DISTANCE / DISSIMILARITY MATRIX CLUSTERING METHOD

Types of Hierarchical Agglomerative Clustering


1. Single Linkage (SLINK)
2. Complete Linkage (CLINK)
3. Average Linkage (ALINK)
4. Ward’s Method - minimize the error SS
5. Centroid Method

Partitioning Method
1. K-means clustering
2. K-centroids
DATAFRAME: AMP2009.csv DATA FRAME: AMP09
Read data file AMP2009.csv Select SNPs data
> AMP09 <- read.table(“AMP2009.csv", > SNP09 <- subset(AMP09, select=SNP1:SNP210)
header=T, sep=‘,’, row.names=“Gen”) > str(SNP09)
> str(AMP09) 'data.frame': 207 obs. of 184 variables:
$ SNP1 : Factor w/ 3 levels "AA","AB","BB": 3 3 3 2 ...
'data.frame': 207 obs. of 200 variables: $ SNP2 : Factor w/ 3 levels "AA","AB","BB": 1 1 2 1 ...
$ SNP1 : Factor w/ 3 levels "AA","AB","BB": 2 2 2 2 ... $ SNP3 : Factor w/ 3 levels "AA","AB","BB": 2 2 3 2 ...
$ SNP2 : Factor w/ 3 levels "AA","AB","BB": 1 1 1 1 ... …
.......
$ SNP209 : Factor w/ 2 levels "AB","BB": 2 2 1 2 2 2 ...
$ SNP210 : Factor w/ 2 levels "AA","AB": 1 1 1 1 1 2 ...
$ ADF : num 21.3 22.8 22.2 17.8 25 ...
.......

DISSIMILARITY MATRIX : daisy() CLUSTERING METHOD : hclust()


> daisy(x, metric=“euclidean”, …) > hclust(d, method=“complete”, …)
# x – a numeric matrix, dataframe, or “dist” object # d – a dissimilarity structure as produced by dist() or
# metric – character string specifying the metric to be daisy()
used # method – the agglomeration method to be used
> diss <- dist(SNP09, metric=“gower”)
> dMatrix <- as.matrix(diss) > fit <- hclust(diss, method="ward")
> dMatrix[1:5, 1:5]
AMP001 AMP002 AMP003 AMP004 AMP005
AMP001 0.0000000 0.4476744 0.5773810 0.4415584 0.4913295
AMP002 0.4476744 0.0000000 0.7005988 0.5424837 0.4411765
AMP003 0.5773810 0.7005988 0.0000000 0.6666667 0.5976331
AMP004 0.4415584 0.5424837 0.6666667 0.0000000 0.5294118
AMP005 0.4913295 0.4411765 0.5976331 0.5294118 0.0000000
DENDROGRAM : plot() and rect.hclust()
Metric Multi Dimensional Scaling
> plot(fit)
> rect.hclust(tree=fit, k=6, border="red") • Visualize the relative distances between individuals
in a low dimensional structure.
# tree – an object
produced by hclust • Classical multidimensional scaling of a data matrix.
Also known as principal coordinates analysis
# k – Cut the
(Gower, 1966)
dendrogram at k
clusters • Multidimensional scaling takes a set of
dissimilarities as input and returns a set of points
such that the distances between points are
approximately equal to the dissimilarities.

Classical (Metric) Multi Dimensional Scaling: Classical (Metric) Multi Dimensional Scaling:
cmdscale() cmdscale()

> cmsdscale(d, k=2, eig=FALSE, …) > Mds <- cmdscale(dMatrix)


> plot(Mds, xlab="C1“
# d – a distance structure such as returned by dist() or ,ylab="C2")
a full symmetric matrix containing the > abline(v=0,lty=2)
dissimilarities > abline(h=0,lty=2)
# k – the dimension of the space which the data are
to be presented in; must be in {1,2,…, n-1}
# eig – indicates whether eigenvalues should be
returned.
PRINCIPAL COMPONENT ANALYSIS PRINCIPAL COMPONENT ANALYSIS :
prcomp()
• Data analytic method which provides a specific set of
• Performs a principal component analysis on the given data
projections which represent a given data set in a
matrix and returns the results as an object of class prcomp
fewer dimensions.
> cmsdscale(x, center=T,
• Use to transform correlated variables into
scale.= F, …)
uncorrelated ones, in other words to sphere the data.
• The final rationale for this technique is that it finds the # x – a numeric or complex matrix (or dataframe)
linear combinations of data which have relatively which provides the data for PCA
large (or relatively small) variability. # center – a logical value indicating whether the
variables should be shifted to be 0 centered.
# scale. – a logical value indicating whether the
variables should be scaled to have a unit
variance before the analysis.

PRINCIPAL COMPONENT ANALYSIS : predict() DATA FRAME: Genonum


Convert SNPs data to numeric
• A generic function for predictions from the results of
various model fitting functions. > GenoNum <- data.matrix(SNP09)
> str(GenoNum)
> predict(object, …) int [1:207, 1:184] 3 3 3 2 3 2 2 2 2 2 ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:207] "AMP001" "AMP002" "AMP003" "AMP004"
# object – a model object for which prediction is ...
desired. ..$ : chr [1:184] "SNP1" "SNP2" "SNP3" "SNP4" ...
PRINCIPAL COMPONENT ANALYSIS : STRUCTURE VERSION 2.3
prcomp() • Implements a model-based clustering method for
> PCAMP <- prcomp(GenoNum, scale=TRUE) inferring population structure using genotype data of
> scores <- predict(PCAMP) unlinked markers.
> plot(PCAMP$"x"[,1],PCAMP$"x"[,2], • Bayesian Statistics based which can accommodate prior
xlab="PC1",ylab="PC2", type="p") knowledge about the population structure.
• Individuals in the sample are assigned (probabilistically)
to populations, or jointly to two or more populations if
their genotypes indicate that they are admixed.
• To download the software use below link:
http://pritch.bsd.uchicago.edu/structure.html
• For more details on how to use the software you can
contact Dr. Ken McNally ☺

SAMPLE OUTPUT USING STRUCTURE SAMPLE OUTPUT USING STRUCTURE

• Figure 2. Triangle plot of


the Q-matrix. Each
individual is represented
by colored point. The
colors correspond to the
prior population labels.
• When K=3 the ancestry
• Figure 1. Bar plot of estimates of Q. Each individual is vectors can be plotted
represented by a single vertical line broken into K colored onto triangle, as shown.
segments, with lengths proportional to each of the K
inferred clusters. • For a given point, each of the three components is given
by the distance to one edge of the triangle.
• The numbers 1 to 4 correspond to the predefined
populations. • Individuals who are in one of the corners are therefore
assigned completely to one population or another.
FEATURES OF PACKAGE GenABEL
• Specifically designed for GWAS
• Provides specific facilities for storage and
ASSOCIATION ANALYSIS manipulation of large data
• Very fast tests for GWAS
Using GenABEL • Specific functions to analyze and display the results.
• More efficient than the package ‘genetics’

DESCRIPTION OF gwaa.data-class EXPLORING gwaa.data-class


• In GenABEL, special data class, gwaa.data-class is used
> library(GenABEL)
to store GWA data.
> data(ge03d2ex)
• Includes the phenotypic and genotypic data, > str(ge03d2ex)
chromosome, and location of every SNP. Formal class 'gwaa.data' [package "GenABEL"] with 2 slots
..@ phdata:'data.frame': 136 obs. of 8 variables:
• An object of some class has “slots” which may contain .. ..$ id : chr [1:136] "id199" "id287" "id300"...
actual data or objects of other classes. .. ..$ sex : int [1:136] 1 0 1 0 0 1 1 0 0 1 ...
.....
• At first level gwaa.data-class object has slot phdata, ..@ gtdata:Formal class 'snp.data' [package "GenABEL"]
which contains all the phenotypic information in a with 11 slots
.. .. ..@ nbytes : num 34
dataframe. .. .. ..@ nids : int 136
• The other slot is gtdata which contains all GWA genetic .. .. ..@ nsnps : int 4000
.. .. ..@ idnames : chr [1:136] "id199" "id287"...
information in an object of class snp.data .. .. ..@ snpnames : chr [1:4000] "rs7435137"...
.. .. ..@ chromosome: Factor w/ 4 levels
• For every SNP it is desirable to know the details of coding "1","2","3","X“ ....
and strand (+ , -, top, bot)
STURCTURE OF gwaa.data-class EXPLORING gwaa.data-class
# Summary of Phenotype data
> summary(ge03d2ex@phdata)
# No. of people in a study
> ge03d2ex@gtdata@nids
# No. of SNPs
> ge03d2ex@gtdata@nsnps
# SNP Names
> ge03d2ex@gtdata@snpnames[1:10]
# Chromosome labels
> ge03d2ex@gtdata@chromosome[1:10]
# SNPs map position/location
> ge03d2ex@gtdata@map[1:10]

IMPORT DATA TO GenABEL


IMPORT DATA TO GenABEL
• To import data to GenABEL, need to prepare two files • Example of few phenotypic file : Pheno.csv
- Phenotypic data
- Genotypic data
• Description of Phenotypic data file
- First line must consists of variable name
- First column must contain the unique ID, named
“id”.
- Second column should be named “sex”
(0=female, 1=male)
- Other columns in the file should contain phenotypic
information.
- Missing values should be coded as NA • Save this file as Pheno.dat
IMPORT DATA TO GenABEL IMPORT DATA TO GenABEL: snp.convert.txt()
• Description of Genotypic data file • Converts genotypic data file to raw internal data
- For every SNP, information on map position, formatted file
chromosome, and strand should be provided.
> convert.snp.text(infile, outfile,..)
- For every individual, every SNP genotype should be
provided. # infile – input data file
1st line - contains IDs
- GenABEL provided a number of function to convert
2nd line – names of all SNPs
these data from different formats to the internal
3rd line – list of chromosomes the SNPs belongs to
GenABEL raw format.
4th line – genomic position of the SNPs
> convert.snp.illumina()
5th line – genetic data.
> convert.snp.tped()
> convert.snp.ped()
# outfile – output data file
> convert.snp.txt()

IMPORT DATA TO GenABEL: snp.convert.txt() IMPORT DATA TO GenABEL: load.gwaa.data()


• Sampe genotypic data file : Geno.csv • Load data (genotypes and phenotypes) from files to
gwaa.data object
> load.gwaa.data(phenofile=“pheno.dat”,
genofile=“geno.raw”, sort=T)

# phenofile – data table with phenotypes


# genofile – internally formatted genotypic data file
using convert.snp.txt
# sort – logical value indicating whether SNPs should
be sorted in ascending order according to
chromosome and position
• Save this file as Geno.dat
IMPORT DATA TO GenABEL: load.gwaa.data() DESCRIPTIVE STATISTICS OF PHENOTYPE:
descriptive.trait()
> convert.snp.text("Geno.dat","Geno.raw")
> genphen <- load.gwaa.data( • Generate descriptive summary tables for phenotypic
phenofile="Pheno.dat", data
Genofile ="Geno.raw", sort=F) > descriptive.trait(data, by.var…)

# data – an object of snp.data-class or gwaa.data-


class
# by.var – a binary trait; which will separated analysis
for each group

DESCRIPTIVE STATISTICS OF PHENOTYPE: DESCRIPTIVE STATISTICS OF PHENOTYPE:


descriptive.trait() descriptive.trait()
> descriptive.trait(ge03d2ex) > descriptive.trait(ge03d2ex,
No Mean SD by=ge03d2ex@phdata$dm2)
id 136 NA NA No(by.var=0) Mean SD No (by.var=1) Mean SD Ptt Pkw
id 50 NA NA 86 NA NA NA NA
sex 136 0.529 0.501 sex 50 0.420 0.499 86 0.593 0.494 0.053 0.052
age 50 47.038 13.971 86 50.250 12.206 0.179 0.205
age 136 49.069 12.926 dm2 50 NA NA 86 NA NA NA NA
dm2 136 0.632 0.484 height
weight
49 167.671 8.586
49 76.534 17.441
86 170.448 10.362 0.097 0.141
86 93.587 27.337 0.000 0.000
height 135 169.440 9.814 diet 50 0.060 0.240 86 0.058 0.235 0.965 0.965
bmi 49 27.304 6.463 86 32.008 8.441 0.000 0.001
weight 135 87.397 25.510 Pexact
id NA
diet 136 0.059 0.236 sex 0.074
bmi 135 30.301 8.082 age NA
dm2 NA
height NA
weight NA
diet 1.000
bmi NA
DESCRIPTIVE STATISTICS OF MARKER: DESCRIPTIVE STATISTICS OF MARKERS:
descriptive.marker() descriptive.marker()
• Generate descriptive summary tables for genotypic > descriptives.marker(ge03d2ex)
data
$'Cumulative distr. of number of SNPs out of HWE, at
> descriptive.marker(data, digits…) different alpha'
X<=1e-04 X<=0.001 X<=0.01 X<=0.05 all X
# data – an object of snp.data-class or gwaa.data- No 46.000 71.000 125.000 275.000 4000
class Prop 0.011 0.018 0.031 0.069 1

# digits – number of digits to be printed

DESCRIPTIVE STATISTICS OF MARKERS: ESTIMATE OF INFLATION FACTOR:


descriptive.marker() estlambda()
> descriptives.marker(ge03d2ex,
• Estimate the inflation factor of pvalues or 1df chi-
ids=(dm2==0)) [2]
square test.
$'Cumulative distr. of number of SNPs out of HWE, at
different alpha' • The major use of this procedure is the Genomic
X<=1e-04 X<=0.001 X<=0.01 X<=0.05 all X Control, but can also be used to visualize distribution of
No 0 3.000 14.000 98.000 4000
Prop 0 0.001 0.003 0.025 1 pvalues coming from other tests.

> descriptives.marker(ge03d2ex, > estlambda(data, …)


ids=(dm2==1))[2]
$'Cumulative distr. of number of SNPs out of HWE, at # data – A vector of pvalues or a vector of chi-squares
different alpha' with 1df.
X<=1e-04 X<=0.001 X<=0.01 X<=0.05 all X
No 45.000 79.00 136.000 268.000 4000
Prop 0.011 0.02 0.034 0.067 1
ESTIMATE OF INFLATION FACTOR: ESTIMATE OF INFLATION FACTOR:
estlambda() estlambda()
# Test of HWE in control group (Population 1)
# Test of HWE in case group (Population 2)
> s1 <- summary(ge03d2ex@gtdata[ > s2 <- summary(ge03d2ex@gtdata[
(ge03d2ex@phdata$dm2==0),])
(ge03d2ex@phdata$dm2==1),])
> pexcas <- s1[,"Pexact"]
> pexcas <- s2[,"Pexact"]
> estlambda(pexcas)
> estlambda(pexcas)
$estimate $estimate
[1] 0.6807053 [1] 1.605193
$se
$se [1] 0.0117107
[1] 0.002574679

ESTIMATE OF INFLATION FACTOR: DATA CHECKING PROCEDURE:


estlambda() check.marker()
Population 1 Population 2 • Function to do genotypic quality control by selecting the
marker which should enter into GWA analysis based on call
rate, minor allele frequency, value of chi-square test for
HWE, and redundancies defined as concordance between
the distributions of the genotypes

> check.marker(data, digits…)


# data – an object of snp.data-class or gwaa.data-
class
DATA CHECKING PROCEDURE: DATA CHECKING PROCEDURE:
check.marker() check.marker()
> qc1 <- check.marker(ge03d2ex) RUN 1
Excluding people/markers with extremely low call rate... 3993 markers and 134 people in total
304 (7.613323%) markers excluded as having low (<1.865672%) minor
4000 markers and 136 people in total
allele frequency
0 people excluded because of call rate < 0.1
36 (0.9015778%) markers excluded because of low (<95%) call rate
6 markers excluded because of call rate < 0.1 0 (0%) markers excluded because they are out of HWE (P <0)
Passed: 3994 markers and 136 people 1 (0.7462687%) people excluded because of low (<95%) call rate
Running sex chromosome checks... Mean autosomal HET is 0.2747262 (s.e. 0.03721277)
3 (2.238806%) people excluded because too high autosomal
197 heterozygous X-linked male genotypes found
heterozygosity (FDR <1%)
1 X-linked markers are likely to be autosomal (odds > 1000 )
Excluded people had HET >= 0.4856887
2 male are likely to be female (odds > 1000 ) Mean IBS is 0.7710946 (s.e. 0.02053383), as based on 2000 autosomal
0 female are likely to be male (odds > 1000 ) markers
If these people/markers are removed, 0 heterozygous male 1 (0.7462687%) people excluded because of too high IBS (>=0.95)
genotypes are left In total, 3653 (91.4851%) markers passed all criteria
Passed: 3993 markers and 134 people In total, 129 (96.26866%) people passed all criteria
no X/Y/mtDNA-errors to fix

DATA CHECKING PROCEDURE: DATA CHECKING PROCEDURE:


check.marker() check.marker()
> summary(qc1)
# Generate a short summary QC • Object returned by qc1
$`Per-SNP fails statistics` > names(qc1)
NoCall NoMAF NoHWE Redundant Xsnpfail
NoCall 42 0 0 0 0 [1] "nofreq" "nocall" "nohwe" "Xmrkfail"
NoMAF NA 376 0 0 0 "hetfail" "idnocall"
NoHWE NA NA 0 0 0 7] "ibsfail" "isfemale" "ismale" "snpok"
Redundant NA NA NA 0 0
Xsnpfail NA NA NA NA 1 "idok" "call"

$`Per-person fails statistics`


IDnoCall HetFail IBSFail isfemale ismale isXXY
• qc1 provides the list of individuals (idok) and SNPs
IDnoCall 1 0 0 0 0 0 (snpok) who passes all QC criteria.
HetFail NA 3 0 0 0 0
IBSFail NA NA 1 0 0 0
isfemale NA NA NA 2 0 0
ismale NA NA NA NA 0 0
isXXY NA NA NA NA NA 0
ESTIMATE OF INFLATION FACTOR AFTER
GENERATE NEW DATA SET CLEANING
• Obtain new data set who passes all QC criteria. # Test of HWE in case group (Population2)
> s3 <- summary(data1@gtdata[
> data1<-ge03d2ex[qc1$idok,
(data1@phdata$dm2==1),])
qc1$snpok]
> pexcas <- s2[,"Pexact"]
> estlambda(pexcas)
• idok provides the list of people who passes all QC $estimate
criteria [1] 1.615536
• snpok provides the list of SNPs which passes all
criteria $se
[1] 0.01230754

ESTIMATE OF INFLATION FACTOR AFTER


FINDING GENETIC SUBSTRUCTURE: ibs()
CLEANING
# Test of HWE in case group (Population2) Step 1: Given a set of SNPs, obtain a matrix of average
for a set of subject and markers using ibs()

> ibs(data, weight…)

Before # data – an object of snp.data-class


After
# weight – “no” for direct IBS computations
“freq” to weight by allelic frequency

• This function facilitates quality control of genomic data.


IDENTITY-BY-STATE (IBS) : ibs() FINDING GENETIC SUBSTRUCTURE

> data1.gkin <- Step 2: Transform matrix in Step1 to distance matrix


ibs(data1[,data1@gtdata@chromosome!="X"],
> data1.dist<-as.dist(0.5-data1.gkin)
weight="freq“)
> data1.gkin[1:3, 1:3]
Step 3: Perform Classical Multidimensional Scaling
id199 id300 id403
id199 0.48965582 3262.00000000 3261.000000 > data1.mds<-cmdscale(data1.dist)
id300 -0.01226800 0.48794278 3268.000000
id403 -0.01246468 -0.01262302 0.511605 Step 4: Plot the first two components

> plot(data1.mds)

FINDING GENETIC SUBSTRUCTURE REMOVE OUTLIERS: kmeans()

• Perform k-means clustering on a data matrix


Must exclude
these > kmeans(x, centers, algorithm…)

# x– A numeric matrix of data, or an object that can


be coerced to such a matrix.
# centers – no. of clusters or set of initial distinct
clusters
# algorithm – character string to speficy the algorith
used in clustering method. Default algorithm is
“Hartigan-Wong”
REMOVE OUTLIERS: kmeans() DATA CHECKING PROCEDURE: data2
> km <- kmeans(data1.mds, centers=2) > qc2 <- check.marker(data2,
> names(km) hweids=(data2@phdata$dm2==0), fdr=0.2)
[1] "cluster" "centers" "withinss" "size" > summary(qc2)
NoCall NoMAF NoHWE Redundant Xsnpfail
NoCall 0 0 0 0 0
• Select cluster 1 : Non-outliers NoMAF NA 40 0 0 0
NoHWE NA NA 0 0 0
> cl1 <- names(which(km$cluster==1)) Redundant NA NA NA 0 0
Xsnpfail NA NA NA NA 0

• Select cluster 2 : Outliers $`Per-person fails statistics`


IDnoCall HetFail IBSFail isfemale ismale isXXY
> cl2 <- names(which(km$cluster==2)) IDnoCall 0 0 0 0 0 0
HetFail NA 0 0 0 0 0
IBSFail NA NA 0 0 0 0
• New dataframe without outliers isfemale NA NA NA 0 0 0
> data2 <- data1[cl1,] ismale NA NA NA NA 0 0
isXXY NA NA NA NA NA 0

GENERATE FINAL DATA SET : clean GWA ASSOCIATION ANALYSIS: qtscore()


• Obtain new data set who passes all QC criteria. • Fast score test for association between a trait and genetic
polymorphism
> clean <-data2[qc2$idok, qc2$snpok]
> qtscore(formula, data, trait…)
• This dataset will be use for final analysis
# formula– formula describing fixed effects (y ~ a +b)
- mean the outcome y depends on two covariates, a
and b.
# data – An object of gwaa.data-class

# trait – “gaussian” or “binomial”


GWA SCAN OF CLEANED DATA GWA SCAN OF CLEANED DATA

> cleanqt <- qtscore(dm2, clean) > plot(cleanqt)


> cleanqt$lambda
$estimate
[1] 1.040969

$se
[1] 0.0007325815

$iz0
[1] 1.016119

$iz2
[1] 1

GWA ASSOCIATION ANALYSIS: GWA ASSOCIATION ANALYSIS:


descriptives.scan() descriptives.scan()
• Function that describe the ”top” hits in GWA scan > descriptives.scan(cleanqt, digits=5)
Chromosome Position N effB P1df Pc1df effAB effBB P2df
> descriptives.scan(data) rs1719133 1 4495479 125 -0.27097 0.00036 0.00047 -0.21539 -0.72603 0.00093
rs8835506 2 6010852 122 0.23464 0.00085 0.00107 0.32015 0.32015 0.00130
rs4804634 1 2807417 122 -0.21614 0.00109 0.00137 -0.09717 -0.39474 0.00265
# data – An object of gwaa.data-class rs3925525
rs3224311
2 6008501 125 0.22808 0.00111 0.00139 0.30462 0.31876 0.00201
2 6009769 125 0.22808 0.00111 0.00139 0.30462 0.31876 0.00201
rs2975760 3 10518480 124 0.22449 0.00125 0.00157 0.25704 0.39593 0.00479
rs4534929 1 4474374 124 -0.18643 0.00200 0.00245 -0.14876 -0.39683 0.00743
rs6079246 2 7048058 124 -0.47305 0.00211 0.00258 -0.47305 NA 0.00211
rs5308595 3 10543128 123 0.27318 0.00237 0.00289 0.28441 0.47191 0.00955
rs1013473 1 4487262 125 0.18645 0.00257 0.00312 0.27148 0.37827 0.00678
COMPARISON OF TWO SCANS:
ORIGINAL VS. CLEANED
> origdata < qtscore(dm2, ge03d2ex)
> plot(origdata, col=“green”)
> add.plot(origdata, col=“red")
GWA IN THE PRESENCE
OF GENETIC STRATIFICATION

STRUCTURED ASSOCIATION ANALYSIS STRATIFIED ASSOCIATION


> pop <- as.numeric(data1@phdata$id %in%cl2) > data1.sa<-qtscore(dm2, data=data1,
> pop strata=pop)
[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 # strata –stratification variable. Scores are computed
[26] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[51] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 within strata and then added up
[76] 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[101] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[126] 0 0 0 0 • Plot results and compare with the analysis of cleaned data
> plot(cleanqt, cex=0.5, pch=19,
ylim=c(1,4))
> add.plot(data1.sa, col="green",
cex=1.2)
> add.plot(origdata, col="red",
cex=1.2)
STRATIFIED ASSOCIATION PCA USING PRICE’S METHOD: egscore()
• Comparison of Structured Association analysis • Fast score test for association (FASTA) between a trait
and genetic polymorphism, adjusted for possible
stratification by principal components.

> egscore(formula, data, kin…)

# formula– formula describing fixed effects (y ~ a +b)


- mean the outcome y depends on two covariates, a
and b.
# data – An object of gwaa.data-class
# kin – kinship matrix as returned by ibs

PCA USING PRICE’S METHOD: egscore() PCA USING PRICE’S METHOD: egscore()
qtscore(dm2, clean)
> data1.eg <- egscore(dm2, data=data1,
kin=data1.gkin)

4
• Plot results and compare with the analysis of cleaned data

− log10(P − value)

3
> plot(cleanqt, cex=0.5, pch=19,
ylim=c(1,5))
> add.plot(data1.sa, col="green",

2
cex=1.2)
> add.plot(data1.eg, col=“red",
cex=1.3)
1

1 2 3 X

Chromosome
REFERENCES
• Zhao JH. Use of R in Genome-wide Association Studies
(GWASs)
• Aluchencko, Yurii. GenABEL Tutorial (March 14, 2008)

THANK YOU! ☺

You might also like