Professional Documents
Culture Documents
Leilani Nora
Assistant Scientist
• When a population is in HWE, the next generation will Violation in HWE Assumption
result in the same genotype frequency as well as the • When a locus is not in HWE, then this suggests one or
same allele frequency. more of the Hardy-Weinberg assumptions is false.
• Allele frequencies for a population in HWE: • Departure from HWE has been used to infer the
P(AA) = P(A) P(A) =p2 existence of natural selection, argue for existence of
P(Aa) = 2P(A) P(a) = 2pq assortive (non-random) mating, and infer genotyping
P(aa) = P(a)P(a) = q2 errors.
• It is therefore of interest to test whether a population is
• For example, consider a diallelic locus with alleles “A” in HWE at a locus.
and “a” with frequencies 0.85 and 0.15, respectively. If
the locus is in HWE, calculate the allele frequencies. • Two most popular ways of testing HWE
- Chi-Square test
- Exact test
CHI-SQUARE GOODNESS OF FIT CHI-SQUARE GOODNESS OF FIT
• Compares observed genotype counts with the values
expected under Hardy-Weinberg • Test Statistic for Allelic Association is:
• For a locus with two alleles, we might construct a
(Observed count - Expected count )2
table as follows: χ2 = ∑
Genotypes Expected Count
≈ under Ho χ (21df )
Genotype Observed Expected Under HWE
AA nAA np2
Aa nAa 2npq
aa naa nq2
• Illustration • Illustration
> HWE.exact(Snp4)
> HWE.chisq(Snp4)
Exact Test for Hardy-Weinberg Equilibrium
Pearson's Chi-squared test with simulated p-value
(based on 10000
data: snp4
replicates)
N11 = 109, N12 = 105, N22 = 29, N1 = 323, N2 = 163,
p-value = 0.666
data: tab
X-squared = 0.2298, df = NA, p-value = 0.6657
LINKAGE DISEQUILIBRIUM
• Also known as gametic phase disequilibrium, gametic
disequilibrium and allelic association.
• Non random association of alleles at different loci
Usage
T
> LD(g1, g2, …)
G
# g1 – genotype object or dataframe containing
• Polymorphisms are not completely correlated, but there is no genotype objects
evidence of recombination.
# g2 – genotype object (ignored if g1 is a dataframe)
• This type of LD structure develop when mutations occur on
different allelic lineages.
• This is the situation in which r2 and D’ act differently, with D’
still equal to 1, but where r2 can be much smaller .
Snp5 D 0.2135702
Snp5 D' 0.9997231
Snp5 Corr. 0.9098532
Snp5 X^2 379.1474204
Snp5 P-value < 2.2204e-16
Snp5 n 229
FACTORS CREATING LINKAGE THE RATE OF RECOMBINATION, r
DISEQUILIBRIUM
• Let r be the rate of recombination between A and B
5 Processes that can produce linkage disequilibrium loci.
• r ranges in value between 0 and 0.5. Maximum
1. Mutation – provides the raw material for value of r is at 0.5 because with independent
producing polymorphisms that will be in LD assortment of the loci 0.5 of the gametes produced
2. Random Drift will be parental type
Residuals:
ASSOCIATION ANALYSIS Min 1Q Median
-2.63700 -0.62291 -0.01225
3Q
0.58922
Max
3.05561
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.081114 0.092517 -0.877 0.382
Parametric method
• STRUCTURE
CLUSTER ANALYSIS HIERARCHICAL CLUSTER ANALYSIS
• Exploratory technique which may be used to search Steps in Performing Agglomerative Hierarchical
for category structure based on natural groupings in Clustering
the data, or reduce a very large body of data to a
relatively compact description 1. Obtain the Data Matrix
• No assumptions are made concerning the number 2. Standardize the data matrix if needed be
of groups or the group structure.
3. Generate the resemblance or distance
• Grouping is done on the basis of similarities or matrix
distances (dissimilarities).
4. Execute the Clustering Method
• There are various techniques in doing this which
may give different results. Thus researcher should
consider the validity of the clusters found.
Partitioning Method
1. K-means clustering
2. K-centroids
DATAFRAME: AMP2009.csv DATA FRAME: AMP09
Read data file AMP2009.csv Select SNPs data
> AMP09 <- read.table(“AMP2009.csv", > SNP09 <- subset(AMP09, select=SNP1:SNP210)
header=T, sep=‘,’, row.names=“Gen”) > str(SNP09)
> str(AMP09) 'data.frame': 207 obs. of 184 variables:
$ SNP1 : Factor w/ 3 levels "AA","AB","BB": 3 3 3 2 ...
'data.frame': 207 obs. of 200 variables: $ SNP2 : Factor w/ 3 levels "AA","AB","BB": 1 1 2 1 ...
$ SNP1 : Factor w/ 3 levels "AA","AB","BB": 2 2 2 2 ... $ SNP3 : Factor w/ 3 levels "AA","AB","BB": 2 2 3 2 ...
$ SNP2 : Factor w/ 3 levels "AA","AB","BB": 1 1 1 1 ... …
.......
$ SNP209 : Factor w/ 2 levels "AB","BB": 2 2 1 2 2 2 ...
$ SNP210 : Factor w/ 2 levels "AA","AB": 1 1 1 1 1 2 ...
$ ADF : num 21.3 22.8 22.2 17.8 25 ...
.......
Classical (Metric) Multi Dimensional Scaling: Classical (Metric) Multi Dimensional Scaling:
cmdscale() cmdscale()
> plot(data1.mds)
$se
[1] 0.0007325815
$iz0
[1] 1.016119
$iz2
[1] 1
PCA USING PRICE’S METHOD: egscore() PCA USING PRICE’S METHOD: egscore()
qtscore(dm2, clean)
> data1.eg <- egscore(dm2, data=data1,
kin=data1.gkin)
4
• Plot results and compare with the analysis of cleaned data
− log10(P − value)
3
> plot(cleanqt, cex=0.5, pch=19,
ylim=c(1,5))
> add.plot(data1.sa, col="green",
2
cex=1.2)
> add.plot(data1.eg, col=“red",
cex=1.3)
1
1 2 3 X
Chromosome
REFERENCES
• Zhao JH. Use of R in Genome-wide Association Studies
(GWASs)
• Aluchencko, Yurii. GenABEL Tutorial (March 14, 2008)
THANK YOU! ☺