You are on page 1of 18

R Packages

for Genome-Wide Association Studies

Qunyuan Zhang Division of Statistical Genomics Statistical Genetics Forum March 10,2008

What is R ?
R

is a free software environment for statistical computing and graphics.

Run

s on a wide variety of UNIX platforms, Windows and MacOS (interactive or batch mode)
Free

and open source, can be downloaded from cran.r-project.org range of packages (base & contributed), novel methods available grammar & good structure (function, data object, methods and class)

Wide

Concise Help

from manuals and email group

Slow,

time and memory consuming (can be overcome by parallel computation, and/or integration with C)
Popular,

used by 70~80% statisticians

R Task Views

http://cran.r-project.org/web/views/

Statistical Genetics Packages in R


http://cran.r-project.org/web/views/Genetics.html
Population Genetics : genetics (basic), Geneland (spatial structures of genetic data), rmetasim (population genetics simulations), hapsim (simulation), popgen (clustering SNP genotype data and SNP simulation), hierfstat (hierarchical F-statistics of genetic data), hwde (modeling genotypic disequilibria), Biodem (biodemographical analysis), kinship (pedigree analysis), adegenet (population structure), ape & apTreeshape (Phylogenetic and evolution analyses), ouch (Ornstein-Uhlenbeck models), PHYLOGR (simulation and GLS model), stepwise (recombination breakpoints) Linkage and Association : gap (both population and family data, sample size calculations, probability of familial disease aggregation, kinship calculation, linkage and association analyses, haplotype frequencies) tdthap (TDT for haplotypes, powerpkg (power analyses for the affected sib pair and the TDT design),hapassoc (likelihood inference of trait associations with haplotypes in GLMs), haplo.ccs (haplotype and covariate relative risks in case-control data by weighted logistic regression), haplo.stats (haplotype analysis for unrelated subjects), tdthap (haplotype transmission/disequilibrium tests), ldDesign (experiment design for association and LD studies), LDheatmap (heatmap of pairwise LD),. mapLD (LD and haplotype blocks), pbatR (R version of PBAT), GenABEL & SNPassoc for GWAS QTL mapping for the data from experimental crosses: bqtl (inbred crosses and recombinant inbred lines), qtl (genome-wide scans), qtlDesign (designing QTL experiments & power computations), qtlbim (Bayesian Interval QTL Mapping) Sequence & Array Data Processing : seqinr, BioConductor packages

GenABEL

Aulchenko Y .S., Ripke S., Isaacs A., van Duijn C.M. GenABEL: an R package for genome-wide association analysis. Bioinformatics. 2007, 23(10):1294-6.

GenABEL: genome-wide SNP association analysis a package for genome-wide association analysis between quantitative or binary traits and single-nucleotides polymorphisms (SNPs). Version: 1.3-5 Depends: R ( 2.4.0), methods, genetics, haplo.stats, qvalue, MASS Date: 2008-02-17 Author: Yurii Aulchenko, with contributions from Maksim Struchalin, Stephan Ripke and Toby Johnson Maintainer: Yurii Aulchenko <i.aoultchenko at erasmusmc.nl> License: GPL ( 2) In views: Genetics CRAN checks: GenABEL results

GenABEL: Data Objects


nbytes: number of bytes used to store data on a SNP gwaa.datanids: number of people class gtdata: male: male code idnames: ID names genotypic data (snp.data-class) nsnps: number of SNPs nsnpnames: list of SNP names chromosome: list chromosomes corresponding to SNPs snp.data() coding: list of nucleotide coding for SNP names 2-bit strand: strands of the SNPs storage map: list SNPs positions 0 00 load.gwaa.data(phenofile = "pheno.dat", genofile = 01 gtps: genotypes (snp.mx-class) 1

phdata: phenotypic data (data frame)

"geno.raw)

2 10 3 11 Save 75%

convert.snp.text() from text file (GenABEL default format) convert.snp.ped() from Linkage, Merlin, Mach, and similar files convert.snp.mach() from Mach format convert.snp.tped() from PLINK TPED format

GenABEL: Data Manipulation


snp.subset(): add.phdata():

subset data by snp names or by QC criteria

merge extra phenotypic data to the gwaa.data-class.


ztransform(): rntransform():

standard normalization of phenotypes rank-normalization of phenotypes

npsubtreated():

non-parametric adjustment of phenotypes for medicated subjects

GenABEL: QC & Summarization


summary.snp.data():

summary of snp data (Number of observed genotypes, call rate, allelic frequency, genotypic distribution, P-value of HWE test check.trait(): summary of phenotypic data and outlier check based on a specified p/FDR cut-off check.marker(): SNP selection based on call rate, allele frequency and deviation from HWE HWE.show(): showing HWE tables, Chi2 and exact HWE Pvalues perid.summary(): call rate and heterozygosity per person

ibs():

matrix of average IBS for a group of people & a given set of SNPs hom(): average homozygosity (inbreeding) for a set of people, across multiple markers

GenABEL: SNP Association Scans


scan.glm():

snp association test using GLM in R library scan.glm((y~x1+x2++CRSNP", family = gaussian(), data, snpsubset, idsubset) scan.glm((y~x1+x2++CRSNP", family = binomial (), data, snpsubset, idsubset) scan.glm.2D(): 2-snp interaction scan Fast Scan (call C language)
ccfast():

case-control association analysis by computing chi-square test from 2x2 (allelic) or 2x3 (genotypic) tables emp.ccfast(): Genome-wide significance (permutation) for ccfast() scan
qtscore():

association test (GLM) for a trait (quantitative or categorical) emp.qtscore(): Genome-wide significance (permutation) for qscaore() scan
mmscore():

score test for association between a trait and genetic polymorphism, in samples of related individuals (needs stratification variable, scores are computed within strata and then added up)
egscore():

association test, adjusted for possible stratification by principal components of genomic kinship matrix(snp correlation matrix)

GenABEL: Haplotype Association Scans


scan.haplo():

haplotype association test using GLM in R library 2-haplotype interaction scan

scan.haplo.2D():

(haplo.stats package required) Sliding window strategy Posterior prob. of Haplotypes via EM algorithm GLM-based score test for haplotype-trait association (Schaid DJ,
Rowland CM, Tines DE, Jacobson RM, Poland GA. 2002. Score tests for association of traits with haplotypes when linkage phase is ambiguous Am J Hum Genet 70: 425-434. )

GenABEL: GWAS results


from scan.glm, scan.haplo, ccfast, qtscore, emp.ccfast,emp.qtscore

scan.gwaa-class
Names:

snpnames list of names of SNPs tested P1df: p-values of 1-d.f. (additive or allelic) test for association P2df: p-values of 2-d.f. (genotypic) test for association Pc1df: p-values from the 1-d.f. test for association between SNP and trait; the statistics is corrected for possible inflation effB: effect of the B allele in allelic test effAB: effect of the AB genotype in genotypic test effBB: effect of the BB genotype in genotypic test Map: list of map positions of the SNPs Chromosome: list of chromosomes the SNPs belong to Idnames: list of subjects used in analysis Lambda: inflation factor estimate, as computed using lower portion (say, 90%) of the distribution, and standard error of the estimate Formula: formula/function used to compute p-values Family: family of the link function / nature of the test

GenABEL: Functions

Table & Graphic

descriptives.marker(): table of marker info. descriptives.trait(): table of trait info. descriptives.scan(): table of scan results plot.scan.gwaa(): plot of scan results plot.check.marker(): plot of marker data (QC etc.)

GenABEL: Computer Efficiency


2000 subjects x 500K chip Memory: ~3.2 G Loading time: ~4 Min. SNP summary: ~1 Min. Call ccfast: ~0.5 Min. Call qtscore: ~2 Min. Total: < 10 Min. Permutation test N=10,000 73~ 120 hrs, 3~5 days
Intel Xeon 2.8GHz processor,SuSE Linux 9.2, R

SNPassoc
An R package to perform whole genome association studies, Juan R. Gonzlez 1, et al. Bioinformatics, 2007 23(5):654-655

SNPassoc: SNPs-based whole genome association studies This package carries out most common analysis when performing whole genome association studies. These analyses include descriptive statistics and exploratory analysis of missing values, calculation of Hardy-Weinberg equilibrium, analysis of association based on generalized linear models (either for quantitative or binary traits), and analysis of multiple SNPs (haplotype and epistasis analysis). Permutation test and related tests (sum statistic and truncated product) are also implemented. Version:1.4-9 Depends:R ( 2.4.0), haplo.stats, survival, mvtnorm Date:2007-Oct-16 Author:Juan R Gonzlez, Llus Armengol, Elisabet Guin, Xavier Sol, and Vctor MorenoMaintainer:Juan R Gonzlez <jrgonzalez at imim.es> License:GPL version 2 or newerURL:http://www.r-project.org and http://davinci.crg.es/estivill_lab/snpassoc; In views:Genetics CRAN checks:SNPassoc results

SNPassoc: Data & Summary


setupSNP(data=snp-pheno.table,

info=map.table,

colSNPs=, sep = "/", ...)

summary()

allele frequencies percentage of missing values HWE test

SNPassoc: Association Tests


WGassociation(y~x1+x2,

data=, model = (codominant, dominant, recessive, overdominant, log-additive or all),quantitative = , level = 0.95) scanWGassociation(): only p values association(): only for selected snps, can do stratified, GxE interaction analyses Results Summary: a summary table by genes/chromosomes Wgstats: detailed output(case-control numbers, percentages, odds ratios/ mean differences, 95% confidence intervals, P-value for the likelihood ratio test of association, and AIC, etc.) Pvalues: a table of p-values for each genetic model for each SNP Plot: p values in the -log scale for plot.Wgassociation() Labels: returns the names of the SNPs analyzed

SNPassoc: Multiple-SNP Analysis


SNPSNP Interaction interactionPval(): epistasis analysis between all pairs of SNPs (and covariates).

Haplotype Analysis haplo.glm(): using the R package haplo.stats: association analysis of haplotypes with a response via GLM haplo.interaction(): interactions between haplotypes (and covariates)

SNPassoc: Computer Efficiency


1000 subjects X 3000 SNPs 5 min. import data 40 min. setupSNP() 30 min. scanWGassociation(): only p values (including permutation test) Memory usage: 750 MB

You might also like