You are on page 1of 26

ArrayCluster:

an analytic tool for clustering, data visualization


and module finder on gene expression profiles

組員:李祥豪
謝紹陽
江建霖
Outline
 Introduction
 Mixed Factors Model
 Analytic Tools
 Summary
 Demo
Introduction
 This task can be addressed by groupi
ng gene expression patterns of a larg
e number of genes
 Typical microarray data have a fairly
small sample size, less than 100, whe
reas the number of genes involved is
more than several thousands
Introduction
 One major difficulty in this problem is
that the number of samples to be
clustered is much smaller than the
dimension of data
 Most clustering technologies, e.g. k-
means, Gaussian mixture clustering,
hierarchical clustering and so on,
would be limited by over-learning
Introduction
 In statistics, overfitting is fitting a stat
istical model that has too many param
eters.
 When the degrees of freedom in para
meter selection exceed the data, this l
eads to arbitrariness in the final (fitte
d) model parameters which reduces or
destroys the ability of the model to ge
neralize beyond the fitting data.
Introduction
 In machine learning, usually a
learning algorithm is trained using
some set of training examples,
especially in learning was performed
too long or training are rare, the
learner may adjust to very specific
random features of the training data,
that have no causal relation to the
target function.
Introduction
 In both statistics and machine learnin
g, in order to avoid overfitting, it is n
ecessary to use additional techniques
(e.g. cross-validation, early stopping,
Bayesian Priors on parameters or mo
del comparison), that can indicate wh
en further training is not resulting in
better generalization.
Mixed Factors Model
 The mixed factors model presents a
parsimonious parameterization of Gaussian
mixture model
 Our primal intention is parsimoniously to
describe the group structure of data based
on the factor variables. To this end, we
devise the mixed factors that follow a G-
components Gaussian mixture as
G
p ( f j )    g  ( f j ; g ,  g )
g 1
Mixed Factors Model
 The mixed factors model, we possibly
avoid the over-fitting of the Gaussian
mixture by choosing an appropriate fa
ctor dimension regardless to the high
dimensionality of data.
 Once the model has been fitted to a g
iven dataset, clustering can be addres
sed by the Bayes rule.
Mixed Factors Model
 To avoid it, we impose the orthogonality on
the q columns of the factor loading matrix
 This imposition leads to a canonical represe
ntation of the mixed factors model as
AT X j  f j  A T  j
 From this equation, one achieves the fact t
hat the q canonical variates in ATxj€Rq are d
istributed according to
G
p ( A X j )    g  ( A T X j ; g ,  g   I )
T

g 1
Mixed Factors Model
 The canonical variates can be conside
red as the q modules of genes which
are relevant to the existing molecular
subtypes.
 This process yields a feature selection
that constructs good discriminators fo
r existing groups as linear combinatio
n d genes.
Analytic Tools
 File format of data
file
Analytic Tools
 model selection based on BIC curve
Analytic Tools
 In this plot, the horizontal and vertica
l axes correspond to the factor dimen
sion and the BIC scores, respectively.
The each line represents curve of BIC
scores against to varying factor dime
nsions (q) for a fixed number of clust
ers (G)
Analytic Tools
 File format of mixed_factors
Analytic Tools
 Box plot of the computed factor scores
Analytic Tools
 Each cluster is separated with the bla
nk lines. All samples in one cluster ar
e ordered according to the degree of t
he belongings that are measured by t
he Maharanobis distance between eac
h sample point and the corresponding
group centeroid. The calculated dista
nces are indicated next to the sample
identifiers
Analytic Tools
 File format of relevant_set
Analytic Tools
 relevant module profiling
 After selecting rows (genes) of intere
st, the enlarged expression image will
be displayed on the right window
Analytic Tools
 The ArrayCluster provides users an usable
environment to perform the following tasks
:
 Parameter estimation of the mixed factors model
: The ArrayCluster computes the maximum likeli
hood estimators by using the EM algorithm
 Determination of the number of clusters and the
factor dimension (the number of group-relatedm
odules):These are selected based on the Bayesia
n information criterion (BIC)
 Clustering based on the Bayes rule
Analytic Tools
 Dimension reduction of data: This task is addres
sed by the same way of the classical factor analy
sis, the mixed factors analysis explicitly reflects t
he existing group structure of original data, whil
e the classical factor analysis ignores it during th
e dimension reduction
 Identification of the group-related genes: In the
ArrayCluster, the relevant genes in each module
are selected to be top L (user can specify) of the
highest positive (negative) correlation with each
element of the factor vector
Analytic Tools
 Identification of the modules: By
separating positive and negative
correlated genes with the factor vector in
a module, totally we identify 2q modules
 Missing data imputation
 Data preprocessing: The methods
include normalization and gene filtering
Summary
 The ArrayCluster visualizes the comput
ed factor scores using the box plot mat
rix
 Enhancing the graphical understanding
of the group structure.
 A casual link from the calibrated cluster
s to biological knowledge can be elucid
ated through the inspection of the grou
p-related modules.
Summary
 The ArrayCluster displays the express
ion patterns of these modules.
 Genes at these modules and their vis
ualization give us a scope to question
where the calibrated clusters come fr
om.
Thanks for your attention

Next->DEMO

You might also like