Professional Documents
Culture Documents
Gene expression
The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org 2413
Y.Wang et al.
Sparse Structure
Gene Expression
Gene Expression Data Matrix X5
Data Matrix X1 Lambda L1 SVD
SVD L1
Weight 5
Weight 1 General Solution 5
General Solution 1
Weight 4
Gene Expression Weight 2
Data Matrix X2
Consistent Structure Gene Expression
SVD L1 Data Matrix X4
Weight 3 L1 SVD
General Solution 2
General Solution 4
Gene Expression
Data Matrix X3
SVD L1
General Solution 3
This paper proposes a novel method to combine a wide variety 2.1 Gene regulatory network
of microarray datasets from different experiments (different Generally, a genetic network can be expressed by a set of non-linear dif-
environmental conditions or perturbations) for inferring GN with ferential equations with each gene expression level as variables
the consideration of sparsity of connections. GNR (Gene Network
x_ t f xt 1
Reconstruction tool), based on LP (linear programming) and a
T n T n n
decomposition procedure, is developed by exploiting the general where x(t) (x1(t), . . . , xn(t)) 2 R , and f (f1, . . . , fn) : R 7! R . xi(t) is
solution form of arbitrary connectivity matrix for GN. The proposed the expression level (mRNA concentrations) of gene i at time instance t.
method GNR theoretically ensures the derivation of the most con- Assume that there are a total of m time points for a given experimental
sistent or invariant network structure with respect to all the used condition from microarray, i.e. t1, . . . , tm. fi is a C1 class non-linear function.
Although gene regulations are often non-linear, most of the existing
datasets, thereby not only significantly alleviating the problem of
approaches for GN inference use linear or additive models owing to unclear
data scarcity but also remarkably improving the reliability. Specif- structures of biological systems and scarcity of data (DHaeseleer et al.,
ically, inferring GN is formulated as an optimization problem with 1999; Gustafsson et al., 2005). From the viewpoint of dynamical systems,
an objective function of forced matching and sparsity terms, so that linear equations can at least capture the main features of the network or the
a consistent and sparse structure that is also considered to be bio- function, in particular around a specific state of the system. The linear form
logically plausible can be expected. An efficient algorithm has been of Equation (1) with appropriate normalization is
developed to solve such a large-scale LP in an iterative manner. x_ t Jxt bt t t1 . . . tm 2
Both simulated examples and experimental data are used to dem-
onstrate the effectiveness of GNR, which also leads to predictions of where J (Jij)nn @f(x)/@x is an n n Jacobian matrix or connectivity
new gene regulation relationships for yeast and Arabidopsis. GNR is matrix, and b (b1, . . . , bn)T 2 Rn is a vector representing the external
stimuli or environment conditions, which is set to zero when there is no
implemented in the Fortran programming language and the software
external input.
is available from http://zhangorup.aporc.org/bioinfo/grninfer/,
http://digbio.missouri.edu/grninfer/ and http://intelligent.eic.osaka-
sandai.ac.jp or upon request from the authors. 2.2 General solution for a single dataset
To overcome the difficulty because of scarce data, many techniques, such as
clustering of genes, SVD, interpolation of data (van Someren et al., 2001)
2 METHODS have been developed. We first adopt the SVD technique to derive a particular
Figure 1 illustrates the schematic of the proposed method. In this section, we solution and further the general solution of Equation (2), using a single time-
first describe a GN as a set of differential equations, and then derive a special course dataset. By rewriting Equation (2), we have
solution of the GN based on singular value decomposition (SVD) for a single
X_ JX B 3
dataset (time-course data). By constructing the general solution of the GN for
each single dataset, we formulate the GN reconstruction problem as an where X (x(t1), . . . , x(tm)), B (b(t1), . . . , b(tm)) and X_
optimization problem which is to find the most consistent network structure x_ t1 . . . x_ tm are all n m matrices with x_ i tj xi tj1 xi tj /
with respect to all the used datasets. The optimal solution can be viewed as a tj1 tj for i 1, . . . , n; j 1, . . . , m. By adopting SVD, i.e.
special solution for the multiple datasets with the minimal connections or XT mn Umn Enn V Tnn , where U is a unitary m n matrix of left
edges. We show that such an optimization problem is equivalent to a linear eigenvectors, E diag(e1, . . . , en) is a diagonal n n matrix containing
programming, and an efficient algorithm is developed to solve such an LP the n eigenvalues and VT is the transpose of a unitary n n matrix of right
based on the decomposition technique. eigenvectors. Without loss of generality, let all non-zero elements of ek
2414
Inferring gene regulatory networks
be listed at the end, i.e. e1 el 0 and el+1, . . . , en 6 0. Then we can Since (9) is a large-scale linear programming (LP) problem owing to a large
have a particular solution with the smallest L2 norm for the connectivity number of variables, we adopt an iterative technique to solve (9). Specif-
matrix J^ J^ij nn as ically, first we fix J to solve N small-size matching subproblems Y, and
then update J based on the results of Y for N subproblems. Such iteration
J^ X_ BUE1 V T 4 continues until converged.
1
where E diag(1/ei) and 1/ei is set to be zero if ei 0. Thus, the network Therefore, we have the following algorithm for deriving gene network.
family, or the general solution of the connectivity matrix J (Jij)nn is STEP-0: Initialization. Obtain all of the particular solution J^k by SVD
J X_ BUE1 V T YV T J^ YV T 5 from (4), and vk from (8). Set initial value Jij(0) 0, Y kij 0 0 and
J kij 0 J^k , and positive values l, . Let iteration index be q and set q 1.
Y (yij) is an n n matrix, where yij is zero if ej 6 0 and is otherwise an STEP-1: Set J k q J k q 1 Y k qV Tk and solve ykij q at iteration
arbitrary scalar coefficient. Solutions of (5) represent all of the possible q by LP for each subproblem from (9) with J (q 1) fixed, i.e. solve
networks that are consistent with the single microarray dataset, depending Y k q ykij qmm of the following subproblem for k 1, . . . , N with
on arbitrary Y. Notice that m + 1 points are required in (5) owing to the J(q 1) given
estimation of X_ .
X
n X
n
forces J sparse owing to L1 norm. l is a positive parameter, which balances vk J *ij J kij Y *k 2
k1
the matching and sparsity terms in the objective function. The variables in vij 12
N
(7) are Jij and all
Pof non-zero ykij . vk is a positive weight coefficient for the p
N k sij vij 13
dataset-k with k1 v 1. Since different datasets may have different
qualities (e.g. different technologies, number of repeats in measurements, By computing their average, we have
etc.), a weight coefficient is used to represent the reliability of each dataset.
Xn Xn
sij
Assume that the number of the repeated experiments for the dataset-k is Nk
s 14
by using the same type of microarray. Then vk can be set as i1 j1
n2
2415
Y.Wang et al.
where xi reflects the expression level of the gene-i and ji(t) -1.00 -3.02 -1.00 -3.01
x3 x3
represents noise for i 1, 2, 3, 4, 5. Clearly, the system is a negative
gene regulation loop with genes 2, 3, 4, 5 repressing genes 1, 2, 3, 4 (c) (d)
respectively, and with gene 1 in turn enhancing gene 5.
To test GNR, we randomly choose the initial condition of the Fig. 2. The simulated example with l 0 and without noise. Arrows and
system and take several points of x as a measured time-course arcs denote activation and repression, respectively. (a) The true network.
(b) Using one dataset. (c) Using two datasets. (d) Using three datasets.
dataset. With three different initial conditions, we obtained three
different datasets with 4, 4 and 3 time points respectively, and
applied GNR to reconstruct the connectivity matrix or the Jacobian x1 x1
1.68
x5 1.76
x5
matrix J. To measure the discrepancies between the true network
and the inferred network with n genes, we adopt the simple criterion -1.99 -1.50 -1.99 -0.20 -1.50
0.23
in Yeung et al. (2002) as E0 to assess the basic recovering ability: x2
0.17
x4 x2 0.22
x4
X
n X
n
E0 : eij 15 -1.00 -3.00
-0.98
-3.01
i1 j1
x3 x3
(a) (b)
where eij takes 1 if jjJ Tij J Rij jj > d, otherwise 0. d is a prescribed
small value for error tolerance related to noise level of the system.
x1 1.96
x5 x1 1.68
x5
J Tij and J Rij are interaction strength from gene-j to gene-i for the true
and inferred networks, respectively. -1.99 -1.50 -1.97 -1.48
E1 : jJ Tij J Rij j 16 x3 x3
i1 j1
(c) (d)
X
n X
n
E2 : J Tij J Rij 2 17 Fig. 3. The simulated example with noise. Arrows and arcs denote activation
i1 j1 and repression respectively. (a) Using one dataset with l 0.0. (b) Using two
datasets with l 0.0. (c) Using three datasets with l 0.0. (d) Using three
which are L1 norm and L2 norm errors respectively for all of datasets with l 0.3.
interaction strengths.
The numerical results are depicted in Figures 2 and 3, which
show the reconstructed networks without and with noises respec- multiple solutions (which are typical) due to the under-determined
tively. As indicated in Figure 2, clearly the more the datasets, the nature, GNR prefers to infer a network with a sparse structure.
more accurate the inferred network. When using one dataset Figure 3 shows the results when noises are added to the dynamics.
(Fig. 2b), it contains a wrong relation between x5 and x3. As two As indicated in Tu et al. (2002), the distribution function of the
datasets are used, the topology of the network becomes correct noise in microarray is more like a Gaussian distribution. Therefore
(Fig. 2c). After using all three datasets, the predicted connectivity we set all of noises ji(t), i 1, 2, 3, 4, 5 obeying normal distribution
matrix, which represents the strengths among gene interactions, is in the simulated example. With gradual increase of noise level to
very close to the true one (Fig. 2c). Such results imply that GNR is N(0, 0.005), the network eventually cannot be correctly inferred
able to infer the solution of the highly under-determined problem even using all three datasets (Fig. 3c) due to the effect of noises. For
in an accurate manner when a sufficient number of datasets (or such an under-determined case, GNR can reconstruct the network
experiments) are available even though each dataset has only a by an additional constraint of sparsity, i.e. introducing a positive
few time points and starts from different initial conditions. In parameter l, as shown in Figure 3d. With such a constraint, gen-
GNR, we also introduce a scalar parameter l to control the sparsity erally there is a better chance to construct a biologically plausible
of the inferred network (see Methods for details). When there are structure (Yeung et al., 2002) but at the expense of accuracy of
2416
Inferring gene regulatory networks
l E0 E1 E2 s
2417
Y.Wang et al.
FET3 SUC2
encodes a member of the F-box family, whose members involved in
AGA2 HLR1
YDR regulating diverse cellular processes including cell cycle transition,
344C
transcriptional regulation and signal transduction.
ASH1
DSE2
YGP1
EBP2
PIR3
4 DISCUSSION
TFA2 MGA1 Microarray gene expression data has increasingly become the com-
AGA1
YPL YKR mon data source that can provide insights into biological processes
158C 077W
at a system-wide level. As indicated in Soinov (2003), although a
CLB6
large amounts of data are increasingly accumulated, one of the
HHF2 PRY1
major problems with microarrays is that data often come from
different platforms, laboratories, etc. It is often difficult to compare
PHO85 DIP5 TSL1
A
or combine results of experiments done by different research groups
for biological inference. In contrast to the conventional methods
(a) which require more time points in a single dataset to infer more
accurate network owing to the dimensionality problem, the main
contribution of this paper is that we developed a methodology to
BAR1 reconstruct GN using multiple datasets from different sources with-
B MFA1
CHA1
CIS3 out normalization among the datasets. In other words, we provide a
general framework to handle the microarray data by fully exploiting
C
POL30
YDR HTB2 HTA1 all available microarray data for a given species, so as to alleviate
214W
the problem of dimensionality or data scarcity. As a byproduct of
YCR D
FIG1 007C the new method, it provides a new way to compare hypotheses
HTA2 ACS1
generated from different datasets, and also a new way to derive
YNL
PRY2 NCE3 058C
PIR1 E a common substructure not from network alignment but from the
raw microarray datasets. In particular, it is very effective to find an
(b) invariant structure when multiple datasets with different conditions
or perturbations are used.
Fig. 5. Partial representation (with two connected sub-networks) of the 64- We have tested our approach on both simulated problems and
link inferred network in yeast based on cell cycle microarray experimental experimental biological data, which verified the efficiency and
datasets. The isolated genes without interaction with others are not shown. effectiveness of the algorithm. Depending on the trade-off parame-
The red arrows in the figure indicate repression while the blue arrows indicate ter l, we can derive either a global structure with dense connection
activation. The circles in the same color indicate the same biological function.
for a small l or local substructure with sparse connection for a
large l. Furthermore, the role of parameter l in the inference algo-
root and shoot experiments. Supplementary Table A6 lists the rithm is discussed and tested by comparing the inferred network
experimental details and the time points used. We used the log structures for different ls. Also we discuss how to specify the proper
ratios of the expression values for a treatment condition against value and the searching strategy in the parameter space of l (see the
the mock condition. We narrowed down the list of genes to 226 details in Supplementary Material).
genes for the root experiments and 246 genes for the shoot There is an important assumption for the proposed method in this
experiments based on two-fold change (either up or down) in at paper, i.e. the structure of the regulatory network is stationary, and
least 70% of the ratios of a gene. This list represents the most does not rewire under the environmental conditions for those
statistically significant genes differentially expressed under stress different datasets. This means that the change of environmental
in root and shoot based on all experiments. conditions alters the level of gene expression instead of the network
GNR was applied to the 226 genes with the above nine datasets. structure. Another assumption is that high resolution time-course
Using different thresholds, we can predict various networks microarray datasets are required so as to accurately infer the net-
with different edge density, which are consistent with respect to work structure because a genetic network is expressed by a set of
all datasets. Figure 6 represents a 35-link sub-network in shoots. differential equations with each gene expression level as a variable
The network shows that the genes AT1G56600 and ATERF6 shown in Equation (1). Here high resolution data mean high quality
(AT4G17490) control the neighboring genes by either activating time-course microarray data which are expected to capture the
or suppressing them. We found that our network relates to some dynamic behavior of the gene regulatory network.
knowledge while predicting novel regulations. ATERF6 is a The linear differential equation model in this paper is used to
member of the ERF (ethylene response factor) subfamily B-3 of identify gene regulation between RNA transcripts (Gardner and
ERF/AP2 transcription factor family (Fujimoto et al., 2000). It is Faith, 2005). An advantage of such a strategy is that the model
predicted to activate three genes encoding known or putative tran- can implicitly capture regulatory mechanisms at the protein and
scription factors, i.e. AT2G12940, AT3G49760 and AT2G40750. metabolite levels that are not physically measured. That is, it is
AT2G12940 is similar to transcription factor VSF-1; AT3G49760 not restricted to describe only transcription factor/DNA inter-
is a Bzip transcription factor; AT2G40750 is a member of WRKY actions. By construction, the inferred model may accurately reflect
transcription factor family. Other genes have functions related to a physical interaction if the regulator transcripts encode the tran-
stress response. AT1G52560 is a heat shock protein. AT1G36030 scription factors that directly regulate transcription. On the other
2418
Inferring gene regulatory networks
Fig. 6. Inferred network in A.thaliana based on 35 links generated from stress response datasets in shoots. The red arrows represent activation while the blue
arrows represent suppression. Putative transcription factors are shown in purple. F-box family proteins are shown in yellow.
hand, the implicit description of hidden regulatory factors by this direct and indirect regulations (i.e. via hidden variables). Typically
approach may lead to prediction errors. Generally, the mRNA one can interpret an edge in a GN model as the net effect if the gene
levels measured in a microarray experiment are the results of from the source is deleted. For example, if an arrow pointing from
a variety of complex events including gene transcription and gene A to gene B for activation, it is expected that deleting gene A
mRNA degradation (Gardner and Faith, 2005). With such events, will lead to an increased expression of gene B. Notice that the
dynamic Bayesian networks can be used to derive the regulations inferred results by GNR are only valid on the assumption that
among biomolecules (Nachman et al., 2004; Rangel et al., 2004; the dynamics of the system can be captured by the time intervals
Beal et al., 2005; Li and Zhan, 2006). between the data points. Nevertheless, our predicted regulatory
To examine causal relation among genes, a major source of errors network is testable through a comparison in microarray data
comes from the noises of the gene expression data intrinsic to between wild type and mutant with specific deletion.
microarray technologies (Thattai and van Oudenaarden, 2001). Currently, GNR is aimed to infer the consistent structure from a
To reduce the defect of unreliable data, GNR is able to assess variety of datasets but for the same species or organism. With the
the quality of microarray datasets by comparing their inferred similar mechanism, GNR can be extended to identify the conserved
results, and remove the inconsistent dataset using a clustering network patterns or motifs (Kelly et al., 2003) from the datasets of
method according to their degree of inconsistency. As a result, either the same species or different species, by adjusting the
GNR can alleviate the impact of noises to improve the prediction parameter l, i.e. a higher l results in a more consistent or conserved
accuracy. In addition, GNR is also effective for reducing the effects network.
of stochastic fluctuations by introducing the sparsity leverage l and
combining the several datasets together (see the details in the Sup- ACKNOWLEDGEMENTS
plementary Material). Depending on the prior information of the
data (e.g. reliability of experiments or number of experiments), we The authors are grateful to the associate editor and anonymous refer-
can also allocate different weights for the datasets to maximally ees for comments and helping to improve the earlier version. This
utilize the information of reliable gene expression data. work is partly supported by Grant sponsor: project Bioinformatics,
Our method also has some limitations owing to the nature of Bureau of Basic Science, CAS. The effort of T.J. and D.X. was
microarray gene expression data, like other existing methods for supported by a USDA grant CSREES 200425,604-14,708.
GN inference. In particular, although GNR can provide a relation- Conflict of Interest: none declared.
ship in which the expression of one gene can lead to an increased
(or decreased) expression of another gene, such a relationship does
not show the exact mechanism. A predicted regulatory relation-
REFERENCES
ship does not always mean genetic regulation by a transcriptional Akutsu,T. et al. (2000) Inferring qualitative relations in genetic networks and metabolic
pathways. Bioinformatics, 16, 727734.
factor. Some regulation can be at the post-transcriptional or post-
Alonso-Monge,R. et al. (2001) Hyperosmotic stress response and regulation of cell
translational level, which are often not reflected in mRNA expres- wall integrity in saccharomyces cerevisiae share common functional aspects.
sion levels detected by microarrays. Therefore, there is a need for Mol. Microbiol., 41, 717730.
integration with other information sources to derive regulatory net- Basso,K. et al. (2005) Reverse engineering of regulatory networks in human B cells.
works in an accurate manner. In other cases, the transcriptional Nat. Genet., 37, 382390.
Beal,M.J. et al. (2005) A Bayesian approach to reconstructing genetic regulatory
factors for direct regulation are not selected for GN construction networks with hidden factors. Bioinformatics, 21, 349356.
due to their low expression levels or statistically insignificant Chen,L. and Aihara,K. (2001) Stability and bifurcation analysis of differential-
changes. Hence, the GN models that we predicted include both difference-algebraic equations. IEEE Trans. Circuits Syst. I, 48, 308326.
2419
Y.Wang et al.
Chen,L. and Aihara,K. (2002) Stability of genetic regulatory networks with time delay. Kelley,B.P. et al. (2003) Conserved pathways within bacteria and yeast as revealed
IEEE Trans. Circuits Syst. I, 49, 602608. by global protein network alignment. Proc. Natl Acad. Sci. USA, 100,
Destruelle,M. et al. (1994) Identification and characterization of a novel yeast gene: 1139411399.
the ygp1 gene product is a highly glycosylated secreted protein that is synthesized Kornberg,R.D. (1998) Mechanism and regulation of yeast rna polymerase ii
in response to nutrient limitation. Mol. Cell Biol., 14, 27402754. transcription. Cold Spring Harb. Symp. Quant. Biol., 63, 229232.
Dewey,T.G. and Galas,D.J. (2001) Dynamic models of gene expression and classi- Levine,M. and Davidson,E.H. (2005) Gene regulatory networks for development. Proc.
fication. Funct. Integr. Genomics, 1, 269278. Natl Acad. Sci. USA, 102, 49364942.
Dhaeseleer,P. et al. (2000) Genetic network inference: from co-expression clustering Li,H. and Zhan,M. (2006) Systematic intervention of transcription for identifying
to reverse engineering. Bioinformatics, 16, 707726. network response to disease and cellular phenotypes. Bioinformatics, 22, 96102.
DHaeseleer,P., Wen,X., Fuhrman,S. and Somogyi,R. (1999) Linear modeling of Mrsa,V. and Tanner,W. (1999) Role of naoh-extractable cell wall proteins ccw5p,
mRNA expression levels during cns development. In Altman,R.B., Dunker,A.K., ccw6p, ccw7p and ccw8p (members of the pir protein family) in stability of the
Hunter,L., Klein,T.E. and Lauderdaule,K. (eds), Pacific Symposium on Biocom- Saccharomyces cerevisiae cell wall. Yeast, 15, 813820.
puting Vol. 4, pp. 4152. Nachman,I. et al. (2004) Inferring quantitative models of regulatory networks from
Doolin,M.T. et al. (2001) Overlapping and distinct roles of the duplicated yeast expression data. Bioinformatics, 20 (Suppl. 1), i248i256.
transcription factors ace2p and swi5p. Mol. Microbiol., 40, 422432. Rangel,C. et al. (2004) Modeling T-cell activation using gene expression profiling and
Fujimoto,S.Y. et al. (2000) Arabidopsis ethylene-responsive element binding factors state-space models. Bioinformatics, 20, 13611372.
act as transcriptional activators or repressors of gcc box-mediated gene expression. Soinov,L. (2003) Supervised classification for gene network reconstruction. Biochem.
Plant Cell, 12, 393404. Soc. Trans., 31, 14971502.
Gardner,T.S. and Faith,J. (2005) Reverse-engineering transcription control networks. Tegner,J. et al. (2003) Reverse engineering gene networks: integrating genetic
Phys. Life Rev., 2, 6588. perturbations with dynamical modeling. Proc. Natl Acad. Sci. USA, 100,
Gustafsson,M. et al. (2005) Constructing and analyzing a large-scale gene-to-gene 59445949.
regulatory network-lasso-constrained inference and biological validation. IEEE/ Thattai,M. and van Oudenaarden,A. (2001) Intrinsic noise in gene regulatory networks.
ACM Trans. Comput. Biol. Bioinform., 2, 254261. Proc. Natl Acad. Sci. USA, 98, 86148619.
H,D.J. (2002) Modeling and simulation of genetic regulatory systems: a literature Tu,Y. et al. (2002) Statistics quantitative noise analysis for gene expression
review. J. Comput. Biol., 9, 67103. microarray experiments. Proc. Natl Acad. Sci. USA, 99, 1403114036.
Hartemink,A.J. (2005) Reverse engineering gene regulatory networks. Nat. van Someren,E.P. et al. (2001) Robust genetic network modeling by adding noisy data.
Biotechnol., 23, 554555. In Proceedings of 2001 IEEE-EURASIP Workshop on Nonlinear Signal and Image
Holter,N.S. et al. (2001) Dynamic modeling of gene expression data. Proc. Natl Acad. Processing.
Sci. USA, 98, 16931698. Yeung,M.K.S. et al. (2002) Reverse engineering gene networks using singular value
Hughes,T.R. et al. (2000) Functional discovery via a compendium of expression decomposition and robust regression. Proc. Natl Acad. Sci. USA, 99, 61636168.
profiles. Cell, 102, 109126. Zak,D.E. et al. (2003) Importance of input perturbations and stochastic gene
Husmeier,D. (2003) Sensitivity and specificity of inferring genetic regulatory expression in the reverse engineering of genetic regulatory networks: insights
interactions from microarray experiments with dynamic Bayesian networks. from an identifiability analysis of an in silico network. Genome Res., 13,
Bioinformatics, 19, 22712282. 23962405.
2420