24650566.txt 36.3 KB

Raw Blame History Permalink

Methods
Inferring gene regulatory networks from gene expression data at whole genome level is still an arduous challenge , especially in higher organisms where the number of genes is large but the number of experimental samples is small .
It is reported that the accuracy of current methods at genome scale signiﬁcantly drops from Escherichia coli to Saccharomyces cerevisiae due to the increase in number of genes .
This limits the applicability of current methods to more complex genomes , like human and mouse .
Least absolute shrinkage and selection operator ( LASSO ) is widely used for gene regulatory network inference from gene expression proﬁles .
However , the accuracy of LASSO on large genomes is not satisfactory .
In this study , we apply two extended models of LASSO , L0 and L1/2 regularization models to infer gene regulatory network from both high-throughput gene expression data and transcription factor binding data in mouse embryonic stem cells ( mESCs ) .
We ﬁnd that both the L0 and L1/2 regularization models signiﬁcantly outperform LASSO in network inference .
Incorporating interactions between transcription factors and their targets remarkably improved the prediction accuracy .
Current study demonstrates the efﬁciency and applicability of these two models for gene regulatory network inference from integrative omics data in large genomes .
The applications of the two models will facilitate biologists to study the gene regulation of higher model organisms in a genome-wide scale .
2014 Elsevier Inc. .
All rights reserved .
1. Introduction
Inferring gene regulatory networks from high-throughput gen-ome-wide data is still a major challenge in systems biology .
Transcriptome that is measured by microarray or RNA-seq describes the expression of all the genes of a genome .
Various methods have been developed to infer gene regulatory networks from such transcriptome data ( reviewed in [ 1 ] ) .
Even though most of the methods perform very well on smaller genomes such as Escherichia coli , very few of them can accurately handle larger genomes , such as human and mouse .
In large genomes , the complexity of gene regulatory system dramatically increases .
Thousands of regulators , such as transcription factors ( TFs ) , communicate in different ways to regulate tens of thousands of target genes in various tissues or biolog ¬
⇑ Corresponding author at : Department of Biochemistry , LKS Faculty of Medicine , The University of Hong Kong , Hong Kong Special Administrative Region , China .
Fax : +852 2855 1254 .
E-mail address : junwen@hku.hk ( J. Wang ) .
1 These authors contributed equally to this work .
ical processes .
However , for a speciﬁc gene , only a few key TFs collaborate and control its expression change in a speciﬁc cell type or developmental stage .
Thus , the gene regulatory network inference for such large genomes becomes a sparse optimization problem , which is to search a small number of key TFs from a pool of thousands of TFs for tens of thousands of targets based on the dependencies between the expression of TFs and the targets .
The sparsity levels of the gene regulatory networks in large genomes are much higher than those in small genomes .
One of the most popular approaches is to estimate the pair-wise correlation between genes using metrics like Pearson 's correlation coefﬁcient , Spearman 's correlation coefﬁcient , mutual information , partial correlation coefﬁcient or expression alignment , and then ﬁlter for causal relationships to infer TF-target pairs [ 2 -- 6 ] .
For example , ARACNE ( Algorithm for the Reconstruction of Accurate Cellular Networks ) is proved to achieve low error rate and scaled up to mammalian system [ 2,7,8 ] .
Another popular approach is to use the regression-based models to select TFs with target gene-speciﬁc sparse linear-regression [ 1,9,10 ] .
In regression-based models , leas absolute shrinkage and selection operator ( LASSO ) is most commonly used for gene regulatory network inference .
In the ﬁeld of optimization , LASSO is also called the L1 regularization model [ 11 ] .
Many efﬁcient algorithms , such as ISTA ( Iterative Soft Thres-holding Algorithm ) , LAR ( Least Angle Regression ) and YALL1 ( Your ALgorithms for L1 ) , have been developed for this model , and some of them are applied to large datasets [ 11 -- 20 ] .
Recently developed methods , such as NARROMI , take advantages of both correlation-and regression-based approaches and achieve improved accuracy [ 10 ] .
However , both approaches suffer from several limitations when dealing with large genomes .
Marbach et al. have shown that all of the 35 methods assessed , including both approaches , have much less precision for gene regulatory network inference in Saccharomyces cerevisiae than those in E. coli .
Because the genome of S. cerevisiae is larger than that of E. coli , and its gene regulatory network is much more complex , only about 2.5 % area under the precision-recall curve ( AUPR ) is achieved by these methods in S. cerevisiae , which is close to random .
The correlation-based approaches need to calculate the correlation of all gene pairs , thus the computation cost increases exponentially with the number of genes .
The sparsity level of the gene regulatory networks in large genomes is much higher than those in small genomes , hence false positives also increase remarkably when a similar correlation cutoff is used to predict the gene regulatory networks [ 2,7,21 ] .
The accuracy of LASSO in large-scale problems with high sparsity is also reported to be not satisfactory [ 22 -- 24 ] .
To improve the performance , current methods require a large number of transcriptome proﬁles , usually at least one fold of the number of regulators [ 1 ] .
However , in most biological studies , sample size is much smaller than the number of regulators due to high experimental cost .
The limitation in sample size impedes performance of both approaches .
In correlation-based methods , limited sample size makes the correlations between genes sensitive to noise , and thus high correlated gene pair needs not imply a true regulatory relationship .
In large genomes , it is even more difﬁcult to infer true regulatory links from a larger pool of highly correlated gene pairs with smaller sample size .
In LASSO-based methods , when sample size is smaller than the number of regulators , multiple solutions are available , which makes it difﬁcult to determine which solution is more biologically meaningful .
To encounter the small sample size problem , heterogeneous datasets from different tissues , biological processes or experimental conditions are usually pooled together before the modeling , which increases prediction accuracy .
However , the inferred network will lose its cell-type or condition speciﬁcity [ 1,25 ] .
Further , heterogeneous datasets may weaken dependencies between the expression of TFs and their targets , since gene regulatory network topology varies among different biological processes , and one TF may regulate different gene sets in different cell states .
Chromatin immunoprecipitation ( ChIP ) coupled with highthroughput techniques , such as sequencing or microarray ( ChIP-seq/chip , hereafter refer to as ChIP-X ) data , which is also called cistrome , are also widely used to construct gene regulatory network in recent years [ 26 -- 28 ] .
However , TF binding sites detected by ChIP-X show only the genomic positions of the TF binding , but could not tell which gene is its target and whether and how the TF binding affects the transcription of its targets .
Recently developed web server ChIP-Array that integrates both ChIP-X and transcriptome data to construct gene regulatory networks takes the advantages of both technologies and provides high accuracy , but it can be used for single TF-centered network only and requires the transcriptome data to be generated under the perturbation of the same TF as that of ChIP-X data .
However , genome-wide gene regulatory networks contain multiple TFs , but only a limited number of TFs have both omics data .
In summary , although several methods have been proposed to infer genome-wide gene regulation networks from transcriptome proﬁles either alone [ 1,29 ] , or in combination with predicted TF binding data [ 30 ] , they are all limited by high computation cost and low accuracy in large genomes .
To tackle these limitations , here we propose to integrate ChIP-X data with transcriptome pro-ﬁles for gene regulatory network inference and use the Lp ( p < 1 ) regularization model to improve the accuracy of LASSO , hereafter referred to as the L1 regularization model .
It is reported that it is able to achieve more sparse and accurate solutions by virtue of the Lp ( p < 1 ) regularization model , even from small amount of samples [ 22,23,31 ] .
However , the Lp ( p < 1 ) regularization model suffers from its non-convexity and it is very difﬁcult in general to design efﬁcient algorithm for its solutions .
Fortunately , the iterative hard thresholding algorithm [ 32 ] and iterative half thresholding algorithm [ 23 ] have been developed to solve the L0 and L1/2 regularization models respectively , but they have not been applied to gene regulatory network inference .
Due to their low computation cost and fast convergence rate , we found that they are suitable for the gene regulatory network inference problem .
Thus in this study , we apply the L0 and L1/2 regularization models to infer gene regulatory networks from ChIP-X and transcriptome data in mouse embryonic stem cells ( mESCs ) .
We compare their performance with the L1 regularization model and ﬁnd that ChIP-X data dramatically improved the accuracy of all three models , and the L0 and L1/2 regularization models signiﬁcantly outperform the L1 regularization model in the presence of ChIP-X data .
The proposed models biologists to infer gene regulatory networks in higher model organisms using integrative omics data , efﬁciently .
All the datasets and codes are available at : http://jjwanglab.org/LpRGNI .
2. Materials and Methods
2.1. Lp regularization models
Regulatory relationship between TFs and targets can be represented approximately by a linear system ( Fig. 1A ) AX 1/4 B þ e where A 2 Rm r denotes the expression matrix of candidate TFs , B 2 Rm n denotes the expression matrix of all target genes , e 2 Rm n denotes an error matrix and X 2 Rr n denotes the regulation matrix that describes the regulatory relationship between all TFs and the targets , m denotes the number of samples , r denotes the number of factors and n denotes the number of target genes .
In gene regulatory network inference , for each target gene j , we want to minimize the difference between AX : ; j and B : ; j with a small number of selected TFs , which is a sparse optimization problem described as min kAX : ; j B : ; jk2 s : t : kX : ; jk0 6 K ; qfiP fifififififififififififififififi where k k2 denotes the Euclidean norm as kX r 2 : ; jk2 1/4 i 1/4 1Xi ; j and kX : ; jk0 denotes the number of non-zero elements in X : ; j .
The less kX : ; jk0 means higher sparsity of X : ; j .
It indicates how many TFs are found to regulate the target gene j. For this problem , a popular and practical technique is to transform the sparse optimization problem into an unconstrained optimization problem , called a regularization problem ( Fig. 1B ) .
For example , given a gene j and its expression proﬁle B : ; j , the L0 regularization model is to minimize the difference between AX : ; j and B : ; j , and maximize the sparsity of X : ; j : minkAX : ; j B 2 r : ; jk2 þ kkX : ; jk0 ; X : ; j2 where k > 0 is the regularization parameter , providing a tradeoff between accuracy and sparsity .
Even though the L0 regularization model is close to the original problem we want to solve , it is NP-hard to achieve a global optimal solution [ 33 ] .
Thus , the L1 regularization model ( LASSO ) , a popular relaxation of the L0 regularization model , is introduced to solve the following problem :
1 i 1/4 1jXi ; jj .
However , in many practical applications , the solutions yielded from the L1 regularization model are less sparse than those of the L0 regularization model [ 22 -- 24 ] .
Recently , the L1/2 regularization model is proposed and proved to perform better than the L1 regularization model [ 23 ] .
This model is described as
: ; j B : ; jk2 þ kkX : ; jk1 = 2 ; X : ; j2R P pfififififififififi 2 where kX r : ; jk1 = 2 1/4 i 1/4 1 jXi ; jj .
Neither L0 nor L1/2 regularization model has been used in gene regulatory network inference .
2.2. Algorithms
In this study , we apply the iterative thresholding algorithms to solve the Lp ( p = 1,1 / 2,0 ) regularization models for gene regulatory network inference from omics data .
The iterative thresholding algorithm is the most widely studied class of the ﬁrst-order methods for the sparse optimization problem .
It is convergent and of very low computational complexity [ 11,23,32 ] .
Beneﬁtting from its simple formulation and low storage requirement , it is very efﬁcient and applicable even for the large-scale sparse optimization problem .
In particular , the iterative soft thresholding algorithm i introduced and developed to solve the L1 regularization problem ; the iterative hard thresholding algorithm is proposed to solve the L0 regularization problem ; and the iterative half thresholding algorithm is designed for the L1/2 regularization problem .
Brieﬂy , in each iteration , these three algorithms ﬁrstly have a same gradient step where the upper indexes of X and Z denote the number of iterations , v denotes the stepsize , which we always choose as 1/2 , and signð Þ denotes the sign function .
The solutions of three algorithms achieved after 200 iterations , except those indicated , are used for further evaluation and comparison .
For all the three algorithms , the regularization parameter k is updated iteratively so as to keep the sparsity of Xk .
For the details , one can refer to [ 23,32 ] .
Since : ; j the number of TFs ( the sparsity of X : ; j ) that regulate a particular gene is usually unknown and biologists need to select a small number of TFs for the experimental veriﬁcation , we make this parameter adjustable to the user .
For further evaluation and comparison of three models , we test a series of factor numbers ( kX : ; jk0 ) from 1 to 100 ( sparsity level 0.1 % -10 % ) .
In each test , we ﬁx the same sparsity ( kX : ; jk0 ) for all three models .
We assume that the TFs which are detected in a higher sparsity will be more important than those detected in a lower sparsity .
Thus we score each factor according to the highest sparsity where it gets a non-zero index in the ﬁnal solution ( Section 2.5 ) .
2.3. Data collections
Transcriptome data were downloaded from Gene Expression Omnibus ( GEO ) .
245 experiments under perturbations in mESC were collected from three papers ( Table 1 ) [ 34 -- 36 ] .
Each experiment produced transcriptome data with or without overexpression or knockdown of a gene , two replicates for control and two replicates for treatment .
Gene expression fold changes between control and TF perturbation samples of 19978 genes in all experiments were log2 transformed and formed matrix B ( Fig. 1A ) .
Candidate regulators , including TFs , mediators , co-factors , chromatin modiﬁ-ers and repressors , were collected from four TF databases , TRANSFAC , JASPAR , UniPROBE and TFCat , as well as literatures .
Matrix A was made up of the expression proﬁles of 939 regulators ( Fig. 1A ) .
A literature-based golden standard TF-target pair set from biological studies ( Fig. 1C ) , including 97 TF-target interactions between 23 TFs and 48 target genes ( low-throughput golden standard ) , was downloaded from iScMiD ( Integrated Stem Cell Molecular Interactions Database ) .
Another golden standard mESC network was constructed from high-throughput ChIP-X and transcriptome data under TF perturbation ( high-throughput golden standard ) .
28 TFs with evidences from both high-throughput ChIP-X and transcriptome data under perturbation were collected from literatures ( Tables 2 and 3 ) .
TF binding sites were called with MACS [ 37 ] for ChIP-seq data and Cisgenome [ 38 ] for ChIP-chip data .
Distance cutoff between a TF binding site and a potential target gene was set as 10 kbp .
Differentially expressed genes under TF perturbation were deﬁned as top 5 % up-regulated and top 5 % down-regulated genes , whose expression changes were signiﬁcant with p-value < 0.05 .
Single TF-centered network was constructed for each TF by ChIP-Array [ 39 ] with both high-throughput data .
Direct target of all TFs were combined as a golden standard mESC network , which contains 40006 links between 13092 notes ( Fig. 1C ) .
Basically , each target in the network is evidenced by the cell-type speciﬁc binding sites on its promoter and the expression change in the perturbation experiment of the TF , which is generally accepted as a true target .
2.4. Integration of ChIP-X and transcriptome data
ChIP-X identiﬁes in vivo active and cell-speciﬁc TF binding sites of a particular TF .
A gene with an active TF binding site around its promoter is considered to be a potential target of the TF .
Thus , ChIP-X data provides possible direct TF-target connections and may help regularization models to approximate the biologically meaningful solutions for the whole genome .
Since matrix X describes the connections between TFs and targets , the TF-target connections deﬁned by ChIP-X data were converted into an initial matrix X0 ( Fig. 1A and Table 2 ) .
Without ChIP-X data as a prior , the initial X0 was artiﬁcially set as 0 .
When integrating ChIP-X data , if TF i has binding site around the gene j promoter within 10 kbp , except those indicated , the Pearson 's correlation coefﬁcient ( PCC ) between the expression proﬁles of TF i and gene j was calculated and assigned on X0 .
PCC can be positive or negative , representing i ; j the TF can activate or repress the target gene expression .
2.5. Evaluations
The area under the curve ( AUC ) of a receiver operating characteristic ( ROC ) curve is widely applied as an important index of the overall classiﬁcation performance of an algorithm .
We applied AUC to evaluate the performance of these three regularization models .
For each pair of TF i and target j , if the Xi ; j is non-zero in the ﬁnal solution matrix X , this TF is regarded as a potential regulator of the target .
A series of factor numbers ( kX : ; jk0 ) from 1 to 100 were tested for each target .
We assume that the TFs which are detected in a higher sparsity ( smaller kX : ; jk0 ) will be more important than those detected in a lower sparsity ( larger kX : ; jk0 ) .
Thus in the process of calculating the AUC , a score Si ; j was applied as the predictor for TF i on target j : maxð1 = kX : ; jk0Þ ; X 1/4 i ; j -- 0 Si ; j 0 ; Xi ; j 1/4 0 And either high-throughput or low-throughput evaluation dataset was used as the golden standard .
Furthermore , 1000 times ' bootstrap was used to test the stability of AUCs .
After the 1000 times ' bootstrap , 1000 AUCs of each model was obtained .
We then used Wilcoxon test to compare the performance of three models .
3. Results and discussions
Since , current methods are mostly designed for transcriptome data , we ﬁrstly inferred the gene regulatory network using curren methods from transcriptome data alone .
To retain the cell-type speciﬁcity of the inferred gene regulatory networks , we incorporated data from only mESC .
Two TF-target datasets from highthroughput and low-throughput studies respectively are used as golden standards to evaluate the accuracy .
As expected , their AUCs are all close to random on both evaluation data because the number of samples is much less than the number of regulators in our mESC data set ( Fig. 2 ) .
Thus we incorporated ChIP-X data to improve the accuracy .
We applied three Lp ( p = 1,1 / 2,0 ) regularization models on the integration of ChIP-X and transcriptome data , and compared their performance .
The L1 regularization model used the same algorithm as ISTA , but using a different initial X0 derived from ChIP-X data .
Without the integration of ChIP-X data , the performance of the three models is similar to other methods and very poor when evaluated with either high-throughput ( HGS ) or low-throughput ( LGS ) golden standard ( Fig. 3A and B ) .
When ChIP-X data are integrated for network inference , the performance of all three regularization models dramatically improved , and the L0 and L1/2 regularization models signiﬁcantly outperformed the L1 regularization model ( Fig. 3A and B , Table 4 ) .
The stabilities of the AUCs for all models are high when evaluated on high-through-put golden standard ( Fig. 3C ) , while , due to the small number of known TF-target pairs , AUCs of different models calculated on low-throughput golden standard are less stable ( Fig. 3D ) .
But the L0 and L1/2 regularization models for integrative data still showed signiﬁcantly better performance when evaluated on this data set ( Table 4 and Fig. 3B ) .
ROCs describe the information of false positive rate ( FPR ) and true positive rate ( TPR ) of all models .
In biological studies , 0.05 is commonly used as the cutoff of FPR .
At the FPR of 0.05 , when evaluated with HGS , integration of ChIP-X data achieved TPRs of 0.637 , 0.594 and 0.079 for the L0 , L1/2 and L1 regularization models respectively , and calculation with transcriptome data alone had TPRs of 0.031 , 0.034 and 0.044 for the L0 , L1/2 and L1 regulari-zation models respectively ( Fig. 3A ) .
The L0 , and L1/2 regularization models achieved much higher sensitivity when integrating ChIP-X data , which meets biological researches ' demand much better .
Fig. 7 shows an example networks known to be active in mESC .
A strict and identical cutoff ( score Si ; j P 0:1 ) is used for all three models .
The L0 and L1/2 regularization models reported much more true targets than the L1 regularization model .
The advantages of the L0 and L1/2 regularization models are also demonstrated by the smaller error between AX : ; j and B : ; j when compared with the L1 regularization model ( Fig. 4 ) .
The errors are calculated along different sparsity levels ( Fig. 4A ) or iterations ( Fig. 4B ) .
When more TFs are selected , smaller error is achieved .
To obtain solutions with same sparsity level , the L1 regularization model showed a larger error than the L0 and L1/2 regularization models , which means that it gets less sparse if we ﬁx the error allowance .
This observation is consistent with several previous numerical experiments [ 22 -- 24 ] .
The error reduction along the iteration of the L0 and L1/2 regularization models were also faster than the L1 regularization model after 100 iterations ( Fig. 4B ) .
Heatmap in Fig. 5B illustrates an example gene regulatory network inferred by the L0 regularization model in mESC , in which only a small portion of TFs regulates most of the targets .
The inferred TFs with more targets signiﬁcantly overlapped with the TFs that were intensively reported to be associated with ESCs ( Fig. 5A , p-value is 9.95E-10 in hypergeometric test ) .
The iterative thresholding algorithms we applied here only require low storage and computation cost [ 11,23,32 ] .
The computational complexity of iterative hard and soft thresholding algorithms for the L0 and L1 regularization models , respectively , has been reported to be O ( krlogm ) , where k is the number of iterations , r is the number of TFs and m is the number of targets [ 11,32,40 ] .
The L1/2 regularization model costs the similar computation time , since its computational complexity is similar to that of the L0 and L1 regularization models ( Fig. 6 ) .
Here , these three regularization models inferred the gene regulatory network with 939 TF , 19978 targets and 245 samples of mouse genome within on hour with one Intel Core i7 in personal laptop ( 2.00 GHz , 8.00 GB of RAM ) , slower than YALL1 , but much faster than ARACNE and NARROMI ( Fig. 6 ) .
The ChIP-Array web server we developed previously can also integrate ChIP-X and transcriptome data to construct gene regulatory network for a single TF [ 39 ] .
Even though it provides more conﬁdent network , it could be used only if both ChIP-X and transcriptome data of perturbation are available for the same TF .
However , only a limited number of TFs have both omics data .
Moreover , ChIP-Array constructs network for only one TF , although in most cases , target genes are regulated by multiple TFs .
Unlike ChIP-Array , transcriptome data used by the Lp ( p = 1,1 / 2,0 ) regularization models are not necessary to be the paired transcriptome data obtained from the perturbation of the same TF of ChIP-X experiment .
Moreover , the Lp ( p = 1,1 / 2,0 ) regularization models consider multiple TFs at the same time to infer more comprehensive gene regulatory network in a genome-wide scale .
To assess how three models rely on the ChIP-X data , we tested their performance with different initial X0 s .
A series of initial X0 s are made up of ChIP-X deﬁned TF-target relationships with different distance cutoffs between a TF binding site and a potential target gene from 200 bp to 50 kbp , which are commonly used in biological studies ( Table 5 ) .
After the TF binding sites are detected from ChIP-X data , a gene that locates closely to a TF binding site is usually considered as a potential target of the TF .
However , proximity may not always indicate a true target , because the TF may bind on a distal enhancer , it may have other unknown function rather than transcription regulation , or sometimes multiple genes are close to a single TF binding site .
Thus a large portion of potential targets deﬁned by only ChIP-X data are false positives .
Table 5 has shown that only 12.468-16 .277 % of potential targets deﬁned by ChIP-X data alone are true targets that are veriﬁed by the TF knockdown/overexpression experiments ( HGS ) .
When a shorter distance cutoff is used , fewer genes will be deﬁned as potential targets in the initial X0 , less false targets , but some true targets in a longer distance may be lost .
When a longer distance cutoff is chosen , more true targets will be included , but false targets will be increased also .
With different initial X0 s , the L0 and L1/2 regularization models consistently outperformed the L1 regularization model ( Table 5 ) .
Even though highthroughput golden standard shares the ChIP-X data with the initial X0 s , the large proportions of false targets in the initial X0 s show that ChIP-X data alone could not infer the true targets accurately .
PCC values between TF and target expression proﬁles in the initial X0 s were used to indicate the possible regulatory effects of the TFs on the targets ( activated or repressed ) , however , using PCC value in the initial X0 s as the predictor to classify true targets in those potential targets deﬁned by ChIP-X data resulted in very low AUCs ( 0.530 -- 0.538 , Table 5 ) .
Besides , least norm minimization can provide the solution having the smallest L2 norm in all the possible solutions , which is commonly used as the initial point for the algorithms for solving the regularization problems of sparse optimization [ 22 ] .
We created initial X0 s via least norm minimization from transcriptome data , then performed the applied iterative thres-holding algorithms for the Lp ( p = 0,1 / 2,1 ) regularization models and evaluated results with the same methods .
Three regularization models , L0 , L1/2 and L1 , obtained AUCs of 0.529 , 0.531 and 0.498 with high-throughput golden standard , and AUCs of 0.681 , 0.672 and 0.629 with low-throughput golden standard , respectively .
Consistent with [ 22 ] , the L0 , L1/2 regularization models are slightly better than L1 regularization model .
However , the results starting from L2 norm solution are still much worse than those starting from the ChIP-X data .
Thus , either ChIP-X or transcriptome data alone can not achieve satisfactory accuracy , and the L0 and L1/2 regularization models did improve the performance of gene regulatory network inference from integrating ChIP-X and transcriptome data .
Recursive optimization of these three models iteratively moves the initial X0 to the solution with minimum value of error between : ; j AX : ; j and B : ; j .
When sample size is smaller than the number of factors , the solution is not unique .
Without prior knowledge , X : ; j reaches one of the solutions that are close to the artiﬁcially assigned initial X0 , like 0 .
Thus , even though the models achieve a : ; j small error , which is mathematically profound ; the non-unique-ness of the solution makes it biologically contradictory , i.e. it may not be the true biological solution we want .
Integrating ChIP-X data provides partial knowledge of the gene regulatory network .
The solutions that are close to the initial X0 deﬁned by ChIP - : ; j X are biologically more meaningful .
Thus the performances of these three models improve after ChIP-X data are incorporated .
Since the L1 regularization model is convex , its local minimizer is also the global minimizer .
Thus different initial X0 s have less inﬂu-ence on the solution of L1 regularization model ( Table 5 ) .
However , the L0 and L1/2 regularization models are non-convex , the corresponding algorithms only converge to some local minimizers [ 41 ] .
Here , we have shown that these local minimizers of the L0 and L1/2 regularization models obtained from integrative data are much closer to the biological solutions we expect than those of the L1 regularization model ( Table 5 ) .
4. Conclusion
In this study , we apply the L0 and L1/2 regularization models to gene regulatory network inference from integrative omics data in large genome with a small number of samples .
Integrating ChIP-X data with transcriptome proﬁles signiﬁcantly improves the performance of network inference .
Compared with the commonly used L1 regularization model , the L0 and L1/2 regularization models have much higher accuracy for integrative omics data .
We evaluated the inferred networks with both high-throughput and low-throughput golden standards .
The L0 and L1/2 regularization models consistently outperformed the L1 regularization model for integrative omics data .
Besides , the algorithms we applied here are computationally efﬁcient and can be executed by a personal computer within one hour .
In summary , we have demonstrated that the L0 and L1/2 regularization models are applicable to gene regulatory network inference in biological researches that study higher organisms but generate only a small number of omics data , and facilitate biologists to analyze gene regulation at whole system level .
Funding
This work was supported by funding from the Research Grants Council , Hong Kong SAR , China ( Grant number 781511M ) , National Natural Science Foundation of China , China ( Grant numbers 91229105 and 11101186 ) .
Reference
[ 1 ] D. Marbach , J.C. Costello , R. Kuffner , N.M. Vega , R.J. Prill , D.M. Camacho , K.R. Allison , M. Kellis , J.J. Collins , G. Stolovitzky , Nat .
Methods 9 ( 2012 ) 796 -- 804 .
[ 2 ] A.A. Margolin , I. Nemenman , K. Basso , C. Wiggins , G. Stolovitzky , R. Dalla Favera , A. Califano , BMC Bioinf .
7 ( Suppl 1 ) ( 2006 ) S7 .
[ 3 ] A.J. Butte , I.S. Kohane , Pac .
Symp .
Biocomput .
( 2000 ) 418 -- 429 .
[ 4 ] A. de la Fuente , N. Bing , I. Hoeschele , P. Mendes , Bioinformatics 20 ( 2004 ) 3565 -- 3574 .
[ 5 ] X. Zhang , X.M. Zhao , K. He , L. Lu , Y. Cao , J. Liu , J.K. Hao , Z.P. Liu , L. Chen , Bioinformatics 28 ( 2012 ) 98 -- 104 .
[ 6 ] H.K. Yalamanchili , B. Yan , M.J. Li , J. Qin , Z. Zhao , F.Y. Chin , J. Wang , Bioinformatics 30 ( 2014 ) 377 -- 383 .
[ 7 ] K. Basso , A.A. Margolin , G. Stolovitzky , U. Klein , R. Dalla-Favera , A. Califano , Nat .
Genet .
37 ( 2005 ) 382 -- 390 .
[ 8 ] A.A. Margolin , K. Wang , W.K. Lim , M. Kustagi , I. Nemenman , A. Califano , Nat .
Protoc .
1 ( 2006 ) 662 -- 671 .
[ 9 ] E.P. van Someren , B.L. Vaes , W.T. Steegenga , A.M. Sijbers , K.J. Dechering , M.J. Reinders , Bioinformatics 22 ( 2006 ) 477 -- 484 .
[ 10 ] X. Zhang , K. Liu , Z.P. Liu , B. Duval , J.M. Richer , X.M. Zhao , J.K. Hao , L. Chen , Bioinformatics 29 ( 2013 ) 106 -- 113 .
[ 11 ] I. Daubechies , M. Defrise , C. De Mol , Commun .
Pur .
Appl .
Math .
57 ( 2004 ) 1413 -- 1457 .
[ 12 ] M.A.T. Figueiredo , R.D. Nowak , S.J. Wright , IEEE J.-Stsp 1 ( 2007 ) 586 -- 597 .
[ 13 ] J.F. Yang , Y. Zhang , Siam J. Sci .
Comput .
33 ( 2011 ) 250 -- 278 .
[ 14 ] A.C. Haury , F. Mordelet , P. Vera-Licona , J.P. Vert , BMC Syst .
Biol .
6 ( 2012 ) 145 .
[ 15 ] B. Efron , T. Hastie , I. Johnstone , R. Tibshirani , Ann .
Stat .
32 ( 2004 ) 407 -- 451 .
[ 16 ] R. Tibshirani , J. R. Stat .
Soc .
B : Met .
58 ( 1996 ) 267 -- 288 .
[ 17 ] R. Tibshirani , J. R. Stat .
Soc .
B 73 ( 2011 ) 273 -- 282 .
[ 18 ] M.K.S. Yeung , J. Tegner , J.J. Collins , Proc .
Natl. Acad .
Sci .
USA 99 ( 2002 ) 6163 -- 6168 .
[ 19 ] Y. Wang , T. Joshi , X.S. Zhang , D. Xu , L.N. Chen , Bioinformatics 22 ( 2006 ) 2413 -- 2420
[ 20 ] R. Bonneau , D.J. Reiss , P. Shannon , M. Facciotti , L. Hood , N.S. Baliga , V. Thorsson , Genome Biol .
7 ( 2006 ) .
[ 21 ] V. Belcastro , V. Siciliano , F. Gregoretti , P. Mithbaokar , G. Dharmalingam , S. Berlingieri , F. Iorio , G. Oliva , R. Polishchuck , N. Brunetti-Pierri , D. di Bernardo , Nucleic Acids Res .
39 ( 2011 ) 8677 -- 8688 .
[ 22 ] R. Chartrand , V. Staneva , Inverse Prob .
24 ( 2008 ) .
[ 23 ] Z.B. Xu , X.Y. Chang , F.M. Xu , H. Zhang , IEEE Trans .
Neural Net .
Lear 23 ( 2012 ) 1013 -- 1027 .
[ 24 ] T. Zhang , J. Mach .
Learn Res .
11 ( 2010 ) 1081 -- 1107 .
[ 25 ] D. Marbach , S. Roy , F. Ay , P.E. Meyer , R. Candeias , T. Kahveci , C.A. Bristow , M. Kellis , Genome Res .
22 ( 2012 ) 1334 -- 1349 .
[ 26 ] C.G. de Boer , T.R. Hughes , Nucleic Acids Res .
40 ( 2012 ) D169 -- D179 .
[ 27 ] X. Chen , H. Xu , P. Yuan , F. Fang , M. Huss , V.B. Vega , E. Wong , Y.L. Orlov , W. Zhang , J. Jiang , Y.H. Loh , H.C. Yeo , Z.X. Yeo , V. Narang , K.R. Govindarajan , B. Leong , A. Shahab , Y. Ruan , G. Bourque , W.K. Sung , N.D. Clarke , C.L. Wei , H.H. Ng , Cell 133 ( 2008 ) 1106 -- 1117 .
[ 28 ] A. Marson , S.S. Levine , M.F. Cole , G.M. Frampton , T. Brambrink , S. Johnstone , M.G. Guenther , W.K. Johnston , M. Wernig , J. Newman , J.M. Calabrese , L.M. Dennis , T.L. Volkert , S. Gupta , J. Love , N. Hannett , P.A. Sharp , D.P. Bartel , R. Jaenisch , R.A. Young , Cell 134 ( 2008 ) 521 -- 533 .
[ 29 ] L. Zhang , X. Ju , Y. Cheng , X. Guo , T. Wen , BMC Syst .
Biol .
5 ( 2011 ) 152 .
[ 30 ] N. Novershtern , A. Regev , N. Friedman , Bioinformatics 27 ( 2011 ) i177 -- i185 .
[ 31 ] R. Chartrand , IEEE Signal Process Lett .
14 ( 2007 ) 707 -- 710 .
[ 32 ] T. Blumensath , M.E. Davies , J. Fourier Anal .
Appl .
14 ( 2008 ) 629 -- 654 .
[ 34 ] A. Nishiyama , L. Xin , A.A. Sharov , M. Thomas , G. Mowrer , E. Meyers , Y. Piao , S. Mehta , S. Yee , Y. Nakatake , C. Stagg , L. Sharova , L.S. Correa-Cerro , U. Bassey , H. Hoang , E. Kim , R. Tapnio , Y. Qian , D. Dudekula , M. Zalzman , M. Li , G. Falco , H.T. Yang , S.L. Lee , M. Monti , I. Stanghellini , M.N. Islam , R. Nagaraja , I. Goldberg , W. Wang , D.L. Longo , D. Schlessinger , M.S. Ko , Cell Stem Cell 5 ( 2009 ) 420 -- 433 .
[ 35 ] L.S. Correa-Cerro , Y. Piao , A.A. Sharov , A. Nishiyama , J.S. Cadet , H. Yu , L.V. Sharova , L. Xin , H.G. Hoang , M. Thomas , Y. Qian , D.B. Dudekula , E. Meyers , B.Y. Binder , G. Mowrer , U. Bassey , D.L. Longo , D. Schlessinger , M.S. Ko , Sci .
Rep. 1 ( 2011 ) 167 .
[ 36 ] A. Nishiyama , A.A. Sharov , Y. Piao , M. Amano , T. Amano , H.G. Hoang , B.Y. Binder , R. Tapnio , U. Bassey , J.N. Malinou , L.S. Correa-Cerro , H. Yu , L. Xin , E. Meyers , M. Zalzman , Y. Nakatake , C. Stagg , L. Sharova , Y. Qian , D. Dudekula , S. Sheer , J.S. Cadet , T. Hirata , H.T. Yang , I. Goldberg , M.K. Evans , D.L. Longo , D. Schlessinger , M.S. Ko , Sci .
Rep. 3 ( 2013 ) 1390 .
[ 37 ] J. Feng , T. Liu , B. Qin , Y. Zhang , X.S. Liu , Nat .
Protoc .
7 ( 2012 ) 1728 -- 1740 .
[ 38 ] H. Ji , H. Jiang , W. Ma , W.H. Wong , Current protocols in bioinformatics/editoral board , Andreas D. Baxevanis , et al. , Chapter 2 ( 2011 ) Unit2 13 .
[ 39 ] J. Qin , M.J. Li , P. Wang , M.Q. Zhang , J. Wang , Nucleic Acids Res .
39 ( 2011 ) W430 -- W436 .
[ 40 ] T. Blumensath , M.E. Davies , Appl .
Comput .
Harmon A 27 ( 2009 ) 265 -- 274 .
[ 41 ] Y.H. Hu , C. Li , X.Q. Yang , J. Mach .
Learn .
Res .
( submitted for publication ) , http://www.acad.polyu.edu.hk/~mayangxq/GPA-SO.pdf