23511241.txt 47 KB
Eficient transcription initiation in bacteria: an interplay of protein–DNA interaction parameters†
As the first , and usually rate-limiting , step of transcription initiation , bacterial RNA polymerase ( RNAP ) binds to double stranded DNA ( dsDNA ) and subsequently opens the two strands of DNA ( the open complex formation ) . 
The rate determining step in the open complex formation is opening of a short ( 6 bp ) DNA called the 10 region , which interacts with RNAP in both dsDNA and single stranded ( ssDNA ) forms . 
Accordingly , formation of the open complex depends on ( physically independent ) domains of RNAP that interact with ssDNA and dsDNA , as well as on parameters of DNA melting and sequences of 
1 Introduction
Transcription initiation is both the first step and a major control point in gene expression . 
Transcription can not be initiated by 
Institute of Physiology and Biochemistry , Faculty of Biology , University of Belgrade , Studentski trg 16 , 11000 Belgrade , Serbia . 
E-mail : dmarko@bio.bg.ac.rs; + + Fax : 381 11 2639 882 ; Tel : 381 63 1312 976 † Electronic supplementary information ( ESI ) available . 
See DOI : 10.1039 / c3ib20221f core RNA polymerase alone , so a complex between core RNA polymerase and a s factor , which is called RNA polymerase holoenzyme ( RNAP ) , is formed .1 Different s factors interact with double-stranded DNA ( dsDNA ) and single-stranded DNA ( ssDNA ) in a sequence specific manner , and they are responsible for transcription under different conditions .2 In this work we concentrate on s70 ( the major s factor in E. coli ) , which is responsible for transcribing housekeeping genes .3 Transcription is initiated from the sequences called core promoters . 
The main elements of core promoters in bacteria are 35 element and 10 element , where 35 and 10 refer to typical distances of these elements from transcription start sites .4 As the first step of transcription initiation , RNAP reversibly binds to dsDNA of promoter elements , which is called the closed complex formation , and is described by the binding afinity K . 
B The binding afinity is , therefore , determined by interactions of s70 with dsDNA , which is exhibited through interactions of s70 domain 4.2 with 35 box , and s70 domain 2.4 with 10 box in the dsDNA form .2 This binding of RNAP leads to opening of the two DNA strands ( promoter melting ) , so that a transcription bubble is formed . 
This transcription bubble extends from the upstream edge of the 10 element to about two bases downstream of the transcription start site , which roughly corresponds to positions 12 to +2 ( +1 is the transcription start site ) .5 The ( inverse ) time needed to form the transcription bubble ( i.e. to open the two DNA strands ) is described by the transition rate from the closed to open complex ( kf ) . 
The transition rate , therefore , crucially depends on interactions of s70 with 10 element ssDNA , which are exhibited through s70 domain 2.3.6 Since almost the entire -- 10 element is a part of the transcription bubble , this element interacts with s70 in both dsDNA and ssDNA forms . 
While sequences from the downstream edge of the -- 10 element to the transcription start site are also part of the transcription bubble , mutating these sequences does not affect the bubble formation ,7 and it is considered that these sequences do not interact with s70 in a sequence specific manner . 
Furthermore , both theoretical studies8 and single molecule experiments9 show that opening of 10 element is the rate limiting step in the transcription bubble formation . 
Since the 10 box is a part of both the closed and the open complex , there is a complex interplay of biophysical interactions associated with this element : ( i ) DNA melting energies ,10 since the 10 box dsDNA is opened ( melted ) in the open complex , ( ii ) interaction energies of s70 with dsDNA through s70 subdomain 2.4,11 and ( iii ) interaction energies of s70 with ssDNA through s70 subdomain 2.3 . 
These three types of interactions are physically independent , since they either correspond to intrinsic DNA properties ( for melting energies ) or are exhibited through physically distinct s70 binding domains ( for s70 -- dsDNA and s70 -- ssDNA interactions ) .6 Given the complex set of physically independent interactions at the 10 element described above , there is a question of how their mutual relationship leads to eficient transcription . 
In particular , the RNAP binding afinity ( KB ) depends on interactions of 10 box dsDNA with s70 subdomain 2.4,6 where the stronger interaction leads to larger binding afinity . 
On the other hand , a stronger interaction of s2 .4 with dsDNA of 10 element leads to a slower transition from the closed to open complex .8 The transition rate ( kf ) also depends on interactions of 10 box ssDNA with s70 subdomain 2.3 and on the 10 element melting energy , both of which are physically independent of s2 .4.8,12 Due to this , KB and kf should a priori be negatively correlated , and there may be a large number of sequences in the genome that correspond to high KB but low kf . 
We call such sequences where RNAP is strongly bound to dsDNA ( high KB ) , but proceeds to the open complex too slowly to achieve functional transcription ( due to small kf ) , poised promoters ; more generally , the term poised promoter is used for all instances where RNAP is bound strongly to DNA , but fails to proceed to functional transcription .13 Naively , RNAP poising appears particularly detrimental for sequences that should be transcriptionally active ( functional promoters ) , since these sequences should result in eficient transcription . 
Given the kinetic issues discussed above , we here aim to understand the following questions : ( i ) what is the extent of RNAP poising in the genome ? 
( ii ) Are binding specificities of s70 interaction domains , and/or sequences of E. coli intergenic regions , designed to minimize the number of poised promoters ? 
( iii ) Do sequences of functional s70 promoters ( additionally ) suppress RNAP poising ? 
We here concentrate on the intergenic regions , rather than on the whole genome , since these regions are relevant for transcription regulation , i.e. both transcription start sites and regulatory elements are located in the intergenic regions . 
The questions posed above are important not only from the point of design of s70 -- promoter DNA interactions , but also from the point of searches for functional promoters in the genome . 
In particular , the most common experimental method to search for core promoters on a genome-wide scale is ChIP-chip14 and its alternatives ( e.g. ChIP-seq15 ) . 
However , immunoprecipitation ( ChIP ) detects DNA sequences that are strongly bound by the protein ( RNAP ) , rather than sequences with a high rate of transcription initiation -- which is the parameter that defines a functional promoter . 
Consequently , the high number of false positives , which is commonly associated with ChIP-chip experiments aimed for promoter detection ,16 may indicate extensive RNAP poising in the genome . 
The goal of this paper is to investigate a relationship between physical interactions at the 10 element and RNAP poising , which provides a basis for better understanding of the nature of false positives in ChIP-chip experiments . 
Along the same lines , DNA footprinting experiments detected sequences that are strongly bound by RNAP , but which result in transcriptionally inactive complexes ; these inactive complexes were shown to be due to ineficient formation of the open complex ( i.e. due to RNAP poising ) .17 Such observations seem particularly important from the point of computational searches of transcription start sites ( core promoters ) in the genome , which typically lead to a very high number of false positives . 
It was consequently proposed that kinetic effects -- an extreme example of which are poised promoters -- can significantly contribute to accuracy of the weight matrix ( computational ) searches of promoters .18 Furthermore , an understanding of the kinetic effects , which we will achieve in this paper , will motivate their inclusion within more physical methods of TSS recognition . 
With regard to this , it was frequently observed that coupling biophysical models with sequence statistics provides a significantly better prediction accuracy compared to simple statistical models .19 In order to analyze how the interplay of different interaction parameters leads to eficient transcription , one must be able to investigate kinetics of transcription initiation on a genome wide scale . 
This analysis can not be done through experiments , since KB and kf have to be measured through work-intensive t-plot measurements ,20 individually for each sequence of interest . 
We here instead approach the problem computationally , where we use a recently developed biophysical model of the open complex formation ,8 which allows the calculation of the kinetic parameters ( KB and kf ) for each sequence of interest . 
This model showed a very good agreement with both biochemical and genomics data , with no free parameters used in comparing the model with the experimental data .8 We will here show that binding specificities of s70 DNA interaction domains are designed to prevent extensive RNAP poising in the intergenic regions , but that the number of poised promoters is still suficient to significantly affect accuracy of core promoter searches . 
Surprisingly , we will find that sequences of functional 10 elements increase the extent of RNAP poising ; on the other hand , overall , the sequences in the intergenic regions have no tendency to affect RNAP poising . 
Though seemingly counter-intuitive , we will argue that this result fits well within the recently proposed mix-and-match model of promoter recognition .21 
2 Results
2.1 Design of in silico experiments
Our goal is to investigate how the interplay of physical interactions at the 10 promoter region provides for eficient transcription . 
We , consequently , systematically investigate relations between the kinetic parameters as the 10 element sequence is varied . 
To achieve this , we design a number of in silico experiments , where we start from a sequence of the lacUV5 promoter . 
This promoter has a consensus 10 element -- which is convenient as a reference for calculating kinetic para-meters -- but has an imperfect 35 element as is characteristic for most functional promoters .5 In the analysis/in silico experiments presented in the following subsections , we will substitute the consensus 10 element of lacUV5 promoter with different sets of DNA segments . 
The biophysical model of transcription initiation8 allows the calculation of the relevant kinetic parameters for sets of DNA segments at the scale of the entire genome ( see Methods and ESI † ) . 
In particular , in the analysis below , we will substitute the consensus 10 element of lacUV5 promoter with : ( i ) all 6 bp long segments from E. coli intergenic regions , ( ii ) all 10 elements that correspond to experimentally detected E. coli transcription start sites , ( iii ) segments that correspond to randomized intergenic regions and randomized 10 elements of experimentally detected promoters ; the computational procedure allows randomizing DNA sequences multiple times , so that statistics of the relevant quantities can be calculated . 
In the analysis below , we will also address how relevant s70 DNA-interaction domains contribute to the kinetic properties that we investigate . 
Experimentally , contributions of different protein domains to the properties of interest would be assessed by mutating amino-acid sequences of these domains . 
We will computationally assess contributions of s70 domains by randomizing interaction specificities of these domains ; similarly as with DNA sequences , we can perform multiple randomizations in order to calculate statistics of the relevant quantities . 
Finally , we will also substitute binding specificities of s70 domains with binding specificities of different E. coli transcription factors , in order to ensure that the reported relationships are not a consequence of generic properties of protein -- DNA interactions . 
2.2 Kinetic properties of E. coli intergenic regions
We start from the sequence of the lacUV5 promoter , and substitute its consensus 10 element with all 6 bp long segments from E. coli intergenic regions . 
For all these substitutions we calculate the relative binding afinity ( KB ) and the relative transcription initiation rate ( j ) , by using eqn ( 1 ) and ( 3 ) ( see Methods ) . 
The relationship between logarithms of KB and j is shown in Fig. 1A , so that the quantities on the two axis correspond to the appropriate interaction energies that determine the relevant kinetic parameters . 
Specifically , the horizontal axis ( log ( KB ) ) corresponds to the s70 -- dsDNA binding energy , while the vertical axis corresponds to a combination of the energy terms that we refer to as the effective energy and which directly determines the transcription initiation rate ( see eqn ( 3 ) and ( 4 ) in Methods ) . 
Both KB and j , which are shown in Fig. 1A , are calculated relative to the binding afinity and the transcription initiation rate of the lacUV5 promoter . 
Note that we substitute ( vary ) only the 10 element of lacUV5 promoter , and that 10 element of this promoter corresponds to the consensus sequence ( ` 12TATAAT 7 ' ) . 
Consequently , zeros on the horizontal and the vertical axis correspond to the consensus 10 element , and stronger interaction energies correspond to larger ( less negative ) values on the two axes . 
The horizontal line in Fig. 1 ( transcription rate threshold ) indicates the transcription rate below which transcript levels can not be detected , while the vertical line ( binding threshold ) indicates the binding afinity above which a sequence is considered to be strongly bound by RNAP . 
The transcription rate threshold is set based on the estimate that the minimal rate of transcription is 1/400 per second , while the transcription rate of the reference lacUV5 can be estimated at 1/3 per second .22 The binding threshold is set so that it corresponds to the binding afinity of a weak Plac promoter , with sequences of 35 element and 10 element that correspond , respectively , to ` 36TTTACA 31 ' and ` 12TATGTT 7 ' ; 23 this definition is in accordance with an intuitive notion that strongly bound sequences should have a larger binding afinity than a weak promoter . 
Fig. 1A shows that there is a high positive correlation ( with a Pearson correlation coeficient of R = 0.85 ) between the transcription activity and the binding afinity for 10 elements derived from E. coli intergenic regions . 
One should note that the determinants of binding afinity and transcription activity are physically independent ( see the previous section ) , so the good correlation has to be due to the design of s70 interaction domains or due to the sequence of DNA intergenic segments , which is further explored in the next subsection . 
However , despite this high correlation , a significant fraction of the strongly bound sequences corresponds to poised promoters : in Fig. 1 , the green dots mark strongly bound DNA segments that correspond to the functional promoters ( i.e. to sequences that are above both the binding and the transcription activity threshold ) , while the red dots mark the sequences that correspond to the poised promoters ( i.e. to sequences that are above the binding , but below the transcription activity threshold ) . 
One can see that a significant fraction of the strongly bound sequences ( B30 % ) correspond to poised promoters . 
Such poised promoters can be falsely identified as targets by computational and experimental searches of core promoters , which we will further discuss in the next section . 
2.3 Analyzing the good correlation between the transcription rate and the binding afinity In this subsection , we concentrate on the properties of s70 -- DNA interactions that lead to the good correlation between the transcription activity and the binding afinity , which is observed in Fig. 1A . 
As discussed above , KB depends on s70 interactions with 10 element dsDNA , while j depends on interactions of s70 with 10 box ssDNA and on DNA melting energies .8 Since KB and j are physically independent of each other , there is a question of why there is a good correlation between the transcription rate and binding afinity that is observed in Fig. 1A . 
The first possibility is that this good correlation is due to the sequence of E. coli intergenic regions , i.e. the presence of poised promoters is suppressed in these sequences . 
This possibility might be reasonable , since existence of a large number of poised promoters could be detrimental for eficient transcription initiation ( see also Discussion ) . 
The second possibility is that the good correlation is due to the design of s70 DNA interaction domains ( specifically due to the binding specificities of s70 subunits 2.3 and 2.4 ) . 
We test these two possibilities below . 
In order to generate an appropriate ensemble to test the possibility that the good correlation is due to the DNA sequence , we next randomize the DNA sequence of E. coli intergenic regions 50 times . 
The randomizations are performed so that frequencies of the nucleotides are preserved ( see Methods ) . 
We next re-calculate the correlation coeficient between the transcription rate and the binding afinity for each of the 50 randomized sequences , and obtain the mean for these 50 randomizations as R % = 0.84 ( the relationship between the transcription rate and the binding afinity for one such randomization is shown in ESI , † Fig . 
S1 ) . 
This value ( R % = 0.84 ) is only somewhat smaller compared to the correlation coeficient for the actual E. coli intergenic regions ( R = 0.85 ) . 
Consequently , the design of the DNA sequence of the intergenic regions is not a reason for the high correlation between the transcription rate and the binding afinity . 
As the second possibility , we analyze if the high correlation is due to the design of the binding specificities of s70 DNA interaction domains . 
To test this possibility , we randomize the binding specificities that correspond to s70 subunit 2.3 ( s70 -- ssDNA interactions ) and 2.4 ( s70 -- dsDNA interactions ) and DNA melting energies ( see Methods ) . 
We first permute the two parameters that -- in the single nucleotide approximation -- characterize DNA melting ( melting energies of A : T and G : C pairs -- see Methods ) ; the effect of this permutation is shown in Fig. 1B . 
In Fig. 1C and D we show the effect of randomization of , respectively , s70 binding domains 2.3 and 2.4 . 
Fig. 1B -- D show that ( separately ) randomizing each of the interaction energies leads to a large decrease in the correlation coeficient , and to a consequent large increase in the fraction of poised promoters ( the red dots in Fig. 1B -- D ) . 
In particular , note that not only randomizations of the interaction domain specificities ( Fig. 1C and D ) , but also the permutation of the melting energies ( Fig. 1B ) lead to a significant decrease in the correlation coeficient . 
This indicates that the reduction of RNAP poising in the genome depends on an interplay of all the relevant parameters ( i.e. on the mutual relation between ssDNA , dsDNA and melting energy parameters ) . 
To test statistical significance of the results , in Fig. 1C and D , we calculate correlation coeficients for 50 randomizations of ssDNA interaction parameters ( s70 subunit 2.3 ) , and for 50 randomizations of dsDNA interaction parameters ( s70 subunit 2.4 ) . 
The mean values and 95 % confidence intervals for these rando-mizations are shown in the histogram ( see Fig. 2 ) . 
For comparison , the correlation coeficient for the actual ( wild type ) interaction parameters and for the permutation of the melting parameters are also indicated . 
We see that all the randomizations indeed lead to a statistically significant ( and large ) decrease in the correlation coeficient . 
Consequently , the reduction in the number of poised promoters in the intergenic regions depends on the mutual relationship of all physical parameters that are relevant for opening the 10 element . 
Finally , from Fig. 2 one can also note that randomization of dsDNA interaction parameters ( s70 domain 2.4 ) leads to an almost complete loss of the correlation . 
The reason for this loss is that the binding afinity depends exclusively on dsDNA interactions , while the transcription rate depends on dsDNA interactions through only one out of six bases of the 10 element ( base 12 ) ( see eqn ( 1 ) , ( 3 ) and ( 4 ) ) . 
Consequently , randomization of dsDNA interactions leads to an almost complete loss of the relation between the binding afinity and the transcription rate . 
2.4 Substitutions of r DNA interaction domains
In this subsection , we provide further evidence that the binding specificities of s70 interaction domains are designed to prevent extensive RNAP poising . 
Specifically , while we established that the good correlation is due to the specificities of s70 DNA-binding domains , it remains to be confirmed that the effect is not an artificial consequence of some generic property of protein -- DNA interactions . 
For example , such an artifact would arise if protein -- DNA binding domains would have a general tendency to recognize similar AT rich sequences . 
To test this , we substitute specificities of binding domains 2.3 and 2.4 with specificities of different E. coli DNA binding proteins . 
Parameters of protein -- DNA interactions are inferred from binding sequences assembled in DPInteract database ,24 by using the QPMEME algorithm .19 b From DPInteract database we can infer , with a high reliability , interaction specificities of 8 E. coli transcription factors ( see Methods ) . 
We then substitute specificities of RNAP binding domains 2.3 and 2.4 with these inferred specificities , which makes a total of 56 substitution pairs ; note that we do not allow for the same E. coli transcription factor specificity to substitute both s70 domains 2.3 and 2.4 . 
For each of these substitutions we calculate correlation between the rates of transcription and binding afinities , as described in the previous subsection . 
The distribution of the correlation coeficients for the substitutions is shown in Fig. 3 , and the correlation for the actual s70 binding domains is also indicated in the figure for comparison . 
We see that the correlation in the case of the actual s70 binding domains is significantly larger compared to all the substitutions , with a very high statistical significance ( P value of B10 24 ) . 
Therefore , the good correlation is not an artificial consequence of some generic property of protein -- DNA interactions , and interaction domains of RNAP are indeed `` hardwired '' so as to reduce RNAP poising in the genome . 
2.5 Kinetic properties of experimentally detected r70 promoters
We next investigate kinetic properties of 10 elements associated with 342 experimentally confirmed transcription start sites . 
Selection of the transcription start sites with experimentally confirmed transcription activity from RegulonDB database ,25 and alignment of 10 elements associated with these transcription start sites , is described in Methods . 
We substitute the consensus 10 element of the lacUV5 promoter with these aligned 10 elements , and for each of these substitutions we calculate the transcription rate and the binding afinity ; the obtained relationship between these two quantities is shown in Fig. 4A . 
One may expect that RNAP poising at the transcriptionally active sequences should be suppressed to a larger extent compared to the generic segments from the intergenic regions . 
However , in contrast to this expectation , we find that the correlation in the case of the transcriptionally active 10 elements is notably smaller than the correlation for the intergenic segments ( 0.75 vs. 0.84 , compare Fig. 4A with Fig. 1A ) ; to further assess this result , we analyze how the correlation changes when functional 10 elements are randomized . 
To obtain appropriate statistics , we randomize the set of aligned 10 elements 50 times , and then calculate the correlation coeficient for each randomization . 
Consistent with the result obtained above , the mean of the correlation coeficients for these randomizations is notably larger compared to the correlation for the actual 10 elements ( 0.85 vs. 0.75 ) , with a very high statistical significance ( P B 10 39 ) . 
Therefore , the DNA sequences of the transcriptionally active 10 elements indeed significantly decrease the correlation between the transcription rate and the binding afinity , and consequently increase the extent of RNAP poising . 
Finally , to visualize the effect of 10 element randomization , we show the relationship between the transcription rate and the binding afinity , for one instance of 10 element randomization 
2.6 Extension of the mix-and-match model to kinetic parameters We here establish a connection between the surprising decrease in the correlation coeficient for functional 10 elements and a recently proposed mix-and-match model of promoter recognition .21 The mix-and-match model initially proposed that the strengths of the promoter elements , that interact with dsDNA , complement each other so as to achieve a necessary level of overall binding afinity . 
Subsequently , a more detailed statistical analysis showed that promoter elements match each other to achieve a necessary level of total promoter strength .26 We here consider an extension of this model to the kinetic parameters , where we propose that the binding afinity and the transition rate match each other to achieve a necessary level of transcription activity . 
To test such extension of the mix-and-match model , we start from the intergenic segments ( analyzed in Fig. 1A ) , and from the transcriptionally active 10 elements ( analyzed in Fig. 4A ) . 
From each of these two sets of sequences , we select the following two subsets : ( i ) 30 % of the sequences with the highest value of the transition rate from the closed to open complex ( kf ) and ( ii ) 30 % of the sequences with the lowest value of the transition rate . 
The transition rates from the closed to open complex ( kf ) are calculated according to eqn ( 2 ) ( see Methods ) . 
We next calculate the distribution of the binding afinities for these two subsets -- i.e. for the sequences with the high and the low values of the transition rate -- by using eqn ( 1 ) ( see Methods ) . 
For the intergenic segments , the distributions for the two subsets are shown together in Fig. 5A . 
Similarly , the two distributions for transcriptionally active 10 elements are shown together in Fig. 5B . 
In Fig. 5A , we see that , for the intergenic segments , the mean binding afinity is significantly smaller for the group with small kf values than for the group with high k values ( P o 10 100 f ) . 
This property decreases the extent of RNAP poising for the intergenic segments , i.e. sequences with low values of the transition rates are generally not characterized by high values of the binding afinities . 
Note that this result is directly related to the high value of the correlation between the binding afinity and the transcription rate for the intergenic segments . 
On the other hand , for the transcriptionally active 10 elements , the distribution of the binding afinities for the group with low kf is shifted towards the stronger binding afinities , relative to the same distribution for the intergenic segments . 
As a consequence , for transcriptionally active 10 elements , the group of promoters with high kf values has smaller mean binding afinities compared to the group with low kf values ( with P o 0.05 ) . 
This result is a consequence of the decrease in the correlation coeficient between the transcription rate and the binding afinity for the transcriptionally active 10 elements relative to the intergenic segments ( Fig. 4A vs. Fig. 1A ) , and is analyzed below in terms of the mix-and-match model for promoter recognition . 
Though unexpected , the result in Fig. 5B is straightforward to interpret in terms of the extension of the mix-and-match model to kinetic parameters . 
This figure shows that KB and kf complement each other , so that lower kf is accompanied by higher KB ; this is notably different from the intergenic regions , where sequences with low kf have tendency to have low KB . 
This match of the kinetic parameters for the transcriptionally active 10 elements allows us to achieve a suficient level of transcription activity ( which is proportional to the product of KB and kf ) . 
This result , and the extension of the mix-and-match model to kinetic parameters , is further discussed in the next section . 
3 Discussion
Interactions of s70 with 10 promoter elements are crucial for initiation of transcription . 
These interactions involve s70 binding domains that interact with dsDNA and ssDNA , as well as DNA melting energies . 
We here analyzed how the interplay of these interactions affects kinetics of transcription initiation . 
A prominent example of such kinetic effects are poised promoters , which are sequences where RNAP strongly binds to dsDNA , but has a too slow transition from the closed to open complex to achieve detectable transcription levels . 
Extensive RNAP poising could be detrimental for eficient transcription , since unproductively bound RNAP can disrupt normal transcription regulation -- e.g. note that the bound RNAP molecule protects B75 bps of DNA , which is often comparable to the size of E. coli intergenic regions .27 Such unproductive binding can also require a significantly larger RNAP production , in order to achieve a suficiently high RNAP concentration for function of transcriptionally active promoters . 
Consequently , it seems plausible that specificities of different interactions and DNA sequences , which are involved in transcription initiation , are somehow tuned to prevent RNAP poising . 
We here investigated this possibility and showed that s70 -- DNA interaction domains , though physically independent , are designed to reduce the extent of RNAP poising in the intergenic regions . 
This reduction depends on a mutual relationship between all three types of the interaction parameters ( ssDNA , dsDNA and melting energies ) , which strongly suggests that binding specificities of s70 -- DNA interaction domains are tuned to evade a large number of poised promoters in the intergenic regions . 
As another evidence that reduction of RNAP poising is a major ` design ' constraint on specificities of s70 -- DNA interaction domains , we found that the actual s70 binding specificities lead to a much larger correlation between binding afinity and transcription rate compared with substitutions of these domains with specificities of other E. coli transcription factors . 
It is interesting that the reduction in the number of poised promoters depends on the binding specificities of s70 interaction domains , rather than on the sequence of the intergenic regions . 
Such design may allow modularity in reduction of RNAP binding through different bacterial species : while binding specificities of s70 interaction domains are known to be well conserved across different bacteria ,5 DNA sequences of the intergenic regions are widely different . 
Therefore , imposing the reduction in the number of poised promoters at the level of ( conserved ) interaction domains , rather than at the level of ( variable ) DNA sequence , provides a straightforward strategy to impose reduction of RNAP poising in diverse bacterial sequences . 
Furthermore , there are likely numerous simultaneous constraints on bacterial regulatory ( intergenic ) regions , since these regions must accommodate a number of functional motifs ( e.g. core promoters , transcription factor binding sites , terminators ) . 
Due to this , tuning the binding specificities of s70 interaction domains may be easier than imposing the absence of poised promoters at the level of DNA sequence . 
The fact that s70 interaction domains are designed to reduce the number of poised promoters implies that any DNA sequence will have a tendency for high correlation between the binding afinity and the transcription activity . 
Such high correlation was also observed for DNA sequences of transcriptionally active promoters . 
However , we found that DNA sequences of these promoters have a tendency to decrease this correlation , i.e. to increase the extent of RNAP poising . 
This finding is surprising , since one may expect that transcriptionally active sequences should evade RNAP poising . 
To better understand this result , it is useful to discuss it from the point of the recently proposed mix-and-match model of promoter recognition . 
This model proposes that strengths of promoter elements mix with each other , and match each other strengths , so as to achieve the necessary level of promoter strength .21,26 For example , a weaker 10 element may be complemented by a stronger 35 element , so that a necessary level of transcription activity is achieved .28 Actually , Fig. 4A shows that many substitutions of the 10 element of the lacUV5 promoter with 10 elements that correspond to the experimentally detected TSS fall below either the binding afinity or the transcription rate threshold . 
It is likely that , for a substantial number of such 10 elements , the strengths of the other elements within the promoter ( 35 element , spacer ) are adjusted ( ` matched ' ) so that the kinetic parameters for the entire promoter are above the thresholds . 
Furthermore , one should note that some of the known promoters depend on transcription factors in order to achieve suficient binding afinity and transcription rate , so that their basal values of the kinetic parameters are below the relevant thresholds . 
We here proposed to extend the mix-and-match model to the kinetic parameters ; consequently , the observed decrease in the correlation between the binding afinity and the transcription activity can be explained by the need to match the lower transition rate from the closed to open complex with higher binding afinity . 
Our results show that , though statistically significant , this decrease in the correlation is still small enough as not to turn a transcriptionally active promoter into a poised promoter . 
That is , the observed increase of RNAP poising at functional promoters is such that to allow matching of the kinetic parameters , but not such to cause dysfunctional transcription . 
We here predicted that a significant fraction of the strongly bound sequences correspond to poised promoters . 
This prediction may have a direct consequence on experiments that identify transcription start sites by detecting sequences to which RNAP strongly binds , such as ChIP-chip or ChIP-seq experiments ; such measurements provide experimental strategy to detect transcription start sites on a genome-wide scale . 
Actually , it is interesting that the number of poised promoters , which is estimated here ( B30 % of the strongly bound sequences ) , roughly matches with the reported number of false positives in ChIP-chip experiments .16 However , care must be taken when literally comparing false positives in ChIP-chip experiments with our in silico results , due to possible different choices of the binding thresholds . 
That is , the binding threshold is to a good degree provisional in ChIP-chip experiments , i.e. it depends on the signal intensity above which the sequences are considered to be targets . 
Therefore , the binding threshold is likely different from one ChIP-chip experiment to the other , and may also be different from the choice of binding threshold in our study . 
Consequently , it is likely that false positives in ChIP-chip experiments come from both sequences that are poised promoters and from technical issues such as biases in DNA amplification or imperfect immunoprecipitations of DNA fragments cross-linked to protein . 
Furthermore , the importance of the kinetic effects strongly suggests that they should be incorporated in bioinformatic methods for TSS detection . 
In fact , TSS detection in bacteria is a classic bioinformatic problem , where available methods show poor accuracy .18 b ,24,29 An alternative to current methods , which are based on information theory , is a biophysics method that would detect promoters based on the calculated transcription rate . 
A major dificulty in developing such a method is that interactions of s70 with 35 element have ( to our knowledge ) not been measured until now . 
Note that in our in silico experiments we varied the 10 element , while the sequence of 35 element remained constant . 
While such a design is evidently useful for studying the interplay of physical interactions at the 10 element , it is not convenient for promoter detection , since promoters sample sequences with variable 35 elements . 
A solution to this problem may be a mixed bioinformatic and biochemical parameterization , which is our work that is currently in progress . 
In this work , we investigated kinetic effects of transcription initiation on a genome-wide scale . 
Such analysis is , to our knowledge , the first of its kind , since there is currently no high-throughput method for measuring kinetic parameters of transcription initiation for sequences of interest . 
Consequently , the kinetic parameters have to be experimentally measured through classical , but time-consuming , t-plot measurements , individually for each sequence of interest . 
To overcome this dificulty , we here used a quantitative model of transcription initiation , which showed a very good agreement with experimental data , and which allows eficient calculation of the kinetic parameters . 
The computational procedure also allowed 70 repeatedly altering both specificities of s -- DNA interaction domains , and relevant DNA sequences , which is experimentally not feasible . 
We consequently designed a set of experiin silico ments , which use a model of the specific biochemical process ( transcription initiation ) , in order to study kinetics of transcription initiation on a much larger ( whole genome ) scale . 
Through the in silico experiments we found that the extent of RNAP poising in the genome is highly suppressed , where this suppression is at the level of s70 interaction domains , rather than the DNA sequence . 
However , despite this suppression , a significant fraction of the sequences that are strongly bound by RNAP correspond to poised promoters . 
This significant fraction of poised promoters is directly relevant for interpreting results of experimental and computational searches of transcription start sites . 
Furthermore , we surprisingly found that sequences of the functional promoters increase the extent of RNAP poising , which we interpreted in terms of the mix-and-match model of promoter recognition . 
Overall , the analysis presented here strongly suggests that the kinetic effects are important , and that they should be incorporated in methods for core promoter detection . 
It is likely that this will allow both increasing the accuracy of computational predictions and better understanding the results of the experimental searches . 
5 Methods
5.1 Calculation of the kinetic parameters
To calculate the relevant kinetic parameters , we use a biophysical model of transcription initiation .8 For completeness , in ESI , † we summarize elements of this model that are directly relevant for the analysis presented here . 
Briefly , the model is used to express the rate by which RNAP opens the two DNA strands , in terms of the interactions of s70 with ssDNA and dsDNA , and DNA melting energies . 
To parameterize the model , we use a widely used independent nucleotide approximation ,30 according to which the interaction energies are given by the sum of the terms that correspond to different bases at different positions . 
Also , in this study we vary only the sequence of the 10 element , so that the energy terms that are associated with 35 element interactions and spacer lengths do not enter the relevant equations . 
Consequently , the binding afinity KB , the rate of transition from the closed to open complex kf , and the rate of transcription initiation j are given below , respectively , by eqn ( 1 ) , ( 2 ) and ( 3 ) ( see ref . 
8 and ESI † ) : X 4 6 X . 
D ðdsÞ c G k T S i ; a B i ; a i 1/4 1 a 1/4 1 where in the last equation we introduced the effective binding energy DG ( eff ) i , a : . 
< > D ðssÞ ðmÞ 8 G þ i ; a DGa kBT for i 2 ð2 ; 6Þ DG > : D ðdsÞ ðeffÞ i ; a . 
G kBT for i 1/4 1 i ; a 
In the equations above , the index i denotes different positions within the 10 box , so that i = 1 corresponds to the position 12 , while i = 6 corresponds to the position 7 , relative to the transcription start site . 
Further , a denotes the four different bases ( A , T , C or G ) , while Si , a is equal to one if base a is present at position i in sequence S , and is equal to zero otherwise . 
Furthermore , DG ( m ) a denotes the melting energies of different bases , DG ( ss ) ia denotes the interaction energies of s with different bases at different positions of the non-template strand in the open complex , and DG ( ds ) ia denotes the interaction energies of s with different bases at different positions of duplex DNA for the 10 box . 
Note that the base 12 ( i = 1 ) appears asymmetrically in the expression for the effective energy ( see eqn ( 4 ) ) , since this is the only base of the 10 element that remains double stranded in the open complex .6 Also , note that due to the symmetry of the two DNA strands DG ( m ) T C G A = DG ( m ) and DG ( m ) = DG ( m ) , so that there are effectively two parameters that determine melting energy in the single nucleotide approximation . 
5.2 Alignment of 10 promoter elements
To align 10 elements , we use the assembly of transcription start sites from RegulonDB database .25 This assembly includes both experimentally verified promoters and computational predictions , and corresponds to both s70 and alternative s factors . 
For our alignment , we select only experimentally verified s70 transcription start sites , i.e. we disregard all transcription start sites that are either not experimentally validated , or correspond to alternative s factors . 
This selection results in the total of 342 s70 transcription start sites , and we use the obtained start sites in order to extract DNA segments that correspond to positions 17 to 2 , relative to the transcription start sites . 
These positions were chosen having in mind that the position of 10 element can deviate for 5 bps relative to its canonical position ( 12 to 7 ) .31 
To identify the 6 bp long 10 elements within the selected DNA segments , we used the Gibbs sampler .32 The Gibbs sampler implements a version of the Gibbs search algorithm ,33 which is used to perform unsupervised motif alignment . 
Only the DNA strand defined by the direction of transcription was searched , since both 10 box and 35 box motifs are not palindrome symmetric . 
The search was done with the initial assumption that one motif element is present in each DNA segment ; however , in the end of the Gibbs sampler search , individual motif elements are added in or taken out , in a single pass of the algorithm , depending upon whether or not their inclusion improves the value of the alignment score . 
The last step allows excluding from the alignment those sequences that do not contain 10 box motifs , e.g. due to database miss-assignments . 
The search resulted in the identification of 322 aligned 10 boxes that correspond to the experimentally confirmed s70 transcription start sites in E. coli ; these aligned 10 elements were used in the further analysis . 
5.3 Randomization of interaction specificities and DNA segments 
We aim to randomize the interaction specificities , without changing the overall strength of s70 -- DNA interactions . 
To achieve this , it is useful to visualize the interaction parameters in the form of a matrix , where index i corresponds to different positions within the 10 element , while index a corresponds to four different bases . 
OveralP l interaction strength for energy matrix ei , a can be defined as e 2.19 b i ; a Consequently , to rando-i ; a mize the interaction specificities , we randomly permute elements of the interaction matrix , whichP randomizes the interaction specificity but does not change e 2 i ; a . 
In order to i ; a obtain statistics for quantities of interest , we randomize a given matrix 50 times , according to the procedure described above . 
In order to randomize the interactions corresponding to DNA melting , we simply permute energies that correspond to AT ( DG ( m ) T C G A = DG ( m ) ) and GC base pairs ( DG ( m ) = DG ( m ) ) . 
This procedure results in a single randomization , and is a consequence of the fact that in the single nucleotide approximation there are only two parameters that describe DNA melting ( see above ) . 
We randomize DNA sequences , i.e. intergenic regions and 10 elements that correspond to the experimentally confirmed transcription start sites ( see above ) , by randomly permuting the bases within the sequences . 
Note that such randomization preserves nucleotide ( GC ) content of the sequences . 
Similar to s70 -- DNA interaction domains , to obtain appropriate statistics we randomize a given DNA sequence 50 times . 
5.4 Interaction parameters for E. coli transcription factors
We use protein -- DNA interaction parameters that were obtained in ref . 
19b . 
These interaction parameters were inferred from E. coli transcription factor binding sites which were assembled in DPInteract database .24 The interaction parameters were inferred from the example binding sites by using the QPMEME ( Quadratic Programming Method of Energy Matrix Estimation ) algorithm . 
To ensure a high accuracy of the inferred protein -- DNA interaction parameters , we select those transcription factors ( i.e. their corresponding interaction parameters ) , for which the following two conditions are satisfied : ( i ) the number of the example binding sites assembled in DPInteract database is larger than 10 , ( ii ) over representation for the transcription factor is also larger than 10 . 
The first condition ensures that too few example binding sites do not lead to overfitting of the interaction parameters . 
The second condition ( over representation ) is related to a measure of significance/functionality of the inferred interaction parameters .19 b This procedure results in selection of the interaction parameters for eight E. coli transcription factors . 
We then use the inferred interaction parameters for the selected E. coli transcription factors in order to substitute interaction specificities of s2 .3 ( s70 -- ssDNA interactions ) and s2 .4 ( s70 -- dsDNA interactions ) binding domains . 
A technical dificulty is that the length of s2 .3 and s2 .4 binding sites ( 5 bps and 6 bps , respectively ) is generally different ( shorter ) than the length of binding sites of the selected E. coli transcription factors . 
To resolve this dificulty , we select a subset of adjacent positions that correspond to maximal binding specificity within the interaction domain of each transcription factor ; the length of the selected adjacent positions corresponds to the length of s2 .3 or s2 .4 binding positions ( i.e. 5 bps or 6 bps ) . 
To select the adjacent positions with maximal specificity , we use a definition of the binding specificity si P at position i of the energy matrix e 2 i ; a : si 1/4 ei ; a . 
a