Structure, function, and evolution of genome-wide regulatory networks

Most projects that we pursue concern the functioning and evolution of genome-wide regulatory systems in organisms ranging from bacteria to humans. The type of large-scale questions that we aim to answer include understanding how regulatory systems are integrated on a genome-wide scale, how regulatory networks are structured, how these systems handle and potentially exploit the inherent noise in gene regulatory processes, how stable cell types are defined and maintained, understanding how gene regulation evolves, and understanding under what conditions regulatory complexity can be expected to increase in evolution. Another major topic of interest in our group is to use large-scale comparative genomic analysis to develop quantitative theories of genome evolution. We are particularly interested in attempting to put theories of genome evolution on a strictly empirical footing.

Our group pursues both theoretical/computational and experimental approaches, and our projects can roughly be divided into `dry lab’ projects that mainly concern the structure and function of gene regulatory networks in higher eukaryotes, and `wet lab’ projects that mainly concern the evolution of gene regulation at the single cell level in E. coli.

From regulatory site constellations to dynamic gene regulatory programs

To help understand how regulatory networks function on a genome-wide scale in higher eukaryotic organisms, we have been developing Bayesian probabilistic methods that combine information from high-throughput experiments (e.g. RNA-seq, ChIP-seq) with comparative genomic sequence analysis. The development of these methods is often done in collaboration with experimental groups that provide high-throughput data. Very roughly speaking, our projects concern identifying regulatory sites genome-wide in DNA and RNA, understanding how constellations of regulatory sites determine binding patterns of transcription factors and, ultimately, gene expression patterns. Finally, we aim to develop quantitative and predictive models that describe how dynamic interactions between transcriptional and post-transcriptional regulators implement gene regulatory programs that define cellular states and the transitions between them.

Regulatory site prediction

We have been developing methods for identifying regulatory sites in DNA and RNA sequences for over a decade and continue to work on improving these methods. In one current project we are working on extending the well-known position-specific weight matrix models of transcription factor (TF) binding specificity into dinucleotide weight tensor models that take arbitrary dependencies between pairs of positions into account. We are also developing completely automated procedures for analysis of ChIP-seq data, including comprehensive downstream motif analysis of binding regions. In a recent collaboration with the group of Mihaela Zavolan, we have developed a novel biophysical model or miRNA target site prediction, called MIRZA (Fig. 1). MIRZA allows prediction of miRNA target sites without resorting to ad hoc seed-based methods. In addition, application of MIRZA has shown that a substantial fraction of miRNA target sites is non-canonical, i.e. not containing a seed match. All our genome-wide regulatory site predictions are available in various formats through our SwissRegulon database and genome browser (

Fig. 1: biophysical model of miRNA-target interaction. (a) Sketch of a miRNA-mRNA hybrid illustrating the way MIRZA assigns a binding energy to the interaction. Nucleotides involved in base-pairing are indicated in orange, symmetric loops in red, bulges in the miRNA in blue and dangling ends in cyan. Arrows point from the independent energy terms (defined in Online Methods) to the corresponding structural elements (base pairs, loop openings and extensions). (b) Summary of energy parameters inferred from 100 independent optimization runs on the Ago2-CLIP data. Green boxes show interquartile ranges, the 5th and 95th percentiles are indicated by whiskers, and black dots indicate median values of fitted parameters across the runs. (c) Summary of the predicted hybrid structures; colors indicate the fraction of hybrids in which a given nucleotide is involved in a base pair (orange), symmetric loop (red), bulge (blue) or dangling end (cyan).

Motif Activity Response Analysis

To take a first step toward modeling how constellations of regulatory sites determine genome-wide expression patterns we developed an approach, called Motif Activity Response Analysis (MARA), that models the expression of each gene as a linear function of the binding sites that occur in its promoter and unknown ‘motif activities’ that represent the condition-dependent activities of the regulators binding to these sites. Since the original presentation of this approach, in the FANTOM4 collaboration with the RIKEN Institute in Yokohama, Japan, we have been working both on completely automating the MARA approach and on extending it in a number of ways.
One of our key current interests is to understand how the interplay between chromatin state and the actions of TFs controls gene regulation in higher eukaryotes. In the context of the CellPlasticity project, we have extended MARA to model how local constellations of regulatory sites ultimately guide the local epigenetic state of the chromatin, and successfully applied this to a number of model systems including the epithelial-to-mesenchyme transition in cancer, and T cell development. For example, in an initial proof of concept we showed that this epi-MARA approach can successfully predict TFs that are involved in the recruitment of Polycomb repression and tri-methylation of histone 3 at lysine 27 in mammals.

In another project we are developing models for the ways in which TFs and nucleosomes interact to determine genome-wide DNA accessibility and nucleosome positioning patterns. In a recent study in Saccharomyces cerevisiae we showed that, whereas the phasing of nucleosomes over gene bodies is mainly determined by the sequence binding preferences of nucleosomes, larger nucleosome-free regions in promoters are predominantly explained by competition with TF binding. Interestingly, we found that only a relatively small subset of yeast TFs that are known to interact with chromatin modifiers are the key determinants of nucleosome exclusion. We are currently working on extending this approach to modeling genome-wide DNA accessibility patterns in mammals. In a related project, we are extending our MARA approach to include the effects of distal enhancers on gene expression patterns.


Over the last years our group has invested a significant amount of work on completely automating our MARA approach and this has now resulted in a fully functional webserver, called ISMARA (integrated system for motif activity response analysis), available at, where users can perform automated motif activity analysis of their micro-array, RNA-seq, or ChIP-seq data, simply by uploading raw data (Fig. 2). The system has already been successfully used to predict key regulatory interactions in almost a dozen of recent studies in mammals, and we are working on various further improvements and extensions of the system. This includes extension to additional model organisms such asDrosophila and E. coli, significantly extending the set of regulatory motifs that it uses, and incorporation of distal cis-regulatory modules. In addition, several of ISMARA’s recent applications involve systems that are highly medically relevant and we plan to adapt ISMARA in ways that aim to increase its medical relevance. In particular, we want to extend ISMARA to allow it to infer the effects of single nucleotide polymorphisms in predicted regulatory sites on gene expression and regulatory programs genome-wide.

Fig. 2: Outline of the Integrated System for Motif Activity Response Analysis. A: ISMARA starts from a curated genome-wide collection of promoters and their associated transcripts. Transcription factor binding sites (TFBSs) are predicted in proximal promoters and miRNA target sites are annotated in the 3’ UTRs of transcripts associated with each promoter. B: Users provide measurements of gene expression (micro-array, RNA-seq) or chromatin state (ChIP-seq). The raw data are processed automatically and a signal is calculated for each promoter and each sample. C: The site predictions and measured signals are summarized in two large matrices. The matrix N contains the total number of sites for each motif m in each promoter p and the matrix E contains the signal associated with each promoter p in each sample s. D: The linear MARA model is used to explain the signal levels in terms of regulatory sites and unknown motif activities, which are inferred by the model. E: As output, ISMARA provides the inferred motif activity profiles of all motifs across the samples sorted by the significance of the motifs. A sorted list of all predicted target promoters is provided for each motif, together with the network of known interactions between these targets and a list of Gene Ontology categories that are enriched among the predicted targets. Finally, for each motif, a local network of predicted direct regulatory interactions with other motifs is provided.

Evolving synthetic E. coli promoters: The role of noise in the evolution of gene regulation

Since 2010 our group also includes a wet lab component where we study gene regulation at the single-cell level in E. coli. In order to learn more about how natural selection may have shaped the characteristics of E. coli promoters, we set out to compare ‘native’ E. coli promoters with synthetic promoters that we evolved in the lab under carefully controlled experimental conditions, starting from entirely random sequences. To this end we use reporter constructs in which short random sequence fragments are fused upstream of GFP and these promoter sequences are then evolved using repeated rounds of selection of single cells using fluorescence-activated cell sorting (FACS), and mutation using error-prone PCR. The FACS not only allows us to precisely quantitate both the mean and noise in expression of individual promoter sequences, it also allows us to precisely control selective conditions. Moreover, using next-generation sequencing of the evolving populations, we can precisely study the effects of selection on promoter sequences.
One highly surprising finding from our studies on promoter evolution is that our synthetic promoters have noise levels that are as low as the least noisy `native’ promoters from E. coli, and that a substantial fraction of native E. coli promoters have higher noise levels than any of the synthetic promoters. To explain these observations we developed a new theory for the evolution of gene regulation that calculates the ‘fitness’ of a promoter as a function of its coupling to transcriptional regulators and the noise levels of these regulators (Fig. 3). This analysis shows that, whenever existing transcription regulators can only attain limited accuracy in implementing a promoter’s desired expression levels, selection favors noisy gene regulation and may even favor the coupling of promoters to regulators whose activities are not correlated at all to the promoter’s desired expression. The theory provides a novel framework for understanding when and how gene regulation will evolve, suggesting that noise may facilitate the evolution of gene regulatory interactions.

Fig. 3: A model of the evolution of gene expression regulation in a variable environment. (A) Illustration of a promoter’s fitness function. Expression from a given promoter genotype leads to a Gaussian distribution of expression levels across individual cells (blue curve) with mean μ and standard-deviation σ. We modeled the fitness of a cell as a Gaussian function of its expression level with optimum μe and standard-deviation τ (dotted brown curve) such that the fitness of the promoter corresponds approximately to the brown shaded area. (B) The same expression distribution, but now subjected to selection across three environments with three different desired expression levels (red, brown, and green dotted curves). The promoter’s fitness is proportional to the product of the areas under the intersections with the three selection curves, which is very small in the case illustrated here. (C) Contour plot of the log-fitness of a promoter that is optimally coupled to a regulator with signal-to-noise ratio S and correlation R. The three colored dots correspond to the parameter settings illustrated in panels D-F. (D) Expression distributions in each of the three environments (red, brown, and green solid curves) for a promoter that is optimally coupled to a regulator with a high signal-to-noise ratio S=3.3 whose activity is highly correlated (R=0.95) to the desired expression levels. The total noise, σtot, of the regulated promoter is indicated above the curves. (E) Expression distributions for a promoter that is optimally coupled to a moderately correlated regulator (R=0.64). Note that whereas the activity of the regulator results in the mean expression level of the promoter being close to the desired level in the red condition, mean expression is substantially too high in the brown environment, and too low in the green environment. (F) Expression distributions for a promoter coupled to a completely uncorrelated regulator (R=0).

A microfluidic framework for studying single-cell gene expression and growth dynamics

To allow us to track both the growth and expression dynamics of single cells we have been working on establishing a micro-fluidic setup in our lab that allows us to track growth of single cells and expression of fluorescent reporters in these cells using time-lapse microscopy. We are aiming to use this setup to pursue a number of projects on single-cell gene regulation in E. coli. First, we want to investigate the coupling between fluctuations in gene expression and the instantaneous growth rates of single cells. Second, using a collection of wild E. coli isolates we are investigating how single-cell gene regulatory dynamics of orthologous promoters has evolved across these strains. Finally, in the context of the StoNets project we are using this microfluidic system in combination with multi-color fluorescent reporters that measure the regulatory activity of different E. coli TFs to study how the joint regulatory activities of multiple TFs in E. coli are varying across time in single-cells. We are particularly interested in investigating to what extent interactions among regulators are used to attenuate noise, and to what extent they cause cells to stochastically diversify into phenotypically distinct states. In addition, we want to characterize which aspects of the joint gene expression state are determined by the environment, and which are guided by internal regulatory interactions.

Integrated Genotype/Phenotype evolution in E. coli

The availability of large numbers of complete genome sequences has led, over the last 15 years, to a revolution in our understanding of genome evolution and the identification of a number of surprising ‘quantitative laws’ of genome evolution. However, whereas the insights gained from analysis of genomic data have been impressive, they have taught us surprisingly little about what selective pressures in the wild are driving genotype dynamics. In this project we aim to learn about selection pressures that are acting in the wild by combining information on genotype evolution in closely related bacterial strains with extensive quantitative characterization of their phenotypes. In particular, using next-generation sequencing we have determined complete genomes of 95 wild E. coli isolates that were all obtained from a common location at the shore of lake Superior (Minnesota, USA). In parallel we have been characterizing the phenotypes of these strains by assessing their growth in a wide variety of conditions using a combination of automated image-analysis of cultures on agar plates, and high-throughput photo-spectroscopy to obtain quantitative growth kinetics. We have started developing theoretical models to describe the joint evolution of genotypes and phenotypes along the phylogeny relating the strains. We are aiming in particular to develop rigorous quantitative measures of the extent to which different phenotypic traits have been under natural selection in the history of these strains, and to infer how this has impacted their genomes. As part of this project we have also recently developed a new method, called REALPHY, for automatically inferring phylogenies from raw next-generation sequencing data.