Abstract
The Encyclopedia of DNA Elements (ENCODE) Project is currently the largest collection of functional information on any vertebrate genome, with extensive Chip-seq, RNA-seq, DNase-seq and FAIRE-seq data on over 140 different Transcription Factors, 14 different Histone Modifications, 4 different chromatin accessibility assays and over 25 different RNA preparations distributed primarily over 6 cell types (both laboratory lines and primary cell cultures). A number of these assays, in particular chromatin accessibility assays are also performed on a far broader range of over 200 different cell types, including many primary cell lines or tissue preparations. Careful, dedicated data production coupled with extensive quality control and community based standards has ensured high quality information across all of ENCODE.
The Analysis working group of ENCODE have developed robust, statistically sound methods to analyze this data both individually and jointly, with both hypothesis based and hypothesis generating methods. There is a surprising amount of the genome which annotated with functional, biochemical events, with an observed 9% of genomic bases in specific DNA:Protein contacts, and a lower bound estimate of ~20% of bases with this property over all cell types. Even large portion of the genome are included when considering specific histone modifications (42%) or RNA production (80%), with 99% of the genome within 1KB of a specific biochemical event. Comparison to mammalian and population level views of selection show that although a substantial amount of these elements are not conserved across mammals, there is evidence of a non-trivial portion of the primate specific elements being under negative selection in the human population.
Joint analysis of these experiments shows that one can accurately predict RNA expression levels from both chromatin modification patterns and more surprisingly from the limited TF repertoire studied here. Integrative analysis using multi-dimensional segmentation methods provides a cell-line invariant set of functional regions which are distributed in a cell-line specific manner, and a number of enhancers have been tested in mouse and fish systems. Analysis of the transcription co-occupancy shows a complex network of transcription factor behavior with different roles performed by different factors. Genome-wide association study regions are enriched in ENCODE functional regions, and there are specific associations between phenotypes and cell type specific or transcription factor regions, allowing novel hypothesis to be proposed in the molecular aetiology of diseases.
About the speaker
Dr Ewan Birney received his PhD from the University of Cambridge working at the then Sanger Centre (now Wellcome Trust Sanger Institute) to develop algorithms for understanding genomes. He joined the European Molecular Biology Laboratory - European Bioinformatics Institute (EMBL-EBI) in 2000 as a Team Leader and became a Senior Scientist in EMBL in 2003. He is one of the founders of the Ensembl Genome Browser and other databases, and has played a key role in many large-scale genomics projects, notably the sequencing of the Human Genome in 2000 and the analysis of genome function in the ENCODE project. He has been Lead Analysis Coordinator for ENCODE since 2007, and coordinated data analysis in the "1% Pilot". He is currently Associate Director of the EMBL-EBI.
Dr Birney has played a vital role in annotating the genome sequences of the human, mouse, chicken and several other organisms; this work has had a profound impact on our understanding of genomic biology. His research group currently focuses on genomic algorithms and inter-individual differences in human and other species.
In 2012, Dr Birney was elected Fellow of the European Molecular Biology Organization. He was awarded the inaugural Francis Crick Lecture by the Royal Society in 2003 and the Chris Overton Prize and the Benjamin Franklin Award in 2005.
|