EPDnew databases

EPDnew is serie of species-specific databases of experimentally validated promoters. At the moment 10 species are supported: 6 animals (H. sapiens, M. musculus, D. melanogaster, A. mellifera, C. elegans and D. rerio), 2 plants (A. thaliana and Z. mays) and 2 fungus (S. cerevisiae and S. pombe). Evidence comes from TSS-mapping from high-throughput expreriments such as CAGE and Oligocapping.
The number of promoters for each organism is the following:
  • Animals:
    • Homo sapiens: 25503 promoters,
    • Mus musculus: 21239 promoters,
    • Drosophila melanogaster: 17452 promoters,
    • Apis mellifera: 6493 promoters,
    • Danio rerio: 10728 promoters,
    • Caenorhabditis elegans: 7120 promoters;
  • Plants:
    • Arabidopsis thaliana: 21233 promoters;
    • Zea mays: 17081 promoters;
  • Fungi:
    • Saccharomyces cerevisiae: 5117 promoters,
    • Schizosaccharomyces pombe: 3440 promoters.

Collection accessibility

EPDnew databases are accessible in different ways: (1) using the input form in the header, searching for single gene symbol, gene description or ENSEMBL / RefSeq gene IDs; (2) using the Select / Download tool for searching multiple genes symbols, ENSEMBL or RefSeq gene IDs and/or for selecting promoters based on their genomic context (core promoter elements, CpG island, expression, ...) and downloading them in various formats (SGA, BED, sequences) or (3) through an ftp website for bulk download of the whole database.

Viewer Page

The viewer page contains information about a single entry in the database and provide various tools for the analysis of a promoter region. It is devided in several section each devoted to one single task.

General Information

This section provides information for the entry:
  • Promoter ID: internal EPDnew unique promoter ID. It is composed by two parts separated by an underscore symbol ('_'). In general, the first part is the gene symbol / gene ID associated to the promoter; the second part is a number and indicates the hierachy of promoter usage for that gene. For genes with multiple promoters, the number one (_1) marks the promoter with the highest usage (primary promoter) and is followed by all the other in decreasing order of usage.
  • Promoter type: three types of promoters are distinguished in order to account for the variety of transcription initiation patterns in eukaryotes. It is based on the Dispersion Index (a statistic defined in Dreos et al., 2016 that is conceptually similar to a the standard deviation of the observed initiation sites around the annotated TSS) and is defined as follow:
    • Single initiation site: dispersion index values between 0 and 3;
    • Multiple initiation sites: dispersion index values between 3 and 10;
    • Initiation region: dispersion index values > 10;
  • Organism: scientific and common name of the organism
  • Gene Symbol: short unique symbol that identify the gene
  • Description of the gene: short description of the gene
  • Sequence: short sequence segment corresponding to the -49 to +10 region of the promoter. Transcribed and untranscribed nucleotides are represented by upper and lower case characters, respectively. This data is not meant to provide sequence data but serves as a control string for sequence extraction.
  • Position in the genome: an eukaryotic promoter is defined as a DNA sequence around a transcription initiation site. The position reference to the initiation site is therefore the central part of a promoter entry. Its assignment is based directly on experimental data. A transcription initiation site may be reassigned upon analysis of new data (new database version). Chromosomes are defined by NCBI Reference Sequence (RefSeq) IDs.
  • References: references to external databases that provide additional information on the Gene or Transcript. These are often speces-specific.

Promoter Image

This section provides visual information about the genomic context of a promoter. The image is derived from the UCSC Genome Browser and can be reproduced loading the EPDnew Hub. It is designed to help scientists judge the quality of the annotated promoter.

This is an example of a promoter image from the human promoter MAPK1_1:

MAPK1_1

The image is conceptually divided into three sections:

  • Experimental evidece of the promoter activity in the form of histone modifications (H3K4me3 and H3K4me1), Pol-II and RNA-seq of 5'-end mapping techniques (CAGE, GRO-cap, etc). Care is taken to represent each track in an optional fashion. For instance, different bin sizes are used to convert different ChIP-seq datasets into wiggle files, taking into account their variation in tag density in promoter regions. To visualize the nucleosome architecture of promoters, we exclusively selected ChIP-Seq data generated with MNase digestion rather than sonication, as only the former type of data achieves single-nucleosome resolution. Importantly, the RNA-seq tracks are at single base-pair resolution and display all experimental evidence that has been used for defining the reference TSS position (All CAGE traks) and a selection of representative cell types or tissues in a composite track.
  • Annotation tracks for Genes and EPDnew promoters. The EPDnew track is a representation of the sequence provided in the 'General Information' section. The narrow segment represents the promoter section from base -49 to -1 whereas the thick section represents the region 0 to 10. The Gene track corresponds to the annotation used during the first step of makind an EPDnew database.
  • Sequence-derived tracks such as CpG islands, conserved transcription factor binding sites and conservation scores.
Note that for some organisms some experimental and / or sequence-derived data might be missing.

Sequence Retrieval Tool

The sequence retrieval tool allows the extraction of sequence of any length around the anotated promoter. To extract a sequece, simply select the sequence range (in base pair) and click 'GET SEQUENCE'. The option 'lower case upstream TSS' output lower-case letters in the up-stream region to facilitate the identification of the TSS within the sequence (the first uppercase letter in the sequence represents the TSS).

Search Motif Tool

The search motif tool allows scientists to scan promoter regions with Position Weight Matrices (PWM) of several transcription factor and Core Promoter Elements to find putative binding sites. Once a PWM library and TF have been selected it is possible to scan the region with that PWM. Hits are marked as red rectangles in the plot and exact positions relative to the TSS are reported below the plot. Motif libraries are from the JASPAR database and the EPD Promoter Elements. The scan is prformed on-the-fly using the FindM tool from the SSA toolkit.

External Resources

Links to external genome browsers.

References

EPD and EPDnew, high-quality promoter resources in the next-generation sequencing era. Dreos, R., Ambrosini, G., Périer, R., Bucher, P. Nucleic Acids Res. (2013) 41(Database issue):D157-64; PUBMED  23193273
Last update May 2017