EPDnew databases

EPDnew is a set of species-specific databases of experimentally validated promoters. Currently, 15 organisms are supported: 10 animals (H. sapiens, M. mulatta, M. musculus, R. norvegicus, G. gallus, C. familiaris, D. melanogaster, A. mellifera, C. elegans and D. rerio), 2 plants (A. thaliana and Z. mays), 2 fungi (S. cerevisiae and S. pombe) and 1 invertebrate (P. falciparum). Evidence comes from TSS mapping data generated from high-throughput experiments such as CAGE and Oligocapping.
The number of promoters for each organism is the following:
  • Animals:
    • Homo sapiens: 29598 promoters,
    • Macaca mulatta: 9575 promoters,
    • Mus musculus: 25111 promoters,
    • Rattus norvegicus: 12601 promoters,
    • Gallus gallus: 6127 promoters,
    • Canis familiaris: 7545 promoters,
    • Drosophila melanogaster: 16972 promoters,
    • Apis mellifera: 6493 promoters,
    • Danio rerio: 10728 promoters,
    • Caenorhabditis elegans: 7120 promoters;
  • Plants:
    • Arabidopsis thaliana: 22703 promoters;
    • Zea mays: 17081 promoters;
  • Fungi:
    • Saccharomyces cerevisiae: 5117 promoters,
    • Schizosaccharomyces pombe: 4802 promoters.
  • Invertebrates:
    • Plasmodium falciparum: 5597 promoters.

Collection accessibility

EPDnew databases are accessible in different ways:
  • using the input form in the header, searching for single gene symbol, gene description or ENSEMBL/RefSeq gene IDs,
  • using the Select / Download tool for searching multiple EPDnew IDs, ENSEMBL/RefSeq gene IDs and/or for selecting promoters based on their genomic context (core promoter elements, CpG island, expression, etc.) and downloading them in various formats (SGA, BED, sequences) or
  • through an FTP website for bulk downloads.

Viewer Page

The viewer page contains information about a single entry in the database and provides various tools for the analysis of a promoter region. It is divided into several sections, each devoted to a single task.

General Information

This section provides information for the entry:
  • Promoter ID: internal EPDnew unique promoter ID. It is composed of two parts separated by an underscore symbol ('_'). In general, the first part is the gene symbol/ID associated with the promoter; the second part is a number indicating the hierachy of promoter usage for that gene. For genes with multiple promoters, the "_1" marks the promoter with the highest usage (primary promoter) and is followed by all the others in decreasing order of usage.
  • Promoter type: three types of promoters are distinguished, reflecting the variety of transcription initiation patterns in Eukaryotes. They are based on the Dispersion Index (a statistic defined in Dreos et al., 2016 that is conceptually similar to a the standard deviation of the observed initiation sites around the annotated TSS) and are defined as follows:
    • Single initiation site: dispersion index value between 0 and 3;
    • Multiple initiation sites: dispersion index value between 3 and 10;
    • Initiation region: dispersion index value > 10;
  • Organism: scientific and common name of the organism
  • Gene Symbol: short unique symbol that identifies the gene
  • Description of the gene: short description of the gene
  • Sequence: short sequence segment corresponding to the -49 to +10 region of the promoter. Transcribed and untranscribed nucleotides are represented by upper and lower case characters, respectively. This data is not meant to provide sequence data but serves as a control string for sequence extraction.
  • Position in the genome: an eukaryotic promoter is defined as a DNA sequence around a transcription initiation site. The position reference to the initiation site is therefore the central part of a promoter entry. Its assignment is based directly on experimental data. A transcription initiation site may be reassigned upon analysis of new data (new database version). Chromosomes are defined by NCBI Reference Sequence (RefSeq) IDs.
  • References: references to external databases that provide additional information on the Gene or Transcript. These are often species-specific.

Promoter Image

This section provides visual information about the genomic context of a promoter. The image is derived from the UCSC Genome Browser and can be reproduced by loading the EPDnew hub. It is designed to help scientists judge the quality of the annotated promoter.

This is an example of a promoter image from the human promoter MAPK1_1:

MAPK1_1

The image is conceptually divided into three sections:

  • Experimental evidence of the promoter activity in the form of histone modifications (H3K4me3 and H3K4me1), RNA polymerase II and RNA-seq of 5'-end mapping techniques (CAGE, GRO-cap, etc). Care is taken to represent each track in an optional fashion. For instance, different bin sizes are used to convert different ChIP-seq datasets into wiggle files, taking into account their variation in tag density in promoter regions. To visualize the nucleosome architecture of promoters, we exclusively selected ChIP-Seq data generated with MNase digestion rather than sonication, as only the former type of data achieves single-nucleosome resolution. Importantly, the RNA-seq tracks are at single base-pair resolution and display all experimental evidence that has been used for defining the reference TSS position (All CAGE traks) and a selection of representative cell types or tissues in a composite track.
  • Annotation tracks for Genes and EPDnew promoters. The EPDnew track is a representation of the sequence provided in the 'General Information' section. The narrow segment represents the promoter section from base -49 to -1 whereas the thick section represents the region 0 to 10. The Gene track corresponds to the annotation used during the first step of making an EPDnew database.
  • Sequence-derived tracks such as CpG islands, conserved transcription factor binding sites and conservation scores.
Note that depending on the organism, some experimental and / or sequence-derived data might be missing.

Sequence Retrieval Tool

The sequence retrieval tool allows the extraction of sequence of any length around the anotated promoter. To extract a sequence, simply select the sequence range (in base pair) and click 'GET SEQUENCE'. The option 'lower case upstream TSS' outputs lowercase letters in the upstream region to facilitate the identification of the TSS within the sequence (the first uppercase letter in the sequence represents the TSS).

Search Motif Tool

The search motif tool scans promoter regions with position weight matrices (PWM) of several transcription factors and core promoter elements to find putative binding sites. Once a PWM library and TF have been selected, it is possible to scan the region with that PWM. Hits are marked as red rectangles in the plot and exact positions relative to the TSS are reported below the plot. Motif libraries are from the JASPAR database and the EPD Promoter Elements and can be downloaded as text file here. A description of the motifs and the conversion rules can be found in our Motif Database homepage. The scan is performed on-the-fly using the FindM tool from the SSA toolkit.

Expression Profile Tool

The expression profile tool shows the number of samples in which the promoter is active, its average expression level (number of tags in a 100-bp region centered on the TSS and normalized to 10M total tags) and a plot showing the distribution of sample-specific TSSs around the annotated TSS. Clicking on the histogram bars will show the number of samples that have the TSS located in that position (relative to the EPDnew annotated TSS) with their names and expression values.

External Resources

Links to external genome browsers.

References

The eukaryotic promoter database in its 30th year: focus on non-vertebrate organisms. Dreos, R., Ambrosini, G., Groux, R., Périer, R., Bucher, P. Nucleic Acids Res. (2017) 45:D51-55; PUBMED  27899657
Last update October 2019