EUKARYOTIC PROMOTER DATABASE USER MANUAL
Written by: Philipp Bucher, Rouaida Cavin Périer, Viviane Praz and Christoph Schmid

EPFL School of Life Sciences - SV
and Swiss Institute for Experimental Cancer Research - ISREC
Computational Cancer Genomics Group
EPFL SV ISREC GR-BUCHER Station 15
CH-1015 /Lausanne
Switzerland

Electronic mail:

This manual and the database it accompanies may be copied and redistributed freely, without advance permission, provided that this statement is reproduced with each copy.
Published Research assisted by the Eukaryotic Promoter Database should cite:
EPD in its twentieth year: towards complete promoter coverage of selected model organisms
Schmid, C.D., Perier, R., Praz, V. and Bucher, P. (2006) Nucleic Acids Res, 34, D82-85.

EPD RELEASE 128, July 2016

WHAT IS NEW IN RELEASE 128: Update to new EMBL Release
EPD release history

INTRODUCTION
PROMOTER SELECTION
ASSIGNMENT OF INITIATION SITE
FORMAT CONVENTIONS

The title line
Promoter entries

The ID line
The AC line
The DT line
The DE line
The OS line
The HG line
The AP line
The NP line
The DR line
The RN, RX, RA, RT and RL lines
The ME line
The SE line
The FL line
The IF line
The TX line
The KW line
The FP, DO and RF lines
The // line

Line types retained from the old format

The FP line
Documentation
Literature references
Miscellaneous

Distinct format of 'preliminary' entries in epd_bulk.dat

CLASSIFICATION
HOMOLOGOUS PROMOTERS
PROMOTER SEQUENCE RETRIEVAL
REFERENCES

APPENDIX A : SURVEY OF RELEASE
APPENDIX B : CODES AND ABBREVIATIONS

SPECIES CODES
JOURNAL CODES
ABBREVIATIONS

1 INTRODUCTION

The Eukaryotic Promoter Database EPD was designed and developed at the Weizmann Institute of Science in Rehovot (Israel) and is currently maintained at ISREC in Epalinges s/Lausanne (Switzerland). EPD is a specialized annotation database of the EMBL Data Library. It provides information about eukaryotic promoters available in the EMBL Data Library and is intended to assist experimental researchers, as well as computer analysts, in the investigation of eukaryotic transcription signals. The present version originated from a previous compilation published in an article (1) and is organized as a hierarchically ordered and documented "functional position set" (2) pointing to transcription initiation sites. All information is either directly extracted from scientific literature or, starting from release 73, compiled by a new in silico primer extension method (16). Thus promoter information in EPD is independent of the EMBL sequence entry descriptions. As a consequence, many of the initiation sites referred to in EPD do not appear in corresponding EMBL feature tables.A coordinated updating procedure has been set up by the two laboratories that will ensure future compatibility between the position references in EPD and the sequence data in the main data library. Investigators who access EMBL via publicly available programs should be aware of the fact that software producers occasionally modify the sequence data in ways that render position references inaccurate. EPD is generally not compatible with sequence data of another release because EMBL sequence entries are not designed as stable data units. The completeness and accuracy of EPD greatly benefits from user-feedback. Any report of mistakes or omissions would be very much appreciated. Direct communication of newly published transcript mapping or gene expression data is also welcome. Please forward all correspondence to the address given on top of this document. Use electronic mail if possible.

2 PROMOTER SELECTION

EPD is a rigorously selected database. In order to be included in EPD, a promoter must be:

recognized by eukaryotic RNA POL II,
active in a higher eukaryote,
experimentally defined, or homologous and sufficiently similar to an experimentally defined promoter,
biologically functional,
available in the current EMBL release,
distinct from other promoters in the database.

Explanations:

Transcription by RNA POL II is bona fide assumed for protein coding genes but must be supported by alpha-amanitin data if the end product is an RNA.
All eukaryotes except phycophyta, fungi, myxomycetes, and protozoa are considered higher eukaryotes. Note that the expression "active in" does not always refer to the source organism of the promoter (e.g. in viruses). EPD contains currently promoter sequences from 139 different species.
A promoter is experimentally determined if a corresponding transcription initiation site is mapped with a precision of +/- 5 bp or higher. Any technique that characterizes the 5'terminus of an in vivo or in vitro generated RNA is acceptable. Single nuclease-protection or primer-extension data must be accompanied by additional evidence unless the gene's intron-exon organization is well established. Similarity is considered "sufficient" if percent identity (as defined in Section 6) is >=60% between -79 and +20 or >=75% between -49 and +10.
A promoter is biologically functional if it contributes to the source organism's survival and/or reproduction. This is bona fide assumed except for promoters of pseudogenes, minor transcription initiation sites (<20% of total gene transcripts), promoters giving rise to an unstable RNA product, and mutant promoter.
The minimum sequence requirement is 45 bp between -49 and +10.
Promoters are considered distinct if they originate from different gene loci or different species. Identity is assumed if two promoters from the same species exhibit >95% similarity between -79 and +20 while their genetic relationship is unknown. Multiple isolates of viruses or transposable elements are considered distinct if at least one promoter region fails to fulfill the above similarity criterion.

3 ASSIGNMENT OF TRANSCRIPTION INITIATION SITE

A eukaryotic promoter is defined as a DNA sequence around a transcription initiation site. The position reference to the initiation site is therefore the central part of a promoter entry. Its assignment is based directly on experimental data shown in an article, proposed adjustments originating from consensus sequence considerations being ignored. In the case of minor discrepancies between different publications averaged positions are given. Position references are subject to permanent re-evaluation. A transcription initiation site may be reassigned upon publication of new data. Position references are replaced if longer upstream sequences of the same promoter become available in a new EMBL sequence entry.
Several initiation sites preceding the same gene appear as alternative promoters if they are clearly separated from each other or differentially regulated. The minimum distance required between two alternative initiation sites is 20 bp. Otherwise, they are considered a single promoter region.
Four types of promoters are distinguished by one-letter codes in order to account for the variety of transcription initiation patterns in eukaryotes:

S: Single initiation site: >90% of all reported transcripts initiate within 10 bp (the experimental data usually do not allow distinction between a single cap-site and small mRNA 5' heterogeneity).
M: Multiple initiation sites: >75% of all reported transcripts initiate within 20 bp.
R: Initiation region: >75% of all reported transcripts initiate within 100 bp.
U: Undefined transcription initiation pattern, exclusively in 'preliminary' entries in epd_bulk.dat (see next section).

Note that in addition to true alternative promoter activity, variability in the position of the transcription initiation site might also be due to experimental constraints, a biological variability in the activity of the DNA polymerase II, or the presence of highly similar (pseudo-) genes with distinct transcription initiation sites.
In sequence entries that contain a complete RNA or DNA genome of a retrovirus or a retrovirus-like transposable elements, the position reference points to the U3/R boundary of the 3'terminal LTR.

4 FORMAT CONVENTIONS

EPD is distributed as two ASCII flatfiles (epd.dat, epd_bulk.dat) in essentially identical format. Differences in the format of 'preliminary' entries in 'epd_bulk.dat' are described in paragraph 4.4. EPD files contain a title line followed by a number of promoter entries. Interspersed are group headings whose function and format are described in the next section. The title line and parts of the promoter entries are rigidly formatted so that the entire database conforms to the standards of an FPS file (functional position set) of our current signal search analysis (1,2) software.

4.1. The title line

The title line of EPD is shown below:

TI   EPD83     Eukaryotic Promoter Database / Release 83              EP

The TI line contains the following fields:


columns	data type
1- 2	"TI"
3- 5	(blank)
6-15	FPS name
16-70	title
71-72	FPS code

Explanations:

FPS name and FPS code are used by our data extraction software to generate default names for output files.

4.2. Promoter entries

An EPD entry contains the following types of information:

Promoter identification and description.
Machine-readable pointers to the transcription initiation site in corresponding sequence entries.
Description of the experimental evidence defining the transcription start site.
Various kinds of promoter classifications useful for extraction of biologically meaningful promoter subsets.
Information on regulatory properties.
Cross-references to other databases.
Bibliographic references.

Promoter entries are presented in a similar format as EMBL and SWISS-PROT sequence entries. Each line starts with a line code identifying the type of information presented. The current line types and line codes and the order in which they appear in an entry, are shown below:

    ID  - IDentification.
    AC  - ACcession number(s).
    DT  - DaTe.
    DE  - DEscription.
    OS  - Organism Species.
    HG  - Homology Group.
    AP  - Alternative Promoter.
    NP  - Neighbouring Promoter.
    DR  - Database cross-References.
    RN  - Reference Number.
    RX  - Reference cross-references.
    RA  - Reference Authors.
    RT  - Reference Title.
    RL  - Reference Location.
    ME  - MEthods.
    SE  - SEquence.
    FL  - Full Length.
    IF  - Initiation Frequency.
    TX  - TaXonomy.
    KW  - KeyWords.
    FP  - Functional Position.
    DO  - DOcumentation.
    RF  - literature ReFerence.
    //  - Termination line.

Spacer lines (XX) are inserted in order to make the promoter database easier to read by eye. Some line types occur many times in a single entry. Each entry must begin with an identification line (ID) and end with a terminator line (//). Text does not exceed column 72. Below is an example of a promoter entry:

      ID   HS_MYC_2     standard; single; VRT.
      XX
      AC   EP11148;
      XX
      DT   ??-APR-1987 (Rel. 11, created)
      DT   07-MAR-2005 (Rel. 82, Last annotation update).
      XX
      DE   c-myc (cellular homologue of myelocytomatosis virus 29 oncogene),
      DE   promoter 2.
      OS   Homo sapiens (human)
      XX
      HG   Homology group 53; Mammalian c-myc proto-oncogene, promoter 2
      AP   Alternative promoter #2 of 2; exon 1; site 2; major promoter.
      NP   none.
      XX
      DR   GENOME; NT_008046.15; NT_008046; [-41966656, 15188617].
      DR   EPD; EP11146; HS_MYC_1; alternative promoter; [-162; +].
      DR   CLEANEX; HS_MYC.
      DR   EMBL; AC103819.3; [-87815, 60206].
      DR   EMBL; X00364.2; [-2489, 8507].
      DR   EMBL; D10493.1; [-2487, 5569].
      DR   EMBL; K01910.1; [-2451, 49].
      DR   EMBL; M16261.1; [-1843, 1048].
      DR   EMBL; J03253.1; [-1759, 461].
      DR   EMBL; L00057.1; [-810, 2795].
      DR   EMBL; K03015.1; [-555, 458].
      DR   EMBL; X00196.1; [-532, 2792].
      DR   EMBL; M12026.1; [-511, 678].
      DR   EMBL; K01708.1; [-410, 500].
      DR   EMBL; K00559.1; [-345, 1020].
      DR   EMBL; K02280.1; [-302, 178].
      DR   EMBL; K01909.1; [-266, 1365].
      DR   EMBL; S65124.1; [-266, 1023].
      DR   EMBL; M14206.1; [-266, 446].
      DR   EMBL; M20013.1; [-240, 982].
      DR   EMBL; AF111270.1; [-142, 264].
      DR   EMBL; K02275.1; [-96, 780].
      DR   EMBL; X00675.1; [-96, 404].
      DR   EMBL; K02277.1; [-96, 157].
      DR   SWISS-PROT; P01106; MYC_HUMAN.
      DR   TRANSFAC; R01157; HS$CMYC_01; [-211, -189]; by position.
      DR   TRANSFAC; R01158; HS$CMYC_02; [-168, -145]; by position.
      DR   TRANSFAC; R01804; HS$CMYC_04; [-300, -283]; by position.
      DR   TRANSFAC; R01851; HS$CMYC_05; [-65, -57]; by position.
      DR   TRANSFAC; R01852; HS$CMYC_06; [-42, -34]; by position.
      DR   TRANSFAC; R04076; HS$CMYC_12; [-251, -228]; by position.
      DR   TRANSFAC; R04076; HS$CMYC_12; [-252, -229]; by position.
      DR   TRANSFAC; R04076; HS$CMYC_12; [-253, -230]; by position.
      DR   TRANSFAC; R04621; HS$CMYC_17; [-313, -262]; by position.
      DR   TRANSFAC; R08503; HS$CMYC_18; [-50, -41]; by position.
      DR   TRANSFAC; R16688; HS$CMYC_24; [-7, 41]; by position.
      DR   TRANSFAC; R16689; HS$CMYC_25; [-7, 41]; by position.
      DR   TRANSFAC; R17051; HS$CMYC_30; [-510, -480]; by position.
      DR   TRANSFAC; R18503; HS$CMYC_31; [-185, -170]; by position.
      DR   TRANSFAC; R18504; HS$CMYC_32; [-153, -168]; by position.
      DR   RefSeq; NM_002467.
      DR   MIM; 190080.
      XX
      RN   [1]
      RX   MEDLINE; 84026482.
      RA   Battey J., Moulding C., Taub R., Murphy W., Stewart T., Potter H.,
      RA   Lenoir G., Leder P.;
      RT   "The human c-myc oncogene: structural consequences of
      RT   translocation into the IgH locus in Burkitt lymphoma";
      RL   Cell 34:779-787(1983).
      RN   [2]
      RX   MEDLINE; 84131953.
      RA   Bernard O.D., Cory S., Gerondakis S., Webb E., Adams J.M.;
      RT   "Sequence of the murine and human cellular myc oncogenes and two
      RT   modes of myc transcription resulting from chromosome translocation
      RT   in B lymphoid tumours";
      RL   EMBO J. 2:2375-2383(1983).
      RN   [3]
      RX   MEDLINE; 87257828.
      RA   Lipp M., Schilling R., Wiest S., Laux G., Bornkamm G.W.;
      RT   "Target sequences for cis-acting regulation within the dual
      RT   promoter of the human c-myc gene.";
      RL   Mol. Cell. Biol. 7:1393-1400(1987).
      RN   [4]
      RX   MEDLINE; 88038843.
      RA   Broome H.E., Reed J.C., Godillot E.P., Hoover R.G.;
      RT   "Differential promoter utilization by the c-myc gene in mitogen-
      RT   and interleukin-2-stimulated human lymphocytes.";
      RL   Mol. Cell. Biol. 7:2988-2993(1987).
      XX
      ME   Nuclease protection [1,4].
      ME   Nuclease protection; transfected or transformed cells [3].
      ME   Length measurement of an RNA product; low-precision data [1].
      XX
      SE   agggagggatcgcgctgagtataaaagccggttttcggggctttatctaACTCGCTGTAG
      XX
      TX   6. Vertebrate promoters
      TX   6.1. Chromosomal genes
      TX   6.1.5. Hormones, growth factors, regulatory proteins
      TX   6.1.5.16. Various cellular protooncogenes
      XX
      KW   Proto-oncogene, Nuclear protein, DNA-binding, Glycoprotein,
      KW   Transcription regulation.
      XX
      FP   Hs c-myc         P2+:+S  EU:NC_000008.9       1+ 128817660; 11148.053 010*2
      XX
      DO        Experimental evidence: 4,4#,2l
      DO        Expression/Regulation: +mitogen
      RF        Cell34:779     EMBOJ2:2375    MCB7:1393      MCB7:2988
      //

A detailed description of each line type is given below.

4.2.1. The ID line

The identification line is always the first line of an entry. The general form of the ID line is:

ID   ENTRY_NAME data class; initiation site type; TAXONOMIC DIVISION.

ENTRY_NAME is a unique entry identifier "HS_MYC_2" which obeys rigorous naming conventions. It contains 2 or 3 fields, the first is the species identification code at most 4 alphanumeric characters representing the biological source of the promoter. The second field uses for gene identification the protein code of SWISS-PROT ID (if available). For human EPD entries, instead of the SwissProt ID the official gene symbol approved by the HUGO nomenclature committee (if available) is used. The third field is optional, it is either a number which represents alternative promoters or a letter for promoters of duplicated genes. The `_' sign serves as a separator.
The data class field relates to the quality of the information: "standard" means that the information is complete and correct according the standards laid down in this document; "preliminary" means that the entry has not yet undergone all quality checks necessary for being classified as "standard".
The initiation site type is either "single", "multiple", "region" as defined in Section 3.
TAXONOMIC DIVISION are

PLN for plant
NEM for nematode
ART for arthropode
MLS for mollusc
ECH for echinoderm
VRT for vertebrates.

The ID line is terminated by a period.

4.2.2. The AC line

AC   EP11148;

The accession number consists of the character string "EP" followed by 5 digits representing the EMBL release number followed by the EPD entry order. Most EPD entries currently have only one accession number. If necessary, more then one AC will be used, separated by semicolons and the list is terminated by a semicolon.

4.2.3. The DT line

The date lines show the date of entry or last modification of the entry.

DT   DD-MMM-YEAR (Rel. XX, Comment)

where `DD' is the day, `MMM' the month, `YEAR' the year, and `XX' the EPD release number. The comment portion of the line indicates the action taken on that date.

The first DT line indicates when the entry first appeared in the database.
The second DT line indicates when the promoter data was last modified. It is terminated by a period.

4.2.4. The DE line

DE   c-myc (cellular homologue of myelocytomatosis virus 29 oncogene),
DE   promoter 2.

The description lines contain general descriptive information about the promoter. The description is given in ordinary English and is free-format. It contains the swiss-prot gene names when known. In some cases, more than one DE line is required; in this case, the text is divided only between words. The last DE line is terminated by a period.

4.2.5. The OS line

OS   Mus musculus (house mouse)

The species line specifies the source organism(s) of the promotery. The species names are based on NCBI's taxonomy and thus can be automatically hyperlinked to the NCBI's taxonomy web pages.

4.2.6. The HG line

HG   Homology group 53; Mammalian c-myc proto-oncogene, promoter 2

The homology group line is optional, it contains 2 fields: a homology group number that allows identification of all sequence-wise similar promoters in EPD, and a homology group name.

4.2.7. The AP line

AP   Alternative promoter #2 of 2; 5' exon 1; site 2; major promoter.

The AP line is optional and provides information on alternative promoters of the same gene (for more details, see Section 4.3.1.). It contains 3 or 4 fields, separated by semicolons, providing the following types of information:

descriptive text fields followed by

Two numbers indicating, respectively, the promoter's relative position along the gene, and the total number of alternative promoters of the gene. Promoters are numbered in the 5' to 3' directions starting with one.
A number referring to the exon preceded by the promoters. Note that multiple promoters may be associated with the same (3'-coterminal) exon or with different exons. Known exons are numbered in 5' to 3' direction starting with one.

Note that the nomenclature of 5'-exons in EPD may differ from the usage in the literature.

A number indicating the promoter's relative position among the subset of promoters preceeding the same exon.
An optional keyword indicating major promoters.

The AP line is terminated by a period.

4.2.8. The NP line

NP   Neighbouring Promoter; EP23008; MM_H2B1; [-209; -].

The NP line is optional and provides information on promoters which are physically closer to each other than 1000 bp. It contains 3 fields, separated by semicolons, providing the following types of information:

The EPD accession number of the neighbouring promoter.
The EPD identifier of the neighbouring promoter.
The last field indicates, respectively, the position and the direction of the neighbouring promoter relative to the transcription initiation site given in the promoter entry.

Negative numbers indicate the upstream region of this entry and positive ones indicate the downstream region.
The sign indicates the transcription direction of the neighbouring promoter relative to the promoter entry:

4.2.9. The DR line

The DR lines contain cross-references to other EPD entries (if there are alternative promoters of the same gene), or to entries from other databases. So far, we have incorporated links to CLEANEX,EMBL (3), GenBank (4), DDBJ (5), SWISS-PROT (6), TRANSFAC (7), Flybase (8), MIM (9) and MGD (10). The precise format of these lines depends on the target database. Note that some cross-references include numbers enclosed in square brackets indicating the relative position of a linked sequence object, or keywords characterising the nature of the relationship between the entries. For instance, the ranges associated with cross-references to EMBL entries define the extensions of the EMBL sequences relative to the initiation site described by the EPD entry. The multiplicity of EMBL cross-references in some entries mirrors the redundancy of the sequence database. The first of these references corresponds to the longest promoter region, except when the sequences are cancelled from EMBL database, but still exist in GenBank or DDBJ.
The format of the DR line is shown by the following example lines:

     DR   GENOME; NT_037436.1; NT_037436; [-14139754, 9212459].
     DR   EPD; EP11146; HS_MYC_1; alternative promoter; [-162; +].
     DR   EMBL; J00120.1; [-2489, 8507].
     DR   SWISS-PROT; P01106; MYC_HUMAN.
     DR   SPTREMBL; Q8IQL1.
     DR   FLYBASE; FBgn0013718; nuf.
     DR   TRANSFAC; R01804; HS$CMYC_04; [-300, -283]; by position.
     DR   MIM; 190080.
     DR   RefSeq; NM_003529.
     DR   MGD; MGI:88468; Cola2.
     DR   ENSEMBL; CG32140.
     DR   TRANSCRIPTOME; DMe000571.

Explanations (for detailed information go to Guidelines ):

The first item on the DR line is the abbreviated name of the data collection to which reference is made. The currently defined data bank identifiers are the following:


GENOME	NCBI Reference Sequence (RefSeq) of genomic sequence contigs
EPD	Eukaryotic Promoter Database: alternative promoters of the same gene
CLEANEX	Gene expression database for human EPD promoters
EMBL	Nucleotide sequence database of the EMBL
SWISS_PROT	Protein sequence database
SPTREMBL	Subset of protein sequence database TrEMBL. It contains the entries which should be eventually incorporated into SWISS-PROT. SWISS-PROT accession numbers have been assigned for all SP-TrEMBL entries
FLYBASE	Drosophila genome database
TRANSFAC	Transcription factor (TF) database
MIM	Mendelian Inheritance in Man Database
RefSeq	Reference Sequence Database
MGD	Mouse Genome Database
ENSEMBL	Metazoan genome annotation
TRANSCRIPTOME	Catalog of transcripts and their mapping onto the genome (LICR Lausanne branch)
TIGR	'gene identifiers' from the 'Rice Genome Annotation' project at TIGR

The second item is the primary accession number (or an equivalent unique identifier of another data banks) of the entry to which reference is made.
The third item (if it exists) is a secondary idientifier or name for the cross-referenced database entry.
The fourth item for EMBL and Transfac indicates the location and extension of the sequences given in these entries relative to the transcription initiation site given in the promoter entry. Negative numbers indicate the upstream region of this site and positive ones indicate the downstream part.
The fifth item

in the EPD line, indicates the position and the direction of the alternative promoter as it is defined for the neighbouring promoter in the NP line last field
in the TRANSFAC line, designates the criteria used to collect the TF entry:

NB : TRANSFAC cross-reference lines should not exceed the real number of binding sites found in "TRANSFAC Site Table". Thus the position given in this DR line in related to the longest EMBL entry common to both EPD and TRANSFAC (version 6.3) databases.

4.2.10. The RN, RX, RA, RT and RL lines

These lines comprise the literature citations within EPD. The citations indicate the papers from which the data has been abstracted. The reference lines for a given citation occur in a block, and are always in the order RN, RX, RA, RT, RL. Within each such reference block the RN line occurs once, the RX lines occurs zero or more times, and the RA, RT and RL lines each occur one or more times. If several references are given, there will be a reference block for each.An example of a complete reference is:

RN   [1]
RX   MEDLINE; 84026482.
RA   Battey J., Moulding C., Taub R., Murphy W., Stewart T., Potter H.,
RA   Lenoir G., Leder P.;
RT   "The human c-myc oncogene: structural consequences of
RT   translocation into the IgH locus in Burkitt lymphoma";
RL   Cell 34:779-787(1983).

The formats of the individual lines are explained below. >

4.2.10.1. The RN line

The RN line gives a sequential number to each reference citation in an entry.This number is used to indicate the reference in the ME lines.

4.2.10.2 The RX line

The RX line is an optional line which is used to indicate the identifier assigned to a specific reference in PubMed (PMID, from the National Library of Medicine (NLM)). .

4.2.10.3 The RA line

The RA lines list the authors of the paper (or other work) cited. The authors are are listed in the order given in the paper. The names are listed surname first followed by a blank followed by initial(s) with periods. The authors' names are separated by commas and terminated by a semicolon. Author names are not split between lines.

4.2.10.4 The RT line

The RT lines contain the title of the reference citation.

4.2.10.5 The RL line

The RL lines contain the conventional citation information for the reference. In general, the RL lines alone are sufficient to find the paper in question. It includes the journal abbreviation, the volume number, the page range, and the year. Journal names are abbreviated according to the conventions used by the National Library of Medicine (NLM) and are based on the existing ISO and ANSI standards.

4.2.11. The ME line

The method lines describe experiments defining the transcription initiation site. The format of the ME line is as follows:

ME   Method_description [; Qualifier...] [n,...].

A complete list of method descriptions is given in Section 4.3.2. Qualifiers may indicate that an experimental gene transcription system was used, that data are of low precision (less +/- 5 bp), or that the experiments were done with a closely related gene. The number(s) enclosed in square brackets links the method descriptions to the bibliographic references included in the promoter entry. The methods line from the example are:

ME   Nuclease protection [1,4].
ME   Nuclease protection; transfected or transformed cells [3].
ME   Length measurement of an RNA product; low-precision data [1].

4.2.12. The SE line

The sequence line shows a short sequence segment corresponding to the -49 to +10 region of the promoter. Transcribed and untranscribed nucleotides are represented by upper and lower case characters, respectively. This line type is not meant to provide sequence data but serves as a control string for sequence extraction.

4.2.13. The FL line

The Full length line designates the large-scale cDNA sequencing projects : NEDO (11), MGC (12), and BDGP (15).

4.2.13. The IF line

The Initiation Frequency lines reflect the frequency at which each nucleotide within the initiation region is found at the 5'end of bone fide full-length cDNA clone inserts.

4.2.14. The TX line

The TX (TaXonomy) lines define a promoter's location within EPD's hierarchical classification system (see Section 5). Note that starting from release 72, the classification system is no longer maintained.

4.2.15. The KW line

The KW lines define a number of keywords describing an entry.

4.2.16. The FP, DO and RF lines

These lines pertain to the EPD old format, see next Section.

4.2.17. The // line

The // (terminator) line contains no data or comments. It designates the end of an entry.

4.3. Line types retained from the old format

The last six lines of a entry present essential information in the more concise, old format. A original description of the old format follows: Each entry starts with an FP line that contains a position reference to a transcription initiation site, and ends with a terminator (//).Below is an example of a promoter entry:

FP   Hs c-myc         P2+:+S  EU:NC_000008.9       1+ 128817660; 11148.053 010*2
XX
DO        Experimental evidence: 4,4#,<2>
DO        Expression/Regulation: +mitogen
RF        Cell34:779     EMBOJ2:2375    MCB7:1393      MCB7:2988
//

4.3.1. The FP line

The FP line contains the following fields and subfields:


columns	data type
1- 2 3- 5 6-30 6-25 26-26 27-27 28-28 29-30 31-55 31-51 31-32 33-33 34-51 52-52 53-53 54-63 64-64 65-70 71-71 72-74 75-75 76-80 76-78 79-79 80-80	"FP" (blank) description: promoter name ": " independent subset status (see section 6) type of initiation site (see section 3) (blank) functional position reference: sequence reference: genome db code ":" genome db entry accession number sequence type (0 = circular, 1 = linear) strand (+ or -) position number ";" entry code "." homology group number (see section 6) (blank) alternative promoter identification code: gene number "*" Initiation site number

Explanations:

The promoter name begins with a species code usually followed by a gene locus or gene product name. Species codes consist of the initials of genus and species name. Occasionally, three characters are required to generate unique codes. Standard abbreviations identify viruses. The full names of the organisms are given in appendix B.1. Subspecies or strains are specified in parentheses. Chromosomal locations (genetic or cytogenetic loci, genomic map units, etc.) may appear in square brackets immediately following species codes. Many gene products are referred to by abbreviations explained in appendix B.3. Alternative promoters are identified by right-justified "P" and a digit indicating the corresponding initiation site numbered sequentially from 5' to 3'. An optional "E" and digit refers to the corresponding 5'exons, if known. Identical numbers indicate 3'co-terminal exons. The strongest initiation site is marked by trailing + if known (see also List of alternative promoters)
genome db codes currently used are 'EM' for EMBL database, and 'EU' for genome contigs or chromosomal genome assemblies of the RefSeq database.
The EMBL accession number always relates to the first EMBL cross-reference. This one is usually the longest promoter region except when the entry is cancelled from the EMBL database, but still present in GenBank or DDBJ.
The sequence type indicates whether the sequence is circular or linear. A sequence comprising exactly one repeat unit of a tandem repeat cluster is also considered circular. Note that the annotation as circular or linear sequences in EPD is not always in agreement with the corresponding annotation in EMBL.
The entry code is a five-digit number which is the only part of a promoter entry that is stable from release to release.
Alternative promoter identification code: Genes represented by multiple promoter entries in EPD are assigned a promoters group number. The corresponding initiation sites are numbered sequentially from 5' to 3'.

4.3.2. DO lines: Documentation

Documentation of promoter entries is presented on lines starting with "DO". They are essentially free format and so far not processed by specific programs. In the present release, there are two DO lines per entry, the first referring to the transcript mapping experiments that define the promoter, the second giving information about expression and regulation.The varies experimental techniques are identified by number codes.The "Medline's number" and/or "example" in brackets are linked, respectively, to the abstract and/ or to the full text article describing the related experiment.


codes	experiments
1	Direct RNA sequencing (1634116)
2	Length measurement of an RNA product (1989694)
3	Nuclease protection : Length measurement of a nuclease-protected complementary RNA or DNA fragment (2845126) (8294473)
4	RNA sequencing by primer extension : by dideoxy-terminated primer extension (3396543)
5	Sequencing of a full-length cDNA (8294473)
6	Primer extension : Length measurement of a primer extension product (10187799 , example) (9880555 , example)
7	DNA sequencing of a full-length processed pseudogene (3584116)
8	Reverse direction primer extension with homologous sequence ladder : Length measurement of an in vitro synthesised DNA primed upstream of the initiation site and blocked by the 5'end of the RNA hybridized to the template (2451027)
9	Rapid amplification of cDNA ends (RACE) (9116864)
10	RNA sequencing, type not specifed
11	Oligo-capping : artificial capping of mRNA followed by sequencing of the 5' end of cDNA (11375929, 11337467 and examples)
12	Mammalian gene collection (MGC) full-length cDNA cloning (10521335 and example)
13	5' end confirmed by alignment of first 100 downstream nucleotides to EST database.
14	Oligo-capping: Berkeley Drosophila Genome Project (12537569)
15	Oligo-capping: Rice full-length cDNA cloning (12869764)

Special characters appended to the number codes designate an experimental gene expression system where the RNA for the corresponding experiments was synthesized.


*	RNA POL II in vitro system
o	injected amphibian oocytes
#	transfected or transformed cells, injected neurons
!	transgenic organisms

r	experiments performed with closely related gene
h	homologous sequence ladder used for length measurement of nuclease protection or primer extension product
l	low-precision data (error > +/- 5 bp)

Explanations and additional conventions:

The full-length assumption of a cDNA clone or a proccessed pseudogene is based on consistency with accompanying nuclease-protection or primer extension data or, alternatively, the existence of multiple 5'coterminal clones or pseudogenes.

The information on expression/regulation may include indication of developmental stages, tissues, cell types, cell cycle stages, and various regulatory features.Conventions:

Semicolon delimits the two fields : expression and regulation.
Comma delimits alternative keywords (e.g. liver, kidney)
"+" means "induced by" or "strongly expressed in".
"-" means "repressed by" or "weakly expressed in".
"~" means "modulated by".
Cell cycle stages are given in square brackets.

4.3.3. RF line: Literature references

The first four references from the RN, RX, RA, RT and RL lines are repeated in a highly condensed form. Each reference is spaced by 15 letters and indicates journal, volume, and starting page of the referred article (maximal 14 letters). The journal code explained in Appendix B.2.

They primarily point to the articles where the experimental promoter evidence is presented. Additional potential subjects are homology to other promoters, gene expression and regulation, nomenclature. Papers containing only sequence data are usually not referred to because they are easy to find via the corresponding EMBL sequence entry descriptions.

4.3.4. Miscellaneous

Greek letters are sometimes represented by corresponding latin letters followed by apostrophe:


a' = alpha	b' = beta	g' = gamma	d' = delta	e' = epsilon
z' = zeta	h' = eta	th'= theta	k' = kappa	l' = lambda
n' = nu	r' = rho

Sub- and superscripts are sometimes indicated by preceding "_" and "^", respectively.

4.4. Distinct format of 'preliminary' entries in epd_bulk.dat

4.4.1. The title line:

TI   epd83     Bulk Section Eukaryotic Promoter Database / Release 83 EP

4.4.2. The ID line

The identification line is always the first line of an entry. The form of the ID line in 'epd_bulk.dat' is:

ID   OS_bAAAA     preliminary; undefined; TAXONOMIC DIVISION.

An unique entry identifier "OS_bAAAA" is contructed using the species identification code ('OS') with at most 4 alphanumeric characters representing the biological source of the promoter and a 'b' (for bulk) followed by an arbitrary 4 letter code
"preliminary" data class field indicates that the entry has not (yet) undergone all quality checks necessary for being classified as "standard".
"undefined" as initiation site type due to insufficient data to define transcription initiation patterns (Section 3).
TAXONOMIC DIVISION are

PLN for plant
NEM for nematode
ART for arthropode
MLS for mollusc
ECH for echinoderm
VRT for vertebrates.

The ID line is terminated by a period.

4.4.3. The AC line

AC   EP00001;

The accession number consists of the character string "EP" followed by 5 digits. Previously the first two digits of the AC designated the release number of initial appearance of the specific entry followed by the EPD entry order. AC numbers in 'epd_bulk.dat' are continuous numbers, excluding ACs already used for entries in the main file 'epd.dat'.

5 CLASSIFICATION

Starting from release 72, the classification system is no longer maintained. New entries are presently added by default to an '?Unclassified' category. The classification system might still provide valuable information for entries added before release 72. However for any category, consider the possible existence of additional, potentially corresponding EPD entries in the default categories.

The entries of the Eukaryotic Promoter Database are embedded in a hierarchical classification system. A promoter's taxonomic location is made clear by interspersed group headings. The example shown below is taken from top of the database. A contrasting format has been chosen to emphasize the very different nature of this information.

*----------------------------------------------------------------------*
*    1. Plant promoters                                                *
*----------------------------------------------------------------------*
*    1.1. Chromosomal genes                                            *
*----------------------------------------------------------------------*
*    1.1.1. Small nuclear RNAs                                         *
*----------------------------------------------------------------------*

A group heading consists of a series of node numbers and a title. The highest classification level distinguishes between promoters active in major eukaryotic taxa (phyla). Further below, grouping considers replicon type and functional properties of gene products. On the lowest level, homology (as defined in section 6) is the criterion. A survey of the upper part of the classification pyramid is presented in appendix A.The proposed classification system has a highly tentative character as it is often unclear how a new promoter should be classified, especially if the gene product is a multifunctional protein. Users should therefore not be surprised or discouraged if they don't find a promoter at the initially expected place.

6 HOMOLOGOUS PROMOTERS

Homology is defined as sequence similarity due to common phylogenetic origin. In EPD, two promoters are considered homologous if they exhibit >=50% sequence similarity between -79 and +20. Similarity is calculated from optimal alignments generated with the aid of the UWGCG subroutine ShiftAlign (13) using the following symbol comparison table:


A	C	G	N	T
1.0	0.0	0.0	0.5	0.0	A
	1.0	0.0	0.5	0.0	C
		1.0	0.5	0.0	G
			0.5	0.5	N
				1.0	T

Gap weight and gap length weight are specified as 3 and 0, respectively. Terminal gaps are ignored. Percent similarity is understood as alignment score divided by segment length, times 100. Groups of homologous promoters are identified by homology group numbers (see 4.2.1.). Definition of these groups is based on similarity scores as defined above and a tree generation method called UPGMA (14). In a few cases, similarities between 50% and 56% were ignored if the protein sequences of the corresponding genes were not related. Similarities were also ignored between alternative promoter sequences that are spaced by less than 50 bp. A subset of "independent" promoters is marked by "+" in column 27 of the FP line. This set contains only one member per homology group (usually, the promoter with the longest upstream sequence available) and is intended to be used for statistical analysis of functional patterns where it is important to avoid bias by multiples of closely related sequences.

7 PROMOTER SEQUENCE RETRIEVAL

Promoter sequence listings have not been incorporated into EPD for two reasons: (i) to avoid duplication of data already existing elsewhere in the EMBL data library, and (ii) to encourage usage of FPS-dependent sequence retrieval programs which enables the user to specify suitable 5'- and 3'boundaries of the requested sequence segments himself. Effort is under way to motivate producers of standard nucleotide sequence analysis packages to provide such tools in the future. In the meantime, users with some programming experience will find it easy to write their own routines. Our local sequence extraction programs run in a UWGCG environment (13) and have been implemented at several sites in Europe and the United States. They are documented and freely available on request.

8 REFERENCES

Bucher, P. & Trifonov, E.N., Compilation and analysis of eukaryotic POL II promoter sequences, Nucl. Acids Res. 14, 10009-10026 (1986). (3808945)

Bucher, P. & Bryan, B., Signal search analysis: a new method to localize and characterize functionally important DNA sequences, Nucl. Acids Res. 12, 287-305 (1984). (6546421)

Stoesser, G., Tuli,M.A., Lopez, R. and Sterk, P., The EMBL nucleotide sequence database, Nucleic Acids. Res., 27, 18-24 (1999). (9847133)

Benson, D.A., Boguski, M.S., Lipman, D.J., Ostell, J., Ouellette B.F.F, Rapp, B:A: and Wheeler, D.L., GenBank, Nucleic Acids. Res., 27, 12-17 (1999). (9847132)

Sugawara, H., Miyazaki, S., Gojobori, T. and Tateno, Y.,DNA Data Bank of Japan dealing with large-scale data submission, Nucleic Acids. Res., 27, 25-28 (1999). (9847134)

Bairoch, A. and Apweiler, R., The SWISS-PROT protein sequence data bank and its supplement TrEMBL in 1999, Nucleic Acids Res., 27, 49-54 (1999). (9847139)

Heinemeyer, T., Chen, X., Karas, H., Kel, A.E., Kel, O.V., Liebich, I., Meinhardt, T., Reuter, I., Schacherer, F. and Wingender, E., Expanding the TRANSFAC database towards an expert system of regulatory molecular mechanisms, Nucleic Acids. Res., 27, 318-322 (1999). (9847216)

The FlyBase consortium, The FlyBase database of the drosophilia genome projects and community litterature, Nucleic Acids. Res., 27,85-88 (1999). (9847148)

Pearson, P., Francomano, C., Foster, P., Bocchini, C., Li, P. and McKusick, V., The status of online Mendelian inheritance in man (OMIM) medio 1994, Nucleic Acids Res., 22, 3470-3473 (1994). (7937048)

Blake, J.A., Richardson, J.E., Davisson, M.T., Eppig, J.T. and the Mouse Genome Database Group, The Mouse Genome Database (MGD): genetic and genomic information about the laboratory mouse, Nucleic Acids Res., 27, 95-98 (1999). (9847150)

Suzuki Y., Yamashita R., Nakai K., Sugano S., DBTSS: database of human transcriptional start sites and full-length cDNAs. Nucleic Acids Res. 30(1):328-331(2002). (11752328)

Strausberg, R.L., Feingold, E.A., Klausner, R.D., Collins, F.S., The Mammalian Gene Collection. Science, 286, 455-457 (1999). (10521335)

Devereux,J., Haeberli,P., & Smithies,O. A comprehensive set of sequence analysis programs for the VAX, Nucl. Acids Res. 12, 387-395 (1984). (6546423)

Sneath,H.A. & Sokal,R.R., Numerical taxonomy, W.H. Freemann, San Francisco, London (1973).

Stapleton M., Liao GC., Brokstein P., Hong L., Carninci P., Shiraki T., Hayashizaki Y., Champe M., Pacleb J., Wan K., Yu C., Carlson J., George R., Celniker S., and Rubin GM., The Drosophila Gene Collection: Identification of Putative Full-Length cDNAs for 70% of D. melanogaster Genes. Genome Res., 12:1294-1300 (2002). (12176937)

Schmid C.D., Praz V., Delorenzi M., Périer R., and Bucher P., The Eukaryotic Promoter Database EPD: the impact of in silico primer extension. Nucleic Acids Res. 32, D82-5 (2004). (14681364)

A. APPENDIX A : SURVEY OF RELEASE

B. APPENDIX B : CODES AND ABBREVIATIONS

B.1. SPECIES CODES


Code	Scientific name (English name)
AAV2	Adeno-associated virus 2
Ac	Aplysia californica (California sea hare)
AcNPV	Autographa californica nuclear polyhedrosis virus
Ad2	Human adenovirus type 2
Ad5	Human adenovirus type 5
Ad7	Human adenovirus type 7
Ad12	Human adenovirus type 12
Ag	Ateles geoffroyi (black-handed spider monkey)
ALV	Avian leukosis virus
Am	Antirrhinum majus (snapdragon)
Ab-MLV	Abelson murine leukemia virus
Apo	Antheraea polyphemus (polyphemus moth)
Ap	Anas platyrhynchos (mallard, domestic duck)
As	Avena sativa (oat)
At	Agrobacterium tumefaciens
Ath	Arabidopsis thaliana (thale cress)
Atr	Aotus trivirgatus (douroucouli)
Ay	Antheraea yamamai
B19	Human parvovirus B19
Be	Bertholletia excelsa (Brazil nut)
BKV	Papovavirus BKV
BLV	Bovine leukemia virus
Bm	Bombyx mori (silkworm)
Bn	Brassica napus (rape)
BPV1	Bovine papillomavirus type 1
Bt	Bos taurus (cattle)
CaMV	Cauliflower mosaic virus
Cco	Coturnix coturnix (quail)
Ce	Caenorhabditis elegans
Cg	Canavalia gladiata (sword bean)
Cgr	Cricetulus griseus (Chinese hamster)
Ch	Capra hircus (goat)
Cl	Canis lupus (gray wolf)
Cm	Cairina moschata (muscovy duck)
Cp	Cavia porcellus (domestic guinea pig)
Cpe	Cucurbita pepo (zucchini)
Ct	Chironomus thummi (midge)
Cte	Chironomus tentans
Dc	Daucus carota (carrot)
Df	Drosophila funebris (fruit fly)
Dh	Drosophila hydei (fruit fly)
DHBV	Duck hepatitis B virus
Dm	Drosophila melanogaster (fruit fly)
Dma	Drosophila mauritiana (fruit fly)
Dmo	Drosophila mojavensis (fruit fly)
Dmu	Drosophila mulleri (fruit fly)
Do	Drosophila orena (fruit fly)
Dp	Drosophila pseudoobscura (fruit fly)
Ds	Drosophila simulans (fruit fly)
Dse	Drosophila sechellia (fruit fly)
Dv	Drosophila virilis (fruit fly)
EBV	Human herpesvirus 4 (Epstein-Barr virus)
Ec	Equus caballus (horse)
FBJ-MSV	Murine osteosarcoma virus (Finkel-Biskis-Jinkins)
FBR-MSV	Murine osteosarcoma virus (Finkel-Biskis-Reilly)
F-MCF	Friend mink cell focus-forming virus (Murine)
Fs	Felis silvestris (wild cat)
F-SFFV	Friend spleen focus-forming virus
Ft	Flaveria trinervia
GA-FeLV	Gardner-Arnstein feline leukemia oncovirus B
GALV	Gibbon ape leukemia virus
Gg	Gallus gallus (chicken)
Ggo	Gorilla gorilla (gorilla)
Gm	Glycine max (soybean)
GSHV	Ground squirrel hepatitis virus
H-1	Parvovirus H1 (Murine)
Ha	Helianthus annuus (common sunflower)
Hb	Hevea brasiliensis (para rubber tree)
HBV	Human hepatitis B virus
HCMV	Human cytomegalovirus
Hg	Halichoerus grypus (grey seal)
HIV-1	Human immunodeficiency virus type 1
HIV-2	Human immunodeficiency virus type 2
HPV16	Human papillomavirus type 16
HPV18	Human papillomavirus type 18
Hs	Homo sapiens (human)
HSV-1	Human herpesvirus 1
HSV-2	Human herpesvirus 2
HTLV-I	Human T-cell leukemia virus type I
HTLV-II	Human T-cell leukemia virus type II
Hv	Hordeum vulgare (barley)
HVS	Herpesvirus saimiri
JCV	Human polyomavirus JCV
Le	Lycopersicon esculentum (tomato)
Leu	Lepus europaeus (European hare)
Lm	Locusta migratoria (migratory locust)
Lp	Lytechinus pictus (painted urchin)
Lpe	Lycopersicon peruvianum (Peruvian tomato)
Lv	Lytechinus variegatus (green urchin)
Ma	Mesocricetus auratus (golden hamster)
Mc	Macaca fascicularis (crab-eating macaque)
MCMV	Murine cytomegalovirus
MLV_AKV	AKV murine leukemia virus
MLVxeno	Xenotropic murine leukemia virus
Mm	Mus musculus (house mouse)
M-MLV	Moloney murine leukemia virus
M-MSV	Moloney murine sarcoma virus
MMTV	Mouse mammary tumor virus
Ms	Medicago sativa (alfalfa)
MSV	Maize streak virus
Np	Nicotiana plumbaginifolia (curled-leaved tobacco)
Ns	Nicotiana sylvestris (wood tobacco)
Nt	Nicotiana tabacum (common tobacco)
Nto	Nicotiana tomentosiformis
Oa	Ovis aries (sheep)
Oc	Oryctolagus cuniculus (rabbit)
Os	Oryza sativa (rice)
Ph	Petunia hybrida (e.g. Petunia strain Mitchell)
Pa	Papio anubis (olive baboon)
Pc	Petroselinum crispum (parsley)
Pl	Paracentrotus lividus (common urchin)
Pm	Psammechinus miliaris (sand urchin)
Polyoma	Mouse polyomavirus
Ppy	Photinus pyralis (North American firefly)
Pp	Pongo pygmaeus (orangutan)
Ps	Pisum sativum (pea)
Pt	Pan troglodytes (chimpanzee)
Pth	Pinus thunbergii (Japanese black pine)
Pv	Phaseolus vulgaris (kidney bean)
RAV2	Rous associated virus type 2 (Avian)
Rc	Ricinus communis (castor bean)
R-MCF	Rauscher mink cell focus-forming virus
Rn	Rattus norvegicus (Norway rat)
RSV	Rous sarcoma virus (Avian)
Sa	Sinapis alba (white mustard)
SA7P	Simian adenovirus (7P)
Sd	Strongylocentrotus droebachiensis
Se	Nannospalax ehrenbergi (Ehrenberg's mole-rat)
Sg	Oncorhynchus mykiss (rainbow trout)
SIV	Simian immunodeficiency virus
SNV	Spleen necrosis virus
So	Spinacia oleracea
Sp	Strongylocentrotus purpuratus
Spe	Sarcophaga peregrina
Sr	Sesbania rostrata
SRV-1	Simian AIDS retrovirus SRV-1
Ss	Sus scrofa (pig)
SSV	Simian sarcoma virus
St	Solanum tuberosum (potato)
Sv	Sorghum bicolor (sorghum)
SV40	Simian virus 40
Ta	Triticum aestivum (wheat)
Visna	Visna lentivirus
Xb	Xenopus borealis (Kenyan clawed frog)
Xl	Xenopus laevis (African clawed frog)
Xt	Xenopus tropicalis (western clawed frog)
Zm	Zea mays (maize)

B.2. JOURNAL CODES


Code	Journal name
ARB	Annual Review of Biochemistry
ARP	Annual Review of Physiology
BBA	Biochimica Biophysica Acta
BBRC	Biochemical and Biophysical Research Communications
Bch	Biochemistry
Bchi	Biochimie
BchJ	Biochemical Journal
BCHS	Biological Chemistry Hoppe-Seyler
BrJR	British Journal of Rheumatology
BrainR	Brain Research
Btech	Biotechnology
CanR	Cancer Research
Cell	Cell
CGD	Cell Growth Differentiation
Chrom	Chromosoma
CSHS	Cold Spring Harbor Symposia on Quantitative Biology
CTMI	Current Topics in Microbiology and Immunology
CurG	Current Genetics
DCB	DNA and Cell Biology
DevB	Developmental Biology
Diab	Diabetes
DNA	DNA
ECR	Experimental Cell Research
EJBc	European Journal of Biochemistry
EJCB	European Journal of Cellular Biology
EMBOJ	EMBO Journal
EMBOR	EMBO Reports
Evo	Evolution
FEBS	FEBS Letters
GDev	Genes and Development
Gene	Gene
GChC	Genes Chromosomes Cancer
GnmR	Genome Research
Gnms	Genomics
Gnts	Genetics
HGEN	Human Genetics
IJCa	International Journal of Cancer
ImTo	Immunology Today
JBC	Journal of Biological Chemistry
JBch	Journal of Biochemistry
JCB	Journal of Cell Biology
JEM	Journal of Experimental Medicine
JGV	Journal of General Virology
JI	Journal of Immunology
JMAG	Journal of Molecular and Applied Genetics
JMB	Journal of Molecular Biology
JME	Journal of Molecular Evolution
JMEnd	Journal of Molecular Endocrinology
JNeSc	Journal of Neuroscience
JVir	Journal of Virology
MB	Molecular Biology
MBE	Molecular Biology and Evolution
MBM	Molecular Biology and Medicine
MBR	Molecular Biology Reports
MCB	Molecular and Cellular Biology
MCEnd	Molecular and Cellular Endocrinology
MEnd	Molecular Endocrinology
MImm	Molecular Immunology
MEnz	Methods in Enzymology
MGG	Molecular and General Genetics
MNeub	Molecular Neurobiology
MPMI	Molecular Plant-Microbe Interactions
NAR	Nucleic Acids Research
Nat	Nature
Oncg	Oncogene
OncR	Oncogene Research
Pla	Planta
PlJ	Plant Journal
PMB	Plant Molecular Biology
PSL	Plant Science Letters
RPHR	Recent Progress in Hormone Research
PNAS	Proceedings of the National Academy of Sciences of the United States of America
Sci	Science
SCMG	Somatic Cell and Molecular Genetics
TiG	Trends in Genetics
Vir	Virology
VirR	Virus Research

B.3. ABBREVIATIONS


1-25OH2D3	1,25-(OH)_2 vitamin D_3
20-OHE	20-Hydroxyecdysone
4CL	4-coumarate coenzyme A ligase
a1	Gene locus 1 involved in anthocyanin biosynthesis
abd-g.	Abdominal ganglion
abl	Abelson murine leukemia virus oncogene
ACC	1-aminocyclopropane-1-carboxylic acid
AChR	Acetylcholin receptor
ACP	b'-ketoacyl-acyl carrier protein of fatty acid synthase
ACTH	Adrenocorticotropic hormone
ADA	Adenosine deaminase
ADH	Alcohol dehydrogenase
ADPg-s	GT ADPglucose-starch glucosyltransferase
adult-HA	Adult hermaphrodite
AFW1	Adult fast-white (myosin heavy chain) 1
Ag	Antigen
(AGM)	"from african green monkey"
AGP	Acid glycoprotein
AGPP	ADP glucose pyrophosphorylase
AIRS	Aminoimidazole ribonucleotide synthase
ALA-synt.	5-Aminolevulinate synthase
ALDH_2	Aldehyde dehydrogenase 2
AlkExo	Alkaline exonuclease
Amy	Amylase
antp	"antennapedia" locus
aP2	Adipocyte homologue of myelin P2
apolipop.	Apolipoprotein
apoVLDLII	Very low densitiy apolipoprotein II
APRT	Adenine phosphoribosyltransferase
AR	Adrenergic receptor
ARF	ADP-ribosylation factor
arg	Arginine
AS	Argininosuccinate synthetase
AS-C	"achaete-scute" complex locus
AspAT	Aspartate aminotransferase
ass.	Associated
AT	Antitrypsin
ATIII	Antithrombin III
ATCase	Aspartate transcarbamylase
ATP	Adenosinetriphosphate
awd	"abnormal wing disk" locus
BB	Bowman-Birk (protease inhibitor)
BCKDHA	Branched-chain alpha-keto acid dehydrogenase complex
Bcl-2	B-cell leukemia/lymphoma 2 proto-oncogene
BMMC	Bone marrow-derived mast cell
BPTI	Bovine pancreatic trypsin inhibitor
BSF	B-cell stimulating factor
bsg25D	Blastoderm specific locus 25D
c-	Cellular protooncogene ..
c1	Regulatory locus of anthocyanin synthesis (maize)
C4BP	Complement component C4-binding protein
CA	Carbonic anhydrase
CAD	Carbamoyl-phosphate synthetase (glutamine-hydrolysing)/aspartate carbamoyl transferase/dihydroorotase
cab	Chlorophyll a/b-binding protein
cAMP	Cyclic AMP (Adenosinemonophosphate)
card-m.	Cardiac muscle
cc-ind.	Cell cycle-independent
CD3	T-cell differentiation antigen CD3
CD4	T-cell differentiation antigen CD4
CD8	T-cell differentiation antigen CD8
CEA	Carcinoembryonic antigen
CG	Chorionic gonadotropin
CNS	Central nervous system
CNTF	Ciliary neurotrophic factor
car.	Cartilage
col.	Collagen
conglyc.	Conglycinin
cor.	Cornea
cotyl.	Cotyledon
cp	Cytoplasm(ic)
CPS	Carbamyl-phosphate synthetase
CRF	Corticotropin-releasing factor
CRP	C-reactive protein
cs	Cytosol(ic)
CSF	Colony stimulating facter
cyt	Cytokinin gene (coding for isopentenyltransferase)
DAF	Decay-accelerating factor
dbp	DNA binding protein
DDC	DOPA decarboxylase
DDH	Dihydrodiol dehydrogenase
dep.	dependent
dev.	Development(ally)
DHFR	Dihydrofolate reductase
diff.	differentiation, differentiated
DL/R	Left and right duplicated region
dnc	"dunce" locus
dUTPase	Deoxyuridinetriphosphatase
E	1. Early, 2. Erythroid cell-specific
E8	Ethylene inducible gene during fruit ripening 8
EAS	5-epi-aristolochene synthase (sesquiterpene cyclase)
EBNA	Epstein-Barr virus nuclear antigens
ecd-ind.	Ecdysone-inducible
EDF	Eosinophil differentiation factor
EFW1	Embryonic fast-white (myosin heavy chain) 1
EGF	Epidermal growth factor
EIa	Adenovirus early Ia region (transactivating element)
Eip	Ecdysone-induced protein
ELH	Egg-laying hormone
em	Embryo, embryonic
epithel	epithelial or epithelium
EPSP	5-Enolpyruvylshikimate-3-phosphate
erbA,B	(Avian) erythroblastosis virus oncogene A,B
E-resp.	Estrogen-responsive
ERV3	Endogenous retrovirus 3
E.Tn	Early transposon
et-hypocot.	Etiolated hypocotyl
ev1	(Avian) endogenous virus 1
eve	"even-skipped" locus
exch.	Exchanger
f.	Factor
fib.	Fibers
fibrob.	Fibroblasts
FMRFamide	Phe-Met-Arg-Phe-NH(2) neuropeptide
FNR	Ferredoxin-(NADP+)-oxidoreductase
FBP	Folate Binding Protein
fos	FBJ (Finkel-Biskis-Jinkins) osteosarcoma virus oncogene
FSH	Follicle stimulating hormone
ftz	"fushi tarazu" locus
g.	Gene
G0S..	G0/G1 switch regulatory gene ..
G6PD	Glucose-6-phosphate dehydrogenase
GA	Gibberellic acid
GADPH	Glyceraldehyde-3-phosphate dehydrogenase
GARS	Glycinamide ribonucleotide synthase
Gart	"Gart" locus (-> GARS, AIRS, GART)
GART	Glycinamide ribonucleotide transformylase
gC	Glycoprotein C
G-CSF	Granulocyte colony stimulating factor
gD	Glycoprotein D
GdX	X-linked gene downstream of G6PD gene
gE	Glycoprotein E
GFAP	Glial fibrillary acidic protein
g'GT	g'-Glutamyl transpeptidase
gln	Glutamine
globul-12s	12s globulin (oat seed storage protein)
glucc	Glucocorticoid
GLUT1	Glucose transporter type 1
GM-CSF	Granulocyte/Macrophage colony stimulating factor
GnRH	Gonadotropin-releasing hormone
gp	Glycoprotein
GPD	Glycerol-3-phosphate dehydrogenase
GPT	UDP-GlcNAc:dolichol phosphate N-acetylglucosamine-1-phosphate transferase
granulo-c	Granulocyte
GRF	Growth hormone-releasing factor
GRP	Glycine-rich (cell wall) protein
GS17	Gastrula-specific transcript 17
GSHPx	Gluthathione peroxidase
G-spec.	Gastrula-specific
GST	Gutathione S-transferase
H	1. Heavy chain, 2. Housekeeping-type promoter
Ha-ras	Rat-derived Harvey murine sarcoma virus oncogene
haptoblob	haptoglobin
hb	"hunchbank" locus
Hc	High-cysteine (chorion protein)
HDC	L-histidine decarboxylase
hematop.	hematopoietic
HGT	High-(glycine+tyrosine) keratin
hist.	Histone
HMG-	High mobility group chromosomal protein
HMG-CoA	3-Hydroxy-3-methylglutaryl coenzyme A
HPRT	Hypoxanthine phosphoribosyltransferase
hs	Heatshock
hsc	Constitutive analogue of heatshock gene/protein
HSF	Hepatocyte-stimulating factor
hsp	Heatshock protein
Ht	Testicular histone
HTF	Restriction endonuclease HpaII tiny fragments
I-FABP	Intestinal fatty-acid binding protein
IAA	Indolacetic acid
IAP	Intracisternal A-particles
ICP	Infected cell protein
IE	Immediate early (gene, RNA)
IF	Intermediate filament
IFI	Interferon-induced gene/protein
IFN	Interferon
Ig	Immunoglobulin
IGF	Insulin-like growth factor
IL	Interleukin
inf.	Infected
inh.	Inhibitor
iNOS	Inducible nitric oxide synthase
IRF	Interferon regulatory factor
ISG	Interferon-stimulated gene
k.	Kinase
keratino-c	Keratinocyte
Ki-ras	Rat-derived Kirsten murine sarcoma virus oncogene
L	1. Light chain; 2. Late
larva-1,2,..	First, second, .. instar larva
LAT..	Lycopersicon anther-specific gene ..
LCAT	Lecithin-cholesterol acyltransferase
lck	T-cell- or lymphocyte-specific tyrosine kinase
LDH	Lactate dehydrogenase
leghem.	Leghemoglobin
LeIF	Leukocyte interferon
leuko-c	Leukocyte
LH	Luteinizing hormone
LHC	Light-harvesting complex
LHRH	Luteinizing hormone-releasing factor
liv.	liver
LMW	Low molecular weight
LPH	Lipotropic hormone
LPS	Lipopolysaccharide
LTR	Long Terminal Repeat
lympho-c	Lymphocyte
lys	Lysosomal
MBP	Myelin basic protein
(MAC)	Macaque
MC	Methylcholanthrene
MCK	Muscle-specific creatine kinase
mGK	Submaxillary gland kallikrein
MHCI/MHCII	Class I/II transplantation antigens of major histocompatibility complex
MIF	Macrophage migration inhibitory factor
minipara	Miniparamyosin
mit	Mitochondrial
mono-c	Monocyte
mononuc-c.	Mononuclear cells
MOPC..	Mineral oil-induced plasmacytoma
mos	Moloney murine sarcoma virus oncogene
MP	Macrophage
MPC..	Mouse plasma cell tumor
MRP	MIF-related protein (see MIF)
MSF	Megakaryocyte stimulating factor
msp	Major sperm protein gene
MT	Metallothionein
mst	Male-specific transcript
MUP	Major urinary protein
myb	(Avian) myeoloblastosis virus oncogene
myc	Myelocytomatosis virus 29 oncogene
NCA	nonspecific cross-reacting (with -> CEA) antigen
nerv. sys	Nervous system
neu	Ethyl-nitrosurea-induced rat neuroblastoma oncogene
neuropep.	Neuropeptide
NGF	Nerve growth factor
ninaE	"neither inactivation nor afterpotential" locus E
NMDH	NADP-malate dehydrogenase
NOS	Nitric oxide synthase
nos	Nopaline synthetase
NR	Nitrate reductase
N-ras	Neuroblastoma ras-like (-> Ha-ras) oncogene
NS	Nervous system
OAT	Ornithine aminotransferase
ocs	Octopine synthetase
ODC	Ornithine decarboxylase
Ori	Origin of replication
OTC	Ornithine transcarbamylase
ovalb.	Ovalbumin
p.	Protein
P-450	Cytochrome P-450
p53	53K phosphoprotein
panc.	pancreas, pancreatic
parath.	Parathyroid
PB	Phenobarbital
PBGD	Porphobilinogen deaminase
PCNA	Proliferating cell nuclear antigen
PDEase	cAMP phosphodiesterase
PDGF	Platelet-derived growth factor
PEPCase	Phosphoenolpyruvate carboxylase
PEPCK	Phosphoenolpyruvate carboxykinase
PG	Prostaglandin
PGK	3-Phosphoglycerate kinase
PHA	Phytohemagglutinin
PK	Protein kinase
P_L	Late promoter
PLP	Proteolipid protein
POL	Polymerase
POMC	Proopiomelanocortin
pp..	Phosphoprotein ..
PR1a	Pathogenesis-related protein 1a
PRBP	Plasma retinol-binding protein
PRL	Prolactin
prog.	Progesterone
prolyl	4-hydr. Prolyl 4-hydroxylase
PrP	Prion protein
PSG1,PSG2,.	Pregnancy-specific glycoproteins 1,2,.
PSBP	Prostatic steroid binding protein
PSP	Parotid secretory protein
PTH	Parathyroid hormone
pTiN	Nopaline type tumor inducing plasmid
pTiO	Octopine type tumor inducing plasmid
r	"rudimentary" locus
R	1. Regulatory subunit, 2. Erythroid cell-specific
RAB	Gene responsive to ABA
ras	Homologue of -> Ha-ras, Ki-ras, etc.
rec.	Receptor
red.	Reductase
reg.	Regulated
rep-dep.	Replication-dependent
rig	Rat insulinoma gene
RnBP	Renin-binding protein
RNR1,	RNR2 Ribonucleotide reductase large, small subunit
rp	Ribosomal protein
rTn	Retrotransposon
RuBPCss	Ribulose-1,5-biphosphate carboxylase small subunit
RuBPCA	Ribulose-1,5-biphosphate carboxylase/oxygenase activase
s.	Small
saliv-g.	Salivary gland
SBP	Spermine-binding protein
SC	Stem cells
sem-v.	Seminal vesicle
ser.	Serum
sgs	Salivary gland secretion protein
sis	Simian sarcoma virus oncogene
sk-m.	Skeletal muscle
skel-m.	Skeletal muscle
smooth-m.	Smooth muscle
snRNA	Small nuclear RNA
snRNA	Small nuclear ribonucleoprotein
SOD	Superoxide dismutase
som	Somatic
spat-reg.	Spatially regulated
Spec	Strongylocentrotus purpureatus ectoderm enriched RNA
SPI	Serine protease inhibitor
sry	"serendipity" locus
SV40T	Tumor antigen of simian virus 40 (SV40)
SVS	Seminal vesicle secretory protein
synt.	Synthase
T3d'	T-cell antigen receptor-associated T3-complex delta chain
TAT	Tyrosine aminotransferase
TCDD	2,3,7,8-Tetrachlorodibenzo-p-dioxin
TCGF	T-cell growth factor
TCR	T-cell receptor
TdT	Terminal deoxynucleotidyltransferase
test.	testis
TF	Transcription factor
TGA1a	TGACG-specific DNA-binding protein 1a
TGF-b'	Transforming growth factor beta
TH	Tyrosin hydroxylase
thyr.	Thyroxine
Thy-1.2	Thy-1 (thymocyte) antigen/glycoprotein allotype 2
TIF	Trans-inducing factor
TIM	Triosephosphate isomerase
tis.	Tissue
TM	Tropomyosin
tmr	"tumor morphology root" locus
TNF	Tumor necrosis factor
TnI	Troponin I (inhibitory subunit)
TnT	Troponin T (tropomyosin-binding subunit)
TO	Tryptophan oxygenase
TP1,TP2,.	Transition protein 1,2,.
TPA	12-O-tetradecaonyl-phorbol-13-acetate
TPI	Triosephosphate isomerase
tr.,tr-	Transcript
TRF	T-cell replacing factor
TRH	Thyrotropin-releasing hormone
TS	Thymidylate sythetase
TSH	Thyroid stimulating hormone
T/t	Large/small T(tumor) antigen
Ubx	"ultrabithorax" locus
uPA	Urine plasminogen activator
URO-D	Uroporphyrinogen decarboxylase
Vg1	Vegetal hemisphere-specific mRNA 1
vir-inf.	Viral infection
VL30	Retrovirus-like 30s RNA
VLDL	Very low density lipoprotein
V_NP	(Immunoglobulin heavy chain) variable region specific for 4-hydroxyl-3-nitrophenacetyl
VP5	Virion protein 5 (HSV-1/2: =major capsid protein)
VSP	Virion stimulatory protein
vWf	von Willebrand factor
Zen	"zerknuellt" protein

EUKARYOTIC PROMOTER DATABASE USER MANUAL Written by: Philipp Bucher, Rouaida Cavin Périer, Viviane Praz and Christoph Schmid

EPD RELEASE 128, July 2016

CONTENTS

1 INTRODUCTION

2 PROMOTER SELECTION

3 ASSIGNMENT OF TRANSCRIPTION INITIATION SITE

4 FORMAT CONVENTIONS

4.1. The title line

4.2. Promoter entries

4.2.1. The ID line

4.2.2. The AC line

4.2.3. The DT line

4.2.4. The DE line

4.2.5. The OS line

4.2.6. The HG line

4.2.7. The AP line

4.2.8. The NP line

4.2.9. The DR line

4.2.10. The RN, RX, RA, RT and RL lines

4.2.10.1. The RN line

4.2.10.2 The RX line

4.2.10.3 The RA line

4.2.10.4 The RT line

4.2.10.5 The RL line

4.2.11. The ME line

4.2.12. The SE line

4.2.13. The FL line

4.2.13. The IF line

4.2.14. The TX line

4.2.15. The KW line

4.2.16. The FP, DO and RF lines

4.2.17. The // line

4.3. Line types retained from the old format

4.3.1. The FP line

4.3.2. DO lines: Documentation

4.3.3. RF line: Literature references

4.3.4. Miscellaneous

4.4. Distinct format of 'preliminary' entries in epd_bulk.dat

4.4.1. The title line:

4.4.2. The ID line

4.4.3. The AC line

5 CLASSIFICATION

6 HOMOLOGOUS PROMOTERS

7 PROMOTER SEQUENCE RETRIEVAL

8 REFERENCES

A. APPENDIX A : SURVEY OF RELEASE

B. APPENDIX B : CODES AND ABBREVIATIONS

B.1. SPECIES CODES

B.2. JOURNAL CODES

B.3. ABBREVIATIONS

EUKARYOTIC PROMOTER DATABASE USER MANUAL
Written by: Philipp Bucher, Rouaida Cavin Périer, Viviane Praz and Christoph Schmid