TSS assembly pipeline for Hs_EPDnew_005
This document provides a technical description of the transcription
start site assembly pipeline that was used to generate EPDnew
version 005 for H. sapiens
Dec 2013 GRCh38/hg38
Assembly pipeline overview
Description of procedures and intermediate data files
1. UCSC Download
Data was downloaded from UCSC Table Browser (30-06-2017) selecting
the following attributes:
Then, transcrips were filtered according to the following rules:
- Transcripts of protein coding genes only (Ensembl
- Transcripts must have a RefSeq protein ID
Gene names were taken from the field "Associated Gene Name". Since
the EPD format doesn\'t allow gene names longer than 18 characters,
we checked whether the names repsected this limitation.
Transcripts with the same TSS position were merged under a common
ID. As a consequence of this the total number of TSS in the list
2. Gencode TSS collection
The Gencode TSS collection is stored as a tab-deliminated text file
conforming to the SGA format.
The six fields in the file contain the following kinds of
- NCBI/RefSeq chromosome id
- strand ("+" or "-")
Note that the second and forth fields are invariant.
3. Data import from ENCODE and FANTOM5 CAGE
CAGE Tag Data were downloaded from UCSC ftp-site and FANTOM5
http-site (see links above). The source files are in bam format
mapped on hg19 genome assembly. Samples were lifted-over to hg38
genome assembly using the liftOver tool. The complete list of
files can be found
here for ENCODE
. Bam files were converted into bed files with
bamToBed program. Files were kept and analysed individually.
4. CAGE tags
The compressed versions of these files are available from the MGA
archive (see links above).
5. mRNA 5' tags peak calling
Peak calling for each individual CAGE data file has been carried
out using ChIP-Peak
on-line tool with the following parameters:
- Window width = 1
- Vicinity range = 200
- Peak refine = N
- Count cutoff = 9999999
- Threshold = 5
6. TSS validation and shifting
Each sample in the collection (mRNA peaks and Gencode TSS) was then
processed in a pipeline aiming at validating transcription start
sites with mRNA peaks. An Gencode TSS was experimentally confirmed
if a CAGE peak lied in a window of 300 bp around it or if mapped
in the 5'UTR region and if it had a maximum high of at least 5
tags (50 tags for peaks in the 5'UTR). The validated TSS was then
shifted to the nearest base with the higher tag density.
7. Gencode not-validated TSS
The total number (summing up all samples) of non experimentally
validated TSS was around 3000.
8. Promoter collection for each sample
Each sample in the dataset was used to generate a separate
promoter collection. Potentially, the same transcript could be
validated by multiple samples and it could have different start
sites in different samples. To avoid redundancy, the individual
collections were used as input for an additional step in the
analysis (Assembly pipeline part B).
9. Quality controls of sample-specific promoter
The quality of promoter collections derived from each sample was
tested to exclude low quality samples from the final
collection. To achive this, each promoter collection was scored
according to the distribution of the TATA-box
motif in the expected position (-29bp from the TSS and at the TSS
respectively). Samples with very low motif frequencies (Inr
frequency < 10% and TATA-box < 5%) were discarded (9 samples in
total) from further analyses.
10. Merging collections and second TSS selection
The good-quality promoter collections were merged into a unique
file and further analysed. The promoter of a transcript was
mantained in the list only if validated by at least two
samples. Transcript validated by multiple samples could
potentially have the TSS set on a broader region and not to single
position. To avoid such inconsistency, for each transcript we
selected the position that was validated by the larger number of
samples as the true TSS.
Transcription Start Sites that mapped closed to other TSS that
belonged to the same gene (500 bp window) were merged into a
unique promoter following the same rule: the promoter that was
validated by the higher number of samples was kept.
12. Final EPDnew collection
The 25503 experimentally validated promoter were stored in the
EPDnew database that can be downloaded from our ftp
site. Scientist are wellcome to use our other tools ChIP-Seq
correlation analysis) and SSA
(for motifs analysis
around promoters) to analyse EPDnew database.