TSS assembly pipeline for Mm_EPDnew_002
Introduction
This document provides a technical description
of the transcription start site assembly pipeline that was used to
generate EPDnew version 002 for
M. musculus.
Source Data
Promoter collection:
Name |
Genome Assembly |
Promoters |
Genes |
PMID |
Access data |
UCSC Known Genes
|
July 2007 NCBI37/mm9
|
25221
|
19378
|
26590259
|
SOURCE
|
DOC
|
DATA
|
Experimental data:
Assembly pipeline overview
Description of procedures and intermediate data files
1. UCSC Download
Data was downloaded from
UCSC table browser the 15th of MAy 2014
Then, transcrips have been filtered according to the following rules:
- Transcripts of protein coding genes only
- Transcript lies on full chromosomes
- Genes must be annotated [Associated Gene Name present]
- Gene and transcripts status known
Gene names were taken from the field "Associated Gene Name". Since the
EPD format doesn't allow gene names longer than 18 characters,
we checked whether the names repsected this limitation.
Transcripts with the same TSS position were merged under a common
ID. As a conseguence of this and of the filters, from the 55420
transcrips originally present in the UCSC database, ~30000 were
merged, leaving 67440 uniquely mapped promoters in the input list.
2. UCSC TSS collection
The UCSC TSS collection is stored as a tab-deliminated text file
conforming to the SGA format under the name:
The six field contain the following kinds of information:
- NCBI/RefSeq chromosome id
- "TSS"
- position
- strand ("+" or "-")
- "1"
- gene name
Note that the second and forth fields are invariant.
3. Data import from DBTSS7
Solexa Tag Data were downloaded from DBTSS ftp-site (see link above).
The source files are the following:
- 3t3_data.tab.gz: Mouse 3T3 Solexa tag mapping data;
According to the readme file included in the ftp archive, the 5' end
tags were mapped to the mouse genome mm9. The source format is a
non-standard tab-delimited format that has been converted to SGA via
an ad hoc perl script. All tissues have been merged into a single
file.
3. Data import from FANTOM5
BAM files for high quality CAGE samples (hCAGE) were downloaded from FANTOM5 ftp-site (see link above).
FIles were then converted into SGA format using in-house software. There are a total number of 339 samples in this collection. Individual SGA files can be downloaded from our ftp website (link above).
5. mRNA 5' tags peak calling
For each individual sample (340), peak calling for the merged file has been
carried out using
ChIP-Peak
on-line tool with the following parameters:
- Window width = 200
- Vicinity range = 200
- Peak refine = Y
- Count cutoff = 9999999
- Threshold = 5
6. TSS validation and shifting
Each sample in the collection (mRNA peaks and UCSC TSS) was then
separately processed in a pipeline aiming at validating transcription
start sites with mRNA peaks. A UCSC TSS was experimentally confirmed
if an mRNA peak lied in a window of 500 bp around it. The validated
TSS was then shifted to the nearest base with the higher tag
density.
7. UCSC not-validated TSS
The total number (summing up all samples) of non experimentally validated TSS was around 3000.
8. Promoter collection for each sample
Each sample in the dataset was used to generate a separate
promoter collection. Potentially, the same transcript could be
validated by multiple samples and it could have different start
sites in different samples. To avoid redundancy, the individual
collections were used as input for an additional step in the
analysis (Assembly pipeline part B).
9. Merging collections and second TSS selection
The 340 promoter collections were merged into a unique file and
further analysed. The promoter of a transcript was mantained in
the list only if validated by at least two samples. Transcript
validated by multiple samples could potentially have the TSS set
on a broader region and not to single position. To avoid such
inconsistency, for each transcript we selected the position that
was validated by the larger number of samples as the true TSS.
10. Filtering
Transcription Start Sites that mapped closed to other TSS that
belonged to the same gene (500 bp window) were merged into a
unique promoter following the same rule: the promoter that was
validated by the higher number of samples was kept.
10. Final EPDnew collection
The 21201 experimentally validated promoter were stored in the
EPDnew database that can be downloaded from our ftp
site. Scientist are wellcome to use our other tools
ChIP-Seq (for
correlation analysis) and
SSA (for motifs analysis
around promoters) to analyse EPDnew database.