TSS assembly pipeline for Am_EPDnew_001

Introduction

This document provides a technical description of the transcription start site assembly pipeline that was used to generate EPDnew version 001 for A. mellifera.

Source Data

Promoter collection:

Name Genome Assembly Promoters Genes PMID Access data
RefSeq Genes Apr 2011 Amel_4.5/amel5 17735 10727 22121212 SOURCE DOC DATA

Experimental data:

Name Type Samples Tags PMID Access data
Khamis et al., 2015 CAGEscan 16 70,802,351 26073445 SOURCE DOC DATA

Assembly pipeline overview

Description of procedures and intermediate data files

1. Download annotated TSS

Data was downloaded from RefSeq the 20-07-2016. Transcrips have been filtered according to the following rules:
  1. Transcripts of protein coding genes only
  2. Transcripts have a non-empty description field
Gene names were taken from the field "Locus ID". Since the EPD format doesn't allow gene names longer than 18 characters, we checked whether the names repsected this limitation.
A total number of 17735 promoters were selected.

2. RefSeq TSS collection

The RefSeq TSS collection is stored as a tab-deliminated text file conforming to the SGA format under the name:
    amel5TssFromRefSeq.sga
The six field contain the following kinds of information:
  • NCBI/RefSeq chromosome id
  • "TSS"
  • position
  • strand ("+" or "-")
  • "1"
  • Locus ID
Note that the second and forth fields are invariant.

3. Import CAGE data

Data was imported from GEO as SRA file format. Raw sequence files were mapped to amel_4.5 genome using Bowtie. The resulting BAM files were converted to SGA file format using ChIP-Convert.

5. mRNA 5' tags peak calling

For each individual sample (8), peak calling for the merged file has been carried out using ChIP-Peak on-line tool with the following parameters:
  • Window width = 200
  • Vicinity range = 200
  • Peak refine = Y
  • Count cutoff = 9999999
  • Threshold = 5

6. TSS validation and shifting

Each sample in the collection (mRNA peaks and RefSeq TSS) was then separately processed in a pipeline aiming at validating transcription start sites with mRNA peaks. A RefSeq TSS was experimentally confirmed if an mRNA peak lied in a window of 300 bp around it. The validated TSS was then shifted to the nearest base with the higher tag density.

7. RefSeq not-validated TSS

The total number (summing up all samples) of non experimentally validated TSS was around 10000.

8. Promoter collection for each sample

Each sample in the dataset was used to generate a separate promoter collection. Potentially, the same transcript could be validated by multiple samples and it could have different start sites in different samples. To avoid redundancy, the individual collections were used as input for an additional step in the analysis (Assembly pipeline part B).

9. Merging collections and second TSS selection

The 8 promoter collections were merged into a unique file and further analysed. Transcript validated by multiple samples could potentially have the TSS set on a broader region and not to single position. To avoid such inconsistency, for each transcript we selected the position that was validated by the larger number of samples as the true TSS.
Different TSSs that belong to the same gene were classified according to their global expression level. The primary TSSs of a gene (marked with an '_1' at the end of the ID) have always the highest expression level followed by all the other in decreasing order of expression (marked as '_2', '_3', etc.).

10. Filtering

Transcription Start Sites that mapped closed to other TSS that belonged to the same gene (500 bp window) were merged into a unique promoter following the same rule: the promoter that was validated by the higher number of samples was kept.

10. Final EPDnew collection

The 17000 experimentally validated promoter were stored in the EPDnew database that can be downloaded from our ftp site. Scientist are wellcome to use our other tools ChIP-Seq (for correlation analysis) and SSA (for motifs analysis around promoters) to analyse EPDnew database.

Last update Dec. 2016