TSS assembly pipeline for At_EPDnew_001

Introduction

This document provides a technical description of the transcription start site assembly pipeline that was used to generate EPDnew version 002 for A. thaliana genome assembly araTha1 (TAIR10).

Source Data

Description URLs
TAIR genes Source URL: ftp://ftp.arabidopsis.org/home/tair/Genes/
MGA data: /ftp/mga/araTha1/tair/arabidopsisTair10Genes.sga.gz
Morton 14 Source URL: http://megraw.cgrb.oregonstate.edu/suppmats/3PEAT/
MGA doc: /ftp/mga/araTha1/morton14/morton14.html
MGA data: /ftp/mga/araTha1/morton14/

Assembly pipeline overview

Description of procedures and intermediate data files

1. Download annotated TSS

Primary annotation data was downloaded from TAIR the 06-02-2015.

Genes annotations downloaded from TAIR did not contain direct links to RefSeq ID. For this reason, RefSeq ID has been parsed from NCBI RefSeq files.

A total number of 31615 promoters were selected.

2. TAIR10 TSS collection

The TAIR10 TSS collection is stored as a tab-deliminated text file conforming to the SGA format under the name:
    arabidopsisTair10Genes.sga
The six field contain the following kinds of information:
  • NCBI/RefSeq chromosome id
  • "TSS"
  • position
  • strand ("+" or "-")
  • "1"
  • TAIR ID
Note that the second and forth fields are invariant.

3. Import CAGE data

Data was imported from GEO as BAM file format. BAM files were converted to SGA file format using ChIP-Convert.
A step-by-step guide on how to import, map and convert these samples can be found here

5. mRNA 5' tags peak calling

For the only sample present, peak calling for the merged file has been carried out using ChIP-Peak on-line tool with the following parameters:
  • Window width = 200
  • Vicinity range = 200
  • Peak refine = Y
  • Count cutoff = 9999999
  • Threshold = 5

6. TSS validation and shifting

The sample in the collection (mRNA peaks and TAIR10 TSS) was then processed in a pipeline aiming at validating transcription start sites with mRNA peaks. A TAIR10 TSS was experimentally confirmed if an mRNA peak lied in a window of 100 bp around it. The validated TSS was then shifted to the nearest base with the higher tag density.

7. TAIR not-validated TSS

The total number (summing up all samples) of non experimentally validated TSS was around 15000.

8. Filtering

Transcription Start Sites that mapped closed to other TSS that belonged to the same gene (100 bp window) were merged into a unique promoter following the same rule: the promoter that was validated by the higher number of samples was kept.

9. Final EPDnew collection

The 17000 experimentally validated promoter were stored in the EPDnew database that can be downloaded from our ftp site. Scientist are wellcome to use our other tools ChIP-Seq (for correlation analysis) and SSA (for motifs analysis around promoters) to analyse EPDnew database.

Last update October 2019