TSS assembly pipeline for Ce_EPDnew_001

Introduction

This document provides a technical description of the transcription start site assembly pipeline that was used to generate EPDnew version 001 for C. elegans.

Source Data

Promoter collection:

Name	Genome Assembly	Promoters	Genes	PMID	Access data
UCSC Genes	May 2008 WS190/ce6	20531	11786	26590259	SOURCE	DOC	DATA

Experimental data:

Name	Type	Samples	Tags	PMID	Access data
Kruesi et al., 2013	GRO-cap	9	236,210,104	23795297	SOURCE	DOC	DATA

1. Download annotated TSS

Data was downloaded from UCSC table browser. Transcrips have been filtered according to the following rules:

Transcripts of protein coding genes only
Transcript lies on full chromosomes
Genes must be annotated [Associated Gene Name present]
Gene and transcripts status known

Gene names were taken from the field "Associated Gene Name". Since the EPD format doesn't allow gene names longer than 18 characters, we checked whether the names repsected this limitation.
A total number of 20531 promoters were selected.

2. UCSC TSS collection

The UCSC TSS collection is stored as a tab-deliminated text file conforming to the SGA format under the name:

ucsc_promoter_list.sga

The six field contain the following kinds of information:

NCBI/RefSeq chromosome id
"TSS"
position
strand ("+" or "-")
"1"
gene name / ID

Note that the second and forth fields are invariant.

3. Import CAGE data

Data was imported from GEO as SRA file format. Raw sequence files were mapped to ce6 genome using Bowtie. The resulting BAM files were converted to SGA file format using ChIP-Convert.
A step-by-step guide on how to import, map and convert these samples can be found here

4. Download annotated TSS file from eLIFE

The list of promoters published by Kruesi et al., was downloaded from ELife. XLS file was converted to a tab delimited flat file using OpenOffice and converted to a bed file using in-house scripts. (Note that one line in the input data file contains up to 4 TSS coordinates)

5. LiftOver ce10 to ce6 and generate an SGA file

The Kruesi et al. promoter list was lifted over from ce10 to ce6 using the liftOver tool from UCSC Genome Browser.
The resulting BED file was converted to SGA using ChIP-Convert.

6. Annote kruesi13 SGA file with GRO-cap counts

The published promoter collection was annotated using the GRO-cap raw data. This step was done to get the total number of GRO-cap reads that mapped at the annotated TSSs.

7. Select TSS with maximal GRO-cap

Promoters that belong to the same genes were merged if their distance was shorter that 100 bp. The site with the higher tag count was then selected as EPD promoter.

Last update October 2019

SIB Swiss Institute of Bioinformatics | Computational Cancer Genomics | ExPASy | Privacy Notice |

Back to the Top