View on GitHub

CLIP-seq Toolkit

home-made toolkit

Download this project as a .zip file Download this project as a tar.gz file

Pre

Great thanks for your visiting. Partly because of my CV, thanks for you attention no matter how. During my application ( Jan13 ~ April) , this repository would be temporarily public though it has been in Private Status since Oct 13, 2013. I am writing a R package based on this repository, which are previously my home-made R snippets.

Assembling those sketches into a whole picture, is a small piece of wish I want to do. It gets stronger during my days at MSKCC-cbio. Next time when I bring it onto solid ground, here will represent you a real R package, hopefully a Bioconductor package.

Repository Purpose

CLIP-seq can globally report all the binding positions of RNA binding proteins, like PTB, FOX, etc. The key is to identify the sites where proteins and RNA are bound with high affinity and strength, a.k.a peak finding. Lots of Chip-seq tools had been utilized to find peak of CLIP-seq as both Chip-seq and CLIP-seq scan protein-nucleic interaction sites [1]. Here represents the bio-informatics Perl script to do the job. It absorbs and incorporates methods at silico CLIP-seq (by Darnell) [2] and modified FDR methods at PTB CLIP-seq (by Xue) [3].

In the future, I will git_push more home-made codes for CLIP-seq analysis.

Brief Introduction for pipeline

Peak Finding

Cubic Spline Interploration

Cubic spline interploration is used in Darnell paper to identify the CLIP-seq signal maximum site, which potentially indicates the binding peaks. In the figure, the black dots denote the raw signal value (hight) and the blue curve is the fitting curve. The red one with dots is derivative curve. When the derivative equals zero, this site indicates max or min site, and furthermore the maximum site can be filtered.

Cubic spiline interploration for peak finding is implemented by PeakFindingBySpline.pl.

Threshold before peak finding

The principle is same to Darnel method. For each gene, I assign all the reads for given gene to random positions in order to find the maximum value in random scenario, thus real observed/experimental peak whose height is lower than random/dummy peak is expected to be noise peak that should be excluded.

Threshold before peak finding is implemented by threshold_each_gene_multiple_expriments.pl.

Dependencies

In PeakFindingBySpling.pl, Perl package Statistics::R is needed, because my Perl script employes R functions.

The funny thing is R script runs for 104.948s, while Perl script runs only for 5.028s when I deal with exactly same test genes. The principle and core idea of the R script ( I may git push it the other day) is same to this Perl script. In addition, I avoid using for or while loop (Explicit Loop in R) which are killer for R codes. Perhaps I should try some bioconductor packaged when dealling with large amounts of genes. It gives a feeling that my R script works like foreach in Perl thus it is not efficienty as in perl foreach loop is highly prohibited in my vim. Neverthless, this Perl script calling R function really works.

Refs

  1. Althammer, Sonja, et al. "Pyicos: A versatile toolkit for the analysis of high-throughput sequencing data." Bioinformatics 27.24 (2011): 3333-3340.
  2. Chi, Sung Wook, et al. "Argonaute HITS-CLIP decodes microRNA–mRNA interaction maps." Nature 460.7254 (2009): 479-486.
  3. Xue, Yuanchao, et al. "Genome-wide analysis of PTB-RNA interactions reveals a strategy used by the general splicing repressor to modulate exon inclusion or skipping." Molecular cell 36.6 (2009): 996-1006.