Progress in science depends on new techniques, new discoveries, and new ideas, probably in that order.[1]

ChIPseeker: an R package for ChIP peak Annotation, Comparison and Visualization

Introduction: ChIPseeker, a popular package of Bioconductor, retrieves the nearest genes around the peak, annotates genomic region of the peak, and uses statstical methods for estimate the significance of overlap among ChIP peak data sets, etc. It is developed and actively maintained by Guangchuang Yu [1] and I am contributor.

My rols in ChIPseeker is to use bootstrap techniques to estimate confidence interval of average ChIP-seq signals within genomic regions of interest (by default TSS), and visualize the results. The official page for ChIPseeker at Bioconductor is here and the active developing repository is here.

Language: R

  GitHub Repo

Identifying binding peaks of RBP from CLIP-seq data

Introduction: CLIP-seq can globally report all the binding positions of RNA binding proteins, like PTB, FOX, etc. The key is to identify the sites where proteins and RNA are bound with high affinity and strength, a.k.a peak finding. Lots of Chip-seq tools had been utilized to find peak of CLIP-seq as both Chip-seq and CLIP-seq scan protein-nucleic interaction sites [1]. Here represents the bio-informatics Perl script to do the job. It absorbs and incorporates methods at Ago CLIP-seq (by Darnell) [2] and modified FDR methods at PTB CLIP-seq (by Xue) [3].

Language: Perl, R.

  GitHub Repo

Cassette Exon Finder

Introduction: Alternative splicing makes main contributions to mammalian trancriptome dynamics, and further proteomic complexity. Cassette exon is the largest category (over 60%) of alternative exon. Cassette exon is related to disease, like SMA (Spinal Muscular Atrophy), a neorodegenerative disorder, which is directly caused by SMN1 deficiency. The inclusion of exon-7 among SMN gene leads to SMN1 transcripts in normal people, while the unexpected exclusion (or skipping) leads to SMN2 which finally results in infantile death. Here represents the scripts to extract exon trios from gene annotation available at UCSC Genome Browser.

It fetches exon trios to feed downstream scripts (not-included) for exon inclusion ratio quantification and furthermore detecting differential spliced exons via Fisher's exact test.

Language: Perl, Bash

  GitHub Repo

Plugin for Pysam to decode MD tag

>>> ## Demo
>>> demoMD = "31^AT3T14"
>>> ## Demo read.cigar()
>>> cigarList = [(0, 31), (2,2), (3, 94), (0, 18)]
>>> ## cigar list with match/mismatch information: "
>>> moreCigarList = moreCigar(cigarList, demoMD)
>>> print moreCigarList
[(7, 31), (2, 2), (3, 94), (7, 3), (8, 1), (7, 14)]

Introduction: It would enable read.cigar in pysam to parse MD tag information saved in bam/sam format. The syntax of pysam could be found here where 0 denotes alignment matched position, 7 and 8 are claimed to denote equal and mismatch respectively. Though MD tag is available in bam/sam alignment files and even the identical bases are manually changed to = in cigar string, the sequence mismatched and matched base are still ambiguously marked by 0, thus cannot be distinguished, which make it less straightforward to locate base differences, e.g. SNP.

Language: Python

  GitHub Repo

Minery, a web server for Chemical In Vitro In Vivo Profilling

CIIP (Chemical In Vitro In Vivo Profilling) is a toolkit with ultimate goal to model the similarities among the bio-activity of small molecules and high-throughput screening data. It is promising for researches in chemeoinformatics and drug discovery.

Fed with compounds of interest, Minery is dedicated to fetch bioassay data from available database, for example, PubChem.

Minery is implemented in R and deployed at Shinyapps thanks to Rstudio.

It will be open-source once it is finished.


aRtist, repository dedicated to data visualization in bioinformatics.

aRtist repository saves reproducible codes for producing elegant and publication level figures in bioinformatics researches. In addition to the above three figures there are more classic cases in my gallery. I have two strong beliefs residing in the aRtist.

  • Data never lies but data visualization probably could.
  • Reviewers are indeed kids who love cartoon with colors.
  •   Gallery   Source

    ggHMM, visualize Hidden Markov Model

    Using ggplot2 and d3.js to implement elegant visualization of Hidden Markov Model with known parameters.