Available Software



1. FATHMM-MKL and CScape

The FATHMM family of predictors originally predicted the pathogenic status of amino acid substitutions:

  • Hashem A. Shihab, Julian Gough, David N. Cooper, Peter D. Stenson, Gary L.A. Barker, Keith J. Edwards, Ian N.M. Day, Tom R. Gaunt. Predicting the Functional, Molecular and Phenotypic Consequences of Amino Acid Substitutions using Hidden Markov Models. Hum. Mutat. (2013), 34:57-65
  • Variations of this predictor were created for disease-specific contexts, including cancer. For predicting the pathogenic status of single nucleotide variants (SNVs) in the human genome, they focused on the observation that SNVs in regions of the genome which are highly conserved across species are more likely to be deleterious, relative to variants in regions with high variability across species.

    As an improvement we later devised a predictor for single nucleotide variants, in both the coding and non-coding regions of the human genome. This predictor (FATHMM-MKL) used a wide variety of sources of data for predicting the pathogenic impact of individual SNVs, inclusive of sequence conservation across species, which remained the most informative source of information. This method used multiple kernel kernel (see my webpages on machine learning, or Chapter 3.6 of Learning with Support Vector Machines). The algorithm learns to weight the different types of data according to relative informativeness. This method is available at the FATHMM webserver site and was published here:

  • Hashem Shihab, Mark Rogers, Julian Gough, Matthew Mort, David Cooper, Ian Day, Tom Gaunt and Colin Campbell. An Integrative Approach to Predicting the Functional Effects of Non-Coding and Coding Sequence Variation. Bioinformatics Vol. 31, No. 10, 2015, pages 1536-1543.
  • We later improved the method a little:

  • Mark Rogers, Hashem Shihab, Tom Gaunt, Matthew Mort, David Cooper, and Colin Campbell, Sequential Data Selection for Predicting the Pathogenic Effects of Sequence Variation, Proceedings, 2015 IEEE International Conference on Bioinformatics and Biomedicine (IEEE BIBM 2015, B394)
  • FATHMM-MKL has been found to be state-of-the-art in comparative surveys by other groups.

    We have also devised a Genome Tolerance Browser to better visualise the locations of pathogenic single nucleotide variants in the human genome. Peaks near unity in the depicted plots indicate probable pathogenic SNVs and peaks near zero indicate neutral. Other prediction methods are presented, e.g. CADD, some as optional tracks.

    Reference:

    Hashem A. Shihab, Mark F. Rogers, Michael Ferlaino, Colin Campbell and Tom R. Gaunt. GTB - an online genome tolerance browser. BMC Bioinformatics 2017, 18:20. DOI: 10.1186/s12859-016-1436-4.

    Subsequent to this we developed an indel predictor (for estimating the pathogenic effects of short insertions or deletions of genetic code). This predictor can handle indels in non-coding regions of the human genome:

  • Michael Ferlaino, Mark F Rogers, Hashem A Shihab, Tom R Gaunt, Matthew Mort, David N Cooper, Colin Campbell. An integrative approach to predicting the functional effects of small indels in non-coding regions of the human genome. Journal submission
  • A further area of interest has been disease-specific predictors, which are generally more accurate than generic predictors. Thus we are devising a suite of predictors in the context of cancer under the generic title of CScape. Our first generic cancer predictor CScape uses a wide variety of data sources to predict if a single nucleotide variant is potentially a disease-driver for cancer:

  • Mark F. Rogers, Hashem A. Shihab, Tom R. Gaunt and Colin Campbell. CScape: a tool for predicting oncogenic single-point mutations in the cancer genome. Journal submission
  • Our baseline predictor appears more accurate than competitors and was based on data from COSMIC and up to 30 different types of genomic data sources. The method was benchmarked on independent data from the The Cancer Genome Atlas (TCGA) and the International Cancer Genome Consortium (ICGC), in addition to other databases. It is able to make predictions in both the coding and non-coding regions of the cancer genome, though it is much more accurate in coding regions. We furthermore introduced a confidence measure for the predicted class label. By restricting prediction to the highest confidence instances, the resultant classifier can perform at approximately 90% test accuracy (in coding regions), though it is only able to achieve this level of accuracy at a minority of nucleotide positions (about 17% of nucleotide positions). These high confidence predicted potential disease-driver variants are typically clustered by location in the cancer genome and the method highlights exons in 191 autosomal genes such that mutational change could act as a disease-driver.

    Finally, we have developed a state-of-the-art integrative classifier for predicting haploinsufficient genes:

    Hashem Shihab, Mark Rogers, Colin Campbell and Tom Gaunt. HIPred: an integrative approach to predicting haploinsufficient genes. Bioinformatics, (2017) 33 (12): 1751-1757. https://doi.org/10.1093/bioinformatics/btx028 (2017).

    The cell nucleus of many human cells are diploid: they contain two complete sets of chromosomes, one from each parent (in humans, germ cells are haploid). Haploinsufficiency occurs when a diploid organism has only a single functional copy of a gene (with the other copy inactivated by mutation) and this single functional copy does not produce enough of a gene product, leading to a disease trait.

    This research programme is ongoing. Interested researchers are welcome to get in contact.

    2. Learning with indefinite kernels (SVM classfication)


    Reference:

  • Yiming Ying, Colin Campbell and Mark Girolami.Analysis of SVM with Indefinite Kernels Advances in Neural Information Processing Systems (NIPS) 22, 2009, p. 2214-2222.
  • Download the MATLAB code

    3. Variational Bayesian approach to LPD cluster analysis


    References:

  • Y.Ying, Peng Li and C. Campbell, A marginalized variational Bayesian approach to the analysis of array data, BMC Proceedings 2008, 2 (suppl 4):S7.

  • S. Rogers, M. Girolami, C. Campbell, and R. Breitling, The Latent Process Decomposition of cDNA Microarray Datasets. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2 (2005) 143-156.
  • Download the MATLAB code