1. FATHMM-MKL and CScape
The FATHMM family of predictors originally predicted the pathogenic status of amino acid substitutions:
Variations of this predictor were created for disease-specific contexts, including cancer. For predicting the pathogenic status of single nucleotide variants (SNVs) in the human genome, they focused on the observation that SNVs in regions of the genome which are highly conserved across species are more likely to be deleterious, relative to variants in regions with high variability across species.
As an improvement we later devised a predictor for single nucleotide variants, in both the coding and non-coding regions of the human genome. This predictor (FATHMM-MKL) used a wide variety of sources of data for predicting the pathogenic impact of individual SNVs, inclusive of sequence conservation across species which remained the most informative source of information. This method used multiple kernel kernel (see my webpages on machine learning, or Chapter 3.6 of C. Campbell and Y. Ying. Learning with Support Vector Machines, Morgan and Claypool, 2011). The algorithm learns to weight the different types of data according to relative informativeness. This method is available at the FATHMM webserver site and was published here:
We have also devised a Genome Tolerance Browser to better visualise the locations of pathogenic single nucleotide variants in the human genome (peaks near unity in the depicted plots indicate probable pathogenic SNVs and peaks near zero indicate neutral. Other prediction methods are presented, e.g. CADD, some as optional tracks).Reference:
Hashem A. Shihab, Mark F. Rogers, Michael Ferlaino, Colin Campbell and Tom R. Gaunt. GTB - an online genome tolerance browser. BMC Bioinformatics 2017, 18:20. DOI: 10.1186/s12859-016-1436-4.
Subsequent to this we developed an indel predictor (for estimating the pathogenic effects of short insertions or deletions of genetic code). This predictor can handle indels in non-coding regions of the human genome:
A further area of interest has been disease-specific predictors, which are generally more accurate than generic predictors. Thus we are devising a suite of predictors in the context of cancer under the generic title of CScape. Our first generic cancer predictor CScape uses a wide variety of data sources to predict if a single nucleotide variant is potentially oncogenic:
Finally, we have developed a state-of-the-art integrative classifier for predicting haploinsufficient genes:
Hashem Shihab, Mark Rogers, Colin Campbell and Tom Gaunt. HIPred: an integrative approach to predicting haploinsufficient genes. Bioinformatics, https://doi.org/10.1093/bioinformatics/btx028 (2017).
The cell nucleus of many human cells are diploid: they contain two complete sets of chromosomes, one from each parent (in humans, germ cells are haploid). Haploinsufficiency occurs when a diploid organism has only a single functional copy of a gene (with the other copy inactivated by mutation) and this single functional copy does not produce enough of a gene product, leading to a disease trait.
This research programme is ongoing. Interested researchers are welcome to get in contact.
2. Learning with indefinite kernels (SVM classfication)
Download the MATLAB code
3. Variational Bayesian approach to LPD cluster analysis
Download the MATLAB code