Methods
PhD-SNPg is a binary classifier that implements Gradient Boosting-based algorithm from scikit-learn package (http://scikit-learn.org/).
PhD-SNPg has been trained and tested using a set of ~37,000 Pathogenic and Benign SNVs extracted from
Clinvar dataset.
The dataset of SNVs, which is composed by ~2/3 Pathogenic and 1/3 Benign SNVs, is distributed across all the human chromosomes.
It is well known that damaging variants corresponds on average to conserved regions of the genome.
Thus, as preliminary test for estimating the discriminating power of the conservation features, we analyzed the distribution of pre-calculated PhyloP scores from the UCSC repository.
The analysis revealed that PhyloP100 is the most discriminative input feature.
This observation is confirmed by plotting the distribution of the PhyloP100 score in the mutated position for
Pathogenic and
Benign SNVs,
which show median values of 5.7 and 0.4 respectively.
The final version of PhD-SNPg takes in input a 35-elements vector (see figure) encoding for a 5-nucleotides window around the mutated position.
-
25 elements of the input vector encodes for the sequence information and the mutation.
-
10 elements are conservation scores from the alignments of 7 (PhyloP7) and 100 (PhyloP100) species.
The size of the window in input and other parameters have been optimized by performing a 10-fold cross-validation test on ~35,000 SNVs. On this subset PhD-SNPg reaches Area Under the Receiver Operating Characteristics Curve (AUC) of 0.93. To further assess the prediction of PhD-SNPg, we extracted a set of ~1,400 newly annotated SNVs from a more recent version of Clinvar. On this testing set, PhD-SNPg reaches an AUC of 0.92. In the table below we report the average performance of PhD-SNPg in cross-validation on training set (Clinvar012016) and on the testing set (NewClinvar032016). The performance scores (ACC, TPR, PPV, NPR, NPV, MCC and AUC) are defied in Wikipedia.
These performances are similar or better than the scores obtained by
CADD and
FATHMM-MKL on the same dataset.
Similar trend has been observed on the subset of mutations in coding and non-coding regions.
This surprising results shows that our approach, based on few input features, reaches similar of better accuracy than methods that rely
on more complex input features.
PhD-SNPg, which is available on GitHub, can be installed running a python script that automatically downloads the programs and data from the UCSC repository.
For running the program scikit-learn package needs to be installed. PhD-SNPg can predict the effect of single variant or multiple SNVs from an input file.
Variant Calling Format (VCF) file is also accepted as input. Our scripts accept in input genomic coordinates from both human genome assemblies hg19 and hg38.
When the input is provided, PhD-SNPg internally runs twoBitToFa
program to extract the 5-nucleotides window sequence centered on the mutated position and
bigWigToBedGraph
to extract the PhyloP7 and PhyloP100 scores in the corresponding positions. When coordinates from hg19 assembly are provided, the
PhyloP7 conservation score is with PhyloP46 calculated over primate species. All the extracted information generates the 35-elements vector processed by the
Gradient Boosting algorithm.
The main output of PhD-SNPg represents the probability that a SNV is pathogenic. If the probability is >0.5 then the SNV is predicted to be
Pathogenic otherwise Benign.
For each prediction PhD-SNPg calculates the false discovery rate associated to
Pathogenic and Benign
SNVs. In addition our script returns the PhyloP100 score of the mutated site and its average value on the 5-nucleotides window centered
on the mutated position. When a VCF file is provided in input, the output values are after the last column of each row.
PhD-SNPg, running on an Intel Xeon 2.40GHz machine, predicts the effect of 1,000 SNVs in less then 2 minutes.