Methods
PhD-SNPg is a binary classifier that implements Gradient Boosting-based algorithm from scikit-learn package (http://scikit-learn.org/).
The new version of PhD-SNPg has been trained and tested using a set of ~104,000 Pathogenic and Benign SNVs extracted from
Clinvar dataset.
The new dataset is composed by and equal fraction of Pathogenic and Benign SNVs, distributed across all the human chromosomes.
It is well known that damaging variants corresponds on average to conserved regions of the genome.
Thus, as preliminary test for estimating the discriminating power of the conservation features, we analyzed the distribution of pre-calculated PhyloP scores from the UCSC repository.
The analysis revealed that PhyloP100 is the most discriminative input feature.
This observation is confirmed by plotting the distribution of the PhyloP100 score in the mutated position for
Pathogenic and
Benign SNVs,
which show median values of 6.2 and 0.2 respectively. In the new version of PhD-SNPg we take advantage of the new PhyloP470 score
which was derived from the alignment of 470 species.
Thus, in the new version of PhD-SNPg takes in input a 35-elements vector (see figure) encoding for a 5-nucleotides window around the mutated position.
-
25 elements of the input vector encodes for the sequence information and the mutation.
-
10 elements are conservation scores from the alignments of 100 (PhyloP100) and 470 (PhyloP470) species.
A similar approach was implemented for predicting the pathogenic InDels.
In particular, we assume that the effect of an InDel corresponds to the
effect of the closest SNV that is obtained by deleting and/or inserting
a set of nucleotides in a given region of the genome.
Using this assumption, we developed a second version of PhD-SNPg
for predicting the impact of the InDels which takes in input 38 values.
In detail, the input is composed by 35 values used for predicting the impact of SNVs
and three new features encoding for the size and location of the InDel.
They represent the lengths of the reference and alternative alleles and a boolean
variable corresponding to the location of the mutated loci in coding or noncoding regions.
In the figure below, we represented the example of the deletion chr13:g.77000728 CAGGA>C
which, in the closets loci, corresponds to the change of G (Guanine) to A (Adenine)
in position 77,000,730 of chromosome 13. In the second part of the figure is reported
a representation of the input features.
Initially, PhD-SNPg performance was evaluated using a 10-fold cross-validation test on ~104,000 SNVs. On this subset PhD-SNPg reaches Area Under the Receiver Operating Characteristics Curve (AUC) of 0.95. To further assess the prediction of PhD-SNPg, we extracted a set of ~43,600 newly annotated SNVs from a more recent version of Clinvar. On this testing set, PhD-SNPg reaches an AUC of 0.96. In the table below we report the average performance of PhD-SNPg in cross-validation on training and testing sets. The performance scores (ACC, TPR, PPV, NPR, NPV, MCC and AUC) are defied in Wikipedia. Both datasets Clinvar122020-SNV and NewClinvar122022-SNV are available at this link.
The performance of PhD-SNPg in the prediction of pathogenic InDels was evaluated
using a 10-fold cross-validation test on ~34,000 InDels.
On this subset PhD-SNPg reaches Area Under the Receiver Operating Characteristics
Curve (AUC) of 0.95.
To further assess the prediction of PhD-SNPg, we extracted a set of ~9,000 annotated
InDels from a previous version of Clinvar. On this
testing set, PhD-SNPg reaches an AUC of 0.96. In the table below
we report the average performance of PhD-SNPg in cross-validation on
training and testing sets.
Both datasets Clinvar122020-InDel and NewClinvar122022-InDel
are available at this link.
These performances are similar or better than the scores obtained by
CADD on the same dataset.
Similar trend has been observed on the subset of mutations in coding and non-coding regions.
This surprising results shows that our approach, based on few input features, reaches similar of better accuracy than methods that rely
on more complex input features.
PhD-SNPg, which is available on GitHub, can be installed running a python2 script that automatically downloads the programs and data from the UCSC repository.
For running the program scikit-learn package needs to be installed. PhD-SNPg can predict the effect of single variant or multiple SNVs from an input file.
Variant Calling Format (VCF) file is also accepted as input. Our scripts accept in input genomic coordinates from both human genome assemblies hg19 and hg38.
When the input is provided, PhD-SNPg internally runs twoBitToFa
program to extract the 5-nucleotides window sequence centered on the mutated position and
bigWigToBedGraph
to extract the PhyloP100 and PhyloP470 scores in the corresponding positions. When coordinates from hg19 assembly are provided, the
server internally executes a liftOver and return the prediction using the same PhyloP conservation scores. All the extracted information generates the 35-elements vector processed by the
Gradient Boosting algorithm.
The main output of PhD-SNPg represents the probability that a SNV is pathogenic. If the probability is >0.5 then the SNV is predicted to be
Pathogenic otherwise Benign.
For each prediction PhD-SNPg calculates the false discovery rate associated to
Pathogenic and Benign
SNVs. In addition our script returns the PhyloP470 score of the mutated site and its average value on the 5-nucleotides window centered
on the mutated position. When a VCF file is provided in input, the output values are after the last column of each row.
PhD-SNPg, running on an Intel Xeon 2.40GHz machine, predicts the effect of 1,000 SNVs in less then 2 minutes.