Methods
Fido-SNP is a binary classifier that implements a Gradient Boosting-based algorithm from scikit-learn package . Fido-SNP has been implemented using the model previously developed for the human variants (PhD-SNPg). The classification threshold of Fido-SNP has been fine-tuned for discriminating between potentially pathogenic variants and dbSNP variants in the dog genome. The original human predictor PhD-SNPg was trained on a set of ~37,000 Pathogenic and Benign variants from Clinvar database. The optimization procedure was performed considering a subset of 1,479 well-conserved pathogenic human variants. This subset is obtained considering the UCSC 100-way alignment and selecting the pathogenic variants with the following criteria:
-
conserved loci and 5-nucleotide window between human and dog
-
conservation of the reference allele frequency greater than 95% in mutated loci
-
average conservation on the 5-nucleotide window around the mutated loci greater than 95%
-
no presence of the alternative variant multiple alignment of the mutated site
The negative set of potentially benign variants was randomly selected from the dog variants in dbSNP database. Fido-SNP performances have been validated on a set of 75 pathogenic dog variants from OMIA database complemented by the same number of potentially benign variants from dbSNP. Fido-SNP algorithm similarly to PhD-SNPg takes in input a 35-elements vector (see figure) encoding for a 5-nucleotides window around the mutated position.
-
25 elements of the input vector encodes for the sequence information and the mutation.
-
10 elements are conservation scores from the alignments of 4 (PhyloP4) and 11 (PhyloP11) species.
A new multiple alignment has been built with the reference of the Dog genome to develop
Fido-SNP using the recipe define by the Genome Browser at UCSD. The difference between
the human and dog multiple alignments requires an adjustment of the prediction threshold
moving from human to dog genomes in order to achieve the best performance for discriminating
between Pathogenic and Benign variants.
For consistence with standard prediction methods, the output of Fido-SNP is rescaled in
to make the output space symmetric around 0.5 as follows:
On the subset from OMIA database Fido-SNP reaches Area Under the Receiver Operating Characteristics Curve (AUC) of 0.91. In the table below we report the average performance of Fido-SNP on optimization set (Mapped Human) and on the validation set (OMIA Dog). The performance scores (ACC, TPR, PPV, NPR, NPV, MCC and AUC) are defined in Wikipedia.
In the following table we summarized the composition of the datasets used for the optimization and testing of Fido-SNP.
Other datasets used for testing the performance of Fido-SNP are
Fido-SNP, which is available on GitHub, can be installed running a python script that automatically downloads the programs and data from the UCSC repository.
For running the program scikit-learn package needs to be installed. Fido-SNP can predict the effect of single variant or multiple SNVs from an input file.
Variant Calling Format (VCF) file is also accepted as input. Our scripts accept in input genomic coordinates from both human genome assemblies canFam2 and canFam3.
When the input is provided, Fido-SNP internally runs twoBitToFa
program to extract the 5-nucleotides window sequence centered on the mutated position and
bigWigToBedGraph
to extract the PhyloP4 and PhyloP11 scores in the corresponding positions. When coordinates from canfam2 assembly are provided the same PhyloP scores are used.
All the extracted information generates the 35-elements vector processed by the
Gradient Boosting algorithm.
The main output of Fido-SNP represents the probability that a SNV is pathogenic. If the probability is >0.5 then the SNV is predicted to be
Pathogenic otherwise Benign.
For each prediction Fido-SNP calculates the false discovery rate associated to
Pathogenic and Benign
SNVs. In addition our script returns the PhyloP11 score of the mutated site and its average value on the 5-nucleotides window centered
on the mutated position. When a VCF file is provided in input, the output values are after the last column of each row.
Fido-SNP, running on an Intel Xeon 2.40GHz machine, predicts the effect of 1,000 SNVs in less then 2 minutes.