Fido-SNP - Method

Methods

Fido-SNP is a binary classifier that implements a Gradient Boosting-based algorithm from scikit-learn package . Fido-SNP has been implemented using the model previously developed for the human variants (PhD-SNP^g). The classification threshold of Fido-SNP has been fine-tuned for discriminating between potentially pathogenic variants and dbSNP variants in the dog genome. The original human predictor PhD-SNP^g was trained on a set of ~37,000 Pathogenic and Benign variants from Clinvar database. The optimization procedure was performed considering a subset of 1,479 well-conserved pathogenic human variants. This subset is obtained considering the UCSC 100-way alignment and selecting the pathogenic variants with the following criteria:

conserved loci and 5-nucleotide window between human and dog
conservation of the reference allele frequency greater than 95% in mutated loci
average conservation on the 5-nucleotide window around the mutated loci greater than 95%
no presence of the alternative variant multiple alignment of the mutated site

The negative set of potentially benign variants was randomly selected from the dog variants in dbSNP database. Fido-SNP performances have been validated on a set of 75 pathogenic dog variants from OMIA database complemented by the same number of potentially benign variants from dbSNP. Fido-SNP algorithm similarly to PhD-SNP^g takes in input a 35-elements vector (see figure) encoding for a 5-nucleotides window around the mutated position.

25 elements of the input vector encodes for the sequence information and the mutation.
10 elements are conservation scores from the alignments of 4 (PhyloP4) and 11 (PhyloP11) species.

A new multiple alignment has been built with the reference of the Dog genome to develop Fido-SNP using the recipe define by the Genome Browser at UCSD. The difference between the human and dog multiple alignments requires an adjustment of the prediction threshold moving from human to dog genomes in order to achieve the best performance for discriminating between Pathogenic and Benign variants. For consistence with standard prediction methods, the output of Fido-SNP is rescaled in to make the output space symmetric around 0.5 as follows:

On the subset from OMIA database Fido-SNP reaches Area Under the Receiver Operating Characteristics Curve (AUC) of 0.91. In the table below we report the average performance of Fido-SNP on optimization set (Mapped Human) and on the validation set (OMIA Dog). The performance scores (ACC, TPR, PPV, NPR, NPV, MCC and AUC) are defined in Wikipedia.

In the following table we summarized the composition of the datasets used for the optimization and testing of Fido-SNP.

Other datasets used for testing the performance of Fido-SNP are Fido-SNP, which is available on GitHub, can be installed running a python script that automatically downloads the programs and data from the UCSC repository. For running the program scikit-learn package needs to be installed. Fido-SNP can predict the effect of single variant or multiple SNVs from an input file. Variant Calling Format (VCF) file is also accepted as input. Our scripts accept in input genomic coordinates from both human genome assemblies canFam2 and canFam3. When the input is provided, Fido-SNP internally runs twoBitToFa program to extract the 5-nucleotides window sequence centered on the mutated position and bigWigToBedGraph to extract the PhyloP4 and PhyloP11 scores in the corresponding positions. When coordinates from canfam2 assembly are provided the same PhyloP scores are used. All the extracted information generates the 35-elements vector processed by the Gradient Boosting algorithm.

The main output of Fido-SNP represents the probability that a SNV is pathogenic. If the probability is >0.5 then the SNV is predicted to be Pathogenic otherwise Benign. For each prediction Fido-SNP calculates the false discovery rate associated to Pathogenic and Benign SNVs. In addition our script returns the PhyloP11 score of the mutated site and its average value on the 5-nucleotides window centered on the mutated position. When a VCF file is provided in input, the output values are after the last column of each row.

Fido-SNP, running on an Intel Xeon 2.40GHz machine, predicts the effect of 1,000 SNVs in less then 2 minutes.

References

For more details about Fido-SNP and PhD-SNP^g please refer to the publications below and their supplementary material.

Capriotti E, Montanucci L, Profiti G, Rossi I, Giannuzzi D, Aresu L, Fariselli P. (2019). Fido-SNP: The first webserver for scoring the impact of single nucleotide variants in the dog genome. Nucleic Acids Research. DOI:10.1093/nar/gkz420.

Capriotti E, Fariselli P. (2017). PhD-SNP^g: A webserver and lightweight tool for scoring single nucleotide variants. Nucleic Acids Research. DOI:10.1093/nar/gkx369.

Standalone Package Installation

Fido-SNP is available for download on GitHub. After cloning the scripts in your own machine please execute the following instruction for the installation. Minimum requirements for the installation are wget, zcat, curl, scikit-learn. To install the correct version of the scikit-learn the installer automatically runs a local installation.


      Run:
        git clone https://github.com/biofold/Fido-SNP
        cd Fido-SNP
        python setup.py install arch_type

      For Linux 64bit architectures for which ucsc executable files are available:
      The standard version is:
        - linux.x86_64 

      Installation time depends on the network speed.
      About 35G UCSC like files need to be downloaded.

      For light installation:
          python setup.py install arch_typ --web

      With the install --web option PhD-SNPg 
      runs without downloading the UCSC data.
      The functionality of the program depends 
      on the network speed.

The installation time depends on the network speed. About 35G UCSC like files need to be downloaded. For testing the program run the following command.


      Test:
        python setup.py test

      For web installation:
        python setup.py test --web

The manual installation of Fido-SNP can be performed following the steps reported below.


      1) Download PhD-SNPg script from github
        - git clone https://github.com/biofold/Fido-SNP

      2) Required python libraries: scikit-learn-0.17
          They are already available in tools directory.

        - Untar the scikit-learn-0.17.tar.gz directory and run
          python setup.py install --install-lib=../
          https://pypi.python.org/simple/scikit-learn/

      3) Required UCSC tools and data:
        - bigWigToBedGraph and twoBitToFa from
          http://hgdownload.cse.ucsc.edu/admin/exe
          in ucsc/exe directory

        - For canfam2 based predictions:
          canfam2.2bit: http://snps.biofold.org/Fido-SNP/ucsc/canfam2/canfam2.2bit
          canfam2.phyloP4way.bw http://snps.biofold.org/Fido-SNP/ucsc/canfam2/canfam2.phyloP4way.bw
          canfam2.phyloP10way.bw http://snps.biofold.org/Fido-SNP/ucsc/canfam2/canfam2.phyloP10way.bw

        - For canfam3 based predictions:
          canfam3.2bit: http://snps.biofold.org/Fido-SNP/ucsc/canfam3/canfam3.2bit
          canfam3.phyloP4way.bw http://snps.biofold.org/Fido-SNP/ucsc/canfam3/canfam3.phyloP4way.bw
          canfam3.phyloP10way.bw http://snps.biofold.org/Fido-SNP/ucsc/canfam3/canfam3.phyloP10way.bw