PhD-SNP^g

Predicting human Deleterious SNPs in human genome
A binary classifier for predicting pathogenic variants in coding and non-coding regions.

Methods

PhD-SNP^g is a binary classifier that implements Gradient Boosting-based algorithm from scikit-learn package (http://scikit-learn.org/). PhD-SNP^g has been trained and tested using a set of ~37,000 Pathogenic and Benign SNVs extracted from Clinvar dataset. The dataset of SNVs, which is composed by ~2/3 Pathogenic and 1/3 Benign SNVs, is distributed across all the human chromosomes. It is well known that damaging variants corresponds on average to conserved regions of the genome. Thus, as preliminary test for estimating the discriminating power of the conservation features, we analyzed the distribution of pre-calculated PhyloP scores from the UCSC repository. The analysis revealed that PhyloP100 is the most discriminative input feature. This observation is confirmed by plotting the distribution of the PhyloP100 score in the mutated position for Pathogenic and Benign SNVs, which show median values of 5.7 and 0.4 respectively.

The final version of PhD-SNP^g takes in input a 35-elements vector (see figure) encoding for a 5-nucleotides window around the mutated position.

25 elements of the input vector encodes for the sequence information and the mutation.
10 elements are conservation scores from the alignments of 7 (PhyloP7) and 100 (PhyloP100) species.

The size of the window in input and other parameters have been optimized by performing a 10-fold cross-validation test on ~35,000 SNVs. On this subset PhD-SNP^g reaches Area Under the Receiver Operating Characteristics Curve (AUC) of 0.93. To further assess the prediction of PhD-SNP^g, we extracted a set of ~1,400 newly annotated SNVs from a more recent version of Clinvar. On this testing set, PhD-SNP^g reaches an AUC of 0.92. In the table below we report the average performance of PhD-SNP^g in cross-validation on training set (Clinvar012016) and on the testing set (NewClinvar032016). The performance scores (ACC, TPR, PPV, NPR, NPV, MCC and AUC) are defied in Wikipedia.

These performances are similar or better than the scores obtained by CADD and FATHMM-MKL on the same dataset. Similar trend has been observed on the subset of mutations in coding and non-coding regions. This surprising results shows that our approach, based on few input features, reaches similar of better accuracy than methods that rely on more complex input features.

PhD-SNP^g, which is available on GitHub, can be installed running a python script that automatically downloads the programs and data from the UCSC repository. For running the program scikit-learn package needs to be installed. PhD-SNP^g can predict the effect of single variant or multiple SNVs from an input file. Variant Calling Format (VCF) file is also accepted as input. Our scripts accept in input genomic coordinates from both human genome assemblies hg19 and hg38. When the input is provided, PhD-SNP^g internally runs twoBitToFa program to extract the 5-nucleotides window sequence centered on the mutated position and bigWigToBedGraph to extract the PhyloP7 and PhyloP100 scores in the corresponding positions. When coordinates from hg19 assembly are provided, the PhyloP7 conservation score is with PhyloP46 calculated over primate species. All the extracted information generates the 35-elements vector processed by the Gradient Boosting algorithm.

The main output of PhD-SNP^g represents the probability that a SNV is pathogenic. If the probability is >0.5 then the SNV is predicted to be Pathogenic otherwise Benign. For each prediction PhD-SNP^g calculates the false discovery rate associated to Pathogenic and Benign SNVs. In addition our script returns the PhyloP100 score of the mutated site and its average value on the 5-nucleotides window centered on the mutated position. When a VCF file is provided in input, the output values are after the last column of each row.

PhD-SNP^g, running on an Intel Xeon 2.40GHz machine, predicts the effect of 1,000 SNVs in less then 2 minutes.

Reference

For more details about PhD-SNP^g please refers to the publication below and its supplementary material.

Capriotti E, Fariselli P. (2017). PhD-SNP^g: A webserver and lightweight tool for scoring single nucleotide variants. Nucleic Acids Research. DOI:10.1093/nar/gkx369.

Standalone Package Installation

PhD-SNP^g is available for download on GitHub. After cloning the scripts in your own machine please execute the following instruction for the installation. Minimum requirements for the installation are wget, zcat, curl, scikit-learn. To install the correct version of the scikit-learn the installer automatically runs a local installation.


      Run:
        git clone https://github.com/biofold/PhD-SNPg
        cd PhD-SNPg
        python setup.py install arch_type

      For Linux 64bit architecture there are two compiled versions:
        - linux.x86_64
        - linux.x86_64.v287
	- macOSX.x86_64

      Installation time depends on the network speed.
      About 30G UCSC files need to be downloaded.

      For light installation:
          python setup.py install arch_typ --web

      With the install --web option PhD-SNPg 
      runs without downloading the UCSC data.
      The functionality of the program depends 
      on the network speed.

PhD-SNPg can also run on the docker platform.


      For other operating systems, 2 docker images are available:
         PhD-SNPg Full version: It contains all the scripts and data (~30 Gb). 
         PhD-SNPg Light version: It runs only in web mode (--web option).

      Warning: Before installing the image make sure that your docker environment
      has more than 30 Gb disk space.

      To load the image execute the following command:
         docker load < phd-snpg-docker-[web or full].tar

      To run the image:
         docker run -v /home/your_home:/home/your_home -it phd-snpg:[web or full]

      To execute the command:
         docker run -v /home/your_home:/home/your_home  phd-snpg:[web or full] \
                /home/bass/PhD-SNPg/predict_variants.py /home/bass/PhD-SNPg/test/test_short_variants_hg19.tsv -g hg19 [--web]

      The option -v mounts your local home on the docker image directory.
      Warning: The command should be modified if your local home is named "bass".

The installation time depends on the network speed. About 30G UCSC files need to be downloaded. For testing the program run the following command.


      Test:
        python setup.py test

      For web installation:
        python setup.py test --web

The manual installation of PhD-SNP^g can be performed following the steps reported below.


      1) Download PhD-SNPg script from github
        - git clone https://github.com/biofold/PhD-SNPg.git

      2) Required python libraries: scikit-learn-0.17
          They are already available in tools directory.

        - Untar the scikit-learn-0.17.tar.gz directory and run
          python setup.py install --install-lib=../
          https://pypi.python.org/simple/scikit-learn/

      3) Required UCSC tools and data:
        - bigWigToBedGraph and twoBitToFa from
          http://hgdownload.cse.ucsc.edu/admin/exe
          in ucsc/exe directory

        - For hg19 based predictions:
          hg19.2bit: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/hg19.2bit
          hg19.phyloP46way.primate.bw http://snps.biofold.org/PhD-SNPg/ucsc/hg19/hg19.phyloP46way.primate.bw
          hg19.100way.phyloP100way.bw: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/phyloP100way/hg19.100way.phyloP100way.bw
          in ucsc/hg19 directory
          Alterantively hg19 bundle is available at http://snps.biofold.org/PhD-SNPg/ucsc/hg19.tar.gz

        - For hg38 based predictions:
          hg38.2bit: http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.2bit
          hg38.phyloP7way.bw http://hgdownload.cse.ucsc.edu/goldenPath/hg38/phyloP7way/hg38.phyloP7way.bw
          hg38.phyloP100way.bw http://hgdownload.cse.ucsc.edu/goldenPath/hg38/phyloP100way/hg38.phyloP100way.bw
          in ucsc/hg38 directory
          Alterantively hg38 bundle is available at http://snps.biofold.org/PhD-SNPg/ucsc/hg38.tar.gz