PhD-SNP^g

Predicting human Deleterious SNPs in human genome
A binary classifier for predicting pathogenic variants in coding and non-coding regions.

Methods

PhD-SNP^g is a binary classifier that implements Gradient Boosting-based algorithm from scikit-learn package (http://scikit-learn.org/). The new version of PhD-SNP^g has been trained and tested using a set of ~104,000 Pathogenic and Benign SNVs extracted from Clinvar dataset. The new dataset is composed by and equal fraction of Pathogenic and Benign SNVs, distributed across all the human chromosomes. It is well known that damaging variants corresponds on average to conserved regions of the genome. Thus, as preliminary test for estimating the discriminating power of the conservation features, we analyzed the distribution of pre-calculated PhyloP scores from the UCSC repository. The analysis revealed that PhyloP100 is the most discriminative input feature. This observation is confirmed by plotting the distribution of the PhyloP100 score in the mutated position for Pathogenic and Benign SNVs, which show median values of 6.2 and 0.2 respectively. In the new version of PhD-SNP^g we take advantage of the new PhyloP470 score which was derived from the alignment of 470 species.

Thus, in the new version of PhD-SNP^g takes in input a 35-elements vector (see figure) encoding for a 5-nucleotides window around the mutated position.

25 elements of the input vector encodes for the sequence information and the mutation.
10 elements are conservation scores from the alignments of 100 (PhyloP100) and 470 (PhyloP470) species.

A similar approach was implemented for predicting the pathogenic InDels. In particular, we assume that the effect of an InDel corresponds to the effect of the closest SNV that is obtained by deleting and/or inserting a set of nucleotides in a given region of the genome.
Using this assumption, we developed a second version of PhD-SNP^g for predicting the impact of the InDels which takes in input 38 values. In detail, the input is composed by 35 values used for predicting the impact of SNVs and three new features encoding for the size and location of the InDel. They represent the lengths of the reference and alternative alleles and a boolean variable corresponding to the location of the mutated loci in coding or noncoding regions. In the figure below, we represented the example of the deletion chr13:g.77000728 CAGGA>C which, in the closets loci, corresponds to the change of G (Guanine) to A (Adenine) in position 77,000,730 of chromosome 13. In the second part of the figure is reported a representation of the input features.

Initially, PhD-SNP^g performance was evaluated using a 10-fold cross-validation test on ~104,000 SNVs. On this subset PhD-SNP^g reaches Area Under the Receiver Operating Characteristics Curve (AUC) of 0.95. To further assess the prediction of PhD-SNP^g, we extracted a set of ~43,600 newly annotated SNVs from a more recent version of Clinvar. On this testing set, PhD-SNP^g reaches an AUC of 0.96. In the table below we report the average performance of PhD-SNP^g in cross-validation on training and testing sets. The performance scores (ACC, TPR, PPV, NPR, NPV, MCC and AUC) are defied in Wikipedia. Both datasets Clinvar122020-SNV and NewClinvar122022-SNV are available at this link.

The performance of PhD-SNP^g in the prediction of pathogenic InDels was evaluated using a 10-fold cross-validation test on ~34,000 InDels. On this subset PhD-SNP^g reaches Area Under the Receiver Operating Characteristics Curve (AUC) of 0.95. To further assess the prediction of PhD-SNP^g, we extracted a set of ~9,000 annotated InDels from a previous version of Clinvar. On this testing set, PhD-SNP^g reaches an AUC of 0.96. In the table below we report the average performance of PhD-SNP^g in cross-validation on training and testing sets. Both datasets Clinvar122020-InDel and NewClinvar122022-InDel are available at this link.

These performances are similar or better than the scores obtained by CADD on the same dataset. Similar trend has been observed on the subset of mutations in coding and non-coding regions. This surprising results shows that our approach, based on few input features, reaches similar of better accuracy than methods that rely on more complex input features.

PhD-SNP^g, which is available on GitHub, can be installed running a python2 script that automatically downloads the programs and data from the UCSC repository. For running the program scikit-learn package needs to be installed. PhD-SNP^g can predict the effect of single variant or multiple SNVs from an input file. Variant Calling Format (VCF) file is also accepted as input. Our scripts accept in input genomic coordinates from both human genome assemblies hg19 and hg38. When the input is provided, PhD-SNP^g internally runs twoBitToFa program to extract the 5-nucleotides window sequence centered on the mutated position and bigWigToBedGraph to extract the PhyloP100 and PhyloP470 scores in the corresponding positions. When coordinates from hg19 assembly are provided, the server internally executes a liftOver and return the prediction using the same PhyloP conservation scores. All the extracted information generates the 35-elements vector processed by the Gradient Boosting algorithm.

The main output of PhD-SNP^g represents the probability that a SNV is pathogenic. If the probability is >0.5 then the SNV is predicted to be Pathogenic otherwise Benign. For each prediction PhD-SNP^g calculates the false discovery rate associated to Pathogenic and Benign SNVs. In addition our script returns the PhyloP470 score of the mutated site and its average value on the 5-nucleotides window centered on the mutated position. When a VCF file is provided in input, the output values are after the last column of each row.

PhD-SNP^g, running on an Intel Xeon 2.40GHz machine, predicts the effect of 1,000 SNVs in less then 2 minutes.

Reference

More information about the previous version of PhD-SNP^g are available at the following link. Please cite the publication below and its supplementary material.

Capriotti E, Fariselli P. (2017). PhD-SNP^g: A webserver and lightweight tool for scoring single nucleotide variants. Nucleic Acids Research. DOI:10.1093/nar/gkx369.

Standalone Package Installation

PhD-SNP^g package is writtein in python2. It is available for download on GitHub. After cloning the scripts in your own machine please execute the following instruction for the installation. Minimum requirements for the installation are wget, zcat, curl, scikit-learn. To install the correct version of the scikit-learn the installer automatically runs a local installation.


      Run:
        git clone https://github.com/biofold/PhD-SNPg
        cd PhD-SNPg
        python2 setup.py install arch_type

      For Linux 64bit architecture there are two compiled versions:
        - linux.x86_64
        - linux.x86_64.v369
        - linux.x86_64.v385
        - macOSX.arm64
	- macOSX.x86_64

      Installation time depends on the network speed.
      About 30G UCSC files need to be downloaded.

      For light installation:
          python2 setup.py install arch_typ --web

      With the install --web option PhD-SNPg 
      runs without downloading the UCSC data.
      The functionality of the program depends 
      on the network speed.

PhD-SNPg can also run on the docker platform.


      For other operating systems, 2 docker images are available at the following 
	link:
         The full version of PhD-SNPg contains all the scripts and data (~26 Gb). 
         The light version  of PhD-SNPg runs only in web mode (--web option).

      Warning: Before installing the image make sure that your docker environment
      has more than 30 Gb disk space.

      To load the image execute the following command:
         docker push biofold/phd-snpg:[web or full]

      To run the image:
         docker run -v /home/your_home:/home/your_home -it phd-snpg:[web or full]

      To execute the command:
         docker run -v /home/your_home:/home/your_home  phd-snpg:[web or full] \
                /home/bass/PhD-SNPg/predict_variants.py /home/bass/PhD-SNPg/test/test_short_variants_hg19.tsv -g hg19 [--web]

      The option -v mounts your local home on the docker image directory.
      Warning: The command should be modified if your local home is named "bass".

The installation time depends on the network speed. About 30G UCSC files need to be downloaded. For testing the program run the following command.


      Test:
        python setup.py test

      For web installation:
        python setup.py test --web

The manual installation of PhD-SNP^g can be performed following the steps reported below.


      1) Download PhD-SNPg script from github
        - git clone https://github.com/biofold/PhD-SNPg.git

      2) Required python2 libraries: scikit-learn-0.17
          They are already available in tools directory.

        - Untar the scikit-learn-0.17.tar.gz directory and run
          python2 setup.py install --install-lib=../
          https://pypi.python.org/simple/scikit-learn/

      3) Required UCSC tools and data:
        - bigWigToBedGraph, twoBitToFa and liftOver from
          http://hgdownload.cse.ucsc.edu/admin/exe
          in ucsc/exe directory

        - For hg19 based predictions:
          hg19.2bit: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/hg19.2bit
          hg19.phyloP46way.primate.bw: http://snps.biofold.org/PhD-SNPg/ucsc/hg19/hg19.phyloP46way.primate.bw
          hg19.100way.phyloP100way.bw: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/phyloP100way/hg19.100way.phyloP100way.bw
          hg19ToHg38.over.chain.gz: https://hgdownload.cse.ucsc.edu/goldenpath/hg19/liftOver/hg19ToHg38.over.chain.gz
          in ucsc/hg19 directory
          Alterantively hg19 bundle is available at http://snps.biofold.org/PhD-SNPg/ucsc/hg19.tar.gz

        - For hg38 based predictions:
          hg38.2bit: http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.2bit
          hg38.phyloP7way.bw: http://hgdownload.cse.ucsc.edu/goldenPath/hg38/phyloP7way/hg38.phyloP7way.bw
          hg38.phyloP100way.bw: http://hgdownload.cse.ucsc.edu/goldenPath/hg38/phyloP100way/hg38.phyloP100way.bw
          hg38.phyloP470way.bw: http://hgdownload.cse.ucsc.edu/goldenPath/hg38/phyloP470way/hg38.phyloP470way.bw                 
          in ucsc/hg38 directory
          Alterantively hg38 bundle is available at http://snps.biofold.org/PhD-SNPg/ucsc/hg38.tar.gz