PhD-SNPg

Predicting human Deleterious SNPs in human genome
A binary classifier for predicting pathogenic variants in coding and non-coding regions.




Server Input


PhD-SNPg server takes in input a list of variants in three different formats. Two formats (CSV and VCF) require the genomic location of the variants and the nucleotide change. The MUT format, which requires a list of amino acid changes, allows to map the residue substitution to the corresponding nucleotide variants. The three input formats are summarized as follows:

  • CSV: The simplest input format uses comma separated values which indicates the chromosome (chr), the position (pos), the reference (ref) and alternative (alt) alleles as follows: chr,pos,ref,alt (see example 1).

  • VCF: The variants can be also provided in a VCF like format that requires at least 5 columns (chr,pos,id,ref,alt) separated by spaces. When the id is not available, it can be replaced by a dot character (see example 2).

  • MUT: The effect of single amino acid variants can be predicted providing in input the 2 columns including the gene symbol and the mutation (gene,mutation) separated by comma (see example 3).

For all the input formats, each variant is provided in a separated row of the textarea box. In addition, all the input variants can be uploaded as a single file in text or zipped format. For formatting reasons, it is recommended to avoid the cutting and paste of VCF input data which can be easily upload as zip file. Before the submission of the process please select the appropriate assembly of the human genome on which the genomic location are expressed.

To prevent the submission of large processes, a maximum number of 1,000 variants for each job is allowed. For predicting the impact of a larger of variants, please install PhD-SNPg on your local machine. The information for the local installation of PhD-SNPg are reported in method page.



Server Output


After the submission of the job, PhD-SNPg server returns a link to a web page that is automatically refreshed every 20 seconds. A cgi script checks what is the status of the job in the queue system and when the job is terminate return a static html page with the predictions. In case the page is closed, the output of your job can be retrieved using the JobID and the form in the Job web page
Independently from the input format, PhD-SNPg displays in the html output the same information. For each variant, the prediction row includes the following data: the chromosome, the position, the reference/alternative alleles, the prediction, the score, the false discovery rate, the PhyloP100 score on the mutated site and the average value of the PhyloP100 score on a 5-nucleotides window sequence centred around the mutated site. An example of the output table is reported below.
In each row a green button open a window where more information about the variant are reported. In particular, if the variants is in the coding region, the code of the largest NCBI transcript, the UniProt ID of the gene, the strand and the effect of the nucleotide change are included. The annotation process is performed using the transvar package. A text format of PhD-SNPg output can be downloaded through a web link. The output file is a tab separated VCF-like file that includes the PhD-SNPg predictions and the annotation from the transvar output.
The standard output of the standalone PhD-SNPg program, which does not include the transvar annotation, is described in the next section.



Standalone Package Output


The output of the standalone package does not include the transvar annotation. An example of the PhD-SNPg output is provided below.


        PhD-SNPg returns in output:

        PREDICTION: Pathogenic or Benign
        SCORE: a probabilistic score between 0 and 1. If the score is >0.5 the variants is predicted to be Pathogenic.
        FDR: The false discovery rate associated to higher/lower SCORE.
        PhyloP100: PhyloP100 in the mutated position.
        AvgPhyloP100: Average value of PhyloP100 in a 5-nucleotide window around the mutated position.

        The scores added as extra columns to the input file. An example of output is reported below.

        #CHROM	POS	REF	ALT	CODING	PREDICTION	SCORE	FDR	PhyloP100	AvgPhyloP100
        1	10042376	C	G	Yes	Pathogenic	0.814	0.079	-0.159	3.412
        1       197094291       C       T       Yes     Pathogenic      0.988   0.023   7.304   4.071
        2       31751295        G       A       Yes     Pathogenic      0.913   0.053   1.810   2.674
        2       71797809        C       T       Yes     Pathogenic      0.998   0.023   1.181   3.699
        2       179577870       T       C       Yes     Benign  0.004   0.007   -6.363  2.997
        5       74046464        C       T       Yes     Benign  0.009   0.021   -0.070  5.860