PhD-SNP
Predictor of human Deleterious Single Nucleotide Polymorphisms

Last Update 18/01/10





PhD-SNP:  Predictor of human Deleterious Single Nucleotide Polymorphisms

PhD-SNP is based a SVM-based classifier [1]. In the new version we developed a predictor based on a single SVM trained and tested on protein sequence and profile information (see figure below).
The PhD-SNP SVM input is build following the next steps:

  • for a given mutation the substitution form the wild-type residue to the mutant is encoded in a 20 elemts vector that have -1 in the position relative to the wild-type residue, 1 in the position relative to the mutatnt residues and 0 in the remaining 18 positions.
  • a second 20 elements vector encoding for the sequience environment is build reporting the occurrence of the residues in a windows of 19 residue around the mutated residue.
  • for a given protein, its sequence profile is built according to the procedure detailed above. From this we evaluate both the frequency of the wild type (Fi(WT)) and mutated (Fi(MUT)) residues at position i. The NAL is the numeber is the number of sequences in the alignment at a given and position and the Conservation Index (CI).

The SVM-based method using sequence and Profile (SVM-Sequence)
The first SVM classifies mutations into diseases related (desired output set to 0) and neutral polymorphism (desired output set to 1). The decision threshold is set equal to 0.5. The input vector consists of 40 values: the first 20 (the 20 residue types) explicitly define the mutation by setting to -1 the element corresponding to the wild type residue and to 1 the newly intro-duced residue (all the remaining elements are kept equal to 0). The last 20 input values encode for the mutation sequence environment (again the 20 elements represent the 20 residue types). Each input is provided with the number of the encoded residue type, to be found inside a window centered at the residue that undergoes the mutation and that symmetrically spans the sequence to the left (N-terminus) and to the right (C-terminus) with a length of 19 residues [2,3]. For SVM implementation we use LIBSVM (http://www.csie.ntu.edu.tw/~cjlin/) with a RBF kernel function K(xi,xj)=exp(-G ||xi -xj ||2)

The SVM-based method using profile information (SVM-Profile)
The second SVM method (SVM-Profile) classifies mutations into disease and neutral polymorphism taking as input only a vector of 2 elements derived from the sequence profile. This is computed from the output of the BLAST program [4], running on the uniref90 database (E-value threshold=10-9 , number of runs=1). The first input element is the ratio between the frequencies of wild-type versus that of the mutated residue in the sequence mutated position and the second element is the number of aligned sequences with respect to the mutation at hand. The software and the kernel used for this SVM implementation are as described above.

The SVM-based method using sequence and profile information (PhD-SNP2.0)
The last vesion of PhD-SNP uses the same input described for the SVM-Sequence method and 4 more profile based features. The sequence profile is calculated accoriding to the procedure used for the SVM-Profile method but in this case the input vector is composed by the frequenceies od wild-type and mutant residues, the number of aligned sequences and the conservation index in the mutated position

Results
The list of the predictions of PhD-SNP method are available on OutPhD-SNP08.txt file. In the table some scoring indexes of the efficiency of three methods are listed.


Q2
P(D)
Q(D)
P(N)
Q(N)
C
PhD-SNP
0.76
0.76
0.72
0.76
0.80
0.52
SVM-Sequence
0.68
0.68
0.64
0.69
0.72
0.36
SVM-Profile
0.74
0.71
0.41
0.75
0.91
0.39


The overall accuracy Q2 is:

Q2=p/N


where p is the total number of correctly predicted residues and N is the total number of residues.
The correlation coefficient C is defined as:

C(s)=[ p(s)n(s)-u(s)o(s) )] / D


where D is the normalization factor

D =[(p(s)+u(s))(p(s)+o(s))(n(s)+u(s))(n(s)+o(s))]1/2


for each class s (D and N, for disease-related and neutral polymorphism, respectively); p(s) and n(s) are the total number of correct predictions and correctly rejected assignments, respectively, and u(s) and o(s) are the numbers of under and over predictions.

The coverage for each discriminated structure s is evaluated as:

Q(s)=p(s)/[ p(s)+u(s)]


where p(s) and u(s) are as defined above. The probability of correct predictions P(s) (or accuracy for s) is computed as:

P(s)=p(s) / [p(s) + o(s)]


where p(s) and o(s) are previously defined (ranging from 1 to 0).

Required Inputs
PhD-SNP is optimized to predict if a given sinle point protein mutation can be classified as disease-related or as neutral polymorphism. The required inputs are:

  • Protein Sequence: the protein sequence can be provided in raw format or giving its Swiss-Prot or uploading a text file containing the protein sequence;
  • Position: the position number in the sequence of the residue that undergoes mutation;
  • New Residue: if you would ask for a specific mutation please insert the symbol of the mutated residue;
  • Prediction: choose between Sequence-Based or Sequence and Profile-Based prediction.
  • Multi SVM: choose if the prediction is performed using 20 different SVM model from cross validation procedure or a single SVM model (fast option).
The results can be sent to your e-mail address, if you ask for it, or obtained interactively if you do not past your e-mail in the proper box.

Outputs
The output consists of a table listing the number of the mutated position in the protein sequence, the wild-type residue, the new residue and if the related mutataion is predicted as disease-related (Disease) or as neutral polymorphism (Neutral).
The RI value (Reliability Index) is evaluated from the output of the support vector machine O as

RI=20*abs(O-0.5).

The old help web page, where the datsets used in [1] are reported, is reacheable with the following link.

 

[1] Capriotti, E., Calabrese, R., Casadio, R. (2006) Predicting the insurgence of human genetic diseases associated to single point protein mutations with support vector machines and evolutionary information. Bioinformatics, 22:2729-2734.
[2] Capriotti, E., Fariselli, P., Calabrese, R. and Casadio, R. (2005) Predicting protein stability changes from sequences using support vector machines. Bioinformatics, 21 (Suppl 2), ii54-ii58.
[3] Capriotti, E., Fariselli, P. and Casadio, R. (2005) I-Mutant2.0: predicting stability changes upon mutation from the protein sequence or structure. Nucleic Acids Res., 33 (Web server issue), W306-W310.
[4] Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J., Zhang, Z., Miller, W. and Lipman, D. J. (1997) Gapped BLAST and PSI-BLAST: a new generation of pro-tein database search programs. Nucleic Acids Res., 25, 3389-3402.