PhD-SNP:
Predictor of human Deleterious Single Nucleotide Polymorphisms
PhD-SNP
is based on a decision tree with the SVM-based classifier
(SVM-Sequence) coupled to SVM-Profile trained on sequence
profile information (see figure below).
PhD-SNP comprises the following steps:
The
SVM-based method using sequence information (SVM-Sequence)
The first SVM classifies mutations into diseases related (desired
output set to 0) and neutral polymorphism (desired output
set to 1). The decision threshold is set equal to 0.5. The
input vector consists of 40 values: the first 20 (the 20 residue
types) explicitly define the mutation by setting to -1 the
element corresponding to the wild type residue and to 1 the
newly intro-duced residue (all the remaining elements are
kept equal to 0). The last 20 input values encode for the
mutation sequence environment (again the 20 elements represent
the 20 residue types). Each input is provided with the number
of the encoded residue type, to be found inside a window centered
at the residue that undergoes the mutation and that symmetrically
spans the sequence to the left (N-terminus) and to the right
(C-terminus) with a length of 19 residues [1,2]. For SVM implementation
we use LIBSVM (http://www.csie.ntu.edu.tw/~cjlin/) with a
RBF kernel function K(xi,xj)=exp(-G ||xi -xj ||2)
The
SVM-based method using profile information (SVM-Profile)
The second SVM method (SVM-Profile) classifies mutations into
disease and neutral polymorphism taking as input only a vector
of 2 elements derived from the sequence profile. This is computed
from the output of the BLAST program [3], running on the nr95
database (E-value threshold=10-9 , number of runs=1) as obtained
with cd-hit program available at http://bioinformatics.org/cd-hit/
[4]. The first input element is the ratio between the frequencies
of wild-type versus that of the mutated residue in the sequence
mutated position and the second element is the number of aligned
sequences with respect to the mutation at hand. The software
and the kernel used for this SVM implementation are as described
above.
Results
In the table some scoring indexes of the efficiency of three
methods are listed.
|
Q2 |
P(D) |
Q(D) |
P(N) |
Q(N) |
C |
PhD-SNP |
0.74 |
0.80 |
0.76 |
0.65 |
0.70 |
0.46 |
SVM-Sequence |
0.70 |
0.71 |
0.84 |
0.65 |
0.46 |
0.34 |
SVM-Profile |
0.70 |
0.74 |
0.49 |
0.68 |
0.46 |
0.39 |
The overall accuracy Q2 is:
Q2=p/N
where p is the total number of correctly predicted residues
and N is the total number of residues.
The correlation coefficient C is defined as:
C(s)=[
p(s)n(s)-u(s)o(s) )] / D
where D is the normalization factor
D
=[(p(s)+u(s))(p(s)+o(s))(n(s)+u(s))(n(s)+o(s))]1/2
for each class s (D and N, for disease-related and neutral
polymorphism, respectively); p(s) and n(s) are the total number
of correct predictions and correctly rejected assignments,
respectively, and u(s) and o(s) are the numbers of under and
over predictions.
The coverage for
each discriminated structure s is evaluated as:
Q(s)=p(s)/[
p(s)+u(s)]
where p(s) and u(s) are as defined above. The probability
of correct predictions P(s) (or accuracy for s) is computed
as:
P(s)=p(s)
/ [p(s) + o(s)]
where p(s) and o(s) are previously defined (ranging from 1
to 0).
Required
Inputs
PhD-SNP is optimized to predict if a given sinle point protein
mutation can be classified as disease-related or as neutral
polymorphism. The required inputs are:
-
Protein
Sequence: the protein sequence can be provided
in raw format or giving its Swiss-Prot or uploading a
text file containing the protein sequence;
-
Position:
the position number in the sequence of the residue that
undergoes mutation;
- New
Residue: if you would ask for a specific mutation
please insert the symbol of the mutated residue;
-
Prediction:
choose between Sequence-Based or Sequence and Profile-Based
prediction.
The results can be sent to your e-mail address, if you ask
for it, or obtained interactively if you do not past your
e-mail in the proper box.
Outputs
The
output consists of a table listing the number of the mutated
position in the protein sequence, the wild-type residue, the
new residue and if the related mutataion is predicted as disease-related
(Disease) or as neutral polymorphism (Neutral).
The RI value (Reliability Index) is evaluated
from the output of the support vector machine O as
RI=20*abs(O-0.5).
[1]
Capriotti, E., Fariselli, P., Calabrese, R. and Casadio,
R. (2005) Predicting protein stability changes from sequences
using support vector machines. Bioinformatics,
21 (Suppl 2), ii54-ii58.
[2] Capriotti, E., Fariselli, P. and Casadio, R. (2005)
I-Mutant2.0: predicting stability changes upon mutation
from the protein sequence or structure. Nucleic Acids
Res., 33 (Web server issue), W306-W310.
[3] Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang,
J., Zhang, Z., Miller, W. and Lipman, D. J. (1997) Gapped
BLAST and PSI-BLAST: a new generation of pro-tein database
search programs. Nucleic Acids Res., 25, 3389-3402.
[4] Li, W., Jaroszewski, L. and Godzik, A. (2001) Clustering
of highly homologous sequences to reduce the size of large
protein databases. Bioinformatics, 17, 282-283.
|