SNPs&GO - Predicting disease associated SNPs using GO terms.

SNPs&GO
Predicting disease associated variations using GO terms

Introduction

The genetic basis of human variability is mainly due to Single Nucleotide Polymorphisms (SNPs). The most investigated SNPs are missense mutations resulting in residue substitutions in the protein. Here we propose SNPs&GO, an accurate method based on support vector machines, that discriminates among disease related and neutral variations in a protein sequence. SNPs&GO collects in an unique framework information derived from protein sequence, protein sequence profile, and protein function.
The enormous number of human SNPs available in the data bases poses the question of relating protein variations to diseases. We propose a new server that uses different pieces of information, including that derived from the Gene Ontology annotation to predict if a given variation can be classified disease-related or neutral.

Methods

For the first time we present a GO-integrated predictor tested and trained with a stringent cross-validation procedure. SNPs&GO is based a SVM-based classifier [1] based on a single SVM that takes in input protein sequence, profile and functional information (see figure below).
The SNPs&GO input is build following the following steps:

for a given mutation the substitution form the wild-type residue to the mutant is encoded in a 20 elemts vector that have -1 in the position relative to the wild-type residue, 1 in the position relative to the mutatnt residues and 0 in the remaining 18 positions.
a second 20 elements vector encoding for the sequence environment is build reporting the occurrence of the residues in a windows of 19 residue around the mutated residue (Seq).
For a given protein, its sequence profile features (Prof) are extracted from a BLAST search output [2]. From the output we evaluate both the frequency of the wild type (Fi(WT)) and mutated (Fi(MUT)) residues at position i. NAL is the number of sequences in the alignment at a given position and the CI is the Conservation Index
The input is composed by the sum of log-odd scores for all the GO terms associated to the protein under consideration and their parents and the number of all the GO terms associated to the protein undergoing variation (LGO).

The server also implements SEQ-PROF and SVM-GOS methods that take in input different subsets of SNPs&GO's input features. The SEQ-PROD method takes in input the first 45 elements vector encoding for the sequence and profile information and SVM-GOS a two element vector encoding for the functional information. Selecting the option "All methods" the prediction of SEQ-PROF and SVM-GOS are calculated and included in the output.

Results

SNPs&GO was trained on a set of 38460 mutations and tested with cross-validation procedure over sets in which similar proteins were kept in the same dataset also for the calculation of the LGO score, as derived from the GO data base. At increasing input level of complexity, the performance is also increasing, suggesting that on top of sequence profile also LGO, derived from the protein GO annotation, is a crucial added value for discriminating disease-related polymorphisms from neutral ones. The finding that the level of performance increases at increasing information added to the input corroborates the notion that support vector machines can capture all the correlations existing in complementary knowledge. Recently SNPs&GO was also tested by another laboratories and scored among the best predictor available [3]. The benchmark that we performed in house indicates that presently SNPs&GO is one of the best scoring classifiers available for predicting whether a mutation at the protein level is or is not disease-related. In the table the efficiency of SNPs&GO is compared with those obtained by SEQ-PROF and a simple GO-based method.


Methods	Q2	P(D)	Q(D)	P(N)	Q(N)	C

SEQ-PROF	0.76	0.76	0.77	0.76	0.76	0.52
SVM-GOS	0.70	0.73	0.62	0.67	0.77	0.40
SNPs&GO	0.81	0.81	0.82	0.82	0.81	0.63

The overall accuracy Q2 is:

Q2=p/N

where p is the total number of correctly predicted residues and N is the total number of residues.
The correlation coefficient C is defined as:

C(s)=[p(s)n(s)-u(s)o(s)] / D

where D is the normalization factor

D=[(p(s)+u(s))(p(s)+o(s))(n(s)+u(s))(n(s)+o(s))]1/2

for each class s (D and N, for disease-related and neutral polymorphism, respectively); p(s) and n(s) are the total number of correct predictions and correctly rejected assignments, respectively, and u(s) and o(s) are the numbers of under and over predictions.

The coverage for each discriminated structure s is evaluated as:

Q(s)=p(s)/[p(s)+u(s)]

where p(s) and u(s) are as defined above. The probability of correct predictions P(s) (or accuracy for s) is computed as:

P(s)=p(s) / [p(s) + o(s)]

where p(s) and o(s) are defined above (ranging from 1 to 0).

References

[1] Calabrese R, Capriotti E, Fariselli P, Martelli PL, Casadio R. (2009). Functional annotations improve the predictive score of human disease-related mutations in proteins. Human Mutation. 30; 1237-1244.
[2] Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25, 3389-3402.
[3] Thusberg, J., Olatubosun, A. and Vihinen, M. (2011) Performance of mutation pathogenicity prediction methods on missense variants. Human Mutation., 32, 358-368.