We developed a disease-specific machine learning approach
to predict if a non-synonymous SNP is related to cancer.
The implemented Support Vector Machine (SVM) method has been
trained on a set of 3,163 cancer-causing mutations from 74
proteins. As a negative set we used differnts types of
missense Single Nucleotide Variants (mSNVs) from SwissVar and
generated in silico and previously used to train a method
the descrimination between driver and passenger mutations.
In particlular the CNO dataset
was used for training and tested porpouse and is composed by
1,583 cancer-causing mutations and the same number of randomly
selected Polymorphisms in SwissVar with allele frequency higher than 0.01 and sample count higher than 49.
Only for testing porpopse we used the
CND dataset
that is similar to the CNO dataset but 1,582 polymorphism where
replaced with disease-related mutations not associated to
disease to MESH term neoplasm .
Finally, in the
Synthetic dataset negative
subset has been generated in silico and used to test
the CHASM algorithm (Carter et al. Cancer Research 2009 ).
The SVM input feature of the method are: i) the amino acid
substitution, ii) the sequence environment, iii) the sequence
profile information, and iv) a Gene Ontology (GO) based score.
More details and The preformances of Dr. Cancer algorithm
has been described in a per-reviewed paper
(Capriotti and
Altman, Genomics 2011).