|
SNPs&GO
Predicting disease associated variations using GO terms
|
|
Benchmark
SNPs&GO has been trained and tested using a 20-fold cross-validation procedure on a set of 38,460
variations from 9,067 proteins (SAP-SEQ) extracted from the
Swiss-Var database
(Oct. 2009).
The SAP-SEQ dataset is composed by 19,230 disease-related mutations and the same number
of randomly selected neutral polymorphisms.
In the cross-validation procedure, proteins are clustered using the blastclust algorithm in the
BLAST package, and keeping in the same set all the variations belonging to the same cluster of similar sequences.
The SAP-SEQ dataset can be downloaded from this
link.
The structure-based SNPs&GO3d algorithm, has been trained
and tested using a 20-fold cross-validation procedure
on a set of 6,630 mutations from 784 protein chains (SAP-3D) from the
PDB
(Oct. 2009).
The SAP-3D dataset is composed by 3,342 disease associated v and the 1,644
neutral variations. To balance the composition of the dataset the reverse
mutations of neutral polymorphisms are also considered.
in the dataset also the reverse mutation of the
In the cross-validation procedure proteins are clustered using blastclust
algorithm in the blast package, and keeping in the same set all the
mutations belonging to the same
cluster of sequences.
The SAP-3D dataset can be downloaded for this
link.
An additional dataset composed by 1,489 variants from 271 proteins (SAP-NEW) with known structures has been used to test both SNPs&GO and SNPs&GO3d.
The list of SAP-NEW variations is available here.
The Gene Ontology (GO) terms are extracted from
the gene_association.goa_human file and their parents are retrieved using
GO-TermFinder package.
|
|
|