ContrastRank - A new method for tumor samples classification.

Benchmark

We used 3 datasets of whole-exome sequence made available by The Cancer Genome Atlas (TCGA) consortium. We selected 3 types of adenocarcinomas for which there are more than 200 pairs of normal/tumor samples that we consider to be the minimum number of samples to perform a 2-fold cross-validation test. For each tumor type we selected the largest datasets of samples for colon adenocarcinoma (COAD) produced by the Baylor College of Medicine and Lung and Prostate adenocarcinomas (LUAD and PRAD respectively) produced by the Broad Institute of MIT and Harvard. The three selected datasets are composed by 220, 625 and 309 pairs of normal and tumor samples from patients respectively affected by COAD, LUAD and PRAD.

We also analysed the genomes of 1092 individuals made available by the 1000 Genomes Consortium (see link). In our analysis, the variation data from 8 genes in the chromosome Y was not considered because of the lower number of individuals for which the data are available and missing genotype data for some of the alleles in the samples. We used ANNOVAR (Wang, et al., 2010) to annotate the effect of the genetic variants in each VCF file from TCGA and 1000 genomes using the human genome build 19 (hg19) . For tumor datasets specific filtering procedures have been adopted to extract the genetic variants from the Variant Calling Format (VCF) files. The filtering procedure applied to COAD, LUAD and PRAD samples allowed us to select an average number of nsSNVs per sample that is comparable with the recently estimated value (~10,000) (Bamshad, et al., Nature Review Genetics, 2011). Average values of nsSNVs for the normal and tumor samples in COAD, LUAD, PRAD and 1000 Genomes samples.

In our analysis, we only focus on putative deleterious variants (PDVs) with minor allele frequency (MAF) lower than 0.5%. The MAF is derived from the genomes of 1092 individuals in 1000 Genomes Consortium. All the nsSNVs found in TCGA samples, but not in 1000 Genomes were considered to have even lower frequencies, and therefore, assumed to be PDVs. After filtering, the number of PDVs in normal and tumor samples are between 10-16% of the whole set of nsSNVs. The average number of PDVs per individual in 1000 Genome is 318. This is in agreement with the previous published result (1000 Genomes Project, et al., Nature, 2012). We mapped all the PDVs to their corresponding genes and calculated the average number of putative impaired genes (PIGs) for each samples. We found that on average the PDVs are affecting ~700 and 900 PIGs in normal and tumor samples respectively.