14 September 2009
It is possible to detect individual genotypes from a DNA pool if a sufficiently large number of common SNPs are known. However, accurate identification is also dependent on the allele frequencies of the SNPs, the number of individuals in the DNA pool, and the method used to detect an individual in the pool. Consequently, it may be possible to prevent the revelation of subject’s identity by limiting the SNP information in the dataset. A recent paper in Nature Genetics describes the development of an approach to elucidate which SNPs are ’safe‘ to reveal, thereby allowing configuration of datasets to increase anonymity.
Based on a study of statistical methods such as those developed by Homer et al., Sankararaman et al. have constructed a likelihood ratio test (LR test) to calculate the probability of identification of an individual genotype in a pooled data set Sankararaman et al. (2009) Nat Genet. doi:10.1038/ng.436]. The LR test is able to do this by taking into account the false-positive rate, the size of the pool and the number of exposed SNPs. As altering the number of exposed SNPs influences the LR value, the formula allows estimation of the chances of identifying an individual genotype within the pool in relation to exposed SNPs, thereby giving some guidance on how to alter the dataset to maintain anonymity.
Based on the mathematical formula the team went on to develop a software tool called SecureGenome, which allows users to identify a limited set of SNPs that can be safely revealed from a genotype dataset. The approach was validated using simulated data, as well as data from the Wellcome Trust Case Control Consortium and HapMap Project. The exposed SNPs in each data set were varied and the likelihood of identifying a specific test genotype was calculated, this allowed identification of SNPs that could safely be exposed.
However, the method has some limitations; it assumes that the SNPs are in linkage disequilibrium and does not factor in rare SNPs which make data much more identifiable. In addition, the set of SNPs which can be safely exposed may not necessarily be those in which researchers are interested.