Secure way to share genetic data?

14 September 2009

In September last year bodies such as the Wellcome Trust and the US National Institutes of Health (NIH) restricted access to databases containing genetic information following the publication of an article in PLoS Genetics (see previous news). The article described a statistical approach to the analysis of single nucleotide polymorphism (SNP) data sets to resolve individual genotypes [Homer N et al, (2008) PLoS Genet. 4(8), e1000167]. This technique was a major breakthrough for forensic science, but could also allegedly be applied to individual participants, in data obtained from genome-wide association (GWA) studies, although identification would require prior knowledge of the SNP profile of the individual concerned, or a close relative. GWA studies rely on combining data from multiple studies in order to in order to demonstrate genetic associations with appropriate statistical power, a process which is affected by restricted access to data. This problem of data access and anonymity may now have been overcome with the development of a mathematical formula and concurrent software program (reported by GenomeWeb).

It is possible to detect individual genotypes from a DNA pool if a sufficiently large number of common SNPs are known. However, accurate identification is also dependent on the allele frequencies of the SNPs, the number of individuals in the DNA pool, and the method used to detect an individual in the pool. Consequently, it may be possible to prevent the revelation of subject’s identity by limiting the SNP information in the dataset. A recent paper in Nature Genetics describes the development of an approach to elucidate which SNPs are ’safe‘ to reveal, thereby allowing configuration of datasets to increase anonymity.

Based on a study of statistical methods such as those developed by Homer et al., Sankararaman et al. have constructed a likelihood ratio test (LR test) to calculate the probability of identification of an individual genotype in a pooled data set Sankararaman et al. (2009) Nat Genet. doi:10.1038/ng.436]. The LR test is able to do this by taking into account the false-positive rate, the size of the pool and the number of exposed SNPs. As altering the number of exposed SNPs influences the LR value, the formula allows estimation of the chances of identifying an individual genotype within the pool in relation to exposed SNPs, thereby giving some guidance on how to alter the dataset to maintain anonymity.

Based on the mathematical formula the team went on to develop a software tool called SecureGenome, which allows users to identify a limited set of SNPs that can be safely revealed from a genotype dataset. The approach was validated using simulated data, as well as data from the Wellcome Trust Case Control Consortium and HapMap Project. The exposed SNPs in each data set were varied and the likelihood of identifying a specific test genotype was calculated, this allowed identification of SNPs that could safely be exposed.

However, the method has some limitations; it assumes that the SNPs are in linkage disequilibrium and does not factor in rare SNPs which make data much more identifiable. In addition, the set of SNPs which can be safely exposed may not necessarily be those in which researchers are interested.  

More from us

Genomics and policy news