Privacy compromised in 'anonymous' genetic databases

22 January 2013

A study published in the journal Science has shown it can be possible to identify some individuals who have ‘anonymously’ made their genome sequences available for use by researchers.

The work demonstrated that male participants can potentially be identified by cross-referencing their genome sequence data, and other data attached to it, with information available from online ‘recreational’ genealogy databases.

The method exploited the fact that short tandem repeats on the Y chromosome (Y-STRs), carried only by males, tend to be highly heritable from father to son and so can be linked to surnames by analysing patterns of association in genealogy databases.

Once a Y-STR-surname link has been made, the name can be combined with other information sometimes attached to genome sequences in research databases – such as the age and general geographic region of the donor – to identify a specific individual. Using this method the researchers were ultimately able to identify around fifty members of Mormon families in Utah who had participated in the 1000 Genomes Project.

In discussing their findings, the study authors express the hope that the research community and public do not respond by ceasing to share data or donate samples, to the detriment of science. They suggest the appropriate action is to establish clear policies for data sharing, to ensure that study participants are fully informed about potential loss of anonymity, and that legislation be enacted to govern proper use of genetic data.

Comment: This study highlights a potentially important issue. It is worth noting however that the researchers had to undertake considerable effort and analysis to identify specific individuals, and that their method relied upon related individuals participating in both research studies and public genetic genealogy databases, as well as additional identifying information being linked to their genome; specifically their age and geographic location.

Even if an individual can be identified from their DNA sequence, it is not clear that this will cause them harm. This work does however highlight the many questions around what realistic expectations of anonymity are for genomic data – or any kind of data – in the online world.