Personal Genome Project participants often identifiable

10 May 2013

A pre-publication study posted to arXiv last week showed that full names, addresses and medical details of a significant proportion of participants in the Personal Genome Project (PGP) could readily be obtained by cross-referencing their ‘anonymous’ profiles with other public records.

The PGP initiative aims to sequence the genomes of 100,000 volunteers and make them available in an online database along with phenotype information, to help researchers identify correlations between traits and genomic data. It operates an ‘open consent’ policy under which participants can disclose as much or as little personal information as they wish, including the option to upload additional data to their profiles from external DNA services, such as 23andMe.

Profiles appear in a ‘de-identified state’ – without the individual’s name or address – which the study authors suggest could give participants the impression that they are unidentifiable. Of 1130 participants with public profiles, around half had included their date of birth, gender and postal code. It was these participants the researchers used in their study, and they set out to discover how many of 579 individuals they could identify by name.

They compared information in the PGP profiles against public records, and noted those that matched just a single name. They also scanned for names residing within files participants had uploaded to their profiles. Combined, these two data sets enabled them to assign names to 241 (42%) of the profiles. They submitted the names to PGP, who confirmed that at least 84% were correct – and that this could be as high as 97% if allowances for possible name variations were made (such as Jim instead of James).

The researchers say their work demonstrates that PGP participants are vulnerable to identification, and point out that many profiles include potentially sensitive information, such as medical, sexual and drug use history. They conclude that participants could most effectively protect themselves by omitting their date of birth and postal code, and by removing names from documents they upload; the study authors built a software tool to help participants edit records to achieve this.

Comment: It is important to note that it was not genomic data that allowed individuals to be identified in this study, but other personal information that they had either voluntarily or inadvertently added to their profiles, as was the case in a study earlier this year that was able to identify ‘anonymised’ genomes (see previous news). The PGP website explicitly warns potential participants that by taking part they are sharing their data with the wider world, should consider themselves identifiable, and should not take part in the project if this is a concern.