New standards for genome sequence data quality

27 October 2009

DNA sequence databases are often used by researchers to cross-reference unknown sequences with that of known sequences. Consequently, information on the completeness and accuracy of the sequence is important. Principal investigators participating in the Human Genome Project established world standards for genome sequence fidelity in 1997. Referred to as the Bermuda standards, they categorized sequences into either ’Finished‘ or Draft‘ sequences. Finished sequences were those that were contiguous (with no gaps) and had fewer than one error per 10,000 bases; almost all other sequences were classified as Draft. However, developments in sequencing technologies have led to a proliferation of sequence data deposited in publicly accessible databases. Much of this data has been classified as ’Draft‘ sequence, although the quality of these sequences can be very variable. Factors affecting sequence quality include the sequence technology used, which can lead to inherent errors in the sequencing process itself, and the ability of software programs to assemble these sequences.

A recent paper published in Science has now proposed more detailed standards and classification of sequence data for researchers who generate and/or use this data [Chain et al. (2009) Science 326: 236-7]. The new standards have been compiled by an international team of researchers and classify genomes into six categories ranging from Standard to Finished. The Standard draft is the minimum standard for submission to the public databases and comprises unfiltered or minimally filtered data. Although these sequences are of poor quality and may be incomplete they still possess useful information. Finished refers to the current gold standard as described above and can act as high quality reference genomes for comparative purposes. Intermediate categories include high-quality draft, improved high-quality draft, annotation-directed improvement and non-contiguous finished.

The authors have tried to develop standards that apply to all types of whole genome sequencing projects independent of technology used. They have also avoided rigid numerical thresholds in order to take into account products achieved by combination of technology and/or finishing processes.


More from us

Genomics and policy news