University of Cambridge logo

The value and limitations of big data on the human genome: probing function


Interpreting the human genome

Resolving exactly which areas of the human genome are functional is important for identifying sequences associated with our health. Although the first draft human genome sequence was completed over a decade ago, it still remains unclear what functional roles, if any, are possessed by most human DNA. A surprising finding from early sequencing efforts is that less than 2% of the human genome is code for the production of proteins, the critical ‘nuts and bolts’ of cellular function and development. The remaining 98% non-coding portions of the genome comprise some important regulatory sequences, but also a large amount of so-called ‘junk DNA that has no functional value.

The Encyclopedia of DNA Elements (ENCODE)  project is a research consortium working to identify biochemically active regions of the human genome. A parallel initiative, modENCODE, is doing the same for selected model organism (fly and worm) genomes. They have performed large-scale biological assays across many different cell and tissue types. Building on their previous release of data in 2012 the consortia have identified new predicted regulatory elements for humans, flies, and worms with five accompanying papers published in Nature. These new papers focus on comparative analyses that provide a high resolution, global picture of how genomic elements vary across different species. They found that several key genomic features have preserved components across the divergent species.

Relevance to public health

These findings are not revolutionary, but they add incrementally to previous research and reinforce the value of model organisms to inform human biology. The shared fundamental developmental biology of humans, flies and worms illustrates that these model organisms can provide insights into human diseases, as they have done on many occasions.

The main utility of the data in a medical context is that these biochemical annotations can provide functional information for candidate human disease regions, and such data is sporadic outside of the protein regions. However, caution should be taken with such approaches since many biochemically active sequences are just by-products of cellular processes and not meaningfully functional. Furthermore, an improved understanding of the causes of disease is not in itself a medical benefit, although it may be the first step towards a developing a diagnostic tool or potentially a novel therapy.

The term ‘big data’ is arguably overused, but it is no overstatement for the data produced by the ENCODE, which is one of the largest biomedical resources produced by a single consortium. However, the data comes at a large cost, an estimated US $185 million as of 2012, with up to $123 million secured for additional subsequent work. Are furth er extensions of the ENCODE project a good use of financial resources? There is not a clear-cut answer to this question, but considerable downstream translational work will be required, and should be prioritised, before any medical benefits are realised from the existing ENCODE data. Furthermore, smaller, targeted hypothesis driven pieces of work may be less visionary and exciting compared to projects like ENCODE, but have the potential to facilitat e more applied and translational efforts to drag genomics further into the realm of public health.

Genomics and policy newsletter

Sign up