Entries tagged with “informatics” from O'Reilly Radar
Sequencing a Genome a Week
Radar Talks to OSCON Speaker David Dooling
by James Turner | comments: 3
You may also download this file. Running time: 34:51
Subscribe to this podcast series via iTunes. Or, visit the O'Reilly Media area at iTunes to find other podcasts from O'Reilly.
The Human Genome Project took 13 years to fully sequence a single human's genetic information. At Washington University's Genome Center, they can now do one in a week. But when you're generating that much data, just keeping track of it can become a major challenge in itself. David Dooling is in charge of managing the massive output of the Center's herd of gene sequencing machines, and making it available to researchers inside the Center and around the world. He'll be speaking at OSCON, the O'Reilly Open Source Convention. His talk, titled The Freedom to Cure Cancer: Open Source Software in Genomics, will be about how he uses open source tools to keep things under control, and he agreed to talk about how the field of genomics is evolving.
James Turner: Can you start by describing what it is you do and how you came to be doing it?
David Dooling: Sure. I work at the Genome Center at Washington University in St. Louis. We are one of the handful or so of large scale genome sequencing centers around the world. What that means is essentially we participate in large genome sequencing projects that some people may have heard of, like the Human Genome Project, Thousand Genomes Project, things like that. And involved in that is a lot of data processing, laboratory processing, tracking and all sorts of things, so it's a rather large enterprise.
There are about 300 or so people that work here. And how I came to work here was that about eight years ago, I decided that I wanted to get more into programming and more into open science. So I took a job as a programmer here at the Genome Center and gradually worked my way around to where I am now, where I oversee all of the software development and IT infrastructure here at the Genome Center. And it's a fairly large IT infrastructure.
We have somewhere around three petabytes of storage online, and somewhere north of 3,000 cores in our computational cluster. And we're generating terabytes, tens of terabytes of data, per day with our current sequencing instruments. The sorts of things that we're doing now as we transition from more fundamental evolutionary types of projects, such as the Human Genome Project and subsequent projects like the Mouse Genome Project, we've done things like corn and things of that nature, now we're doing more and more sequencing projects related to medicine and medical sequencing.
Last year, we published the full cancer genome sequence. In doing both the cancer and the normal, we were able to determine the differences between those two genomes and begin to identify what might've possibly caused cancer in that individual. So projects like that. We're also doing projects with metabolic syndromes, like diabetes, and several other cancer projects as well. That's essentially what we're doing and how we're doing it and how I got here.
James Turner: Genomics is an area that seems to be on the steep part of the hockey stick curve right now. In just a decade, we've gone from sequencing one genome over a period of years to doing them routinely. Can you talk a bit about what's enabled this acceleration?
David Dooling: Well, a whole host of things. But I think really at the core was the changing fundamentals of sequencing itself. For a long time, DNA sequencing was based on a process invented by Sanger, sometimes called Sanger Sequencing, sometimes called capillary electrophoresis now because of the last revision of the instruments that were generated. But essentially with that approach, you did reactions in 96 plate wells. You processed sequence in these 96 plate well chunks. And you did reactions in there. You loaded them on the readers, and the readers read out sequence for each of those 96 wells. So that's sort of how you processed it. And at the height of that sort of sequencing, which was only a few years ago, we had about 130 or so of those instruments each churning about 15 to 20 runs per day. Each run gave you 100 pieces of sequences. You had 100 or so machines. And so you got on the order of a few thousand sequence reads, that's what we called them, because of the way the instrument read the information.
Now, since that time, 454 was first [of the new generation of sequencers] and then Solexa came, which was later bought-out by Illumina, and the ABI SOLiD has a platform. There's one from Helicos as well. And then several other third generation, those first being the second generation, sequencers have come out. And what those do is greatly increase the parallelism with which you're able to process DNA and sequence it. So instead of a few thousand runs per day, or a few thousand reads per day, you may get a few million reads per run. And these runs, for some of the platforms, do take a little bit longer. But the parallelism of it increases your throughput tremendously. And so now we have about 35 to 40 of these highly parallel instruments in-house. And with that, we're able to sequence the human genome to complete coverage in less than a week.
So the main driver has been this change in the sequencing technology and the parallelism of it. It's a fundamentally different chemistry, different physics. The flipside of it is that we talked about the hockey stick, and so that hockey stick is the sequencing hockey stick, but it's brought several other hockey sticks along with it, mainly the amount of data that these things generate. And the amount of processing power that is required to process that data has increased greatly as well. Much faster than Moore's Law over the last two years or so. Whereas with those original instruments, you would generate on the order of megabytes per day, now we're doing tens of terabytes per day with these new instruments. And then processing that, instead of taking a single processor a few minutes, it can take a small cluster a few days to actually analyze the data from each of these runs.
Those are the main things. The enabling technology was the change in the sequencing chemistry itself. And then what had to come along with that was building these infrastructures to be able to track these things and process these things and store all of this data as the instruments increased in their abilities.
tags: genomics, informatics, interviews, open source, oscon
| comments: 3
submit:
Challenges for the New Genomics
by Matt Wood | comments: 14
New guest blogger Matt Wood heads up the Production Software team at the Wellcome Trust Sanger Institute, where he builds tools and processes to manage tens of terabytes of data per day in support of genomic research. Matt will be exploring the intersection of data, computer technology, and science on Radar.
The original Human Genome Project was completed in 2003, after a 13-year worldwide effort and a billion dollar budget. The quest to sequence all three billion letters of the human genome, which encodes a wide range of human characteristics including the risk of disease, has provided the foundation for modern biomedical research.
Through research built around the human genome, the scientific community aims to learn more about the interplay of genes, and the role of biologically active regions of the genome in maintaining health or causing disease. Since such active areas are often well conserved between species, and given the huge costs involved in sequencing a human genome, scientists have worked hard to sequence a wide range of organisms that span evolutionary history.
This has resulted in the publication of around 40 different species' genomes, ranging from C. elegans to the Chimpanzee, from the Opossum to the Orangutan. These genomic sequences have helped progress the state of the art of human genomic research, in part, by helping to identify biologically important genes.
Whilst there is great value in comparing genomes between species, the answers to key questions of an individual's genetic makeup can only be found by looking at individuals within the same species. Until recently, this has been prohibitively expensive. We needed a quantum leap in cost-effective, timely individual genome sequencing, a leap delivered by a new wave of technologies from companies such as Illumina, Roche and Applied Biosystems.
In the last 18 months, new horizons in genomic research have opened up, along with a number of new projects looking to make a big impact (the 1000 Genomes Project and International Cancer Genome Consortium to name but two). Despite the huge potential, these new technologies bring with them some tough challenges for modern biological research.
High throughput
For the first time, biology has become truly data driven. New short-read sequencing technologies offer orders of magnitude greater resolution when sequencing DNA, sufficient to detect the single-letter changes that could indicate an increased risk of disease. The cost of this enhanced resolution comes in the form of substantial data throughput requirements, with a single sequencing instrument generating terabytes of data a week--more than all biological protocols to date. The methods by which data of this scale can be efficiently moved, analyzed, and made available to scientific collaborators (not least the challenge of backing it up), are cause for intense activity and discussion in biomedical research institutes around the globe.
Very rapid change
Scientific research has always been a relatively dynamic realm to work in, but the novel requirements of these new technologies bring with them unprecedented levels of flux. Software tools built around these technologies are required to bend and flex with the same agility as the frequently updated and refined underlying laboratory protocols and analysis techniques. A new breed of development approaches, techniques and technologies are needed to help biological researches add value to this data.
In a very short space of time the biological sciences have caught up with the data and analysis requirements of other large scale domains, such as high energy physics and astronomy. It is an exciting and challenging time to work in areas with such large scale requirements, and I look forward to discussing the role distribution, architecture and the networked future of science here on Radar.
tags: genomics, informatics, science, software
| comments: 14
submit:


