UCLA Team Sequences Cell Line, Puts Open Source Software Framework into Production

Transcription

1 Page 1 of 6 UCLA Team Sequences Cell Line, Puts Open Source Software Framework into Production February 05, 2010 Newsletter: BioInform BioInform - February 5, 2010 By Vivien Marx Scientists at the department of human genetics and the department of computer science at the University of California at Los Angeles described the genomic sequence of the brain cancer cell line U87MG in a recent study. In their paper, which appeared in last week's PLoS Genetics, the team highlighted "enormous improvements in the throughput of data generation." The scientists had decided to mainly use open source software for the project, putting in place an opensource analysis and data-management pipeline called SeqWare, which was developed in the lab. Bioinformatician Brian O'Connor, co-author of the PLoS study and post-doctoral fellow in the Stan Nelson lab at UCLA began developing the software two years ago, he told BioInform last week. He wanted to pick up where Illumina's software tools left off, he said. The platform now handles all data types, applies a pipeline of tools, a federated database structure, a LIMS, and a query engine. [BioInform 9/12/2008] O'Connor said that the team is scaling up the software in several ways: it is being modularized in order to be used as a framework for other tools, and it is being deployed at other research centers that need second-gen sequence analysis and data management. He and a colleague in the lab are also porting the software to the Amazon Elastic Compute Cloud, or EC2, and are integrating an open source database system so the tools and pipeline can scale from its current handling of scores of genomes, to potentially, hundreds or thousands of genomes. Separately, the scientists said they are transitioning the UCLA lab from being a microarray core to a second-generation sequencing core, O'Connor said.

2 Page 2 of 6 For the work in the paper, which relied on more than 30x genomic sequence coverage, the researchers applied a "novel" 50-base-mate paired strategy and 10 micrograms of input DNA to generate reads in five weeks of sequencing. The total reagent cost for the project was "under $30,000," and emboldened the researchers to call this genome "the least expensive published genome sequenced to date." The study described the large amount of data generated for analysis in these types of whole-genome resequencing studies: The team generated Gb of raw color space data, of which Gb was mapped to the reference genome. The researchers used Blat-like Fast Accurate Search Tool version 0.5.3, or BFAST 0.5.3, a tool developed in the Nelson lab to align the two-and-a half full sequencing runs from the ABI SOLiD, yielding slightly more than 1 billion 50-base pair mate-paired reads that they used to identify SNVs, indels, structural variants, and translocations. A "fully gapped local alignment" on the two-base encoded data to maximize variant calling took four days on a 20-node, 8-core cluster, the team wrote. BFAST, a color and nucleotide space-alignment tool, was in their view suited to obtain "rapid and sensitive" alignment of the more than 1 billion resulting reads. Using an Agilent array, and applying the Illumina Genome Analyzer, they also captured the exon sequence of more than 5,000 genes. In large projects such as theirs, scientists have data files that may comprise 160 gigabytesized sequence read files, or SRFs, and alignment files, O'Connor said. For 20x coverage of a human genome, the variant files run around 60 gigabytes in size, he said, which all need to be efficiently processed, annotated, and easy to query. To identify single nucleotide variants and small insertions and deletions, the team used the open-source assembly builder Mapping Assembly with Qualities, or MAQ, implemented in the SAMtools software suite. For "the primary structural variation candidate search," the researchers used DNA Analysis, or DNAA's "dtranslocations" utility, another set of tools from the Nelson lab. The team uploaded intensities, quality scores, and color space sequence for the genomic sequence of U87 SOLiD runs to the NCBI's Sequence Read Archive and did the same for intensities, quality scores, and nucleotide space sequence for the U87 exon capture Illumina sequence. SeqWare pipeline analysis programs were used to analyze variant calls, to store the data, and they used the new SeqWare Query Engine web service, available here, to query both variant calls and annotations. Beyond the Gap O'Connor said he set out with SeqWare to address a functionalities gap that currently exists between vendor tools and those from sequencer manufacturers, and to offer a combination of workflow management, sample tracking, data storage, and data-querying possibilities.

3 Page 3 of 6 In particular he said he has been trying to find frameworks that are scalable and that can work beyond dozens of genomes. Explaining this, he said that while sequencers have increased output 10-fold in the last two years, hardware and connectivity bandwidth are not scaling as quickly. The Nelson lab, a microarray core, is going to be the "sequencing center for a campus called the "Center for High-Throughput Biology at UCLA" and will offer sequencing and sequence-analysis services," O'Connor explained. The lab currently has two Illumina Genome Analyzers and one ABI SOLiD machine. The plan is to set up two or three more ABI SOLiD machines that will offer "quite a bit of capacity" to the community, even beyond the UCLA campus, he said. The new center targets whole genome sequencing and exome sequencing, which O'Connor said are "the two protocols we want to offer to the community." The PLoS Genetics work has put SeqWare to the test in a data-intensive production environment and is helping SeqWare reach its next level of development, O'Connor explained. "We're in the process of replicating that production environment in multiple places," he said. For example, over the "next few weeks," he is installing SeqWare at Cedars-Sinai Medical Center. The software can now grow from being an "academic, single-install project to something that is more replicated across sites," he said. The pipeline is a system for running analytical workflows and includes standalone modules, XML workflows that define jobs, and an execution engine. Cloud Bound Another transition underway for SeqWare is computational. The Linux-based software must be installed on a local cluster, and O'Connor said "[w]e're trying to take that away and abstract that away and install it on the [Amazon cloud] EC2." O'Connor and UCLA programmer and analyst Jordan Mendler are working to port the SeqWare to the cloud, and cloud computing is part of the Cedars-Sinai installation. It is slated to be completed by early April, he said. "We're looking at using the cloud as a means for bringing software like SeqWare and other applications to more people who do not have the resources that the Nelson lab has," he said. A cloud demonstration project is up and running at UCLA but has not yet been made publicly available, O'Connor said. "It's developing pretty quickly," he said. Alignment with BFAST works on the Amazon EC2 but the web interface is not user-friendly, he said. SeqWare users begin by launching a master node. "In our demo case so far we have launched that master node in lab," O'Connor said, adding that it could be a single machine running either in lab or on the cloud.

4 Page 4 of 6 To port SeqWare to the cloud, he and Mendler are using a tool suite that is part of the Planning for Execution in Grids, Pegasus, platform developed at the University of Southern California's Information Sciences Institute, he said. The machine's images can be "fired up" with enough information to know that they should all talk to the master node, which would enable scientists to "set up a virtual cluster of three nodes or 300 nodes," O'Connor said. When SeqWare is launched on the cloud, it can either target the UCLA lab's cluster running SunGrid Engine or it can target a new virtual cluster and enable workflow on the virtual nodes. "The real reason we are doing this is [that], like [at] a lot of other places, UCLA is in a situation where we can't infinitely expand out infrastructure," O'Connor said. He said he believes that as the Nelson lab adds sequencers, it will be able to apply the same SeqWare workflow now in place, with administrative duties reduced to tasks such as load-balancing. Adopting SeqWare for Pegasus over the last year has required the UCLA lab to "revamp the way we do workflows," he said. The software had been "pretty monolithic" with "homegrown code," and it was also rather "delicate," he said. Now it comprises individual "self-contained" modules that are more robust, O'Connor said. "What we get out of using Pegasus is the ability to target multiple clusters," he said. "It's a killer feature; it's just wonderful." For instance, scientists can move analysis to different computational locations when the need arises, he said. Overall, SeqWare handles sequence read format files, or SRFs, which are in the generic format for DNA sequence data developed by scientists at NCBI, the Broad Institute, the EBI, and other academic institutions, as well as at companies such as Illumina, Roche, Helicos, and Life Technologies/ABI. It can help researchers by not requiring them to be as concerned about sequencer-specific file issues in analysis as they currently are. "The idea is that since it's starting with the common file format, all of our code essentially works unchanged," O'Connor said. He added that the only exception is that BFAST has two modes: color space and nucleotide space. Alignments are stored in the BAM format the compressed binary version of the Sequence Alignment, or SAM, format, which "seems to be what most people are using," O'Connor said. Start the Engine For variant calling, standards are lacking, he said, but added that the SeqWare Query Engine can help handle that type of data, and offers multiple types of querying. The engine has been in the works for six months and can support large databases containing more than 300 gigabytes of information, he said. It can also be distributed

5 Page 5 of 6 across a cluster, and researchers can query it using a representational state transfer, or RESTful, web interface architecture. Variant calling in sequencing workflow leads to "massive files," O'Connor said. For example, in the brain cancer cell line project, the files ran 150 gigabytes in size, and the files describe all sequenced positions and the consensus calls. However, performing analysis on that data meant scripting. "I spent a lot of time writing Perl scripts that were very custom," he said. O'Connor developed the query engine in reaction to that experience and the increasing number of genomes in experiments. "It's one way to get to the data instead of having to write a ton of different parsers for all my analysis components," he said. For the U87 work, he used the Berkeley DB open-source developer database system to create databases of genomic information such as variants, SNVs, small indels, translocation, and coverage information. In the SeqWare pipeline system, "basically one genome equals one database," he said. "If I had done it with the standard MySQL or Postgres databases, it would have been fine" to around 100 genomes or so, but after that a single database "would implode, basically." Now that he is porting SeqWare to the cloud, the challenge is again to avoid bottlenecks. "I ported the back end to something called HBase," which is part of the Hadoop project, an open-source volunteer project run under the Apache Software Foundation. Although similar to Berkeley DB, HBase has "no nice query engine like SQL, but you get scalability," he said. The key difference between Berkeley DB and HBase is less need for manual intervention, he said. "HBase itself knows how to distribute the database information and shred it across 10 different nodes," O'Connor said. "That's really nice because I don't have to think about where the database lives." Although "the system is a "little rough around the edges," it is working and it seems to be "a lot faster than Berkeley DB," he said. As O'Connor works on the cloud computing-enabled SeqWare, he said he believes his system for sequence analysis and data management will be less fraught with database issues and will give researchers options to track metadata, run analytical workflows, and query data. As a SeqWare developer and user, he said "it's really nice to be able to provide collaborators a URL and say 'Go crazy, you can query [as] much as you want.'" Alternatively, it would require collaborators to communicate over or make scripting necessary to provide the data they might want, such as including data filtered for frameshift mutations. SeqWare also has a "meta-database" to track analysis steps and experimental protocols, O'Connor said. Another challenge with performing experiments with second-gen sequencers is that research teams must run them many times and tweak the software as they work. "We did variant calling on U87 eight times," he said. "You need to keep track of that."

6 Page 6 of 6 He said that in the future researchers might all converge on a few vendor or open-source tools. Anticipating such a convergence, and because he does not have a large developer community, O'Connor said he decided to not try to make SeqWare an "all encompassing" suite. Rather he wanted it to act as a "glue code" for a modularized package that enables scientists to use other tools. O'Connor said he has been shifting his focus "over the last six months or so" to accommodate this potential convergence. SeqWare is now "less about our own algorithms for calling variants or doing alignments" than it is about tracking meta-data, experimental, and computational methods, and archiving the results in a common format so they can be queried, O'Connor said. He said he chose a particular database focus "because I think that is something that isn't well-addressed by vendor tools," he said. "Regardless of scale, you have these issues, [and] as you scale up these issues become more and more critical." Slightly bad decisions at a small scale means a task might run two hours instead of one, but for larger data analysis, those "bad" decisions can be even more time-consuming and costly. According to O'Connor, another capability that is currently not addressed well by vendor tools is how to provision jobs to a cluster or multiple cluster types, and how to handle submission engines. As O'Connor wraps up his post-doctoral fellowship, and regardless of whether his next post will be at a university or a company, he said he plans to continue developing SeqWare. "What I am looking forward to is starting up a really good core set of users in multiple locations who can give feedback," he said. "It's so important in this field right now to do collaborative development of software tools." Cloud computing is part of that mix. Although some researchers shy away from the cloud's inherent costs, he says its scalability pays off. To test what it will cost to perform on alignments on the EC2 cloud, he did a "back of the napkin" calculation and found that a whole-genome alignment, including data transfer and computation time, "works out to be around $600" which, compared to reagent costs, "is not that bad at all," O'Connor said. Researchers generally still face the challenge of getting data to the cloud. "At some point the pipe from UCLA to the cloud will become too small," he said, adding that the data transfer rate of five megabytes/second is not going to improve in the short-term. And when he and his colleagues begin increasing current data generation tenfold by using Illumina's HiSeq2000 or Life Technologies' SOLiD 4, bottlenecks will become acute. Genomeweb system These settings are generally managed by the web site so you rarely need to consider them. Issue Order: 2 -->