UCLA Team Sequences Cell Line, Puts Open Source Software Framework into Production



Similar documents
ENABLING DATA TRANSFER MANAGEMENT AND SHARING IN THE ERA OF GENOMIC MEDICINE. October 2013

Putting Genomes in the Cloud with WOS TM. ddn.com. DDN Whitepaper. Making data sharing faster, easier and more scalable

Leading Genomics. Diagnostic. Discove. Collab. harma. Shanghai Cambridge, MA Reykjavik

Delivering the power of the world s most successful genomics platform

Focusing on results not data comprehensive data analysis for targeted next generation sequencing

SEQUENCING. From Sample to Sequence-Ready

Analysis of NGS Data

Introduction to NGS data analysis

Hadoopizer : a cloud environment for bioinformatics data analysis

Data processing goes big

Hadoop. Sunday, November 25, 12

Building Bioinformatics Capacity in Africa. Nicky Mulder CBIO Group, UCT

G E N OM I C S S E RV I C ES

GC3 Use cases for the Cloud

SeqScape Software Version 2.5 Comprehensive Analysis Solution for Resequencing Applications

Next Generation Sequencing: Technology, Mapping, and Analysis

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Real-Time Analytics on Large Datasets: Predictive Models for Online Targeted Advertising

Automated Lab Management for Illumina SeqLab

Attacking the Biobank Bottleneck

OpenCB a next generation big data analytics and visualisation platform for the Omics revolution

Cloud Computing Solutions for Genomics Across Geographic, Institutional and Economic Barriers

Convergence of Big Data and Cloud

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

Go where the biology takes you. Genome Analyzer IIx Genome Analyzer IIe

Managing and Conducting Biomedical Research on the Cloud Prasad Patil

New solutions for Big Data Analysis and Visualization

Early Cloud Experiences with the Kepler Scientific Workflow System

MiSeq: Imaging and Base Calling

Open source Google-style large scale data analysis with Hadoop

What is Analytic Infrastructure and Why Should You Care?

Eoulsan Analyse du séquençage à haut débit dans le cloud et sur la grille

Single-Cell DNA Sequencing with the C 1. Single-Cell Auto Prep System. Reveal hidden populations and genetic diversity within complex samples

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Colgate-Palmolive selects SAP HANA to improve the speed of business analytics with IBM and SAP

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time

Practical Solutions for Big Data Analytics

Powerful Duo: MapR Big Data Analytics with Cisco ACI Network Switches

Molecular Genetics: Challenges for Statistical Practice. J.K. Lindsey

SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop

Building a Scalable Big Data Infrastructure for Dynamic Workflows

Shouguo Gao Ph. D Department of Physics and Comprehensive Diabetes Center

Next generation DNA sequencing technologies. theory & prac-ce

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

PROPOSAL To Develop an Enterprise Scale Disease Modeling Web Portal For Ascel Bio Updated March 2015

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Cloud BioLinux: Pre-configured and On-demand Bioinformatics Computing for the Genomics Community

Integrated Rule-based Data Management System for Genome Sequencing Data

Next generation sequencing (NGS)

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Simplifying Data Interpretation with Nexus Copy Number

Scaling up to Production

Introduction to transcriptome analysis using High Throughput Sequencing technologies (HTS)

Data Analysis & Management of High-throughput Sequencing Data. Quoclinh Nguyen Research Informatics Genomics Core / Medical Research Institute

Cloud BioLinux: Pre-configured and On-demand Bioinformatics Computing for the Genomics Community

Workshop on Hadoop with Big Data

LifeScope Genomic Analysis Software 2.5

Well packaged sets of preinstalled, integrated, and optimized software on select hardware in the form of engineered systems and appliances

Why NoSQL? Your database options in the new non- relational world IBM Cloudant 1

Chapter 4 Cloud Computing Applications and Paradigms. Cloud Computing: Theory and Practice. 1

Using Illumina BaseSpace Apps to Analyze RNA Sequencing Data

Aligning Your Strategic Initiatives with a Realistic Big Data Analytics Roadmap

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Cloud Computing with Microsoft Azure

Data Management & Storage for NGS

Development of Bio-Cloud Service for Genomic Analysis Based on Virtual

Hadoop-BAM and SeqPig

Hadoop Ecosystem B Y R A H I M A.

Overview. Swarms in nature. Fish, birds, ants, termites, Introduction to swarm intelligence principles Particle Swarm Optimization (PSO)

BUILDING A SCALABLE BIG DATA INFRASTRUCTURE FOR DYNAMIC WORKFLOWS

Customer Case Study. Automatic Labs

QLIKVIEW INTEGRATION TION WITH AMAZON REDSHIFT John Park Partner Engineering

Real Time Big Data Processing

Nazneen Aziz, PhD. Director, Molecular Medicine Transformation Program Office

Cisco Data Preparation

Automated Library Preparation for Next-Generation Sequencing

How To Handle Big Data With A Data Scientist

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Big Data and the Data Lake. February 2015

Open source large scale distributed data management with Google s MapReduce and Bigtable

Introduction to Big Data Training

Integrating Big Data into the Computing Curricula

Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks

NEXT GENERATION ARCHIVE MIGRATION TOOLS

EMBL Identity & Access Management

NoSQL for SQL Professionals William McKnight

In-Database Analytics

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

<Insert Picture Here> Oracle and/or Hadoop And what you need to know

Everything you need to know about flash storage performance

Module 1. Sequence Formats and Retrieval. Charles Steward

Transcription:

Page 1 of 6 UCLA Team Sequences Cell Line, Puts Open Source Software Framework into Production February 05, 2010 Newsletter: BioInform BioInform - February 5, 2010 By Vivien Marx Scientists at the department of human genetics and the department of computer science at the University of California at Los Angeles described the genomic sequence of the brain cancer cell line U87MG in a recent study. In their paper, which appeared in last week's PLoS Genetics, the team highlighted "enormous improvements in the throughput of data generation." The scientists had decided to mainly use open source software for the project, putting in place an opensource analysis and data-management pipeline called SeqWare, which was developed in the lab. Bioinformatician Brian O'Connor, co-author of the PLoS study and post-doctoral fellow in the Stan Nelson lab at UCLA began developing the software two years ago, he told BioInform last week. He wanted to pick up where Illumina's software tools left off, he said. The platform now handles all data types, applies a pipeline of tools, a federated database structure, a LIMS, and a query engine. [BioInform 9/12/2008] O'Connor said that the team is scaling up the software in several ways: it is being modularized in order to be used as a framework for other tools, and it is being deployed at other research centers that need second-gen sequence analysis and data management. He and a colleague in the lab are also porting the software to the Amazon Elastic Compute Cloud, or EC2, and are integrating an open source database system so the tools and pipeline can scale from its current handling of scores of genomes, to potentially, hundreds or thousands of genomes. Separately, the scientists said they are transitioning the UCLA lab from being a microarray core to a second-generation sequencing core, O'Connor said.

Page 2 of 6 For the work in the paper, which relied on more than 30x genomic sequence coverage, the researchers applied a "novel" 50-base-mate paired strategy and 10 micrograms of input DNA to generate reads in five weeks of sequencing. The total reagent cost for the project was "under $30,000," and emboldened the researchers to call this genome "the least expensive published genome sequenced to date." The study described the large amount of data generated for analysis in these types of whole-genome resequencing studies: The team generated 107.5 Gb of raw color space data, of which 55.51 Gb was mapped to the reference genome. The researchers used Blat-like Fast Accurate Search Tool version 0.5.3, or BFAST 0.5.3, a tool developed in the Nelson lab to align the two-and-a half full sequencing runs from the ABI SOLiD, yielding slightly more than 1 billion 50-base pair mate-paired reads that they used to identify SNVs, indels, structural variants, and translocations. A "fully gapped local alignment" on the two-base encoded data to maximize variant calling took four days on a 20-node, 8-core cluster, the team wrote. BFAST, a color and nucleotide space-alignment tool, was in their view suited to obtain "rapid and sensitive" alignment of the more than 1 billion resulting reads. Using an Agilent array, and applying the Illumina Genome Analyzer, they also captured the exon sequence of more than 5,000 genes. In large projects such as theirs, scientists have data files that may comprise 160 gigabytesized sequence read files, or SRFs, and alignment files, O'Connor said. For 20x coverage of a human genome, the variant files run around 60 gigabytes in size, he said, which all need to be efficiently processed, annotated, and easy to query. To identify single nucleotide variants and small insertions and deletions, the team used the open-source assembly builder Mapping Assembly with Qualities, or MAQ, implemented in the SAMtools software suite. For "the primary structural variation candidate search," the researchers used DNA Analysis, or DNAA's "dtranslocations" utility, another set of tools from the Nelson lab. The team uploaded intensities, quality scores, and color space sequence for the genomic sequence of U87 SOLiD runs to the NCBI's Sequence Read Archive and did the same for intensities, quality scores, and nucleotide space sequence for the U87 exon capture Illumina sequence. SeqWare pipeline analysis programs were used to analyze variant calls, to store the data, and they used the new SeqWare Query Engine web service, available here, to query both variant calls and annotations. Beyond the Gap O'Connor said he set out with SeqWare to address a functionalities gap that currently exists between vendor tools and those from sequencer manufacturers, and to offer a combination of workflow management, sample tracking, data storage, and data-querying possibilities.

Page 3 of 6 In particular he said he has been trying to find frameworks that are scalable and that can work beyond dozens of genomes. Explaining this, he said that while sequencers have increased output 10-fold in the last two years, hardware and connectivity bandwidth are not scaling as quickly. The Nelson lab, a microarray core, is going to be the "sequencing center for a campus called the "Center for High-Throughput Biology at UCLA" and will offer sequencing and sequence-analysis services," O'Connor explained. The lab currently has two Illumina Genome Analyzers and one ABI SOLiD machine. The plan is to set up two or three more ABI SOLiD machines that will offer "quite a bit of capacity" to the community, even beyond the UCLA campus, he said. The new center targets whole genome sequencing and exome sequencing, which O'Connor said are "the two protocols we want to offer to the community." The PLoS Genetics work has put SeqWare to the test in a data-intensive production environment and is helping SeqWare reach its next level of development, O'Connor explained. "We're in the process of replicating that production environment in multiple places," he said. For example, over the "next few weeks," he is installing SeqWare at Cedars-Sinai Medical Center. The software can now grow from being an "academic, single-install project to something that is more replicated across sites," he said. The pipeline is a system for running analytical workflows and includes standalone modules, XML workflows that define jobs, and an execution engine. Cloud Bound Another transition underway for SeqWare is computational. The Linux-based software must be installed on a local cluster, and O'Connor said "[w]e're trying to take that away and abstract that away and install it on the [Amazon cloud] EC2." O'Connor and UCLA programmer and analyst Jordan Mendler are working to port the SeqWare to the cloud, and cloud computing is part of the Cedars-Sinai installation. It is slated to be completed by early April, he said. "We're looking at using the cloud as a means for bringing software like SeqWare and other applications to more people who do not have the resources that the Nelson lab has," he said. A cloud demonstration project is up and running at UCLA but has not yet been made publicly available, O'Connor said. "It's developing pretty quickly," he said. Alignment with BFAST works on the Amazon EC2 but the web interface is not user-friendly, he said. SeqWare users begin by launching a master node. "In our demo case so far we have launched that master node in lab," O'Connor said, adding that it could be a single machine running either in lab or on the cloud.

Page 4 of 6 To port SeqWare to the cloud, he and Mendler are using a tool suite that is part of the Planning for Execution in Grids, Pegasus, platform developed at the University of Southern California's Information Sciences Institute, he said. The machine's images can be "fired up" with enough information to know that they should all talk to the master node, which would enable scientists to "set up a virtual cluster of three nodes or 300 nodes," O'Connor said. When SeqWare is launched on the cloud, it can either target the UCLA lab's cluster running SunGrid Engine or it can target a new virtual cluster and enable workflow on the virtual nodes. "The real reason we are doing this is [that], like [at] a lot of other places, UCLA is in a situation where we can't infinitely expand out infrastructure," O'Connor said. He said he believes that as the Nelson lab adds sequencers, it will be able to apply the same SeqWare workflow now in place, with administrative duties reduced to tasks such as load-balancing. Adopting SeqWare for Pegasus over the last year has required the UCLA lab to "revamp the way we do workflows," he said. The software had been "pretty monolithic" with "homegrown code," and it was also rather "delicate," he said. Now it comprises individual "self-contained" modules that are more robust, O'Connor said. "What we get out of using Pegasus is the ability to target multiple clusters," he said. "It's a killer feature; it's just wonderful." For instance, scientists can move analysis to different computational locations when the need arises, he said. Overall, SeqWare handles sequence read format files, or SRFs, which are in the generic format for DNA sequence data developed by scientists at NCBI, the Broad Institute, the EBI, and other academic institutions, as well as at companies such as Illumina, Roche, Helicos, and Life Technologies/ABI. It can help researchers by not requiring them to be as concerned about sequencer-specific file issues in analysis as they currently are. "The idea is that since it's starting with the common file format, all of our code essentially works unchanged," O'Connor said. He added that the only exception is that BFAST has two modes: color space and nucleotide space. Alignments are stored in the BAM format the compressed binary version of the Sequence Alignment, or SAM, format, which "seems to be what most people are using," O'Connor said. Start the Engine For variant calling, standards are lacking, he said, but added that the SeqWare Query Engine can help handle that type of data, and offers multiple types of querying. The engine has been in the works for six months and can support large databases containing more than 300 gigabytes of information, he said. It can also be distributed

Page 5 of 6 across a cluster, and researchers can query it using a representational state transfer, or RESTful, web interface architecture. Variant calling in sequencing workflow leads to "massive files," O'Connor said. For example, in the brain cancer cell line project, the files ran 150 gigabytes in size, and the files describe all sequenced positions and the consensus calls. However, performing analysis on that data meant scripting. "I spent a lot of time writing Perl scripts that were very custom," he said. O'Connor developed the query engine in reaction to that experience and the increasing number of genomes in experiments. "It's one way to get to the data instead of having to write a ton of different parsers for all my analysis components," he said. For the U87 work, he used the Berkeley DB open-source developer database system to create databases of genomic information such as variants, SNVs, small indels, translocation, and coverage information. In the SeqWare pipeline system, "basically one genome equals one database," he said. "If I had done it with the standard MySQL or Postgres databases, it would have been fine" to around 100 genomes or so, but after that a single database "would implode, basically." Now that he is porting SeqWare to the cloud, the challenge is again to avoid bottlenecks. "I ported the back end to something called HBase," which is part of the Hadoop project, an open-source volunteer project run under the Apache Software Foundation. Although similar to Berkeley DB, HBase has "no nice query engine like SQL, but you get scalability," he said. The key difference between Berkeley DB and HBase is less need for manual intervention, he said. "HBase itself knows how to distribute the database information and shred it across 10 different nodes," O'Connor said. "That's really nice because I don't have to think about where the database lives." Although "the system is a "little rough around the edges," it is working and it seems to be "a lot faster than Berkeley DB," he said. As O'Connor works on the cloud computing-enabled SeqWare, he said he believes his system for sequence analysis and data management will be less fraught with database issues and will give researchers options to track metadata, run analytical workflows, and query data. As a SeqWare developer and user, he said "it's really nice to be able to provide collaborators a URL and say 'Go crazy, you can query [as] much as you want.'" Alternatively, it would require collaborators to communicate over email or make scripting necessary to provide the data they might want, such as including data filtered for frameshift mutations. SeqWare also has a "meta-database" to track analysis steps and experimental protocols, O'Connor said. Another challenge with performing experiments with second-gen sequencers is that research teams must run them many times and tweak the software as they work. "We did variant calling on U87 eight times," he said. "You need to keep track of that."

Page 6 of 6 He said that in the future researchers might all converge on a few vendor or open-source tools. Anticipating such a convergence, and because he does not have a large developer community, O'Connor said he decided to not try to make SeqWare an "all encompassing" suite. Rather he wanted it to act as a "glue code" for a modularized package that enables scientists to use other tools. O'Connor said he has been shifting his focus "over the last six months or so" to accommodate this potential convergence. SeqWare is now "less about our own algorithms for calling variants or doing alignments" than it is about tracking meta-data, experimental, and computational methods, and archiving the results in a common format so they can be queried, O'Connor said. He said he chose a particular database focus "because I think that is something that isn't well-addressed by vendor tools," he said. "Regardless of scale, you have these issues, [and] as you scale up these issues become more and more critical." Slightly bad decisions at a small scale means a task might run two hours instead of one, but for larger data analysis, those "bad" decisions can be even more time-consuming and costly. According to O'Connor, another capability that is currently not addressed well by vendor tools is how to provision jobs to a cluster or multiple cluster types, and how to handle submission engines. As O'Connor wraps up his post-doctoral fellowship, and regardless of whether his next post will be at a university or a company, he said he plans to continue developing SeqWare. "What I am looking forward to is starting up a really good core set of users in multiple locations who can give feedback," he said. "It's so important in this field right now to do collaborative development of software tools." Cloud computing is part of that mix. Although some researchers shy away from the cloud's inherent costs, he says its scalability pays off. To test what it will cost to perform on alignments on the EC2 cloud, he did a "back of the napkin" calculation and found that a whole-genome alignment, including data transfer and computation time, "works out to be around $600" which, compared to reagent costs, "is not that bad at all," O'Connor said. Researchers generally still face the challenge of getting data to the cloud. "At some point the pipe from UCLA to the cloud will become too small," he said, adding that the data transfer rate of five megabytes/second is not going to improve in the short-term. And when he and his colleagues begin increasing current data generation tenfold by using Illumina's HiSeq2000 or Life Technologies' SOLiD 4, bottlenecks will become acute. Genomeweb system These settings are generally managed by the web site so you rarely need to consider them. Issue Order: 2 -->