UCLA Team Sequences Cell Line, Puts Open Source Software Framework into Production
|
|
- Elijah Harmon
- 8 years ago
- Views:
Transcription
1 Page 1 of 6 UCLA Team Sequences Cell Line, Puts Open Source Software Framework into Production February 05, 2010 Newsletter: BioInform BioInform - February 5, 2010 By Vivien Marx Scientists at the department of human genetics and the department of computer science at the University of California at Los Angeles described the genomic sequence of the brain cancer cell line U87MG in a recent study. In their paper, which appeared in last week's PLoS Genetics, the team highlighted "enormous improvements in the throughput of data generation." The scientists had decided to mainly use open source software for the project, putting in place an opensource analysis and data-management pipeline called SeqWare, which was developed in the lab. Bioinformatician Brian O'Connor, co-author of the PLoS study and post-doctoral fellow in the Stan Nelson lab at UCLA began developing the software two years ago, he told BioInform last week. He wanted to pick up where Illumina's software tools left off, he said. The platform now handles all data types, applies a pipeline of tools, a federated database structure, a LIMS, and a query engine. [BioInform 9/12/2008] O'Connor said that the team is scaling up the software in several ways: it is being modularized in order to be used as a framework for other tools, and it is being deployed at other research centers that need second-gen sequence analysis and data management. He and a colleague in the lab are also porting the software to the Amazon Elastic Compute Cloud, or EC2, and are integrating an open source database system so the tools and pipeline can scale from its current handling of scores of genomes, to potentially, hundreds or thousands of genomes. Separately, the scientists said they are transitioning the UCLA lab from being a microarray core to a second-generation sequencing core, O'Connor said.
2 Page 2 of 6 For the work in the paper, which relied on more than 30x genomic sequence coverage, the researchers applied a "novel" 50-base-mate paired strategy and 10 micrograms of input DNA to generate reads in five weeks of sequencing. The total reagent cost for the project was "under $30,000," and emboldened the researchers to call this genome "the least expensive published genome sequenced to date." The study described the large amount of data generated for analysis in these types of whole-genome resequencing studies: The team generated Gb of raw color space data, of which Gb was mapped to the reference genome. The researchers used Blat-like Fast Accurate Search Tool version 0.5.3, or BFAST 0.5.3, a tool developed in the Nelson lab to align the two-and-a half full sequencing runs from the ABI SOLiD, yielding slightly more than 1 billion 50-base pair mate-paired reads that they used to identify SNVs, indels, structural variants, and translocations. A "fully gapped local alignment" on the two-base encoded data to maximize variant calling took four days on a 20-node, 8-core cluster, the team wrote. BFAST, a color and nucleotide space-alignment tool, was in their view suited to obtain "rapid and sensitive" alignment of the more than 1 billion resulting reads. Using an Agilent array, and applying the Illumina Genome Analyzer, they also captured the exon sequence of more than 5,000 genes. In large projects such as theirs, scientists have data files that may comprise 160 gigabytesized sequence read files, or SRFs, and alignment files, O'Connor said. For 20x coverage of a human genome, the variant files run around 60 gigabytes in size, he said, which all need to be efficiently processed, annotated, and easy to query. To identify single nucleotide variants and small insertions and deletions, the team used the open-source assembly builder Mapping Assembly with Qualities, or MAQ, implemented in the SAMtools software suite. For "the primary structural variation candidate search," the researchers used DNA Analysis, or DNAA's "dtranslocations" utility, another set of tools from the Nelson lab. The team uploaded intensities, quality scores, and color space sequence for the genomic sequence of U87 SOLiD runs to the NCBI's Sequence Read Archive and did the same for intensities, quality scores, and nucleotide space sequence for the U87 exon capture Illumina sequence. SeqWare pipeline analysis programs were used to analyze variant calls, to store the data, and they used the new SeqWare Query Engine web service, available here, to query both variant calls and annotations. Beyond the Gap O'Connor said he set out with SeqWare to address a functionalities gap that currently exists between vendor tools and those from sequencer manufacturers, and to offer a combination of workflow management, sample tracking, data storage, and data-querying possibilities.
3 Page 3 of 6 In particular he said he has been trying to find frameworks that are scalable and that can work beyond dozens of genomes. Explaining this, he said that while sequencers have increased output 10-fold in the last two years, hardware and connectivity bandwidth are not scaling as quickly. The Nelson lab, a microarray core, is going to be the "sequencing center for a campus called the "Center for High-Throughput Biology at UCLA" and will offer sequencing and sequence-analysis services," O'Connor explained. The lab currently has two Illumina Genome Analyzers and one ABI SOLiD machine. The plan is to set up two or three more ABI SOLiD machines that will offer "quite a bit of capacity" to the community, even beyond the UCLA campus, he said. The new center targets whole genome sequencing and exome sequencing, which O'Connor said are "the two protocols we want to offer to the community." The PLoS Genetics work has put SeqWare to the test in a data-intensive production environment and is helping SeqWare reach its next level of development, O'Connor explained. "We're in the process of replicating that production environment in multiple places," he said. For example, over the "next few weeks," he is installing SeqWare at Cedars-Sinai Medical Center. The software can now grow from being an "academic, single-install project to something that is more replicated across sites," he said. The pipeline is a system for running analytical workflows and includes standalone modules, XML workflows that define jobs, and an execution engine. Cloud Bound Another transition underway for SeqWare is computational. The Linux-based software must be installed on a local cluster, and O'Connor said "[w]e're trying to take that away and abstract that away and install it on the [Amazon cloud] EC2." O'Connor and UCLA programmer and analyst Jordan Mendler are working to port the SeqWare to the cloud, and cloud computing is part of the Cedars-Sinai installation. It is slated to be completed by early April, he said. "We're looking at using the cloud as a means for bringing software like SeqWare and other applications to more people who do not have the resources that the Nelson lab has," he said. A cloud demonstration project is up and running at UCLA but has not yet been made publicly available, O'Connor said. "It's developing pretty quickly," he said. Alignment with BFAST works on the Amazon EC2 but the web interface is not user-friendly, he said. SeqWare users begin by launching a master node. "In our demo case so far we have launched that master node in lab," O'Connor said, adding that it could be a single machine running either in lab or on the cloud.
4 Page 4 of 6 To port SeqWare to the cloud, he and Mendler are using a tool suite that is part of the Planning for Execution in Grids, Pegasus, platform developed at the University of Southern California's Information Sciences Institute, he said. The machine's images can be "fired up" with enough information to know that they should all talk to the master node, which would enable scientists to "set up a virtual cluster of three nodes or 300 nodes," O'Connor said. When SeqWare is launched on the cloud, it can either target the UCLA lab's cluster running SunGrid Engine or it can target a new virtual cluster and enable workflow on the virtual nodes. "The real reason we are doing this is [that], like [at] a lot of other places, UCLA is in a situation where we can't infinitely expand out infrastructure," O'Connor said. He said he believes that as the Nelson lab adds sequencers, it will be able to apply the same SeqWare workflow now in place, with administrative duties reduced to tasks such as load-balancing. Adopting SeqWare for Pegasus over the last year has required the UCLA lab to "revamp the way we do workflows," he said. The software had been "pretty monolithic" with "homegrown code," and it was also rather "delicate," he said. Now it comprises individual "self-contained" modules that are more robust, O'Connor said. "What we get out of using Pegasus is the ability to target multiple clusters," he said. "It's a killer feature; it's just wonderful." For instance, scientists can move analysis to different computational locations when the need arises, he said. Overall, SeqWare handles sequence read format files, or SRFs, which are in the generic format for DNA sequence data developed by scientists at NCBI, the Broad Institute, the EBI, and other academic institutions, as well as at companies such as Illumina, Roche, Helicos, and Life Technologies/ABI. It can help researchers by not requiring them to be as concerned about sequencer-specific file issues in analysis as they currently are. "The idea is that since it's starting with the common file format, all of our code essentially works unchanged," O'Connor said. He added that the only exception is that BFAST has two modes: color space and nucleotide space. Alignments are stored in the BAM format the compressed binary version of the Sequence Alignment, or SAM, format, which "seems to be what most people are using," O'Connor said. Start the Engine For variant calling, standards are lacking, he said, but added that the SeqWare Query Engine can help handle that type of data, and offers multiple types of querying. The engine has been in the works for six months and can support large databases containing more than 300 gigabytes of information, he said. It can also be distributed
5 Page 5 of 6 across a cluster, and researchers can query it using a representational state transfer, or RESTful, web interface architecture. Variant calling in sequencing workflow leads to "massive files," O'Connor said. For example, in the brain cancer cell line project, the files ran 150 gigabytes in size, and the files describe all sequenced positions and the consensus calls. However, performing analysis on that data meant scripting. "I spent a lot of time writing Perl scripts that were very custom," he said. O'Connor developed the query engine in reaction to that experience and the increasing number of genomes in experiments. "It's one way to get to the data instead of having to write a ton of different parsers for all my analysis components," he said. For the U87 work, he used the Berkeley DB open-source developer database system to create databases of genomic information such as variants, SNVs, small indels, translocation, and coverage information. In the SeqWare pipeline system, "basically one genome equals one database," he said. "If I had done it with the standard MySQL or Postgres databases, it would have been fine" to around 100 genomes or so, but after that a single database "would implode, basically." Now that he is porting SeqWare to the cloud, the challenge is again to avoid bottlenecks. "I ported the back end to something called HBase," which is part of the Hadoop project, an open-source volunteer project run under the Apache Software Foundation. Although similar to Berkeley DB, HBase has "no nice query engine like SQL, but you get scalability," he said. The key difference between Berkeley DB and HBase is less need for manual intervention, he said. "HBase itself knows how to distribute the database information and shred it across 10 different nodes," O'Connor said. "That's really nice because I don't have to think about where the database lives." Although "the system is a "little rough around the edges," it is working and it seems to be "a lot faster than Berkeley DB," he said. As O'Connor works on the cloud computing-enabled SeqWare, he said he believes his system for sequence analysis and data management will be less fraught with database issues and will give researchers options to track metadata, run analytical workflows, and query data. As a SeqWare developer and user, he said "it's really nice to be able to provide collaborators a URL and say 'Go crazy, you can query [as] much as you want.'" Alternatively, it would require collaborators to communicate over or make scripting necessary to provide the data they might want, such as including data filtered for frameshift mutations. SeqWare also has a "meta-database" to track analysis steps and experimental protocols, O'Connor said. Another challenge with performing experiments with second-gen sequencers is that research teams must run them many times and tweak the software as they work. "We did variant calling on U87 eight times," he said. "You need to keep track of that."
6 Page 6 of 6 He said that in the future researchers might all converge on a few vendor or open-source tools. Anticipating such a convergence, and because he does not have a large developer community, O'Connor said he decided to not try to make SeqWare an "all encompassing" suite. Rather he wanted it to act as a "glue code" for a modularized package that enables scientists to use other tools. O'Connor said he has been shifting his focus "over the last six months or so" to accommodate this potential convergence. SeqWare is now "less about our own algorithms for calling variants or doing alignments" than it is about tracking meta-data, experimental, and computational methods, and archiving the results in a common format so they can be queried, O'Connor said. He said he chose a particular database focus "because I think that is something that isn't well-addressed by vendor tools," he said. "Regardless of scale, you have these issues, [and] as you scale up these issues become more and more critical." Slightly bad decisions at a small scale means a task might run two hours instead of one, but for larger data analysis, those "bad" decisions can be even more time-consuming and costly. According to O'Connor, another capability that is currently not addressed well by vendor tools is how to provision jobs to a cluster or multiple cluster types, and how to handle submission engines. As O'Connor wraps up his post-doctoral fellowship, and regardless of whether his next post will be at a university or a company, he said he plans to continue developing SeqWare. "What I am looking forward to is starting up a really good core set of users in multiple locations who can give feedback," he said. "It's so important in this field right now to do collaborative development of software tools." Cloud computing is part of that mix. Although some researchers shy away from the cloud's inherent costs, he says its scalability pays off. To test what it will cost to perform on alignments on the EC2 cloud, he did a "back of the napkin" calculation and found that a whole-genome alignment, including data transfer and computation time, "works out to be around $600" which, compared to reagent costs, "is not that bad at all," O'Connor said. Researchers generally still face the challenge of getting data to the cloud. "At some point the pipe from UCLA to the cloud will become too small," he said, adding that the data transfer rate of five megabytes/second is not going to improve in the short-term. And when he and his colleagues begin increasing current data generation tenfold by using Illumina's HiSeq2000 or Life Technologies' SOLiD 4, bottlenecks will become acute. Genomeweb system These settings are generally managed by the web site so you rarely need to consider them. Issue Order: 2 -->
ENABLING DATA TRANSFER MANAGEMENT AND SHARING IN THE ERA OF GENOMIC MEDICINE. October 2013
ENABLING DATA TRANSFER MANAGEMENT AND SHARING IN THE ERA OF GENOMIC MEDICINE October 2013 Introduction As sequencing technologies continue to evolve and genomic data makes its way into clinical use and
More informationPutting Genomes in the Cloud with WOS TM. ddn.com. DDN Whitepaper. Making data sharing faster, easier and more scalable
DDN Whitepaper Putting Genomes in the Cloud with WOS TM Making data sharing faster, easier and more scalable Table of Contents Cloud Computing 3 Build vs. Rent 4 Why WOS Fits the Cloud 4 Storing Sequences
More informationLeading Genomics. Diagnostic. Discove. Collab. harma. Shanghai Cambridge, MA Reykjavik
Leading Genomics Diagnostic harma Discove Collab Shanghai Cambridge, MA Reykjavik Global leadership for using the genome to create better medicine WuXi NextCODE provides a uniquely proven and integrated
More informationDelivering the power of the world s most successful genomics platform
Delivering the power of the world s most successful genomics platform NextCODE Health is bringing the full power of the world s largest and most successful genomics platform to everyday clinical care NextCODE
More informationFocusing on results not data comprehensive data analysis for targeted next generation sequencing
Focusing on results not data comprehensive data analysis for targeted next generation sequencing Daniel Swan, Jolyon Holdstock, Angela Matchan, Richard Stark, John Shovelton, Duarte Mohla and Simon Hughes
More informationSEQUENCING. From Sample to Sequence-Ready
SEQUENCING From Sample to Sequence-Ready ACCESS ARRAY SYSTEM HIGH-QUALITY LIBRARIES, NOT ONCE, BUT EVERY TIME The highest-quality amplicons more sensitive, accurate, and specific Full support for all major
More informationAnalysis of NGS Data
Analysis of NGS Data Introduction and Basics Folie: 1 Overview of Analysis Workflow Images Basecalling Sequences denovo - Sequencing Assembly Annotation Resequencing Alignments Comparison to reference
More informationIntroduction to NGS data analysis
Introduction to NGS data analysis Jeroen F. J. Laros Leiden Genome Technology Center Department of Human Genetics Center for Human and Clinical Genetics Sequencing Illumina platforms Characteristics: High
More informationHadoopizer : a cloud environment for bioinformatics data analysis
Hadoopizer : a cloud environment for bioinformatics data analysis Anthony Bretaudeau (1), Olivier Sallou (2), Olivier Collin (3) (1) anthony.bretaudeau@irisa.fr, INRIA/Irisa, Campus de Beaulieu, 35042,
More informationHigh Performance Compu2ng Facility
High Performance Compu2ng Facility Center for Health Informa2cs and Bioinforma2cs Accelera2ng Scien2fic Discovery and Innova2on in Biomedical Research at NYULMC through Advanced Compu2ng Efstra'os Efstathiadis,
More informationData processing goes big
Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,
More informationHadoop. http://hadoop.apache.org/ Sunday, November 25, 12
Hadoop http://hadoop.apache.org/ What Is Apache Hadoop? The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using
More informationBuilding Bioinformatics Capacity in Africa. Nicky Mulder CBIO Group, UCT
Building Bioinformatics Capacity in Africa Nicky Mulder CBIO Group, UCT Outline What is bioinformatics? Why do we need IT infrastructure? What e-infrastructure does it require? How we are developing this
More informationQ&A: Kevin Shianna on Ramping up Sequencing for the New York Genome Center
Q&A: Kevin Shianna on Ramping up Sequencing for the New York Genome Center Name: Kevin Shianna Age: 39 Position: Senior vice president, sequencing operations, New York Genome Center, since July 2012 Experience
More informationG E N OM I C S S E RV I C ES
GENOMICS SERVICES THE NEW YORK GENOME CENTER NYGC is an independent non-profit implementing advanced genomic research to improve diagnosis and treatment of serious diseases. capabilities. N E X T- G E
More informationWork Package 13.5: Authors: Paul Flicek and Ilkka Lappalainen. 1. Introduction
Work Package 13.5: Report summarising the technical feasibility of the European Genotype Archive to collect, store, and use genotype data stored in European biobanks in a manner that complies with all
More informationGC3 Use cases for the Cloud
GC3: Grid Computing Competence Center GC3 Use cases for the Cloud Some real world examples suited for cloud systems Antonio Messina Trieste, 24.10.2013 Who am I System Architect
More informationSeqScape Software Version 2.5 Comprehensive Analysis Solution for Resequencing Applications
Product Bulletin Sequencing Software SeqScape Software Version 2.5 Comprehensive Analysis Solution for Resequencing Applications Comprehensive reference sequence handling Helps interpret the role of each
More informationNext Generation Sequencing: Technology, Mapping, and Analysis
Next Generation Sequencing: Technology, Mapping, and Analysis Gary Benson Computer Science, Biology, Bioinformatics Boston University gbenson@bu.edu http://tandem.bu.edu/ The Human Genome Project took
More informationOpen source software framework designed for storage and processing of large scale data on clusters of commodity hardware
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after
More informationReal-Time Analytics on Large Datasets: Predictive Models for Online Targeted Advertising
Real-Time Analytics on Large Datasets: Predictive Models for Online Targeted Advertising Open Data Partners and AdReady April 2012 1 Executive Summary AdReady is working to develop and deploy sophisticated
More informationAutomated Lab Management for Illumina SeqLab
Automated Lab Management for Illumina SeqLab INTRODUCTION Whole genome sequencing holds the promise of understanding genetic variation and disease better than ever before. In response, Illumina developed
More informationAttacking the Biobank Bottleneck
Attacking the Biobank Bottleneck Professor Jan-Eric Litton BBMRI-ERIC BBMRI-ERIC Big Data meets research biobanking Big data is high-volume, high-velocity and highvariety information assets that demand
More informationOpenCB a next generation big data analytics and visualisation platform for the Omics revolution
OpenCB a next generation big data analytics and visualisation platform for the Omics revolution Development at the University of Cambridge - Closing the Omics / Moore s law gap with Dell & Intel Ignacio
More informationCloud Computing Solutions for Genomics Across Geographic, Institutional and Economic Barriers
Cloud Computing Solutions for Genomics Across Geographic, Institutional and Economic Barriers Ntinos Krampis Asst. Professor J. Craig Venter Institute kkrampis@jcvi.org http://www.jcvi.org/cms/about/bios/kkrampis/
More informationConvergence of Big Data and Cloud
American Journal of Engineering Research (AJER) e-issn : 2320-0847 p-issn : 2320-0936 Volume-03, Issue-05, pp-266-270 www.ajer.org Research Paper Open Access Convergence of Big Data and Cloud Sreevani.Y.V.
More informationRETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison
RETRIEVING SEQUENCE INFORMATION Nucleotide sequence databases Database search Sequence alignment and comparison Biological sequence databases Originally just a storage place for sequences. Currently the
More informationImportance of Statistics in creating high dimensional data
Importance of Statistics in creating high dimensional data Hemant K. Tiwari, PhD Section on Statistical Genetics Department of Biostatistics University of Alabama at Birmingham History of Genomic Data
More informationGo where the biology takes you. Genome Analyzer IIx Genome Analyzer IIe
Go where the biology takes you. Genome Analyzer IIx Genome Analyzer IIe Go where the biology takes you. To published results faster With proven scalability To the forefront of discovery To limitless applications
More informationManaging and Conducting Biomedical Research on the Cloud Prasad Patil
Managing and Conducting Biomedical Research on the Cloud Prasad Patil Laboratory for Personalized Medicine Center for Biomedical Informatics Harvard Medical School SaaS & PaaS gmail google docs app engine
More informationNew solutions for Big Data Analysis and Visualization
New solutions for Big Data Analysis and Visualization From HPC to cloud-based solutions Barcelona, February 2013 Nacho Medina imedina@cipf.es http://bioinfo.cipf.es/imedina Head of the Computational Biology
More informationEarly Cloud Experiences with the Kepler Scientific Workflow System
Available online at www.sciencedirect.com Procedia Computer Science 9 (2012 ) 1630 1634 International Conference on Computational Science, ICCS 2012 Early Cloud Experiences with the Kepler Scientific Workflow
More informationMiSeq: Imaging and Base Calling
MiSeq: Imaging and Page Welcome Navigation Presenter Introduction MiSeq Sequencing Workflow Narration Welcome to MiSeq: Imaging and. This course takes 35 minutes to complete. Click Next to continue. Please
More informationAccelerate > Converged Storage Infrastructure. DDN Case Study. ddn.com. 2013 DataDirect Networks. All Rights Reserved
DDN Case Study Accelerate > Converged Storage Infrastructure 2013 DataDirect Networks. All Rights Reserved The University of Florida s (ICBR) offers access to cutting-edge technologies designed to enable
More informationOpen source Google-style large scale data analysis with Hadoop
Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical
More informationWhat is Analytic Infrastructure and Why Should You Care?
What is Analytic Infrastructure and Why Should You Care? Robert L Grossman University of Illinois at Chicago and Open Data Group grossman@uic.edu ABSTRACT We define analytic infrastructure to be the services,
More informationEoulsan Analyse du séquençage à haut débit dans le cloud et sur la grille
Eoulsan Analyse du séquençage à haut débit dans le cloud et sur la grille Journées SUCCES Stéphane Le Crom (UPMC IBENS) stephane.le_crom@upmc.fr Paris November 2013 The Sanger DNA sequencing method Sequencing
More informationSingle-Cell DNA Sequencing with the C 1. Single-Cell Auto Prep System. Reveal hidden populations and genetic diversity within complex samples
DATA Sheet Single-Cell DNA Sequencing with the C 1 Single-Cell Auto Prep System Reveal hidden populations and genetic diversity within complex samples Single-cell sensitivity Discover and detect SNPs,
More informationLecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop
Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1 2 Big Data Problems Data explosion Data from users on social
More informationColgate-Palmolive selects SAP HANA to improve the speed of business analytics with IBM and SAP
selects SAP HANA to improve the speed of business analytics with IBM and SAP Founded in 1806, is a global consumer products company which sells nearly $17 billion annually in personal care, home care,
More informationHow In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time
SCALEOUT SOFTWARE How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time by Dr. William Bain and Dr. Mikhail Sobolev, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 T wenty-first
More informationPractical Solutions for Big Data Analytics
Practical Solutions for Big Data Analytics Ravi Madduri Computation Institute (madduri@anl.gov) Paul Dave (pdave@uchicago.edu) Dinanath Sulakhe (sulakhe@uchicago.edu) Alex Rodriguez (arodri7@uchicago.edu)
More informationPowerful Duo: MapR Big Data Analytics with Cisco ACI Network Switches
Powerful Duo: MapR Big Data Analytics with Cisco ACI Network Switches Introduction For companies that want to quickly gain insights into or opportunities from big data - the dramatic volume growth in corporate
More informationMolecular Genetics: Challenges for Statistical Practice. J.K. Lindsey
Molecular Genetics: Challenges for Statistical Practice J.K. Lindsey 1. What is a Microarray? 2. Design Questions 3. Modelling Questions 4. Longitudinal Data 5. Conclusions 1. What is a microarray? A microarray
More informationSeqPig: simple and scalable scripting for large sequencing data sets in Hadoop
SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop André Schumacher, Luca Pireddu, Matti Niemenmaa, Aleksi Kallio, Eija Korpelainen, Gianluigi Zanetti and Keijo Heljanko Abstract
More informationBuilding a Scalable Big Data Infrastructure for Dynamic Workflows
Building a Scalable Big Data Infrastructure for Dynamic Workflows INTRODUCTION Organizations of all types and sizes are looking to big data to help them make faster, more intelligent decisions. Many efforts
More informationShouguo Gao Ph. D Department of Physics and Comprehensive Diabetes Center
Computational Challenges in Storage, Analysis and Interpretation of Next-Generation Sequencing Data Shouguo Gao Ph. D Department of Physics and Comprehensive Diabetes Center Next Generation Sequencing
More informationNext generation DNA sequencing technologies. theory & prac-ce
Next generation DNA sequencing technologies theory & prac-ce Outline Next- Genera-on sequencing (NGS) technologies overview NGS applica-ons NGS workflow: data collec-on and processing the exome sequencing
More informationScalable Cloud Computing Solutions for Next Generation Sequencing Data
Scalable Cloud Computing Solutions for Next Generation Sequencing Data Matti Niemenmaa 1, Aleksi Kallio 2, André Schumacher 1, Petri Klemelä 2, Eija Korpelainen 2, and Keijo Heljanko 1 1 Department of
More informationPROPOSAL To Develop an Enterprise Scale Disease Modeling Web Portal For Ascel Bio Updated March 2015
Enterprise Scale Disease Modeling Web Portal PROPOSAL To Develop an Enterprise Scale Disease Modeling Web Portal For Ascel Bio Updated March 2015 i Last Updated: 5/8/2015 4:13 PM3/5/2015 10:00 AM Enterprise
More informationBENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB
BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB Planet Size Data!? Gartner s 10 key IT trends for 2012 unstructured data will grow some 80% over the course of the next
More informationCloud BioLinux: Pre-configured and On-demand Bioinformatics Computing for the Genomics Community
Cloud BioLinux: Pre-configured and On-demand Bioinformatics Computing for the Genomics Community Ntinos Krampis Asst. Professor J. Craig Venter Institute kkrampis@jcvi.org http://www.jcvi.org/cms/about/bios/kkrampis/
More informationIntegrated Rule-based Data Management System for Genome Sequencing Data
Integrated Rule-based Data Management System for Genome Sequencing Data A Research Data Management (RDM) Green Shoots Pilots Project Report by Michael Mueller, Simon Burbidge, Steven Lawlor and Jorge Ferrer
More informationNext generation sequencing (NGS)
Next generation sequencing (NGS) Vijayachitra Modhukur BIIT modhukur@ut.ee 1 Bioinformatics course 11/13/12 Sequencing 2 Bioinformatics course 11/13/12 Microarrays vs NGS Sequences do not need to be known
More informationDeveloping Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control
Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control EP/K006487/1 UK PI: Prof Gareth Taylor (BU) China PI: Prof Yong-Hua Song (THU) Consortium UK Members: Brunel University
More informationSimplifying Data Interpretation with Nexus Copy Number
Simplifying Data Interpretation with Nexus Copy Number A WHITE PAPER FROM BIODISCOVERY, INC. Rapid technological advancements, such as high-density acgh and SNP arrays as well as next-generation sequencing
More informationScaling up to Production
1 Scaling up to Production Overview Productionize then Scale Building Production Systems Scaling Production Systems Use Case: Scaling a Production Galaxy Instance Infrastructure Advice 2 PRODUCTIONIZE
More informationIntroduction to transcriptome analysis using High Throughput Sequencing technologies (HTS)
Introduction to transcriptome analysis using High Throughput Sequencing technologies (HTS) A typical RNA Seq experiment Library construction Protocol variations Fragmentation methods RNA: nebulization,
More informationData Analysis & Management of High-throughput Sequencing Data. Quoclinh Nguyen Research Informatics Genomics Core / Medical Research Institute
Data Analysis & Management of High-throughput Sequencing Data Quoclinh Nguyen Research Informatics Genomics Core / Medical Research Institute Current Issues Current Issues The QSEQ file Number files per
More informationCloud BioLinux: Pre-configured and On-demand Bioinformatics Computing for the Genomics Community
Cloud BioLinux: Pre-configured and On-demand Bioinformatics Computing for the Genomics Community Ntinos Krampis Asst. Professor J. Craig Venter Institute kkrampis@jcvi.org http://www.jcvi.org/cms/about/bios/kkrampis/
More informationWorkshop on Hadoop with Big Data
Workshop on Hadoop with Big Data Hadoop? Apache Hadoop is an open source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly
More informationLifeScope Genomic Analysis Software 2.5
USER GUIDE LifeScope Genomic Analysis Software 2.5 Graphical User Interface DATA ANALYSIS METHODS AND INTERPRETATION Publication Part Number 4471877 Rev. A Revision Date November 2011 For Research Use
More informationWell packaged sets of preinstalled, integrated, and optimized software on select hardware in the form of engineered systems and appliances
INSIGHT Oracle's All- Out Assault on the Big Data Market: Offering Hadoop, R, Cubes, and Scalable IMDB in Familiar Packages Carl W. Olofson IDC OPINION Global Headquarters: 5 Speen Street Framingham, MA
More informationWhy NoSQL? Your database options in the new non- relational world. 2015 IBM Cloudant 1
Why NoSQL? Your database options in the new non- relational world 2015 IBM Cloudant 1 Table of Contents New types of apps are generating new types of data... 3 A brief history on NoSQL... 3 NoSQL s roots
More informationChapter 4 Cloud Computing Applications and Paradigms. Cloud Computing: Theory and Practice. 1
Chapter 4 Cloud Computing Applications and Paradigms Chapter 4 1 Contents Challenges for cloud computing. Existing cloud applications and new opportunities. Architectural styles for cloud applications.
More informationUsing Illumina BaseSpace Apps to Analyze RNA Sequencing Data
Using Illumina BaseSpace Apps to Analyze RNA Sequencing Data The Illumina TopHat Alignment and Cufflinks Assembly and Differential Expression apps make RNA data analysis accessible to any user, regardless
More informationAligning Your Strategic Initiatives with a Realistic Big Data Analytics Roadmap
Aligning Your Strategic Initiatives with a Realistic Big Data Analytics Roadmap 3 key strategic advantages, and a realistic roadmap for what you really need, and when 2012, Cognizant Topics to be discussed
More informationSpeak<geek> Tech Brief. RichRelevance Distributed Computing: creating a scalable, reliable infrastructure
3 Speak Tech Brief RichRelevance Distributed Computing: creating a scalable, reliable infrastructure Overview Scaling a large database is not an overnight process, so it s difficult to plan and implement
More informationESS event: Big Data in Official Statistics. Antonino Virgillito, Istat
ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web
More informationCloud Computing with Microsoft Azure
Cloud Computing with Microsoft Azure Michael Stiefel www.reliablesoftware.com development@reliablesoftware.com http://www.reliablesoftware.com/dasblog/default.aspx Azure's Three Flavors Azure Operating
More informationData Management & Storage for NGS
Data Management & Storage for NGS 2009 Pre-Conference Workshop Chris Dagdigian BioTeam Inc. Independent Consulting Shop: Vendor/technology agnostic Staffed by: Scientists forced to learn High Performance
More informationDevelopment of Bio-Cloud Service for Genomic Analysis Based on Virtual
Development of Bio-Cloud Service for Genomic Analysis Based on Virtual Infrastructure 1 Jung-Ho Um, 2 Sang Bae Park, 3 Hoon Choi, 4 Hanmin Jung 1, First Author Korea Institute of Science and Technology
More informationHadoop-BAM and SeqPig
Hadoop-BAM and SeqPig Keijo Heljanko 1, André Schumacher 1,2, Ridvan Döngelci 1, Luca Pireddu 3, Matti Niemenmaa 1, Aleksi Kallio 4, Eija Korpelainen 4, and Gianluigi Zanetti 3 1 Department of Computer
More informationHadoop Ecosystem B Y R A H I M A.
Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open
More information14.10.2014. Overview. Swarms in nature. Fish, birds, ants, termites, Introduction to swarm intelligence principles Particle Swarm Optimization (PSO)
Overview Kyrre Glette kyrrehg@ifi INF3490 Swarm Intelligence Particle Swarm Optimization Introduction to swarm intelligence principles Particle Swarm Optimization (PSO) 3 Swarms in nature Fish, birds,
More informationBUILDING A SCALABLE BIG DATA INFRASTRUCTURE FOR DYNAMIC WORKFLOWS
BUILDING A SCALABLE BIG DATA INFRASTRUCTURE FOR DYNAMIC WORKFLOWS ESSENTIALS Executive Summary Big Data is placing new demands on IT infrastructures. The challenge is how to meet growing performance demands
More informationCustomer Case Study. Automatic Labs
Customer Case Study Automatic Labs Customer Case Study Automatic Labs Benefits Validated product in days Completed complex queries in minutes Freed up 1 full-time data scientist Infrastructure savings
More informationQLIKVIEW INTEGRATION TION WITH AMAZON REDSHIFT John Park Partner Engineering
QLIKVIEW INTEGRATION TION WITH AMAZON REDSHIFT John Park Partner Engineering June 2014 Page 1 Contents Introduction... 3 About Amazon Web Services (AWS)... 3 About Amazon Redshift... 3 QlikView on AWS...
More informationHIV NOMOGRAM USING BIG DATA ANALYTICS
HIV NOMOGRAM USING BIG DATA ANALYTICS S.Avudaiselvi and P.Tamizhchelvi Student Of Ayya Nadar Janaki Ammal College (Sivakasi) Head Of The Department Of Computer Science, Ayya Nadar Janaki Ammal College
More informationReal Time Big Data Processing
Real Time Big Data Processing Cloud Expo 2014 Ian Meyers Amazon Web Services Global Infrastructure Deployment & Administration App Services Analytics Compute Storage Database Networking AWS Global Infrastructure
More informationNazneen Aziz, PhD. Director, Molecular Medicine Transformation Program Office
2013 Laboratory Accreditation Program Audioconferences and Webinars Implementing Next Generation Sequencing (NGS) as a Clinical Tool in the Laboratory Nazneen Aziz, PhD Director, Molecular Medicine Transformation
More informationCisco Data Preparation
Data Sheet Cisco Data Preparation Unleash your business analysts to develop the insights that drive better business outcomes, sooner, from all your data. As self-service business intelligence (BI) and
More informationAutomated Library Preparation for Next-Generation Sequencing
Buyer s Guide: Automated Library Preparation for Next-Generation Sequencing What to consider as you evaluate options for automating library preparation. Yes, success can be automated. Next-generation sequencing
More informationHow To Handle Big Data With A Data Scientist
III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution
More informationOrganization and analysis of NGS variations. Alireza Hadj Khodabakhshi Research Investigator
Organization and analysis of NGS variations. Alireza Hadj Khodabakhshi Research Investigator Why is the NGS data processing a big challenge? Computation cannot keep up with the Biology. Source: illumina
More informationOutline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging
Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging
More informationIntroduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
More informationBig Data and the Data Lake. February 2015
Big Data and the Data Lake February 2015 My Vision: Our Mission Data Intelligence is a broad term that describes the real, meaningful insights that can be extracted from your data truths that you can act
More informationOpen source large scale distributed data management with Google s MapReduce and Bigtable
Open source large scale distributed data management with Google s MapReduce and Bigtable Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory
More informationIntroduction to Big Data Training
Introduction to Big Data Training The quickest way to be introduce with NOSQL/BIG DATA offerings Learn and experience Big Data Solutions including Hadoop HDFS, Map Reduce, NoSQL DBs: Document Based DB
More informationIntegrating Big Data into the Computing Curricula
Integrating Big Data into the Computing Curricula Yasin Silva, Suzanne Dietrich, Jason Reed, Lisa Tsosie Arizona State University http://www.public.asu.edu/~ynsilva/ibigdata/ 1 Overview Motivation Big
More informationAchieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks
WHITE PAPER July 2014 Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks Contents Executive Summary...2 Background...3 InfiniteGraph...3 High Performance
More informationNEXT GENERATION ARCHIVE MIGRATION TOOLS
NEXT GENERATION ARCHIVE MIGRATION TOOLS Cloud Ready, Scalable, & Highly Customizable - Migrate 6.0 Ensures Faster & Smarter Migrations EXECUTIVE SUMMARY Data migrations and the products used to perform
More informationEMBL Identity & Access Management
EMBL Identity & Access Management Rupert Lück EMBL Heidelberg e IRG Workshop Zürich Apr 24th 2008 Outline EMBL Overview Identity & Access Management for EMBL IT Requirements & Strategy Project Goal and
More informationNoSQL for SQL Professionals William McKnight
NoSQL for SQL Professionals William McKnight Session Code BD03 About your Speaker, William McKnight President, McKnight Consulting Group Frequent keynote speaker and trainer internationally Consulted to
More informationIn-Database Analytics
Embedding Analytics in Decision Management Systems In-database analytics offer a powerful tool for embedding advanced analytics in a critical component of IT infrastructure. James Taylor CEO CONTENTS Introducing
More informationOracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>
s Big Data solutions Roger Wullschleger DBTA Workshop on Big Data, Cloud Data Management and NoSQL 10. October 2012, Stade de Suisse, Berne 1 The following is intended to outline
More information<Insert Picture Here> Oracle and/or Hadoop And what you need to know
Oracle and/or Hadoop And what you need to know Jean-Pierre Dijcks Data Warehouse Product Management Agenda Business Context An overview of Hadoop and/or MapReduce Choices, choices,
More informationEverything you need to know about flash storage performance
Everything you need to know about flash storage performance The unique characteristics of flash make performance validation testing immensely challenging and critically important; follow these best practices
More informationModule 1. Sequence Formats and Retrieval. Charles Steward
The Open Door Workshop Module 1 Sequence Formats and Retrieval Charles Steward 1 Aims Acquaint you with different file formats and associated annotations. Introduce different nucleotide and protein databases.
More information