Accelerating Life Science Discovery using a High-Performance Analytics Platform in a Collaborative Environment Overview
|
|
- Morgan Harvey
- 8 years ago
- Views:
Transcription
1 Accelerating Life Science Discovery using a High-Performance Analytics Platform in a Collaborative Environment Overview October 7, 2015 Kathy Tzeng, PhD Worldwide Technical Lead Healthcare & Life Sciences IBM Systems Group
2 Genomic Solution Enablement Team Mission: Porting and Optimization of Genomics/Translational applications on IBM solution Developing Solutions with Partners Making IBM SW/HW available to Software developers Members: Independent Software Vendor (ISV) team Toronto Compiler Lab Boeblingen Development Lab Tokyo Research Lab Austin Research Lab 2
3 GENOMIC MEDICINE from Sequencing to Personalized Healthcare NHGRI, a branch of NIH, has defined 5 steps for genomic medicine. (source: E. Green et al., Nature 470, ) 3 Next Generation Sequencing (or other ingestion) the focus is on very large data generation, mainly from $1000 whole genome sequencing, and the data processing and reduction includes human, plant, animal, and microbiome genomics Translational Research/Early Discovery the focus is on data integration including genomic data, and the analytics required to identify biomarkers, understand disease mechanisms, and to identify new medical treatments Personalized Healthcare/Clinical Genomics the focus is on delivering genomic medicine to patients to improve outcomes by associating patients with known genomic specific treatments
4 A Computationally Challenging Problem Breakthroughs in Genomic Medicine require quantifying associations between known population traits, environmental factors, and biological responses Known Traits or Environmental Features Predictive Response Function Measured Biological Response F(t) W(t) R(t) Quantities describing population traits or environmental factors at time t Model of associations between features and responses as a function of time t Computational Challenges Feature combinatorics Large file sizes Large population sizes Unstructured data types Quantities describing response events for an organism at time t 4
5 Workload Challenge #1: Big Data Analytics Variant information requires a computationally intensive analysis of raw sequence data across thousands of genomic samples Processing time per genome 1 to 100 hours * on 1 compute node * Duration depends on selection of analytical tools and hardware High-Throughput Sequencing Assembly & Alignment Variant Calling Variant Annotations Raw Reads De Novo Assembly SOAPdenovo Velvet Reference-Based Mapping BWA Bowtie SOAP Reference Genomes TGCA GEO dbsnp Variant Calling Picard GATK SAMtools SOAPsnp Annotation Tools ANNOVAR Gene Ontology File Format FastQ BAM VCF Sample: intergenic SNP in IL23R associated with Crohn's disease 3 billion DNA base pairs Whole Human 30 x coverage ~ 150 GB (compressed) ~ 150 GB 100 to 200 MB 500 MB Each human genome can have a few million variants 5
6 Workload Challenge #2: Unstructured Information Scientific data must be extracted from very large volumes of natural language content, biomedical images, and other unstructured data, and transformed into a structured format for analysis Omics Data Variant Databases exonic NOD2 16 a frameshift SNP exonic GJB2 13 associated with hearing loss exonic CRYL1,GJB6 13 a 342kb deletion Phenotypic Data Ex. Clinical Histories, Medical Images was in good health until 2-3 months ago when she gradually developed fatigue and intermittent epigastric pain, Scientific Literature Peer-Reviewed Articles, Clinical Guidelines, Textbooks, Patents Information must be transformed into normalized structured data for statistical analysis and relationship visualization 6
7 Workload Challenge #3: Big Data Integration Discovery of genotype-phenotype associations requires an analysis of complex data types that must be integrated within a common analytical environment 1 Omics Data ##FORMAT=<ID=DP, ##FORMAT=<ID=HQ, #CHROM POS ID REF ALT rs G A Variant Calls & Annotations 2 Phenotypic Data 3 Knowledge Base Clinical Features, Environmental Factors, Biological Responses Electronic Text & Web Sites + Big Data Warehouse Environment Patient-Centric Logical Data Model Genotypic Data 1 Variant List VCF Patient ID Patient ID Observed Traits & Responses 2 Phenotypic Data Variant ID Phenotype ID Detail on a Single Variant Patient Population Observation Detail 3 Knowledge Base RDBMS and/or NoSQL 7
8 Key Capabilities Leading biomedical research organizations are asking for technology capabilities that will give them a low-cost solution to accelerate scientific discovery in Genomic Medicine Flexible, scalable, and low-cost high-performance compute and storage solutions capable of efficiently processing rapidly growing quantities of genomic and other types of complex life science data Seamless integration of complex life science data types on a common analytical platform Rapid extraction and analysis of unstructured language content from very large volumes of clinical and scientific documents Metadata collection capabilities providing detailed audit trails as source data are transformed into analytical results Tools for scientific collaboration that enable data and workload sharing tocross organizations and geographic boundaries in a secure environment that ensures data privacy 8
9 A Foundation for Computational Science IBM s Reference Architecture for Genomic Medicine supports big data computational research on a foundation of HPC compute, storage, and workload management capabilities Performance optimization for open source and commercial analytics applications Research Applications Computational Modeling Genomic Analysis Pipelines Text Analytics /NLP -Apache UIMA -IBM System T LAN Image Analysis Text Analytics for the conversion of natural language concepts into structured data entities IBM Research, IBM Watson, IBM Business Partners RDBMS or NoSQL database environments enabling rapid processing of large volumes of complex highdimensional data structures in a data warehouse IBM BigInsights, IBM Business Partners Big Data Foundation Big Data Warehouse + Workload Orchestration with Metadata Capture Data Management: File System & Storage / ILM Intelligent resource allocation, sharing, and monitoring across parallel HPC workloads IBM Platform Computing, IBM Business Partners Low-cost, low-latency, easy-access storage & archiving of data and metadata across heterogeneous environments IBM Spectrum Scale / Elastic Storage Server 9
10 IBM Systems Facilitate Scientific Collaboration Data management and analytics tools can be accessed and shared across heterogeneous systems in on-premise and cloud environments Local Data Center On-Premise Users External Collaborators (Heterogeneous Environments) Private Cloud Users Public Cloud Users 1/10 GbE Applications Big Data Warehouse HPC Network Workload Orchestration with Metadata Capture Data Management: File System / Storage ILM Workload Burst WAN 10GbE or InfiniBand Big Data foundation enables data access, data management, and HPC workload orchestration across heterogeneous onpremise, private cloud, public cloud, and hybrid cloud environments On-Premise Cluster Virtual Private Clouds Encrypted VPN 10
11 Workload Orchestration Platforms Genomics Translational Personalized Healthcare Access AppCenter (PAC, Galaxy, DataBiology, Lab7) Application & Workflow File & Database Visualization System & Log Compute Orchestrator (ASC/EGO, LSF, Symphony, PPM) HPC Cluster Big Data Spark Cluster Openstack Docker Storage Datahub (Spectrum Scale, Zato, Nirvana) SSD/Flash FC/IB Attached Low-cost Storage HA/DR Storage Cloud Storage 11
12 A framework for NGS and HPC Systems Architecture Users HPC Platform Management Software Stack Suite Scale-out cluster Scale-up SMP Spectrum Scale ESS Active Archive TSM/LTFS/HPSS Devices 12
13 IBM Genomics Reference Architecture The IBM Reference Architecture is an ecosystem of data management and analytics tools developed by IBM and industry-leading commercial and open source software providers Edico Genome 13
14 BioBuilds Open Source Bioinformatics Open Source bioinformatics tools for research, commercial, and regulated environments. Turn-key: Pre-built binaries and complete build scripts enable easy deployment Optimized: POWER8 binaries provide the best performance for your hardware Ready for the Clinic: A single source for tools streamlining verification and audit Long Term Support: Community sponsorship and support contracts ensure ongoing support for tools 14
15 Open Source Application Portfolio in BioBuilds ALLPATHS-LG Bedtools Bfast BLAST (NCBI) Bowite Bowtie2 BWA Cufflinks FastQC Numpy PICARD PLINK Python SAMTools SOAP3-DP SOAPDenovo SQLite R Bioconductor FASTA Trinity SHRiMP Updated tools HMMER (LE) OpenSSL IGV irods RNAStar ISAAC TMAP SOAPaligner/soap2 Updated tools Bowtie2 HMMER Tabix BWA HTSeq Mothur TopHat Velvet/Oases OpenSSL 15 15
16 Optimization of GATK from Broad Institute IBM works with genomics leaders to improve performance of analytical workflows like GATK on IBM Power 8 Systems 16
17 Optimization of Broad s Best Practice Pipeline ~ 65X Whole Human Genome analysis done within a day ~ 150X Whole Exome analysis done in 3.45 hours Steps Intel Runtime* IBM Runtime BWA Samtools MarkDuplicates RealignTargets IndelRealigner BaseRecalibrator PrintReads+Index PreProcessiong Total HaplotypeCaller 2.03 Input Dataset: G15512.HCC1954.1, coverage: 65x Both IBM and Intel solution: # of Machines = 1 # of cores/machine = 24 IBM Solution: GHz Power8 with GPFS Total Note*: 17
18 Performance of L3 Bioinformatics BALSA on Power 8 with GPU Power GHz, 2x k40 GPU and GPFS 18
19 IO Cache Library to Optimize Performance of Genomics Application IBM uses a File Cache Library to improve I/O Performance and reduce workflow runtimes Application: Illumina s Casava V. 1.8 (BCL to FASTQ) Data Set: 8 lanes of HiSeq data Without cache library With cache library Elapsed Time = 1730 min Elapsed Time = 107 min 19
20 Accelerating Genomics Applications using GPFS IBM and BIOVIA s Pipeline Pilot scale genomic analysis from the desktop to the enterprise using IBM GPFS Speed of the file system matters Bowtie2: NGS Benchmarks on 2.6 GHz idataplex with GPFS and NFS Elapsed Time in Minutes, lower is better GPFS NFS
21 Genomic Workflow Optimization Typical Genomic Sequencing Workflow Command Line bwa aln -t 12 -l 40 -n 3 -k 2 bwa sampe -a 700 -P -o 1000 samtools view bt samtools sort Picard: java Xmx8g -Djava.io.tmpdir MarkDuplicates.jar METRICS_FILE=metrics CREATE_INDEX=true VALIDATION_STRINGENCY=LENIENT REMOVE_DUPLICATES=true ASSUME_SORTED=true TMP_DIR Picard: java -Xmx8g -Djava.io.tmpdir AddOrReplaceReadGroups.jar SORT_ORDER=coordinate RGID=sample_lane RGLB=sample RGPL=illumina RGPU=lane RGSM=sample RGCN=center_name CREATE_INDEX=True VALIDATION_STRINGENCY=LENIENT TMP_DIR Gatk lite: java -Xmx8g -Djava.io.tmpdir -T RealignerTargetCreator -nt 1 Gatk lite: java -Xmx8g -Djava.io.tmpdir -T IndelRealigner -targetintervals -known 1000G_biallelic.indels.hg19.vcf Picard: java -Xmx8g -Djava.io.tmpdir FixMateInformation.jar SO=coordinate VALIDATION_STRINGENCY=LENIENT CREATE_INDEX=true TMP_DIR Gatk lite: java -Xmx#{JAVA_REQMEM}g -Djava.io.tmpdir -T CountCovariates recalfile - knownsites:dbsnp,vcf /gpfs/gpfs1/genome/snp_indel_vcf/dbsnp_137.hg19.vcf -cov ReadGroupCovariate -cov QualityScoreCovariate -cov CycleCovariate -cov DinucCovariate Gatk lit: java -Xmx8g -Djava.io.tmpdir -T TableRecalibration -recalfile -smode SET_Q_ZERO - solid_nocall_strategy THROW_EXCEPTION -nback 7 --baq RECALCULATE Gatk lite:java -Xmx4g -jar $GATK_BIN/GenomeAnalysisTK.jar -glm BOTH -R $REFERENCE -T UnifiedGenotyper I recalibrated.bam 21
22 Genomic Workflow Optimization IBM Platform Process Manager facilitates genomic workflow execution 22
23 Genomic Workflow Optimization IBM Platform LSF workload scheduler is linked to the Process Manager and maximizes the utilization of HPC resources to improve workflow runtimes Data Set: 37x coverage of whole human genomes Workflow Input: 74 fastq.gz files, Workflow Output: Recalibrated Bam file Dependency steps = Using LSF bsub w option Runs 1 st Set 2 nd Set 3 rd Set 4 th Set Total Sets 1 set on 8 nodes hrs hrs 4 sets on 8 nodes hrs 20.9 hrs hrs hrs hrs 23
24 Data Compression Appliance Compression Algorithms gzip on Power 8 with FPGA board available now CRAM Compression ratio (lossless) On average 1:3 for fastq files 1:2 to 1:4 with respect to BAM files depending on the sequencing depth and other factors. (from FASTQ to compressed BAM ratio is 16X) Speed/throughput 2.5GB/s on average (200 GB fastq can be compressed in 80 second) Achieved beyond 10 times speed up using 12 cores (approximately 0.5GB/min) FPGA acceleration is ongoing. Pistoia compression contest was held in James Bonfieldof Sanger Institute won with 1:9 compression ratio and 0.1GB/min CRAM is released late 2012 to compress BAM file by EBI and accepted by Global Alliance of Genomics and Health. IBM is collaborating with Sanger Institute and EBI on improving compression for genomics data Samtools, Picard, CRAM Source:Baker M.,Nature Methods7, (2010) 24
25 .. >187_29_706_F3 T T >187_29_829_F3 T >187_29_858_F3 T > Enterprise Data Management IBM works with Lab7 to deliver data provenance with performance, reliability and security Experimental Design Sample Prep Sequencing Mapping Analysis Reporting Meta Analysis Sample LIMS User Experience Workflow Engine Federated Data Engine Pipeline Engine Sample Data Reference Attribute Sheet Pipeline Visualization/EDA Lab7 ESP Comprehensive software platform --- combines LIMS and informatics functionalities h Data provenance ---maintains continuous data provenance by: Tracking the history of samples, analyses, and results Providing detailed audit trails 9 Sequencing platform flexibility ---manages data generated from any sequencing platform IBM Power System Solution with GPFS and Platform LSF delivers: Superior compute infrastructure --- Superior performance, scalability & maximum throughput 8 Outstanding enterprise-grade reliability and security: Reliability, Availability and Serviceability (RAS) features help avoid unplanned downtime IBM Power Security and Compliance (PowerSC ) enables security compliance automation and includes reporting for compliance measurement and audit (HIPAA) 8 Total cost of ownership --- Very affordable compared to like-sized x86 systems 25
26 Data Provenance with Performance, Reliability and Security 26 Databiology for Enterprise Functional Architecture Databiology for Enterprise Scientific Samples Annotation Ontologies Shopping Basket Social Comments + Attachments + WF Integration 3 C s (Configure, Command, Collaborate) Portal API Custom Web Apps via API Compute and Storage Softlayer LSF GPFS Project Management Roles + Access Lifecycle Management Meta Information Financial + Resource Mgmt Task Management Transport DBE Download Manager DBE Multiprot S3, SCP, RSync, SFTP, FTP HTTP Applications Import Analysis Visualization Configuration Infrastructure Compute Storage Network Identity Management Instruments Logic Everything as an app: Scripts, Binaries, Pipelines, Workflow Management, Virtual Machines Version Control + Reproducible Data Provenance IBM Power System Solution with GPFS and Platform LSF delivers: Superior compute infrastructure --- Superior performance, scalability & maximum throughput 8 Outstanding enterprise-grade reliability and security: Reliability, Availability and Serviceability (RAS) features help avoid unplanned downtime IBM Power Security and Compliance (PowerSC ) enables security compliance automation and includes reporting for compliance measurement and audit (HIPAA) 8Total cost of ownership --- Very affordable compared to like-sized x86 systems Interface Information Management Orchestration SaaS + customer specific instances Central hub to manage all omics data and to orchestrate all activities Functionally rich and orientated on key steps in R&D life cycle Insight to Instrument with best in class applications Easy integration with existing environments Automatic data provenance and reporting Cost neutral deployment Gradual roll-out / Low risk
27 transmart - Optimized on Power8 and Spectrum Scale transmart associates genotypic & phenotypic data for complex analytics Watson Explorer extracts insight from scientific literature and data record and provides enrichment to transmart s analysis 27
28 transmart Power8 Deployment Architecture Users Application Browser HTTP Web Server (Apache2) Watson Analytics Server HTTP I2b2 Application Server transmart Solr Full Text index Watson Analytics JDBC Application Server PLINK GPFS PostgreSQL transmart DB JDBC (Tomcat 7) Quartz Job Call R Analytics Tools Gene Patterns Power8 28
29 Accelerate transmart ETL by Power8/Spectrum Scale Dataset TCGA_OV Simulation GSE32583 GSE13168 GSE1456 GSE15258 No. Records 5,789,632 40,774, ,724 1,203,282 3,600,555 4,702,050 29
30 Zato s Scalable Data Federation Solution for Healthcare and Genomics Data Spanning Data Centers in Parallel with a Single Pane of Glass for Clinical and Research Applications on Power 8 and GPFS Imaging Data Lab Results LAN LAN Electronic Health Record Data LAN Nursing Home Records VPN Microbiology Reports LAN Claims Data Radiology Reports LAN VPN VPN Internet Genomic Data Accepted Medical Knowledge NIH Data CDC Data NLM Data 30
31 Thank You 31 22
32 32 32
33 Noblis BioVelocity is Developed and Optimized on IBM s Power 8 33
Practical Solutions for Big Data Analytics
Practical Solutions for Big Data Analytics Ravi Madduri Computation Institute (madduri@anl.gov) Paul Dave (pdave@uchicago.edu) Dinanath Sulakhe (sulakhe@uchicago.edu) Alex Rodriguez (arodri7@uchicago.edu)
More informationGenomic Applications on Cray supercomputers: Next Generation Sequencing Workflow. Barry Bolding. Cray Inc Seattle, WA
Genomic Applications on Cray supercomputers: Next Generation Sequencing Workflow Barry Bolding Cray Inc Seattle, WA 1 CUG 2013 Paper Genomic Applications on Cray supercomputers: Next Generation Sequencing
More informationENABLING DATA TRANSFER MANAGEMENT AND SHARING IN THE ERA OF GENOMIC MEDICINE. October 2013
ENABLING DATA TRANSFER MANAGEMENT AND SHARING IN THE ERA OF GENOMIC MEDICINE October 2013 Introduction As sequencing technologies continue to evolve and genomic data makes its way into clinical use and
More informationLarge-scale Research Data Management and Analysis Using Globus Services. Ravi Madduri Argonne National Lab University of Chicago @madduri
Large-scale Research Data Management and Analysis Using Globus Services Ravi Madduri Argonne National Lab University of Chicago @madduri Outline Who we are Challenges in Big Data Management and Analysis
More informationData management challenges in todays Healthcare and Life Sciences ecosystems
Data management challenges in todays Healthcare and Life Sciences ecosystems Jose L. Alvarez Principal Engineer, WW Director Life Sciences jose.alvarez@seagate.com Evolution of Data Sets in Healthcare
More informationText file One header line meta information lines One line : variant/position
Software Calling: GATK SAMTOOLS mpileup Varscan SOAP VCF format Text file One header line meta information lines One line : variant/position ##fileformat=vcfv4.1! ##filedate=20090805! ##source=myimputationprogramv3.1!
More informationThe Future of Data Management
The Future of Data Management with Hadoop and the Enterprise Data Hub Amr Awadallah (@awadallah) Cofounder and CTO Cloudera Snapshot Founded 2008, by former employees of Employees Today ~ 800 World Class
More informationPractical Guideline for Whole Genome Sequencing
Practical Guideline for Whole Genome Sequencing Disclosure Kwangsik Nho Assistant Professor Center for Neuroimaging Department of Radiology and Imaging Sciences Center for Computational Biology and Bioinformatics
More informationIBM Platform Computing Cloud Service Ready to use Platform LSF & Symphony clusters in the SoftLayer cloud
IBM Platform Computing Cloud Service Ready to use Platform LSF & Symphony clusters in the SoftLayer cloud February 25, 2014 1 Agenda v Mapping clients needs to cloud technologies v Addressing your pain
More informationIntegrated Rule-based Data Management System for Genome Sequencing Data
Integrated Rule-based Data Management System for Genome Sequencing Data A Research Data Management (RDM) Green Shoots Pilots Project Report by Michael Mueller, Simon Burbidge, Steven Lawlor and Jorge Ferrer
More informationGlobus Genomics Tutorial GlobusWorld 2014
Globus Genomics Tutorial GlobusWorld 2014 Agenda Overview of Globus Genomics Example Collaborations Demonstration Globus Genomics interface Globus Online integration Scenario 1: Using Globus Genomics for
More informationAlternative Deployment Models for Cloud Computing in HPC Applications. Society of HPC Professionals November 9, 2011 Steve Hebert, Nimbix
Alternative Deployment Models for Cloud Computing in HPC Applications Society of HPC Professionals November 9, 2011 Steve Hebert, Nimbix The case for Cloud in HPC Build it in house Assemble in the cloud?
More informationIBM Reference Architecture for Genomics
Front cover IBM Reference Architecture for Genomics Speed, Scale, Smarts Frank Lee, Ph.D. Redpaper Genomic medicine promises to revolutionize biomedical research and clinical care. By investigating the
More informationNew solutions for Big Data Analysis and Visualization
New solutions for Big Data Analysis and Visualization From HPC to cloud-based solutions Barcelona, February 2013 Nacho Medina imedina@cipf.es http://bioinfo.cipf.es/imedina Head of the Computational Biology
More informationDelivering the power of the world s most successful genomics platform
Delivering the power of the world s most successful genomics platform NextCODE Health is bringing the full power of the world s largest and most successful genomics platform to everyday clinical care NextCODE
More informationIntroduction to Arvados. A Curoverse White Paper
Introduction to Arvados A Curoverse White Paper Contents Arvados in a Nutshell... 4 Why Teams Choose Arvados... 4 The Technical Architecture... 6 System Capabilities... 7 Commitment to Open Source... 12
More informationGetting Started & Successful with Big Data
Getting Started & Successful with Big Data @Pentaho #BigDataWebSeries 2013, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 Your Hosts Today Davy Nys VP EMEA & APAC Pentaho Paul
More informationDatabricks. A Primer
Databricks A Primer Who is Databricks? Databricks was founded by the team behind Apache Spark, the most active open source project in the big data ecosystem today. Our mission at Databricks is to dramatically
More informationIBM 000-281 EXAM QUESTIONS & ANSWERS
IBM 000-281 EXAM QUESTIONS & ANSWERS Number: 000-281 Passing Score: 800 Time Limit: 120 min File Version: 58.8 http://www.gratisexam.com/ IBM 000-281 EXAM QUESTIONS & ANSWERS Exam Name: Foundations of
More informationDatabricks. A Primer
Databricks A Primer Who is Databricks? Databricks vision is to empower anyone to easily build and deploy advanced analytics solutions. The company was founded by the team who created Apache Spark, a powerful
More informationIBM AND NEXT GENERATION ARCHITECTURE FOR BIG DATA & ANALYTICS!
The Bloor Group IBM AND NEXT GENERATION ARCHITECTURE FOR BIG DATA & ANALYTICS VENDOR PROFILE The IBM Big Data Landscape IBM can legitimately claim to have been involved in Big Data and to have a much broader
More informationManaging and Conducting Biomedical Research on the Cloud Prasad Patil
Managing and Conducting Biomedical Research on the Cloud Prasad Patil Laboratory for Personalized Medicine Center for Biomedical Informatics Harvard Medical School SaaS & PaaS gmail google docs app engine
More informationBoas Betzler. Planet. Globally Distributed IaaS Platform Examples AWS and SoftLayer. November 9, 2015. 20014 IBM Corporation
Boas Betzler Cloud IBM Distinguished Computing Engineer for a Smarter Planet Globally Distributed IaaS Platform Examples AWS and SoftLayer November 9, 2015 20014 IBM Corporation Building Data Centers The
More informationHDP Hadoop From concept to deployment.
HDP Hadoop From concept to deployment. Ankur Gupta Senior Solutions Engineer Rackspace: Page 41 27 th Jan 2015 Where are you in your Hadoop Journey? A. Researching our options B. Currently evaluating some
More informationCloud Computing Solutions for Genomics Across Geographic, Institutional and Economic Barriers
Cloud Computing Solutions for Genomics Across Geographic, Institutional and Economic Barriers Ntinos Krampis Asst. Professor J. Craig Venter Institute kkrampis@jcvi.org http://www.jcvi.org/cms/about/bios/kkrampis/
More informationIntroduction to NGS data analysis
Introduction to NGS data analysis Jeroen F. J. Laros Leiden Genome Technology Center Department of Human Genetics Center for Human and Clinical Genetics Sequencing Illumina platforms Characteristics: High
More informationData Analysis & Management of High-throughput Sequencing Data. Quoclinh Nguyen Research Informatics Genomics Core / Medical Research Institute
Data Analysis & Management of High-throughput Sequencing Data Quoclinh Nguyen Research Informatics Genomics Core / Medical Research Institute Current Issues Current Issues The QSEQ file Number files per
More informationAgenda. HPC Software Stack. HPC Post-Processing Visualization. Case Study National Scientific Center. European HPC Benchmark Center Montpellier PSSC
HPC Architecture End to End Alexandre Chauvin Agenda HPC Software Stack Visualization National Scientific Center 2 Agenda HPC Software Stack Alexandre Chauvin Typical HPC Software Stack Externes LAN Typical
More informationQuick Reference Selling Guide for Intel Lustre Solutions Overview
Overview The 30 Second Pitch Intel Solutions for Lustre* solutions Deliver sustained storage performance needed that accelerate breakthrough innovations and deliver smarter, data-driven decisions for enterprise
More informationDatenverwaltung im Wandel - Building an Enterprise Data Hub with
Datenverwaltung im Wandel - Building an Enterprise Data Hub with Cloudera Bernard Doering Regional Director, Central EMEA, Cloudera Cloudera Your Hadoop Experts Founded 2008, by former employees of Employees
More informationScaling up to Production
1 Scaling up to Production Overview Productionize then Scale Building Production Systems Scaling Production Systems Use Case: Scaling a Production Galaxy Instance Infrastructure Advice 2 PRODUCTIONIZE
More informationAccelerate > Converged Storage Infrastructure. DDN Case Study. ddn.com. 2013 DataDirect Networks. All Rights Reserved
DDN Case Study Accelerate > Converged Storage Infrastructure 2013 DataDirect Networks. All Rights Reserved The University of Florida s (ICBR) offers access to cutting-edge technologies designed to enable
More informationAbout the Princess Margaret Computational Biology Resource Centre (PMCBRC) cluster
Cluster Info Sheet About the Princess Margaret Computational Biology Resource Centre (PMCBRC) cluster Welcome to the PMCBRC cluster! We are happy to provide and manage this compute cluster as a resource
More informationOpenCB a next generation big data analytics and visualisation platform for the Omics revolution
OpenCB a next generation big data analytics and visualisation platform for the Omics revolution Development at the University of Cambridge - Closing the Omics / Moore s law gap with Dell & Intel Ignacio
More informationBig data management with IBM General Parallel File System
Big data management with IBM General Parallel File System Optimize storage management and boost your return on investment Highlights Handles the explosive growth of structured and unstructured data Offers
More informationHadoopizer : a cloud environment for bioinformatics data analysis
Hadoopizer : a cloud environment for bioinformatics data analysis Anthony Bretaudeau (1), Olivier Sallou (2), Olivier Collin (3) (1) anthony.bretaudeau@irisa.fr, INRIA/Irisa, Campus de Beaulieu, 35042,
More informationCisco Data Preparation
Data Sheet Cisco Data Preparation Unleash your business analysts to develop the insights that drive better business outcomes, sooner, from all your data. As self-service business intelligence (BI) and
More informationIBM ELASTIC STORAGE SEAN LEE
IBM ELASTIC STORAGE SEAN LEE Solution Architect Platform Computing Division IBM Greater China Group Agenda Challenges in Data Management What is IBM Elastic Storage Key Features Elastic Storage Server
More informationAutomated and Scalable Data Management System for Genome Sequencing Data
Automated and Scalable Data Management System for Genome Sequencing Data Michael Mueller NIHR Imperial BRC Informatics Facility Faculty of Medicine Hammersmith Hospital Campus Continuously falling costs
More informationLeading Genomics. Diagnostic. Discove. Collab. harma. Shanghai Cambridge, MA Reykjavik
Leading Genomics Diagnostic harma Discove Collab Shanghai Cambridge, MA Reykjavik Global leadership for using the genome to create better medicine WuXi NextCODE provides a uniquely proven and integrated
More informationDeIC Watson Agreement - hvad betyder den for DeIC medlemmerne
DeIC Watson Agreement - hvad betyder den for DeIC medlemmerne Preben Jacobsen Solution Architect Nordic Lead, Software Defined Infrastructure Group IBM Danmark 2014 IBM Corporation Link: https://www.youtube.com/watch?v=_xcmh1lqb9i
More informationGC3 Use cases for the Cloud
GC3: Grid Computing Competence Center GC3 Use cases for the Cloud Some real world examples suited for cloud systems Antonio Messina Trieste, 24.10.2013 Who am I System Architect
More informationCHALLENGES IN NEXT-GENERATION SEQUENCING
CHALLENGES IN NEXT-GENERATION SEQUENCING BASIC TENETS OF DATA AND HPC Gray s Laws of data engineering 1 : Scientific computing is very dataintensive, with no real limits. The solution is scale-out architecture
More informationEuro-BioImaging European Research Infrastructure for Imaging Technologies in Biological and Biomedical Sciences
Euro-BioImaging European Research Infrastructure for Imaging Technologies in Biological and Biomedical Sciences WP11 Data Storage and Analysis Task 11.1 Coordination Deliverable 11.2 Community Needs of
More informationScalable Cloud Computing Solutions for Next Generation Sequencing Data
Scalable Cloud Computing Solutions for Next Generation Sequencing Data Matti Niemenmaa 1, Aleksi Kallio 2, André Schumacher 1, Petri Klemelä 2, Eija Korpelainen 2, and Keijo Heljanko 1 1 Department of
More informationHadoop s Rise in Life Sciences
Exploring EMC Isilon scale-out storage solutions Hadoop s Rise in Life Sciences By John Russell, Contributing Editor, Bio IT World Produced by Cambridge Healthtech Media Group By now the Big Data challenge
More informationNVIDIA GPUs in the Cloud
NVIDIA GPUs in the Cloud 4 EVOLVING CLOUD REQUIREMENTS On premises Off premises Hybrid Cloud Connecting clouds New workloads Components to disrupt 5 GLOBAL CLOUD PLATFORM Unified architecture enabled by
More informationDriving IBM BigInsights Performance Over GPFS Using InfiniBand+RDMA
WHITE PAPER April 2014 Driving IBM BigInsights Performance Over GPFS Using InfiniBand+RDMA Executive Summary...1 Background...2 File Systems Architecture...2 Network Architecture...3 IBM BigInsights...5
More informationTurbo-Charging Open Source Hadoop for Faster, more Meaningful Insights
Turbo-Charging Open Source Hadoop for Faster, more Meaningful Insights Gord Sissons Senior Manager, Technical Marketing IM Platform Computing gsissons@ca.ibm.com Agenda Some Context IM Platform Computing
More information<Insert Picture Here> Infrastructure as a Service (IaaS) Cloud Computing for Enterprises
Infrastructure as a Service (IaaS) Cloud Computing for Enterprises Speaker Title The following is intended to outline our general product direction. It is intended for information
More informationTutorial for Windows and Macintosh. Preparing Your Data for NGS Alignment
Tutorial for Windows and Macintosh Preparing Your Data for NGS Alignment 2015 Gene Codes Corporation Gene Codes Corporation 775 Technology Drive, Ann Arbor, MI 48108 USA 1.800.497.4939 (USA) 1.734.769.7249
More informationIt s Not Public Versus Private Clouds - It s the Right Infrastructure at the Right Time With the IBM Systems and Storage Portfolio
White Paper - It s the Right Infrastructure at the Right Time With the IBM Systems and Storage Portfolio Contents Executive Summary....2 Introduction....3 Private clouds - Powerful tech, new solutions....3
More informationPentaho High-Performance Big Data Reference Configurations using Cisco Unified Computing System
Pentaho High-Performance Big Data Reference Configurations using Cisco Unified Computing System By Jake Cornelius Senior Vice President of Products Pentaho June 1, 2012 Pentaho Delivers High-Performance
More informationAccelerating Data-Intensive Genome Analysis in the Cloud
Accelerating Data-Intensive Genome Analysis in the Cloud Nabeel M Mohamed Heshan Lin Wu-chun Feng Department of Computer Science Virginia Tech Blacksburg, VA 24060 {nabeel, hlin2, wfeng}@vt.edu Abstract
More informationCloud BioLinux: Pre-configured and On-demand Bioinformatics Computing for the Genomics Community
Cloud BioLinux: Pre-configured and On-demand Bioinformatics Computing for the Genomics Community Ntinos Krampis Asst. Professor J. Craig Venter Institute kkrampis@jcvi.org http://www.jcvi.org/cms/about/bios/kkrampis/
More informationUCLA Team Sequences Cell Line, Puts Open Source Software Framework into Production
Page 1 of 6 UCLA Team Sequences Cell Line, Puts Open Source Software Framework into Production February 05, 2010 Newsletter: BioInform BioInform - February 5, 2010 By Vivien Marx Scientists at the department
More informationBuilding Bioinformatics Capacity in Africa. Nicky Mulder CBIO Group, UCT
Building Bioinformatics Capacity in Africa Nicky Mulder CBIO Group, UCT Outline What is bioinformatics? Why do we need IT infrastructure? What e-infrastructure does it require? How we are developing this
More informationAddressing Open Source Big Data, Hadoop, and MapReduce limitations
Addressing Open Source Big Data, Hadoop, and MapReduce limitations 1 Agenda What is Big Data / Hadoop? Limitations of the existing hadoop distributions Going enterprise with Hadoop 2 How Big are Data?
More informationInformation Architecture
The Bloor Group Actian and The Big Data Information Architecture WHITE PAPER The Actian Big Data Information Architecture Actian and The Big Data Information Architecture Originally founded in 2005 to
More informationWindows HPC Server 2008 R2 Service Pack 3 (V3 SP3)
Windows HPC Server 2008 R2 Service Pack 3 (V3 SP3) Greg Burgess, Principal Development Manager Windows Azure High Performance Computing Microsoft Corporation HPC Server Components Job Scheduler Distributed
More informationSCALABLE FILE SHARING AND DATA MANAGEMENT FOR INTERNET OF THINGS
Sean Lee Solution Architect, SDI, IBM Systems SCALABLE FILE SHARING AND DATA MANAGEMENT FOR INTERNET OF THINGS Agenda Converging Technology Forces New Generation Applications Data Management Challenges
More informationCloud-based Analytics and Map Reduce
1 Cloud-based Analytics and Map Reduce Datasets Many technologies converging around Big Data theme Cloud Computing, NoSQL, Graph Analytics Biology is becoming increasingly data intensive Sequencing, imaging,
More informationPersonalized Medicine and IT
Personalized Medicine and IT Data-driven Medicine in the Age of Genomics www.intel.com/healthcare/bigdata Ketan Paranjape General Manager, Life Sciences Intel Corp. @Portlandketan 1 The Central Dogma of
More informationIBM Platform Computing : infrastructure management for HPC solutions on OpenPOWER Jing Li, Software Development Manager IBM
IBM Platform Computing : infrastructure management for HPC solutions on OpenPOWER Jing Li, Software Development Manager IBM #OpenPOWERSummit Join the conversation at #OpenPOWERSummit 1 Scale-out and Cloud
More informationIBM Software Information Management Creating an Integrated, Optimized, and Secure Enterprise Data Platform:
Creating an Integrated, Optimized, and Secure Enterprise Data Platform: IBM PureData System for Transactions with SafeNet s ProtectDB and DataSecure Table of contents 1. Data, Data, Everywhere... 3 2.
More informationAutomated Data Ingestion. Bernhard Disselhoff Enterprise Sales Engineer
Automated Data Ingestion Bernhard Disselhoff Enterprise Sales Engineer Agenda Pentaho Overview Templated dynamic ETL workflows Pentaho Data Integration (PDI) Use Cases Pentaho Overview Overview What we
More informationEoulsan Analyse du séquençage à haut débit dans le cloud et sur la grille
Eoulsan Analyse du séquençage à haut débit dans le cloud et sur la grille Journées SUCCES Stéphane Le Crom (UPMC IBENS) stephane.le_crom@upmc.fr Paris November 2013 The Sanger DNA sequencing method Sequencing
More informationScaling LS-DYNA on Rescale HPC Cloud Simulation Platform
Scaling LS-DYNA on Rescale HPC Cloud Simulation Platform Joris Poort, President & CEO, Rescale, Inc. Ilea Graedel, Manager, Rescale, Inc. 1 Cloud HPC on the Rise 1.1 Background Engineering and science
More informationCSE-E5430 Scalable Cloud Computing. Lecture 4
Lecture 4 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 5.10-2015 1/23 Hadoop - Linux of Big Data Hadoop = Open Source Distributed Operating System
More informationRichmond, VA. Richmond, VA. 2 Department of Microbiology and Immunology, Virginia Commonwealth University,
Massive Multi-Omics Microbiome Database (M 3 DB): A Scalable Data Warehouse and Analytics Platform for Microbiome Datasets Shaun W. Norris 1 (norrissw@vcu.edu) Steven P. Bradley 2 (bradleysp@vcu.edu) Hardik
More informationChukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84
Index A Amazon Web Services (AWS), 50, 58 Analytics engine, 21 22 Apache Kafka, 38, 131 Apache S4, 38, 131 Apache Sqoop, 37, 131 Appliance pattern, 104 105 Application architecture, big data analytics
More informationBig Workflow: More than Just Intelligent Workload Management for Big Data
Big Workflow: More than Just Intelligent Workload Management for Big Data Michael Feldman White Paper February 2014 EXECUTIVE SUMMARY Big data applications represent a fast-growing category of high-value
More informationASPERA HIGH-SPEED TRANSFER SOFTWARE. Moving the world s data at maximum speed
ASPERA HIGH-SPEED TRANSFER SOFTWARE Moving the world s data at maximum speed PRESENTERS AND AGENDA PRESENTER John Heaton Aspera Director of Sales Engineering john@asperasoft.com AGENDA How Cloud is used
More informationIBM Smart Business Storage Cloud
GTS Systems Services IBM Smart Business Storage Cloud Reduce costs and improve performance with a scalable storage virtualization solution SoNAS Gerardo Kató Cloud Computing Solutions 2010 IBM Corporation
More informationWhite Paper. Version 1.2 May 2015 RAID Incorporated
White Paper Version 1.2 May 2015 RAID Incorporated Introduction The abundance of Big Data, structured, partially-structured and unstructured massive datasets, which are too large to be processed effectively
More informationWOS Cloud. ddn.com. Personal Storage for the Enterprise. DDN Solution Brief
DDN Solution Brief Personal Storage for the Enterprise WOS Cloud Secure, Shared Drop-in File Access for Enterprise Users, Anytime and Anywhere 2011 DataDirect Networks. All Rights Reserved DDN WOS Cloud
More informationDeep Sequencing Data Analysis
Deep Sequencing Data Analysis Ross Whetten Professor Forestry & Environmental Resources Background Who am I, and why am I teaching this topic? I am not an expert in bioinformatics I started as a biologist
More informationOvercoming Storage Barriers in Life Sciences Research with IBM s Next Generation Sequencing Solutions. Executive Summary
Overcoming Storage Barriers in Life Sciences Research with IBM s Next Generation Sequencing Solutions Sponsored by IBM Srini Chari, Ph.D., MBA October 2011 Cabot Partners Group, Inc. 100 Woodcrest Lane,
More informationOracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>
s Big Data solutions Roger Wullschleger DBTA Workshop on Big Data, Cloud Data Management and NoSQL 10. October 2012, Stade de Suisse, Berne 1 The following is intended to outline
More informationBig Data and the Data Lake. February 2015
Big Data and the Data Lake February 2015 My Vision: Our Mission Data Intelligence is a broad term that describes the real, meaningful insights that can be extracted from your data truths that you can act
More informationAligning Your Strategic Initiatives with a Realistic Big Data Analytics Roadmap
Aligning Your Strategic Initiatives with a Realistic Big Data Analytics Roadmap 3 key strategic advantages, and a realistic roadmap for what you really need, and when 2012, Cognizant Topics to be discussed
More information<Insert Picture Here> The Evolution Of Clinical Data Warehousing
The Evolution Of Clinical Data Warehousing Srinivas Karri Principal Consultant Agenda Value of Clinical Data Clinical Data warehousing & The Big Data Challenge
More informationGo where the biology takes you. Genome Analyzer IIx Genome Analyzer IIe
Go where the biology takes you. Genome Analyzer IIx Genome Analyzer IIe Go where the biology takes you. To published results faster With proven scalability To the forefront of discovery To limitless applications
More informationCapitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes
Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes Highly competitive enterprises are increasingly finding ways to maximize and accelerate
More informationThe deployment of OHMS TM. in private cloud
Healthcare activities from anywhere anytime The deployment of OHMS TM in private cloud 1.0 Overview:.OHMS TM is software as a service (SaaS) platform that enables the multiple users to login from anywhere
More informationRED HAT: UNLOCKING THE VALUE OF THE CLOUD
RED HAT: UNLOCKING THE VALUE OF THE CLOUD Chad Tindel September 2010 1 RED HAT'S APPROACH TO THE CLOUD IS BETTER Build better clouds with Red Hat 1. The most comprehensive solutions for clouds both private
More informationEnd to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ
End to End Solution to Accelerate Data Warehouse Optimization Franco Flore Alliance Sales Director - APJ Big Data Is Driving Key Business Initiatives Increase profitability, innovation, customer satisfaction,
More informationEnabling the Big Data Commons through indexing of data and their interactions
biomedical and healthcare Data Discovery Index Ecosystem Enabling the Big Data Commons through indexing of and their interactions 2 nd BD2K all-hands meeting Bethesda 11/12/15 Aims 1. Help users find accessible
More informationHigh Performance Compu2ng Facility
High Performance Compu2ng Facility Center for Health Informa2cs and Bioinforma2cs Accelera2ng Scien2fic Discovery and Innova2on in Biomedical Research at NYULMC through Advanced Compu2ng Efstra'os Efstathiadis,
More informationEMC ATMOS. Managing big data in the cloud A PROVEN WAY TO INCORPORATE CLOUD BENEFITS INTO YOUR BUSINESS ATMOS FEATURES
EMC ATMOS Managing big data in the cloud Essentials Purpose-built cloud storage platform designed for unlimited global scale Intelligently automates management of content through highly flexible policies
More informationCloudCenter Full Lifecycle Management. An application-defined approach to deploying and managing applications in any datacenter or cloud environment
CloudCenter Full Lifecycle Management An application-defined approach to deploying and managing applications in any datacenter or cloud environment CloudCenter Full Lifecycle Management Page 2 Table of
More informationBig Data Analytics - Accelerated. stream-horizon.com
Big Data Analytics - Accelerated stream-horizon.com Legacy ETL platforms & conventional Data Integration approach Unable to meet latency & data throughput demands of Big Data integration challenges Based
More informationCloud BioLinux: Pre-configured and On-demand Bioinformatics Computing for the Genomics Community
Cloud BioLinux: Pre-configured and On-demand Bioinformatics Computing for the Genomics Community Ntinos Krampis Asst. Professor J. Craig Venter Institute kkrampis@jcvi.org http://www.jcvi.org/cms/about/bios/kkrampis/
More informationChallenges associated with analysis and storage of NGS data
Challenges associated with analysis and storage of NGS data Gabriella Rustici Research and training coordinator Functional Genomics Group gabry@ebi.ac.uk Next-generation sequencing Next-generation sequencing
More informationThree data delivery cases for EMBL- EBI s Embassy. Guy Cochrane www.ebi.ac.uk
Three data delivery cases for EMBL- EBI s Embassy Guy Cochrane www.ebi.ac.uk EMBL European Bioinformatics Institute Genes, genomes & variation European Nucleotide Archive 1000 Genomes Ensembl Ensembl Genomes
More informationGet More Scalability and Flexibility for Big Data
Solution Overview LexisNexis High-Performance Computing Cluster Systems Platform Get More Scalability and Flexibility for What You Will Learn Modern enterprises are challenged with the need to store and
More informationG E N OM I C S S E RV I C ES
GENOMICS SERVICES THE NEW YORK GENOME CENTER NYGC is an independent non-profit implementing advanced genomic research to improve diagnosis and treatment of serious diseases. capabilities. N E X T- G E
More informationBig Data Challenges in Bioinformatics
Big Data Challenges in Bioinformatics BARCELONA SUPERCOMPUTING CENTER COMPUTER SCIENCE DEPARTMENT Autonomic Systems and ebusiness Pla?orms Jordi Torres Jordi.Torres@bsc.es Talk outline! We talk about Petabyte?
More informationScaling Objectivity Database Performance with Panasas Scale-Out NAS Storage
White Paper Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage A Benchmark Report August 211 Background Objectivity/DB uses a powerful distributed processing architecture to manage
More information