HPC pipeline and cloud-based solutions for Next Generation Sequencing data analysis
|
|
- Josephine Bond
- 8 years ago
- Views:
Transcription
1 HPC pipeline and cloud-based solutions for Next Generation Sequencing data analysis HPC4NGS 2012, Valencia Ignacio Medina Scientific Computing Unit Bioinformatics and Genomics Department Centro de Investigación Príncipe Felipe (CIPF) Valencia, Spain
2 Index Introduction High Performance Computing (HPC) Distributed and cloud computing (cloud) NGS Data Analysis Pipeline Next Steps Conclusions
3 Introduction New scenario in Biology Next Generation Sequencing (NGS) technology is changing the way how researchers perform experiments. Many new experiments are being conducted by sequencing: exome re-sequencing, RNA-seq, Meth-seq, ChIP-seq,... NGS is allowing researches to: Find exome variants responsible of diseases Study the whole transcriptome of a phenotype Establish the methylation state of a condition Locate DNA binding proteins Fantastic! But, this new experiments have increased data size by 1000x when compared with microarrays, i.e. from MB to GB in expression analysis Data processing and analysis are becoming a bottleneck and a nightmare, from days or weeks with microarrays to months with NGS, and it will be worse as more data become available
4 Introduction The NGS data NGS throughput is amazing, 30x human genomes in less than a week, and it's becoming faster! A single run can generate hundreds of GB of data, a typical NGS experiment today can produce 1TB of data in 1 week The good and bad news: Prices falling down quick! The good news: Parallelization is possible! Many of the analysis is short read independent This new scenario demands more advanced software solutions that exploit the current computer hardware properly. We need new technologies like HPC or cloud-based applications In just 10 years!
5 Introduction Current computer hardware Multi-core CPUs Coming soon: Intel MIC (Many Integrated Core) architecture with more than 50 processing cores GPU computing solutions Currently: quad-core and hexa-core processors, with hyper-threading Desktop: Nvidia GTX580 Fermi architecture (512 cores performing 1.5TFlops) HPC: Nvidia Tesla M2090 Fermi architecture (6GB memory, 512 cores, 1.3TFlops SP, 665 GFlops DP, memory bandwidth 177GB/sec) AVX, a new version of SSE, SIMD extended to 256 bits. New instruction set and registers Designed to support 512 or 1024-bits SIMD registers
6 Introduction Poor software performance But, software in Bioinformatics is in general slow and not designed for running in clusters or clouds, even new NGS software is still being developed in Perl, R and Java (Disclaimer: I usually code in Java, R and Perl, but they are not recommendable for NGS data analysis) New NGS technologies push computer needs, new solutions and technologies become necessary: more analysis, performance, scalable, cloud Computer performance vs data analysis As a Computer scientist I want high performance software Fast, very fast! Speed matters, low memory footprint! Scalable solutions: clusters, cloud As a Biologist I need useful data analysis tools Efficient and sensitive Extract relevant information It's not a choice! We need both of them, i.e. Google Search
7 Introduction Motivation and goals As data analysis group we focus in analysis: It's analysis, stupid! ~1-2 weeks sample sequencing ~2-4 weeks preprocessing and read mapping, this is necessary to be improved but is not enough! Several months of data analysis, lack of software for biological analysis, i.e. which are the genes with a coverage >30x that are differentially expressed in a dataset of 50 BAMs accounting for 500GB of data with a p-value<0.05? To develop new generation of software in bioinformatics to be fast, actually, very fast software! With a low memory consumption! be able to handle and analyze TB of data store data efficiently to query distribute computation, no data focused and useful in analysis! Software must exploit current hardware and be fast and efficient in a single computer or in a HPC cluster
8 High Performance Computing (HPC) Technologies Overview OpenMP SIMD instruction set extension to x86 architecture AVX (Advanced Vector Extensions) is an advanced version of SSE, SIMD register extended from 128 bits to 256 bits with new instructions Message Passing Interface (MPI) Frameworks: Nvidia CUDA, Khronos OpenCL, OpenACC (in development) Streaming SIMD Extensions (SSE) flexible interface for developing parallel applications for platforms ranging from the desktop to the supercomputer GPU computing multi-platform shared-memory parallel programming in C/C++ and Fortran on all architectures de facto standard for communication in a parallel program running on a distributedmemory system FPGAs Integrated circuit configurable by programmer
9 High Performance Computing (HPC) Software architecture Common mistakes in new HPC developers Not all algorithms or problems are suitable for GPUs CPUs multi-core must be also exploited, hyper-threading Heterogeneous HPC computation, in shared-memory SSE/AVX SIMD instructions can also improve algorithms, i.e. speed-up 6-8x of Smith-Waterman, Farrar 2007) CPU(OpenMP + SSE/AVX) + GPU(Nvidia CUDA) Hybrid approach CPU OpenMP + SSE/AVX MPI GPU Hundreds of samples TB of data Nvidia CUDA Single Node execution Hadoop MapReduce Cluster execution
10 Distributed and cloud computing Apache Hadoop framework It's a Java framework library that allows for the distributed processing of large data sets across clusters of computers using a simple programming model Develops open-source software for reliable, scalable, distributed computing Apache Hadoop framework overview:
11 Distributed and cloud computing Apache Hadoop framework MapReduce implementation is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner Who is using in Bioinformatics? GATK Crossbow 1000genomes not many more... yet HBase is a sparse, distributed, persistent multidimensional sorted map. The map is indexed by a row key, column key, and a timestamp; each value in the map is an uninterpreted array of bytes. No SQL schema, different number of columns is normal, very flexible
12 Distributed and cloud computing Amazon AWS In 2006, Amazon Web Services (AWS) began offering IT infrastructure services to businesses in the form of web services -- now commonly known as cloud computing. One of the key benefits of cloud computing is the opportunity to replace up-front capital infrastructure expenses with low variable costs that scale with your business
13 NGS data analysis pipeline Global overview NGS sequencer Fastq file, up to hundreds of GB per run HPG suite High Performance Genomics QC and preprocessing QC stats, filtering and preprocessing options Short read aligner Double mapping strategy: Burrows-Wheeler Transform (GPU Nvidia CUDA) + Smith-Waterman (CPU OpenMP+SSE/AVX) More info SAM/BAM file QC and preprocessing Other analysis (HPC for genomics consortium) QC stats, filtering and preprocessing options Variant Calling analysis RNA-seq (mrna sequenced) DNA assembly (not a real analysis) Meth-seq Copy Number Transcript isoform Variant VCF viewer... HTML5+SVG Web based viewer GATK and SAM mpileup HPC Implementation. Statistics genomic tests VCF file QC and preprocessing QC stats, filtering and preprocessing options Variant effect analysis Consequence type, regulatory variants and system biology information
14 NGS data analysis pipeline HPG-Fastq tool Features: Filtering: filter by length range, filer by quality range, num. nt out of range, N's,... Preprocessing: quality ends trim, remove cloning vectors,... Hybrid approach implementation: Nvida CUDA + OpenMP QC stats: read length and quality average, nucleotides %, GC% content, quality per nucleotide, k-mers,... example: hpg-fastq qc fq fastq_file.fastq -t -v Paired-end Fastq files of 15GB each QC'ed and preprocessed in only one execution in less than 4 minutes in a standard workstation: Intel quad-core with 4GB RAM and Nvidia GTX580
15 NGS data analysis pipeline HPG-aligner, a HPC DNA short read aligner Current read aligners software tend to fit in one of these groups: Very fast, but no too sensitive: no gaps, no indels, rna-seq... Slow, but very sensitive: up to 1 day by sample Fastq file Read Aligner algorithms Burrows-Wheeler Transform (BWT): very fast! No sensitive Smith-Waterman (SW): very sensitive but very slow Hybrid approach (paper sent): BWT +1 error implemented in Nvidia CUDA SW implemented using OpenMP and SSE/AVX Architecture BWT-GPU < 2 errors Find CALs with BWT-GPU (seeds) 10x compared to BFAST ~50x with SSE Testing BWT-GPU >= 2 errors SW OpenMP/SSE SAM file
16 NGS data analysis pipeline HPG-BAM tool Features BAM Sort implemented in Nvidia CUDA (2x compared to SAM tools) Quality control: % reads with N errors, % reads with multiple mappings, strand bias, paired-end insert,... Coverage implemented in OpenMP (2x faster): by gene, region,... Filtering: by number of errors, number of hits, Comparator: stats, intersection,... In a standard workstation works in minute scale Not too much performance improvement, but many new analysis features
17 NGS data analysis pipeline HPG-variant, suite of tools for variant analysis HPG-VCF tools: to analyze large VCFs files with a low memory footprint: stats, filter, split, merge, HPG-Variant: suite of tools for genome-scale variant analysis Variant calling: is a critical step, two HPC implementation are being developed: GATK, SAMtools mpileup Genomic analysis: association, TDT, Hardy-Weinberg, LOH, (~Plink) VCF WEB Viewer and VARIANT software (later) more coming
18 NGS data analysis pipeline CellBase, an integrative database and RESTful WS API Based on CellBase (accepted in NAR), a comprehensive integrative database and RESTful Web Services API, more than 120GB of data and 100 SQL tables exported in TXT and JSON: Core features: genes, transcripts, exons, cytobands, proteins (UniProt),... Variation: dbsnp and Ensembl SNPs, HapMap, 1000Genomes, Cosmic,... Functional: 40 OBO ontologies(gene Ontology), Interpro,... Regulatory: TFBS, mirna targets, conserved regions, System biology: Interactome (IntAct), Reactome database, co-expressed genes,...
19 NGS data analysis pipeline Genome Maps, a HTML5+SVG genome browser Visualization is an important part of the data analysis: heavy data issue! Do not move data! Features of Genome Maps ( paper sent) First 100% web based: HTML5+SVG (Google Maps inspired) Always updated, no browser plugins or installation Data from CellBase database and DAS servers: genes, transcripts, exons, SNPs, TFBS, mirna targets, Other features: Multi species, API oriented, integrable, plugin framework,...
20 NGS data analysis pipeline VARIANT, a genomic VARIant ANalysis Tool A genomic variant effect predictor tool, recursive acronym: VARIANT combines CellBase Web Services and Genome Maps to build a cloud variant effect predictor, no installation, always updated Variant effects: VARIANT = VARIant ANalysis Tool ( accepted in NAR) Genomic location: intronic, non-synonymous, upstream, Regulatory location: TFBS, mirna targets, conserved regions, MySQL clusters +120GB of data Three interfaces: RESTful, web application and CLI Java RESTful WS API Fast! - GZip - Multi Thread Three clients available WEB browser (AJAX) WEB application CLI program
21 NGS data analysis pipeline Moving data analysis pipeline forward HPC for Genomics Consortium, CIPF and Bull leading a Consortium with UV, UPV and UJI Universities. More than 15 HPC computational researches and developers joined Extending data analysis pipeline to DNA assembly: new assembly algorithm and HPC implementation is being developed RNA-seq: RNA mapping, quantification, diff. expr. analysis, isoform detection,... Meth-seq: new bisulfite-seq mapping algorithm, diff. methylated region analysis, Copy Number: detect amplification and deletions, LOH,... Next sub-projects ChIP-seq Data analysis integration: eqtl(variant+expr), expr+methylation Distributing: MPI-CUDA, rcuda, Hadoop MapReduce
22 Next steps Variant NoSQL Database Clients: Currently thousands of VCFs are being generated, VCF files are being stored in repositories... not enough Motivation: variant analysis! Biological questions remain to be answered easily, ie: Which variants with more than 20x are shared in a single gene in more than 80% of disease samples? Too many variants for a relational SQL database, we need new solutions, new technology available: NoSQL distributed databases A NoSQL database and Java and RESTful WS API to store and analyze TeraBytes of variant data is being implemented using Hadoop Framework, release in a few months WEB browser Perl, python, Java,... RESTful WS API Variant Analysis Java API Hadoop Java API
23 Next steps Browsing NGS data, HTML5 web based solution Postulate: As data size increase people tend to develop desktop software for visualization and analysis i.e. GBrowse, Cytoscape, snpeff, etc. We took the opposite direction (Google and cloud approach): Moving high volumes of data is not scalable (~TB of data) Cloud distributed and collaborative solutions Based on Genome Maps visualization technology: Browse mapped reads Examine variants from a familiy Analysis widgets integrated Thousands of files ~TBs of data SAM/BAM data SAM tools API VCF/NoSQL data VCF homemade API RESTful WS API Specification Java implementation provided JSON Client
24 Next steps HPC for Genomics Consortium HPC for Genomics Consortium, CIPF and Bull leading a Consortium with UV, UPV and UJI Universities. More than 15 HPC computational researches and developers joined Last weeks some computational researchers from Germany or Chinese BGI have raised interest so far in this project Motivation: Computation is a bottleneck and the current development model based in many small solutions does not solve big problems As sequencing technology throughput improve and prices go down this problem will become much worse Focus in analysis, many more analysis are needed So, bioinformatician with strong background in computation is invited to participate and contact us
25 Next Steps Future Clinic project Future Clinic project aims to introduce personalized medicine in the Valencian health system Consortium formed by Indra, Bull, Valencian Health System, Universitat Politecnica, and led by CIPF Specific goals: Develop algorithms and software to process, store and analyze genomic information Integrate this information into health system: Digital Clinic History Develop software and technology to aid to diagnosis This project is funded by the Generalitat Valenciana and FEDER
26 Conclusions NGS technology is pushing computer needs, new solutions are necessary The huge data demands faster and more efficient software solution, HPC can be applied New architecture model, more cloud solutions It's not only a hardware problem, we need better software Analysis software to handle this data is being developed
27 Acknowledgements Head of Department Joaquín Dopazo HPC developers Joaquín Tárraga Cristina Y. González Víctor Requena (BULL) Web and Java developers Franscisco Salavert Ruben Sánchez Others developers Alejandro de María Roberto Alonso Jose Carbonell And the rest of the department!! HPC Consortium: - Ignacio Blanquer (UPV) - José Salavert (UPV) - Jose Duato (UPV) - Juan M. Orduña (UV) - Vicente Arnau (UV) - Enrique Quintana (UJI) - Mario Gómez (BULL) -... Bull Extreme Computing (BUX)
New solutions for Big Data Analysis and Visualization
New solutions for Big Data Analysis and Visualization From HPC to cloud-based solutions Barcelona, February 2013 Nacho Medina imedina@cipf.es http://bioinfo.cipf.es/imedina Head of the Computational Biology
More informationNew advanced solutions for Genomic big data Analysis and Visualization
New advanced solutions for Genomic big data Analysis and Visualization Ignacio Medina (Nacho) im411@cam.ac.uk http://www.hpc.cam.ac.uk/compbio Head of Computational Biology Lab HPC Service, University
More informationNew advanced solutions for Genomic big data Analysis and Visualization
New advanced solutions for Genomic big data Analysis and Visualization Ignacio Medina (Nacho) im411@cam.ac.uk http://bioinfo.cipf.es/imedina Head of Computational Biology Lab HPC Service, University of
More informationOpenCB a next generation big data analytics and visualisation platform for the Omics revolution
OpenCB a next generation big data analytics and visualisation platform for the Omics revolution Development at the University of Cambridge - Closing the Omics / Moore s law gap with Dell & Intel Ignacio
More informationOpenCB development - A Big Data analytics and visualisation platform for the Omics revolution
OpenCB development - A Big Data analytics and visualisation platform for the Omics revolution Ignacio Medina, Paul Calleja, John Taylor (University of Cambridge, UIS, HPC Service (HPCS)) Abstract The advent
More informationPreparing the scenario for the use of patient s genome sequences in clinic. Joaquín Dopazo
Preparing the scenario for the use of patient s genome sequences in clinic Joaquín Dopazo Computational Medicine Institute, Centro de Investigación Príncipe Felipe (CIPF), Functional Genomics Node, (INB),
More informationProgramming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga
Programming models for heterogeneous computing Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga Talk outline [30 slides] 1. Introduction [5 slides] 2.
More informationGPU System Architecture. Alan Gray EPCC The University of Edinburgh
GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems
More informationData Analysis & Management of High-throughput Sequencing Data. Quoclinh Nguyen Research Informatics Genomics Core / Medical Research Institute
Data Analysis & Management of High-throughput Sequencing Data Quoclinh Nguyen Research Informatics Genomics Core / Medical Research Institute Current Issues Current Issues The QSEQ file Number files per
More informationHadoop-BAM and SeqPig
Hadoop-BAM and SeqPig Keijo Heljanko 1, André Schumacher 1,2, Ridvan Döngelci 1, Luca Pireddu 3, Matti Niemenmaa 1, Aleksi Kallio 4, Eija Korpelainen 4, and Gianluigi Zanetti 3 1 Department of Computer
More informationOutline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging
Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging
More informationIntroducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child
Introducing A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child Bio Tim Child 35 years experience of software development Formerly VP Oracle Corporation VP BEA Systems Inc.
More informationScalable Cloud Computing Solutions for Next Generation Sequencing Data
Scalable Cloud Computing Solutions for Next Generation Sequencing Data Matti Niemenmaa 1, Aleksi Kallio 2, André Schumacher 1, Petri Klemelä 2, Eija Korpelainen 2, and Keijo Heljanko 1 1 Department of
More informationFocusing on results not data comprehensive data analysis for targeted next generation sequencing
Focusing on results not data comprehensive data analysis for targeted next generation sequencing Daniel Swan, Jolyon Holdstock, Angela Matchan, Richard Stark, John Shovelton, Duarte Mohla and Simon Hughes
More informationHadoop. Bioinformatics Big Data
Hadoop Bioinformatics Big Data Paolo D Onorio De Meo Mattia D Antonio p.donoriodemeo@cineca.it m.dantonio@cineca.it Big Data Too much information! Big Data Explosive data growth proliferation of data capture
More informationIntroduction. Xiangke Liao 1, Shaoliang Peng, Yutong Lu, Chengkun Wu, Yingbo Cui, Heng Wang, Jiajun Wen
DOI: 10.14529/jsfi150104 Neo-hetergeneous Programming and Parallelized Optimization of a Human Genome Re-sequencing Analysis Software Pipeline on TH-2 Supercomputer Xiangke Liao 1, Shaoliang Peng, Yutong
More informationIntroduction to NGS data analysis
Introduction to NGS data analysis Jeroen F. J. Laros Leiden Genome Technology Center Department of Human Genetics Center for Human and Clinical Genetics Sequencing Illumina platforms Characteristics: High
More informationParallel Programming Survey
Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory
More informationAccelerating variant calling
Accelerating variant calling Mauricio Carneiro GSA Broad Institute Intel Genomic Sequencing Pipeline Workshop Mount Sinai 12/10/2013 This is the work of many Genome sequencing and analysis team Mark DePristo
More informationArchitectures for Big Data Analytics A database perspective
Architectures for Big Data Analytics A database perspective Fernando Velez Director of Product Management Enterprise Information Management, SAP June 2013 Outline Big Data Analytics Requirements Spectrum
More informationDelivering the power of the world s most successful genomics platform
Delivering the power of the world s most successful genomics platform NextCODE Health is bringing the full power of the world s largest and most successful genomics platform to everyday clinical care NextCODE
More informationImportance of Statistics in creating high dimensional data
Importance of Statistics in creating high dimensional data Hemant K. Tiwari, PhD Section on Statistical Genetics Department of Biostatistics University of Alabama at Birmingham History of Genomic Data
More informationShouguo Gao Ph. D Department of Physics and Comprehensive Diabetes Center
Computational Challenges in Storage, Analysis and Interpretation of Next-Generation Sequencing Data Shouguo Gao Ph. D Department of Physics and Comprehensive Diabetes Center Next Generation Sequencing
More informationBig Data Challenges in Bioinformatics
Big Data Challenges in Bioinformatics BARCELONA SUPERCOMPUTING CENTER COMPUTER SCIENCE DEPARTMENT Autonomic Systems and ebusiness Pla?orms Jordi Torres Jordi.Torres@bsc.es Talk outline! We talk about Petabyte?
More informationCSE-E5430 Scalable Cloud Computing. Lecture 4
Lecture 4 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 5.10-2015 1/23 Hadoop - Linux of Big Data Hadoop = Open Source Distributed Operating System
More informationHadoopizer : a cloud environment for bioinformatics data analysis
Hadoopizer : a cloud environment for bioinformatics data analysis Anthony Bretaudeau (1), Olivier Sallou (2), Olivier Collin (3) (1) anthony.bretaudeau@irisa.fr, INRIA/Irisa, Campus de Beaulieu, 35042,
More informationOptimizing a 3D-FWT code in a cluster of CPUs+GPUs
Optimizing a 3D-FWT code in a cluster of CPUs+GPUs Gregorio Bernabé Javier Cuenca Domingo Giménez Universidad de Murcia Scientific Computing and Parallel Programming Group XXIX Simposium Nacional de la
More informationCloud-based Analytics and Map Reduce
1 Cloud-based Analytics and Map Reduce Datasets Many technologies converging around Big Data theme Cloud Computing, NoSQL, Graph Analytics Biology is becoming increasingly data intensive Sequencing, imaging,
More informationText file One header line meta information lines One line : variant/position
Software Calling: GATK SAMTOOLS mpileup Varscan SOAP VCF format Text file One header line meta information lines One line : variant/position ##fileformat=vcfv4.1! ##filedate=20090805! ##source=myimputationprogramv3.1!
More informationProcessing NGS Data with Hadoop-BAM and SeqPig
Processing NGS Data with Hadoop-BAM and SeqPig Keijo Heljanko 1, André Schumacher 1,2, Ridvan Döngelci 1, Luca Pireddu 3, Matti Niemenmaa 1, Aleksi Kallio 4, Eija Korpelainen 4, and Gianluigi Zanetti 3
More informationESS event: Big Data in Official Statistics. Antonino Virgillito, Istat
ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web
More information-> Integration of MAPHiTS in Galaxy
Enabling NGS Analysis with(out) the Infrastructure, 12:0512 Development of a workflow for SNPs detection in grapevine From Sets to Graphs: Towards a Realistic Enrichment Analy species: MAPHiTS -> Integration
More informationCase Study on Productivity and Performance of GPGPUs
Case Study on Productivity and Performance of GPGPUs Sandra Wienke wienke@rz.rwth-aachen.de ZKI Arbeitskreis Supercomputing April 2012 Rechen- und Kommunikationszentrum (RZ) RWTH GPU-Cluster 56 Nvidia
More informationIntro to Map/Reduce a.k.a. Hadoop
Intro to Map/Reduce a.k.a. Hadoop Based on: Mining of Massive Datasets by Ra jaraman and Ullman, Cambridge University Press, 2011 Data Mining for the masses by North, Global Text Project, 2012 Slides by
More informationAccelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software
WHITEPAPER Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software SanDisk ZetaScale software unlocks the full benefits of flash for In-Memory Compute and NoSQL applications
More informationOpen source large scale distributed data management with Google s MapReduce and Bigtable
Open source large scale distributed data management with Google s MapReduce and Bigtable Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory
More informationAnalysis of NGS Data
Analysis of NGS Data Introduction and Basics Folie: 1 Overview of Analysis Workflow Images Basecalling Sequences denovo - Sequencing Assembly Annotation Resequencing Alignments Comparison to reference
More informationENABLING DATA TRANSFER MANAGEMENT AND SHARING IN THE ERA OF GENOMIC MEDICINE. October 2013
ENABLING DATA TRANSFER MANAGEMENT AND SHARING IN THE ERA OF GENOMIC MEDICINE October 2013 Introduction As sequencing technologies continue to evolve and genomic data makes its way into clinical use and
More informationSeqPig: simple and scalable scripting for large sequencing data sets in Hadoop
SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop André Schumacher, Luca Pireddu, Matti Niemenmaa, Aleksi Kallio, Eija Korpelainen, Gianluigi Zanetti and Keijo Heljanko Abstract
More informationHow To Scale Out Of A Nosql Database
Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 thomas.steinmaurer@scch.at www.scch.at Michael Zwick DI
More informationBringing Big Data Modelling into the Hands of Domain Experts
Bringing Big Data Modelling into the Hands of Domain Experts David Willingham Senior Application Engineer MathWorks david.willingham@mathworks.com.au 2015 The MathWorks, Inc. 1 Data is the sword of the
More informationHBase A Comprehensive Introduction. James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367
HBase A Comprehensive Introduction James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367 Overview Overview: History Began as project by Powerset to process massive
More informationAn Approach to Implement Map Reduce with NoSQL Databases
www.ijecs.in International Journal Of Engineering And Computer Science ISSN: 2319-7242 Volume 4 Issue 8 Aug 2015, Page No. 13635-13639 An Approach to Implement Map Reduce with NoSQL Databases Ashutosh
More informationLocal Alignment Tool Based on Hadoop Framework and GPU Architecture
Local Alignment Tool Based on Hadoop Framework and GPU Architecture Che-Lun Hung * Department of Computer Science and Communication Engineering Providence University Taichung, Taiwan clhung@pu.edu.tw *
More informationTrends in High-Performance Computing for Power Grid Applications
Trends in High-Performance Computing for Power Grid Applications Franz Franchetti ECE, Carnegie Mellon University www.spiral.net Co-Founder, SpiralGen www.spiralgen.com This talk presents my personal views
More informationChapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing
More informationLarge-scale Research Data Management and Analysis Using Globus Services. Ravi Madduri Argonne National Lab University of Chicago @madduri
Large-scale Research Data Management and Analysis Using Globus Services Ravi Madduri Argonne National Lab University of Chicago @madduri Outline Who we are Challenges in Big Data Management and Analysis
More informationSAP HANA - Main Memory Technology: A Challenge for Development of Business Applications. Jürgen Primsch, SAP AG July 2011
SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications Jürgen Primsch, SAP AG July 2011 Why In-Memory? Information at the Speed of Thought Imagine access to business data,
More informationVersion 5.0 Release Notes
Version 5.0 Release Notes 2011 Gene Codes Corporation Gene Codes Corporation 775 Technology Drive, Ann Arbor, MI 48108 USA 1.800.497.4939 (USA) +1.734.769.7249 (elsewhere) +1.734.769.7074 (fax) www.genecodes.com
More informationBenchmarking Hadoop & HBase on Violin
Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages
More informationOverview of HPC Resources at Vanderbilt
Overview of HPC Resources at Vanderbilt Will French Senior Application Developer and Research Computing Liaison Advanced Computing Center for Research and Education June 10, 2015 2 Computing Resources
More informationPerformance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France
More informationOpen source Google-style large scale data analysis with Hadoop
Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical
More informationHPC with Multicore and GPUs
HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville CS 594 Lecture Notes March 4, 2015 1/18 Outline! Introduction - Hardware
More informationComparing Methods for Identifying Transcription Factor Target Genes
Comparing Methods for Identifying Transcription Factor Target Genes Alena van Bömmel (R 3.3.73) Matthew Huska (R 3.3.18) Max Planck Institute for Molecular Genetics Folie 1 Transcriptional Regulation TF
More informationDiscovery and Quantification of RNA with RNASeq Roderic Guigó Serra Centre de Regulació Genòmica (CRG) roderic.guigo@crg.cat
Bioinformatique et Séquençage Haut Débit, Discovery and Quantification of RNA with RNASeq Roderic Guigó Serra Centre de Regulació Genòmica (CRG) roderic.guigo@crg.cat 1 RNA Transcription to RNA and subsequent
More informationData processing goes big
Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,
More informationPractical Solutions for Big Data Analytics
Practical Solutions for Big Data Analytics Ravi Madduri Computation Institute (madduri@anl.gov) Paul Dave (pdave@uchicago.edu) Dinanath Sulakhe (sulakhe@uchicago.edu) Alex Rodriguez (arodri7@uchicago.edu)
More informationGeneProf and the new GeneProf Web Services
GeneProf and the new GeneProf Web Services Florian Halbritter florian.halbritter@ed.ac.uk Stem Cell Bioinformatics Group (Simon R. Tomlinson) simon.tomlinson@ed.ac.uk December 10, 2012 Florian Halbritter
More informationHigh Performance Compu2ng Facility
High Performance Compu2ng Facility Center for Health Informa2cs and Bioinforma2cs Accelera2ng Scien2fic Discovery and Innova2on in Biomedical Research at NYULMC through Advanced Compu2ng Efstra'os Efstathiadis,
More informationAccelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing
Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing Innovation Intelligence Devin Jensen August 2012 Altair Knows HPC Altair is the only company that: makes HPC tools
More informationMonitis Project Proposals for AUA. September 2014, Yerevan, Armenia
Monitis Project Proposals for AUA September 2014, Yerevan, Armenia Distributed Log Collecting and Analysing Platform Project Specifications Category: Big Data and NoSQL Software Requirements: Apache Hadoop
More informationUnleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers
Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers Haohuan Fu haohuan@tsinghua.edu.cn High Performance Geo-Computing (HPGC) Group Center for Earth System Science Tsinghua University
More information10- High Performance Compu5ng
10- High Performance Compu5ng (Herramientas Computacionales Avanzadas para la Inves6gación Aplicada) Rafael Palacios, Fernando de Cuadra MRE Contents Implemen8ng computa8onal tools 1. High Performance
More informationHIGH PERFORMANCE CONSULTING COURSE OFFERINGS
Performance 1(6) HIGH PERFORMANCE CONSULTING COURSE OFFERINGS LEARN TO TAKE ADVANTAGE OF POWERFUL GPU BASED ACCELERATOR TECHNOLOGY TODAY 2006 2013 Nvidia GPUs Intel CPUs CONTENTS Acronyms and Terminology...
More informationGPUs for Scientific Computing
GPUs for Scientific Computing p. 1/16 GPUs for Scientific Computing Mike Giles mike.giles@maths.ox.ac.uk Oxford-Man Institute of Quantitative Finance Oxford University Mathematical Institute Oxford e-research
More informationNext Generation GPU Architecture Code-named Fermi
Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time
More informationSeeking Opportunities for Hardware Acceleration in Big Data Analytics
Seeking Opportunities for Hardware Acceleration in Big Data Analytics Paul Chow High-Performance Reconfigurable Computing Group Department of Electrical and Computer Engineering University of Toronto Who
More informationIntroduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1
Introduction to GP-GPUs Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1 GPU Architectures: How do we reach here? NVIDIA Fermi, 512 Processing Elements (PEs) 2 What Can It Do?
More informationIntroduction to Big Data! with Apache Spark" UC#BERKELEY#
Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"
More informationScaling Objectivity Database Performance with Panasas Scale-Out NAS Storage
White Paper Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage A Benchmark Report August 211 Background Objectivity/DB uses a powerful distributed processing architecture to manage
More informationAcceleration for Personalized Medicine Big Data Applications
Acceleration for Personalized Medicine Big Data Applications Zaid Al-Ars Computer Engineering (CE) Lab Delft Data Science Delft University of Technology 1" Introduction Definition & relevance Personalized
More informationHadoop IST 734 SS CHUNG
Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to
More informationBuilding a Top500-class Supercomputing Cluster at LNS-BUAP
Building a Top500-class Supercomputing Cluster at LNS-BUAP Dr. José Luis Ricardo Chávez Dr. Humberto Salazar Ibargüen Dr. Enrique Varela Carlos Laboratorio Nacional de Supercómputo Benemérita Universidad
More informationExperimentation on Cloud Databases to Handle Genomic Big Data
Experimentation on Cloud Databases to Handle Genomic Big Data Presented by: Abraham Gómez, M.Sc., B.Sc. Academic Advisor: Alain April. Ph.D,M.Sc.A, B.A. abraham-segundo.gomez.1@ens.etsmtl.ca Agenda 1 2
More informationAn Oracle White Paper June 2012. High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database
An Oracle White Paper June 2012 High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database Executive Overview... 1 Introduction... 1 Oracle Loader for Hadoop... 2 Oracle Direct
More informationTowards Integrating the Detection of Genetic Variants into an In-Memory Database
Towards Integrating the Detection of Genetic Variants into an 2nd International Workshop on Big Data in Bioinformatics and Healthcare Oct 27, 2014 Motivation Genome Data Analysis Process DNA Sample Base
More informationAn example of bioinformatics application on plant breeding projects in Rijk Zwaan
An example of bioinformatics application on plant breeding projects in Rijk Zwaan Xiangyu Rao 17-08-2012 Introduction of RZ Rijk Zwaan is active worldwide as a vegetable breeding company that focuses on
More informationG E N OM I C S S E RV I C ES
GENOMICS SERVICES THE NEW YORK GENOME CENTER NYGC is an independent non-profit implementing advanced genomic research to improve diagnosis and treatment of serious diseases. capabilities. N E X T- G E
More informationSGI. High Throughput Computing (HTC) Wrapper Program for Bioinformatics on SGI ICE and SGI UV Systems. January, 2012. Abstract. Haruna Cofer*, PhD
White Paper SGI High Throughput Computing (HTC) Wrapper Program for Bioinformatics on SGI ICE and SGI UV Systems Haruna Cofer*, PhD January, 2012 Abstract The SGI High Throughput Computing (HTC) Wrapper
More informationBenchmark Hadoop and Mars: MapReduce on cluster versus on GPU
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview
More informationST810 Advanced Computing
ST810 Advanced Computing Lecture 17: Parallel computing part I Eric B. Laber Hua Zhou Department of Statistics North Carolina State University Mar 13, 2013 Outline computing Hardware computing overview
More informationCloud-Based Big Data Analytics in Bioinformatics
Cloud-Based Big Data Analytics in Bioinformatics Presented By Cephas Mawere Harare Institute of Technology, Zimbabwe 1 Introduction 2 Big Data Analytics Big Data are a collection of data sets so large
More informationComputational infrastructure for NGS data analysis. José Carbonell Caballero Pablo Escobar
Computational infrastructure for NGS data analysis José Carbonell Caballero Pablo Escobar Computational infrastructure for NGS Cluster definition: A computer cluster is a group of linked computers, working
More informationHPC Wales Skills Academy Course Catalogue 2015
HPC Wales Skills Academy Course Catalogue 2015 Overview The HPC Wales Skills Academy provides a variety of courses and workshops aimed at building skills in High Performance Computing (HPC). Our courses
More informationEuropean Genome-phenome Archive database of human data consented for use in biomedical research at the European Bioinformatics Institute
European Genome-phenome Archive database of human data consented for use in biomedical research at the European Bioinformatics Institute Justin Paschall Team Leader Genetic Variation / EGA ! European Genome-phenome
More information22S:295 Seminar in Applied Statistics High Performance Computing in Statistics
22S:295 Seminar in Applied Statistics High Performance Computing in Statistics Luke Tierney Department of Statistics & Actuarial Science University of Iowa August 30, 2007 Luke Tierney (U. of Iowa) HPC
More informationOpen source software framework designed for storage and processing of large scale data on clusters of commodity hardware
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after
More informationSimilarity Search in a Very Large Scale Using Hadoop and HBase
Similarity Search in a Very Large Scale Using Hadoop and HBase Stanislav Barton, Vlastislav Dohnal, Philippe Rigaux LAMSADE - Universite Paris Dauphine, France Internet Memory Foundation, Paris, France
More informationEvaluation of CUDA Fortran for the CFD code Strukti
Evaluation of CUDA Fortran for the CFD code Strukti Practical term report from Stephan Soller High performance computing center Stuttgart 1 Stuttgart Media University 2 High performance computing center
More informationComparing SQL and NOSQL databases
COSC 6397 Big Data Analytics Data Formats (II) HBase Edgar Gabriel Spring 2015 Comparing SQL and NOSQL databases Types Development History Data Storage Model SQL One type (SQL database) with minor variations
More informationHPC ABDS: The Case for an Integrating Apache Big Data Stack
HPC ABDS: The Case for an Integrating Apache Big Data Stack with HPC 1st JTC 1 SGBD Meeting SDSC San Diego March 19 2014 Judy Qiu Shantenu Jha (Rutgers) Geoffrey Fox gcf@indiana.edu http://www.infomall.org
More informationUCLA Team Sequences Cell Line, Puts Open Source Software Framework into Production
Page 1 of 6 UCLA Team Sequences Cell Line, Puts Open Source Software Framework into Production February 05, 2010 Newsletter: BioInform BioInform - February 5, 2010 By Vivien Marx Scientists at the department
More informationRETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison
RETRIEVING SEQUENCE INFORMATION Nucleotide sequence databases Database search Sequence alignment and comparison Biological sequence databases Originally just a storage place for sequences. Currently the
More informationA REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, sborkar95@gmail.com Assistant Professor, Information
More informationNext generation sequencing (NGS)
Next generation sequencing (NGS) Vijayachitra Modhukur BIIT modhukur@ut.ee 1 Bioinformatics course 11/13/12 Sequencing 2 Bioinformatics course 11/13/12 Microarrays vs NGS Sequences do not need to be known
More informationOverview on Modern Accelerators and Programming Paradigms Ivan Giro7o igiro7o@ictp.it
Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o igiro7o@ictp.it Informa(on & Communica(on Technology Sec(on (ICTS) Interna(onal Centre for Theore(cal Physics (ICTP) Mul(ple Socket
More informationAGILENT S BIOINFORMATICS ANALYSIS SOFTWARE
ACCELERATING PROGRESS IS IN OUR GENES AGILENT S BIOINFORMATICS ANALYSIS SOFTWARE GENESPRING GENE EXPRESSION (GX) MASS PROFILER PROFESSIONAL (MPP) PATHWAY ARCHITECT (PA) See Deeper. Reach Further. BIOINFORMATICS
More informationDNA Mapping/Alignment. Team: I Thought You GNU? Lars Olsen, Venkata Aditya Kovuri, Nick Merowsky
DNA Mapping/Alignment Team: I Thought You GNU? Lars Olsen, Venkata Aditya Kovuri, Nick Merowsky Overview Summary Research Paper 1 Research Paper 2 Research Paper 3 Current Progress Software Designs to
More informationA survey on platforms for big data analytics
Singh and Reddy Journal of Big Data 2014, 1:8 SURVEY PAPER Open Access A survey on platforms for big data analytics Dilpreet Singh and Chandan K Reddy * * Correspondence: reddy@cs.wayne.edu Department
More information