GENOME ANALYTICS. Performance in-situ DDN BPGW15. Hanif Khalak September 22, 2015 Cambridge, UK
|
|
|
- Ariel Cooper
- 10 years ago
- Views:
Transcription
1 GENOME ANALYTICS Performance in-situ DDN BPGW15 Hanif Khalak September 22, 2015 Cambridge, UK
2 Weill-Cornell in Qatar Medical Education Pre-medical (2-yr) n WCMC-Q Medical (4-yr MD) n n n Math & Science Identical to NY curriculum Cross-registrations 100% USMLE success and residency in US Biomedical Research Human Genetics & Genomics Proteomics & Metabolomics Biostatistics & Epidemiology Molecular & Cell Biology Stem Cells & Tumor Microenvironment Biophysics & Physiology Global & Public Health
3 Genomic Big WCMC-Q Whole Genome (30X+ coverage) 200GB+, block gzipped 1G+ sequence objects Derived variants = 20GB+ compressed, ~5M features Whole Exome (50X+ coverage) 15GB+, block gzipped 50M+ sequence objects Derived variants = 500MB+ compressed, 0.5M features Genome Repository genomes, exomes across multiple studies Great value in meta- and re-analysis HPC Infrastructure 30 mixed nodes (1K cores, 3TB RAM) 1PB DDN GridScalar (GPFS) Upgrade to 3x capacity by end 2015 Software API c/c++ samtools, bamtools java Picard perl Bio::DB::SAM python pysam Databases postgres, gemini
4 Analytics on Genome Data in situ Big Data 200GB BAM compressed file à up to 1B seq reads per genome 500TB genome repository à 2PB+ in HDFS, RDB, MongoDB Significant storage and compute resources required with most solutions Analytics API Next-gen analysis is still in early stages à iterative development Skill gap: molecular genomics (science) and analysis informatics (programming) Hadoop/Spark/etc still difficult for scientists, even bioinformaticians SQL skills are common and easier to pick up Performance Interactivity for data access and query response times
5 Gemini: HPC DB for Genome Variants
6 BAM Data API options Access% Mode% Files% RDBMS% with%etl% NoSQL% SQL% No%ETL% Direct%%% (API)% SDKs%% (samtools,% Hadoop0BAM)% CloverETL,%% % MongoDB%% Cassandra% N/A$ SQL% Hive%/%Drill%/% Impala%/%HAWQ%%% HDFS%connectors% ODBC,%JDBC,% DBIx% Simba%% (ODBC%for% Cassandra)% PostgresSQL%FDW% (multicorn,%citusdb% PGStrom)% Apache%Drill%!
7 SQL options in situ Postgres FDW FDW = foreign data wrapper API for pluggable storage back-ends as foreign tables Multicorn: python FDW framework Recent work on accelerated offloading with GPUs n PG-Strom (OpenCL) n MapD (CUDA) Apache Drill SQL on streams framework open-sourced by MapR Limited stream formats supported Only recently added support for gzipped streams next project, TBD Any approach would benefit from accelerated I/ O with data files
8 Use Case: Qatar Genome Browser (QGB)
9 Genomic Big Data Query Input: list of genome files/ids and regions of interest Output: JSON-style object(s) Query: Single exome (coding regions) ~ 5GB file Single request: return read info for 100 chromosome intervals Performance using SDK from CLI n ~5.5s clock time n <100MB RAM n Scaling of query size, complexity and number of data files???
10 Multithreading using CLI Runtime (s) % CPU Speedup (x)
11 PostgreSQL Foreign Data Wrapper (FDW) Foreign table API for PostgreSQL Many drivers available (SQL, NoSQL, CSV, gzip, HDFS) Multicorn 3 rd party FDW framework to write custom data source drivers in python E.g.: RSS, IMAP, Google, Hive, VCF
12 MySQL: FEDERATED Storage Engine (mysql only) MSSQL: Text File Driver (cvs only) Firebird: External Table (cvs only) DB2: Complete SQL/MED implementation PostgreSQL FDW 9 Overview 7 / 23
13 FDW / multicorn for BAM data CREATE OR REPLACE FUNCTION bammeta() RETURNS SETOF bam_core AS $BODY$ DECLARE crow called_exome_targets_ %rowtype; brow bam_core%rowtype; BEGIN FOR crow IN SELECT * FROM called_exome_targets_ LOOP PERFORM * FROM bam_core WHERE bam_core.contig = crow.contig AND bam_core.reference_start >= crow.start AND bam_core.reference_end <= crow.end; END LOOP; END; $BODY$ LANGUAGE 'plpgsql' ; Optimize move loop to python à parallel offload
14 FDW / multicorn performance Whole exome 10K regions
15 Future Work Software Methods CitusDB n n Distributed query engine Modify their approach: offload instead of clustering query PostgreSQL + PG-Strom n OpenCL-based extension of FDW for query-on-accelerator (GPU) Apache Drill n I/O System Java adapter for.bam DDN IME!
16 Storage Performance with DDN IME A touching storey, full of tiers
17 Storage Performance vs. Capacity Storage latency (log-scale) Seagate, 2015
18 Touch Rate Steve Hetzler, IBM Architect Touch rate n a scale-free metric to evaluate storage performance n Definition: the proportion of a storage device or system s total content that can be accessed per unit time (e.g. year)
19 Touch Rate Steve Hetzler, IBM Architect Touch rate n a scale-free metric to evaluate storage performance n Definition: the proportion of a storage device or system s total content that can be accessed per unit time (e.g. year)
20 Hetzler & Coughlin, 2015
21 Hetzler & Coughlin, 2015
22 Hetzler & Coughlin, 2015
23 Hetzler & Coughlin, 2015
24 Hetzler & Coughlin, 2015
25 Hybrid Storage: Tiering Up FLASH! $$$!! System Design??? Qlogic, 2015
26 ~1% tiered software intermediation ê DDN IME Hetzler & Coughlin, 2015
27 DDN IME infinite memory engine before IME POSIX, MPIIO, GPFS, Lustre, after IME
28 DDN IME: I/O and Application Acceleration postgresql DDN IME softwaredefined storage FDW A natural platform for high-performance data APIs On-appliance BAM file processing TBD BAM files
29 Acknowledgements DDN BPGW15 Collaborations George Vacek, DDN Laurent Thiers, DDN Sanger Centre Will Schepp, EMC/Pivotal WCMC-Q Gaurav Kaul, Intel Utku Azman, CitusDB Jillian Rowe, Greg Smith - HPC Karsten Suhre, Bioinformatics Khaled Fakhro, Genetic Medicine Alice Aleem, Human Genetics Shahzad Jafri, CIO
30
31 Genomic Big Data - Options Standard Data Technologies SQL, NoSQL, HDFS, Require replication n High cost of additional (slow) storage More analysts with SQL skills than MapReduce / Hadoop Data API Goals: in-place data repository Ease of integrated queries: raw + metadata + high performance queries n Threads, cores, RAM, I/O,
32 Genomic Clinical Decision System (CDS) Intel, 2014
33 WCMC-Q Data Engine: Scale to Cloud FW CLI Galaxy, web apps R, MATLAB, AWS, Google, Rackspace Query ID#, SQL, JSON, Response TSV, JSON, XML, Virtual Data Engine slurm, CLI, Hadoop, Spark, yarn SQL No SQL HDFS, CEPH Omics (.BAM,.VCF,.BED) 400TB+ GPFS Annotation Files Other Data Files bandwidth? Remote Sites (FTP, ) N C B I U C S C E M B L
34 Data Federation vs. Virtualization
35 No Shortage of NoSQL Big Data Analysis Platforms! Query/Scripting Language SCOPE AQL Meteor PigLatin Jaql Sawzall Dremel SQL High-Level API Compiler/Optimizer SCOPE DryadLINQ Algebricks Spark Sopremo Java/Scala Pig Cascading Cascading Jaql FlumeJava FlumeJava Dremel SQL Low-Level API Execution Engine Dryad Hyracks RDDs Spark Nephele PACT Tez MapReduce Hadoop MapReduce Google MapReduce Dremel Dataflow Processor Data Store Cosmos TidyFS Hyracks LSM Storage HBase HDFS GFS Bigtable Relational Row/ Column Storage Resource Management Quincy Mesos YARN Omega 7
36 Example Query Query n Depth of coverage along genome n What percent of sites are 0 <= 10, 11 <= 20, 21 <= 30, and so on up to 100? n What percentage of SNP variants fall in each of these bins? n Within specific regions on genome n Sizes: 100, 1000, 10K, Whole exome n 176,715 regions exons (human genome b37/hg19) Target n sample whole exome BAM file from 1000genomes project n HG00096.mapped.illumina.mosaik.GBR.exome bam
37 Genomic Big Data Query Issues Scaling: n many files (1000+) n batched queries (e.g. visualization) n parallel requests (e.g. many users) Locality: central GPFS store vs. distributed FS Network speed: bandwidth, latency Ease of Query integration n SQL vs. R vs. Hadoop vs. Spark vs. Pig vs.
38 Comparison Methods Perl-MCE n Multithreaded, shared memory parallelism n Queries are limited by samtools API PostgreSQL/multicorn n Multiprocess n Arbitrary SQL, in theory n Uncertain performance
39 Alignment Metadata Per bam file, retrieve alignment metadata for region(s) seq_id, start, end, strand, cigar_str, query.start, query.end, dna, query.dna, qscore, qual, tagpaired a. 100, 1000, 10,000 regions b. Whole exome, 176,715 regions
40 MCE: Many-core processing with Perl Threading shared memory; workers use callback functions Chunking can reduce IPC overhead and likelihood that workers finish tasks at same time Serial I/O better than random I/O of workers; esp. with caching
41 MCE: Many-Core Engine Channels separate threads Communication Serialized Input scatter / gather Serialized output Queue Sync (event, term) Benchmark Net::Ping + MCE w/ 30 event loops 25K IPs/sec
42 PostgreSQL (FDW) / multicorn Load CREATE SERVER alchemy_srv foreign data wrapper multicorn options ( wrapper 'multicorn.sqlalchemyfdw.sqlalchemyfdw ); Attach data source CREATE FOREIGN TABLE mysql_datatable ( id integer, created_at timestamp without time zone, updated_at timestamp without time zone ) server alchemy_srv options ( tablename datatable, db_url 'mysql://root:password@ /testing );
43 FDW / multicorn (2) Rewrite as PG-SQL function which calls multicorn code to pull BAM data For SQL user, becomes SELECT * FROM input(contig, start, end) Parallelize basic data pull in python Much faster
44 Lessons In-situ queries on genome data files are possible and can be parallelized Reliance on samtools API limits query options Using FDW framework, files can be queried with SQL Basic queries have similar performance to CLI Can populate temporary tables and continue analytics in pure SQL Need to accelerate joins
45
Big Data Storage: Should We Pop the (Software) Stack? Michael Carey Information Systems Group CS Department UC Irvine. #AsterixDB
Big Data Storage: Should We Pop the (Software) Stack? Michael Carey Information Systems Group CS Department UC Irvine #AsterixDB 0 Rough Topical Plan Background and motivation (quick!) Big Data storage
Moving From Hadoop to Spark
+ Moving From Hadoop to Spark Sujee Maniyam Founder / Principal @ www.elephantscale.com [email protected] Bay Area ACM meetup (2015-02-23) + HI, Featured in Hadoop Weekly #109 + About Me : Sujee
OpenCB a next generation big data analytics and visualisation platform for the Omics revolution
OpenCB a next generation big data analytics and visualisation platform for the Omics revolution Development at the University of Cambridge - Closing the Omics / Moore s law gap with Dell & Intel Ignacio
Scalable Cloud Computing Solutions for Next Generation Sequencing Data
Scalable Cloud Computing Solutions for Next Generation Sequencing Data Matti Niemenmaa 1, Aleksi Kallio 2, André Schumacher 1, Petri Klemelä 2, Eija Korpelainen 2, and Keijo Heljanko 1 1 Department of
Hadoop Ecosystem B Y R A H I M A.
Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open
Unified Big Data Processing with Apache Spark. Matei Zaharia @matei_zaharia
Unified Big Data Processing with Apache Spark Matei Zaharia @matei_zaharia What is Apache Spark? Fast & general engine for big data processing Generalizes MapReduce model to support more types of processing
How To Scale Out Of A Nosql Database
Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 [email protected] www.scch.at Michael Zwick DI
How To Create A Data Visualization With Apache Spark And Zeppelin 2.5.3.5
Big Data Visualization using Apache Spark and Zeppelin Prajod Vettiyattil, Software Architect, Wipro Agenda Big Data and Ecosystem tools Apache Spark Apache Zeppelin Data Visualization Combining Spark
Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase
Architectural patterns for building real time applications with Apache HBase Andrew Purtell Committer and PMC, Apache HBase Who am I? Distributed systems engineer Principal Architect in the Big Data Platform
Pilot-Streaming: Design Considerations for a Stream Processing Framework for High- Performance Computing
Pilot-Streaming: Design Considerations for a Stream Processing Framework for High- Performance Computing Andre Luckow, Peter M. Kasson, Shantenu Jha STREAMING 2016, 03/23/2016 RADICAL, Rutgers, http://radical.rutgers.edu
Big Data Architecture & Analytics A comprehensive approach to harness big data architecture and analytics for growth
MAKING BIG DATA COME ALIVE Big Data Architecture & Analytics A comprehensive approach to harness big data architecture and analytics for growth Steve Gonzales, Principal Manager [email protected]
Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect
Big Data & QlikView Democratizing Big Data Analytics David Freriks Principal Solution Architect TDWI Vancouver Agenda What really is Big Data? How do we separate hype from reality? How does that relate
Processing NGS Data with Hadoop-BAM and SeqPig
Processing NGS Data with Hadoop-BAM and SeqPig Keijo Heljanko 1, André Schumacher 1,2, Ridvan Döngelci 1, Luca Pireddu 3, Matti Niemenmaa 1, Aleksi Kallio 4, Eija Korpelainen 4, and Gianluigi Zanetti 3
Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging
Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging
Open source large scale distributed data management with Google s MapReduce and Bigtable
Open source large scale distributed data management with Google s MapReduce and Bigtable Ioannis Konstantinou Email: [email protected] Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory
A programming model in Cloud: MapReduce
A programming model in Cloud: MapReduce Programming model and implementation developed by Google for processing large data sets Users specify a map function to generate a set of intermediate key/value
Oracle Big Data SQL Technical Update
Oracle Big Data SQL Technical Update Jean-Pierre Dijcks Oracle Redwood City, CA, USA Keywords: Big Data, Hadoop, NoSQL Databases, Relational Databases, SQL, Security, Performance Introduction This technical
THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES
THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES Vincent Garonne, Mario Lassnig, Martin Barisits, Thomas Beermann, Ralph Vigne, Cedric Serfon [email protected] [email protected] XLDB
Application Development. A Paradigm Shift
Application Development for the Cloud: A Paradigm Shift Ramesh Rangachar Intelsat t 2012 by Intelsat. t Published by The Aerospace Corporation with permission. New 2007 Template - 1 Motivation for the
Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14
Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 14 Big Data Management IV: Big-data Infrastructures (Background, IO, From NFS to HFDS) Chapter 14-15: Abideboul
Large-Scale Data Processing
Large-Scale Data Processing Eiko Yoneki [email protected] http://www.cl.cam.ac.uk/~ey204 Systems Research Group University of Cambridge Computer Laboratory 2010s: Big Data Why Big Data now? Increase
Open Source Technologies on Microsoft Azure
Open Source Technologies on Microsoft Azure A Survey @DChappellAssoc Copyright 2014 Chappell & Associates The Main Idea i Open source technologies are a fundamental part of Microsoft Azure The Big Questions
An Approach to Implement Map Reduce with NoSQL Databases
www.ijecs.in International Journal Of Engineering And Computer Science ISSN: 2319-7242 Volume 4 Issue 8 Aug 2015, Page No. 13635-13639 An Approach to Implement Map Reduce with NoSQL Databases Ashutosh
Hadoop IST 734 SS CHUNG
Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to
Apache Flink Next-gen data analysis. Kostas Tzoumas [email protected] @kostas_tzoumas
Apache Flink Next-gen data analysis Kostas Tzoumas [email protected] @kostas_tzoumas What is Flink Project undergoing incubation in the Apache Software Foundation Originating from the Stratosphere research
Database Scalability and Oracle 12c
Database Scalability and Oracle 12c Marcelle Kratochvil CTO Piction ACE Director All Data/Any Data [email protected] Warning I will be covering topics and saying things that will cause a rethink in
Bringing Big Data Modelling into the Hands of Domain Experts
Bringing Big Data Modelling into the Hands of Domain Experts David Willingham Senior Application Engineer MathWorks [email protected] 2015 The MathWorks, Inc. 1 Data is the sword of the
MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012
MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte
New solutions for Big Data Analysis and Visualization
New solutions for Big Data Analysis and Visualization From HPC to cloud-based solutions Barcelona, February 2013 Nacho Medina [email protected] http://bioinfo.cipf.es/imedina Head of the Computational Biology
Hadoop-BAM and SeqPig
Hadoop-BAM and SeqPig Keijo Heljanko 1, André Schumacher 1,2, Ridvan Döngelci 1, Luca Pireddu 3, Matti Niemenmaa 1, Aleksi Kallio 4, Eija Korpelainen 4, and Gianluigi Zanetti 3 1 Department of Computer
SQL on NoSQL (and all of the data) With Apache Drill
SQL on NoSQL (and all of the data) With Apache Drill Richard Shaw Solutions Architect @aggress Who What Where NoSQL DB Very Nice People Open Source Distributed Storage & Compute Platform (up to 1000s of
A Brief Introduction to Apache Tez
A Brief Introduction to Apache Tez Introduction It is a fact that data is basically the new currency of the modern business world. Companies that effectively maximize the value of their data (extract value
HDFS. Hadoop Distributed File System
HDFS Kevin Swingler Hadoop Distributed File System File system designed to store VERY large files Streaming data access Running across clusters of commodity hardware Resilient to node failure 1 Large files
Sharding with postgres_fdw
Sharding with postgres_fdw Postgres Open 2013 Chicago Stephen Frost [email protected] Resonate, Inc. Digital Media PostgreSQL Hadoop [email protected] http://www.resonateinsights.com Stephen
Peers Techno log ies Pv t. L td. HADOOP
Page 1 Peers Techno log ies Pv t. L td. Course Brochure Overview Hadoop is a Open Source from Apache, which provides reliable storage and faster process by using the Hadoop distibution file system and
Lecture 10: HBase! Claudia Hauff (Web Information Systems)! [email protected]
Big Data Processing, 2014/15 Lecture 10: HBase!! Claudia Hauff (Web Information Systems)! [email protected] 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind the
HPC ABDS: The Case for an Integrating Apache Big Data Stack
HPC ABDS: The Case for an Integrating Apache Big Data Stack with HPC 1st JTC 1 SGBD Meeting SDSC San Diego March 19 2014 Judy Qiu Shantenu Jha (Rutgers) Geoffrey Fox [email protected] http://www.infomall.org
Introduction to Big Data! with Apache Spark" UC#BERKELEY#
Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"
Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview
Programming Hadoop 5-day, instructor-led BD-106 MapReduce Overview The Client Server Processing Pattern Distributed Computing Challenges MapReduce Defined Google's MapReduce The Map Phase of MapReduce
Open source Google-style large scale data analysis with Hadoop
Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: [email protected] Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical
Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. www.spark- project.org. University of California, Berkeley UC BERKELEY
Spark in Action Fast Big Data Analytics using Scala Matei Zaharia University of California, Berkeley www.spark- project.org UC BERKELEY My Background Grad student in the AMP Lab at UC Berkeley» 50- person
The Future of Data Management
The Future of Data Management with Hadoop and the Enterprise Data Hub Amr Awadallah (@awadallah) Cofounder and CTO Cloudera Snapshot Founded 2008, by former employees of Employees Today ~ 800 World Class
Distributed Computing and Big Data: Hadoop and MapReduce
Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:
the missing log collector Treasure Data, Inc. Muga Nishizawa
the missing log collector Treasure Data, Inc. Muga Nishizawa Muga Nishizawa (@muga_nishizawa) Chief Software Architect, Treasure Data Treasure Data Overview Founded to deliver big data analytics in days
Introduction to Hadoop. New York Oracle User Group Vikas Sawhney
Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop
Challenges for Data Driven Systems
Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Quick History of Data Management 4000 B C Manual recording From tablets to papyrus to paper A. Payberah 2014 2
Big Data Primer. 1 Why Big Data? Alex Sverdlov [email protected]
Big Data Primer Alex Sverdlov [email protected] 1 Why Big Data? Data has value. This immediately leads to: more data has more value, naturally causing datasets to grow rather large, even at small companies.
Big Systems, Big Data
Big Systems, Big Data When considering Big Distributed Systems, it can be noted that a major concern is dealing with data, and in particular, Big Data Have general data issues (such as latency, availability,
Analytics on Spark & Shark @Yahoo
Analytics on Spark & Shark @Yahoo PRESENTED BY Tim Tully December 3, 2013 Overview Legacy / Current Hadoop Architecture Reflection / Pain Points Why the movement towards Spark / Shark New Hybrid Environment
Large Scale Text Analysis Using the Map/Reduce
Large Scale Text Analysis Using the Map/Reduce Hierarchy David Buttler This work is performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract
6.S897 Large-Scale Systems
6.S897 Large-Scale Systems Instructor: Matei Zaharia" Fall 2015, TR 2:30-4, 34-301 bit.ly/6-s897 Outline What this course is about" " Logistics" " Datacenter environment What this Course is About Large-scale
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of
Systems Engineering II. Pramod Bhatotia TU Dresden pramod.bhatotia@tu- dresden.de
Systems Engineering II Pramod Bhatotia TU Dresden pramod.bhatotia@tu- dresden.de About me! Since May 2015 2015 2012 Research Group Leader cfaed, TU Dresden PhD Student MPI- SWS Research Intern Microsoft
Performance and Scalability Overview
Performance and Scalability Overview This guide provides an overview of some of the performance and scalability capabilities of the Pentaho Business Analytics platform. PENTAHO PERFORMANCE ENGINEERING
Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems
Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems Rekha Singhal and Gabriele Pacciucci * Other names and brands may be claimed as the property of others. Lustre File
Big Data and Scripting Systems build on top of Hadoop
Big Data and Scripting Systems build on top of Hadoop 1, 2, Pig/Latin high-level map reduce programming platform Pig is the name of the system Pig Latin is the provided programming language Pig Latin is
Hadoop: Embracing future hardware
Hadoop: Embracing future hardware Suresh Srinivas @suresh_m_s Page 1 About Me Architect & Founder at Hortonworks Long time Apache Hadoop committer and PMC member Designed and developed many key Hadoop
Large scale processing using Hadoop. Ján Vaňo
Large scale processing using Hadoop Ján Vaňo What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data Includes: MapReduce offline computing engine
Comparing SQL and NOSQL databases
COSC 6397 Big Data Analytics Data Formats (II) HBase Edgar Gabriel Spring 2015 Comparing SQL and NOSQL databases Types Development History Data Storage Model SQL One type (SQL database) with minor variations
The evolution of database technology (II) Huibert Aalbers Senior Certified Executive IT Architect
The evolution of database technology (II) Huibert Aalbers Senior Certified Executive IT Architect IT Insight podcast This podcast belongs to the IT Insight series You can subscribe to the podcast through
CSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing
Hadoop & Spark Using Amazon EMR
Hadoop & Spark Using Amazon EMR Michael Hanisch, AWS Solutions Architecture 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda Why did we build Amazon EMR? What is Amazon EMR?
Integrating Apache Spark with an Enterprise Data Warehouse
Integrating Apache Spark with an Enterprise Warehouse Dr. Michael Wurst, IBM Corporation Architect Spark/R/Python base Integration, In-base Analytics Dr. Toni Bollinger, IBM Corporation Senior Software
Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016
Big Data Approaches Making Sense of Big Data Ian Crosland Jan 2016 Accelerate Big Data ROI Even firms that are investing in Big Data are still struggling to get the most from it. Make Big Data Accessible
Why Spark on Hadoop Matters
Why Spark on Hadoop Matters MC Srivas, CTO and Founder, MapR Technologies Apache Spark Summit - July 1, 2014 1 MapR Overview Top Ranked Exponential Growth 500+ Customers Cloud Leaders 3X bookings Q1 13
Scalable Architecture on Amazon AWS Cloud
Scalable Architecture on Amazon AWS Cloud Kalpak Shah Founder & CEO, Clogeny Technologies [email protected] 1 * http://www.rightscale.com/products/cloud-computing-uses/scalable-website.php 2 Architect
HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM
HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM 1. Introduction 1.1 Big Data Introduction What is Big Data Data Analytics Bigdata Challenges Technologies supported by big data 1.2 Hadoop Introduction
Scaling up to Production
1 Scaling up to Production Overview Productionize then Scale Building Production Systems Scaling Production Systems Use Case: Scaling a Production Galaxy Instance Infrastructure Advice 2 PRODUCTIONIZE
Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child
Introducing A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child Bio Tim Child 35 years experience of software development Formerly VP Oracle Corporation VP BEA Systems Inc.
Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: [email protected] Website: www.qburst.com
Lambda Architecture Near Real-Time Big Data Analytics Using Hadoop January 2015 Contents Overview... 3 Lambda Architecture: A Quick Introduction... 4 Batch Layer... 4 Serving Layer... 4 Speed Layer...
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing
Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter : @carbone
Hadoop2, Spark Big Data, real time, machine learning & use cases Cédric Carbone Twitter : @carbone Agenda Map Reduce Hadoop v1 limits Hadoop v2 and YARN Apache Spark Streaming : Spark vs Storm Machine
ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat
ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web
CSE-E5430 Scalable Cloud Computing. Lecture 4
Lecture 4 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 5.10-2015 1/23 Hadoop - Linux of Big Data Hadoop = Open Source Distributed Operating System
Intro to Map/Reduce a.k.a. Hadoop
Intro to Map/Reduce a.k.a. Hadoop Based on: Mining of Massive Datasets by Ra jaraman and Ullman, Cambridge University Press, 2011 Data Mining for the masses by North, Global Text Project, 2012 Slides by
BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB
BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB Planet Size Data!? Gartner s 10 key IT trends for 2012 unstructured data will grow some 80% over the course of the next
Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook
Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future
NoSQL: Going Beyond Structured Data and RDBMS
NoSQL: Going Beyond Structured Data and RDBMS Scenario Size of data >> disk or memory space on a single machine Store data across many machines Retrieve data from many machines Machine = Commodity machine
MySQL and Hadoop. Percona Live 2014 Chris Schneider
MySQL and Hadoop Percona Live 2014 Chris Schneider About Me Chris Schneider, Database Architect @ Groupon Spent the last 10 years building MySQL architecture for multiple companies Worked with Hadoop for
Can High-Performance Interconnects Benefit Memcached and Hadoop?
Can High-Performance Interconnects Benefit Memcached and Hadoop? D. K. Panda and Sayantan Sur Network-Based Computing Laboratory Department of Computer Science and Engineering The Ohio State University,
CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University
CS 555: DISTRIBUTED SYSTEMS [SPARK] Shrideep Pallickara Computer Science Colorado State University Frequently asked questions from the previous class survey Streaming Significance of minimum delays? Interleaving
Comparison of the Frontier Distributed Database Caching System with NoSQL Databases
Comparison of the Frontier Distributed Database Caching System with NoSQL Databases Dave Dykstra [email protected] Fermilab is operated by the Fermi Research Alliance, LLC under contract No. DE-AC02-07CH11359
Cloud Computing. Lecture 24 Cloud Platform Comparison 2014-2015
Cloud Computing Lecture 24 Cloud Platform Comparison 2014-2015 1 Up until now Introduction, Definition of Cloud Computing Pre-Cloud Large Scale Computing: Grid Computing Content Distribution Networks Cycle-Sharing
TRAINING PROGRAM ON BIGDATA/HADOOP
Course: Training on Bigdata/Hadoop with Hands-on Course Duration / Dates / Time: 4 Days / 24th - 27th June 2015 / 9:30-17:30 Hrs Venue: Eagle Photonics Pvt Ltd First Floor, Plot No 31, Sector 19C, Vashi,
Write a Foreign Data Wrapper in 15 minutes
Write a Foreign Data Wrapper in 15 minutes Table des matières Write a Foreign Data Wrapper in 15 minutes...1 1 About me...4 2 Foreign Data Wrappers?...5 3 Goals...5 4 Agenda...5 5 Part 1 - SQL/MED...6
Case Study : 3 different hadoop cluster deployments
Case Study : 3 different hadoop cluster deployments Lee moon soo [email protected] HDFS as a Storage Last 4 years, our HDFS clusters, stored Customer 1500 TB+ data safely served 375,000 TB+ data to customer
Jeffrey D. Ullman slides. MapReduce for data intensive computing
Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very
Big Data and Data Science: Behind the Buzz Words
Big Data and Data Science: Behind the Buzz Words Peggy Brinkmann, FCAS, MAAA Actuary Milliman, Inc. April 1, 2014 Contents Big data: from hype to value Deconstructing data science Managing big data Analyzing
Unified Big Data Analytics Pipeline. 连 城 [email protected]
Unified Big Data Analytics Pipeline 连 城 [email protected] What is A fast and general engine for large-scale data processing An open source implementation of Resilient Distributed Datasets (RDD) Has an
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe
Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15
Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 15 Big Data Management V (Big-data Analytics / Map-Reduce) Chapter 16 and 19: Abideboul et. Al. Demetris
Big Data Course Highlights
Big Data Course Highlights The Big Data course will start with the basics of Linux which are required to get started with Big Data and then slowly progress from some of the basics of Hadoop/Big Data (like
HAWQ Architecture. Alexey Grishchenko
HAWQ Architecture Alexey Grishchenko Who I am Enterprise Architect @ Pivotal 7 years in data processing 5 years of experience with MPP 4 years with Hadoop Using HAWQ since the first internal Beta Responsible
Hadoop implementation of MapReduce computational model. Ján Vaňo
Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed
