Cloud-based Analytics and Map Reduce



Similar documents
Scaling up to Production

Big Data for Big Science. Bernard Doering Business Development, EMEA Big Data Software

Fast, Low-Overhead Encryption for Apache Hadoop*

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

Hadoop & Spark Using Amazon EMR

High Performance Computing and Big Data: The coming wave.

Cloud Computing. Big Data. High Performance Computing

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

Extended Attributes and Transparent Encryption in Apache Hadoop

Big Data. Value, use cases and architectures. Petar Torre Lead Architect Service Provider Group. Dubrovnik, Croatia, South East Europe May, 2013

Intel Platform and Big Data: Making big data work for you.

Accelerating Enterprise Big Data Success. Tim Stevens, VP of Business and Corporate Development Cloudera

HPC ABDS: The Case for an Integrating Apache Big Data Stack

Big Data and Natural Language: Extracting Insight From Text

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

Big Data on Microsoft Platform

SAS BIG DATA SOLUTIONS ON AWS SAS FORUM ESPAÑA, OCTOBER 16 TH, 2014 IAN MEYERS SOLUTIONS ARCHITECT / AMAZON WEB SERVICES

Real Time Big Data Processing

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Hadoop. Bioinformatics Big Data

How To Scale Out Of A Nosql Database

The ROI from Optimizing Software Performance with Intel Parallel Studio XE

Implement Hadoop jobs to extract business value from large and varied data sets

Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies

Hadoopizer : a cloud environment for bioinformatics data analysis

Enabling High performance Big Data platform with RDMA

Intel HPC Distribution for Apache Hadoop* Software including Intel Enterprise Edition for Lustre* Software. SC13, November, 2013

BIG DATA TRENDS AND TECHNOLOGIES

How To Handle Big Data With A Data Scientist

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

INTEL PARALLEL STUDIO XE EVALUATION GUIDE

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

A Brief Introduction to Apache Tez

Scalable Architecture on Amazon AWS Cloud

Hadoop. Sunday, November 25, 12

Hadoop IST 734 SS CHUNG

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Big Data Introduction

Deploying Hadoop with Manager

Hadoop-BAM and SeqPig

Real-Time Big Data Analytics SAP HANA with the Intel Distribution for Apache Hadoop software

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

COURSE CONTENT Big Data and Hadoop Training

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Developing High-Performance, Scalable, cost effective storage solutions with Intel Cloud Edition Lustre* and Amazon Web Services

Building your Big Data Architecture on Amazon Web Services

Intel Media SDK Library Distribution and Dispatching Process

Alternative Deployment Models for Cloud Computing in HPC Applications. Society of HPC Professionals November 9, 2011 Steve Hebert, Nimbix

Intel and Qihoo 360 Internet Portal Datacenter - Big Data Storage Optimization Case Study

CSE-E5430 Scalable Cloud Computing Lecture 2

Hadoop and Map-Reduce. Swati Gore

Vendor Update Intel 49 th IDC HPC User Forum. Mike Lafferty HPC Marketing Intel Americas Corp.

Processing NGS Data with Hadoop-BAM and SeqPig

BIG DATA-AS-A-SERVICE

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

Big Data and Industrial Internet

Benchmarking Sahara-based Big-Data-as-a-Service Solutions. Zhidong Yu, Weiting Chen (Intel) Matthew Farrellee (Red Hat) May 2015

Cray XC30 Hadoop Platform Jonathan (Bill) Sparks Howard Pritchard Martha Dumler

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Dell* In-Memory Appliance for Cloudera* Enterprise

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

BIG DATA What it is and how to use?

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.

ENABLING GLOBAL HADOOP WITH EMC ELASTIC CLOUD STORAGE

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

Microsoft Big Data. Solution Brief

Open source Google-style large scale data analysis with Hadoop

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

BIG DATA - HADOOP PROFESSIONAL amron

Big Data at Cloud Scale

Large scale processing using Hadoop. Ján Vaňo

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

Accelerating and Simplifying Apache

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Dell Reference Configuration for DataStax Enterprise powered by Apache Cassandra

Cloud-Based Big Data Analytics in Bioinformatics

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Big Data on AWS. Services Overview. Bernie Nallamotu Principle Solutions Architect

Adobe Deploys Hadoop as a Service on VMware vsphere

Dell In-Memory Appliance for Cloudera Enterprise

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Application Development. A Paradigm Shift

Building Out Your Cloud-Ready Solutions. Clark D. Richey, Jr., Principal Technologist, DoD

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Ubuntu and Hadoop: the perfect match

ITG Software Engineering

Virtualizing Apache Hadoop. June, 2012

Protecting Big Data Data Protection Solutions for the Business Data Lake

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

The Future of Data Management

Transcription:

1 Cloud-based Analytics and Map Reduce

Datasets Many technologies converging around Big Data theme Cloud Computing, NoSQL, Graph Analytics Biology is becoming increasingly data intensive Sequencing, imaging, and other instruments Computing with big datasets is fundamentally different than big compute on small datasets I/O centric, data movement is critical 2

Cloud-based Analytics Must be specific when talking cloud IaaS, PaaS, SaaS,... Basic cloud ingredients Self-service on-demand compute and storage Commodity hardware High capacity Everything has an API Leading example: Amazon Web Services Honorable mention: Google Compute Engine, OpenStack, CloudStack, VMWare 3

Mapping Informatics to the Cloud Good News Good News: You don t have to change much Using the basic IaaS building blocks we can handle most traditional use cases Clusters, client-server, N-tier deployments Running standard HPC clusters in the cloud is simple StarCluster: http://star.mit.edu/cluster/ Create EC2 clusters in minutes OpenMPI, ATLAS, Lapack, NumPy, SciPy SGE, IPython, Condor, MPICH2 plugins 4

Mapping Informatics to the Cloud Cloud computing is democratizing access to IT infrastructure resources Anyone can have access to massive compute and storage resources in minutes This changes the way we solve scientific problems Cloud computing is not a silver bullet for scalability Cloud providers have primarily focused on horizontal scaling and not on HPC 5

Map Reduce We need a computing framework that is... able to handle huge datasets - 1TB+ massively parallel - runs on commodity hardware fault tolerant - hardware fails locality aware - moving computation is cheaper than moving data easy to use 6

Map Reduce Original paper by Google in 2004 Introduced a simplified parallel processing model Used to build Google search indexes Users specify a Map function and Reduce function Framework manages task distribution, orchestration, data transfers, redundancy, and fault tolerance 7

Map Reduce Map Reduce is a simple approach to parallel programming Existing algorithms must be translated into one or more map/reduce steps Batch oriented Requires a distributed filesystem Map Reduce can be implemented on top of MPI http://mapreduce.sandia.gov 8

Apache Hadoop Free implementation derived from Google MapReduce - written in Java Composed of many complementary projects Core set of components and interfaces for distributed filesystems and general I/O serialization MapReduce distributed data processing model and execution environment HDFS - distributed file system that runs on large clusters 9

Hadoop Ecosystem Hadoop-related projects at Apache Hadoop has a large ecosystem of tools HBase - non-relational, column-orientated database that runs on HDFS Pig - data flow language for exploring datasets Hive - distributed data warehouse with SQL-like query language Mahout - machine learning and data mining library 10

Hadoop Components Core consists of compute layer (MapReduce) and storage layer (HDFS) Alternatives to HDFS Amazon S3 GlusterFS Lustre Many distributions/flavors of Hadoop exist Apache Cloudera Amazon Elastic Map Reduce Intel 11

Intel Hadoop Core improvements and enterprise features Encrypted HDFS Faster job launch Optimized for SSDs and 10GbE networking Accelerated Hive queries Premium support Intel Manager Multi-datacenter HBase deployments 12

Hadoop Job Flow http://developer.yahoo.com/hadoop/tutorial/module4.html#dataflow 13

Canonical Example - Word Count 14

Translating Workloads to Map Reduce Parallel programming requires parallel thinking Domain decomposition Exploit natural parallelism A map reduce job assumes independent mappers and reducers running in parallel on individual slices of data Share-nothing architecture - avoid communication and global data structures Exploit parallelism at each stage in workflow 15

Translating Workloads to Map Reduce Genomic analysis is ideally suited for the Hadoop framework large, semi-structured, file-based data, parallel IO parallel processing by reads, samples, genes, etc. Hadoop interfaces exist for C, Java, Perl, R,... Hadoop Streaming allows any executables to be mappers and reducers 16

Enter Hadoop Streaming Unix Pipes for Hadoop Utility that comes with Hadoop distribution Use any mapper and reducer Perl Bash cat/wc Binaries Easiest way to get started with Hadoop cat in.txt mapper.sh sort reduce.sh > out.txt 17

Revolution R with Hadoop Series of R connectors to Hadoop features Write MapReduce jobs in R using Hadoop Streaming Import tables from HDFS and HBase https://github.com/revolutionanalytics/rhadoop http://www.revolutionanalytics.com/why-revolutionr/whitepapers/r-and-hadoop-big-data-analytics.pdf 18

Elastic MapReduce Hadoop on AWS Amazon realized customers were spending a lot of time configuring and operating Hadoop clusters on EC2 EMR is a service that runs on EC2 and handles set-up, teardown, and other Hadoop details Charged by instance-hour Instances can be terminated automatically when your job finishes Adds Job Flows feature Unique to EMR 19

Elastic MapReduce Features Elastic Add nodes to a running cluster New: variable node count in each flow step New: easier to dynamically resize # nodes in use Stores inputs and outputs in S3 New: can use multi-part HTTP upload if configured Easy to Use Job Flows, Debugging Support for On-Demand and Reserved Instances Recent support for Spot and VPC instances! Support for bootstrap actions Support for Pig, Hive, Hadoop 0.20.* 20

Elastic MapReduce Creating Job Flows 21

Elastic MapReduce Monitoring and Debugging 22

Elastic MapReduce Debugging 23

Translating Workloads to Map Reduce The most efficient programs require use of Java and understanding Hadoop internals Hadoop has more scheduling and execution overhead compared to some traditional HPC environments Hadoop can be integrated with cluster schedulers like SGE Moving large amounts of data into HDFS can be slow Filesystem alternatives include S3, GlusterFS, Lustre, http HDFS can greatly benefit from SSDs 24

Data Movement A few tips Primary hurdle in adopting the cloud Avoid using SCP or other TCP based transfers unless you tune your settings http://www.psc.edu/index.php/networking/641-tcp-tune Alternative transports: GridFTP, Aspera fasp, Bit Torrent AWS offers physical import/export via FedEx Aggregate the data within your job if possible Ingest data in preprocess phase or via P2P torrents 25

Life Science Example Cloud-scale RNA-sequencing differential expression analysis with Myrna Langmead et al. Genome Biology 2010, 11:R83 RNA-Seq analysis pipeline Focused on differential expression analysis between genes Complementary to whole transcriptome assembly (e.g. cufflinks) Workflow contains 7 stages Bowtie for alignment and R/Bioconductor for EM and statistics 26

Stage 1 - Preprocess Process FASTQ input list Optional dump from.sra format Assign sample names Copy into HDFS Parallel across input files 27

Stage 2 - Align Align reads to reference genome using bowtie Each node independently obtains the bowtie index from local or shared filesystem (hdfs) Parallel across reads 28

Stage 3 - Overlap Calculate overlaps between alignment and predefined gene intervals Aggregate counts for each genomic feature Parallel across alignments 29

Stage 4 - Normalize Calculate normalization factor based on count distribution Parallel across genetic feature labels 30

Stage 5 - Statistical analysis Fits a linear model relating the counts to the outcome using R Uses values calculated from Align and Overlap stages Parallel across genes 31

Stage 6 - Summarize Significance summaries such as P-values and gene-specific counts are calculated Outputs a list of top N genes ranked by false discovery rate Hadoop takes care of sorting This stage is serial Mitigated by small size of calculation at this stage 32

Stage 7 - Postprocess Discards overlaps not belonging to top genes Creates output files, summary tables, and plots Compressed and stored in user specified output directory This stage has modest parallelism 33

Myrna Performance Uses standard bioinformatics tools bowtie and R/Bioconductor in a Hadoop job flow Workflow broken into many stages to take advantage of parallelism Near linear speedup 34

Summary The most popular genomics algorithms will eventually get Hadoop implementations The other 80% will not... Hadoop is well suited for processing large unstructured data offline Hadoop is not well suited for communication heavy jobs or real-time processing Hadoop can be run locally or integrated into existing HPC infrastructure HBase and other products run on Hadoop taking advantage of framework features 35

Summary Building a Hadoop cluster Use dense storage nodes Boosting HDFS performance Use SSD drives Faster interconnects Replace HDFS with S3, GlusterFS, Lustre Running a Hadoop cluster Job monitoring and debugging requires additional tooling AWS Elastic Map Reduce product for EC2 Intel Manager for Hadoop 36

Observations The cloud is making good parallel programming techniques more important than ever Message passing, threading, distributed systems Understand the difference between vertical and horizontal scaling Use both! Cloud best practices are finding their way back into local infrastructure/hpc Hadoop, configuration management, SOA 37

References http://developer.yahoo.com/hadoop/tutorial/module4.html#d ataflow http://markusklems.files.wordpress.com/2008/07/mapreduc e.png http://bowtie-bio.sourceforge.net/myrna/index.shtml 38

Legal Disclaimer & Optimization Notice INFORMATION IN THIS DOCUMENT IS PROVIDED AS IS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Copyright, Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Core, VTune, and Cilk are trademarks of Intel Corporation in the U.S. and other countries. Optimization Notice Intel s compilers may or may not optimize to the same degree for non-intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 39 Copyright 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.