1 Cloud-based Analytics and Map Reduce
Datasets Many technologies converging around Big Data theme Cloud Computing, NoSQL, Graph Analytics Biology is becoming increasingly data intensive Sequencing, imaging, and other instruments Computing with big datasets is fundamentally different than big compute on small datasets I/O centric, data movement is critical 2
Cloud-based Analytics Must be specific when talking cloud IaaS, PaaS, SaaS,... Basic cloud ingredients Self-service on-demand compute and storage Commodity hardware High capacity Everything has an API Leading example: Amazon Web Services Honorable mention: Google Compute Engine, OpenStack, CloudStack, VMWare 3
Mapping Informatics to the Cloud Good News Good News: You don t have to change much Using the basic IaaS building blocks we can handle most traditional use cases Clusters, client-server, N-tier deployments Running standard HPC clusters in the cloud is simple StarCluster: http://star.mit.edu/cluster/ Create EC2 clusters in minutes OpenMPI, ATLAS, Lapack, NumPy, SciPy SGE, IPython, Condor, MPICH2 plugins 4
Mapping Informatics to the Cloud Cloud computing is democratizing access to IT infrastructure resources Anyone can have access to massive compute and storage resources in minutes This changes the way we solve scientific problems Cloud computing is not a silver bullet for scalability Cloud providers have primarily focused on horizontal scaling and not on HPC 5
Map Reduce We need a computing framework that is... able to handle huge datasets - 1TB+ massively parallel - runs on commodity hardware fault tolerant - hardware fails locality aware - moving computation is cheaper than moving data easy to use 6
Map Reduce Original paper by Google in 2004 Introduced a simplified parallel processing model Used to build Google search indexes Users specify a Map function and Reduce function Framework manages task distribution, orchestration, data transfers, redundancy, and fault tolerance 7
Map Reduce Map Reduce is a simple approach to parallel programming Existing algorithms must be translated into one or more map/reduce steps Batch oriented Requires a distributed filesystem Map Reduce can be implemented on top of MPI http://mapreduce.sandia.gov 8
Apache Hadoop Free implementation derived from Google MapReduce - written in Java Composed of many complementary projects Core set of components and interfaces for distributed filesystems and general I/O serialization MapReduce distributed data processing model and execution environment HDFS - distributed file system that runs on large clusters 9
Hadoop Ecosystem Hadoop-related projects at Apache Hadoop has a large ecosystem of tools HBase - non-relational, column-orientated database that runs on HDFS Pig - data flow language for exploring datasets Hive - distributed data warehouse with SQL-like query language Mahout - machine learning and data mining library 10
Hadoop Components Core consists of compute layer (MapReduce) and storage layer (HDFS) Alternatives to HDFS Amazon S3 GlusterFS Lustre Many distributions/flavors of Hadoop exist Apache Cloudera Amazon Elastic Map Reduce Intel 11
Intel Hadoop Core improvements and enterprise features Encrypted HDFS Faster job launch Optimized for SSDs and 10GbE networking Accelerated Hive queries Premium support Intel Manager Multi-datacenter HBase deployments 12
Hadoop Job Flow http://developer.yahoo.com/hadoop/tutorial/module4.html#dataflow 13
Canonical Example - Word Count 14
Translating Workloads to Map Reduce Parallel programming requires parallel thinking Domain decomposition Exploit natural parallelism A map reduce job assumes independent mappers and reducers running in parallel on individual slices of data Share-nothing architecture - avoid communication and global data structures Exploit parallelism at each stage in workflow 15
Translating Workloads to Map Reduce Genomic analysis is ideally suited for the Hadoop framework large, semi-structured, file-based data, parallel IO parallel processing by reads, samples, genes, etc. Hadoop interfaces exist for C, Java, Perl, R,... Hadoop Streaming allows any executables to be mappers and reducers 16
Enter Hadoop Streaming Unix Pipes for Hadoop Utility that comes with Hadoop distribution Use any mapper and reducer Perl Bash cat/wc Binaries Easiest way to get started with Hadoop cat in.txt mapper.sh sort reduce.sh > out.txt 17
Revolution R with Hadoop Series of R connectors to Hadoop features Write MapReduce jobs in R using Hadoop Streaming Import tables from HDFS and HBase https://github.com/revolutionanalytics/rhadoop http://www.revolutionanalytics.com/why-revolutionr/whitepapers/r-and-hadoop-big-data-analytics.pdf 18
Elastic MapReduce Hadoop on AWS Amazon realized customers were spending a lot of time configuring and operating Hadoop clusters on EC2 EMR is a service that runs on EC2 and handles set-up, teardown, and other Hadoop details Charged by instance-hour Instances can be terminated automatically when your job finishes Adds Job Flows feature Unique to EMR 19
Elastic MapReduce Features Elastic Add nodes to a running cluster New: variable node count in each flow step New: easier to dynamically resize # nodes in use Stores inputs and outputs in S3 New: can use multi-part HTTP upload if configured Easy to Use Job Flows, Debugging Support for On-Demand and Reserved Instances Recent support for Spot and VPC instances! Support for bootstrap actions Support for Pig, Hive, Hadoop 0.20.* 20
Elastic MapReduce Creating Job Flows 21
Elastic MapReduce Monitoring and Debugging 22
Elastic MapReduce Debugging 23
Translating Workloads to Map Reduce The most efficient programs require use of Java and understanding Hadoop internals Hadoop has more scheduling and execution overhead compared to some traditional HPC environments Hadoop can be integrated with cluster schedulers like SGE Moving large amounts of data into HDFS can be slow Filesystem alternatives include S3, GlusterFS, Lustre, http HDFS can greatly benefit from SSDs 24
Data Movement A few tips Primary hurdle in adopting the cloud Avoid using SCP or other TCP based transfers unless you tune your settings http://www.psc.edu/index.php/networking/641-tcp-tune Alternative transports: GridFTP, Aspera fasp, Bit Torrent AWS offers physical import/export via FedEx Aggregate the data within your job if possible Ingest data in preprocess phase or via P2P torrents 25
Life Science Example Cloud-scale RNA-sequencing differential expression analysis with Myrna Langmead et al. Genome Biology 2010, 11:R83 RNA-Seq analysis pipeline Focused on differential expression analysis between genes Complementary to whole transcriptome assembly (e.g. cufflinks) Workflow contains 7 stages Bowtie for alignment and R/Bioconductor for EM and statistics 26
Stage 1 - Preprocess Process FASTQ input list Optional dump from.sra format Assign sample names Copy into HDFS Parallel across input files 27
Stage 2 - Align Align reads to reference genome using bowtie Each node independently obtains the bowtie index from local or shared filesystem (hdfs) Parallel across reads 28
Stage 3 - Overlap Calculate overlaps between alignment and predefined gene intervals Aggregate counts for each genomic feature Parallel across alignments 29
Stage 4 - Normalize Calculate normalization factor based on count distribution Parallel across genetic feature labels 30
Stage 5 - Statistical analysis Fits a linear model relating the counts to the outcome using R Uses values calculated from Align and Overlap stages Parallel across genes 31
Stage 6 - Summarize Significance summaries such as P-values and gene-specific counts are calculated Outputs a list of top N genes ranked by false discovery rate Hadoop takes care of sorting This stage is serial Mitigated by small size of calculation at this stage 32
Stage 7 - Postprocess Discards overlaps not belonging to top genes Creates output files, summary tables, and plots Compressed and stored in user specified output directory This stage has modest parallelism 33
Myrna Performance Uses standard bioinformatics tools bowtie and R/Bioconductor in a Hadoop job flow Workflow broken into many stages to take advantage of parallelism Near linear speedup 34
Summary The most popular genomics algorithms will eventually get Hadoop implementations The other 80% will not... Hadoop is well suited for processing large unstructured data offline Hadoop is not well suited for communication heavy jobs or real-time processing Hadoop can be run locally or integrated into existing HPC infrastructure HBase and other products run on Hadoop taking advantage of framework features 35
Summary Building a Hadoop cluster Use dense storage nodes Boosting HDFS performance Use SSD drives Faster interconnects Replace HDFS with S3, GlusterFS, Lustre Running a Hadoop cluster Job monitoring and debugging requires additional tooling AWS Elastic Map Reduce product for EC2 Intel Manager for Hadoop 36
Observations The cloud is making good parallel programming techniques more important than ever Message passing, threading, distributed systems Understand the difference between vertical and horizontal scaling Use both! Cloud best practices are finding their way back into local infrastructure/hpc Hadoop, configuration management, SOA 37
References http://developer.yahoo.com/hadoop/tutorial/module4.html#d ataflow http://markusklems.files.wordpress.com/2008/07/mapreduc e.png http://bowtie-bio.sourceforge.net/myrna/index.shtml 38
Legal Disclaimer & Optimization Notice INFORMATION IN THIS DOCUMENT IS PROVIDED AS IS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Copyright, Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Core, VTune, and Cilk are trademarks of Intel Corporation in the U.S. and other countries. Optimization Notice Intel s compilers may or may not optimize to the same degree for non-intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 39 Copyright 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.