Data Structures and Performance for Scientific Computing with Hadoop and Dumbo

Similar documents
TSQR, A t A, Cholesky QR, and Iterative Refinement using Hadoop on Magellan

6. Cholesky factorization

A bit about Hadoop. Luca Pireddu. March 9, CRS4Distributed Computing Group. (CRS4) Luca Pireddu March 9, / 18

Hadoop Parallel Data Processing

Hadoop Cluster Applications

CS 378 Big Data Programming. Lecture 5 Summariza9on Pa:erns

Implement Hadoop jobs to extract business value from large and varied data sets


Hadoop SNS. renren.com. Saturday, December 3, 11

RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems CLOUD COMPUTING GROUP - LITAO DENG

Extreme Computing Second Assignment

Big Fast Data Hadoop acceleration with Flash. June 2013

Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

Big Data and Scripting map/reduce in Hadoop

Exploring the Efficiency of Big Data Processing with Hadoop MapReduce

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Energy Efficient MapReduce

MapReduce: Algorithm Design Patterns

This exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing.

H2O on Hadoop. September 30,

Crunching Big Data with R And Hadoop!

Developing MapReduce Programs

ITG Software Engineering

Storage Architectures for Big Data in the Cloud

Cloud Computing. Chapter Hadoop

HiBench Introduction. Carson Wang Software & Services Group

Understanding Hadoop Performance on Lustre

Optimization and analysis of large scale data sorting algorithm based on Hadoop

Graph Mining on Big Data System. Presented by Hefu Chai, Rui Zhang, Jian Fang

Parallel Computing. Benson Muite. benson.

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc.

Chapter 12: Multiprocessor Architectures. Lesson 04: Interconnect Networks

Open Source Technologies on Microsoft Azure

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

# Not a part of 1Z0-061 or 1Z0-144 Certification test, but very important technology in BIG DATA Analysis

Data Intensive Computing Handout 6 Hadoop

Main Memory Map Reduce (M3R)

Distributed Computing and Big Data: Hadoop and MapReduce

Big Data Storage Options for Hadoop Sam Fineberg, HP Storage

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Click Stream Data Analysis Using Hadoop

Big Data Analytics. Lucas Rego Drumond

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

Parallel Computing for Data Science

What s New in MATLAB and Simulink

Adapting scientific computing problems to cloud computing frameworks Ph.D. Thesis. Pelle Jakovits

YARN and how MapReduce works in Hadoop By Alex Holmes

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

Distributed Apriori in Hadoop MapReduce Framework

Architectures for Big Data Analytics A database perspective

ITG Software Engineering

Pre-Stack Kirchhoff Time Migration on Hadoop and Spark

Unified Big Data Analytics Pipeline. 连 城

7. LU factorization. factor-solve method. LU factorization. solving Ax = b with A nonsingular. the inverse of a nonsingular matrix

2015 The MathWorks, Inc. 1

Report: Declarative Machine Learning on MapReduce (SystemML)

Reference Architecture and Best Practices for Virtualizing Hadoop Workloads Justin Murray VMware

IMPLEMENTATION OF P-PIC ALGORITHM IN MAP REDUCE TO HANDLE BIG DATA

Hadoop and Map-Reduce. Swati Gore

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Supercomputing and Big Data: Where are the Real Boundaries and Opportunities for Synergy?

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Similarity Search in a Very Large Scale Using Hadoop and HBase

PassTest. Bessere Qualität, bessere Dienstleistungen!

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

CSE-E5430 Scalable Cloud Computing Lecture 2

Performance Comparison of Intel Enterprise Edition for Lustre* software and HDFS for MapReduce Applications

A permutation can also be represented by describing its cycles. What do you suppose is meant by this?

Estimating PageRank Values of Wikipedia Articles using MapReduce

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

Fast Analytics on Big Data with H20

Big Data Challenges in Bioinformatics

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Unified Big Data Processing with Apache Spark. Matei

MapReduce and the New Software Stack

Infrastructures for big data

Hadoop/MapReduce Workshop. Dan Mazur, McGill HPC July 10, 2014

Performance and Energy Efficiency of. Hadoop deployment models

Big Data Too Big To Ignore

CS 378 Big Data Programming. Lecture 2 Map- Reduce

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Chapter 07: Instruction Level Parallelism VLIW, Vector, Array and Multithreaded Processors. Lesson 05: Array Processors

Parallel Programming for Multi-Core, Distributed Systems, and GPUs Exercises

Exar. Optimizing Hadoop Is Bigger Better?? March Exar Corporation Kato Road Fremont, CA

Transcription:

Data Structures and Performance for Scientific Computing with Hadoop and Dumbo Austin R. Benson Computer Sciences Division, UC-Berkeley ICME, Stanford University May 15, 2012

1 1 Matrix storage 2 Data 3 Example: outputting many small matrices 4 Example: Cholesky QR

Dense matrix storage 11 12 13 14 A = 21 22 23 24 31 32 33 34 41 42 42 44 How do we store the matrix in HDFS?

Dense matrix storage 11 12 13 14 A = 21 22 23 24 31 32 33 34 41 42 42 44 In HDFS: 1, [11, 12, 13, 14] 2, [21, 22, 23, 24] 3, [31, 32, 33, 34] 4, [41, 42, 43, 44]

Two rows per record or we might use: 1, [[11, 12, 13, 14], [21, 22, 23, 24]] 3, [[31, 32, 33, 34], [41, 42, 43, 44]]

Flattened list or maybe 1, [11, 12, 13, 14, 21, 22, 23, 24] 3, [31, 32, 33, 34, 41, 42, 43, 44]... but we do lose information here (maybe it s not important)

Full matrix or maybe 1, [[11, 12, 13, 14], [21, 22, 23, 24], [31, 32, 33, 34], [41, 42, 43, 44]]

What is the best way?

What is the best way? Depends on the application... we will look at an example later.

2 1 Matrix storage 2 Data 3 Example: outputting many small matrices 4 Example: Cholesky QR

Data Serialization Small optimizations 2.5x speedup! *all data from the NERSC Magellan cluster

Data Serialization Same experiment but different matrix size (200 columns): Again, 2.5x speedup!

Languages Switching from Python to C++... same general trend

More speedups Algorithm performance isn t the only place where we see speedups

Why can we expect these speedups? These are not high-performance implementations. We care about I/O performance.

3 1 Matrix storage 2 Data 3 Example: outputting many small matrices 4 Example: Cholesky QR

Suppose we need to write many small matrices to disk.

Code Code: git clone git://github.com/icme/mapreduce-workshop.git cd mapreduce-workshop/arbenson Files: speed test.py (tester) small matrix test.py (driver)

4 1 Matrix storage 2 Data 3 Example: outputting many small matrices 4 Example: Cholesky QR

Algorithm Cholesky QR: R = chol(a T A, upper )

Implementation for MapReduce

Mapper implementation Which of these implementations is better?

Mapper implementation Which of these implementations is better? Answer: the one on the left (usually)

Why? 1 Shuffle time 2 Reduce bottleneck However, the left implementation could run out of memory.

Mapper implementation Can we do better? Yes

Questions? Austin R. Benson arbenson@gmail.com https://github.com/arbenson/mrtsqr