Using MapReduce for Large-scale Medical Image Analysis. HISB 2012 Presented by : Roger Schaer - HES-SO Valais

Similar documents
Task Scheduling in Hadoop

Hadoop Parallel Data Processing

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Chapter 7. Using Hadoop Cluster and MapReduce

Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2

Sriram Krishnan, Ph.D.

Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan

MapReduce, Hadoop and Amazon AWS

Cloud computing - Architecting in the cloud

The Quest for Conformance Testing in the Cloud

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

Lecture 10 - Functional programming: Hadoop and MapReduce

Self Learning Based Optimal Resource Provisioning For Map Reduce Tasks with the Evaluation of Cost Functions

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Hadoop. Scalable Distributed Computing. Claire Jaja, Julian Chan October 8, 2013

Getting Started with Hadoop with Amazon s Elastic MapReduce

Distributed Computing and Big Data: Hadoop and MapReduce

Convex Optimization for Big Data: Lecture 2: Frameworks for Big Data Analytics

What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea

A Cost-Evaluation of MapReduce Applications in the Cloud

Open source Google-style large scale data analysis with Hadoop

Cloud Computing Summary and Preparation for Examination

Final Project Proposal. CSCI.6500 Distributed Computing over the Internet

Cloud Computing: MapReduce and Hadoop

CSE-E5430 Scalable Cloud Computing Lecture 2

MapReduce Job Processing

How To Use Hadoop

Query and Analysis of Data on Electric Consumption Based on Hadoop

MapReduce (in the cloud)

The Performance Characteristics of MapReduce Applications on Scalable Clusters

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

A Multilevel Secure MapReduce Framework for Cross-Domain Information Sharing in the Cloud

Apache Hama Design Document v0.6

t] open source Hadoop Beginner's Guide ij$ data avalanche Garry Turkington Learn how to crunch big data to extract meaning from

Parallel Data Mining and Assurance Service Model Using Hadoop in Cloud

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

Improving MapReduce Performance in Heterogeneous Environments

White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP

Introduction to Hadoop

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Reference Architecture and Best Practices for Virtualizing Hadoop Workloads Justin Murray VMware

Open source large scale distributed data management with Google s MapReduce and Bigtable

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

MapReduce. Tushar B. Kute,

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Big Data and the Cloud Trends, Applications, and Training

Big Data and Cloud Computing for GHRSST

Hadoop Scheduler w i t h Deadline Constraint

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

What is Cloud Computing? Tackling the Challenges of Big Data. Tackling The Challenges of Big Data. Matei Zaharia. Matei Zaharia. Big Data Collection

Advances in Natural and Applied Sciences

IMPLEMENTING PREDICTIVE ANALYTICS USING HADOOP FOR DOCUMENT CLASSIFICATION ON CRM SYSTEM

Leveraging BlobSeer to boost up the deployment and execution of Hadoop applications in Nimbus cloud environments on Grid 5000

Big Data With Hadoop

Unified Batch & Stream Processing Platform

HPC ABDS: The Case for an Integrating Apache Big Data Stack

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Apache Hadoop new way for the company to store and analyze big data


MapReduce and Hadoop Distributed File System

Map-Reduce for Machine Learning on Multicore

Hadoop Architecture. Part 1

PassTest. Bessere Qualität, bessere Dienstleistungen!

The Improved Job Scheduling Algorithm of Hadoop Platform

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5

Predicting Flight Delays

International Journal of Innovative Research in Computer and Communication Engineering

MapReduce on GPUs. Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu

Introduction to Hadoop

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

Fast Analytics on Big Data with H20

MapReduce and Hadoop Distributed File System V I J A Y R A O

CS 378 Big Data Programming. Lecture 2 Map- Reduce

Origins, Evolution, and Future Directions of MATLAB Loren Shure

Apache Hadoop. Alexandru Costan

Introduction to Cloud Computing

5 SCS Deployment Infrastructure in Use

Data Mining with Hadoop at TACC

Big Data Analytics* Outline. Issues. Big Data

High Performance Computing MapReduce & Hadoop. 17th Apr 2014

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

NAIVE BAYES ALGORITHM FOR TWITTER SENTIMENT ANALYSIS AND ITS IMPLEMENTATION IN MAPREDUCE. A Thesis. Presented to. The Faculty of the Graduate School

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

How to properly misuse Hadoop. Marcel Huntemann NERSC tutorial session 2/12/13

!"#$%&' ( )%#*'+,'-#.//"0( !"#$"%&'()*$+()',!-+.'/', 4(5,67,!-+!"89,:*$;'0+$.<.,&0$'09,&)"/=+,!()<>'0, 3, Processing LARGE data sets

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Research Laboratory. Java Web Crawler & Hadoop MapReduce Anri Morchiladze && Bachana Dolidze Supervisor Nodar Momtselidze

CS 378 Big Data Programming

High Performance Computing with Hadoop WV HPC Summer Institute 2014

From GWS to MapReduce: Google s Cloud Technology in the Early Days

A. Aiken & K. Olukotun PA3

A MapReduce based distributed SVM algorithm for binary classification

Improving Job Scheduling in Hadoop

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Big Data and Apache Hadoop s MapReduce

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Transcription:

Using MapReduce for Large-scale Medical Image Analysis HISB 2012 Presented by : Roger Schaer - HES-SO Valais

Summary Introduction Methods Results & Interpretation Conclusions 2

Introduction

Introduction Exponential growth of imaging data (past 20 years) Amount of images produced per day at the HUG Year 4

Introduction (continued) Mainly caused by : Modern imaging techniques (3D, 4D) : Large files! Large collections (available on the Internet) Increasingly complex algorithms make processing this data more challenging Requires a lot of computation power, storage and network bandwidth 5

Introduction (continued) Flexible and scalable infrastructures are needed Several approaches exist : Single, powerful machine Local cluster / grid Alternative infrastructures (Graphics cards) Cloud computing solutions First two approaches have been tested and compared 6

Introduction (continued) 3 large-scale medical image processing use cases Parameter optimization for Support Vector Machines Content-based image feature extraction & indexing 3D texture feature extraction using the Riesz transform NOTE : I mostly handled the infrastructure aspects! 7

Methods

Methods MapReduce Hadoop Cluster Support Vector Machines Image Indexing Solid 3D Texture Analysis Using the Riesz Transform 9

MapReduce MapReduce is a programming model Developed by Google Map Phase : Key/Value pair input, Intermediate output Reduce phase : For each intermediate key, process the list of associated values Trivial example : Word Count application 10

MapReduce : WordCount 11

MapReduce : WordCount INPUT 11

MapReduce : WordCount INPUT #1 hello world #2 goodbye world #3 hello hadoop #4 bye hadoop... 11

MapReduce : WordCount INPUT MAP #1 hello world #2 goodbye world #3 hello hadoop #4 bye hadoop... 11

MapReduce : WordCount INPUT MAP #1 hello world #2 goodbye world #3 hello hadoop #4 bye hadoop... 11

MapReduce : WordCount INPUT MAP #1 hello world #2 goodbye world #3 hello hadoop #4 bye hadoop... 11

MapReduce : WordCount INPUT MAP #1 hello world #2 goodbye world #3 hello hadoop #4 bye hadoop... 11

MapReduce : WordCount INPUT MAP #1 hello world #2 goodbye world #3 hello hadoop #4 bye hadoop... 11

MapReduce : WordCount INPUT MAP #1 hello world #2 goodbye world #3 hello hadoop goodbye 1 #4 bye hadoop... 11

MapReduce : WordCount INPUT MAP #1 hello world #2 goodbye world #3 hello hadoop #4 bye hadoop goodbye 1... 11

MapReduce : WordCount INPUT MAP #1 hello world #2 goodbye world #3 hello hadoop #4 bye hadoop goodbye 1... 11

MapReduce : WordCount INPUT MAP #1 hello world #2 goodbye world #3 hello hadoop #4 bye hadoop... goodbye 1 11

MapReduce : WordCount INPUT MAP #1 hello world #2 goodbye world #3 hello hadoop #4 bye hadoop... goodbye 1 hadoop 1 11

MapReduce : WordCount INPUT MAP #1 hello world #2 goodbye world #3 hello hadoop #4 bye hadoop... goodbye 1 hadoop 1 11

MapReduce : WordCount INPUT MAP #1 hello world #2 goodbye world #3 hello hadoop #4 bye hadoop... goodbye 1 hadoop 1 bye 1 11

MapReduce : WordCount INPUT MAP #1 hello world #2 goodbye world #3 hello hadoop #4 bye hadoop... goodbye 1 hadoop 1 bye 1 hadoop 1 11

MapReduce : WordCount INPUT MAP REDUCE #1 hello world #2 goodbye world #3 hello hadoop #4 bye hadoop... goodbye 1 hadoop 1 bye 1 hadoop 1 11

MapReduce : WordCount INPUT MAP REDUCE #1 hello world #2 goodbye world #3 hello hadoop #4 bye hadoop... goodbye 1 hadoop 1 bye 1 hadoop 1 11

MapReduce : WordCount INPUT MAP REDUCE #1 hello world #2 goodbye world #3 hello hadoop #4 bye hadoop... goodbye 1 hadoop 1 bye 1 hadoop 1 hello 2 11

MapReduce : WordCount INPUT MAP REDUCE #1 hello world #2 goodbye world #3 hello hadoop #4 bye hadoop... goodbye 1 hadoop 1 bye 1 hadoop 1 hello 2 11

MapReduce : WordCount INPUT MAP REDUCE #1 hello world #2 goodbye world #3 hello hadoop #4 bye hadoop... goodbye 1 hadoop 1 bye 1 hadoop 1 hello 2 world 2 11

MapReduce : WordCount INPUT MAP REDUCE #1 hello world #2 goodbye world #3 hello hadoop #4 bye hadoop... goodbye 1 hadoop 1 bye 1 hadoop 1 hello 2 world 2 11

MapReduce : WordCount INPUT MAP REDUCE #1 hello world #2 goodbye world #3 hello hadoop #4 bye hadoop... goodbye 1 hadoop 1 bye 1 hadoop 1 hello 2 world 2 goodbye 1 11

MapReduce : WordCount INPUT MAP REDUCE #1 hello world #2 goodbye world #3 hello hadoop #4 bye hadoop... goodbye 1 hadoop 1 bye 1 hadoop 1 hello 2 world 2 goodbye 1 11

MapReduce : WordCount INPUT MAP REDUCE #1 hello world #2 goodbye world #3 hello hadoop #4 bye hadoop... goodbye 1 hadoop 1 bye 1 hadoop 1 hello 2 world 2 goodbye 1 hadoop 2 11

MapReduce : WordCount INPUT MAP REDUCE #1 hello world #2 goodbye world #3 hello hadoop #4 bye hadoop... goodbye 1 hadoop 1 bye 1 hadoop 1 hello 2 world 2 goodbye 1 hadoop 2 11

MapReduce : WordCount INPUT MAP REDUCE #1 hello world #2 goodbye world #3 hello hadoop #4 bye hadoop... goodbye 1 hadoop 1 bye 1 hadoop 1 hello 2 world 2 goodbye 1 hadoop 2 bye 1 11

Hadoop Apache s implementation of MapReduce Consists of Distributed storage system : HDFS Execution framework : Hadoop MapReduce Master node which orchestrates the task distribution Worker nodes which perform the tasks Typical node runs a DataNode and TaskTracker 12

Support Vector Machines Computes a decision boundary (hyperplane) that separates inputs of different classes represented in a given feature space transformed by a given kernel The values of two parameters need to be adapted to the data: Cost C of errors σ of the Gaussian kernel 13

Support Vector Machines Computes a decision boundary (hyperplane) that separates inputs of different classes represented in a given feature space transformed by a given kernel The values of two parameters need to be adapted to the data: 20 Cost C of errors σ of the Gaussian kernel 15 10 5 0 0 5 10 15 20 13

Support Vector Machines Computes a decision boundary (hyperplane) that separates inputs of different classes represented in a given feature space transformed by a given kernel The values of two parameters need to be adapted to the data: 20 Cost C of errors σ of the Gaussian kernel 15 10 5 0 0 5 10 15 20 13

Support Vector Machines Computes a decision boundary (hyperplane) that separates inputs of different classes represented in a given feature space transformed by a given kernel The values of two parameters need to be adapted to the data: 20 Cost C of errors σ of the Gaussian kernel 15 10 5 0 0 5 10 15 20 13

Support Vector Machines Computes a decision boundary (hyperplane) that separates inputs of different classes represented in a given feature space transformed by a given kernel The values of two parameters need to be adapted to the data: 20 Cost C of errors σ of the Gaussian kernel 15 10 5? 0 0 5 10 15 20 13

Support Vector Machines Computes a decision boundary (hyperplane) that separates inputs of different classes represented in a given feature space transformed by a given kernel The values of two parameters need to be adapted to the data: 20 Cost C of errors σ of the Gaussian kernel 15 10 5 0 0 5 10 15 20 13

SVM (continued) Goal : find optimal value couple (C, σ) to train a SVM Allowing best classification performance of 5 lung texture patterns Execution on 1 PC (without Hadoop) can take weeks Due to extensive leave-one-patient-out crossvalidation with 86 patients Parallelization : Split job by parameter value couples 14

Image Indexing Two phases Extract features from images Construct bags of visual words by quantization Component-based / Monolithic approaches Parallelization : Each task Image Files Feature Extractor Feature Vectors Files Bag of Visual Words Factory Index File Vocabulary File processes N images 15

Image Indexing Two phases Extract features from images Construct bags of visual words by quantization Component-based / Monolithic approaches Parallelization : Each task Image Files Feature Extractor Feature Extractor + Feature Vectors Bag of Files Visual Words Factory Bag of Visual Words Factory Index File Vocabulary File processes N images 15

3D Texture Analysis (Riesz) Features are extracted from 3D images (see below) Parallelization : Each task processes N images 16

Results & Interpretation

Hadoop Cluster Minimally invasive setup (>=2 free cores per node) 18

Support Vector Machines Optimization : Longer tasks = bad performance Because the optimization of the hyperplane is more difficult to compute (more iterations needed) After 2 patients (out of 86), check if : ti F tref. If time exceeds average (+margin), terminate task 19

Support Vector Machines Accuracy (%) C (Cost) σ (Sigma) Black : tasks to be interrupted by the new algorithm Optimized algorithm : ~50h ~9h15min All the best tasks (highest accuracy) are not killed 20

Image Indexing 1K IMAGES Shows the calculation time in function of the # of tasks Both experiments were executed using hadoop Once on a single computer, then on our cluster of PCs 21

Image Indexing 1K IMAGES 10K IMAGES Shows the calculation time in function of the # of tasks Both experiments were executed using hadoop Once on a single computer, then on our cluster of PCs 21

Image Indexing 1K IMAGES 10K IMAGES 100K IMAGES Shows the calculation time in function of the # of tasks Both experiments were executed using hadoop Once on a single computer, then on our cluster of PCs 21

Riesz 3D Particularity : code was a series of Matlab scripts Instead of rewriting the whole application : Used Hadoop streaming feature (uses stdin/stdout) To maximize scalability, GNU Octave was used Great compatibility between Matlab and Octave 22

Riesz 3D Particularity : code was a series of Matlab scripts Instead of rewriting the whole application : Used Hadoop streaming feature (uses stdin/stdout) To maximize scalability, GNU Octave was used Great compatibility between Matlab and Octave RESULTS 1 task (no Hadoop) 42 tasks (idle) 42 tasks (normal) 131h32m42s 6h29m51s 5h51m31s 22

Conclusions

Conclusions MapReduce is Flexible (worked with very varied use cases) Easy to use (2-phase programming model is simple) Efficient (>=20x speedup for all use cases) Hadoop is Easy to deploy & manage User-friendly (nice Web UIs) 24

Conclusions (continued) Speedups for the different use cases SVMs Image Indexing 3D Feature Extraction Single task 990h* 21h* 131h30 42 tasks on hadoop 50h / 9h15** 1h 5h50 Speedup 20x / 107x** 21x 22.5x * estimation ** using the optimized algorithm 25

Lessons Learned It is important to use physically distributed resources Overloading a single machine hurts performance Data locality notably speeds up jobs Not every application is infinitely scalable Performance improvements level off at some point 26

Future work Take it to the next level : The Cloud Amazon Elastic Cloud Compute (IaaS) Amazon Elastic MapReduce (PaaS) Cloudbursting Use both local resources + Cloud (for peak usage) 27

Thank you! Questions?