Using MapReduce for Large-scale Medical Image Analysis. HISB 2012 Presented by : Roger Schaer - HES-SO Valais
|
|
- Leo Sanders
- 7 years ago
- Views:
Transcription
1 Using MapReduce for Large-scale Medical Image Analysis HISB 2012 Presented by : Roger Schaer - HES-SO Valais
2 Summary Introduction Methods Results & Interpretation Conclusions 2
3 Introduction
4 Introduction Exponential growth of imaging data (past 20 years) Amount of images produced per day at the HUG Year 4
5 Introduction (continued) Mainly caused by : Modern imaging techniques (3D, 4D) : Large files! Large collections (available on the Internet) Increasingly complex algorithms make processing this data more challenging Requires a lot of computation power, storage and network bandwidth 5
6 Introduction (continued) Flexible and scalable infrastructures are needed Several approaches exist : Single, powerful machine Local cluster / grid Alternative infrastructures (Graphics cards) Cloud computing solutions First two approaches have been tested and compared 6
7 Introduction (continued) 3 large-scale medical image processing use cases Parameter optimization for Support Vector Machines Content-based image feature extraction & indexing 3D texture feature extraction using the Riesz transform NOTE : I mostly handled the infrastructure aspects! 7
8 Methods
9 Methods MapReduce Hadoop Cluster Support Vector Machines Image Indexing Solid 3D Texture Analysis Using the Riesz Transform 9
10 MapReduce MapReduce is a programming model Developed by Google Map Phase : Key/Value pair input, Intermediate output Reduce phase : For each intermediate key, process the list of associated values Trivial example : Word Count application 10
11 MapReduce : WordCount 11
12 MapReduce : WordCount INPUT 11
13 MapReduce : WordCount INPUT #1 hello world #2 goodbye world #3 hello hadoop #4 bye hadoop... 11
14 MapReduce : WordCount INPUT MAP #1 hello world #2 goodbye world #3 hello hadoop #4 bye hadoop... 11
15 MapReduce : WordCount INPUT MAP #1 hello world #2 goodbye world #3 hello hadoop #4 bye hadoop... 11
16 MapReduce : WordCount INPUT MAP #1 hello world #2 goodbye world #3 hello hadoop #4 bye hadoop... 11
17 MapReduce : WordCount INPUT MAP #1 hello world #2 goodbye world #3 hello hadoop #4 bye hadoop... 11
18 MapReduce : WordCount INPUT MAP #1 hello world #2 goodbye world #3 hello hadoop #4 bye hadoop... 11
19 MapReduce : WordCount INPUT MAP #1 hello world #2 goodbye world #3 hello hadoop goodbye 1 #4 bye hadoop... 11
20 MapReduce : WordCount INPUT MAP #1 hello world #2 goodbye world #3 hello hadoop #4 bye hadoop goodbye
21 MapReduce : WordCount INPUT MAP #1 hello world #2 goodbye world #3 hello hadoop #4 bye hadoop goodbye
22 MapReduce : WordCount INPUT MAP #1 hello world #2 goodbye world #3 hello hadoop #4 bye hadoop... goodbye 1 11
23 MapReduce : WordCount INPUT MAP #1 hello world #2 goodbye world #3 hello hadoop #4 bye hadoop... goodbye 1 hadoop 1 11
24 MapReduce : WordCount INPUT MAP #1 hello world #2 goodbye world #3 hello hadoop #4 bye hadoop... goodbye 1 hadoop 1 11
25 MapReduce : WordCount INPUT MAP #1 hello world #2 goodbye world #3 hello hadoop #4 bye hadoop... goodbye 1 hadoop 1 bye 1 11
26 MapReduce : WordCount INPUT MAP #1 hello world #2 goodbye world #3 hello hadoop #4 bye hadoop... goodbye 1 hadoop 1 bye 1 hadoop 1 11
27 MapReduce : WordCount INPUT MAP REDUCE #1 hello world #2 goodbye world #3 hello hadoop #4 bye hadoop... goodbye 1 hadoop 1 bye 1 hadoop 1 11
28 MapReduce : WordCount INPUT MAP REDUCE #1 hello world #2 goodbye world #3 hello hadoop #4 bye hadoop... goodbye 1 hadoop 1 bye 1 hadoop 1 11
29 MapReduce : WordCount INPUT MAP REDUCE #1 hello world #2 goodbye world #3 hello hadoop #4 bye hadoop... goodbye 1 hadoop 1 bye 1 hadoop 1 hello 2 11
30 MapReduce : WordCount INPUT MAP REDUCE #1 hello world #2 goodbye world #3 hello hadoop #4 bye hadoop... goodbye 1 hadoop 1 bye 1 hadoop 1 hello 2 11
31 MapReduce : WordCount INPUT MAP REDUCE #1 hello world #2 goodbye world #3 hello hadoop #4 bye hadoop... goodbye 1 hadoop 1 bye 1 hadoop 1 hello 2 world 2 11
32 MapReduce : WordCount INPUT MAP REDUCE #1 hello world #2 goodbye world #3 hello hadoop #4 bye hadoop... goodbye 1 hadoop 1 bye 1 hadoop 1 hello 2 world 2 11
33 MapReduce : WordCount INPUT MAP REDUCE #1 hello world #2 goodbye world #3 hello hadoop #4 bye hadoop... goodbye 1 hadoop 1 bye 1 hadoop 1 hello 2 world 2 goodbye 1 11
34 MapReduce : WordCount INPUT MAP REDUCE #1 hello world #2 goodbye world #3 hello hadoop #4 bye hadoop... goodbye 1 hadoop 1 bye 1 hadoop 1 hello 2 world 2 goodbye 1 11
35 MapReduce : WordCount INPUT MAP REDUCE #1 hello world #2 goodbye world #3 hello hadoop #4 bye hadoop... goodbye 1 hadoop 1 bye 1 hadoop 1 hello 2 world 2 goodbye 1 hadoop 2 11
36 MapReduce : WordCount INPUT MAP REDUCE #1 hello world #2 goodbye world #3 hello hadoop #4 bye hadoop... goodbye 1 hadoop 1 bye 1 hadoop 1 hello 2 world 2 goodbye 1 hadoop 2 11
37 MapReduce : WordCount INPUT MAP REDUCE #1 hello world #2 goodbye world #3 hello hadoop #4 bye hadoop... goodbye 1 hadoop 1 bye 1 hadoop 1 hello 2 world 2 goodbye 1 hadoop 2 bye 1 11
38 Hadoop Apache s implementation of MapReduce Consists of Distributed storage system : HDFS Execution framework : Hadoop MapReduce Master node which orchestrates the task distribution Worker nodes which perform the tasks Typical node runs a DataNode and TaskTracker 12
39 Support Vector Machines Computes a decision boundary (hyperplane) that separates inputs of different classes represented in a given feature space transformed by a given kernel The values of two parameters need to be adapted to the data: Cost C of errors σ of the Gaussian kernel 13
40 Support Vector Machines Computes a decision boundary (hyperplane) that separates inputs of different classes represented in a given feature space transformed by a given kernel The values of two parameters need to be adapted to the data: 20 Cost C of errors σ of the Gaussian kernel
41 Support Vector Machines Computes a decision boundary (hyperplane) that separates inputs of different classes represented in a given feature space transformed by a given kernel The values of two parameters need to be adapted to the data: 20 Cost C of errors σ of the Gaussian kernel
42 Support Vector Machines Computes a decision boundary (hyperplane) that separates inputs of different classes represented in a given feature space transformed by a given kernel The values of two parameters need to be adapted to the data: 20 Cost C of errors σ of the Gaussian kernel
43 Support Vector Machines Computes a decision boundary (hyperplane) that separates inputs of different classes represented in a given feature space transformed by a given kernel The values of two parameters need to be adapted to the data: 20 Cost C of errors σ of the Gaussian kernel ?
44 Support Vector Machines Computes a decision boundary (hyperplane) that separates inputs of different classes represented in a given feature space transformed by a given kernel The values of two parameters need to be adapted to the data: 20 Cost C of errors σ of the Gaussian kernel
45 SVM (continued) Goal : find optimal value couple (C, σ) to train a SVM Allowing best classification performance of 5 lung texture patterns Execution on 1 PC (without Hadoop) can take weeks Due to extensive leave-one-patient-out crossvalidation with 86 patients Parallelization : Split job by parameter value couples 14
46 Image Indexing Two phases Extract features from images Construct bags of visual words by quantization Component-based / Monolithic approaches Parallelization : Each task Image Files Feature Extractor Feature Vectors Files Bag of Visual Words Factory Index File Vocabulary File processes N images 15
47 Image Indexing Two phases Extract features from images Construct bags of visual words by quantization Component-based / Monolithic approaches Parallelization : Each task Image Files Feature Extractor Feature Extractor + Feature Vectors Bag of Files Visual Words Factory Bag of Visual Words Factory Index File Vocabulary File processes N images 15
48 3D Texture Analysis (Riesz) Features are extracted from 3D images (see below) Parallelization : Each task processes N images 16
49 Results & Interpretation
50 Hadoop Cluster Minimally invasive setup (>=2 free cores per node) 18
51 Support Vector Machines Optimization : Longer tasks = bad performance Because the optimization of the hyperplane is more difficult to compute (more iterations needed) After 2 patients (out of 86), check if : ti F tref. If time exceeds average (+margin), terminate task 19
52 Support Vector Machines Accuracy (%) C (Cost) σ (Sigma) Black : tasks to be interrupted by the new algorithm Optimized algorithm : ~50h ~9h15min All the best tasks (highest accuracy) are not killed 20
53 Image Indexing 1K IMAGES Shows the calculation time in function of the # of tasks Both experiments were executed using hadoop Once on a single computer, then on our cluster of PCs 21
54 Image Indexing 1K IMAGES 10K IMAGES Shows the calculation time in function of the # of tasks Both experiments were executed using hadoop Once on a single computer, then on our cluster of PCs 21
55 Image Indexing 1K IMAGES 10K IMAGES 100K IMAGES Shows the calculation time in function of the # of tasks Both experiments were executed using hadoop Once on a single computer, then on our cluster of PCs 21
56 Riesz 3D Particularity : code was a series of Matlab scripts Instead of rewriting the whole application : Used Hadoop streaming feature (uses stdin/stdout) To maximize scalability, GNU Octave was used Great compatibility between Matlab and Octave 22
57 Riesz 3D Particularity : code was a series of Matlab scripts Instead of rewriting the whole application : Used Hadoop streaming feature (uses stdin/stdout) To maximize scalability, GNU Octave was used Great compatibility between Matlab and Octave RESULTS 1 task (no Hadoop) 42 tasks (idle) 42 tasks (normal) 131h32m42s 6h29m51s 5h51m31s 22
58 Conclusions
59 Conclusions MapReduce is Flexible (worked with very varied use cases) Easy to use (2-phase programming model is simple) Efficient (>=20x speedup for all use cases) Hadoop is Easy to deploy & manage User-friendly (nice Web UIs) 24
60 Conclusions (continued) Speedups for the different use cases SVMs Image Indexing 3D Feature Extraction Single task 990h* 21h* 131h30 42 tasks on hadoop 50h / 9h15** 1h 5h50 Speedup 20x / 107x** 21x 22.5x * estimation ** using the optimized algorithm 25
61 Lessons Learned It is important to use physically distributed resources Overloading a single machine hurts performance Data locality notably speeds up jobs Not every application is infinitely scalable Performance improvements level off at some point 26
62 Future work Take it to the next level : The Cloud Amazon Elastic Cloud Compute (IaaS) Amazon Elastic MapReduce (PaaS) Cloudbursting Use both local resources + Cloud (for peak usage) 27
63 Thank you! Questions?
Task Scheduling in Hadoop
Task Scheduling in Hadoop Sagar Mamdapure Munira Ginwala Neha Papat SAE,Kondhwa SAE,Kondhwa SAE,Kondhwa Abstract Hadoop is widely used for storing large datasets and processing them efficiently under distributed
More informationHadoop Parallel Data Processing
MapReduce and Implementation Hadoop Parallel Data Processing Kai Shen A programming interface (two stage Map and Reduce) and system support such that: the interface is easy to program, and suitable for
More informationLecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop
Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1 2 Big Data Problems Data explosion Data from users on social
More informationChapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
More informationResearch on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2
Advanced Engineering Forum Vols. 6-7 (2012) pp 82-87 Online: 2012-09-26 (2012) Trans Tech Publications, Switzerland doi:10.4028/www.scientific.net/aef.6-7.82 Research on Clustering Analysis of Big Data
More informationSriram Krishnan, Ph.D. sriram@sdsc.edu
Sriram Krishnan, Ph.D. sriram@sdsc.edu (Re-)Introduction to cloud computing Introduction to the MapReduce and Hadoop Distributed File System Programming model Examples of MapReduce Where/how to run MapReduce
More informationIntroduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan sriram@sdsc.edu
Introduc)on to the MapReduce Paradigm and Apache Hadoop Sriram Krishnan sriram@sdsc.edu Programming Model The computa)on takes a set of input key/ value pairs, and Produces a set of output key/value pairs.
More informationMapReduce, Hadoop and Amazon AWS
MapReduce, Hadoop and Amazon AWS Yasser Ganjisaffar http://www.ics.uci.edu/~yganjisa February 2011 What is Hadoop? A software framework that supports data-intensive distributed applications. It enables
More informationCloud computing - Architecting in the cloud
Cloud computing - Architecting in the cloud anna.ruokonen@tut.fi 1 Outline Cloud computing What is? Levels of cloud computing: IaaS, PaaS, SaaS Moving to the cloud? Architecting in the cloud Best practices
More informationThe Quest for Conformance Testing in the Cloud
The Quest for Conformance Testing in the Cloud Dylan Yaga Computer Security Division Information Technology Laboratory National Institute of Standards and Technology NIST/ITL Computer Security Division
More informationmarlabs driving digital agility WHITEPAPER Big Data and Hadoop
marlabs driving digital agility WHITEPAPER Big Data and Hadoop Abstract This paper explains the significance of Hadoop, an emerging yet rapidly growing technology. The prime goal of this paper is to unveil
More informationLecture 10 - Functional programming: Hadoop and MapReduce
Lecture 10 - Functional programming: Hadoop and MapReduce Sohan Dharmaraja Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 1 / 41 For today Big Data and Text analytics Functional
More informationSelf Learning Based Optimal Resource Provisioning For Map Reduce Tasks with the Evaluation of Cost Functions
Self Learning Based Optimal Resource Provisioning For Map Reduce Tasks with the Evaluation of Cost Functions Nithya.M, Damodharan.P M.E Dept. of CSE, Akshaya College of Engineering and Technology, Coimbatore,
More informationRole of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop
Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Kanchan A. Khedikar Department of Computer Science & Engineering Walchand Institute of Technoloy, Solapur, Maharashtra,
More informationHadoop. Scalable Distributed Computing. Claire Jaja, Julian Chan October 8, 2013
Hadoop Scalable Distributed Computing Claire Jaja, Julian Chan October 8, 2013 What is Hadoop? A general-purpose storage and data-analysis platform Open source Apache software, implemented in Java Enables
More informationGetting Started with Hadoop with Amazon s Elastic MapReduce
Getting Started with Hadoop with Amazon s Elastic MapReduce Scott Hendrickson scott@drskippy.net http://drskippy.net/projects/emr-hadoopmeetup.pdf Boulder/Denver Hadoop Meetup 8 July 2010 Scott Hendrickson
More informationDistributed Computing and Big Data: Hadoop and MapReduce
Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:
More informationConvex Optimization for Big Data: Lecture 2: Frameworks for Big Data Analytics
Convex Optimization for Big Data: Lecture 2: Frameworks for Big Data Analytics Sabeur Aridhi Aalto University, Finland Sabeur Aridhi Frameworks for Big Data Analytics 1 / 59 Introduction Contents 1 Introduction
More informationWhat We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea
What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea Overview Riding Google App Engine Taming Hadoop Summary Riding
More informationA Cost-Evaluation of MapReduce Applications in the Cloud
1/23 A Cost-Evaluation of MapReduce Applications in the Cloud Diana Moise, Alexandra Carpen-Amarie Gabriel Antoniu, Luc Bougé KerData team 2/23 1 MapReduce applications - case study 2 3 4 5 3/23 MapReduce
More informationOpen source Google-style large scale data analysis with Hadoop
Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical
More informationCloud Computing Summary and Preparation for Examination
Basics of Cloud Computing Lecture 8 Cloud Computing Summary and Preparation for Examination Satish Srirama Outline Quick recap of what we have learnt as part of this course How to prepare for the examination
More informationFinal Project Proposal. CSCI.6500 Distributed Computing over the Internet
Final Project Proposal CSCI.6500 Distributed Computing over the Internet Qingling Wang 660795696 1. Purpose Implement an application layer on Hybrid Grid Cloud Infrastructure to automatically or at least
More informationCloud Computing: MapReduce and Hadoop
Cloud Computing: MapReduce and Hadoop June 2010 Marcel Kunze, Research Group Cloud Computing KIT University of the State of Baden-Württemberg and National Laboratory of the Helmholtz Association www.kit.edu
More informationCSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing
More informationMapReduce Job Processing
April 17, 2012 Background: Hadoop Distributed File System (HDFS) Hadoop requires a Distributed File System (DFS), we utilize the Hadoop Distributed File System (HDFS). Background: Hadoop Distributed File
More informationHow To Use Hadoop
Hadoop in Action Justin Quan March 15, 2011 Poll What s to come Overview of Hadoop for the uninitiated How does Hadoop work? How do I use Hadoop? How do I get started? Final Thoughts Key Take Aways Hadoop
More informationQuery and Analysis of Data on Electric Consumption Based on Hadoop
, pp.153-160 http://dx.doi.org/10.14257/ijdta.2016.9.2.17 Query and Analysis of Data on Electric Consumption Based on Hadoop Jianjun 1 Zhou and Yi Wu 2 1 Information Science and Technology in Heilongjiang
More informationMapReduce (in the cloud)
MapReduce (in the cloud) How to painlessly process terabytes of data by Irina Gordei MapReduce Presentation Outline What is MapReduce? Example How it works MapReduce in the cloud Conclusion Demo Motivation:
More informationThe Performance Characteristics of MapReduce Applications on Scalable Clusters
The Performance Characteristics of MapReduce Applications on Scalable Clusters Kenneth Wottrich Denison University Granville, OH 43023 wottri_k1@denison.edu ABSTRACT Many cluster owners and operators have
More informationWeekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay
Weekly Report Hadoop Introduction submitted By Anurag Sharma Department of Computer Science and Engineering Indian Institute of Technology Bombay Chapter 1 What is Hadoop? Apache Hadoop (High-availability
More informationA Multilevel Secure MapReduce Framework for Cross-Domain Information Sharing in the Cloud
A Multilevel Secure MapReduce Framework for Cross-Domain Information Sharing in the Cloud Thuy D. Nguyen, Cynthia E. Irvine, Jean Khosalim Department of Computer Science Ground System Architectures Workshop
More informationApache Hama Design Document v0.6
Apache Hama Design Document v0.6 Introduction Hama Architecture BSPMaster GroomServer Zookeeper BSP Task Execution Job Submission Job and Task Scheduling Task Execution Lifecycle Synchronization Fault
More informationt] open source Hadoop Beginner's Guide ij$ data avalanche Garry Turkington Learn how to crunch big data to extract meaning from
Hadoop Beginner's Guide Learn how to crunch big data to extract meaning from data avalanche Garry Turkington [ PUBLISHING t] open source I I community experience distilled ftu\ ij$ BIRMINGHAMMUMBAI ')
More informationParallel Data Mining and Assurance Service Model Using Hadoop in Cloud
Parallel Data Mining and Assurance Service Model Using Hadoop in Cloud Aditya Jadhav, Mahesh Kukreja E-mail: aditya.jadhav27@gmail.com & mr_mahesh_in@yahoo.co.in Abstract : In the information industry,
More informationHadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart
Hadoop/MapReduce Object-oriented framework presentation CSCI 5448 Casey McTaggart What is Apache Hadoop? Large scale, open source software framework Yahoo! has been the largest contributor to date Dedicated
More informationImproving MapReduce Performance in Heterogeneous Environments
UC Berkeley Improving MapReduce Performance in Heterogeneous Environments Matei Zaharia, Andy Konwinski, Anthony Joseph, Randy Katz, Ion Stoica University of California at Berkeley Motivation 1. MapReduce
More informationWhite Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP
White Paper Big Data and Hadoop Abhishek S, Java COE www.marlabs.com Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP Table of contents Abstract.. 1 Introduction. 2 What is Big
More informationIntroduction to Hadoop
Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction
More informationCSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model
More informationReference Architecture and Best Practices for Virtualizing Hadoop Workloads Justin Murray VMware
Reference Architecture and Best Practices for Virtualizing Hadoop Workloads Justin Murray ware 2 Agenda The Hadoop Journey Why Virtualize Hadoop? Elasticity and Scalability Performance Tests Storage Reference
More informationOpen source large scale distributed data management with Google s MapReduce and Bigtable
Open source large scale distributed data management with Google s MapReduce and Bigtable Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory
More informationHadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN
Hadoop MPDL-Frühstück 9. Dezember 2013 MPDL INTERN Understanding Hadoop Understanding Hadoop What's Hadoop about? Apache Hadoop project (started 2008) downloadable open-source software library (current
More informationMapReduce. Tushar B. Kute, http://tusharkute.com
MapReduce Tushar B. Kute, http://tusharkute.com What is MapReduce? MapReduce is a framework using which we can write applications to process huge amounts of data, in parallel, on large clusters of commodity
More informationPLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS
PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad
More informationBig Data and the Cloud Trends, Applications, and Training
Big Data and the Cloud Trends, Applications, and Training Stavros Christodoulakis MUSIC/TUC Lab School of Electronic and Computer Engineering Technical University of Crete stavros@ced.tuc.gr Data Explosion
More informationBig Data and Cloud Computing for GHRSST
Big Data and Cloud Computing for GHRSST Jean-Francois Piollé (jfpiolle@ifremer.fr) Frédéric Paul, Olivier Archer CERSAT / Institut Français de Recherche pour l Exploitation de la Mer Facing data deluge
More informationHadoop Scheduler w i t h Deadline Constraint
Hadoop Scheduler w i t h Deadline Constraint Geetha J 1, N UdayBhaskar 2, P ChennaReddy 3,Neha Sniha 4 1,4 Department of Computer Science and Engineering, M S Ramaiah Institute of Technology, Bangalore,
More informationOpen source software framework designed for storage and processing of large scale data on clusters of commodity hardware
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after
More informationWhat is Cloud Computing? Tackling the Challenges of Big Data. Tackling The Challenges of Big Data. Matei Zaharia. Matei Zaharia. Big Data Collection
Introduction What is Cloud Computing? Cloud computing means computing resources available on demand Resources can include storage, compute cycles, or software built on top (e.g. database as a service)
More informationAdvances in Natural and Applied Sciences
AENSI Journals Advances in Natural and Applied Sciences ISSN:1995-0772 EISSN: 1998-1090 Journal home page: www.aensiweb.com/anas Clustering Algorithm Based On Hadoop for Big Data 1 Jayalatchumy D. and
More informationIMPLEMENTING PREDICTIVE ANALYTICS USING HADOOP FOR DOCUMENT CLASSIFICATION ON CRM SYSTEM
IMPLEMENTING PREDICTIVE ANALYTICS USING HADOOP FOR DOCUMENT CLASSIFICATION ON CRM SYSTEM Sugandha Agarwal 1, Pragya Jain 2 1,2 Department of Computer Science & Engineering ASET, Amity University, Noida,
More informationLeveraging BlobSeer to boost up the deployment and execution of Hadoop applications in Nimbus cloud environments on Grid 5000
Leveraging BlobSeer to boost up the deployment and execution of Hadoop applications in Nimbus cloud environments on Grid 5000 Alexandra Carpen-Amarie Diana Moise Bogdan Nicolae KerData Team, INRIA Outline
More informationBig Data With Hadoop
With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
More informationUnified Batch & Stream Processing Platform
Unified Batch & Stream Processing Platform Himanshu Bari Director Product Management Most Big Data Use Cases Are About Improving/Re-write EXISTING solutions To KNOWN problems Current Solutions Were Built
More informationHPC ABDS: The Case for an Integrating Apache Big Data Stack
HPC ABDS: The Case for an Integrating Apache Big Data Stack with HPC 1st JTC 1 SGBD Meeting SDSC San Diego March 19 2014 Judy Qiu Shantenu Jha (Rutgers) Geoffrey Fox gcf@indiana.edu http://www.infomall.org
More informationHadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
More informationApache Hadoop new way for the company to store and analyze big data
Apache Hadoop new way for the company to store and analyze big data Reyna Ulaque Software Engineer Agenda What is Big Data? What is Hadoop? Who uses Hadoop? Hadoop Architecture Hadoop Distributed File
More informationhttp://www.wordle.net/
Hadoop & MapReduce http://www.wordle.net/ http://www.wordle.net/ Hadoop is an open-source software framework (or platform) for Reliable + Scalable + Distributed Storage/Computational unit Failures completely
More informationMapReduce and Hadoop Distributed File System
MapReduce and Hadoop Distributed File System 1 B. RAMAMURTHY Contact: Dr. Bina Ramamurthy CSE Department University at Buffalo (SUNY) bina@buffalo.edu http://www.cse.buffalo.edu/faculty/bina Partially
More informationMap-Reduce for Machine Learning on Multicore
Map-Reduce for Machine Learning on Multicore Chu, et al. Problem The world is going multicore New computers - dual core to 12+-core Shift to more concurrent programming paradigms and languages Erlang,
More informationHadoop Architecture. Part 1
Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,
More informationPassTest. Bessere Qualität, bessere Dienstleistungen!
PassTest Bessere Qualität, bessere Dienstleistungen! Q&A Exam : CCD-410 Title : Cloudera Certified Developer for Apache Hadoop (CCDH) Version : DEMO 1 / 4 1.When is the earliest point at which the reduce
More informationThe Improved Job Scheduling Algorithm of Hadoop Platform
The Improved Job Scheduling Algorithm of Hadoop Platform Yingjie Guo a, Linzhi Wu b, Wei Yu c, Bin Wu d, Xiaotian Wang e a,b,c,d,e University of Chinese Academy of Sciences 100408, China b Email: wulinzhi1001@163.com
More informationHadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015
Hadoop MapReduce and Spark Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Outline Hadoop Hadoop Import data on Hadoop Spark Spark features Scala MLlib MLlib
More informationDATA MINING WITH HADOOP AND HIVE Introduction to Architecture
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of
More informationR.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5
Distributed data processing in heterogeneous cloud environments R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5 1 uskenbaevar@gmail.com, 2 abu.kuandykov@gmail.com,
More informationPredicting Flight Delays
Predicting Flight Delays Dieterich Lawson jdlawson@stanford.edu William Castillo will.castillo@stanford.edu Introduction Every year approximately 20% of airline flights are delayed or cancelled, costing
More informationInternational Journal of Innovative Research in Computer and Communication Engineering
FP Tree Algorithm and Approaches in Big Data T.Rathika 1, J.Senthil Murugan 2 Assistant Professor, Department of CSE, SRM University, Ramapuram Campus, Chennai, Tamil Nadu,India 1 Assistant Professor,
More informationMapReduce on GPUs. Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu
1 MapReduce on GPUs Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu 2 MapReduce MAP Shuffle Reduce 3 Hadoop Open-source MapReduce framework from Apache, written in Java Used by Yahoo!, Facebook, Ebay,
More informationIntroduction to Hadoop
1 What is Hadoop? Introduction to Hadoop We are living in an era where large volumes of data are available and the problem is to extract meaning from the data avalanche. The goal of the software tools
More informationChukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84
Index A Amazon Web Services (AWS), 50, 58 Analytics engine, 21 22 Apache Kafka, 38, 131 Apache S4, 38, 131 Apache Sqoop, 37, 131 Appliance pattern, 104 105 Application architecture, big data analytics
More informationFast Analytics on Big Data with H20
Fast Analytics on Big Data with H20 0xdata.com, h2o.ai Tomas Nykodym, Petr Maj Team About H2O and 0xdata H2O is a platform for distributed in memory predictive analytics and machine learning Pure Java,
More informationMapReduce and Hadoop Distributed File System V I J A Y R A O
MapReduce and Hadoop Distributed File System 1 V I J A Y R A O The Context: Big-data Man on the moon with 32KB (1969); my laptop had 2GB RAM (2009) Google collects 270PB data in a month (2007), 20000PB
More informationCS 378 Big Data Programming. Lecture 2 Map- Reduce
CS 378 Big Data Programming Lecture 2 Map- Reduce MapReduce Large data sets are not new What characterizes a problem suitable for MR? Most or all of the data is processed But viewed in small increments
More informationOrigins, Evolution, and Future Directions of MATLAB Loren Shure
Origins, Evolution, and Future Directions of MATLAB Loren Shure 2015 The MathWorks, Inc. 1 Agenda Origins Peaks 5 Evolution 0-5 Tomorrow 2 0 y -2-3 -2-1 x 0 1 2 3 2 Computational Finance Workflow Access
More informationApache Hadoop. Alexandru Costan
1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open
More informationIntroduction to Cloud Computing
Introduction to Cloud Computing Cloud Computing I (intro) 15 319, spring 2010 2 nd Lecture, Jan 14 th Majd F. Sakr Lecture Motivation General overview on cloud computing What is cloud computing Services
More information5 SCS Deployment Infrastructure in Use
5 SCS Deployment Infrastructure in Use Currently, an increasing adoption of cloud computing resources as the base to build IT infrastructures is enabling users to build flexible, scalable, and low-cost
More informationData Mining with Hadoop at TACC
Data Mining with Hadoop at TACC Weijia Xu Data Mining & Statistics Data Mining & Statistics Group Main activities Research and Development Developing new data mining and analysis solutions for practical
More informationBig Data Analytics* Outline. Issues. Big Data
Outline Big Data Analytics* Big Data Data Analytics: Challenges and Issues Misconceptions Big Data Infrastructure Scalable Distributed Computing: Hadoop Programming in Hadoop: MapReduce Paradigm Example
More informationHigh Performance Computing MapReduce & Hadoop. 17th Apr 2014
High Performance Computing MapReduce & Hadoop 17th Apr 2014 MapReduce Programming model for parallel processing vast amounts of data (TBs/PBs) distributed on commodity clusters Borrows from map() and reduce()
More informationBenchmark Hadoop and Mars: MapReduce on cluster versus on GPU
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview
More informationNAIVE BAYES ALGORITHM FOR TWITTER SENTIMENT ANALYSIS AND ITS IMPLEMENTATION IN MAPREDUCE. A Thesis. Presented to. The Faculty of the Graduate School
NAIVE BAYES ALGORITHM FOR TWITTER SENTIMENT ANALYSIS AND ITS IMPLEMENTATION IN MAPREDUCE A Thesis Presented to The Faculty of the Graduate School At the University of Missouri In Partial Fulfillment Of
More informationTutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data
More informationHow to properly misuse Hadoop. Marcel Huntemann NERSC tutorial session 2/12/13
How to properly misuse Hadoop Marcel Huntemann NERSC tutorial session 2/12/13 History Created by Doug Cutting (also creator of Apache Lucene). 2002 Origin in Apache Nutch (open source web search engine).
More information!"#$%&' ( )%#*'+,'-#.//"0( !"#$"%&'()*$+()',!-+.'/', 4(5,67,!-+!"89,:*$;'0+$.<.,&0$'09,&)"/=+,!()<>'0, 3, Processing LARGE data sets
!"#$%&' ( Processing LARGE data sets )%#*'+,'-#.//"0( Framework for o! reliable o! scalable o! distributed computation of large data sets 4(5,67,!-+!"89,:*$;'0+$.
More informationMapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012
MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte
More informationWelcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components
Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components of Hadoop. We will see what types of nodes can exist in a Hadoop
More informationResearch Laboratory. Java Web Crawler & Hadoop MapReduce Anri Morchiladze && Bachana Dolidze Supervisor Nodar Momtselidze
Research Laboratory Java Web Crawler & Hadoop MapReduce Anri Morchiladze && Bachana Dolidze Supervisor Nodar Momtselidze 1. Java Web Crawler Description Java Code 2. MapReduce Overview Example of mapreduce
More informationCS 378 Big Data Programming
CS 378 Big Data Programming Lecture 2 Map- Reduce CS 378 - Fall 2015 Big Data Programming 1 MapReduce Large data sets are not new What characterizes a problem suitable for MR? Most or all of the data is
More informationHigh Performance Computing with Hadoop WV HPC Summer Institute 2014
High Performance Computing with Hadoop WV HPC Summer Institute 2014 E. James Harner Director of Data Science Department of Statistics West Virginia University June 18, 2014 Outline Introduction Hadoop
More informationFrom GWS to MapReduce: Google s Cloud Technology in the Early Days
Large-Scale Distributed Systems From GWS to MapReduce: Google s Cloud Technology in the Early Days Part II: MapReduce in a Datacenter COMP6511A Spring 2014 HKUST Lin Gu lingu@ieee.org MapReduce/Hadoop
More informationA. Aiken & K. Olukotun PA3
Programming Assignment #3 Hadoop N-Gram Due Tue, Feb 18, 11:59PM In this programming assignment you will use Hadoop s implementation of MapReduce to search Wikipedia. This is not a course in search, so
More informationA MapReduce based distributed SVM algorithm for binary classification
A MapReduce based distributed SVM algorithm for binary classification Ferhat Özgür Çatak 1, Mehmet Erdal Balaban 2 1 National Research Institute of Electronics and Cryptology, TUBITAK, Turkey, Tel: 0-262-6481070,
More informationImproving Job Scheduling in Hadoop
Improving Job Scheduling in Hadoop MapReduce Himangi G. Patel, Richard Sonaliya Computer Engineering, Silver Oak College of Engineering and Technology, Ahmedabad, Gujarat, India. Abstract Hadoop is a framework
More informationHow to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning
How to use Big Data in Industry 4.0 implementations LAURI ILISON, PhD Head of Big Data and Machine Learning Big Data definition? Big Data is about structured vs unstructured data Big Data is about Volume
More informationBig Data and Apache Hadoop s MapReduce
Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23
More informationJeffrey D. Ullman slides. MapReduce for data intensive computing
Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very
More information