# Map-Reduce for Machine Learning on Multicore

Save this PDF as:

Size: px
Start display at page:

## Transcription

1 Map-Reduce for Machine Learning on Multicore Chu, et al.

2 Problem The world is going multicore New computers - dual core to 12+-core Shift to more concurrent programming paradigms and languages Erlang, Scala, Occam, Go, among others - Paper discusses other languages, but 1) I dont know any of them 2) These languages are more up to date and useable.

3 Machine Learning frameworks on multicore How to parallelize algorithms? Highly specialized, non-obvious solutions Often applicable to very few algorithms Example 1: Cargea et al - Restricted to decision trees Example 2: Jin, Agrawal - Shared Memory Machines only

4 Goal Develop general techniques for parallelizing many machine learning algorithms Goal of throw more cores at it optimization Algorithms must fit the Statistical Query Model - Rather than traditional approach of algorithm specific tweaks

5 Statistical Query Model (Kearns, 1999) Permit learning algorithm to access learning problem only through a Statistical Oracle Oracle returns estimated expectation of f(x,y), given over all data instances - The expectation averaged over the training/test distribution

6 All algorithms that computes sufficient statistics can be expressed as a summation over all data points This trait lends itself to batch summations over all data points Why do we care? Sufficient statistics - A statistic is sufficient for a family of probability distributions if the sample from which it is calculated gives no additional information than does the statistic, as to which of those probability distributions is that of the population from which the sample was taken. Pr(X=x T(x) = t, Θ) = Pr(X=x T(X) = t) conditional probability not reliant on underlying parameter Θ. (Source wikipedia article)

7 Multicoring Algorithms Algorithms that sum over data points allows for partitioning Chunk data into subgroups Process and aggregate results

8 Example: Linear Least Squares Fits model of y = T x Solve = min Letting X m X i=1 R m n ( T x i y i ) 2 be the instances ~y =[y 1,...,y m ] be the target values Solve =(X T X) 1 X T ~y

9 To Summation form: Two-phase algorithm Compute sufficient statistics Aggregate statistics and solve Solve to get = A 1 b - Compute sufficient statistics by summing over the data

10 A = X T X b = X T ~y mx mx A = ( ~x i ~x i T ) b = ( ~x i y i ) i=1 i=1 Divide and conquer using the multiple cores What programming architecture can help us do this task?

11 Map-Reduce Google inspired software framework Specialized for big data on clusters Based on functional programming Map: Give sub-problems to worker nodes Reduce: Combine all sub-problem results to obtain answer - Map and Reduce functions not the same as in the functional programming - Nodes are cores in a single-machine implementation, and both machines and cores in a cluster setup

12 If mapper/reducer requires scalar information, they can access from the algorithm ( and ) You can have multiple reducers as well (Load balancing and fault tolerance) ideal in cluster setting but you have increased communication overhead

13 Example: k-means Given a set of data points Compute initial k centroids Cluster data points to these centroids Calculate the new centroid Repeat process till convergence We will use Euclidean (easy to think about)

14 Left image: The dataset we want to run k-means on Right Image: Clustering and calculating the new centroid Old Centroids - Square New Centroids - Triangle

15 Example: k-means Mappers Cluster: Compute distance between points and centroids Centroid: Split up data, compute partial sums of vectors Reduce: Cluster: Choose the closest centroid and apply it to that data point Centroid: Aggregate partial sums and calculate centroids Iterative Clustering process... Eventually settle and have our clusters

16 Can anyone think of a different way to run this mapping job? Example Data Point centroid distance D D D Each image here represents a different map job The map job is responsible for calculating distance between each data point and the centroid specified for the mapper Each point will have k different distances for each centroid

17 Reduce Step I II III Reduce step can take the values for each point, and select the min value

18 Map Reduce 1 II III IV Different map jobs for calculating the new centroids - Calculate the partial sum for each centroid Reduce jobs aggregate and average these partial sums to form new centroid points

19 Example: PCA Computing covariance matrix Σ P = 1 m (P m i=1 x ix T i ) µµt With µ = 1 m P m i=1 x i How would we map this? Reduce? Ans: Map each sum to one of the cores. How to map it further?? Reduce: perform final calculation (subtract

20 Results Theoretical Time Complexity speedup of P on P cores Experiment using two versions of algorithm For dual-core processors at least1.9 times speedup for parallelized algorithms Linear speedup with number of cores Viable method for parallelizing machine learning algorithms P appx. equals P Many different datasets used Original algorithms dont utilize all cpu cycles Sub 1 slope for speed increase - increasing communication overhead

21 Frameworks Influenced a variety of researchers and developers Drost, Ingersoll - Apache Mahout Ghoting et al. - IBM SystemML Zaharia et al. - Berkeley Spark

22 - Developed with Hadoop in mind though Recommendation - Given a set of features for a user, find items that a user may like Clustering - Given set of data, group them based on some set of features Classification - Assign unlabeled data based on learned patterns from trained data Frequent item set mining - Given items in a set, find items that are most frequently associated with elements in that item set Frameworks: Mahout Build scalable machine learning libraries Four use cases Recommendation Clustering Classification Frequent item set mining

23 Still new current release 0.5/0.6 Many algorithms proposed in paper yet to be implemented Ones implemented scale well Compatibility Works with/without Apache Hadoop Works single or multi node setup Many clustering techniques available, some classification techniques, SVD, Similarity, and Recommenders available, with others either open or awaiting integration.

24 (Relatively) Easy to use with Apache Lucene Text analysis Benefits Algorithms available Plug in data and play Apache Lucene - Full-text search framework Can pass term vectors created in Lucene to Mahout for processing (i.e. running LDA job with the term vectors for topic modeling, or calculating pairwise document similarity with cosine sim)

25 Compatibility Available to use right now and open source Detractions Setting up Mahout not always easy What you want might not be implemented Very young, and code/usability shows it

26 Discussion What are some of the drawbacks to Map- Reduce?

27 K-means images generated in Matlab

### Machine Learning using MapReduce

Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous

### Machine Learning Big Data using Map Reduce

Machine Learning Big Data using Map Reduce By Michael Bowles, PhD Where Does Big Data Come From? -Web data (web logs, click histories) -e-commerce applications (purchase histories) -Retail purchase histories

### Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:

### Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms. Spiros Papadimitriou, IBM Research Jimeng Sun, IBM Research Rong Yan, Facebook

Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms Spiros Papadimitriou, IBM Research Jimeng Sun, IBM Research Rong Yan, Facebook Part 2:Mining using MapReduce Mining algorithms using MapReduce

### Clustering UE 141 Spring 2013

Clustering UE 141 Spring 013 Jing Gao SUNY Buffalo 1 Definition of Clustering Finding groups of obects such that the obects in a group will be similar (or related) to one another and different from (or

### Big Data and Apache Hadoop s MapReduce

Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23

### CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

MapReduce and Implementation Hadoop Parallel Data Processing Kai Shen A programming interface (two stage Map and Reduce) and system support such that: the interface is easy to program, and suitable for

### Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Hadoop MapReduce and Spark Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Outline Hadoop Hadoop Import data on Hadoop Spark Spark features Scala MLlib MLlib

### A Performance Analysis of Distributed Indexing using Terrier

A Performance Analysis of Distributed Indexing using Terrier Amaury Couste Jakub Kozłowski William Martin Indexing Indexing Used by search

### Hadoop Operations Management for Big Data Clusters in Telecommunication Industry

Hadoop Operations Management for Big Data Clusters in Telecommunication Industry N. Kamalraj Asst. Prof., Department of Computer Technology Dr. SNS Rajalakshmi College of Arts and Science Coimbatore-49

### A survey on platforms for big data analytics

Singh and Reddy Journal of Big Data 2014, 1:8 SURVEY PAPER Open Access A survey on platforms for big data analytics Dilpreet Singh and Chandan K Reddy * * Correspondence: reddy@cs.wayne.edu Department

### Analytics on Big Data

Analytics on Big Data Riccardo Torlone Università Roma Tre Credits: Mohamed Eltabakh (WPI) Analytics The discovery and communication of meaningful patterns in data (Wikipedia) It relies on data analysis

### Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

Big Data Analytics with Spark and Oscar BAO Tamas Jambor, Lead Data Scientist at Massive Analytic About me Building a scalable Machine Learning platform at MA Worked in Big Data and Data Science in the

### Analysis of MapReduce Algorithms

Analysis of MapReduce Algorithms Harini Padmanaban Computer Science Department San Jose State University San Jose, CA 95192 408-924-1000 harini.gomadam@gmail.com ABSTRACT MapReduce is a programming model

### Overview. Introduction. Recommender Systems & Slope One Recommender. Distributed Slope One on Mahout and Hadoop. Experimental Setup and Analyses

Slope One Recommender on Hadoop YONG ZHENG Center for Web Intelligence DePaul University Nov 15, 2012 Overview Introduction Recommender Systems & Slope One Recommender Distributed Slope One on Mahout and

### ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING) Gabriela Ochoa http://www.cs.stir.ac.uk/~goc/ OUTLINE Preliminaries Classification and Clustering Applications

### IMPLEMENTATION OF P-PIC ALGORITHM IN MAP REDUCE TO HANDLE BIG DATA

IMPLEMENTATION OF P-PIC ALGORITHM IN MAP REDUCE TO HANDLE BIG DATA Jayalatchumy D 1, Thambidurai. P 2 Abstract Clustering is a process of grouping objects that are similar among themselves but dissimilar

### Hadoop SNS. renren.com. Saturday, December 3, 11

Hadoop SNS renren.com Saturday, December 3, 11 2.2 190 40 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December

### Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc.

Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc. 2015 The MathWorks, Inc. 1 Challenges of Big Data Any collection of data sets so large and complex that it becomes difficult

### Apache Hama Design Document v0.6

Apache Hama Design Document v0.6 Introduction Hama Architecture BSPMaster GroomServer Zookeeper BSP Task Execution Job Submission Job and Task Scheduling Task Execution Lifecycle Synchronization Fault

### Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Case Study 2: Document Retrieval Parallel Programming Map-Reduce Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 31 st, 2013 Carlos Guestrin

### Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2

Advanced Engineering Forum Vols. 6-7 (2012) pp 82-87 Online: 2012-09-26 (2012) Trans Tech Publications, Switzerland doi:10.4028/www.scientific.net/aef.6-7.82 Research on Clustering Analysis of Big Data

### Big Data. Lyle Ungar, University of Pennsylvania

Big Data Big data will become a key basis of competition, underpinning new waves of productivity growth, innovation, and consumer surplus. McKinsey Data Scientist: The Sexiest Job of the 21st Century -

### Cloud Computing based on the Hadoop Platform

Cloud Computing based on the Hadoop Platform Harshita Pandey 1 UG, Department of Information Technology RKGITW, Ghaziabad ABSTRACT In the recent years,cloud computing has come forth as the new IT paradigm.

### Big Data Processing with Google s MapReduce. Alexandru Costan

1 Big Data Processing with Google s MapReduce Alexandru Costan Outline Motivation MapReduce programming model Examples MapReduce system architecture Limitations Extensions 2 Motivation Big Data @Google:

### Architectures for Big Data Analytics A database perspective

Architectures for Big Data Analytics A database perspective Fernando Velez Director of Product Management Enterprise Information Management, SAP June 2013 Outline Big Data Analytics Requirements Spectrum

### Spark and the Big Data Library

Spark and the Big Data Library Reza Zadeh Thanks to Matei Zaharia Problem Data growing faster than processing speeds Only solution is to parallelize on large clusters» Wide use in both enterprises and

### Hadoop Design and k-means Clustering

Hadoop Design and k-means Clustering Kenneth Heafield Google Inc January 15, 2008 Example code from Hadoop 0.13.1 used under the Apache License Version 2.0 and modified for presentation. Except as otherwise

### Hadoop Ecosystem B Y R A H I M A.

Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open

### Unified Big Data Processing with Apache Spark. Matei Zaharia @matei_zaharia

Unified Big Data Processing with Apache Spark Matei Zaharia @matei_zaharia What is Apache Spark? Fast & general engine for big data processing Generalizes MapReduce model to support more types of processing

### Big Data Analytics Hadoop and Spark

Big Data Analytics Hadoop and Spark Shelly Garion, Ph.D. IBM Research Haifa 1 What is Big Data? 2 What is Big Data? Big data usually includes data sets with sizes beyond the ability of commonly used software

### Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. www.spark- project.org. University of California, Berkeley UC BERKELEY

Spark in Action Fast Big Data Analytics using Scala Matei Zaharia University of California, Berkeley www.spark- project.org UC BERKELEY My Background Grad student in the AMP Lab at UC Berkeley» 50- person

### Map/Reduce Affinity Propagation Clustering Algorithm

Map/Reduce Affinity Propagation Clustering Algorithm Wei-Chih Hung, Chun-Yen Chu, and Yi-Leh Wu Department of Computer Science and Information Engineering, National Taiwan University of Science and Technology,

### Advances in Natural and Applied Sciences

AENSI Journals Advances in Natural and Applied Sciences ISSN:1995-0772 EISSN: 1998-1090 Journal home page: www.aensiweb.com/anas Clustering Algorithm Based On Hadoop for Big Data 1 Jayalatchumy D. and

### On the Performance of High Dimensional Data Clustering and Classification Algorithms

On the Performance of High Dimensional Data Clustering and Classification Algorithms Abstract Kathleen Ericson and Shrideep Pallickara Computer Science Department Colorado State University Fort Collins,

### Clustering and Data Mining in R

Clustering and Data Mining in R Workshop Supplement Thomas Girke December 10, 2011 Introduction Data Preprocessing Data Transformations Distance Methods Cluster Linkage Hierarchical Clustering Approaches

### Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"

### COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Instructor: (jpineau@cs.mcgill.ca) TAs: Pierre-Luc Bacon (pbacon@cs.mcgill.ca) Ryan Lowe (ryan.lowe@mail.mcgill.ca)

### ANALYTICS CENTER LEARNING PROGRAM

Overview of Curriculum ANALYTICS CENTER LEARNING PROGRAM The following courses are offered by Analytics Center as part of its learning program: Course Duration Prerequisites 1- Math and Theory 101 - Fundamentals

### DATA CLUSTERING USING MAPREDUCE

DATA CLUSTERING USING MAPREDUCE by Makho Ngazimbi A project submitted in partial fulfillment of the requirements for the degree of Master of Science in Computer Science Boise State University March 2009

### Introduction to Hadoop and MapReduce

Introduction to Hadoop and MapReduce THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Large-scale Computation Traditional solutions for computing large quantities of data

### Making Big Data Processing Simple with Spark. Matei Zaharia

Making Big Data Processing Simple with Spark Matei Zaharia December 17, 2015 What is Apache Spark? Fast and general cluster computing engine that generalizes the MapReduce model Makes it easy and fast

### The Impact of Big Data on Classic Machine Learning Algorithms. Thomas Jensen, Senior Business Analyst @ Expedia

The Impact of Big Data on Classic Machine Learning Algorithms Thomas Jensen, Senior Business Analyst @ Expedia Who am I? Senior Business Analyst @ Expedia Working within the competitive intelligence unit

### 2015 The MathWorks, Inc. 1

25 The MathWorks, Inc. 빅 데이터 및 다양한 데이터 처리 위한 MATLAB의 인터페이스 환경 및 새로운 기능 엄준상 대리 Application Engineer MathWorks 25 The MathWorks, Inc. 2 Challenges of Data Any collection of data sets so large and complex

### Large scale processing using Hadoop. Ján Vaňo

Large scale processing using Hadoop Ján Vaňo What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data Includes: MapReduce offline computing engine

### Text Mining Approach for Big Data Analysis Using Clustering and Classification Methodologies

Text Mining Approach for Big Data Analysis Using Clustering and Classification Methodologies Somesh S Chavadi 1, Dr. Asha T 2 1 PG Student, 2 Professor, Department of Computer Science and Engineering,

### Bringing Big Data Modelling into the Hands of Domain Experts

Bringing Big Data Modelling into the Hands of Domain Experts David Willingham Senior Application Engineer MathWorks david.willingham@mathworks.com.au 2015 The MathWorks, Inc. 1 Data is the sword of the

### Performance Evaluation for BlobSeer and Hadoop using Machine Learning Algorithms

Performance Evaluation for BlobSeer and Hadoop using Machine Learning Algorithms Elena Burceanu, Irina Presa Automatic Control and Computers Faculty Politehnica University of Bucharest Emails: {elena.burceanu,

### Mining Interesting Medical Knowledge from Big Data

IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 18, Issue 1, Ver. II (Jan Feb. 2016), PP 06-10 www.iosrjournals.org Mining Interesting Medical Knowledge from

### Social Media Mining. Data Mining Essentials

Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

### MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte

### An analysis of suitable parameters for efficiently applying K-means clustering to large TCPdump data set using Hadoop framework

An analysis of suitable parameters for efficiently applying K-means clustering to large TCPdump data set using Hadoop framework Jakrarin Therdphapiyanak Dept. of Computer Engineering Chulalongkorn University

### HPC ABDS: The Case for an Integrating Apache Big Data Stack

HPC ABDS: The Case for an Integrating Apache Big Data Stack with HPC 1st JTC 1 SGBD Meeting SDSC San Diego March 19 2014 Judy Qiu Shantenu Jha (Rutgers) Geoffrey Fox gcf@indiana.edu http://www.infomall.org

### Big Data Analytics. Lucas Rego Drumond

Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany MapReduce II MapReduce II 1 / 33 Outline 1. Introduction

### MLlib: Scalable Machine Learning on Spark

MLlib: Scalable Machine Learning on Spark Xiangrui Meng Collaborators: Ameet Talwalkar, Evan Sparks, Virginia Smith, Xinghao Pan, Shivaram Venkataraman, Matei Zaharia, Rean Griffith, John Duchi, Joseph

### Brave New World: Hadoop vs. Spark

Brave New World: Hadoop vs. Spark Dr. Kurt Stockinger Associate Professor of Computer Science Director of Studies in Data Science Zurich University of Applied Sciences Datalab Seminar, Zurich, Oct. 7,

### Mining Large Datasets: Case of Mining Graph Data in the Cloud

Mining Large Datasets: Case of Mining Graph Data in the Cloud Sabeur Aridhi PhD in Computer Science with Laurent d Orazio, Mondher Maddouri and Engelbert Mephu Nguifo 16/05/2014 Sabeur Aridhi Mining Large

### Spark: Cluster Computing with Working Sets

Spark: Cluster Computing with Working Sets Outline Why? Mesos Resilient Distributed Dataset Spark & Scala Examples Uses Why? MapReduce deficiencies: Standard Dataflows are Acyclic Prevents Iterative Jobs

### Big Data and Analytics: Getting Started with ArcGIS. Mike Park Erik Hoel

Big Data and Analytics: Getting Started with ArcGIS Mike Park Erik Hoel Agenda Overview of big data Distributed computation User experience Data management Big data What is it? Big Data is a loosely defined

### 16.1 MAPREDUCE. For personal use only, not for distribution. 333

For personal use only, not for distribution. 333 16.1 MAPREDUCE Initially designed by the Google labs and used internally by Google, the MAPREDUCE distributed programming model is now promoted by several

### Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com http://elephantscale.com

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com http://elephantscale.com Spark Fast & Expressive Cluster computing engine Compatible with Hadoop Came

### Platforms and Algorithms for Big Data Analytics Chandan K. Reddy Department of Computer Science Wayne State University

Platforms and Algorithms for Big Data Analytics Chandan K. Reddy Department of Computer Science Wayne State University http://www.cs.wayne.edu/~reddy/ http://dmkd.cs.wayne.edu/tutorial/bigdata/ What is

### BIG DATA What it is and how to use?

BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14

### Spark. Fast, Interactive, Language- Integrated Cluster Computing

Spark Fast, Interactive, Language- Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica UC

### PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad

### marlabs driving digital agility WHITEPAPER Big Data and Hadoop

marlabs driving digital agility WHITEPAPER Big Data and Hadoop Abstract This paper explains the significance of Hadoop, an emerging yet rapidly growing technology. The prime goal of this paper is to unveil

### CloudClustering: Toward an iterative data processing pattern on the cloud

CloudClustering: Toward an iterative data processing pattern on the cloud Ankur Dave University of California, Berkeley Berkeley, California, USA ankurd@eecs.berkeley.edu Wei Lu, Jared Jackson, Roger Barga

### Mesos: A Platform for Fine- Grained Resource Sharing in Data Centers (II)

UC BERKELEY Mesos: A Platform for Fine- Grained Resource Sharing in Data Centers (II) Anthony D. Joseph LASER Summer School September 2013 My Talks at LASER 2013 1. AMP Lab introduction 2. The Datacenter

### International Journal of Emerging Technology & Research

International Journal of Emerging Technology & Research High Performance Clustering on Large Scale Dataset in a Multi Node Environment Based on Map-Reduce and Hadoop Anusha Vasudevan 1, Swetha.M 2 1, 2

### RevoScaleR Speed and Scalability

EXECUTIVE WHITE PAPER RevoScaleR Speed and Scalability By Lee Edlefsen Ph.D., Chief Scientist, Revolution Analytics Abstract RevoScaleR, the Big Data predictive analytics library included with Revolution

### The MapReduce Framework

The MapReduce Framework Luke Tierney Department of Statistics & Actuarial Science University of Iowa November 8, 2007 Luke Tierney (U. of Iowa) The MapReduce Framework November 8, 2007 1 / 16 Background

### Report: Declarative Machine Learning on MapReduce (SystemML)

Report: Declarative Machine Learning on MapReduce (SystemML) Jessica Falk ETH-ID 11-947-512 May 28, 2014 1 Introduction SystemML is a system used to execute machine learning (ML) algorithms in HaDoop,

### Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

### Mammoth Scale Machine Learning!

Mammoth Scale Machine Learning! Speaker: Robin Anil, Apache Mahout PMC Member! OSCON"10! Portland, OR! July 2010! Quick Show of Hands!# Are you fascinated about ML?!# Have you used ML?!# Do you have Gigabytes

Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,

### High Productivity Data Processing Analytics Methods with Applications

High Productivity Data Processing Analytics Methods with Applications Dr. Ing. Morris Riedel et al. Adjunct Associate Professor School of Engineering and Natural Sciences, University of Iceland Research

### Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control EP/K006487/1 UK PI: Prof Gareth Taylor (BU) China PI: Prof Yong-Hua Song (THU) Consortium UK Members: Brunel University

### CIEL A universal execution engine for distributed data-flow computing

Reviewing: CIEL A universal execution engine for distributed data-flow computing Presented by Niko Stahl for R202 Outline 1. Motivation 2. Goals 3. Design 4. Fault Tolerance 5. Performance 6. Related Work

### MapReduce Algorithms. Sergei Vassilvitskii. Saturday, August 25, 12

MapReduce Algorithms A Sense of Scale At web scales... Mail: Billions of messages per day Search: Billions of searches per day Social: Billions of relationships 2 A Sense of Scale At web scales... Mail:

### An approach to Machine Learning with Big Data

An approach to Machine Learning with Big Data Ella Peltonen Master s Thesis University of Helsinki Department of Computer Science Helsinki, September 19, 2013 HELSINGIN YLIOPISTO HELSINGFORS UNIVERSITET

### Fast Analytics on Big Data with H20

Fast Analytics on Big Data with H20 0xdata.com, h2o.ai Tomas Nykodym, Petr Maj Team About H2O and 0xdata H2O is a platform for distributed in memory predictive analytics and machine learning Pure Java,

### HadoopRDF : A Scalable RDF Data Analysis System

HadoopRDF : A Scalable RDF Data Analysis System Yuan Tian 1, Jinhang DU 1, Haofen Wang 1, Yuan Ni 2, and Yong Yu 1 1 Shanghai Jiao Tong University, Shanghai, China {tian,dujh,whfcarter}@apex.sjtu.edu.cn

### Information Processing, Big Data, and the Cloud

Information Processing, Big Data, and the Cloud James Horey Computational Sciences & Engineering Oak Ridge National Laboratory Fall Creek Falls 2010 Information Processing Systems Model Parameters Data-intensive

### Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications

Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications Ahmed Abdulhakim Al-Absi, Dae-Ki Kang and Myong-Jong Kim Abstract In Hadoop MapReduce distributed file system, as the input

### Benchmarking of Data Mining Techniques as Applied to Power System Analysis

IT 13 061 Examensarbete 30 hp September 2013 Benchmarking of Data Mining Techniques as Applied to Power System Analysis Can ANIL Institutionen för informationsteknologi Department of Information Technology

### A Novel Cloud Based Elastic Framework for Big Data Preprocessing

School of Systems Engineering A Novel Cloud Based Elastic Framework for Big Data Preprocessing Omer Dawelbeit and Rachel McCrindle October 21, 2014 University of Reading 2008 www.reading.ac.uk Overview

Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future

### Integrating Apache Spark with an Enterprise Data Warehouse

Integrating Apache Spark with an Enterprise Warehouse Dr. Michael Wurst, IBM Corporation Architect Spark/R/Python base Integration, In-base Analytics Dr. Toni Bollinger, IBM Corporation Senior Software

### MapReduce Design of K-Means Clustering Algorithm

MapReduce Design of K-Means Clustering Algorithm Prajesh P Anchalia Department of CSE, Email: prajeshanchalia@yahoo.in Anjan K Koundinya Department of CSE, Email: annjank@gmail.com Srinath N K Department

### Large data computing using Clustering algorithms based on Hadoop

Large data computing using Clustering algorithms based on Hadoop Samrudhi Tabhane, Prof. R.A.Fadnavis Dept. Of Information Technology,YCCE, email id: sstabhane@gmail.com Abstract The Hadoop Distributed

### Apache Mahout's new DSL for Distributed Machine Learning. Sebastian Schelter GOTO Berlin 11/06/2014

Apache Mahout's new DSL for Distributed Machine Learning Sebastian Schelter GOO Berlin /6/24 Overview Apache Mahout: Past & Future A DSL for Machine Learning Example Under the covers Distributed computation

### Classification On The Clouds Using MapReduce

Classification On The Clouds Using MapReduce Simão Martins Instituto Superior Técnico Lisbon, Portugal simao.martins@tecnico.ulisboa.pt Cláudia Antunes Instituto Superior Técnico Lisbon, Portugal claudia.antunes@tecnico.ulisboa.pt

### HIGH PERFORMANCE BIG DATA ANALYTICS

HIGH PERFORMANCE BIG DATA ANALYTICS Kunle Olukotun Electrical Engineering and Computer Science Stanford University June 2, 2014 Explosion of Data Sources Sensors DoD is swimming in sensors and drowning