# Data Mining for Big Data: Tools and Approaches. Pace SDSC SAN DIEGO SUPERCOMPUTER CENTER

Size: px
Start display at page:

Download "Data Mining for Big Data: Tools and Approaches. Pace SDSC SAN DIEGO SUPERCOMPUTER CENTER"

## Transcription

1 Data Mining for Big Data: Tools and Approaches Pace SDSC

2 Todo R domc exercise? Test train account Paradigm stream eg fro mbook? And mapred or join and vector mult?

3 Outline Scaling What is Big Data Parallel option for R Map/Reduce R-Hadoop Map/Reduce

4 Scaling, practically Scaling (with or without more data): more processing/searching (e.g. training more complicated neural networks) more complex analysis (larger ensemble) more sampling (more trees in Random Forest) Sometimes easy to parallelize (like with sampling), Sometimes too much communication between parts (like with neural networks)

5 Scaling In a nutshell R takes advantage of math libraries for vector operations R packages provide multicore, multinode (snow), or map/reduce (RHadoop) options However, model implementations not necessarily built to use parallel backends Some models more amenable to parallel versions

6 R vector operations and scale Intel Math Kernel Libraries provides fast operations for sums and multiplication Uses threads across cpu cores

7 Consider Regression Computations Linear Model: Y = X B where Y=outcomes, X=data matrix Algebraically, we could: take inverse of X Y = B (time consuming) Or, better: decompose X into triangular matrices (more memory) then solve more easily

8 Consider Regression models Related Models and Functions : lm() #Linear Model glm() #Generalized Linear Model (logistic regression, etc) aov() #Analysis of Variance ( returns ANOVA table of F-scores) All these work on system of equations

9 Solving Linear Systems Performance with R R: glm(y~x,family=gaussian) #gaussn regrssn (like lm) glm(y~x,family=binomial) # logistic regrssn (Y=0 or 1) Wall Time (secs) 30min GLM: logistic GLM: Gaussian LM() inverse 1K 2K 4K 8K Data Matrix Size (i.e. square, rowsxcol) Solve(a,b) QR

10 R multicore Run loop iterations on separate cores returned items combined into list by default }) install.packages(domc) library(domc) registerdomc(cores=15) getdoparworkers() results = foreach(i=1:15,.combine=rbind) %dopar% { your code here return( a variable or object ) allocate workers %dopar% puts loops across cores, (loops are independent) %do% runs it serially specify to combine results into array with row bind

11 R multicore exercise Can be run on Gordon compute node or on laptop First: putty (windows) or ssh (mac terminal) Gordon

12 Enter userid password, (you get signed into login node) train91 to train110 \$ls is listing \$sh QSUBH.txt is executing a script to shell that will request 2 compute nodes

13 From compute node (here its called gcn-6-71) \$ cd BootCamp \$ cd Rtests \$ module load R \$ R

14 Source( Ex1_RdoMC.R ) exercise script, First time you ll get install requests and info for domc package

15 Ex1 script tests dopar with and without combine How are return values combined?

16 Source( Ex2_MC.R ) The scripts builds and multiplies two matrixes. Enter number of cores 1 to 16 Enter block size: 100,1000,2000 (for eg) You should see processing time for different domc steps: 1 parallel with %dopar% 2serially with just %do% 3 just native R matrix operation

17 Multicore to multinodes INTEL SANDY BRIDGE COMPUTE NODE Sockets & Cores 2 & 16 Clock speed 2.6 GHz DRAM capacity and speed 64 GB, 1,333 MHz INTEL710 emlc FLASH I/O NODE NAND flash SSD drives 16 SSD capacity per drive & per node 16 * 300 GB = 4.8 TB SMP SUPER-NODE (VIA VSMP) Compute nodes / I/O Nodes 32 / 2 Addressable DRAM 2 TB Addressable memory including flash 11.6 TB GORDON (AGGREGATE) Compute Nodes 1,024 Compute cores 16,384 Peak performance 341 TF DRAM/SSD memory Architecture Link Bandwidth Vendor INFINIBAND INTERCONNECT 64 TB DRAM; 300 TB SSD Dual-Rail, 3D torus QDR Mellanox LUSTRE-BASED DISK I/O SUBSYSTEM (SHARED) Total storage: current/planned 4 PB/6 PB (raw) Total bandwidth 100 GB/s

18 Scale and Computations Multicore and Multinode Communication vs Distributed tradeoff Some operations always best when you can stay on 1 core

19 R multinode: parallel backend Run loop iterations on separate nodes library(dosnow) allocate cluster as parallel backend cl <- makecluster( mpi.universe.size(), type='mpi' ) clusterexport(cl,c('data')) registerdosnow(cl) %dopar% puts loops across cores and nodes results = foreach(i=1:15,.combine=rbind) %dopar% { your code here return( a variable or object ) }) stopcluster(cl) mpi.exit()

20 Multiple CPUs may not help so much Gordon has virtual option to spread out threads across CPUs. Matrix Multiplication Matrix Inversion time 8 threads 32 threads time( s) 32 threads 16 threads N=10K 20K 30K 40K 50K Gb= Square Matrix size N=10K 20K 30K 40K 50K Gb= threads across CPUs: more is better for multiplication, less is better for inversion (or use different operation)

21 So how Big is Big Data, or is it buzz?

22

23 4 V s of Big Data IBM, 2012

24 Uniquely Big Data Problems Streaming data from sensors (energy grids) Cant store it, process/analyze as it comes Internet Page Rank for searches constantly new links and pages into graph database Data/video uploads (youtube, security cams) No annotations Digital text (books, medical notes, blogs) Unstructured Twitter messaging constantly changing topics Not traditional database apps!

25 What to do with big data? (ERIC SALL) Big Data Exploration To get an overall understanding of what is there 360 degree view of the customer Combine both internally available and external information to gain a deeper understanding of the customer Monitoring Cyber-security and fraud in real time Operational Analysis Leveraging machine generated data to improve business effectiveness Data Warehouse Augmentation Enhancing warehouse solution with new information models and architecture

26 Big Data Practically Too big to fit on 1 computer memory Too big to make one pass through on 1 computer Too big for 1 hard drive How to process and do analysis?

27 Got Big Data Map/Reduce framework started by Google Main idea: bring computation to data Apache Hadoop is one implementation Hadoop is entire ecosystem of supporting tools HDFS: Hadoop distributed file system (for partitioning, merging data, reliably, using binary format) Hive: database using map/red on HDFS Pig : query tool using map/red on HDFS

28 Map/Reduce Framework User defines keys & values MR provides parallelization,concurrency, and intermediate data functions (sorting by key&value) User defined functions

29 Taking Advantage of Map/Reduce Map-to-Reduce: what is a key? whatever you need for the sorting should be related to Σ for reducer Example: word count: key is word matrix multiplication: key is row,col indices

30 Map/Reduce Algorithm General Conditions operations/data are separable and independent data that doesn t fit into memory data that doesn t need to be all read into memory General Strategy If you have this: Σ (some process) then do this: Map some process over parts Reduce Σ over results

31 pause

32 Hadoop Map/Reduce Interfaces with R (slides from G.Lockwood SDSC) R Streaming (simplest) or Hadoop API Streaming pipes input/output through steps cat input Rscript mapper.r sort Rscript reducer.r > output You provide these two scripts; Hadoop does the rest generalizable to any language you want (Perl, Python, etc)

33 Paradigmatic Example: Word Counting How would you count all the words in Moby Dick? Call me Ishmael. Some years ago - never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world. It is a way I have of driving off the spleen and regulating the circulation.. How could you count all the words in all web pages? (assume the data is spread out over many nodes) Use Map/Reduce, take computation to nodes

34 Wordcount: Hadoop streaming mapper Emit key-value pairs ( cat is concatenate and print ) Split line Into words Use words as keys emit.keyval <- function(key, value) { cat(key, '\t', value, '\n', sep='') } stdin <- file('stdin', open='r') while ( length(line <- readlines(stdin, n=1)) > 0 ) { line <- gsub('(^\\s+ \\s+\$)', '', line) keys <- unlist(strsplit(line, split='\\s+')) value <- 1 lapply(keys, FUN=emit.keyval, value=value) } close(stdin) Example from Glen Lockwood, SDSC

35 What One Mapper Does line = Call me Ishmael. Some years ago never mind how long keys = Call me Ishmael. Some years ago--never mind how long emit.keyval(key,value)... Call 1 me Ishmael. 1 Some 1 1 years 1 mind 1 ago--never 1 how long 1 1 to the reducers

36 Reducer Loop If this key is the same as the previous key, add this key's value to our running total. Otherwise, print out the previous key's name and the running total, reset our running total to 0, add this key's value to the running total, and "this key" is now considered the "previous key"

37 Wordcount: Streaming Reducer (1/2) last_key <- "" running_total <- 0 Get key, Value Add up values stdin <- file('stdin', open='r') while ( length(line <- readlines(stdin,n=1)) > 0 ) { line <- gsub('(^\\s+) (\\s+\$)', '', line) keyvalue <- unlist(strsplit(line, split='\t', fixed=true)) this_key <- keyvalue[[1]] value <- as.numeric(keyvalue[[2]]) if ( last_key == this_key ) { running_total <- running_total + value } else { (to be continued...)

38 Wordcount: Streaming Reducer (2/2) For each new key, emit <key, sum> else { if ( last_key!= "" ) { cat( paste(last_key,'\t',running_total,'\n',sep='') ) } running_total <- value last_key <- this_key } } if ( last_key == this_key ) { cat( paste(last_key,'\t',running_total,'\n',sep='') ) } close(stdin)

39 Testing Mappers/Reducers Debugging Hadoop is not fun \$ head -n100 pg2701.txt./wordcount-streaming-mapper.r sort./wordcount-streaming-reducer.r... with 5 word, 1 world you 3 You 1

41

42 Taking Advantage of Map/Reduce Case Study: election related tweets daily change in approval ratings What is the relationship between tweets and approval ratings?

43 Tweet Data Twitter provides access to data Unstructured message text and meta data {"created_time": "13:27: ", "text": Vote Obama Man...", "user_id": , "id": , "created_date": "Thu Oct "} {"created_time": "01:12: ", "text": "I swear these dudes in this class dont understand english. Its like my teacher is speaking some foreign language to them", "user_id": , "id": , "created_date": "Wed Sep "} etc partly preprocessed by CS181 (Freund)

44 Twitter and Other Data Obama Approval minus Disapproval poll tracking leading up to 2012 election

45 Defining Flow Map/Reduce Goal : turn tweet message into data by day Target: approval change from previous day Choices: track message elements (words, ) track metadata ( date, users, replies, ) Let s try word counts by date

46 Defining Flow Map/Reduce Approach: extend word count into <date,word> count Map: split tweet into parts and emit Key < date, word > Value 1 Reduce: add up value Result Example: ,TAXES 6

47 Defining Flow Map/Reduce What other aggregations do we need? At what point will data fit into memory? 1 Do we need the list of unique words and their overall counts? 2 If you want to correlate target to unexpected word counts, then what sums does that need?

48 My Example Flow 1. Process messages Map: split tweet message into <date,word>,1 Reduce: sum counts for <date,word> 2. Re-map Map: split <date,word>,1 into <date>,1 Reduce: sum counts for <date> 3. Re-map Map: split <date,word>,1 into <word>,1 Reduce: sum counts for <word>, unique set of words

49 Example Flow Downstream For analysis, perhaps, the end product is a date X word data matrix, Each Row is a count of words for one date (using top P words) Joined with approval rating changes (as +1 or -1 down col1) data: DATE Apr 01 Apr 02 Apr 03 APPR Vote Billion senator june words

50 putty or ssh for windows to get Unix shell on Gordon Gordon Access

51 directory listing (ls) \$cd BootCamp \$sh QSUBH.txt Get to compute node \$cd BootCamp \$cd Rhad_Tweets \$ls

52 Some scripts for date-word counting of tweet messages. The process.r file produces a data matrix dates X word-vector counts (with approval target in col 1)

53 Sample of raw data

54 Test mapper & reducer

55 Test output

57 Note slave nodes, task trackers, data trackers

58 Unix script Do_raw2dtwdcnt

59 Exec script Do_raw2dtwdcnt

60 Exec Do_dtwd_splitcnts.cmd

61 end Sample output from reduce steps

62 Sample output from reduce steps Process cnts into data matrix

63 Data matrix for analysis

64

65

67 In userlogs are jobs and attempts

68 File parts, workers and sorting

69

70 pause

71 Map/Reduce Algorithm General Conditions operations/data are separable and independent data that doesn t fit into memory data that doesn t need to be all read into memory General Strategy If you can divide problem into parts then do this: Map some process over parts Reduce re-organize over map results

72 Join Multiple Dataset on Key Problem: 2 files in HDFS that should be combined on key value In pseudo SQL Select * from table A, table B, where A.key=B.key Joins can be inner, left or right outer inner Left outer

73 Join Map/Reduce Strategy Problem: Join 2 key,value sets A= <wd > <count> about 5 actor 15 bacon 3.. B= <wd> <date> able Nov 16 actor Feb 01 actor May 03 bacon Apr 11.. Want something like AjoinB is <wd> <list of values> actor 15, Feb 01 actor 15, May 03 bacon 3, Apr 11..

74 Join Map/Reduce Strategy One solution: stream both A and B tables to map Intermediate step will shuffle data so that keys are together Key Value about A,5 able B,Nov 16 actor A,15 actor B,Feb 01 actor B,May 03.. What should reducer do?

75 Join Map/Reduce Strategy One solution: A reducer has access to all rows from A,B with same key value, so it can split those rows back to A or B (how?) and take a cross-product Key Value about A,5 able B,Nov 16 actor A,15 actor B,Feb 01 actor B,May 03.. A about 5 actor 15 B able Nov 16 actor Feb 01 actor May 03

76 Join Map/Reduce Strategy Size matters: If one dataset fits in memory, it can be replicated across nodes and fit in memory with Map only (replicated join) If both datasets are large, use full Map/Reduce (repartition join) If both datasets are large but one can be filtered down, do 1 map/reduce first (semi-join)

77 Summary of Map/Reduce Design Considerations Composite keys and/or values Grouping Bundle keys into groups Replication Repeating values across more than 1key Cascading Map/Reduce jobs

78 Machine Learning Most algorithms have some summation step, so Map/Reduce will speed up jobs But parameter estimations require communication between parts Some algorithms look at interdependencies across NxP data matrix E.g. Lin Reg inverts a X *X a PxP matrix, NNets propagate errors Some algorithms use observations interdepencies, e.g. SVM kernels Some algorithms take distances and sums mostly e.g. Kmeans, NaiveBayes

79 Machine Learning and Map/Reduce Mahout for Hadoop is a java library of machine learning algorithms processes data in chunks that fit in memory command line or programming interface many advanced algorithms Spark is new version of Map/Reduce (UCBerkeley) main idea: maintain data in memory, don t write out and shuffle unless need to tools and libraries just starting to get built

### Strategies for Big Data SAN DIEGO SUPERCOMPUTER CENTER

Strategies for Big Data Or: Some experiences with Scaling Models, Distributed Processing, Large Sampling, Map/Reduce, etc IOW: Running Bigger Jobs Distributing data and task to compute nodes Lots of data

### Statistical Computing with R, parallel R, and more

October 23, 2015 Statistical Computing with R, parallel R, and more Paul Rodriguez San Diego Supercomputer Center XSEDE Application Support Outline Introduction and Tutorial to R R and Scaling Parallel

### Parallel Options for R

Parallel Options for R Glenn K. Lockwood SDSC User Services glock@sdsc.edu Motivation "I just ran an intensive R script [on the supercomputer]. It's not much faster than my own machine." Motivation "I

### Introduction to Hadoop on the SDSC Gordon Data Intensive Cluster"

Introduction to Hadoop on the SDSC Gordon Data Intensive Cluster" Mahidhar Tatineni SDSC Summer Institute August 06, 2013 Overview "" Hadoop framework extensively used for scalable distributed processing

### Hadoop Deployment and Performance on Gordon Data Intensive Supercomputer!

Hadoop Deployment and Performance on Gordon Data Intensive Supercomputer! Mahidhar Tatineni, Rick Wagner, Eva Hocks, Christopher Irving, and Jerry Greenberg! SDSC! XSEDE13, July 22-25, 2013! Overview!!

### Hadoop on the Gordon Data Intensive Cluster

Hadoop on the Gordon Data Intensive Cluster Amit Majumdar, Scientific Computing Applications Mahidhar Tatineni, HPC User Services San Diego Supercomputer Center University of California San Diego Dec 18,

### The MapReduce Framework

The MapReduce Framework Luke Tierney Department of Statistics & Actuarial Science University of Iowa November 8, 2007 Luke Tierney (U. of Iowa) The MapReduce Framework November 8, 2007 1 / 16 Background

### HDFS. Hadoop Distributed File System

HDFS Kevin Swingler Hadoop Distributed File System File system designed to store VERY large files Streaming data access Running across clusters of commodity hardware Resilient to node failure 1 Large files

### Big Data Rethink Algos and Architecture. Scott Marsh Manager R&D Personal Lines Auto Pricing

Big Data Rethink Algos and Architecture Scott Marsh Manager R&D Personal Lines Auto Pricing Agenda History Map Reduce Algorithms History Google talks about their solutions to their problems Map Reduce:

### Fast Analytics on Big Data with H20

Fast Analytics on Big Data with H20 0xdata.com, h2o.ai Tomas Nykodym, Petr Maj Team About H2O and 0xdata H2O is a platform for distributed in memory predictive analytics and machine learning Pure Java,

### Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

### Distributed Text Mining with tm

Distributed Text Mining with tm Stefan Theußl 1 Ingo Feinerer 2 Kurt Hornik 1 Department of Statistics and Mathematics, WU Vienna University of Economics and Business 1 Institute of Information Systems,

### Distributed Computing and Big Data: Hadoop and MapReduce

Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:

### Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

### MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte

### Hadoop implementation of MapReduce computational model. Ján Vaňo

Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed

### Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"

### Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Hadoop MapReduce and Spark Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Outline Hadoop Hadoop Import data on Hadoop Spark Spark features Scala MLlib MLlib

### Jeffrey D. Ullman slides. MapReduce for data intensive computing

Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very

### Architectures for Big Data Analytics A database perspective

Architectures for Big Data Analytics A database perspective Fernando Velez Director of Product Management Enterprise Information Management, SAP June 2013 Outline Big Data Analytics Requirements Spectrum

### Big Data and Data Science: Behind the Buzz Words

Big Data and Data Science: Behind the Buzz Words Peggy Brinkmann, FCAS, MAAA Actuary Milliman, Inc. April 1, 2014 Contents Big data: from hype to value Deconstructing data science Managing big data Analyzing

### Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software? 可 以 跟 資 料 庫 結 合 嘛? Can Hadoop work with Databases? 開 發 者 們 有 聽 到

### Simple Parallel Computing in R Using Hadoop

Simple Parallel Computing in R Using Hadoop Stefan Theußl WU Vienna University of Economics and Business Augasse 2 6, 1090 Vienna Stefan.Theussl@wu.ac.at 30.06.2009 Agenda Problem & Motivation The MapReduce

### CS 378 Big Data Programming. Lecture 2 Map- Reduce

CS 378 Big Data Programming Lecture 2 Map- Reduce MapReduce Large data sets are not new What characterizes a problem suitable for MR? Most or all of the data is processed But viewed in small increments

### DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of

### HiBench Introduction. Carson Wang (carson.wang@intel.com) Software & Services Group

HiBench Introduction Carson Wang (carson.wang@intel.com) Agenda Background Workloads Configurations Benchmark Report Tuning Guide Background WHY Why we need big data benchmarking systems? WHAT What is

Zihang Yin Introduction R is commonly used as an open share statistical software platform that enables analysts to do complex statistical analysis with limited computing knowledge. Frequently these analytical

Introduction to Hadoop Miles Osborne School of Informatics University of Edinburgh miles@inf.ed.ac.uk October 28, 2010 Miles Osborne Introduction to Hadoop 1 Background Hadoop Programming Model Examples

### Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after

### CS 378 Big Data Programming

CS 378 Big Data Programming Lecture 2 Map- Reduce CS 378 - Fall 2015 Big Data Programming 1 MapReduce Large data sets are not new What characterizes a problem suitable for MR? Most or all of the data is

Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com Hadoop, Why? Need to process huge datasets on large clusters of computers

### I/O Considerations in Big Data Analytics

Library of Congress I/O Considerations in Big Data Analytics 26 September 2011 Marshall Presser Federal Field CTO EMC, Data Computing Division 1 Paradigms in Big Data Structured (relational) data Very

### Hadoop in Social Network Analysis - overview on tools and some best practices - Headline Goes Here

Hadoop in Social Network Analysis - overview on tools and some best practices - Headline Goes Here Speaker Name or Subhead Goes Here GridKa School 2013, Karlsruhe 2013-08-27 Mirko Kämpf mirko@cloudera.com

### Big Data and Apache Hadoop s MapReduce

Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23

### COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Instructor: (jpineau@cs.mcgill.ca) TAs: Pierre-Luc Bacon (pbacon@cs.mcgill.ca) Ryan Lowe (ryan.lowe@mail.mcgill.ca)

### Large scale processing using Hadoop. Ján Vaňo

Large scale processing using Hadoop Ján Vaňo What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data Includes: MapReduce offline computing engine

### Unified Big Data Processing with Apache Spark. Matei Zaharia @matei_zaharia

Unified Big Data Processing with Apache Spark Matei Zaharia @matei_zaharia What is Apache Spark? Fast & general engine for big data processing Generalizes MapReduce model to support more types of processing

### Hadoop in Action. Justin Quan March 15, 2011

Hadoop in Action Justin Quan March 15, 2011 Poll What s to come Overview of Hadoop for the uninitiated How does Hadoop work? How do I use Hadoop? How do I get started? Final Thoughts Key Take Aways Hadoop

### Getting Started with Hadoop with Amazon s Elastic MapReduce

Getting Started with Hadoop with Amazon s Elastic MapReduce Scott Hendrickson scott@drskippy.net http://drskippy.net/projects/emr-hadoopmeetup.pdf Boulder/Denver Hadoop Meetup 8 July 2010 Scott Hendrickson

### Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

Big Data Analytics with Spark and Oscar BAO Tamas Jambor, Lead Data Scientist at Massive Analytic About me Building a scalable Machine Learning platform at MA Worked in Big Data and Data Science in the

### BIG DATA What it is and how to use?

BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14

### Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks

### Firebird meets NoSQL (Apache HBase) Case Study

Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 thomas.steinmaurer@scch.at www.scch.at Michael Zwick DI

### A bit about Hadoop. Luca Pireddu. March 9, 2012. CRS4Distributed Computing Group. luca.pireddu@crs4.it (CRS4) Luca Pireddu March 9, 2012 1 / 18

A bit about Hadoop Luca Pireddu CRS4Distributed Computing Group March 9, 2012 luca.pireddu@crs4.it (CRS4) Luca Pireddu March 9, 2012 1 / 18 Often seen problems Often seen problems Low parallelism I/O is

Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical

### Hadoop IST 734 SS CHUNG

Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to

### Recommended Literature for this Lecture

COSC 6339 Big Data Analytics Introduction to MapReduce (III) and 1 st homework assignment Edgar Gabriel Spring 2015 Recommended Literature for this Lecture Andrew Pavlo, Erik Paulson, Alexander Rasin,

### Machine- Learning Summer School - 2015

Machine- Learning Summer School - 2015 Big Data Programming David Franke Vast.com hbp://www.cs.utexas.edu/~dfranke/ Goals for Today Issues to address when you have big data Understand two popular big data

### Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging

### Cloud Computing. Chapter 8. 8.1 Hadoop

Chapter 8 Cloud Computing In cloud computing, the idea is that a large corporation that has many computers could sell time on them, for example to make profitable use of excess capacity. The typical customer

### Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1 2 Big Data Problems Data explosion Data from users on social

### Application and Micro-benchmark Performance using MVAPICH2-X on SDSC Gordon Cluster

Application and Micro-benchmark Performance using MVAPICH2-X on SDSC Gordon Cluster Mahidhar Tatineni (mahidhar@sdsc.edu) MVAPICH User Group Meeting August 27, 2014 NSF grants: OCI #0910847 Gordon: A Data

### Spark and the Big Data Library

Spark and the Big Data Library Reza Zadeh Thanks to Matei Zaharia Problem Data growing faster than processing speeds Only solution is to parallelize on large clusters» Wide use in both enterprises and

### High Performance Computing MapReduce & Hadoop. 17th Apr 2014

High Performance Computing MapReduce & Hadoop 17th Apr 2014 MapReduce Programming model for parallel processing vast amounts of data (TBs/PBs) distributed on commodity clusters Borrows from map() and reduce()

### CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

### Hadoop: Distributed Data Processing. Amr Awadallah Founder/CTO, Cloudera, Inc. ACM Data Mining SIG Thursday, January 25 th, 2010

Hadoop: Distributed Data Processing Amr Awadallah Founder/CTO, Cloudera, Inc. ACM Data Mining SIG Thursday, January 25 th, 2010 Outline Scaling for Large Data Processing What is Hadoop? HDFS and MapReduce

### Map-Reduce for Machine Learning on Multicore

Map-Reduce for Machine Learning on Multicore Chu, et al. Problem The world is going multicore New computers - dual core to 12+-core Shift to more concurrent programming paradigms and languages Erlang,

### Big Data Processing with Google s MapReduce. Alexandru Costan

1 Big Data Processing with Google s MapReduce Alexandru Costan Outline Motivation MapReduce programming model Examples MapReduce system architecture Limitations Extensions 2 Motivation Big Data @Google:

### Hadoop & Spark Using Amazon EMR

Hadoop & Spark Using Amazon EMR Michael Hanisch, AWS Solutions Architecture 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda Why did we build Amazon EMR? What is Amazon EMR?

### MapReduce with Apache Hadoop Analysing Big Data

MapReduce with Apache Hadoop Analysing Big Data April 2010 Gavin Heavyside gavin.heavyside@journeydynamics.com About Journey Dynamics Founded in 2006 to develop software technology to address the issues

### Hadoop Ecosystem B Y R A H I M A.

Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open

### INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe

### Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012

Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012 1 Market Trends Big Data Growing technology deployments are creating an exponential increase in the volume

### CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment

CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment James Devine December 15, 2008 Abstract Mapreduce has been a very successful computational technique that has

### Hadoop 101. Lars George. NoSQL- Ma4ers, Cologne April 26, 2013

Hadoop 101 Lars George NoSQL- Ma4ers, Cologne April 26, 2013 1 What s Ahead? Overview of Apache Hadoop (and related tools) What it is Why it s relevant How it works No prior experience needed Feel free

### COURSE CONTENT Big Data and Hadoop Training

COURSE CONTENT Big Data and Hadoop Training 1. Meet Hadoop Data! Data Storage and Analysis Comparison with Other Systems RDBMS Grid Computing Volunteer Computing A Brief History of Hadoop Apache Hadoop

### Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data

### Constructing a Data Lake: Hadoop and Oracle Database United!

Constructing a Data Lake: Hadoop and Oracle Database United! Sharon Sophia Stephen Big Data PreSales Consultant February 21, 2015 Safe Harbor The following is intended to outline our general product direction.

### Analysis of MapReduce Algorithms

Analysis of MapReduce Algorithms Harini Padmanaban Computer Science Department San Jose State University San Jose, CA 95192 408-924-1000 harini.gomadam@gmail.com ABSTRACT MapReduce is a programming model

### Big Data Course Highlights

Big Data Course Highlights The Big Data course will start with the basics of Linux which are required to get started with Big Data and then slowly progress from some of the basics of Hadoop/Big Data (like

### Extreme Computing. Hadoop. Stratis Viglas. School of Informatics University of Edinburgh sviglas@inf.ed.ac.uk. Stratis Viglas Extreme Computing 1

Extreme Computing Hadoop Stratis Viglas School of Informatics University of Edinburgh sviglas@inf.ed.ac.uk Stratis Viglas Extreme Computing 1 Hadoop Overview Examples Environment Stratis Viglas Extreme

### Other Map-Reduce (ish) Frameworks. William Cohen

Other Map-Reduce (ish) Frameworks William Cohen 1 Outline More concise languages for map- reduce pipelines Abstractions built on top of map- reduce General comments Speci

### IMPLEMENTATION OF P-PIC ALGORITHM IN MAP REDUCE TO HANDLE BIG DATA

IMPLEMENTATION OF P-PIC ALGORITHM IN MAP REDUCE TO HANDLE BIG DATA Jayalatchumy D 1, Thambidurai. P 2 Abstract Clustering is a process of grouping objects that are similar among themselves but dissimilar

### Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Accelerating Hadoop MapReduce Using an In-Memory Data Grid By David L. Brinker and William L. Bain, ScaleOut Software, Inc. 2013 ScaleOut Software, Inc. 12/27/2012 H adoop has been widely embraced for

### Open source large scale distributed data management with Google s MapReduce and Bigtable

Open source large scale distributed data management with Google s MapReduce and Bigtable Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory

### Hadoop/MapReduce Workshop. Dan Mazur, McGill HPC daniel.mazur@mcgill.ca guillimin@calculquebec.ca July 10, 2014

Hadoop/MapReduce Workshop Dan Mazur, McGill HPC daniel.mazur@mcgill.ca guillimin@calculquebec.ca July 10, 2014 1 Outline Hadoop introduction and motivation Python review HDFS - The Hadoop Filesystem MapReduce

### Performance Evaluation for BlobSeer and Hadoop using Machine Learning Algorithms

Performance Evaluation for BlobSeer and Hadoop using Machine Learning Algorithms Elena Burceanu, Irina Presa Automatic Control and Computers Faculty Politehnica University of Bucharest Emails: {elena.burceanu,

### GraySort and MinuteSort at Yahoo on Hadoop 0.23

GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters

### International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 A Discussion on Testing Hadoop Applications Sevuga Perumal Chidambaram ABSTRACT The purpose of analysing

### Yahoo! Grid Services Where Grid Computing at Yahoo! is Today

Yahoo! Grid Services Where Grid Computing at Yahoo! is Today Marco Nicosia Grid Services Operations marco@yahoo-inc.com What is Apache Hadoop? Distributed File System and Map-Reduce programming platform

### Scaling Out With Apache Spark. DTL Meeting 17-04-2015 Slides based on https://www.sics.se/~amir/files/download/dic/spark.pdf

Scaling Out With Apache Spark DTL Meeting 17-04-2015 Slides based on https://www.sics.se/~amir/files/download/dic/spark.pdf Your hosts Mathijs Kattenberg Technical consultant Jeroen Schot Technical consultant

### Data Management Using MapReduce

Data Management Using MapReduce M. Tamer Özsu University of Waterloo CS742-Distributed & Parallel DBMS M. Tamer Özsu 1 / 24 Basics For data analysis of very large data sets Highly dynamic, irregular, schemaless,

### H2O on Hadoop. September 30, 2014. www.0xdata.com

H2O on Hadoop September 30, 2014 www.0xdata.com H2O on Hadoop Introduction H2O is the open source math & machine learning engine for big data that brings distribution and parallelism to powerful algorithms

### Apache Flink Next-gen data analysis. Kostas Tzoumas ktzoumas@apache.org @kostas_tzoumas

Apache Flink Next-gen data analysis Kostas Tzoumas ktzoumas@apache.org @kostas_tzoumas What is Flink Project undergoing incubation in the Apache Software Foundation Originating from the Stratosphere research

### Large-Scale Test Mining

Large-Scale Test Mining SIAM Conference on Data Mining Text Mining 2010 Alan Ratner Northrop Grumman Information Systems NORTHROP GRUMMAN PRIVATE / PROPRIETARY LEVEL I Aim Identify topic and language/script/coding

### Workshop on Hadoop with Big Data

Workshop on Hadoop with Big Data Hadoop? Apache Hadoop is an open source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly

### Internals of Hadoop Application Framework and Distributed File System

International Journal of Scientific and Research Publications, Volume 5, Issue 7, July 2015 1 Internals of Hadoop Application Framework and Distributed File System Saminath.V, Sangeetha.M.S Abstract- Hadoop

### Map- reduce, Hadoop and The communica3on bo5leneck. Yoav Freund UCSD / Computer Science and Engineering

Map- reduce, Hadoop and The communica3on bo5leneck Yoav Freund UCSD / Computer Science and Engineering Plan of the talk Why is Hadoop so popular? HDFS Map Reduce Word Count example using Hadoop streaming

### Big Data and Scripting map/reduce in Hadoop

Big Data and Scripting map/reduce in Hadoop 1, 2, parts of a Hadoop map/reduce implementation core framework provides customization via indivudual map and reduce functions e.g. implementation in mongodb

### ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web

### Parallel Computing for Data Science

Parallel Computing for Data Science With Examples in R, C++ and CUDA Norman Matloff University of California, Davis USA (g) CRC Press Taylor & Francis Group Boca Raton London New York CRC Press is an imprint

### Wrangler: A New Generation of Data-intensive Supercomputing. Christopher Jordan, Siva Kulasekaran, Niall Gaffney

Wrangler: A New Generation of Data-intensive Supercomputing Christopher Jordan, Siva Kulasekaran, Niall Gaffney Project Partners Academic partners: TACC Primary system design, deployment, and operations

### A Performance Analysis of Distributed Indexing using Terrier

A Performance Analysis of Distributed Indexing using Terrier Amaury Couste Jakub Kozłowski William Martin Indexing Indexing Used by search

### Introduction to MapReduce, Hadoop, & Spark. Jonathan Carroll-Nellenback Center for Integrated Research Computing

Introduction to MapReduce, Hadoop, & Spark Jonathan Carroll-Nellenback Center for Integrated Research Computing Big Data Outline Analytics Map Reduce Programming Model Hadoop Ecosystem HDFS, Pig, Hive,

### Chase Wu New Jersey Ins0tute of Technology

CS 698: Special Topics in Big Data Chapter 4. Big Data Analytics Platforms Chase Wu New Jersey Ins0tute of Technology Some of the slides have been provided through the courtesy of Dr. Ching-Yung Lin at