Crunching Big Data with R And Hadoop!

Size: px
Start display at page:

Download "Crunching Big Data with R And Hadoop!"

Transcription

1 1 Crunching Big Data with R And Hadoop! Strata/Hadoop World NYC 2012 Flash drives with tutorial materials are near the door, please start downloading the tutorial materials onto your laptop. There is a PDF with instructions on the flash drives. 23 October 2012

2 Part 3: Trending Topics Part 2: Scalability Part 1: Basics 2 Today s Agenda Introduction What is Hadoop What is Rhadoop Basic R syntax for the uninitiated Word count example Scalability Tradeoffs and Tabular Data Matrix Multiplication Example Break Dealing with zipfian distribution of data Advanced word count example Intro to LDA Description of Problem LDA Trending Topics Example Wrap-Up and Questions

3 3 Sec/on 1 Introduc/on to RHadoop Discussion of Hadoop, R, and Map/Reduce R for Hadoop Users Rhadoop s take on Map/Reduce Word Count Exercise

4 4 What is RHadoop? IntegraBon between the R stabsbcal package and Hadoop s Distributed File System and Map/ Reduce ComputaBon Engine Moves algorithm execubon to data Provides access to lots of high- quality stabsbcal libraries Speeds work by processing in parallel

5 Use RHadoop: For data explorabon To run a bunch of tasks in parallel To sort data To sample data To join data This work is licensed under a Creative Commons Attribution-NoDerivs 3.0 Unported License. 5

6 6 Don t Use RHadoop: To implement you next ultra- advanced, high- performance machine learning algorithm based on Map/Reduce

7 Don t Be Embarrassed to be Parallel Overall philosophy Use simple methods first (Or Serial) Use complicated methods only if simple ones fail All examples are reasonably straighvorward This work is licensed under a Creative Commons Attribution-NoDerivs 3.0 Unported License. 7

8 8 The (Big) Data Science Lifecycle Sample data Because there s not usually a reason to use the whole data set Model data Modeling will help you understand where interesbng things are Interpret Find trends, iterate Sample/Summarize Model Interpret

9 9 Hadoop and its Components Runs over many machines Highly scalable architecture MapReduce Distributes processing to data HDFS Holds your data Hadoop Map/ Reduce HDFS Parallel Processing Moves to Data

10 What is MapReduce? Popularized in Google s 2004 paper Makes wribng distributed algorithms easier Removes freedom (synchronizabon, locking) to achieve safety This work is licensed under a Creative Commons Attribution-NoDerivs 3.0 Unported License. 10

11 11 MapReduce Introduc/on All data is represented as a key- value pair Two major phases Map Reads the input data and outputs intermediate key- value pairs Reduce Values with the same key are sent to the same reducer and (opbonally) summarized

12 12 Map Phase How much wood would a woodchuck chuck if a woodchuck could chuck wood? Raw Data Mapper how 1 much 1 How much wood would a woodchuck chuck if a woodchuck could chuck wood Mapper Mapper wood 1 would 1 a 1 woodchuck 1 chuck 1 if 1 a 1...

13 13 Reduce Phase How much wood could a woodchuck chuck if a woodchuck could chuck wood? Mapper how 1 Reducer a 2 much 1 how 1 wood 1 woodchuck 2 Mapper would 1 Reducer wood 2 a 1 would 1 woodchuck 1 Mapper chuck 1 Reducer chuck 2 if 1 if 1 a 1 much

14 14 A Brief Introduc/on to R We ll quickly walk through ~/Examples/ Solutions/r-intro.R from your virtual machine

15 15 MapReduce in RHadoop Please refer to ~/Examples/Solutions/ rmr-intro.r on your virtual machine"

16 Intro to RHadoop and Word Count Simple word count solubon can be found in ~/Examples/Solutions/" Simple word count parbal solubon is available in ~/Examples/Fill-In/" This work is licensed under a Creative Commons Attribution-NoDerivs 3.0 Unported License. 16

17 Recap RHadoop advantages over Java MapReduce Many fewer lines of code Make use of exisbng R funcbons from within Map/Reduce This work is licensed under a Creative Commons Attribution-NoDerivs 3.0 Unported License. 17

18 18 Sec/on 2 Breaking Things Up CompuBng over tabular data scalability challenges Matrix MulBplicaBon Exercise MapReduce and Zipfian DistribuBon Zipfian DistribuBon Exercise

19 19 Tabular Data Ubiquitous Financial informabon Medical informabon Surveys Market reports Anything you put in Excel Can get big quickly due to the denormalized nature of tables in analybcs

20 Map/Reduce Techniques for Dealing with Tabular Big Data The situabons where you need to do this are rare but painful when you arrive at them! Why might you do this? Exploratory analysis, kitchen sink method using covariance matrix This work is licensed under a Creative Commons Attribution-NoDerivs 3.0 Unported License. 20

21 21 Example: Correla/on Matrix for FRED Economic Time Series Data Correla/on Matrix, Sorted by Magnitude Horizontally Cool Things: Random collecbon of economic variables Groupings are shown in the natural ordering Relatedness to other things Un- relatedness to other things We can quickly idenbfy data that is unrelated to much of anything We can quickly discover outliers

22 22 FRED Data: General Outcomes LocaBon is closely related to correlabon Earnings and the civilian labor force variables are strongly negabvely correlated with residence adjustments Government employment is correlated with the civilian labor force and with educabonal and health services employment CorrelaBon between heabng oil in New York Harbor and Jet Fuel on Gulf Coast

23 FRED Data: General Outcomes The unemployment rate is negabvely correlated with hours worked by manufacturing personnel, only in some regions Per- capita personal income and foreign travel services are strongly negabvely correlated This work is licensed under a Creative Commons Attribution-NoDerivs 3.0 Unported License. 23

24 Tradeoffs for Large Matrix Opera/ons Like Correla/on Matrix Tradeoffs must be made for different types of operabons As you break the problem down and increase scalability, you incur more overhead/latency from structure metadata We ll use matrix mulbplicabon as an illustrabve example of trade- offs Map/reduce actually isn t great at represenbng matrix mulbplicabon Hence this is primarily illustrabve This work is licensed under a Creative Commons Attribution-NoDerivs 3.0 Unported License. 24

25 25 Represen/ng Tabular Data Two main ways Row (or column) oriented Cell oriented Row/Column oriented Column (or row) has to fit in memory Olen need to transpose data (a map/reduce job) to do tasks like mulbplicabon Cell oriented Can be used to accomplish any type of operabon without need to change orientabon OperaBons are 100% disk based, therefore highly scalable Tends to be slow for denser matrices because of the excess amounts of data involved

26 26 An Illustra/ve Example: Matrix Mul/plica/on m p A 1,1 A 1,2 A 1,3 A 1,4 B 1,1... p n X m B 2,1 B 3,1 = AB 1,1 AB 1,2 B 4,1... n A 1,1 B A 1,2 A A 1,1 + B 2,1 + 1,3 B 3,1 + 1,4 B 4,1 A 1,1 B A 1,2 A A 1,2 + B 2,2 + 1,3 B 3,2 + 1,4 B 4,2

27 27 Matrix Mul/plica/on: First Approach Map Phase Key: AB 1,1 n A 1,1... Value: A 1,2 A 1,3 A 1,4 X B 1,1 B 2,1 B 3,1 B 4,1 Key: Value: A 1,1 A 1,2 A 1,3 A 1,4 AB 1,1 B 1,1 B 2,1 B B 3,1 4,1 Σ m m Reduce Phase AB 1,1 A 1,1 A 1,2 A 1,3 A 1,4 B 1,1 B 2,1 B B 3,1 4,1 n... X p m AB 1,1... p m...

28 28 m p A 1,1 A 1,2 A 1,3 A 1,4 B 1,1... Matrix Mul/plica/on: Another Approach Map Phase Reduce Phase n Key: AB 1,1 AB 1,1 AB 1, AB 1,1 Σ Value: A 1,1 A 1,2... X X Key: AB 1,1 B 2,1 B 3,1 B 4,1 Value: B 1,1 B 2,1 A 1,1 A 1,2 A 1,3 A 1,4 B 1,1 B 2,1 B 3,1 B 4,1 AB 1,1... AB 1,1 A 1,4 B A 1,1 B A 1,2 A 1,1 4,1 B 2,1 B 3,1 1,3 m p m m n...

29 29 Break We ll take a quick break then come back together for the matrix mulbplicabon exercise

30 30 Matrix Mul/plica/on Walkthrough See virtual machine materials

31 Zipfian Distribu/on Similar to Pareto, power law, exponenbal In this case, one thing takes a lot longer than others Perfect example: word count This work is licensed under a Creative Commons Attribution-NoDerivs 3.0 Unported License. 31

32 32 Zipfian Distribu/ons are Everywhere Types of Data Geographic data Economic data Space/Telescope observabons Shanghai at Night (Courtesy NASA)

33 33 Map/Reduce Task Comple/on Time Reduce tasks Because of RMR's design, which utilizes a list of values instead of streaming them, this can actually make it impossible for a task to complete Process Completion Time

34 34 Strategies for Dealing with Zipfian Distribu/ons Throw out things that occur frequently stop words Break things down further Sample data Model distribubon beforehand For high- frequency keys, parbbon them out to mulbple tasks

35 35 How to Sample Data It depends what is the distribubon of your data? How good does your sample need to be? Approaches Read the first few entries of the first file Read the top few entries of each input file If input is splioable then you can read top few entries of each input split (in Java) If you can t do any of these you have to read the whole file, which can be expensive Remember stuff in the head of a zipfian distribubon is REALLY prevalent, so just reading the first few records works 90% of the Bme

36 36 Breaking Up High- Frequency Tasks For head tasks, prefix a random number between 0 and the number of reduce slots in your cluster Use the number 0 in the composite key for non- head tasks

37 37 Breaking Up High- Frequency Tasks original key Yes Is the key a highfrequency key? No random 0 original key original key First Reduce original key Second Reduce final output

38 38 Exercise: Zipfian Distribu/on See virtual machine materials

39 Sec/on 3 Topic Modeling Overview of Topic Modeling and LDA Sliding Window Analysis Trending Twioer Topics Exercise This work is licensed under a Creative Commons Attribution-NoDerivs 3.0 Unported License. 39

40 Topic Modeling Methods for discovering topics in a collecbon of documents Used for machine learning, natural language processing Several algorithms for topic modeling available This work is licensed under a Creative Commons Attribution-NoDerivs 3.0 Unported License. 40

41 Latent Dirichlet Alloca/on (LDA) Common topic model Allows documents to have a mixture of topics Each topic represented by a list of terms Available in R s topicmodels package This work is licensed under a Creative Commons Attribution-NoDerivs 3.0 Unported License. 41

42 Trending Topics and Twi^er Trending topics are a popular feature of Twioer Trending topics can be found using many methods; one simple method is term frequency LDA is an advanced method that allows topics to be determined by the content of a set of Tweets CompuBng LDA over millions of Tweets will not be possible on a single home computer This work is licensed under a Creative Commons Attribution-NoDerivs 3.0 Unported License. 42

43 Implemen/ng LDA using RHadoop SoluBon 1: Parallelize LDA to split computabons over mulbple machines. Computes one topic model for the whole data set Requires in- depth understanding of LDA algorithm This work is licensed under a Creative Commons Attribution-NoDerivs 3.0 Unported License. 43

44 Implemen/ng LDA using RHadoop SoluBon 2: Analysis of trending topics over Bme CompuBng LDA over Tweets according to Bme, we can idenbfy changes in topics over Bme RHadoop will allow us to run LDA over mulbple Bme slices in parallel We can make use of R s topicmodels package to run LDA on our data This work is licensed under a Creative Commons Attribution-NoDerivs 3.0 Unported License. 44

45 45 Sample Tweet {" "profile_image_url":" profile_image.jpg"," "from_user_name":"donald Duck"," "from_user_id_str":" "," "created_at":"fri, 10 Aug :30: "," "id_str":" "," "from_user":"big_quack"," "to_user_id":0," "text":"what s the Big "metadata":{"result_type":"recent"}," "profile_image_url_https":" }" profile_image.jpg "," "id": ," "to_user":null," "geo":null," "from_user_id": ," "to_user_name":null," "iso_language_code":"en"," "to_user_id_str":"0"," "source":" "

46 46 Sample Tweet {" " "created_at":"fri, 10 Aug :30: "," " "text":"what s the Big " }"

47 47 Sliding Window Analysis Tweets can be grouped according to what hour (or day, month, etc.) they were created 12:00 13:00 14:00 Tweet1 (12:23) Tweet2 (12:54) Tweet3 (13:02) Tweet4 (13:47) Tweet5 (14:17) More overlap in Bme windows will smooth the change in trending topics over Bme 11:30 12:00 12:30 13:00 13:30 14:00 Tweet1 (12:23) Tweet1 (12:23) Tweet2 (12:54) Tweet2 (12:54) Tweet3 (13:02) Tweet3 (13:02) Tweet4 (13:47) Tweet4 (13:47) Tweet5 (14:17) Tweet5 (14:17)

48 Sliding Window Times Simple, non- overlapping hourly windows can be created by dividing the UTC Bme in seconds by 3600, rounding down, and using that as the key Ex: Fri, 10 Aug :30: = This work is licensed under a Creative Commons Attribution-NoDerivs 3.0 Unported License. 48

49 49 Using the LDA Libraries Step 1: Create a Document Source: VectorSource( )" Constructs a source for a vector" Step 2: Create a Corpus: Corpus( )" Takes a Source object and constructs a corpus" Step 3: Create a document- term matrix DocumentTermMatrix( )" Takes a Corpus and constructs a document term matrix" Step 4: Create a topic model with n topics LDA(,n)" Takes a DocumentTermMatrix and a number of topics and computes an LDA model" Step 5: Retrieve topics and terms using the respecbve convenience funcbons topics( ) terms( )"

50 50 Exercise: Trending Tweets See virtual machine materials

51 51 Recap RHadoop provides tools for scalable big data analybcs that Can be quickly prototyped in fewer lines of code Word Count Fit our (Big) Data Science Lifecycle Word Count + Zipfian DistribuBon Make use of R s extensive stabsbcal and data modeling libraries Trending Twioer Topics Analysis with LDA

52 Acknowledgments Special thanks to RevoluBon AnalyBcs and Antonio Piccolboni for developing and sharing Rhadoop and fielding quesbons! Thanks to our Booz Allen Helpers: Josh Sullivan, Doug Gartner, Jay Owen, Brian Bende, Drew Farris, Charles Glover, Paul Tamburello, Michael Kim, and Jay Shipper This work is licensed under a Creative Commons Attribution-NoDerivs 3.0 Unported License. 52

53 53 For more informa/on Office hours at 2:30 pm on Wednesday Come visit our exhibitor booth - Sponsor Pavilion #59 Visit boozallen.com Ed Kohlwey: E- mail: kohlwey_edmund@bah.com Stephanie Beben: E- mail: beben_stephanie@bah.com

54 54 Appendix

55 55 Installa/on (The Old Fashioned Way) git clone -b rmr RevolutionAnalytics/RHadoop.git RHadoop-src " echo 'dir.create(.libpaths() [1],recursive=T);install.packages(c("digest","it ertools","functional","rjsonio","rcpp"),repos="h ttp://cran.us.r-project.org")' R --no-save -- no-resume" " R CMD INSTALL RHadoop-src/quickcheck/ RHadoopsrc/rmr/pkg/"

Data Structures and Performance for Scientific Computing with Hadoop and Dumbo

Data Structures and Performance for Scientific Computing with Hadoop and Dumbo Data Structures and Performance for Scientific Computing with Hadoop and Dumbo Austin R. Benson Computer Sciences Division, UC-Berkeley ICME, Stanford University May 15, 2012 1 1 Matrix storage 2 Data

More information

Hadoop Parallel Data Processing

Hadoop Parallel Data Processing MapReduce and Implementation Hadoop Parallel Data Processing Kai Shen A programming interface (two stage Map and Reduce) and system support such that: the interface is easy to program, and suitable for

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after

More information

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web

More information

Similarity Search in a Very Large Scale Using Hadoop and HBase

Similarity Search in a Very Large Scale Using Hadoop and HBase Similarity Search in a Very Large Scale Using Hadoop and HBase Stanislav Barton, Vlastislav Dohnal, Philippe Rigaux LAMSADE - Universite Paris Dauphine, France Internet Memory Foundation, Paris, France

More information

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Jeffrey D. Ullman slides. MapReduce for data intensive computing Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very

More information

Big Data Rethink Algos and Architecture. Scott Marsh Manager R&D Personal Lines Auto Pricing

Big Data Rethink Algos and Architecture. Scott Marsh Manager R&D Personal Lines Auto Pricing Big Data Rethink Algos and Architecture Scott Marsh Manager R&D Personal Lines Auto Pricing Agenda History Map Reduce Algorithms History Google talks about their solutions to their problems Map Reduce:

More information

Developing MapReduce Programs

Developing MapReduce Programs Cloud Computing Developing MapReduce Programs Dell Zhang Birkbeck, University of London 2015/16 MapReduce Algorithm Design MapReduce: Recap Programmers must specify two functions: map (k, v) * Takes

More information

Energy Efficient MapReduce

Energy Efficient MapReduce Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing

More information

MapReduce Job Processing

MapReduce Job Processing April 17, 2012 Background: Hadoop Distributed File System (HDFS) Hadoop requires a Distributed File System (DFS), we utilize the Hadoop Distributed File System (HDFS). Background: Hadoop Distributed File

More information

Analysis of MapReduce Algorithms

Analysis of MapReduce Algorithms Analysis of MapReduce Algorithms Harini Padmanaban Computer Science Department San Jose State University San Jose, CA 95192 408-924-1000 harini.gomadam@gmail.com ABSTRACT MapReduce is a programming model

More information

The Impact of Big Data on Classic Machine Learning Algorithms. Thomas Jensen, Senior Business Analyst @ Expedia

The Impact of Big Data on Classic Machine Learning Algorithms. Thomas Jensen, Senior Business Analyst @ Expedia The Impact of Big Data on Classic Machine Learning Algorithms Thomas Jensen, Senior Business Analyst @ Expedia Who am I? Senior Business Analyst @ Expedia Working within the competitive intelligence unit

More information

Data processing goes big

Data processing goes big Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,

More information

Mammoth Scale Machine Learning!

Mammoth Scale Machine Learning! Mammoth Scale Machine Learning! Speaker: Robin Anil, Apache Mahout PMC Member! OSCON"10! Portland, OR! July 2010! Quick Show of Hands!# Are you fascinated about ML?!# Have you used ML?!# Do you have Gigabytes

More information

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop

More information

Big Data and Analytics: A Conceptual Overview. Mike Park Erik Hoel

Big Data and Analytics: A Conceptual Overview. Mike Park Erik Hoel Big Data and Analytics: A Conceptual Overview Mike Park Erik Hoel In this technical workshop This presentation is for anyone that uses ArcGIS and is interested in analyzing large amounts of data We will

More information

Click Stream Data Analysis Using Hadoop

Click Stream Data Analysis Using Hadoop Governors State University OPUS Open Portal to University Scholarship Capstone Projects Spring 2015 Click Stream Data Analysis Using Hadoop Krishna Chand Reddy Gaddam Governors State University Sivakrishna

More information

Estimating PageRank Values of Wikipedia Articles using MapReduce

Estimating PageRank Values of Wikipedia Articles using MapReduce Estimating PageRank Values of Wikipedia Articles using MapReduce Due: Sept. 30 Wednesday 5:00PM Submission: via Canvas, individual submission Instructor: Sangmi Pallickara Web page: http://www.cs.colostate.edu/~cs535/assignments.html

More information

How To Use Big Data For Telco (For A Telco)

How To Use Big Data For Telco (For A Telco) ON-LINE VIDEO ANALYTICS EMBRACING BIG DATA David Vanderfeesten, Bell Labs Belgium ANNO 2012 YOUR DATA IS MONEY BIG MONEY! Your click stream, your activity stream, your electricity consumption, your call

More information

A very short Intro to Hadoop

A very short Intro to Hadoop 4 Overview A very short Intro to Hadoop photo by: exfordy, flickr 5 How to Crunch a Petabyte? Lots of disks, spinning all the time Redundancy, since disks die Lots of CPU cores, working all the time Retry,

More information

Open source Google-style large scale data analysis with Hadoop

Open source Google-style large scale data analysis with Hadoop Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical

More information

Designing Agile Data Pipelines. Ashish Singh Software Engineer, Cloudera

Designing Agile Data Pipelines. Ashish Singh Software Engineer, Cloudera Designing Agile Data Pipelines Ashish Singh Software Engineer, Cloudera About Me Software Engineer @ Cloudera Contributed to Kafka, Hive, Parquet and Sentry Used to work in HPC @singhasdev 204 Cloudera,

More information

How To Scale Out Of A Nosql Database

How To Scale Out Of A Nosql Database Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 thomas.steinmaurer@scch.at www.scch.at Michael Zwick DI

More information

Integrating Big Data into the Computing Curricula

Integrating Big Data into the Computing Curricula Integrating Big Data into the Computing Curricula Yasin Silva, Suzanne Dietrich, Jason Reed, Lisa Tsosie Arizona State University http://www.public.asu.edu/~ynsilva/ibigdata/ 1 Overview Motivation Big

More information

Open source large scale distributed data management with Google s MapReduce and Bigtable

Open source large scale distributed data management with Google s MapReduce and Bigtable Open source large scale distributed data management with Google s MapReduce and Bigtable Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory

More information

BIG DATA What it is and how to use?

BIG DATA What it is and how to use? BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14

More information

Hadoop SNS. renren.com. Saturday, December 3, 11

Hadoop SNS. renren.com. Saturday, December 3, 11 Hadoop SNS renren.com Saturday, December 3, 11 2.2 190 40 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December

More information

Big Data With Hadoop

Big Data With Hadoop With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

More information

Hadoop Distributed File System (HDFS) Overview

Hadoop Distributed File System (HDFS) Overview 2012 coreservlets.com and Dima May Hadoop Distributed File System (HDFS) Overview Originals of slides and source code for examples: http://www.coreservlets.com/hadoop-tutorial/ Also see the customized

More information

Introduction to Hadoop

Introduction to Hadoop Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction

More information

Zihang Yin Introduction R is commonly used as an open share statistical software platform that enables analysts to do complex statistical analysis with limited computing knowledge. Frequently these analytical

More information

Big Data and Scripting map/reduce in Hadoop

Big Data and Scripting map/reduce in Hadoop Big Data and Scripting map/reduce in Hadoop 1, 2, parts of a Hadoop map/reduce implementation core framework provides customization via indivudual map and reduce functions e.g. implementation in mongodb

More information

Hadoop and Map-reduce computing

Hadoop and Map-reduce computing Hadoop and Map-reduce computing 1 Introduction This activity contains a great deal of background information and detailed instructions so that you can refer to it later for further activities and homework.

More information

Distributed Computing and Big Data: Hadoop and MapReduce

Distributed Computing and Big Data: Hadoop and MapReduce Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:

More information

A programming model in Cloud: MapReduce

A programming model in Cloud: MapReduce A programming model in Cloud: MapReduce Programming model and implementation developed by Google for processing large data sets Users specify a map function to generate a set of intermediate key/value

More information

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe

More information

Massive Cloud Auditing using Data Mining on Hadoop

Massive Cloud Auditing using Data Mining on Hadoop Massive Cloud Auditing using Data Mining on Hadoop Prof. Sachin Shetty CyberBAT Team, AFRL/RIGD AFRL VFRP Tennessee State University Outline Massive Cloud Auditing Traffic Characterization Distributed

More information

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE

More information

Hadoop Architecture. Part 1

Hadoop Architecture. Part 1 Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,

More information

Big Data and Analytics: Getting Started with ArcGIS. Mike Park Erik Hoel

Big Data and Analytics: Getting Started with ArcGIS. Mike Park Erik Hoel Big Data and Analytics: Getting Started with ArcGIS Mike Park Erik Hoel Agenda Overview of big data Distributed computation User experience Data management Big data What is it? Big Data is a loosely defined

More information

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763 International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 A Discussion on Testing Hadoop Applications Sevuga Perumal Chidambaram ABSTRACT The purpose of analysing

More information

Hadoop WordCount Explained! IT332 Distributed Systems

Hadoop WordCount Explained! IT332 Distributed Systems Hadoop WordCount Explained! IT332 Distributed Systems Typical problem solved by MapReduce Read a lot of data Map: extract something you care about from each record Shuffle and Sort Reduce: aggregate, summarize,

More information

ITG Software Engineering

ITG Software Engineering Introduction to Cloudera Course ID: Page 1 Last Updated 12/15/2014 Introduction to Cloudera Course : This 5 day course introduces the student to the Hadoop architecture, file system, and the Hadoop Ecosystem.

More information

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:

More information

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park Big Data: Using ArcGIS with Apache Hadoop Erik Hoel and Mike Park Outline Overview of Hadoop Adding GIS capabilities to Hadoop Integrating Hadoop with ArcGIS Apache Hadoop What is Hadoop? Hadoop is a scalable

More information

HBase A Comprehensive Introduction. James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367

HBase A Comprehensive Introduction. James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367 HBase A Comprehensive Introduction James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367 Overview Overview: History Began as project by Powerset to process massive

More information

Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect

Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect Matteo Migliavacca (mm53@kent) School of Computing Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect Simple past - Traditional

More information

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future

More information

Performance Management in Big Data Applica6ons. Michael Kopp, Technology Strategist @mikopp

Performance Management in Big Data Applica6ons. Michael Kopp, Technology Strategist @mikopp Performance Management in Big Data Applica6ons Michael Kopp, Technology Strategist NoSQL: High Volume/Low Latency DBs Web Java Key Challenges 1) Even Distribu6on 2) Correct Schema and Access paperns 3)

More information

CS 4604: Introduc0on to Database Management Systems. B. Aditya Prakash Lecture #13: NoSQL and MapReduce

CS 4604: Introduc0on to Database Management Systems. B. Aditya Prakash Lecture #13: NoSQL and MapReduce CS 4604: Introduc0on to Database Management Systems B. Aditya Prakash Lecture #13: NoSQL and MapReduce Announcements HW4 is out You have to use the PGSQL server START EARLY!! We can not help if everyone

More information

What s New in MATLAB and Simulink

What s New in MATLAB and Simulink What s New in MATLAB and Simulink Kevin Cohan Product Marketing, MATLAB Michael Carone Product Marketing, Simulink 2015 The MathWorks, Inc. 1 What was new for Simulink in R2012b? 2 What Was New for MATLAB

More information

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data

More information

Image Search by MapReduce

Image Search by MapReduce Image Search by MapReduce COEN 241 Cloud Computing Term Project Final Report Team #5 Submitted by: Lu Yu Zhe Xu Chengcheng Huang Submitted to: Prof. Ming Hwa Wang 09/01/2015 Preface Currently, there s

More information

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example MapReduce MapReduce and SQL Injections CS 3200 Final Lecture Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI'04: Sixth Symposium on Operating System Design

More information

Cloud Scale Distributed Data Storage. Jürmo Mehine

Cloud Scale Distributed Data Storage. Jürmo Mehine Cloud Scale Distributed Data Storage Jürmo Mehine 2014 Outline Background Relational model Database scaling Keys, values and aggregates The NoSQL landscape Non-relational data models Key-value Document-oriented

More information

Bringing Big Data Modelling into the Hands of Domain Experts

Bringing Big Data Modelling into the Hands of Domain Experts Bringing Big Data Modelling into the Hands of Domain Experts David Willingham Senior Application Engineer MathWorks david.willingham@mathworks.com.au 2015 The MathWorks, Inc. 1 Data is the sword of the

More information

http://www.wordle.net/

http://www.wordle.net/ Hadoop & MapReduce http://www.wordle.net/ http://www.wordle.net/ Hadoop is an open-source software framework (or platform) for Reliable + Scalable + Distributed Storage/Computational unit Failures completely

More information

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data Case Study 2: Document Retrieval Parallel Programming Map-Reduce Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 31 st, 2013 Carlos Guestrin

More information

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Accelerating Hadoop MapReduce Using an In-Memory Data Grid Accelerating Hadoop MapReduce Using an In-Memory Data Grid By David L. Brinker and William L. Bain, ScaleOut Software, Inc. 2013 ScaleOut Software, Inc. 12/27/2012 H adoop has been widely embraced for

More information

! E6893 Big Data Analytics:! Demo Session II: Mahout working with Eclipse and Maven for Collaborative Filtering

! E6893 Big Data Analytics:! Demo Session II: Mahout working with Eclipse and Maven for Collaborative Filtering E6893 Big Data Analytics: Demo Session II: Mahout working with Eclipse and Maven for Collaborative Filtering Aonan Zhang Dept. of Electrical Engineering 1 October 9th, 2014 Mahout Brief Review The Apache

More information

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: bdg@qburst.com Website: www.qburst.com

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: bdg@qburst.com Website: www.qburst.com Lambda Architecture Near Real-Time Big Data Analytics Using Hadoop January 2015 Contents Overview... 3 Lambda Architecture: A Quick Introduction... 4 Batch Layer... 4 Serving Layer... 4 Speed Layer...

More information

This exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing.

This exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing. Big Data Processing 2013-2014 Q2 April 7, 2014 (Resit) Lecturer: Claudia Hauff Time Limit: 180 Minutes Name: Answer the questions in the spaces provided on this exam. If you run out of room for an answer,

More information

Big Data and Apache Hadoop s MapReduce

Big Data and Apache Hadoop s MapReduce Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23

More information

Hadoop MapReduce over Lustre* High Performance Data Division Omkar Kulkarni April 16, 2013

Hadoop MapReduce over Lustre* High Performance Data Division Omkar Kulkarni April 16, 2013 Hadoop MapReduce over Lustre* High Performance Data Division Omkar Kulkarni April 16, 2013 * Other names and brands may be claimed as the property of others. Agenda Hadoop Intro Why run Hadoop on Lustre?

More information

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS Dr. Ananthi Sheshasayee 1, J V N Lakshmi 2 1 Head Department of Computer Science & Research, Quaid-E-Millath Govt College for Women, Chennai, (India)

More information

Big Data, beating the Skills Gap Using R with Hadoop

Big Data, beating the Skills Gap Using R with Hadoop Big Data, beating the Skills Gap Using R with Hadoop Using R with Hadoop There are a number of R packages available that can interact with Hadoop, including: hive - Not to be confused with Apache Hive,

More information

Fast Data in the Era of Big Data: Twitter s Real-

Fast Data in the Era of Big Data: Twitter s Real- Fast Data in the Era of Big Data: Twitter s Real- Time Related Query Suggestion Architecture Gilad Mishne, Jeff Dalton, Zhenghua Li, Aneesh Sharma, Jimmy Lin Presented by: Rania Ibrahim 1 AGENDA Motivation

More information

Introduction to Parallel Programming and MapReduce

Introduction to Parallel Programming and MapReduce Introduction to Parallel Programming and MapReduce Audience and Pre-Requisites This tutorial covers the basics of parallel programming and the MapReduce programming model. The pre-requisites are significant

More information

A Tutorial Introduc/on to Big Data. Hands On Data Analy/cs over EMR. Robert Grossman University of Chicago Open Data Group

A Tutorial Introduc/on to Big Data. Hands On Data Analy/cs over EMR. Robert Grossman University of Chicago Open Data Group A Tutorial Introduc/on to Big Data Hands On Data Analy/cs over EMR Robert Grossman University of Chicago Open Data Group Collin BenneE Open Data Group November 12, 2012 1 Amazon AWS Elas/c MapReduce allows

More information

Four Orders of Magnitude: Running Large Scale Accumulo Clusters. Aaron Cordova Accumulo Summit, June 2014

Four Orders of Magnitude: Running Large Scale Accumulo Clusters. Aaron Cordova Accumulo Summit, June 2014 Four Orders of Magnitude: Running Large Scale Accumulo Clusters Aaron Cordova Accumulo Summit, June 2014 Scale, Security, Schema Scale to scale 1 - (vt) to change the size of something let s scale the

More information

Introduction To Hive

Introduction To Hive Introduction To Hive How to use Hive in Amazon EC2 CS 341: Project in Mining Massive Data Sets Hyung Jin(Evion) Kim Stanford University References: Cloudera Tutorials, CS345a session slides, Hadoop - The

More information

Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

More information

Hadoop IST 734 SS CHUNG

Hadoop IST 734 SS CHUNG Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to

More information

Hadoop Project for IDEAL in CS5604

Hadoop Project for IDEAL in CS5604 Hadoop Project for IDEAL in CS5604 by Jose Cadena Mengsu Chen Chengyuan Wen {jcadena,mschen,wechyu88@vt.edu Completed as part of the course CS5604: Information storage and retrieval offered by Dr. Edward

More information

RHadoop and MapR. Accessing Enterprise- Grade Hadoop from R. Version 2.0 (14.March.2014)

RHadoop and MapR. Accessing Enterprise- Grade Hadoop from R. Version 2.0 (14.March.2014) RHadoop and MapR Accessing Enterprise- Grade Hadoop from R Version 2.0 (14.March.2014) Table of Contents Introduction... 3 Environment... 3 R... 3 Special Installation Notes... 4 Install R... 5 Install

More information

CS 378 Big Data Programming. Lecture 5 Summariza9on Pa:erns

CS 378 Big Data Programming. Lecture 5 Summariza9on Pa:erns CS 378 Big Data Programming Lecture 5 Summariza9on Pa:erns Review Assignment 2 Ques9ons? If you d like to use guava (Google collec9ons classes) pom.xml available for assignment 2 Includes dependency for

More information

PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce. Authors: B. Panda, J. S. Herbach, S. Basu, R. J. Bayardo.

PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce. Authors: B. Panda, J. S. Herbach, S. Basu, R. J. Bayardo. PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce Authors: B. Panda, J. S. Herbach, S. Basu, R. J. Bayardo. VLDB 2009 CS 422 Decision Trees: Main Components Find Best Split Choose split

More information

Reference Architecture, Requirements, Gaps, Roles

Reference Architecture, Requirements, Gaps, Roles Reference Architecture, Requirements, Gaps, Roles The contents of this document are an excerpt from the brainstorming document M0014. The purpose is to show how a detailed Big Data Reference Architecture

More information

Cloud Computing at Google. Architecture

Cloud Computing at Google. Architecture Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale

More information

FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data

FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez To cite this version: Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez. FP-Hadoop:

More information

Introduction to DISC and Hadoop

Introduction to DISC and Hadoop Introduction to DISC and Hadoop Alice E. Fischer April 24, 2009 Alice E. Fischer DISC... 1/20 1 2 History Hadoop provides a three-layer paradigm Alice E. Fischer DISC... 2/20 Parallel Computing Past and

More information

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic BigData An Overview of Several Approaches David Mera Masaryk University Brno, Czech Republic 16/12/2013 Table of Contents 1 Introduction 2 Terminology 3 Approaches focused on batch data processing MapReduce-Hadoop

More information

Map- reduce, Hadoop and The communica3on bo5leneck. Yoav Freund UCSD / Computer Science and Engineering

Map- reduce, Hadoop and The communica3on bo5leneck. Yoav Freund UCSD / Computer Science and Engineering Map- reduce, Hadoop and The communica3on bo5leneck Yoav Freund UCSD / Computer Science and Engineering Plan of the talk Why is Hadoop so popular? HDFS Map Reduce Word Count example using Hadoop streaming

More information

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, sborkar95@gmail.com Assistant Professor, Information

More information

Introduction to Hadoop

Introduction to Hadoop 1 What is Hadoop? Introduction to Hadoop We are living in an era where large volumes of data are available and the problem is to extract meaning from the data avalanche. The goal of the software tools

More information

Optimization and analysis of large scale data sorting algorithm based on Hadoop

Optimization and analysis of large scale data sorting algorithm based on Hadoop Optimization and analysis of large scale sorting algorithm based on Hadoop Zhuo Wang, Longlong Tian, Dianjie Guo, Xiaoming Jiang Institute of Information Engineering, Chinese Academy of Sciences {wangzhuo,

More information

University of Maryland. Tuesday, February 2, 2010

University of Maryland. Tuesday, February 2, 2010 Data-Intensive Information Processing Applications Session #2 Hadoop: Nuts and Bolts Jimmy Lin University of Maryland Tuesday, February 2, 2010 This work is licensed under a Creative Commons Attribution-Noncommercial-Share

More information

Hadoop Ecosystem B Y R A H I M A.

Hadoop Ecosystem B Y R A H I M A. Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open

More information

CSCI6900 Assignment 2: Naïve Bayes on Hadoop

CSCI6900 Assignment 2: Naïve Bayes on Hadoop DEPARTMENT OF COMPUTER SCIENCE, UNIVERSITY OF GEORGIA CSCI6900 Assignment 2: Naïve Bayes on Hadoop DUE: Friday, September 18 by 11:59:59pm Out September 4, 2015 1 IMPORTANT NOTES You are expected to use

More information

Implement Hadoop jobs to extract business value from large and varied data sets

Implement Hadoop jobs to extract business value from large and varied data sets Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to

More information

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model

More information

What are Hadoop and MapReduce and how did we get here?

What are Hadoop and MapReduce and how did we get here? What are Hadoop and MapReduce and how did we get here? Term Big Data coined in 2005 by Roger Magoulas of O Reilly Media But as the idea of big data sets evolved on the Web, organizations began to wonder

More information

Lecture 10: HBase! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Lecture 10: HBase! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl Big Data Processing, 2014/15 Lecture 10: HBase!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind the

More information

Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc.

Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc. Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc. 2015 The MathWorks, Inc. 1 Challenges of Big Data Any collection of data sets so large and complex that it becomes difficult

More information

Hadoop and Map-Reduce. Swati Gore

Hadoop and Map-Reduce. Swati Gore Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data

More information

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components of Hadoop. We will see what types of nodes can exist in a Hadoop

More information

Big Data Storage Options for Hadoop Sam Fineberg, HP Storage

Big Data Storage Options for Hadoop Sam Fineberg, HP Storage Sam Fineberg, HP Storage SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA unless otherwise noted. Member companies and individual members may use this material in presentations

More information