Crunching Big Data with R And Hadoop!
|
|
- Myles Quinn
- 8 years ago
- Views:
Transcription
1 1 Crunching Big Data with R And Hadoop! Strata/Hadoop World NYC 2012 Flash drives with tutorial materials are near the door, please start downloading the tutorial materials onto your laptop. There is a PDF with instructions on the flash drives. 23 October 2012
2 Part 3: Trending Topics Part 2: Scalability Part 1: Basics 2 Today s Agenda Introduction What is Hadoop What is Rhadoop Basic R syntax for the uninitiated Word count example Scalability Tradeoffs and Tabular Data Matrix Multiplication Example Break Dealing with zipfian distribution of data Advanced word count example Intro to LDA Description of Problem LDA Trending Topics Example Wrap-Up and Questions
3 3 Sec/on 1 Introduc/on to RHadoop Discussion of Hadoop, R, and Map/Reduce R for Hadoop Users Rhadoop s take on Map/Reduce Word Count Exercise
4 4 What is RHadoop? IntegraBon between the R stabsbcal package and Hadoop s Distributed File System and Map/ Reduce ComputaBon Engine Moves algorithm execubon to data Provides access to lots of high- quality stabsbcal libraries Speeds work by processing in parallel
5 Use RHadoop: For data explorabon To run a bunch of tasks in parallel To sort data To sample data To join data This work is licensed under a Creative Commons Attribution-NoDerivs 3.0 Unported License. 5
6 6 Don t Use RHadoop: To implement you next ultra- advanced, high- performance machine learning algorithm based on Map/Reduce
7 Don t Be Embarrassed to be Parallel Overall philosophy Use simple methods first (Or Serial) Use complicated methods only if simple ones fail All examples are reasonably straighvorward This work is licensed under a Creative Commons Attribution-NoDerivs 3.0 Unported License. 7
8 8 The (Big) Data Science Lifecycle Sample data Because there s not usually a reason to use the whole data set Model data Modeling will help you understand where interesbng things are Interpret Find trends, iterate Sample/Summarize Model Interpret
9 9 Hadoop and its Components Runs over many machines Highly scalable architecture MapReduce Distributes processing to data HDFS Holds your data Hadoop Map/ Reduce HDFS Parallel Processing Moves to Data
10 What is MapReduce? Popularized in Google s 2004 paper Makes wribng distributed algorithms easier Removes freedom (synchronizabon, locking) to achieve safety This work is licensed under a Creative Commons Attribution-NoDerivs 3.0 Unported License. 10
11 11 MapReduce Introduc/on All data is represented as a key- value pair Two major phases Map Reads the input data and outputs intermediate key- value pairs Reduce Values with the same key are sent to the same reducer and (opbonally) summarized
12 12 Map Phase How much wood would a woodchuck chuck if a woodchuck could chuck wood? Raw Data Mapper how 1 much 1 How much wood would a woodchuck chuck if a woodchuck could chuck wood Mapper Mapper wood 1 would 1 a 1 woodchuck 1 chuck 1 if 1 a 1...
13 13 Reduce Phase How much wood could a woodchuck chuck if a woodchuck could chuck wood? Mapper how 1 Reducer a 2 much 1 how 1 wood 1 woodchuck 2 Mapper would 1 Reducer wood 2 a 1 would 1 woodchuck 1 Mapper chuck 1 Reducer chuck 2 if 1 if 1 a 1 much
14 14 A Brief Introduc/on to R We ll quickly walk through ~/Examples/ Solutions/r-intro.R from your virtual machine
15 15 MapReduce in RHadoop Please refer to ~/Examples/Solutions/ rmr-intro.r on your virtual machine"
16 Intro to RHadoop and Word Count Simple word count solubon can be found in ~/Examples/Solutions/" Simple word count parbal solubon is available in ~/Examples/Fill-In/" This work is licensed under a Creative Commons Attribution-NoDerivs 3.0 Unported License. 16
17 Recap RHadoop advantages over Java MapReduce Many fewer lines of code Make use of exisbng R funcbons from within Map/Reduce This work is licensed under a Creative Commons Attribution-NoDerivs 3.0 Unported License. 17
18 18 Sec/on 2 Breaking Things Up CompuBng over tabular data scalability challenges Matrix MulBplicaBon Exercise MapReduce and Zipfian DistribuBon Zipfian DistribuBon Exercise
19 19 Tabular Data Ubiquitous Financial informabon Medical informabon Surveys Market reports Anything you put in Excel Can get big quickly due to the denormalized nature of tables in analybcs
20 Map/Reduce Techniques for Dealing with Tabular Big Data The situabons where you need to do this are rare but painful when you arrive at them! Why might you do this? Exploratory analysis, kitchen sink method using covariance matrix This work is licensed under a Creative Commons Attribution-NoDerivs 3.0 Unported License. 20
21 21 Example: Correla/on Matrix for FRED Economic Time Series Data Correla/on Matrix, Sorted by Magnitude Horizontally Cool Things: Random collecbon of economic variables Groupings are shown in the natural ordering Relatedness to other things Un- relatedness to other things We can quickly idenbfy data that is unrelated to much of anything We can quickly discover outliers
22 22 FRED Data: General Outcomes LocaBon is closely related to correlabon Earnings and the civilian labor force variables are strongly negabvely correlated with residence adjustments Government employment is correlated with the civilian labor force and with educabonal and health services employment CorrelaBon between heabng oil in New York Harbor and Jet Fuel on Gulf Coast
23 FRED Data: General Outcomes The unemployment rate is negabvely correlated with hours worked by manufacturing personnel, only in some regions Per- capita personal income and foreign travel services are strongly negabvely correlated This work is licensed under a Creative Commons Attribution-NoDerivs 3.0 Unported License. 23
24 Tradeoffs for Large Matrix Opera/ons Like Correla/on Matrix Tradeoffs must be made for different types of operabons As you break the problem down and increase scalability, you incur more overhead/latency from structure metadata We ll use matrix mulbplicabon as an illustrabve example of trade- offs Map/reduce actually isn t great at represenbng matrix mulbplicabon Hence this is primarily illustrabve This work is licensed under a Creative Commons Attribution-NoDerivs 3.0 Unported License. 24
25 25 Represen/ng Tabular Data Two main ways Row (or column) oriented Cell oriented Row/Column oriented Column (or row) has to fit in memory Olen need to transpose data (a map/reduce job) to do tasks like mulbplicabon Cell oriented Can be used to accomplish any type of operabon without need to change orientabon OperaBons are 100% disk based, therefore highly scalable Tends to be slow for denser matrices because of the excess amounts of data involved
26 26 An Illustra/ve Example: Matrix Mul/plica/on m p A 1,1 A 1,2 A 1,3 A 1,4 B 1,1... p n X m B 2,1 B 3,1 = AB 1,1 AB 1,2 B 4,1... n A 1,1 B A 1,2 A A 1,1 + B 2,1 + 1,3 B 3,1 + 1,4 B 4,1 A 1,1 B A 1,2 A A 1,2 + B 2,2 + 1,3 B 3,2 + 1,4 B 4,2
27 27 Matrix Mul/plica/on: First Approach Map Phase Key: AB 1,1 n A 1,1... Value: A 1,2 A 1,3 A 1,4 X B 1,1 B 2,1 B 3,1 B 4,1 Key: Value: A 1,1 A 1,2 A 1,3 A 1,4 AB 1,1 B 1,1 B 2,1 B B 3,1 4,1 Σ m m Reduce Phase AB 1,1 A 1,1 A 1,2 A 1,3 A 1,4 B 1,1 B 2,1 B B 3,1 4,1 n... X p m AB 1,1... p m...
28 28 m p A 1,1 A 1,2 A 1,3 A 1,4 B 1,1... Matrix Mul/plica/on: Another Approach Map Phase Reduce Phase n Key: AB 1,1 AB 1,1 AB 1, AB 1,1 Σ Value: A 1,1 A 1,2... X X Key: AB 1,1 B 2,1 B 3,1 B 4,1 Value: B 1,1 B 2,1 A 1,1 A 1,2 A 1,3 A 1,4 B 1,1 B 2,1 B 3,1 B 4,1 AB 1,1... AB 1,1 A 1,4 B A 1,1 B A 1,2 A 1,1 4,1 B 2,1 B 3,1 1,3 m p m m n...
29 29 Break We ll take a quick break then come back together for the matrix mulbplicabon exercise
30 30 Matrix Mul/plica/on Walkthrough See virtual machine materials
31 Zipfian Distribu/on Similar to Pareto, power law, exponenbal In this case, one thing takes a lot longer than others Perfect example: word count This work is licensed under a Creative Commons Attribution-NoDerivs 3.0 Unported License. 31
32 32 Zipfian Distribu/ons are Everywhere Types of Data Geographic data Economic data Space/Telescope observabons Shanghai at Night (Courtesy NASA)
33 33 Map/Reduce Task Comple/on Time Reduce tasks Because of RMR's design, which utilizes a list of values instead of streaming them, this can actually make it impossible for a task to complete Process Completion Time
34 34 Strategies for Dealing with Zipfian Distribu/ons Throw out things that occur frequently stop words Break things down further Sample data Model distribubon beforehand For high- frequency keys, parbbon them out to mulbple tasks
35 35 How to Sample Data It depends what is the distribubon of your data? How good does your sample need to be? Approaches Read the first few entries of the first file Read the top few entries of each input file If input is splioable then you can read top few entries of each input split (in Java) If you can t do any of these you have to read the whole file, which can be expensive Remember stuff in the head of a zipfian distribubon is REALLY prevalent, so just reading the first few records works 90% of the Bme
36 36 Breaking Up High- Frequency Tasks For head tasks, prefix a random number between 0 and the number of reduce slots in your cluster Use the number 0 in the composite key for non- head tasks
37 37 Breaking Up High- Frequency Tasks original key Yes Is the key a highfrequency key? No random 0 original key original key First Reduce original key Second Reduce final output
38 38 Exercise: Zipfian Distribu/on See virtual machine materials
39 Sec/on 3 Topic Modeling Overview of Topic Modeling and LDA Sliding Window Analysis Trending Twioer Topics Exercise This work is licensed under a Creative Commons Attribution-NoDerivs 3.0 Unported License. 39
40 Topic Modeling Methods for discovering topics in a collecbon of documents Used for machine learning, natural language processing Several algorithms for topic modeling available This work is licensed under a Creative Commons Attribution-NoDerivs 3.0 Unported License. 40
41 Latent Dirichlet Alloca/on (LDA) Common topic model Allows documents to have a mixture of topics Each topic represented by a list of terms Available in R s topicmodels package This work is licensed under a Creative Commons Attribution-NoDerivs 3.0 Unported License. 41
42 Trending Topics and Twi^er Trending topics are a popular feature of Twioer Trending topics can be found using many methods; one simple method is term frequency LDA is an advanced method that allows topics to be determined by the content of a set of Tweets CompuBng LDA over millions of Tweets will not be possible on a single home computer This work is licensed under a Creative Commons Attribution-NoDerivs 3.0 Unported License. 42
43 Implemen/ng LDA using RHadoop SoluBon 1: Parallelize LDA to split computabons over mulbple machines. Computes one topic model for the whole data set Requires in- depth understanding of LDA algorithm This work is licensed under a Creative Commons Attribution-NoDerivs 3.0 Unported License. 43
44 Implemen/ng LDA using RHadoop SoluBon 2: Analysis of trending topics over Bme CompuBng LDA over Tweets according to Bme, we can idenbfy changes in topics over Bme RHadoop will allow us to run LDA over mulbple Bme slices in parallel We can make use of R s topicmodels package to run LDA on our data This work is licensed under a Creative Commons Attribution-NoDerivs 3.0 Unported License. 44
45 45 Sample Tweet {" "profile_image_url":" profile_image.jpg"," "from_user_name":"donald Duck"," "from_user_id_str":" "," "created_at":"fri, 10 Aug :30: "," "id_str":" "," "from_user":"big_quack"," "to_user_id":0," "text":"what s the Big "metadata":{"result_type":"recent"}," "profile_image_url_https":" }" profile_image.jpg "," "id": ," "to_user":null," "geo":null," "from_user_id": ," "to_user_name":null," "iso_language_code":"en"," "to_user_id_str":"0"," "source":" "
46 46 Sample Tweet {" " "created_at":"fri, 10 Aug :30: "," " "text":"what s the Big " }"
47 47 Sliding Window Analysis Tweets can be grouped according to what hour (or day, month, etc.) they were created 12:00 13:00 14:00 Tweet1 (12:23) Tweet2 (12:54) Tweet3 (13:02) Tweet4 (13:47) Tweet5 (14:17) More overlap in Bme windows will smooth the change in trending topics over Bme 11:30 12:00 12:30 13:00 13:30 14:00 Tweet1 (12:23) Tweet1 (12:23) Tweet2 (12:54) Tweet2 (12:54) Tweet3 (13:02) Tweet3 (13:02) Tweet4 (13:47) Tweet4 (13:47) Tweet5 (14:17) Tweet5 (14:17)
48 Sliding Window Times Simple, non- overlapping hourly windows can be created by dividing the UTC Bme in seconds by 3600, rounding down, and using that as the key Ex: Fri, 10 Aug :30: = This work is licensed under a Creative Commons Attribution-NoDerivs 3.0 Unported License. 48
49 49 Using the LDA Libraries Step 1: Create a Document Source: VectorSource( )" Constructs a source for a vector" Step 2: Create a Corpus: Corpus( )" Takes a Source object and constructs a corpus" Step 3: Create a document- term matrix DocumentTermMatrix( )" Takes a Corpus and constructs a document term matrix" Step 4: Create a topic model with n topics LDA(,n)" Takes a DocumentTermMatrix and a number of topics and computes an LDA model" Step 5: Retrieve topics and terms using the respecbve convenience funcbons topics( ) terms( )"
50 50 Exercise: Trending Tweets See virtual machine materials
51 51 Recap RHadoop provides tools for scalable big data analybcs that Can be quickly prototyped in fewer lines of code Word Count Fit our (Big) Data Science Lifecycle Word Count + Zipfian DistribuBon Make use of R s extensive stabsbcal and data modeling libraries Trending Twioer Topics Analysis with LDA
52 Acknowledgments Special thanks to RevoluBon AnalyBcs and Antonio Piccolboni for developing and sharing Rhadoop and fielding quesbons! Thanks to our Booz Allen Helpers: Josh Sullivan, Doug Gartner, Jay Owen, Brian Bende, Drew Farris, Charles Glover, Paul Tamburello, Michael Kim, and Jay Shipper This work is licensed under a Creative Commons Attribution-NoDerivs 3.0 Unported License. 52
53 53 For more informa/on Office hours at 2:30 pm on Wednesday Come visit our exhibitor booth - Sponsor Pavilion #59 Visit boozallen.com Ed Kohlwey: E- mail: kohlwey_edmund@bah.com Stephanie Beben: E- mail: beben_stephanie@bah.com
54 54 Appendix
55 55 Installa/on (The Old Fashioned Way) git clone -b rmr RevolutionAnalytics/RHadoop.git RHadoop-src " echo 'dir.create(.libpaths() [1],recursive=T);install.packages(c("digest","it ertools","functional","rjsonio","rcpp"),repos="h ttp://cran.us.r-project.org")' R --no-save -- no-resume" " R CMD INSTALL RHadoop-src/quickcheck/ RHadoopsrc/rmr/pkg/"
Data Structures and Performance for Scientific Computing with Hadoop and Dumbo
Data Structures and Performance for Scientific Computing with Hadoop and Dumbo Austin R. Benson Computer Sciences Division, UC-Berkeley ICME, Stanford University May 15, 2012 1 1 Matrix storage 2 Data
More informationHadoop Parallel Data Processing
MapReduce and Implementation Hadoop Parallel Data Processing Kai Shen A programming interface (two stage Map and Reduce) and system support such that: the interface is easy to program, and suitable for
More informationCSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing
More informationIntroduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
More informationOpen source software framework designed for storage and processing of large scale data on clusters of commodity hardware
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after
More informationESS event: Big Data in Official Statistics. Antonino Virgillito, Istat
ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web
More informationSimilarity Search in a Very Large Scale Using Hadoop and HBase
Similarity Search in a Very Large Scale Using Hadoop and HBase Stanislav Barton, Vlastislav Dohnal, Philippe Rigaux LAMSADE - Universite Paris Dauphine, France Internet Memory Foundation, Paris, France
More informationJeffrey D. Ullman slides. MapReduce for data intensive computing
Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very
More informationBig Data Rethink Algos and Architecture. Scott Marsh Manager R&D Personal Lines Auto Pricing
Big Data Rethink Algos and Architecture Scott Marsh Manager R&D Personal Lines Auto Pricing Agenda History Map Reduce Algorithms History Google talks about their solutions to their problems Map Reduce:
More informationDeveloping MapReduce Programs
Cloud Computing Developing MapReduce Programs Dell Zhang Birkbeck, University of London 2015/16 MapReduce Algorithm Design MapReduce: Recap Programmers must specify two functions: map (k, v) * Takes
More informationEnergy Efficient MapReduce
Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing
More informationMapReduce Job Processing
April 17, 2012 Background: Hadoop Distributed File System (HDFS) Hadoop requires a Distributed File System (DFS), we utilize the Hadoop Distributed File System (HDFS). Background: Hadoop Distributed File
More informationAnalysis of MapReduce Algorithms
Analysis of MapReduce Algorithms Harini Padmanaban Computer Science Department San Jose State University San Jose, CA 95192 408-924-1000 harini.gomadam@gmail.com ABSTRACT MapReduce is a programming model
More informationThe Impact of Big Data on Classic Machine Learning Algorithms. Thomas Jensen, Senior Business Analyst @ Expedia
The Impact of Big Data on Classic Machine Learning Algorithms Thomas Jensen, Senior Business Analyst @ Expedia Who am I? Senior Business Analyst @ Expedia Working within the competitive intelligence unit
More informationData processing goes big
Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,
More informationMammoth Scale Machine Learning!
Mammoth Scale Machine Learning! Speaker: Robin Anil, Apache Mahout PMC Member! OSCON"10! Portland, OR! July 2010! Quick Show of Hands!# Are you fascinated about ML?!# Have you used ML?!# Do you have Gigabytes
More informationIntroduction to Hadoop. New York Oracle User Group Vikas Sawhney
Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop
More informationBig Data and Analytics: A Conceptual Overview. Mike Park Erik Hoel
Big Data and Analytics: A Conceptual Overview Mike Park Erik Hoel In this technical workshop This presentation is for anyone that uses ArcGIS and is interested in analyzing large amounts of data We will
More informationClick Stream Data Analysis Using Hadoop
Governors State University OPUS Open Portal to University Scholarship Capstone Projects Spring 2015 Click Stream Data Analysis Using Hadoop Krishna Chand Reddy Gaddam Governors State University Sivakrishna
More informationEstimating PageRank Values of Wikipedia Articles using MapReduce
Estimating PageRank Values of Wikipedia Articles using MapReduce Due: Sept. 30 Wednesday 5:00PM Submission: via Canvas, individual submission Instructor: Sangmi Pallickara Web page: http://www.cs.colostate.edu/~cs535/assignments.html
More informationHow To Use Big Data For Telco (For A Telco)
ON-LINE VIDEO ANALYTICS EMBRACING BIG DATA David Vanderfeesten, Bell Labs Belgium ANNO 2012 YOUR DATA IS MONEY BIG MONEY! Your click stream, your activity stream, your electricity consumption, your call
More informationA very short Intro to Hadoop
4 Overview A very short Intro to Hadoop photo by: exfordy, flickr 5 How to Crunch a Petabyte? Lots of disks, spinning all the time Redundancy, since disks die Lots of CPU cores, working all the time Retry,
More informationOpen source Google-style large scale data analysis with Hadoop
Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical
More informationDesigning Agile Data Pipelines. Ashish Singh Software Engineer, Cloudera
Designing Agile Data Pipelines Ashish Singh Software Engineer, Cloudera About Me Software Engineer @ Cloudera Contributed to Kafka, Hive, Parquet and Sentry Used to work in HPC @singhasdev 204 Cloudera,
More informationHow To Scale Out Of A Nosql Database
Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 thomas.steinmaurer@scch.at www.scch.at Michael Zwick DI
More informationIntegrating Big Data into the Computing Curricula
Integrating Big Data into the Computing Curricula Yasin Silva, Suzanne Dietrich, Jason Reed, Lisa Tsosie Arizona State University http://www.public.asu.edu/~ynsilva/ibigdata/ 1 Overview Motivation Big
More informationOpen source large scale distributed data management with Google s MapReduce and Bigtable
Open source large scale distributed data management with Google s MapReduce and Bigtable Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory
More informationBIG DATA What it is and how to use?
BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14
More informationHadoop SNS. renren.com. Saturday, December 3, 11
Hadoop SNS renren.com Saturday, December 3, 11 2.2 190 40 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December
More informationBig Data With Hadoop
With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
More informationHadoop Distributed File System (HDFS) Overview
2012 coreservlets.com and Dima May Hadoop Distributed File System (HDFS) Overview Originals of slides and source code for examples: http://www.coreservlets.com/hadoop-tutorial/ Also see the customized
More informationIntroduction to Hadoop
Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction
More informationZihang Yin Introduction R is commonly used as an open share statistical software platform that enables analysts to do complex statistical analysis with limited computing knowledge. Frequently these analytical
More informationBig Data and Scripting map/reduce in Hadoop
Big Data and Scripting map/reduce in Hadoop 1, 2, parts of a Hadoop map/reduce implementation core framework provides customization via indivudual map and reduce functions e.g. implementation in mongodb
More informationHadoop and Map-reduce computing
Hadoop and Map-reduce computing 1 Introduction This activity contains a great deal of background information and detailed instructions so that you can refer to it later for further activities and homework.
More informationDistributed Computing and Big Data: Hadoop and MapReduce
Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:
More informationA programming model in Cloud: MapReduce
A programming model in Cloud: MapReduce Programming model and implementation developed by Google for processing large data sets Users specify a map function to generate a set of intermediate key/value
More informationINTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe
More informationMassive Cloud Auditing using Data Mining on Hadoop
Massive Cloud Auditing using Data Mining on Hadoop Prof. Sachin Shetty CyberBAT Team, AFRL/RIGD AFRL VFRP Tennessee State University Outline Massive Cloud Auditing Traffic Characterization Distributed
More informationLarge-Scale Data Sets Clustering Based on MapReduce and Hadoop
Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE
More informationHadoop Architecture. Part 1
Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,
More informationBig Data and Analytics: Getting Started with ArcGIS. Mike Park Erik Hoel
Big Data and Analytics: Getting Started with ArcGIS Mike Park Erik Hoel Agenda Overview of big data Distributed computation User experience Data management Big data What is it? Big Data is a loosely defined
More informationInternational Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763
International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 A Discussion on Testing Hadoop Applications Sevuga Perumal Chidambaram ABSTRACT The purpose of analysing
More informationHadoop WordCount Explained! IT332 Distributed Systems
Hadoop WordCount Explained! IT332 Distributed Systems Typical problem solved by MapReduce Read a lot of data Map: extract something you care about from each record Shuffle and Sort Reduce: aggregate, summarize,
More informationITG Software Engineering
Introduction to Cloudera Course ID: Page 1 Last Updated 12/15/2014 Introduction to Cloudera Course : This 5 day course introduces the student to the Hadoop architecture, file system, and the Hadoop Ecosystem.
More informationParallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel
Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:
More informationBig Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park
Big Data: Using ArcGIS with Apache Hadoop Erik Hoel and Mike Park Outline Overview of Hadoop Adding GIS capabilities to Hadoop Integrating Hadoop with ArcGIS Apache Hadoop What is Hadoop? Hadoop is a scalable
More informationHBase A Comprehensive Introduction. James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367
HBase A Comprehensive Introduction James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367 Overview Overview: History Began as project by Powerset to process massive
More informationConjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect
Matteo Migliavacca (mm53@kent) School of Computing Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect Simple past - Traditional
More informationHadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook
Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future
More informationPerformance Management in Big Data Applica6ons. Michael Kopp, Technology Strategist @mikopp
Performance Management in Big Data Applica6ons Michael Kopp, Technology Strategist NoSQL: High Volume/Low Latency DBs Web Java Key Challenges 1) Even Distribu6on 2) Correct Schema and Access paperns 3)
More informationCS 4604: Introduc0on to Database Management Systems. B. Aditya Prakash Lecture #13: NoSQL and MapReduce
CS 4604: Introduc0on to Database Management Systems B. Aditya Prakash Lecture #13: NoSQL and MapReduce Announcements HW4 is out You have to use the PGSQL server START EARLY!! We can not help if everyone
More informationWhat s New in MATLAB and Simulink
What s New in MATLAB and Simulink Kevin Cohan Product Marketing, MATLAB Michael Carone Product Marketing, Simulink 2015 The MathWorks, Inc. 1 What was new for Simulink in R2012b? 2 What Was New for MATLAB
More informationTutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data
More informationImage Search by MapReduce
Image Search by MapReduce COEN 241 Cloud Computing Term Project Final Report Team #5 Submitted by: Lu Yu Zhe Xu Chengcheng Huang Submitted to: Prof. Ming Hwa Wang 09/01/2015 Preface Currently, there s
More informationMapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example
MapReduce MapReduce and SQL Injections CS 3200 Final Lecture Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI'04: Sixth Symposium on Operating System Design
More informationCloud Scale Distributed Data Storage. Jürmo Mehine
Cloud Scale Distributed Data Storage Jürmo Mehine 2014 Outline Background Relational model Database scaling Keys, values and aggregates The NoSQL landscape Non-relational data models Key-value Document-oriented
More informationBringing Big Data Modelling into the Hands of Domain Experts
Bringing Big Data Modelling into the Hands of Domain Experts David Willingham Senior Application Engineer MathWorks david.willingham@mathworks.com.au 2015 The MathWorks, Inc. 1 Data is the sword of the
More informationhttp://www.wordle.net/
Hadoop & MapReduce http://www.wordle.net/ http://www.wordle.net/ Hadoop is an open-source software framework (or platform) for Reliable + Scalable + Distributed Storage/Computational unit Failures completely
More informationParallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data
Case Study 2: Document Retrieval Parallel Programming Map-Reduce Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 31 st, 2013 Carlos Guestrin
More informationAccelerating Hadoop MapReduce Using an In-Memory Data Grid
Accelerating Hadoop MapReduce Using an In-Memory Data Grid By David L. Brinker and William L. Bain, ScaleOut Software, Inc. 2013 ScaleOut Software, Inc. 12/27/2012 H adoop has been widely embraced for
More information! E6893 Big Data Analytics:! Demo Session II: Mahout working with Eclipse and Maven for Collaborative Filtering
E6893 Big Data Analytics: Demo Session II: Mahout working with Eclipse and Maven for Collaborative Filtering Aonan Zhang Dept. of Electrical Engineering 1 October 9th, 2014 Mahout Brief Review The Apache
More informationLambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: bdg@qburst.com Website: www.qburst.com
Lambda Architecture Near Real-Time Big Data Analytics Using Hadoop January 2015 Contents Overview... 3 Lambda Architecture: A Quick Introduction... 4 Batch Layer... 4 Serving Layer... 4 Speed Layer...
More informationThis exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing.
Big Data Processing 2013-2014 Q2 April 7, 2014 (Resit) Lecturer: Claudia Hauff Time Limit: 180 Minutes Name: Answer the questions in the spaces provided on this exam. If you run out of room for an answer,
More informationBig Data and Apache Hadoop s MapReduce
Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23
More informationHadoop MapReduce over Lustre* High Performance Data Division Omkar Kulkarni April 16, 2013
Hadoop MapReduce over Lustre* High Performance Data Division Omkar Kulkarni April 16, 2013 * Other names and brands may be claimed as the property of others. Agenda Hadoop Intro Why run Hadoop on Lustre?
More informationA STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS
A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS Dr. Ananthi Sheshasayee 1, J V N Lakshmi 2 1 Head Department of Computer Science & Research, Quaid-E-Millath Govt College for Women, Chennai, (India)
More informationBig Data, beating the Skills Gap Using R with Hadoop
Big Data, beating the Skills Gap Using R with Hadoop Using R with Hadoop There are a number of R packages available that can interact with Hadoop, including: hive - Not to be confused with Apache Hive,
More informationFast Data in the Era of Big Data: Twitter s Real-
Fast Data in the Era of Big Data: Twitter s Real- Time Related Query Suggestion Architecture Gilad Mishne, Jeff Dalton, Zhenghua Li, Aneesh Sharma, Jimmy Lin Presented by: Rania Ibrahim 1 AGENDA Motivation
More informationIntroduction to Parallel Programming and MapReduce
Introduction to Parallel Programming and MapReduce Audience and Pre-Requisites This tutorial covers the basics of parallel programming and the MapReduce programming model. The pre-requisites are significant
More informationA Tutorial Introduc/on to Big Data. Hands On Data Analy/cs over EMR. Robert Grossman University of Chicago Open Data Group
A Tutorial Introduc/on to Big Data Hands On Data Analy/cs over EMR Robert Grossman University of Chicago Open Data Group Collin BenneE Open Data Group November 12, 2012 1 Amazon AWS Elas/c MapReduce allows
More informationFour Orders of Magnitude: Running Large Scale Accumulo Clusters. Aaron Cordova Accumulo Summit, June 2014
Four Orders of Magnitude: Running Large Scale Accumulo Clusters Aaron Cordova Accumulo Summit, June 2014 Scale, Security, Schema Scale to scale 1 - (vt) to change the size of something let s scale the
More informationIntroduction To Hive
Introduction To Hive How to use Hive in Amazon EC2 CS 341: Project in Mining Massive Data Sets Hyung Jin(Evion) Kim Stanford University References: Cloudera Tutorials, CS345a session slides, Hadoop - The
More informationChapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
More informationHadoop IST 734 SS CHUNG
Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to
More informationHadoop Project for IDEAL in CS5604
Hadoop Project for IDEAL in CS5604 by Jose Cadena Mengsu Chen Chengyuan Wen {jcadena,mschen,wechyu88@vt.edu Completed as part of the course CS5604: Information storage and retrieval offered by Dr. Edward
More informationRHadoop and MapR. Accessing Enterprise- Grade Hadoop from R. Version 2.0 (14.March.2014)
RHadoop and MapR Accessing Enterprise- Grade Hadoop from R Version 2.0 (14.March.2014) Table of Contents Introduction... 3 Environment... 3 R... 3 Special Installation Notes... 4 Install R... 5 Install
More informationCS 378 Big Data Programming. Lecture 5 Summariza9on Pa:erns
CS 378 Big Data Programming Lecture 5 Summariza9on Pa:erns Review Assignment 2 Ques9ons? If you d like to use guava (Google collec9ons classes) pom.xml available for assignment 2 Includes dependency for
More informationPLANET: Massively Parallel Learning of Tree Ensembles with MapReduce. Authors: B. Panda, J. S. Herbach, S. Basu, R. J. Bayardo.
PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce Authors: B. Panda, J. S. Herbach, S. Basu, R. J. Bayardo. VLDB 2009 CS 422 Decision Trees: Main Components Find Best Split Choose split
More informationReference Architecture, Requirements, Gaps, Roles
Reference Architecture, Requirements, Gaps, Roles The contents of this document are an excerpt from the brainstorming document M0014. The purpose is to show how a detailed Big Data Reference Architecture
More informationCloud Computing at Google. Architecture
Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale
More informationFP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data
FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez To cite this version: Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez. FP-Hadoop:
More informationIntroduction to DISC and Hadoop
Introduction to DISC and Hadoop Alice E. Fischer April 24, 2009 Alice E. Fischer DISC... 1/20 1 2 History Hadoop provides a three-layer paradigm Alice E. Fischer DISC... 2/20 Parallel Computing Past and
More informationBigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic
BigData An Overview of Several Approaches David Mera Masaryk University Brno, Czech Republic 16/12/2013 Table of Contents 1 Introduction 2 Terminology 3 Approaches focused on batch data processing MapReduce-Hadoop
More informationMap- reduce, Hadoop and The communica3on bo5leneck. Yoav Freund UCSD / Computer Science and Engineering
Map- reduce, Hadoop and The communica3on bo5leneck Yoav Freund UCSD / Computer Science and Engineering Plan of the talk Why is Hadoop so popular? HDFS Map Reduce Word Count example using Hadoop streaming
More informationA REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, sborkar95@gmail.com Assistant Professor, Information
More informationIntroduction to Hadoop
1 What is Hadoop? Introduction to Hadoop We are living in an era where large volumes of data are available and the problem is to extract meaning from the data avalanche. The goal of the software tools
More informationOptimization and analysis of large scale data sorting algorithm based on Hadoop
Optimization and analysis of large scale sorting algorithm based on Hadoop Zhuo Wang, Longlong Tian, Dianjie Guo, Xiaoming Jiang Institute of Information Engineering, Chinese Academy of Sciences {wangzhuo,
More informationUniversity of Maryland. Tuesday, February 2, 2010
Data-Intensive Information Processing Applications Session #2 Hadoop: Nuts and Bolts Jimmy Lin University of Maryland Tuesday, February 2, 2010 This work is licensed under a Creative Commons Attribution-Noncommercial-Share
More informationHadoop Ecosystem B Y R A H I M A.
Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open
More informationCSCI6900 Assignment 2: Naïve Bayes on Hadoop
DEPARTMENT OF COMPUTER SCIENCE, UNIVERSITY OF GEORGIA CSCI6900 Assignment 2: Naïve Bayes on Hadoop DUE: Friday, September 18 by 11:59:59pm Out September 4, 2015 1 IMPORTANT NOTES You are expected to use
More informationImplement Hadoop jobs to extract business value from large and varied data sets
Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to
More informationCSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model
More informationWhat are Hadoop and MapReduce and how did we get here?
What are Hadoop and MapReduce and how did we get here? Term Big Data coined in 2005 by Roger Magoulas of O Reilly Media But as the idea of big data sets evolved on the Web, organizations began to wonder
More informationLecture 10: HBase! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl
Big Data Processing, 2014/15 Lecture 10: HBase!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind the
More informationTackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc.
Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc. 2015 The MathWorks, Inc. 1 Challenges of Big Data Any collection of data sets so large and complex that it becomes difficult
More informationHadoop and Map-Reduce. Swati Gore
Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data
More informationWelcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components
Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components of Hadoop. We will see what types of nodes can exist in a Hadoop
More informationBig Data Storage Options for Hadoop Sam Fineberg, HP Storage
Sam Fineberg, HP Storage SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA unless otherwise noted. Member companies and individual members may use this material in presentations
More information