Lambda Architecture. CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014
|
|
|
- Adela Cannon
- 10 years ago
- Views:
Transcription
1 Lambda Architecture CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014 1
2 Goals Cover the material in Chapter 8 of the Concurrency Textbook The Lambda Architecture Batch Layer MapReduce (e.g. Hadoop, Spark) Speed Layer Stream Processing (e.g. Storm, Spark Streaming) 2
3 Lambda Architecture (I) The Lambda Architecture refers to an approach for performing big data processing developed by Nathan Marz from his experiences at BackType and Twitter Everyone has their own definition of big gigabytes and terabytes for small research teams and companies terabytes and petabytes for medium and large organizations Google and Microsoft have petabytes of map data exabytes for truly data-intensive organizations Facebook reports having to store 500 TB of new information PER DAY and zettabytes and yottabytes may be in our future some day! 3
4 Lambda Architecture (II) Everything changes when working at scale many of your assumptions fall over many of the techniques that you are comfortable suddenly reveal that they are not scalable I enjoy giving students their first 60 GB file to work on In my research on Project EPIC, we have data sets that are hundreds of gigabytes in size you can t bring the entire set into memory (at least not on a single machine) you can t easily store our data sets accumulated over the past five years on a single machine 4
5 Lambda Architecture (III) In a typical data-intensive system, the goal is to while true collect and/or generate raw data store that data in an effective and efficient way process it to answer questions of some form use the answers to determine what new questions you have use those questions to determine what new data you need In our coverage of the Lambda Architecture, we will be looking primarily at technologies that aid the process stage above Additional details can be found in Marz s book on Big Data 5
6 Lambda Architecture (IV) In processing data to answer questions, the lambda architecture advocates a two-prong approach For large data sets, where the processing can take a significant amount of time (hours) => the batch layer using techniques like MapReduce For more recent generated data, process it as it arrives => the speed layer using techniques like stream processing In both cases, these technologies will make use of clusters of machines working together to perform the analysis in this way data is distributed/replicated across machines AND computation is distributed across machines and occurs in parallel 6
7 Lambda Architecture (V) Job Job Job Job Job Job Raw Data Batch Layer Answers and Information Speed Layer 7
8 Lambda Architecture (VI) For the batch layer, we will make use of techniques that can process large sets of data using batch jobs MapReduce, as implemented by Hadoop, has been the 900lb gorilla in this space for a long time it is now being challenged by other implementations (such as Spark) For the speed layer, we will make use of techniques that can process streaming data quickly The exemplar in this space is Storm, a streaming technology developed at Twitter Other techniques useful in this space include message queueing systems like RabbitMQ and ActiveMQ 8
9 Map Reduce The functions map, reduce, and filter have cropped up a lot this semester They represent a fundamental approach to processing data that (via Hadoop) has been shown to apply to extremely large data sets map: given an array and a function, create a new array that for each element contains the results of the function applied to the corresponding element of the original array filter: given an array and a boolean function, create a new array that consists of all elements of the original for which the function returns true reduce: given an array, an initial value, and a binary function return a single value that is the accumulation of the initial value and the elements of the original array as computed by the function 9
10 Examples: map, filter, and reduce In the subsequent examples, I present examples of map, filter, and reduce that apply to arrays of integers With map, the supplied function has signature: Int -> Int With filter, the supplied function has signature: Int -> Bool With reduce, the supplied function has signature: (Int, Int) -> Int But the technique is generic with respect to type For an array of elements of type T the supplied function for map has signature: T -> U the supplied function for filter has signature: T -> Bool the supplied function for reduce has signature: (R, T) -> R 10
11 Example: map (I) function map(values, f) { var results = []; for (var i = 0; i < values.length; i++) { results.push(f(values[i])); } return results; } Here s a Javascript implementation of map Create a new array Apply the function f to each element of the input array (values) and append (push in Javascript) the result to the output array 11
12 Example: map (II) function triple(x) { return x * 3; } var data = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] console.log(map(data, triple)) Here s an example of using our Javascript version of map Define a function that takes an integer and produces an integer Create an array of integers Call map passing the input array and function; print the result 12
13 Example: filter (I) function filter(values, f) { var results = []; for (var i = 0; i < values.length; i++) { if (f(values[i])) { results.push(values[i]); } } return results; } Here s a Javascript implementation of filter Create a new array Apply the function f to each element of the input array (values) If true, append the element to the output array 13
14 Example: filter (II) function iseven(x) { return (x % 2) == 0; } var data = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] console.log(filter(data, iseven)) Here s an example of using our Javascript version of filter Define a function that takes an integer and produces a boolean Create an array of integers Call filter passing the input array and function; print the result 14
15 Example: reduce (I) function reduce(values, initial, f) { var result = initial; for (var i = 0; i < values.length; i++) { result = f(result, values[i]); } return result; } Here s a Javascript implementation of reduce Set result equal to the initial value Loop through the input array Update result to be the output of the function applied to result and a[i] Return the final result 15
16 Example: reduce (II) function sum(total, value) { return total + value; } var data = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] console.log(reduce(data, 0, sum)) You can even combine them! console.log(reduce(filter(map(data, triple), iseven), 0, sum)) 16
17 Reduce is very powerful In some languages, you can implement map and filter using reduce func rmap<t, U>(xs: [T], f: T -> U) -> [U] { return reduce(xs, []) { result, x in result + [f(x)] } } func rfilter<t>(xs: [T], check: T -> Bool) -> [T] { return reduce(xs, []) { result, x in return check(x)? result + [x] : result } } This is an example written in Swift, a new language created by Apple 17
18 MapReduce and Hadoop (I) Hadoop applies these functions to data at large scale You can have thousands of machines in your cluster You can have petabytes of data The Hadoop file system will replicate your data across the nodes The Hadoop run-time will take your set of map and reduce jobs and distribute across the cluster It will manage the final reduce process and ensure that all results end up in the output file that you specify It does all of this and will complete jobs even if worker nodes go down taking out partially completed tasks 18
19 MapReduce and Hadoop (II) Since I showed that map and filter are special cases of reduce, the overarching framework could have been called ReduceReduce but that doesn t sound as good But that s just to say that you should be confident that the MapReduce framework provides you with a sufficient range of functionality to perform a wide range of analysis and data manipulation tasks 19
20 MapReduce and Hadoop (III) Hadoop s basic mode of operation is transformation A function (either map or reduce) will receive a document of key-value pairs The result is a set of new documents with key-value pairs In all cases, you are transforming from one set of documents to another Or more generically, one type to another type Depending on your algorithm, one input document can produce multiple output documents 20
21 MapReduce and Hadoop (IV) The difference between map and reduce in Hadoop map functions will receive a single document to transform As I indicated, it will then produce zero or more output documents each with a particular key; the key does not have to be unique reduce functions will receive a key and a set of documents that all had that key It will then typically produce a single document with that key in which all of the documents have had their values combined in some way So, if a map phase generates millions of documents across 26 different keys then its reduce phase may end producing 26 final documents 21
22 Hadoop Conceptual Structure Day 1: MapReduce 227 Image from our concurrency text book input Mapper output Reducer Mapper Reducer Mapper Reducer Mapper Hadoop guarantees that all mapper outputs associated with the same key will go to the same reducer Figure 19 Hadoop high-level data flow Shuffle Phase size would be 64 MB) and sends each split to a single mapper. The mapper outputs a number of key/value pairs, which Hadoop then sends to the reducers. This is called the shuffle phase, while it routes documents to reducers. The key/value pairs from a single mapper are sent to multiple reducers. Which 22
23 Powerful but at a cost MapReduce and Hadoop are powerful but they come at a cost latency These jobs are NOT quick It takes a lot of time to spin up a Hadoop job If your analysis requires multiple MapReduce jobs, you have to take the hit of that latency across all such jobs Spark tries to fix exactly this issue with its notion of an RDD But if you have to apply a single algorithm across terabytes or petabytes of data, it will be hard to beat Hadoop once the price of that overhead is paid 23
24 Word Count in Hadoop (I) The book presents an example of using Hadoop to count the words in a document; (as I mentioned earlier in the semester, this is the hello world example of big data) 24
25 Word Count in Hadoop (II) The high-level design is this: Input document is lines of text Hadoop will split the document into lines and send each line to a mapper The mapper receives an individual line and splits it into words For each word, it creates an output document: (word, 1) e.g. ( you, 1), ( shall, 1), ( not, 1), ( pass, 1) These output documents get shuffled and are passed to the reducer like this: (word, [Int]); e.g. ( shazam!, [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]) The reducer takes this as input and produces (word, <sum>) e.g. ( shazam!, 10) 25
26 Word Count in Hadoop (III) There may be multiple phases of reduce depending on how many nodes are in your cluster Depending on the implementation of MapReduce OR Each node may do their reduce phase first and then the implementation would need to combine (i.e. reduce) the output documents on each node with each other to produce the final output During the shuffle phase, we make sure that all documents with the same key go to the same reducer, even if that means sending the output document of the map phase on one machine as input to the reduce phase on a second machine 26
27 Figure 20 Counting words with Hadoop The Mapper Word Count in Hadoop (IV) Our mapper, Map, extends Hadoop s Mapper class, which takes four type parameters the input key type, the input value type, the output key type, and the output value type: The code for our map phase looks like this Line LambdaArchitecture/WordCount/src/main/java/com/paulbutcher/WordCount.java public static class Map extends Mapper<Object, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); } public void map(object key, Text value, Context context) throws IOException, InterruptedException { } String line = value.tostring(); Iterable<String> words = new Words(line); for (String word: words) context.write(new Text(word), one); exclusively for Ken Anderson re 27
28 key/value pair for each of them, where the key is the word and the value the constant integer 1 (line 10). The Reducer Word Count in Hadoop (V) Our reducer, Reduce, extends Hadoop s Reducer class. Like Mapper, this also takes type parameters indicating the input and output key and value types (in our case, Text for both key types and IntWritable for both value types): The code for our reduce phase looks like this LambdaArchitecture/WordCount/src/main/java/com/paulbutcher/WordCount.java public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val: values) sum += val.get(); context.write(key, new IntWritable(sum)); } } The reduce() method will be called once for each key, with values containing a collection of all the values associated with that key. Our mapper simply sums the values and generates a single key/value pair associating the word with its total occurrences. Now that we ve got both our mapper and our reducer, our final task is to create a driver, which tells Hadoop how to run them. 28
29 Now that we ve got both our mapper and our reducer, our final task is t create a driver, which tells Hadoop how to run them. Word The Count Driver in Hadoop (VI) Our driver is a Hadoop Tool, which implements a run() method: The main program looks like this Line 1 - LambdaArchitecture/WordCount/src/main/java/com/paulbutcher/WordCount.java public class WordCount extends Configured implements Tool { - public int run(string[] args) throws Exception { - Configuration conf = getconf(); 5 Job job = Job.getInstance(conf, "wordcount"); - job.setjarbyclass(wordcount.class); - Chapter 8. The Lambda Architecture job.setmapperclass(map.class); job.setreducerclass(reduce.class); - job.setoutputkeyclass(text.class); 10 job.setoutputvalueclass(intwritable.class); - FileInputFormat.addInputPath(job, new Path(args[0])); - exclusively for Ken Anderson FileOutputFormat.setOutputPath(job, new Path(args[1])); - boolean success = job.waitforcompletion(true); - return success? 0 : 1; 15 } } public static void main(string[] args) throws Exception { int res = ToolRunner.run(new Configuration(), new WordCount(), args); System.exit(res); } 29
30 MapReduce is a Generic Concept Many different technologies can provide implementations of map and reduce We ve already seen this with respect to functional programming languages But, MapReduce functionality (as in the Hadoop context) is starting to appear in many different tools similar to the way that support for manipulating and searching text via regular expressions appears in many different editors In particular, CouchDB and MongoDB provide MapReduce functionality I used MongoDB s MapReduce functionality to find the unique users of a Twitter data set the result of the calculation is that each output document represents a unique user and contains useful information about that user 30
31 Twitter Example (I) In the unique users example, the high level design is the following The input data set is a set of tweets stored in MongoDB Each tweet is stored as a JSON document of attribute-value pairs The mapper produces an output document with the following fields names => an array of screen names for this user tweets => an array of tweet ids for this user name_count => number of names tweet_count => number of tweets That document has as its key, the unique user id for that user 31
32 Twitter Example (II) The reducer takes a set of mapper-produced documents that all have the same key (i.e. user id) and Combines all screen names and tweets and updates the counts as appropriate DEMO 32
33 Summary Introduced the first part of the Lambda Architecture Covered MapReduce And showed examples in Hadoop and MongoDB 33
34 Coming Up Next Lecture 30: Lambda Architecture, Part Two 34
Word Count Code using MR2 Classes and API
EDUREKA Word Count Code using MR2 Classes and API A Guide to Understand the Execution of Word Count edureka! A guide to understand the execution and flow of word count WRITE YOU FIRST MRV2 PROGRAM AND
Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart
Hadoop/MapReduce Object-oriented framework presentation CSCI 5448 Casey McTaggart What is Apache Hadoop? Large scale, open source software framework Yahoo! has been the largest contributor to date Dedicated
BIG DATA APPLICATIONS
BIG DATA ANALYTICS USING HADOOP AND SPARK ON HATHI Boyu Zhang Research Computing ITaP BIG DATA APPLICATIONS Big data has become one of the most important aspects in scientific computing and business analytics
Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data
Case Study 2: Document Retrieval Parallel Programming Map-Reduce Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 31 st, 2013 Carlos Guestrin
Hadoop Configuration and First Examples
Hadoop Configuration and First Examples Big Data 2015 Hadoop Configuration In the bash_profile export all needed environment variables Hadoop Configuration Allow remote login Hadoop Configuration Download
Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网. Information Management. Information Management IBM CDL Lab
IBM CDL Lab Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网 Information Management 2012 IBM Corporation Agenda Hadoop 技 术 Hadoop 概 述 Hadoop 1.x Hadoop 2.x Hadoop 生 态
Getting to know Apache Hadoop
Getting to know Apache Hadoop Oana Denisa Balalau Télécom ParisTech October 13, 2015 1 / 32 Table of Contents 1 Apache Hadoop 2 The Hadoop Distributed File System(HDFS) 3 Application management in the
Istanbul Şehir University Big Data Camp 14. Hadoop Map Reduce. Aslan Bakirov Kevser Nur Çoğalmış
Istanbul Şehir University Big Data Camp 14 Hadoop Map Reduce Aslan Bakirov Kevser Nur Çoğalmış Agenda Map Reduce Concepts System Overview Hadoop MR Hadoop MR Internal Job Execution Workflow Map Side Details
Hadoop WordCount Explained! IT332 Distributed Systems
Hadoop WordCount Explained! IT332 Distributed Systems Typical problem solved by MapReduce Read a lot of data Map: extract something you care about from each record Shuffle and Sort Reduce: aggregate, summarize,
The Hadoop Eco System Shanghai Data Science Meetup
The Hadoop Eco System Shanghai Data Science Meetup Karthik Rajasethupathy, Christian Kuka 03.11.2015 @Agora Space Overview What is this talk about? Giving an overview of the Hadoop Ecosystem and related
CS54100: Database Systems
CS54100: Database Systems Cloud Databases: The Next Post- Relational World 18 April 2012 Prof. Chris Clifton Beyond RDBMS The Relational Model is too limiting! Simple data model doesn t capture semantics
HPCHadoop: MapReduce on Cray X-series
HPCHadoop: MapReduce on Cray X-series Scott Michael Research Analytics Indiana University Cray User Group Meeting May 7, 2014 1 Outline Motivation & Design of HPCHadoop HPCHadoop demo Benchmarking Methodology
Introduction to MapReduce and Hadoop
Introduction to MapReduce and Hadoop Jie Tao Karlsruhe Institute of Technology [email protected] Die Kooperation von Why Map/Reduce? Massive data Can not be stored on a single machine Takes too long to process
Xiaoming Gao Hui Li Thilina Gunarathne
Xiaoming Gao Hui Li Thilina Gunarathne Outline HBase and Bigtable Storage HBase Use Cases HBase vs RDBMS Hands-on: Load CSV file to Hbase table with MapReduce Motivation Lots of Semi structured data Horizontal
Tutorial- Counting Words in File(s) using MapReduce
Tutorial- Counting Words in File(s) using MapReduce 1 Overview This document serves as a tutorial to setup and run a simple application in Hadoop MapReduce framework. A job in Hadoop MapReduce usually
Mrs: MapReduce for Scientific Computing in Python
Mrs: for Scientific Computing in Python Andrew McNabb, Jeff Lund, and Kevin Seppi Brigham Young University November 16, 2012 Large scale problems require parallel processing Communication in parallel processing
Hadoop Lab Notes. Nicola Tonellotto November 15, 2010
Hadoop Lab Notes Nicola Tonellotto November 15, 2010 2 Contents 1 Hadoop Setup 4 1.1 Prerequisites........................................... 4 1.2 Installation............................................
How to program a MapReduce cluster
How to program a MapReduce cluster TF-IDF step by step Ján Súkeník [email protected] [email protected] TF-IDF potrebujeme pre každý dokument počet slov frekvenciu každého slova pre každé
Working With Hadoop. Important Terminology. Important Terminology. Anatomy of MapReduce Job Run. Important Terminology
Working With Hadoop Now that we covered the basics of MapReduce, let s look at some Hadoop specifics. Mostly based on Tom White s book Hadoop: The Definitive Guide, 3 rd edition Note: We will use the new
Architectures for massive data management
Architectures for massive data management Apache Kafka, Samza, Storm Albert Bifet [email protected] October 20, 2015 Stream Engine Motivation Digital Universe EMC Digital Universe with
BIG DATA, MAPREDUCE & HADOOP
BIG, MAPREDUCE & HADOOP LARGE SCALE DISTRIBUTED SYSTEMS By Jean-Pierre Lozi A tutorial for the LSDS class LARGE SCALE DISTRIBUTED SYSTEMS BIG, MAPREDUCE & HADOOP 1 OBJECTIVES OF THIS LAB SESSION The LSDS
Hadoop Framework. technology basics for data scientists. Spring - 2014. Jordi Torres, UPC - BSC www.jorditorres.eu @JordiTorresBCN
Hadoop Framework technology basics for data scientists Spring - 2014 Jordi Torres, UPC - BSC www.jorditorres.eu @JordiTorresBCN Warning! Slides are only for presenta8on guide We will discuss+debate addi8onal
Hadoop and Eclipse. Eclipse Hawaii User s Group May 26th, 2009. Seth Ladd http://sethladd.com
Hadoop and Eclipse Eclipse Hawaii User s Group May 26th, 2009 Seth Ladd http://sethladd.com Goal YOU can use the same technologies as The Big Boys Google Yahoo (2000 nodes) Last.FM AOL Facebook (2.5 petabytes
Data Science in the Wild
Data Science in the Wild Lecture 3 Some slides are taken from J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 1 Data Science and Big Data Big Data: the data cannot
Elastic Map Reduce. Shadi Khalifa Database Systems Laboratory (DSL) [email protected]
Elastic Map Reduce Shadi Khalifa Database Systems Laboratory (DSL) [email protected] The Amazon Web Services Universe Cross Service Features Management Interface Platform Services Infrastructure Services
MR-(Mapreduce Programming Language)
MR-(Mapreduce Programming Language) Siyang Dai Zhi Zhang Shuai Yuan Zeyang Yu Jinxiong Tan sd2694 zz2219 sy2420 zy2156 jt2649 Objective of MR MapReduce is a software framework introduced by Google, aiming
Hadoop and Map-reduce computing
Hadoop and Map-reduce computing 1 Introduction This activity contains a great deal of background information and detailed instructions so that you can refer to it later for further activities and homework.
Enterprise Data Storage and Analysis on Tim Barr
Enterprise Data Storage and Analysis on Tim Barr January 15, 2015 Agenda Challenges in Big Data Analytics Why many Hadoop deployments under deliver What is Apache Spark Spark Core, SQL, Streaming, MLlib,
Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13
Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13 Astrid Rheinländer Wissensmanagement in der Bioinformatik What is Big Data? collection of data sets so large and complex
Extreme Computing. Hadoop MapReduce in more detail. www.inf.ed.ac.uk
Extreme Computing Hadoop MapReduce in more detail How will I actually learn Hadoop? This class session Hadoop: The Definitive Guide RTFM There is a lot of material out there There is also a lot of useless
Big Data and Apache Hadoop s MapReduce
Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23
Hadoop Basics with InfoSphere BigInsights
An IBM Proof of Technology Hadoop Basics with InfoSphere BigInsights Unit 2: Using MapReduce An IBM Proof of Technology Catalog Number Copyright IBM Corporation, 2013 US Government Users Restricted Rights
Internals of Hadoop Application Framework and Distributed File System
International Journal of Scientific and Research Publications, Volume 5, Issue 7, July 2015 1 Internals of Hadoop Application Framework and Distributed File System Saminath.V, Sangeetha.M.S Abstract- Hadoop
Introduc)on to Map- Reduce. Vincent Leroy
Introduc)on to Map- Reduce Vincent Leroy Sources Apache Hadoop Yahoo! Developer Network Hortonworks Cloudera Prac)cal Problem Solving with Hadoop and Pig Slides will be available at hgp://lig- membres.imag.fr/leroyv/
Chapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
Big Data 2012 Hadoop Tutorial
Big Data 2012 Hadoop Tutorial Oct 19th, 2012 Martin Kaufmann Systems Group, ETH Zürich 1 Contact Exercise Session Friday 14.15 to 15.00 CHN D 46 Your Assistant Martin Kaufmann Office: CAB E 77.2 E-Mail:
map/reduce connected components
1, map/reduce connected components find connected components with analogous algorithm: map edges randomly to partitions (k subgraphs of n nodes) for each partition remove edges, so that only tree remains
Connecting Hadoop with Oracle Database
Connecting Hadoop with Oracle Database Sharon Stephen Senior Curriculum Developer Server Technologies Curriculum The following is intended to outline our general product direction.
Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems
Processing of massive data: MapReduce 2. Hadoop 1 MapReduce Implementations Google were the first that applied MapReduce for big data analysis Their idea was introduced in their seminal paper MapReduce:
Zebra and MapReduce. Table of contents. 1 Overview...2 2 Hadoop MapReduce APIs...2 3 Zebra MapReduce APIs...2 4 Zebra MapReduce Examples...
Table of contents 1 Overview...2 2 Hadoop MapReduce APIs...2 3 Zebra MapReduce APIs...2 4 Zebra MapReduce Examples... 2 1. Overview MapReduce allows you to take full advantage of Zebra's capabilities.
Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan [email protected]
Introduc)on to the MapReduce Paradigm and Apache Hadoop Sriram Krishnan [email protected] Programming Model The computa)on takes a set of input key/ value pairs, and Produces a set of output key/value pairs.
Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team [email protected]
Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team [email protected] Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since Feb
16.1 MAPREDUCE. For personal use only, not for distribution. 333
For personal use only, not for distribution. 333 16.1 MAPREDUCE Initially designed by the Google labs and used internally by Google, the MAPREDUCE distributed programming model is now promoted by several
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of
The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org
The Flink Big Data Analytics Platform Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org What is Apache Flink? Open Source Started in 2009 by the Berlin-based database research groups In the Apache
MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012
MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte
Big Data for the JVM developer. Costin Leau, Elasticsearch @costinl
Big Data for the JVM developer Costin Leau, Elasticsearch @costinl Agenda Data Trends Data Pipelines JVM and Big Data Tool Eco-system Data Landscape Data Trends http://www.emc.com/leadership/programs/digital-universe.htm
Big Data Analytics* Outline. Issues. Big Data
Outline Big Data Analytics* Big Data Data Analytics: Challenges and Issues Misconceptions Big Data Infrastructure Scalable Distributed Computing: Hadoop Programming in Hadoop: MapReduce Paradigm Example
Lecture 10 - Functional programming: Hadoop and MapReduce
Lecture 10 - Functional programming: Hadoop and MapReduce Sohan Dharmaraja Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 1 / 41 For today Big Data and Text analytics Functional
Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: [email protected] Website: www.qburst.com
Lambda Architecture Near Real-Time Big Data Analytics Using Hadoop January 2015 Contents Overview... 3 Lambda Architecture: A Quick Introduction... 4 Batch Layer... 4 Serving Layer... 4 Speed Layer...
An Implementation of Sawzall on Hadoop
1 An Implementation of Sawzall on Hadoop Hidemoto Nakada, Tatsuhiko Inoue and Tomohiro Kudoh, 1-1-1 National Institute of Advanced Industrial Science and Technology, Umezono, Tsukuba, Ibaraki 35-8568,
Word count example Abdalrahman Alsaedi
Word count example Abdalrahman Alsaedi To run word count in AWS you have two different ways; either use the already exist WordCount program, or to write your own file. First: Using AWS word count program
Introduction to Big Data Science. Wuhui Chen
Introduction to Big Data Science Wuhui Chen What is Big data? Volume Variety Velocity Outline What are people doing with Big data? Classic examples Two basic technologies for Big data management: Data
# Not a part of 1Z0-061 or 1Z0-144 Certification test, but very important technology in BIG DATA Analysis
Section 9 : Case Study # Objectives of this Session The Motivation For Hadoop What problems exist with traditional large-scale computing systems What requirements an alternative approach should have How
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing
CSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing
HDInsight Essentials. Rajesh Nadipalli. Chapter No. 1 "Hadoop and HDInsight in a Heartbeat"
HDInsight Essentials Rajesh Nadipalli Chapter No. 1 "Hadoop and HDInsight in a Heartbeat" In this package, you will find: A Biography of the author of the book A preview chapter from the book, Chapter
Hadoop Integration Guide
HP Vertica Analytic Database Software Version: 7.0.x Document Release Date: 2/20/2015 Legal Notices Warranty The only warranties for HP products and services are set forth in the express warranty statements
Cloud Computing Era. Trend Micro
Cloud Computing Era Trend Micro Three Major Trends to Chang the World Cloud Computing Big Data Mobile 什 麼 是 雲 端 運 算? 美 國 國 家 標 準 技 術 研 究 所 (NIST) 的 定 義 : Essential Characteristics Service Models Deployment
Distributed Systems + Middleware Hadoop
Distributed Systems + Middleware Hadoop Alessandro Sivieri Dipartimento di Elettronica, Informazione e Bioingegneria Politecnico, Italy [email protected] http://corsi.dei.polimi.it/distsys Contents
Hadoop implementation of MapReduce computational model. Ján Vaňo
Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed
MAPREDUCE Programming Model
CS 2510 COMPUTER OPERATING SYSTEMS Cloud Computing MAPREDUCE Dr. Taieb Znati Computer Science Department University of Pittsburgh MAPREDUCE Programming Model Scaling Data Intensive Application MapReduce
Open source large scale distributed data management with Google s MapReduce and Bigtable
Open source large scale distributed data management with Google s MapReduce and Bigtable Ioannis Konstantinou Email: [email protected] Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory
Big Data Technology Map-Reduce Motivation: Indexing in Search Engines
Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after
LANGUAGES FOR HADOOP: PIG & HIVE
Friday, September 27, 13 1 LANGUAGES FOR HADOOP: PIG & HIVE Michail Michailidis & Patrick Maiden Friday, September 27, 13 2 Motivation Native MapReduce Gives fine-grained control over how program interacts
CS 378 Big Data Programming. Lecture 2 Map- Reduce
CS 378 Big Data Programming Lecture 2 Map- Reduce MapReduce Large data sets are not new What characterizes a problem suitable for MR? Most or all of the data is processed But viewed in small increments
AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW
AGENDA What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story Hadoop PDW Our BIG DATA Roadmap BIG DATA? Volume 59% growth in annual WW information 1.2M Zetabytes (10 21 bytes) this
Introduction to Hadoop
Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction
Convex Optimization for Big Data: Lecture 2: Frameworks for Big Data Analytics
Convex Optimization for Big Data: Lecture 2: Frameworks for Big Data Analytics Sabeur Aridhi Aalto University, Finland Sabeur Aridhi Frameworks for Big Data Analytics 1 / 59 Introduction Contents 1 Introduction
The Performance Characteristics of MapReduce Applications on Scalable Clusters
The Performance Characteristics of MapReduce Applications on Scalable Clusters Kenneth Wottrich Denison University Granville, OH 43023 [email protected] ABSTRACT Many cluster owners and operators have
Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop
Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1 2 Big Data Problems Data explosion Data from users on social
CS 378 Big Data Programming
CS 378 Big Data Programming Lecture 2 Map- Reduce CS 378 - Fall 2015 Big Data Programming 1 MapReduce Large data sets are not new What characterizes a problem suitable for MR? Most or all of the data is
and HDFS for Big Data Applications Serge Blazhievsky Nice Systems
Introduction PRESENTATION to Hadoop, TITLE GOES MapReduce HERE and HDFS for Big Data Applications Serge Blazhievsky Nice Systems SNIA Legal Notice The material contained in this tutorial is copyrighted
Big Data Processing with Google s MapReduce. Alexandru Costan
1 Big Data Processing with Google s MapReduce Alexandru Costan Outline Motivation MapReduce programming model Examples MapReduce system architecture Limitations Extensions 2 Motivation Big Data @Google:
The Internet of Things and Big Data: Intro
The Internet of Things and Big Data: Intro John Berns, Solutions Architect, APAC - MapR Technologies April 22 nd, 2014 1 What This Is; What This Is Not It s not specific to IoT It s not about any specific
Developing MapReduce Programs
Cloud Computing Developing MapReduce Programs Dell Zhang Birkbeck, University of London 2015/16 MapReduce Algorithm Design MapReduce: Recap Programmers must specify two functions: map (k, v) * Takes
The Cloud Computing Era and Ecosystem. Phoenix Liau, Technical Manager
The Cloud Computing Era and Ecosystem Phoenix Liau, Technical Manager Three Major Trends to Chang the World Cloud Computing Big Data Mobile Mobility and Personal Cloud My World! My Way! What is Personal
Step 4: Configure a new Hadoop server This perspective will add a new snap-in to your bottom pane (along with Problems and Tasks), like so:
Codelab 1 Introduction to the Hadoop Environment (version 0.17.0) Goals: 1. Set up and familiarize yourself with the Eclipse plugin 2. Run and understand a word counting program Setting up Eclipse: Step
Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview
Programming Hadoop 5-day, instructor-led BD-106 MapReduce Overview The Client Server Processing Pattern Distributed Computing Challenges MapReduce Defined Google's MapReduce The Map Phase of MapReduce
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data
Introduction to Cloud Computing
Introduction to Cloud Computing MapReduce and Hadoop 15 319, spring 2010 17 th Lecture, Mar 16 th Majd F. Sakr Lecture Goals Transition to MapReduce from Functional Programming Understand the origins of
BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic
BigData An Overview of Several Approaches David Mera Masaryk University Brno, Czech Republic 16/12/2013 Table of Contents 1 Introduction 2 Terminology 3 Approaches focused on batch data processing MapReduce-Hadoop
USING HDFS ON DISCOVERY CLUSTER TWO EXAMPLES - test1 and test2
USING HDFS ON DISCOVERY CLUSTER TWO EXAMPLES - test1 and test2 (Using HDFS on Discovery Cluster for Discovery Cluster Users email [email protected] if you have questions or need more clarifications. Nilay
Big Data With Hadoop
With Saurabh Singh [email protected] The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics
Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics Dr. Liangxiu Han Future Networks and Distributed Systems Group (FUNDS) School of Computing, Mathematics and Digital Technology,
Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel
Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:
Copy the.jar file into the plugins/ subfolder of your Eclipse installation. (e.g., C:\Program Files\Eclipse\plugins)
Beijing Codelab 1 Introduction to the Hadoop Environment Spinnaker Labs, Inc. Contains materials Copyright 2007 University of Washington, licensed under the Creative Commons Attribution 3.0 License --
Creating.NET-based Mappers and Reducers for Hadoop with JNBridgePro
Creating.NET-based Mappers and Reducers for Hadoop with JNBridgePro CELEBRATING 10 YEARS OF JAVA.NET Apache Hadoop.NET-based MapReducers Creating.NET-based Mappers and Reducers for Hadoop with JNBridgePro
Distributed Computing and Big Data: Hadoop and MapReduce
Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:
