Convex Optimization for Big Data: Lecture 2: Frameworks for Big Data Analytics

Save this PDF as:
 WORD  PNG  TXT  JPG

Size: px
Start display at page:

Download "Convex Optimization for Big Data: Lecture 2: Frameworks for Big Data Analytics"

Transcription

1 Convex Optimization for Big Data: Lecture 2: Frameworks for Big Data Analytics Sabeur Aridhi Aalto University, Finland Sabeur Aridhi Frameworks for Big Data Analytics 1 / 59

2 Introduction Contents 1 Introduction Big Data Parallel/Distributed computing Distributed storage Grid Computing Cloud Computing 2 MapReduce Programming model Framework Algorithms 3 Other frameworks SPARK STORM 4 Graph Processing Motivations Pregel model Sample Algorithms: SSSP Available implementations Blogel 5 Q and A Sabeur Aridhi Frameworks for Big Data Analytics 2 / 59

3 Introduction Big Data Growth of data size Data sets We need to analyze bigger and bigger datasets: Web datasets: web topology (Google: est. 46 billion pages) Network datasets: routing efficiency analysis, tcp-ip logs Biological datasets: genome sequencing data, protein patterns Big Data Three main characteristics: Volume: Huge size of datasets (i.e., distributed storage). Variety: Several types of data such as text, image and video. Velocity: Data is generated at high speed (i.e., CERN experiments: one petabyte /sec). Sabeur Aridhi Frameworks for Big Data Analytics 3 / 59

4 Introduction Big Data Key sources of Big Data Internet of Things (IoT) Networking of hardware equipment (e.g., household appliances, sensors, mobile pohones) A significant source of Big Data Sabeur Aridhi Frameworks for Big Data Analytics 4 / 59

5 Introduction Big Data Key sources of Big Data Smart cities Use of information technology to enhance quality, performance and interactivity of urban services in a city. Data is collected from sensors installed on: utility poles water lines buses trains traffic lights,... Sabeur Aridhi Frameworks for Big Data Analytics 5 / 59

6 Introduction Big Data Key sources of Big Data - Smart cities Sabeur Aridhi Frameworks for Big Data Analytics 6 / 59

7 Introduction Big Data Big Data in everyday life Sabeur Aridhi Frameworks for Big Data Analytics 7 / 59

8 Introduction Big Data Big Data in everyday life Does one minute fit into RAM? Five Minutes? Sabeur Aridhi Frameworks for Big Data Analytics 7 / 59

9 Introduction Parallel/Distributed computing Parallel/Distributed Computing Approaches Pros Parallel algorithms Distributed algorithms Cheaper (scaling out vs. scaling up) Scalable Cons Needs theory of distributed algorithms (How to distribute computation?) Cluster maintenance (fault tolerance, scheduling,...) Sabeur Aridhi Frameworks for Big Data Analytics 8 / 59

10 Introduction Parallel/Distributed computing Parallel Computing Parallel vs. Distributed A parallel algorithm can have: A global clock Shared memory How to develop parallel computing programs? Hands-on approach Libraries available Still lot of manual set-up required (set-up threads, distribute data,... ) Sabeur Aridhi Frameworks for Big Data Analytics 9 / 59

11 Introduction Parallel/Distributed computing Distributed Computing (DC) Requires a Programming Model (PM) that is inherently parallelizable. Requires a framework that runs the program and provides fault-tolerance and scalability. Sabeur Aridhi Frameworks for Big Data Analytics 10 / 59

12 Introduction Parallel/Distributed computing Distributed Computing (DC) A programming model for DC should: abstract all low level details such as networking, scheduling,... allow for wide range of functionality be easy to use A framework for DC should provide: fault tolerance scalability data integrity Sabeur Aridhi Frameworks for Big Data Analytics 11 / 59

13 Introduction Distributed storage Distributed File System Big Data requires storage over many machines Each computing node needs to access the data. A distributed file system (DFS) abstracts the distributed storage A widely used file system (FS) is Google DFS Google s DFS data is replicated across the network. (Failure resistent, faster read operations). Sabeur Aridhi Frameworks for Big Data Analytics 12 / 59

14 Introduction Distributed storage Distributed File System Big Data requires storage over many machines Each computing node needs to access the data. A distributed file system (DFS) abstracts the distributed storage A widely used file system (FS) is Google DFS Google s DFS data is replicated across the network. (Failure resistent, faster read operations). namenode stores file metadata e.g. name, location of chunks Sabeur Aridhi Frameworks for Big Data Analytics 12 / 59

15 Introduction Distributed storage Distributed File System Big Data requires storage over many machines Each computing node needs to access the data. A distributed file system (DFS) abstracts the distributed storage A widely used file system (FS) is Google DFS Google s DFS data is replicated across the network. (Failure resistent, faster read operations). namenode stores file metadata e.g. name, location of chunks datanode stores chunks of that file Sabeur Aridhi Frameworks for Big Data Analytics 12 / 59

16 Introduction Grid Computing Grid Computing - Definition A set of information resources (computers, databases, networks,...). It provides users with tools and application that treat those resources within a virtual system. Sabeur Aridhi Frameworks for Big Data Analytics 13 / 59

17 Introduction Grid Computing Grid Computing - What is Grid Computing? Grid computing is the collection of computer resources from multiple locations to reach a common goal Characteristics of a Grid: no centralized control center scalability heterogeneity (of resources) dynamic and adaptable Sabeur Aridhi Frameworks for Big Data Analytics 14 / 59

18 Introduction Cloud Computing Cloud Computing Definition Large number of computers that are connected via Internet. Applications delivered as services. Computing resources delivered as services. Pay as you go. Cloud services can be rapidly and elastically provisioned. The framework can be offered as a service by a provider e.g. Amazon EC2 Sabeur Aridhi Frameworks for Big Data Analytics 15 / 59

19 Introduction Cloud Computing Layers of Cloud Computing Service models Software as a Service (SaaS). Platform as a Service (PaaS), Infrastructure as a Service (IaaS), Sabeur Aridhi Frameworks for Big Data Analytics 16 / 59

20 MapReduce Contents 1 Introduction Big Data Parallel/Distributed computing Distributed storage Grid Computing Cloud Computing 2 MapReduce Programming model Framework Algorithms 3 Other frameworks SPARK STORM 4 Graph Processing Motivations Pregel model Sample Algorithms: SSSP Available implementations Blogel 5 Q and A Sabeur Aridhi Frameworks for Big Data Analytics 17 / 59

21 MapReduce Definition Google, 2004: MapReduce refers to a programming model and the corresponding implementation for processing and generating large data sets. Created to help Google developers to analyze huge datasets Sabeur Aridhi Frameworks for Big Data Analytics 18 / 59

22 MapReduce Programming model Map and Reduce in MapReduce Data format: key-value pairs The user (you) must define two functions: Mapper Map function: (k 1, v 1 ) list(k 2, v 2 ) Reducer Reduce function: (k 2, list(v 2 )) list(v 3 ) The Map functions are executed in parallel. k 2, v 2 are intermediate pairs. Mapper results are grouped using k 2. The Reduce functions are executed in parallel. Sabeur Aridhi Frameworks for Big Data Analytics 19 / 59

23 MapReduce Programming model MapReduce data flow (simplified) Sabeur Aridhi Frameworks for Big Data Analytics 20 / 59

24 MapReduce Programming model Basic steps of a MapReduce program: Input is read from file system (fs) The Map function is executed in parallel. These results are written to fs Input is read from fs The Reduce function is executed in parallel. These results are written to fs Sabeur Aridhi Frameworks for Big Data Analytics 21 / 59

25 MapReduce Programming model Example: Word Count (MapReduce s Hello World) Input A set of documents stored in a DFS Output The number of occurrences of each word in the set of documents. Sabeur Aridhi Frameworks for Big Data Analytics 22 / 59

26 MapReduce Programming model Example: Word Count (MapReduce s Hello World) Input A set of documents stored in a DFS Output The number of occurrences of each word in the set of documents. Idea Work on each document in parallel and then reduce the results for each word. Sabeur Aridhi Frameworks for Big Data Analytics 22 / 59

27 MapReduce Programming model Example: Word Count (MapReduce s Hello World) Algorithm: Map function Input: String filename, String content foreach word w in content do EmitIntermediate(w,1); Algorithm: Reduce function Input: String key, Iterator values result 0; foreach v in values do result result + v; Emit (key,result); Sabeur Aridhi Frameworks for Big Data Analytics 23 / 59

28 MapReduce Programming model Possible optimization Creating a pair for each occurrence of a particular word is inefficient Algorithm: Optimized Map function Input: String filename, String content H new HashTable; % H is an associative array that maps keys to values foreach word w in content do H[w] H[w] + 1 foreach word w in H do EmitIntermediate(w,H[w]) Sabeur Aridhi Frameworks for Big Data Analytics 24 / 59

29 MapReduce Programming model Combiner and Partitioner Combiner Operates between Mapper and Reducer. Combiner : (k 2, list(v 2 )) (k 2, list(v 3 )) The combiner is applied before the global sort. Just like in the WordCount example. Partitioner Before global sort, decide which reducer receives which key. Sabeur Aridhi Frameworks for Big Data Analytics 25 / 59

30 MapReduce Programming model Complete MapReduce flow Sabeur Aridhi Frameworks for Big Data Analytics 26 / 59

31 MapReduce Programming model Fault tolerance management In case of a worker failure map (and not reduce) tasks completed reset to idle state map and reduce tasks in progress reset to idle Notification to and re-execution by workers executing reduce tasks in case of Master failure Version of 2004: Abort the computation and retry In MRv2 (YARN): Periodic check points of the data structures Launching of a new copy from the check point Sabeur Aridhi Frameworks for Big Data Analytics 27 / 59

32 MapReduce Framework Framework Google s MapReduce framework is not freely available Open source alternative: Apache s Hadoop Offered as service in Amazon Elastic Computing (via virtual machines) Sabeur Aridhi Frameworks for Big Data Analytics 28 / 59

33 MapReduce Framework Hadoop Architecture Sabeur Aridhi Frameworks for Big Data Analytics 29 / 59

34 MapReduce Framework Hadoop Details Library in Java Started by Yahoo in 2006 By default, Map and Reduce functions must be written in Java Possibility to use external programs as mapper and reducers Sabeur Aridhi Frameworks for Big Data Analytics 30 / 59

35 MapReduce Algorithms Inverted Index (Job interview question ;) Problem We are given a set of webpages P = {P 1,..., P n }. We want to know for each word in the dataset, in which documents it appears (and how many times) Sabeur Aridhi Frameworks for Big Data Analytics 31 / 59

36 MapReduce Algorithms Reminder of MapReduce flow Sabeur Aridhi Frameworks for Big Data Analytics 32 / 59

37 MapReduce Algorithms Baseline solution Algorithm: Map function(string filename, String content) H new HashTable; foreach word w in content do H[w] H[w] + 1 foreach word w in H do EmitIntermediate(w,(f ilename, H[w])) Algorithm: Reduce function(string key, Iterator values) result []; foreach v in values do result.add(v); result.sortbyf req(); Emit (key,result); Sabeur Aridhi Frameworks for Big Data Analytics 33 / 59

38 MapReduce Algorithms Simple run Sabeur Aridhi Frameworks for Big Data Analytics 34 / 59

39 MapReduce Algorithms Simple run Sabeur Aridhi Frameworks for Big Data Analytics 34 / 59

40 MapReduce Algorithms K-Means clustering Problem definition Given a set of points, partition them to minimize the within-cluster sum of squares Standard algorithm (LLoyd 1982) Choose K centroids (m 1, m 2,..., m k ) at random While changes: Assign each point to closest centroid Move each centroid to the average of the points assigned to it Sabeur Aridhi Frameworks for Big Data Analytics 35 / 59

41 MapReduce Algorithms K-Means run m2 m1 m3 Sabeur Aridhi Frameworks for Big Data Analytics 36 / 59

42 MapReduce Algorithms K-Means run Sabeur Aridhi Frameworks for Big Data Analytics 36 / 59

43 MapReduce Algorithms K-Means run Sabeur Aridhi Frameworks for Big Data Analytics 36 / 59

44 MapReduce Algorithms K-Means run Sabeur Aridhi Frameworks for Big Data Analytics 36 / 59

45 MapReduce Algorithms K-Means run Sabeur Aridhi Frameworks for Big Data Analytics 36 / 59

46 MapReduce Algorithms MapReduce Algorithm K-Means is an iterative algorithm Each MapReduce job corresponds to one iteration of K-Means MapReduce implementation of K-Means Let C be the set of starting centroids Do: Set C = C Send C to all nodes Run MapReduce algorithm Extract new centroids C from output While C C Sabeur Aridhi Frameworks for Big Data Analytics 37 / 59

47 MapReduce Algorithms MapReduce Algorithm Algorithm: Map function Input: Int id, Point p k argmin i (dist(p, C i )) ; EmitIntermediate(k,p); Algorithm: Reduce function Input: centroid id, Iterator points result avg(points); Emit (centroid id,result); Sabeur Aridhi Frameworks for Big Data Analytics 38 / 59

48 MapReduce Algorithms MapReduce: Use and abuse Hadoop horror stories (from Yahoo) Map tasks that take a full machine for 3 days Jobs that have 10 MB of data per task, but 1 GB of data in distributed cache What are the problems? Iterative data-flows are not optimized (have to write to disk in every iteration!) No persistent data structure Sabeur Aridhi Frameworks for Big Data Analytics 39 / 59

49 Other frameworks Contents 1 Introduction Big Data Parallel/Distributed computing Distributed storage Grid Computing Cloud Computing 2 MapReduce Programming model Framework Algorithms 3 Other frameworks SPARK STORM 4 Graph Processing Motivations Pregel model Sample Algorithms: SSSP Available implementations Blogel 5 Q and A Sabeur Aridhi Frameworks for Big Data Analytics 40 / 59

50 Other frameworks SPARK SPARK SPARK framework A general engine for large-scale data processing. It supports SQL, streaming, and a wide range of functions. It offers several high-level operators that make it easy to build parallel applications. Sabeur Aridhi Frameworks for Big Data Analytics 41 / 59

51 Other frameworks SPARK SPARK Spark abstracts the distributed storage of objects via resilient distributed datasets (RDDs). Immutable collections of objects spread across a cluster. Built through parallel transformations (map, filter, etc). Automatically rebuilt on failure. Controllable persistence (e.g. caching in RAM) for reuse. hared variables that can be used in parallel operations. Sabeur Aridhi Frameworks for Big Data Analytics 42 / 59

52 Other frameworks STORM STORM Background Created by Nathan BackType/Twitter. Analyse graphs (users and links between them) Open sourced at September Used by many companies in Big Data industry Twitter; Alibaba; Groupon;... Sabeur Aridhi Frameworks for Big Data Analytics 43 / 59

53 Other frameworks STORM STORM Scalable and robust. Allows online computation and efficient processing of streaming data. Based on distributed Remote Procedure Call (RPC). Fault tolerant. Sabeur Aridhi Frameworks for Big Data Analytics 44 / 59

54 Other frameworks STORM System architecture System architecture Nimbus: Like JobTacker in hadoop Supervisor: Manage workers Zookeeper: Store meta data (e.g., file locations,... ) UI: Web-UI Sabeur Aridhi Frameworks for Big Data Analytics 45 / 59

55 Graph Processing Contents 1 Introduction Big Data Parallel/Distributed computing Distributed storage Grid Computing Cloud Computing 2 MapReduce Programming model Framework Algorithms 3 Other frameworks SPARK STORM 4 Graph Processing Motivations Pregel model Sample Algorithms: SSSP Available implementations Blogel 5 Q and A Sabeur Aridhi Frameworks for Big Data Analytics 46 / 59

56 Graph Processing Motivations Motivations Many big data sets are graphs: Internet crawl data Social network data Bioinformatics (protein folding graphs,... ) Sabeur Aridhi Frameworks for Big Data Analytics 47 / 59

57 Graph Processing Motivations Applications for graph data Application domains Computer networks, Social networks, Bioinformatics, Chemoinformatics. Graph representation Data modeling. Identifying relationship patterns and rules. Protein structure Chemical compound Social network Sabeur Aridhi Frameworks for Big Data Analytics 48 / 59

58 Graph Processing Pregel model Pregel - Basic Facts Developed at Google. A Pregel program is based on a sequence of supersteps Each vertex in the graph is seen as a different process At each superstep, each vertex: Receives the list of messages sent to it during the last superstep Computes its new state Sends messages to other vertices Decides if it votes to halt When all nodes vote to halt, the algorithm stops. Sabeur Aridhi Frameworks for Big Data Analytics 49 / 59

59 Graph Processing Pregel model Framework of Pregel Each process take control of a part of the graph The vertices are divided between the nodes The order of computation inside each superstep is not important Can be built on top of MapReduce or other frameworks Sabeur Aridhi Frameworks for Big Data Analytics 50 / 59

60 Graph Processing Pregel model Single Source Shortest Path (SSSP) Single source shortest path Given a weighted graph and a source vertex s, compute the distance from s to all other vertices. Centralized solution Run a breadth-first search. Sabeur Aridhi Frameworks for Big Data Analytics 51 / 59

61 Graph Processing Pregel model Pregel solution Algorithm: SSSPVertex integer dist (this = source)? 0 : + ; compute(messages) integer newdist min(messages); if newdist < dist then dist newdist; for Edge e this.getneighbors() do sendmessage(e, dist + e.value()); else votetohalt(); Sabeur Aridhi Frameworks for Big Data Analytics 52 / 59

62 Graph Processing Pregel model Shortest Example: path SSSP Parallel BFS in Pregel Sabeur Aridhi Frameworks for Big Data Analytics 53 / 59

63 Graph Processing Pregel model Shortest Example: path SSSP Parallel BFS in Pregel Sabeur Aridhi Frameworks for Big Data Analytics 53 / 59

64 Graph Processing Pregel model Shortest Example: path SSSP Parallel BFS in Pregel Sabeur Aridhi Frameworks for Big Data Analytics 53 / 59

65 Graph Processing Pregel model Shortest Example: path SSSP Parallel BFS in Pregel Sabeur Aridhi Frameworks for Big Data Analytics 53 / 59

66 Graph Processing Pregel model Shortest Example: path SSSP Parallel BFS in Pregel Sabeur Aridhi Frameworks for Big Data Analytics 53 / 59

67 Graph Processing Pregel model Shortest Example: path SSSP Parallel BFS in Pregel Sabeur Aridhi Frameworks for Big Data Analytics 53 / 59

68 Graph Processing Pregel model Shortest Example: path SSSP Parallel BFS in Pregel Sabeur Aridhi Frameworks for Big Data Analytics 53 / 59

69 Graph Processing Pregel model Shortest Example: path SSSP Parallel BFS in Pregel Sabeur Aridhi Frameworks for Big Data Analytics 53 / 59

70 Graph Processing Pregel model Shortest Example: path SSSP Parallel BFS in Pregel Sabeur Aridhi Frameworks for Big Data Analytics 53 / 59

71 Graph Processing Pregel model Available implementation Two implementations: GoldenOrb and Apache Giraph Based on Apache Hadoop GoldenOrb Developed by an independent company, sponsored by Ravel (Austin TX) Launched June 2011 Apache Giraph Developed in Apache Incubator public c l a s s MyVertex extends Vertex<IntWritable, IntWritable, IntMessage >{... public void compute ( Collection <IntMessage> messages ) {... sendmessage (new IntMessage ( dest, new IntWritable ( 4 2 ) ) ) ; } } Sabeur Aridhi Frameworks for Big Data Analytics 54 / 59

72 Graph Processing Blogel Blogel - Programming Model Blogel is a block-centric framework for distributed computation on large graphs. From Think Like a Vertex (Pregel) to Think Like a Block (Blogel) A block refers to a connected subgraph of the graph. Message exchanges occur among blocks. Sabeur Aridhi Frameworks for Big Data Analytics 55 / 59

73 Graph Processing Blogel Blogel Three types of jobs: 1 Vertex-centric graph computing, where a worker is called a V-worker. 2 Graph partitioning which groups vertices into blocks, where a worker is called a partitioner. 3 Block-centric graph computing, where a worker is called a B-worker. Sabeur Aridhi Frameworks for Big Data Analytics 56 / 59

74 Graph Processing Blogel Blogel - Programming Model Blogel operates in three computing modes, depending on the application. B-mode: Only block-level message exchanges are allowed. A job terminates when all blocks voted to halt and there is no pending message for the next superstep. V-mode: Only vertex-level message exchanges are allowed. A job terminates when all vertices voted to halt and there is no pending message for the next superstep. VB-mode: vertex-level message exchanges and block-level message exchanges are allowed. Sabeur Aridhi Frameworks for Big Data Analytics 57 / 59

75 Q and A Contents 1 Introduction Big Data Parallel/Distributed computing Distributed storage Grid Computing Cloud Computing 2 MapReduce Programming model Framework Algorithms 3 Other frameworks SPARK STORM 4 Graph Processing Motivations Pregel model Sample Algorithms: SSSP Available implementations Blogel 5 Q and A Sabeur Aridhi Frameworks for Big Data Analytics 58 / 59

76 Q and A That s it. Why was this important? Questions? Sabeur Aridhi Frameworks for Big Data Analytics 59 / 59

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Jeffrey D. Ullman slides. MapReduce for data intensive computing Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very

More information

Mining Large Datasets: Case of Mining Graph Data in the Cloud

Mining Large Datasets: Case of Mining Graph Data in the Cloud Mining Large Datasets: Case of Mining Graph Data in the Cloud Sabeur Aridhi PhD in Computer Science with Laurent d Orazio, Mondher Maddouri and Engelbert Mephu Nguifo 16/05/2014 Sabeur Aridhi Mining Large

More information

Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

More information

Big Data With Hadoop

Big Data With Hadoop With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

More information

MapReduce (in the cloud)

MapReduce (in the cloud) MapReduce (in the cloud) How to painlessly process terabytes of data by Irina Gordei MapReduce Presentation Outline What is MapReduce? Example How it works MapReduce in the cloud Conclusion Demo Motivation:

More information

Map Reduce / Hadoop / HDFS

Map Reduce / Hadoop / HDFS Chapter 3: Map Reduce / Hadoop / HDFS 97 Overview Outline Distributed File Systems (re-visited) Motivation Programming Model Example Applications Big Data in Apache Hadoop HDFS in Hadoop YARN 98 Overview

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data

More information

Big Data and Apache Hadoop s MapReduce

Big Data and Apache Hadoop s MapReduce Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23

More information

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets

More information

Introduction to Hadoop

Introduction to Hadoop Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction

More information

Hadoop Architecture. Part 1

Hadoop Architecture. Part 1 Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,

More information

Introduction to Parallel Programming and MapReduce

Introduction to Parallel Programming and MapReduce Introduction to Parallel Programming and MapReduce Audience and Pre-Requisites This tutorial covers the basics of parallel programming and the MapReduce programming model. The pre-requisites are significant

More information

Hadoop vs Apache Spark

Hadoop vs Apache Spark Innovate, Integrate, Transform Hadoop vs Apache Spark www.altencalsoftlabs.com Introduction Any sufficiently advanced technology is indistinguishable from magic. said Arthur C. Clark. Big data technologies

More information

http://www.wordle.net/

http://www.wordle.net/ Hadoop & MapReduce http://www.wordle.net/ http://www.wordle.net/ Hadoop is an open-source software framework (or platform) for Reliable + Scalable + Distributed Storage/Computational unit Failures completely

More information

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012 MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte

More information

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model

More information

Introduction to Hadoop

Introduction to Hadoop 1 What is Hadoop? Introduction to Hadoop We are living in an era where large volumes of data are available and the problem is to extract meaning from the data avalanche. The goal of the software tools

More information

Lecture 10 - Functional programming: Hadoop and MapReduce

Lecture 10 - Functional programming: Hadoop and MapReduce Lecture 10 - Functional programming: Hadoop and MapReduce Sohan Dharmaraja Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 1 / 41 For today Big Data and Text analytics Functional

More information

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart Hadoop/MapReduce Object-oriented framework presentation CSCI 5448 Casey McTaggart What is Apache Hadoop? Large scale, open source software framework Yahoo! has been the largest contributor to date Dedicated

More information

Apache Hadoop. Alexandru Costan

Apache Hadoop. Alexandru Costan 1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open

More information

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after

More information

Introduction to Hadoop and MapReduce

Introduction to Hadoop and MapReduce Introduction to Hadoop and MapReduce THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Large-scale Computation Traditional solutions for computing large quantities of data

More information

Hadoop IST 734 SS CHUNG

Hadoop IST 734 SS CHUNG Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to

More information

The MapReduce Framework

The MapReduce Framework The MapReduce Framework Luke Tierney Department of Statistics & Actuarial Science University of Iowa November 8, 2007 Luke Tierney (U. of Iowa) The MapReduce Framework November 8, 2007 1 / 16 Background

More information

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of

More information

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad

More information

This exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing.

This exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing. Big Data Processing 2013-2014 Q2 April 7, 2014 (Resit) Lecturer: Claudia Hauff Time Limit: 180 Minutes Name: Answer the questions in the spaces provided on this exam. If you run out of room for an answer,

More information

Big Data Processing with Google s MapReduce. Alexandru Costan

Big Data Processing with Google s MapReduce. Alexandru Costan 1 Big Data Processing with Google s MapReduce Alexandru Costan Outline Motivation MapReduce programming model Examples MapReduce system architecture Limitations Extensions 2 Motivation Big Data @Google:

More information

Scaling Out With Apache Spark. DTL Meeting 17-04-2015 Slides based on https://www.sics.se/~amir/files/download/dic/spark.pdf

Scaling Out With Apache Spark. DTL Meeting 17-04-2015 Slides based on https://www.sics.se/~amir/files/download/dic/spark.pdf Scaling Out With Apache Spark DTL Meeting 17-04-2015 Slides based on https://www.sics.se/~amir/files/download/dic/spark.pdf Your hosts Mathijs Kattenberg Technical consultant Jeroen Schot Technical consultant

More information

Big Data Analytics* Outline. Issues. Big Data

Big Data Analytics* Outline. Issues. Big Data Outline Big Data Analytics* Big Data Data Analytics: Challenges and Issues Misconceptions Big Data Infrastructure Scalable Distributed Computing: Hadoop Programming in Hadoop: MapReduce Paradigm Example

More information

Architectures for massive data management

Architectures for massive data management Architectures for massive data management Apache Kafka, Samza, Storm Albert Bifet albert.bifet@telecom-paristech.fr October 20, 2015 Stream Engine Motivation Digital Universe EMC Digital Universe with

More information

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components of Hadoop. We will see what types of nodes can exist in a Hadoop

More information

Open source Google-style large scale data analysis with Hadoop

Open source Google-style large scale data analysis with Hadoop Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical

More information

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic BigData An Overview of Several Approaches David Mera Masaryk University Brno, Czech Republic 16/12/2013 Table of Contents 1 Introduction 2 Terminology 3 Approaches focused on batch data processing MapReduce-Hadoop

More information

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control EP/K006487/1 UK PI: Prof Gareth Taylor (BU) China PI: Prof Yong-Hua Song (THU) Consortium UK Members: Brunel University

More information

Big Data and Scripting Systems beyond Hadoop

Big Data and Scripting Systems beyond Hadoop Big Data and Scripting Systems beyond Hadoop 1, 2, ZooKeeper distributed coordination service many problems are shared among distributed systems ZooKeeper provides an implementation that solves these avoid

More information

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example MapReduce MapReduce and SQL Injections CS 3200 Final Lecture Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI'04: Sixth Symposium on Operating System Design

More information

GraySort and MinuteSort at Yahoo on Hadoop 0.23

GraySort and MinuteSort at Yahoo on Hadoop 0.23 GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters

More information

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

marlabs driving digital agility WHITEPAPER Big Data and Hadoop marlabs driving digital agility WHITEPAPER Big Data and Hadoop Abstract This paper explains the significance of Hadoop, an emerging yet rapidly growing technology. The prime goal of this paper is to unveil

More information

Introduction to MapReduce and Hadoop

Introduction to MapReduce and Hadoop Introduction to MapReduce and Hadoop Jie Tao Karlsruhe Institute of Technology jie.tao@kit.edu Die Kooperation von Why Map/Reduce? Massive data Can not be stored on a single machine Takes too long to process

More information

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing Data-Intensive Programming Timo Aaltonen Department of Pervasive Computing Data-Intensive Programming Lecturer: Timo Aaltonen University Lecturer timo.aaltonen@tut.fi Assistants: Henri Terho and Antti

More information

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Introduction to Big Data! with Apache Spark UC#BERKELEY# Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"

More information

Firebird meets NoSQL (Apache HBase) Case Study

Firebird meets NoSQL (Apache HBase) Case Study Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 thomas.steinmaurer@scch.at www.scch.at Michael Zwick DI

More information

BIG DATA Giraph. Felipe Caicedo December-2012. Cloud Computing & Big Data. FIB-UPC Master MEI

BIG DATA Giraph. Felipe Caicedo December-2012. Cloud Computing & Big Data. FIB-UPC Master MEI BIG DATA Giraph Cloud Computing & Big Data Felipe Caicedo December-2012 FIB-UPC Master MEI Content What is Apache Giraph? Motivation Existing solutions Features How it works Components and responsibilities

More information

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc hairong@yahoo-inc.com

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc hairong@yahoo-inc.com Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc hairong@yahoo-inc.com What s Hadoop Framework for running applications on large clusters of commodity hardware Scale: petabytes of data

More information

Hadoop. Scalable Distributed Computing. Claire Jaja, Julian Chan October 8, 2013

Hadoop. Scalable Distributed Computing. Claire Jaja, Julian Chan October 8, 2013 Hadoop Scalable Distributed Computing Claire Jaja, Julian Chan October 8, 2013 What is Hadoop? A general-purpose storage and data-analysis platform Open source Apache software, implemented in Java Enables

More information

CS54100: Database Systems

CS54100: Database Systems CS54100: Database Systems Cloud Databases: The Next Post- Relational World 18 April 2012 Prof. Chris Clifton Beyond RDBMS The Relational Model is too limiting! Simple data model doesn t capture semantics

More information

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13 Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13 Astrid Rheinländer Wissensmanagement in der Bioinformatik What is Big Data? collection of data sets so large and complex

More information

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop

More information

Large-Scale Data Processing

Large-Scale Data Processing Large-Scale Data Processing Eiko Yoneki eiko.yoneki@cl.cam.ac.uk http://www.cl.cam.ac.uk/~ey204 Systems Research Group University of Cambridge Computer Laboratory 2010s: Big Data Why Big Data now? Increase

More information

A programming model in Cloud: MapReduce

A programming model in Cloud: MapReduce A programming model in Cloud: MapReduce Programming model and implementation developed by Google for processing large data sets Users specify a map function to generate a set of intermediate key/value

More information

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process

More information

From GWS to MapReduce: Google s Cloud Technology in the Early Days

From GWS to MapReduce: Google s Cloud Technology in the Early Days Large-Scale Distributed Systems From GWS to MapReduce: Google s Cloud Technology in the Early Days Part II: MapReduce in a Datacenter COMP6511A Spring 2014 HKUST Lin Gu lingu@ieee.org MapReduce/Hadoop

More information

Map Reduce & Hadoop Recommended Text:

Map Reduce & Hadoop Recommended Text: Big Data Map Reduce & Hadoop Recommended Text:! Large datasets are becoming more common The New York Stock Exchange generates about one terabyte of new trade data per day. Facebook hosts approximately

More information

Unified Big Data Processing with Apache Spark. Matei Zaharia @matei_zaharia

Unified Big Data Processing with Apache Spark. Matei Zaharia @matei_zaharia Unified Big Data Processing with Apache Spark Matei Zaharia @matei_zaharia What is Apache Spark? Fast & general engine for big data processing Generalizes MapReduce model to support more types of processing

More information

Open Cloud System. (Integration of Eucalyptus, Hadoop and AppScale into deployment of University Private Cloud)

Open Cloud System. (Integration of Eucalyptus, Hadoop and AppScale into deployment of University Private Cloud) Open Cloud System (Integration of Eucalyptus, Hadoop and into deployment of University Private Cloud) Thinn Thu Naing University of Computer Studies, Yangon 25 th October 2011 Open Cloud System University

More information

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique Mahesh Maurya a, Sunita Mahajan b * a Research Scholar, JJT University, MPSTME, Mumbai, India,maheshkmaurya@yahoo.co.in

More information

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe

More information

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015 7/04/05 Fundamentals of Distributed Systems CC5- PROCESAMIENTO MASIVO DE DATOS OTOÑO 05 Lecture 4: DFS & MapReduce I Aidan Hogan aidhog@gmail.com Inside Google circa 997/98 MASSIVE DATA PROCESSING (THE

More information

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora {mbalassi, gyfora}@apache.org The Flink Big Data Analytics Platform Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org What is Apache Flink? Open Source Started in 2009 by the Berlin-based database research groups In the Apache

More information

Hadoop Ecosystem B Y R A H I M A.

Hadoop Ecosystem B Y R A H I M A. Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open

More information

MapReduce Job Processing

MapReduce Job Processing April 17, 2012 Background: Hadoop Distributed File System (HDFS) Hadoop requires a Distributed File System (DFS), we utilize the Hadoop Distributed File System (HDFS). Background: Hadoop Distributed File

More information

Parallel Processing of cluster by Map Reduce

Parallel Processing of cluster by Map Reduce Parallel Processing of cluster by Map Reduce Abstract Madhavi Vaidya, Department of Computer Science Vivekanand College, Chembur, Mumbai vamadhavi04@yahoo.co.in MapReduce is a parallel programming model

More information

Optimization and analysis of large scale data sorting algorithm based on Hadoop

Optimization and analysis of large scale data sorting algorithm based on Hadoop Optimization and analysis of large scale sorting algorithm based on Hadoop Zhuo Wang, Longlong Tian, Dianjie Guo, Xiaoming Jiang Institute of Information Engineering, Chinese Academy of Sciences {wangzhuo,

More information

16.1 MAPREDUCE. For personal use only, not for distribution. 333

16.1 MAPREDUCE. For personal use only, not for distribution. 333 For personal use only, not for distribution. 333 16.1 MAPREDUCE Initially designed by the Google labs and used internally by Google, the MAPREDUCE distributed programming model is now promoted by several

More information

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay Weekly Report Hadoop Introduction submitted By Anurag Sharma Department of Computer Science and Engineering Indian Institute of Technology Bombay Chapter 1 What is Hadoop? Apache Hadoop (High-availability

More information

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. www.spark- project.org. University of California, Berkeley UC BERKELEY

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. www.spark- project.org. University of California, Berkeley UC BERKELEY Spark in Action Fast Big Data Analytics using Scala Matei Zaharia University of California, Berkeley www.spark- project.org UC BERKELEY My Background Grad student in the AMP Lab at UC Berkeley» 50- person

More information

Systems and Algorithms for Big Data Analytics

Systems and Algorithms for Big Data Analytics Systems and Algorithms for Big Data Analytics YAN, Da Email: yanda@cse.cuhk.edu.hk My Research Graph Data Distributed Graph Processing Spatial Data Spatial Query Processing Uncertain Data Querying & Mining

More information

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1 2 Big Data Problems Data explosion Data from users on social

More information

Spark and the Big Data Library

Spark and the Big Data Library Spark and the Big Data Library Reza Zadeh Thanks to Matei Zaharia Problem Data growing faster than processing speeds Only solution is to parallelize on large clusters» Wide use in both enterprises and

More information

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Hadoop MapReduce and Spark Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Outline Hadoop Hadoop Import data on Hadoop Spark Spark features Scala MLlib MLlib

More information

Hadoop Parallel Data Processing

Hadoop Parallel Data Processing MapReduce and Implementation Hadoop Parallel Data Processing Kai Shen A programming interface (two stage Map and Reduce) and system support such that: the interface is easy to program, and suitable for

More information

Spark: Cluster Computing with Working Sets

Spark: Cluster Computing with Working Sets Spark: Cluster Computing with Working Sets Outline Why? Mesos Resilient Distributed Dataset Spark & Scala Examples Uses Why? MapReduce deficiencies: Standard Dataflows are Acyclic Prevents Iterative Jobs

More information

Introduction to Hadoop

Introduction to Hadoop Introduction to Hadoop Miles Osborne School of Informatics University of Edinburgh miles@inf.ed.ac.uk October 28, 2010 Miles Osborne Introduction to Hadoop 1 Background Hadoop Programming Model Examples

More information

Brave New World: Hadoop vs. Spark

Brave New World: Hadoop vs. Spark Brave New World: Hadoop vs. Spark Dr. Kurt Stockinger Associate Professor of Computer Science Director of Studies in Data Science Zurich University of Applied Sciences Datalab Seminar, Zurich, Oct. 7,

More information

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind

More information

Big Data Analytics Hadoop and Spark

Big Data Analytics Hadoop and Spark Big Data Analytics Hadoop and Spark Shelly Garion, Ph.D. IBM Research Haifa 1 What is Big Data? 2 What is Big Data? Big data usually includes data sets with sizes beyond the ability of commonly used software

More information

CLOUD COMPUTING INTRODUCTION TO DATA SCIENCE TIM KRASKA

CLOUD COMPUTING INTRODUCTION TO DATA SCIENCE TIM KRASKA CLOUD COMPUTING INTRODUCTION TO DATA SCIENCE TIM KRASKA CLICKER How was the celebration of knowledge? A) Very Easy B) Just ok C) Tough D) Spring break made me forget about it E) I do not want to talk about

More information

A Brief Outline on Bigdata Hadoop

A Brief Outline on Bigdata Hadoop A Brief Outline on Bigdata Hadoop Twinkle Gupta 1, Shruti Dixit 2 RGPV, Department of Computer Science and Engineering, Acropolis Institute of Technology and Research, Indore, India Abstract- Bigdata is

More information

NoSQL and Hadoop Technologies On Oracle Cloud

NoSQL and Hadoop Technologies On Oracle Cloud NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath

More information

Application Development. A Paradigm Shift

Application Development. A Paradigm Shift Application Development for the Cloud: A Paradigm Shift Ramesh Rangachar Intelsat t 2012 by Intelsat. t Published by The Aerospace Corporation with permission. New 2007 Template - 1 Motivation for the

More information

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since Feb

More information

Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY

Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY Fast and Expressive Big Data Analytics with Python Matei Zaharia UC Berkeley / MIT UC BERKELEY spark-project.org What is Spark? Fast and expressive cluster computing system interoperable with Apache Hadoop

More information

Hadoop and Map-Reduce. Swati Gore

Hadoop and Map-Reduce. Swati Gore Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data

More information

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data Case Study 2: Document Retrieval Parallel Programming Map-Reduce Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 31 st, 2013 Carlos Guestrin

More information

White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP

White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP White Paper Big Data and Hadoop Abhishek S, Java COE www.marlabs.com Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP Table of contents Abstract.. 1 Introduction. 2 What is Big

More information

The Hadoop Framework

The Hadoop Framework The Hadoop Framework Nils Braden University of Applied Sciences Gießen-Friedberg Wiesenstraße 14 35390 Gießen nils.braden@mni.fh-giessen.de Abstract. The Hadoop Framework offers an approach to large-scale

More information

Apache Hama Design Document v0.6

Apache Hama Design Document v0.6 Apache Hama Design Document v0.6 Introduction Hama Architecture BSPMaster GroomServer Zookeeper BSP Task Execution Job Submission Job and Task Scheduling Task Execution Lifecycle Synchronization Fault

More information

Finding Insights & Hadoop Cluster Performance Analysis over Census Dataset Using Big-Data Analytics

Finding Insights & Hadoop Cluster Performance Analysis over Census Dataset Using Big-Data Analytics Finding Insights & Hadoop Cluster Performance Analysis over Census Dataset Using Big-Data Analytics Dharmendra Agawane 1, Rohit Pawar 2, Pavankumar Purohit 3, Gangadhar Agre 4 Guide: Prof. P B Jawade 2

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

MapReduce and Hadoop Distributed File System

MapReduce and Hadoop Distributed File System MapReduce and Hadoop Distributed File System 1 B. RAMAMURTHY Contact: Dr. Bina Ramamurthy CSE Department University at Buffalo (SUNY) bina@buffalo.edu http://www.cse.buffalo.edu/faculty/bina Partially

More information

Systems Engineering II. Pramod Bhatotia TU Dresden pramod.bhatotia@tu- dresden.de

Systems Engineering II. Pramod Bhatotia TU Dresden pramod.bhatotia@tu- dresden.de Systems Engineering II Pramod Bhatotia TU Dresden pramod.bhatotia@tu- dresden.de About me! Since May 2015 2015 2012 Research Group Leader cfaed, TU Dresden PhD Student MPI- SWS Research Intern Microsoft

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University. http://cs246.stanford.edu

CS246: Mining Massive Datasets Jure Leskovec, Stanford University. http://cs246.stanford.edu CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2 CPU Memory Machine Learning, Statistics Classical Data Mining Disk 3 20+ billion web pages x 20KB = 400+ TB

More information

Introduction to Cloud Computing

Introduction to Cloud Computing Introduction to Cloud Computing Cloud Computing I (intro) 15 319, spring 2010 2 nd Lecture, Jan 14 th Majd F. Sakr Lecture Motivation General overview on cloud computing What is cloud computing Services

More information

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Prepared By : Manoj Kumar Joshi & Vikas Sawhney Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks

More information

Developing MapReduce Programs

Developing MapReduce Programs Cloud Computing Developing MapReduce Programs Dell Zhang Birkbeck, University of London 2015/16 MapReduce Algorithm Design MapReduce: Recap Programmers must specify two functions: map (k, v) * Takes

More information

Cloud Computing. Lectures 10 and 11 Map Reduce: System Perspective 2014-2015

Cloud Computing. Lectures 10 and 11 Map Reduce: System Perspective 2014-2015 Cloud Computing Lectures 10 and 11 Map Reduce: System Perspective 2014-2015 1 MapReduce in More Detail 2 Master (i) Execution is controlled by the master process: Input data are split into 64MB blocks.

More information

Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan sriram@sdsc.edu

Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan sriram@sdsc.edu Introduc)on to the MapReduce Paradigm and Apache Hadoop Sriram Krishnan sriram@sdsc.edu Programming Model The computa)on takes a set of input key/ value pairs, and Produces a set of output key/value pairs.

More information

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes

More information