Introduction to Big Data Science. Wuhui Chen

Size: px
Start display at page:

Download "Introduction to Big Data Science. Wuhui Chen"

Transcription

1 Introduction to Big Data Science Wuhui Chen

2 What is Big data? Volume Variety Velocity Outline What are people doing with Big data? Classic examples Two basic technologies for Big data management: Data storage: HDFS (Hadoop Distributed File System) Data processing: MapReduce

3 What s Big Data? No single definition; here is from Wikipedia: Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. Finding correlations to: spot business trends determine quality of research prevent diseases link legal citations combat crime and determine real-time roadway traffic conditions The challenges include capture, curation, storage, search, sharing, transfer, analysis, and visualization. 3

4 Big Data: 3V s 4

5 Data Volume Volume (Scale) Data volume is increasing exponentially 5

6 Data Volume Volume (Scale) 44x increase from From 0.8 zettabytes to 35zb 6

7 12+ TBs of tweet data every day data every day 30 billion RFID tags today (1.3B in 2005) 4.6 billion camera phones world wide? TBs of 100s of millions of GPS enabled devices sold annually 25+ TBs of log data every day 76 million smart meters in M by billion people on the Web by end 2011

8 Maximilien Brice, CERN CERN s Large Hydron Collider (LHC) generates 15 PB a year

9 Variety (Complexity) Relational Data (Tables/Transaction/Legacy Data) Text Data (Web) Semi-structured Data (XML) Graph Data Social Network, Semantic Web (RDF), Streaming Data You can only scan the data once A single application can be generating/collecting many types of data Big Public Data (online, weather, finance, etc) To extract knowledge all these types of data need to linked together 9

10 A Single View to the Customer Social Media Banking Finance Gaming Customer Our Known History Entertain Purchas e All types of data linked together for value extraction!

11 Velocity (Speed) Data is begin generated fast and need to be processed fast Online Data Analytics Late decisions missing opportunities Examples E-Promotions: Based on your current location, your purchase history, what you like send promotions right now for store next to you Healthcare monitoring: sensors monitoring your activities and body any abnormal measurements require immediate reaction 11

12 Real-time/Fast Data Mobile devices (tracking all objects all the time) Social media and networks (all of us are generating data) Scientific instruments (collecting all sorts of data) Sensor technology and networks (measuring all kinds of data) The progress and innovation is no longer hindered by the ability to collect data But, by the ability to manage, analyze, summarize, visualize, and discover knowledge from the collected data in a timely manner and in a scalable fashion 12

13 Real-Time Analytics/Decision Requirement Product Recommendations that are Relevant & Compelling Influence Behavior Learning why Customers Switch to competitors and their offers; in time to Counter Improving the Marketing Effectiveness of a Promotion while it is still in Play Customer Preventing Fraud as it is Occurring & preventing more proactively Friend Invitations to join a Game or Activity that expands business

14 Some Make it 4V s 14

15 What is Big data? Volume Variety Velocity Outline What are people doing with Big data? Classic examples Two basic technologies for Big data management: Data storage: HDFS (Hadoop Distributed File System) Data processing: MapReduce

16 Example 1: Target News from Net York Times, 2012 Target Knew a High School Girl Was Pregnant Before Her Parents Did How? Transaction history analysis:

17 Example 2: Google Flu Trends Using Google search data to estimate current flu activity around the world in near real-time. Source: Estimating which cities are most at risk for spread of the Ebola virus. 17

18 Example 3:Elections2012

19 Example 4: NASA of biomedicine Oxford University's big data and Internet of Things project to 'create the NASA of biomedicine Care for cancer patients Datasets: Sequence the full genome of 100,000 patient volunteers in the NHS and combine it with the hospital clinical data

20 What is Big data? Volume Variety Velocity Outline What are people doing with Big data? Classic examples Two basic technologies for Big data management: Data storage: HDFS (Hadoop Distributed File System) Data processing: MapReduce

21 Hadoop Cluster World switch switch switch switch switch switch switch Name Node DN + TT DN + TT DN + TT DN + TT Job Tracker DN + TT DN + TT DN + TT DN + TT Secondary NN DN + TT DN + TT DN + TT DN + TT Client DN + TT DN + TT DN + TT DN + TT DN + TT DN + TT DN + TT DN + TT DN + TT DN + TT Rack 1 Rack 2 Rack 3 Rack 4 Rack N

22 Typical Workflow Load data into the cluster (HDFS writes) Analyze the data (Map Reduce) Store results in the cluster (HDFS writes) Read the results from the cluster (HDFS reads) Sample Scenario: How many times did our customers type the word Fraud into s sent to customer service? Huge file containing all s sent to customer service File.txt

23 Writing files to HDFS File.txt Blk A Blk B Blk C I want to write Blocks A,B,C of File.txt Client Name Node OK. Write to Data Nodes 1,5,6 Data Node 1 Data Node 5 Data Node 6 Data Node N Blk A Blk B Blk C Client consults Name Node Client writes block directly to one Data Node Data Nodes replicates block Cycle repeats for next block

24 Hadoop Rack Awareness Why? switch Name Node switch Data Node 1 B A Data Node 2 B Data Node 3 Data Node 5 switch Data Node 5 C A Data Node 6 A Data Node 7 Data Node 8 switch Data Node 9 C B Data Node 10 C Data Node 11 Data Node 12 Rack 1 Rack 5 Rack 9 Rack aware Rack 1: Data Node 1 Data Node 2 Data Node 3 Rack 5: Data Node 5 Data Node 6 Data Node 7 metadata File.txt= Blk A: DN1, DN5, DN6 Blk B: DN7, DN1, DN2 Blk C: DN5, DN8,DN9 Never loose all data if entire rack fails Keep bulky flows in-rack when possible Assumption that in-rack is higher bandwidth, lower latency

25 Preparing HDFS writes File.txt Blk A Blk B Blk C I want to write File.txt Block A Ready Data Nodes 5,6 Client Ready! switch switch switch Data Node 1 Data Node 5 Name Node OK. Write to Data Nodes 1,5,6 Rack aware Rack 1: Data Node 1 Rack 5: Data Node 5 Data Node 6 Rack 1 Ready Data Node 6 Ready? Data Node 6 Rack 2 Ready! Name Node picks two nodes in the same rack, one node in a different rack Data protection Locality for M/R

26 Pipelined Write File.txt Blk A Blk B Blk C Client Name Node Data Nodes 1 & 2 pass data along as its received TCP switch switch switch Data Node 1 Data Node 5 A A Data Node 6 A Rack aware Rack 1: Data Node 1 Rack 5: Data Node 5 Data Node 6 Rack 1 Rack 2

27 Pipelined Write File.txt Blk A Blk B Blk C Client Success Name Node File.txt Blk A: DN1, DN2, DN3 switch Block received Rack 1: Data Node 1 switch switch Data Node 1 Data Node 2 Rack 5: Data Node 2 Data Node 3 A A Data Node 3 A Rack 1 Rack 2

28 Multi-block Replication Pipeline File.txt Blk A Blk B Blk C Client switch 1TB File = 3TB storage 3TB network traffic switch switch switch Blk A Data Node 1 Blk A Data Node X Blk C Data Node 2 Blk B Blk B Data Node Y Blk A Blk C Data Node 3 Blk C Data Node W Blk B Data Node Z Rack 1 Rack 4 Rack 5

29 Name Node Awesome! Thanks. Name Node metadata DN1: A,C DN2: A,C DN3: A,C File system File.txt = A,C I have blocks: A, C I m alive! Data Node 1 Data Node 2 Data Node 3 Data Node N A C A C A C Data Node sends Heartbeats Every 10 th heartbeat is a Block report Name Node builds metadata from Block reports TCP every 3 seconds If Name Node is down, HDFS is down

30 Re-replicating missing replicas Uh Oh! Missing replicas Name Node metadata DN1: A,C DN2: A,C DN3: A, C Rack Awareness Rack1: DN1, DN2 Rack5: DN3, Rack9: DN8 Copy blocks A,C to Node 8 Data Node 1 Data Node 2 Data Node 3 Data Node 8 A C A C A C A C Missing Heartbeats signify lost Nodes Name Node consults metadata, finds affected data Name Node consults Rack Awareness script Name Node tells a Data Node to re-replicate

31 Secondary Name Node File system metadata Name Node File.txt = A,C Secondary Name Node Its been an hour, give me your metadata Not a hot standby for the Name Node Connects to Name Node every hour* Housekeeping, backup of Name Node metadata Saved metadata can rebuild a failed Name Node

32 Client reading files from HDFS Tell me the block locations of Results.txt Blk A = 1,5,6 Blk B = 8,1,2 Blk C = 5,8,9 Client Name Node switch Data Node 1 B A switch Data Node 5 C A switch Data Node 8 C B metadata Results.txt = Blk A: DN1, DN5, DN6 Data Node 2 B Data Node Data Node Data Node 6 A Data Node Data Node Data Node 9 C Data Node Data Node Blk B: DN7, DN1, DN2 Blk C: DN5, DN8,DN9 Rack 1 Rack 5 Rack 9 Client receives Data Node list for each block Client picks first Data Node for each block Client reads blocks sequentially

33 Data Node reading files from HDFS Tell me the locations of Block A of File.txt switch Name Node Block A = 1,5,6 switch Data Node 1 B A Data Node 2 B Data Node 3 Data Node switch Data Node 5 C A Data Node 6 A Data Node Data Node switch Data Node 8 C B Data Node 9 C Data Node Data Node Rack aware Rack 1: Data Node 1 Data Node 2 Data Node 3 Rack 5: Data Node 5 Rack 1 Rack 5 Rack 9 Name Node provides rack local Nodes first Leverage in-rack bandwidth, single hop metadata File.txt= Blk A: DN1, DN5, DN6 Blk B: DN7, DN1, DN2 Blk C: DN5, DN8,DN9

34 What is Big data? Volume Variety Velocity Outline What are people doing with Big data? Classic examples Two basic technologies for Big data management: Data storage: HDFS (Hadoop Distributed File System) Data processing: MapReduce

35 What is MapReduce Origin from Google, [OSDI 04] A simple programming model Functional model For large-scale data processing Exploits large set of commodity computers Executes process in distributed manner Offers high availability

36 Motivation Lots of demands for very large scale data processing A certain common themes for these demands Lots of machines needed (scaling) Two basic operations on the input Map Reduce

37 Peta-scale Data Main web 2 weed 1 green 2 Data collection 1..* Parser Thread 1..* Counter sun moon 1 land 1 part 1 DataCollection WordList ResultTable KEY web weed green sun moon land part web green. CCSCNE 2009 VALUE Palttsburg, April

38 Divide and Conquer Data collection One node 1..* Parser DataCollection Main Thread WordList 1..* Counter ResultTable For our example, #1: Schedule parallel parse tasks #2: Schedule parallel count tasks This is a particular solution; Lets generalize it: Main Data collection 1..* Parser Thread 1..* Counter Our parse is a mapping operation: MAP: input <key, value> pairs DataCollection WordList ResultTable Data collection 1..* Parser DataCollection Main Thread WordList 1..* Counter ResultTable Our count is a reduce operation: REDUCE: <key, value> pairs reduced Map/Reduce originated from Lisp But have different meaning here Data collection 1..* Parser DataCollection Main Thread WordList 1..* Counter ResultTable Runtime adds distribution + fault tolerance + replication + monitoring + load balancing to your base application!

39 Architecture overview Master node user Job tracker Slave node 1 Slave node 2 Slave node N Task tracker Task tracker Task tracker Workers Workers Workers

40 Data Processing: Map How many times does Fraud appear in File.txt? Name Node Client Job Tracker Count Fraud in Block C Map Task Map Task Map Task Data Node 1 Fraud = 3 Data Node 5 Data Node 9 A B C Fraud = 0 Fraud = 11 File.txt Map: Run this computation on your local data Job Tracker delivers Java code to Nodes with local data

41 What if data isn t local? How many times does Fraud appear in File.txt? Name Node Client Job Tracker Count Fraud in Block C switch switch switch A I need block A Data Node 1 no Map tasks left Data Node 2 Data Node 5 Map Task Map Task B Data Node 9 Fraud = 0 Fraud = 11 C Rack 1 Rack 5 Job Tracker tries to select Node in same rack as data Name Node rack awareness Rack 9

42 Data Processing: Reduce Client Job Tracker Sum Fraud Results.txt Fraud = 14 X Y Z HDFS Data Node 3 Reduce Task Map Task Fraud = 0 Map Task Map Task Data Node 1 Data Node 5 Data Node 9 A B C Reduce: Run this computation across Map results Map Tasks deliver output data over the network Reduce Task data output written to and read from HDFS

43 Principle of MapReduce Operation

44

45

46

47

48 Map Reduce

49 Map Reduce

50 Map Reduce

51 Map Reduce

52 Map Reduce

53 Map Reduce

54 Map Reduce

55 Map Reduce

56 Map Reduce

57 Large scale data splits Map <key, 1> Reducers (say, Count) Parse-hash Count P-0000, count1 Parse-hash Parse-hash Count P-0001, count2 Parse-hash Count P-0002,count3 Map Reduce 57 CCSCNE 2009 Palttsburg, April

58 Map Reduce

59 Map Reduce

60 HADOOP MapReduce Program : WordCount.java package net.kzk9; import java.io.ioexception; import java.util.stringtokenizer; import org.apache.hadoop.util.genericoptionsparser; import org.apache.hadoop.conf.configuration; import org.apache.hadoop.fs.path; import org.apache.hadoop.io.intwritable; import org.apache.hadoop.io.longwritable; import org.apache.hadoop.io.text; import org.apache.hadoop.mapreduce.job; import org.apache.hadoop.mapreduce.mapper; import org.apache.hadoop.mapreduce.reducer; import org.apache.hadoop.mapreduce.lib.input.fileinputformat; import org.apache.hadoop.mapreduce.lib.input.textinputformat; import org.apache.hadoop.mapreduce.lib.output.fileoutputformat; import org.apache.hadoop.mapreduce.lib.output.textoutputformat; public class WordCount { // Mapperの 実 装 public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new protected void map(longwritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.tostring(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasmoretokens()) { word.set(tokenizer.nexttoken()); context.write(word, one); } } } Map Reduce // Reducerの 実 装 public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { private IntWritable value = new protected void reduce(text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable value : values) sum += value.get(); value.set(sum); context.write(key, value); } } public static void main(string[] args) throws Exception { // 設 定 情 報 の 読 み 込 み Configuration conf = new Configuration(); // 引 数 のパース GenericOptionsParser parser = new GenericOptionsParser(conf, args); args = parser.getremainingargs(); // ジョブの 作 成 Job job = new Job(conf, "wordcount"); job.setjarbyclass(wordcount.class); // Mapper/Reducerに 使 用 するクラスを 指 定 job.setmapperclass(map.class); job.setreducerclass(reduce.class); // ジョブ 中 の 各 種 型 を 設 定 job.setmapoutputkeyclass(text.class); job.setmapoutputvalueclass(intwritable.class); job.setoutputkeyclass(text.class); job.setoutputvalueclass(intwritable.class);

61 HADOOP MapReduce Program : WordCount.java // 入 力 / 出 力 パスを 設 定 FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); // ジョブを JobTrackerにサブミット boolean success = job.waitforcompletion(true); System.out.println(success); } } TestWordCount.java package net.kzk9; import net.kzk9.wordcount; import org.apache.hadoop.io.*; import org.apache.hadoop.mapreduce.*; import org.apache.hadoop.mrunit.mapreduce.mapdriver; import junit.framework.testcase; import org.junit.before; import org.junit.test; public class WordCountTest extends TestCase { // テストする 対 象 の Mapper private WordCount.Map mapper; // Mapper 実 行 用 のドライバクラス private MapDriver public void setup() { // テスト 対 象 の Mapperの 作 成 mapper = new WordCount.Map(); // テスト 実 行 用 MapDriverの 作 成 driver = new MapDriver(mapper); public void testwordcountmapper() { // 入 力 を 設 定 driver.withinput(new LongWritable(0), new Text("this is a pen")) // 期 待 される 出 力 を 設 定.withOutput(new Text("this"), new IntWritable(1)).withOutput(new Text("is"), new IntWritable(1)).withOutput(new Text("a"), new IntWritable(1)).withOutput(new Text("pen"), new IntWritable(1)) // テストを 実 行.runTest(); } } #!/bin/bash export HADOOP_HOME=/usr/lib/hadoop-0.20/ export DIR=wordcount_classes # CLASSPATHの 設 定 CLASSPATH=$HADOOP_HOME/hadoop-core.jar for f in $HADOOP_HOME/lib/*.jar; do CLASSPATH=${CLASSPATH}:$f; done # ディレクトリの 初 期 化 rm -fr $DIR mkdir $DIR # コンパイル javac -classpath $CLASSPATH -d $DIR WordCount.java # jarファイルの 作 成 jar -cvf wordcount.jar -C $DIR. Map Reduce

62 Conclusion and Discussion The Concept of Big Data. Big Data s Current States Basic Technologies: Data Storage and Analysis How we can join the Big Data Trend?

63 Reference Apache Hadoop: Hadoop: The Definitive Guide, by Tom White, 2 nd edition, Oreilly s, 2010 Dean, J. and Ghemawat, S MapReduce: simplified data processing on large clusters. Communication of ACM 51, 1 (Jan. 2008), B. Hedlund s blog: ng-hadoop-clusters-and-the-network/ 11/24/2014

Understanding Hadoop Clusters and the Network

Understanding Hadoop Clusters and the Network Understanding Hadoop lusters and the Network Part 1. Introduction and Overview Brad Hedlund http://bradhedlund.com http://www.linkedin.com/in/bradhedlund @bradhedlund Hadoop Server Roles lients Distributed

More information

Word Count Code using MR2 Classes and API

Word Count Code using MR2 Classes and API EDUREKA Word Count Code using MR2 Classes and API A Guide to Understand the Execution of Word Count edureka! A guide to understand the execution and flow of word count WRITE YOU FIRST MRV2 PROGRAM AND

More information

Hadoop Lab Notes. Nicola Tonellotto November 15, 2010

Hadoop Lab Notes. Nicola Tonellotto November 15, 2010 Hadoop Lab Notes Nicola Tonellotto November 15, 2010 2 Contents 1 Hadoop Setup 4 1.1 Prerequisites........................................... 4 1.2 Installation............................................

More information

MapReduce and Hadoop Distributed File System V I J A Y R A O

MapReduce and Hadoop Distributed File System V I J A Y R A O MapReduce and Hadoop Distributed File System 1 V I J A Y R A O The Context: Big-data Man on the moon with 32KB (1969); my laptop had 2GB RAM (2009) Google collects 270PB data in a month (2007), 20000PB

More information

Introduc)on to Map- Reduce. Vincent Leroy

Introduc)on to Map- Reduce. Vincent Leroy Introduc)on to Map- Reduce Vincent Leroy Sources Apache Hadoop Yahoo! Developer Network Hortonworks Cloudera Prac)cal Problem Solving with Hadoop and Pig Slides will be available at hgp://lig- membres.imag.fr/leroyv/

More information

BIG DATA APPLICATIONS

BIG DATA APPLICATIONS BIG DATA ANALYTICS USING HADOOP AND SPARK ON HATHI Boyu Zhang Research Computing ITaP BIG DATA APPLICATIONS Big data has become one of the most important aspects in scientific computing and business analytics

More information

Working With Hadoop. Important Terminology. Important Terminology. Anatomy of MapReduce Job Run. Important Terminology

Working With Hadoop. Important Terminology. Important Terminology. Anatomy of MapReduce Job Run. Important Terminology Working With Hadoop Now that we covered the basics of MapReduce, let s look at some Hadoop specifics. Mostly based on Tom White s book Hadoop: The Definitive Guide, 3 rd edition Note: We will use the new

More information

Introduction to MapReduce and Hadoop

Introduction to MapReduce and Hadoop Introduction to MapReduce and Hadoop Jie Tao Karlsruhe Institute of Technology [email protected] Die Kooperation von Why Map/Reduce? Massive data Can not be stored on a single machine Takes too long to process

More information

Tutorial- Counting Words in File(s) using MapReduce

Tutorial- Counting Words in File(s) using MapReduce Tutorial- Counting Words in File(s) using MapReduce 1 Overview This document serves as a tutorial to setup and run a simple application in Hadoop MapReduce framework. A job in Hadoop MapReduce usually

More information

Internals of Hadoop Application Framework and Distributed File System

Internals of Hadoop Application Framework and Distributed File System International Journal of Scientific and Research Publications, Volume 5, Issue 7, July 2015 1 Internals of Hadoop Application Framework and Distributed File System Saminath.V, Sangeetha.M.S Abstract- Hadoop

More information

Applications for Big Data Analytics

Applications for Big Data Analytics Smarter Healthcare Applications for Big Data Analytics Multi-channel sales Finance Log Analysis Homeland Security Traffic Control Telecom Search Quality Manufacturing Trading Analytics Fraud and Risk Retail:

More information

MapReduce and Hadoop Distributed File System

MapReduce and Hadoop Distributed File System MapReduce and Hadoop Distributed File System 1 B. RAMAMURTHY Contact: Dr. Bina Ramamurthy CSE Department University at Buffalo (SUNY) [email protected] http://www.cse.buffalo.edu/faculty/bina Partially

More information

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart Hadoop/MapReduce Object-oriented framework presentation CSCI 5448 Casey McTaggart What is Apache Hadoop? Large scale, open source software framework Yahoo! has been the largest contributor to date Dedicated

More information

Zebra and MapReduce. Table of contents. 1 Overview...2 2 Hadoop MapReduce APIs...2 3 Zebra MapReduce APIs...2 4 Zebra MapReduce Examples...

Zebra and MapReduce. Table of contents. 1 Overview...2 2 Hadoop MapReduce APIs...2 3 Zebra MapReduce APIs...2 4 Zebra MapReduce Examples... Table of contents 1 Overview...2 2 Hadoop MapReduce APIs...2 3 Zebra MapReduce APIs...2 4 Zebra MapReduce Examples... 2 1. Overview MapReduce allows you to take full advantage of Zebra's capabilities.

More information

Hadoop Framework. technology basics for data scientists. Spring - 2014. Jordi Torres, UPC - BSC www.jorditorres.eu @JordiTorresBCN

Hadoop Framework. technology basics for data scientists. Spring - 2014. Jordi Torres, UPC - BSC www.jorditorres.eu @JordiTorresBCN Hadoop Framework technology basics for data scientists Spring - 2014 Jordi Torres, UPC - BSC www.jorditorres.eu @JordiTorresBCN Warning! Slides are only for presenta8on guide We will discuss+debate addi8onal

More information

Hadoop Basics with InfoSphere BigInsights

Hadoop Basics with InfoSphere BigInsights An IBM Proof of Technology Hadoop Basics with InfoSphere BigInsights Unit 2: Using MapReduce An IBM Proof of Technology Catalog Number Copyright IBM Corporation, 2013 US Government Users Restricted Rights

More information

Hadoop Configuration and First Examples

Hadoop Configuration and First Examples Hadoop Configuration and First Examples Big Data 2015 Hadoop Configuration In the bash_profile export all needed environment variables Hadoop Configuration Allow remote login Hadoop Configuration Download

More information

Hadoop and Eclipse. Eclipse Hawaii User s Group May 26th, 2009. Seth Ladd http://sethladd.com

Hadoop and Eclipse. Eclipse Hawaii User s Group May 26th, 2009. Seth Ladd http://sethladd.com Hadoop and Eclipse Eclipse Hawaii User s Group May 26th, 2009 Seth Ladd http://sethladd.com Goal YOU can use the same technologies as The Big Boys Google Yahoo (2000 nodes) Last.FM AOL Facebook (2.5 petabytes

More information

HDInsight Essentials. Rajesh Nadipalli. Chapter No. 1 "Hadoop and HDInsight in a Heartbeat"

HDInsight Essentials. Rajesh Nadipalli. Chapter No. 1 Hadoop and HDInsight in a Heartbeat HDInsight Essentials Rajesh Nadipalli Chapter No. 1 "Hadoop and HDInsight in a Heartbeat" In this package, you will find: A Biography of the author of the book A preview chapter from the book, Chapter

More information

map/reduce connected components

map/reduce connected components 1, map/reduce connected components find connected components with analogous algorithm: map edges randomly to partitions (k subgraphs of n nodes) for each partition remove edges, so that only tree remains

More information

Hadoop WordCount Explained! IT332 Distributed Systems

Hadoop WordCount Explained! IT332 Distributed Systems Hadoop WordCount Explained! IT332 Distributed Systems Typical problem solved by MapReduce Read a lot of data Map: extract something you care about from each record Shuffle and Sort Reduce: aggregate, summarize,

More information

Istanbul Şehir University Big Data Camp 14. Hadoop Map Reduce. Aslan Bakirov Kevser Nur Çoğalmış

Istanbul Şehir University Big Data Camp 14. Hadoop Map Reduce. Aslan Bakirov Kevser Nur Çoğalmış Istanbul Şehir University Big Data Camp 14 Hadoop Map Reduce Aslan Bakirov Kevser Nur Çoğalmış Agenda Map Reduce Concepts System Overview Hadoop MR Hadoop MR Internal Job Execution Workflow Map Side Details

More information

Lambda Architecture. CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014

Lambda Architecture. CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014 Lambda Architecture CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014 1 Goals Cover the material in Chapter 8 of the Concurrency Textbook The Lambda Architecture Batch Layer MapReduce

More information

Getting to know Apache Hadoop

Getting to know Apache Hadoop Getting to know Apache Hadoop Oana Denisa Balalau Télécom ParisTech October 13, 2015 1 / 32 Table of Contents 1 Apache Hadoop 2 The Hadoop Distributed File System(HDFS) 3 Application management in the

More information

Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网. Information Management. Information Management IBM CDL Lab

Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网. Information Management. Information Management IBM CDL Lab IBM CDL Lab Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网 Information Management 2012 IBM Corporation Agenda Hadoop 技 术 Hadoop 概 述 Hadoop 1.x Hadoop 2.x Hadoop 生 态

More information

Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan [email protected]

Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan sriram@sdsc.edu Introduc)on to the MapReduce Paradigm and Apache Hadoop Sriram Krishnan [email protected] Programming Model The computa)on takes a set of input key/ value pairs, and Produces a set of output key/value pairs.

More information

Xiaoming Gao Hui Li Thilina Gunarathne

Xiaoming Gao Hui Li Thilina Gunarathne Xiaoming Gao Hui Li Thilina Gunarathne Outline HBase and Bigtable Storage HBase Use Cases HBase vs RDBMS Hands-on: Load CSV file to Hbase table with MapReduce Motivation Lots of Semi structured data Horizontal

More information

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13 Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13 Astrid Rheinländer Wissensmanagement in der Bioinformatik What is Big Data? collection of data sets so large and complex

More information

Mrs: MapReduce for Scientific Computing in Python

Mrs: MapReduce for Scientific Computing in Python Mrs: for Scientific Computing in Python Andrew McNabb, Jeff Lund, and Kevin Seppi Brigham Young University November 16, 2012 Large scale problems require parallel processing Communication in parallel processing

More information

How To Write A Mapreduce Program In Java.Io 4.4.4 (Orchestra)

How To Write A Mapreduce Program In Java.Io 4.4.4 (Orchestra) MapReduce framework - Operates exclusively on pairs, - that is, the framework views the input to the job as a set of pairs and produces a set of pairs as the output

More information

Hadoop: Understanding the Big Data Processing Method

Hadoop: Understanding the Big Data Processing Method Hadoop: Understanding the Big Data Processing Method Deepak Chandra Upreti 1, Pawan Sharma 2, Dr. Yaduvir Singh 3 1 PG Student, Department of Computer Science & Engineering, Ideal Institute of Technology

More information

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems Processing of massive data: MapReduce 2. Hadoop 1 MapReduce Implementations Google were the first that applied MapReduce for big data analysis Their idea was introduced in their seminal paper MapReduce:

More information

MR-(Mapreduce Programming Language)

MR-(Mapreduce Programming Language) MR-(Mapreduce Programming Language) Siyang Dai Zhi Zhang Shuai Yuan Zeyang Yu Jinxiong Tan sd2694 zz2219 sy2420 zy2156 jt2649 Objective of MR MapReduce is a software framework introduced by Google, aiming

More information

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data Case Study 2: Document Retrieval Parallel Programming Map-Reduce Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 31 st, 2013 Carlos Guestrin

More information

Extreme Computing. Hadoop MapReduce in more detail. www.inf.ed.ac.uk

Extreme Computing. Hadoop MapReduce in more detail. www.inf.ed.ac.uk Extreme Computing Hadoop MapReduce in more detail How will I actually learn Hadoop? This class session Hadoop: The Definitive Guide RTFM There is a lot of material out there There is also a lot of useless

More information

Word count example Abdalrahman Alsaedi

Word count example Abdalrahman Alsaedi Word count example Abdalrahman Alsaedi To run word count in AWS you have two different ways; either use the already exist WordCount program, or to write your own file. First: Using AWS word count program

More information

Connecting Hadoop with Oracle Database

Connecting Hadoop with Oracle Database Connecting Hadoop with Oracle Database Sharon Stephen Senior Curriculum Developer Server Technologies Curriculum The following is intended to outline our general product direction.

More information

Data Science in the Wild

Data Science in the Wild Data Science in the Wild Lecture 3 Some slides are taken from J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 1 Data Science and Big Data Big Data: the data cannot

More information

Enterprise Data Storage and Analysis on Tim Barr

Enterprise Data Storage and Analysis on Tim Barr Enterprise Data Storage and Analysis on Tim Barr January 15, 2015 Agenda Challenges in Big Data Analytics Why many Hadoop deployments under deliver What is Apache Spark Spark Core, SQL, Streaming, MLlib,

More information

Big Data Explained. An introduction to Big Data Science.

Big Data Explained. An introduction to Big Data Science. Big Data Explained An introduction to Big Data Science. 1 Presentation Agenda What is Big Data Why learn Big Data Who is it for How to start learning Big Data When to learn it Objective and Benefits of

More information

Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

More information

Big Data Management and NoSQL Databases

Big Data Management and NoSQL Databases NDBI040 Big Data Management and NoSQL Databases Lecture 3. Apache Hadoop Doc. RNDr. Irena Holubova, Ph.D. [email protected] http://www.ksi.mff.cuni.cz/~holubova/ndbi040/ Apache Hadoop Open-source

More information

CS54100: Database Systems

CS54100: Database Systems CS54100: Database Systems Cloud Databases: The Next Post- Relational World 18 April 2012 Prof. Chris Clifton Beyond RDBMS The Relational Model is too limiting! Simple data model doesn t capture semantics

More information

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems Introduction PRESENTATION to Hadoop, TITLE GOES MapReduce HERE and HDFS for Big Data Applications Serge Blazhievsky Nice Systems SNIA Legal Notice The material contained in this tutorial is copyrighted

More information

Hadoop IST 734 SS CHUNG

Hadoop IST 734 SS CHUNG Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to

More information

HPCHadoop: MapReduce on Cray X-series

HPCHadoop: MapReduce on Cray X-series HPCHadoop: MapReduce on Cray X-series Scott Michael Research Analytics Indiana University Cray User Group Meeting May 7, 2014 1 Outline Motivation & Design of HPCHadoop HPCHadoop demo Benchmarking Methodology

More information

Big Data 2012 Hadoop Tutorial

Big Data 2012 Hadoop Tutorial Big Data 2012 Hadoop Tutorial Oct 19th, 2012 Martin Kaufmann Systems Group, ETH Zürich 1 Contact Exercise Session Friday 14.15 to 15.00 CHN D 46 Your Assistant Martin Kaufmann Office: CAB E 77.2 E-Mail:

More information

Hadoop Streaming. 2012 coreservlets.com and Dima May. 2012 coreservlets.com and Dima May

Hadoop Streaming. 2012 coreservlets.com and Dima May. 2012 coreservlets.com and Dima May 2012 coreservlets.com and Dima May Hadoop Streaming Originals of slides and source code for examples: http://www.coreservlets.com/hadoop-tutorial/ Also see the customized Hadoop training courses (onsite

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

19 Putting into Practice: Large-Scale Data Management with HADOOP

19 Putting into Practice: Large-Scale Data Management with HADOOP 19 Putting into Practice: Large-Scale Data Management with HADOOP The chapter proposes an introduction to HADOOP and suggests some exercises to initiate a practical experience of the system. The following

More information

USING HDFS ON DISCOVERY CLUSTER TWO EXAMPLES - test1 and test2

USING HDFS ON DISCOVERY CLUSTER TWO EXAMPLES - test1 and test2 USING HDFS ON DISCOVERY CLUSTER TWO EXAMPLES - test1 and test2 (Using HDFS on Discovery Cluster for Discovery Cluster Users email [email protected] if you have questions or need more clarifications. Nilay

More information

The Hadoop Eco System Shanghai Data Science Meetup

The Hadoop Eco System Shanghai Data Science Meetup The Hadoop Eco System Shanghai Data Science Meetup Karthik Rajasethupathy, Christian Kuka 03.11.2015 @Agora Space Overview What is this talk about? Giving an overview of the Hadoop Ecosystem and related

More information

Step 4: Configure a new Hadoop server This perspective will add a new snap-in to your bottom pane (along with Problems and Tasks), like so:

Step 4: Configure a new Hadoop server This perspective will add a new snap-in to your bottom pane (along with Problems and Tasks), like so: Codelab 1 Introduction to the Hadoop Environment (version 0.17.0) Goals: 1. Set up and familiarize yourself with the Eclipse plugin 2. Run and understand a word counting program Setting up Eclipse: Step

More information

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015 7/04/05 Fundamentals of Distributed Systems CC5- PROCESAMIENTO MASIVO DE DATOS OTOÑO 05 Lecture 4: DFS & MapReduce I Aidan Hogan [email protected] Inside Google circa 997/98 MASSIVE DATA PROCESSING (THE

More information

Hadoop Architecture. Part 1

Hadoop Architecture. Part 1 Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,

More information

Using Big Data to Explore New Opportunities. Fandhy Haristha Siregar, M.Kom, CIA, CRMA, CISA, CISM, CISSP, CEH, CEP-PM, QIA, COBIT5

Using Big Data to Explore New Opportunities. Fandhy Haristha Siregar, M.Kom, CIA, CRMA, CISA, CISM, CISSP, CEH, CEP-PM, QIA, COBIT5 Using Big Data to Explore New Opportunities Fandhy Haristha Siregar, M.Kom, CIA, CRMA, CISA, CISM, CISSP, CEH, CEP-PM, QIA, COBIT5 Introduction to Big Data The Myth About Big Data Source: Big Data: New

More information

Big Data With Hadoop

Big Data With Hadoop With Saurabh Singh [email protected] The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

More information

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data

More information

Data Science Analytics & Research Centre

Data Science Analytics & Research Centre Data Science Analytics & Research Centre Data Science Analytics & Research Centre 1 Big Data Big Data Overview Characteristics Applications & Use Case HDFS Hadoop Distributed File System (HDFS) Overview

More information

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team [email protected]

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team [email protected] Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since Feb

More information

BIG DATA, MAPREDUCE & HADOOP

BIG DATA, MAPREDUCE & HADOOP BIG, MAPREDUCE & HADOOP LARGE SCALE DISTRIBUTED SYSTEMS By Jean-Pierre Lozi A tutorial for the LSDS class LARGE SCALE DISTRIBUTED SYSTEMS BIG, MAPREDUCE & HADOOP 1 OBJECTIVES OF THIS LAB SESSION The LSDS

More information

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components of Hadoop. We will see what types of nodes can exist in a Hadoop

More information

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after

More information

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Prepared By : Manoj Kumar Joshi & Vikas Sawhney Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks

More information

Big Data Analytics* Outline. Issues. Big Data

Big Data Analytics* Outline. Issues. Big Data Outline Big Data Analytics* Big Data Data Analytics: Challenges and Issues Misconceptions Big Data Infrastructure Scalable Distributed Computing: Hadoop Programming in Hadoop: MapReduce Paradigm Example

More information

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS Dr. Ananthi Sheshasayee 1, J V N Lakshmi 2 1 Head Department of Computer Science & Research, Quaid-E-Millath Govt College for Women, Chennai, (India)

More information

Transforming the Telecoms Business using Big Data and Analytics

Transforming the Telecoms Business using Big Data and Analytics Transforming the Telecoms Business using Big Data and Analytics Event: ICT Forum for HR Professionals Venue: Meikles Hotel, Harare, Zimbabwe Date: 19 th 21 st August 2015 AFRALTI 1 Objectives Describe

More information

CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment

CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment James Devine December 15, 2008 Abstract Mapreduce has been a very successful computational technique that has

More information

LANGUAGES FOR HADOOP: PIG & HIVE

LANGUAGES FOR HADOOP: PIG & HIVE Friday, September 27, 13 1 LANGUAGES FOR HADOOP: PIG & HIVE Michail Michailidis & Patrick Maiden Friday, September 27, 13 2 Motivation Native MapReduce Gives fine-grained control over how program interacts

More information

Hadoop Integration Guide

Hadoop Integration Guide HP Vertica Analytic Database Software Version: 7.0.x Document Release Date: 2/20/2015 Legal Notices Warranty The only warranties for HP products and services are set forth in the express warranty statements

More information

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets

More information

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay Weekly Report Hadoop Introduction submitted By Anurag Sharma Department of Computer Science and Engineering Indian Institute of Technology Bombay Chapter 1 What is Hadoop? Apache Hadoop (High-availability

More information

Copy the.jar file into the plugins/ subfolder of your Eclipse installation. (e.g., C:\Program Files\Eclipse\plugins)

Copy the.jar file into the plugins/ subfolder of your Eclipse installation. (e.g., C:\Program Files\Eclipse\plugins) Beijing Codelab 1 Introduction to the Hadoop Environment Spinnaker Labs, Inc. Contains materials Copyright 2007 University of Washington, licensed under the Creative Commons Attribution 3.0 License --

More information

An Implementation of Sawzall on Hadoop

An Implementation of Sawzall on Hadoop 1 An Implementation of Sawzall on Hadoop Hidemoto Nakada, Tatsuhiko Inoue and Tomohiro Kudoh, 1-1-1 National Institute of Advanced Industrial Science and Technology, Umezono, Tsukuba, Ibaraki 35-8568,

More information

University of Maryland. Tuesday, February 2, 2010

University of Maryland. Tuesday, February 2, 2010 Data-Intensive Information Processing Applications Session #2 Hadoop: Nuts and Bolts Jimmy Lin University of Maryland Tuesday, February 2, 2010 This work is licensed under a Creative Commons Attribution-Noncommercial-Share

More information

5 HDFS - Hadoop Distributed System

5 HDFS - Hadoop Distributed System 5 HDFS - Hadoop Distributed System 5.1 Definition and Remarks HDFS is a file system designed for storing very large files with streaming data access patterns running on clusters of commoditive hardware.

More information

Cloud Computing Era. Trend Micro

Cloud Computing Era. Trend Micro Cloud Computing Era Trend Micro Three Major Trends to Chang the World Cloud Computing Big Data Mobile 什 麼 是 雲 端 運 算? 美 國 國 家 標 準 技 術 研 究 所 (NIST) 的 定 義 : Essential Characteristics Service Models Deployment

More information

Three Approaches to Data Analysis with Hadoop

Three Approaches to Data Analysis with Hadoop Three Approaches to Data Analysis with Hadoop A Dell Technical White Paper Dave Jaffe, Ph.D. Solution Architect Dell Solution Centers Executive Summary This white paper demonstrates analysis of large datasets

More information

Hadoop and Big Data. Keijo Heljanko. Department of Information and Computer Science School of Science Aalto University keijo.heljanko@aalto.

Hadoop and Big Data. Keijo Heljanko. Department of Information and Computer Science School of Science Aalto University keijo.heljanko@aalto. Keijo Heljanko Department of Information and Computer Science School of Science Aalto University [email protected] 1/77 Business Drivers of Cloud Computing Large data centers allow for economics

More information

Case-Based Reasoning Implementation on Hadoop and MapReduce Frameworks Done By: Soufiane Berouel Supervised By: Dr Lily Liang

Case-Based Reasoning Implementation on Hadoop and MapReduce Frameworks Done By: Soufiane Berouel Supervised By: Dr Lily Liang Case-Based Reasoning Implementation on Hadoop and MapReduce Frameworks Done By: Soufiane Berouel Supervised By: Dr Lily Liang Independent Study Advanced Case-Based Reasoning Department of Computer Science

More information

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, [email protected] Assistant Professor, Information

More information

Map Reduce & Hadoop Recommended Text:

Map Reduce & Hadoop Recommended Text: Big Data Map Reduce & Hadoop Recommended Text:! Large datasets are becoming more common The New York Stock Exchange generates about one terabyte of new trade data per day. Facebook hosts approximately

More information

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe

More information

Introduction to Hadoop

Introduction to Hadoop Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction

More information

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment Analysing Large Files in a Hadoop Distributed Cluster Environment S Saravanan, B Uma Maheswari Department of Computer Science and Engineering, Amrita School of Engineering, Amrita Vishwa Vidyapeetham,

More information

Hadoop MapReduce Tutorial - Reduce Comp variability in Data Stamps

Hadoop MapReduce Tutorial - Reduce Comp variability in Data Stamps Distributed Recommenders Fall 2010 Distributed Recommenders Distributed Approaches are needed when: Dataset does not fit into memory Need for processing exceeds what can be provided with a sequential algorithm

More information

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop

More information

Hadoop. Dawid Weiss. Institute of Computing Science Poznań University of Technology

Hadoop. Dawid Weiss. Institute of Computing Science Poznań University of Technology Hadoop Dawid Weiss Institute of Computing Science Poznań University of Technology 2008 Hadoop Programming Summary About Config 1 Open Source Map-Reduce: Hadoop About Cluster Configuration 2 Programming

More information

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes

More information

Introduction to Big Data & Basic Data Analysis. Freddy Wetjen, National Library of Norway.

Introduction to Big Data & Basic Data Analysis. Freddy Wetjen, National Library of Norway. Introduction to Big Data & Basic Data Analysis Freddy Wetjen, National Library of Norway. Big Data EveryWhere! Lots of data may be collected and warehoused Web data, e-commerce purchases at department/

More information

Hadoop Parallel Data Processing

Hadoop Parallel Data Processing MapReduce and Implementation Hadoop Parallel Data Processing Kai Shen A programming interface (two stage Map and Reduce) and system support such that: the interface is easy to program, and suitable for

More information

Open source Google-style large scale data analysis with Hadoop

Open source Google-style large scale data analysis with Hadoop Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: [email protected] Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical

More information