Introduction to Big Data Science. Wuhui Chen



Similar documents
Understanding Hadoop Clusters and the Network

Word Count Code using MR2 Classes and API

Hadoop Lab Notes. Nicola Tonellotto November 15, 2010

MapReduce and Hadoop Distributed File System V I J A Y R A O

Introduc)on to Map- Reduce. Vincent Leroy

BIG DATA APPLICATIONS

Working With Hadoop. Important Terminology. Important Terminology. Anatomy of MapReduce Job Run. Important Terminology

Introduction to MapReduce and Hadoop

Tutorial- Counting Words in File(s) using MapReduce

Internals of Hadoop Application Framework and Distributed File System

Applications for Big Data Analytics

MapReduce and Hadoop Distributed File System

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

Zebra and MapReduce. Table of contents. 1 Overview Hadoop MapReduce APIs Zebra MapReduce APIs Zebra MapReduce Examples...

Hadoop Framework. technology basics for data scientists. Spring Jordi Torres, UPC - BSC

Hadoop Basics with InfoSphere BigInsights

Hadoop Configuration and First Examples

Hadoop and Eclipse. Eclipse Hawaii User s Group May 26th, Seth Ladd

HDInsight Essentials. Rajesh Nadipalli. Chapter No. 1 "Hadoop and HDInsight in a Heartbeat"

map/reduce connected components

Hadoop WordCount Explained! IT332 Distributed Systems

Istanbul Şehir University Big Data Camp 14. Hadoop Map Reduce. Aslan Bakirov Kevser Nur Çoğalmış

Lambda Architecture. CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014

Getting to know Apache Hadoop

Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网. Information Management. Information Management IBM CDL Lab

Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan

Xiaoming Gao Hui Li Thilina Gunarathne

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13

Mrs: MapReduce for Scientific Computing in Python

How To Write A Mapreduce Program In Java.Io (Orchestra)

Hadoop: Understanding the Big Data Processing Method

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems

MR-(Mapreduce Programming Language)

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Extreme Computing. Hadoop MapReduce in more detail.

Word count example Abdalrahman Alsaedi

Connecting Hadoop with Oracle Database

Data Science in the Wild

Enterprise Data Storage and Analysis on Tim Barr

Big Data Explained. An introduction to Big Data Science.

Chapter 7. Using Hadoop Cluster and MapReduce

Big Data Management and NoSQL Databases

CS54100: Database Systems

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems

Hadoop IST 734 SS CHUNG

HPCHadoop: MapReduce on Cray X-series

Big Data 2012 Hadoop Tutorial

Hadoop Streaming coreservlets.com and Dima May coreservlets.com and Dima May

CSE-E5430 Scalable Cloud Computing Lecture 2

19 Putting into Practice: Large-Scale Data Management with HADOOP

USING HDFS ON DISCOVERY CLUSTER TWO EXAMPLES - test1 and test2

The Hadoop Eco System Shanghai Data Science Meetup

Step 4: Configure a new Hadoop server This perspective will add a new snap-in to your bottom pane (along with Problems and Tasks), like so:

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

Hadoop Architecture. Part 1

Using Big Data to Explore New Opportunities. Fandhy Haristha Siregar, M.Kom, CIA, CRMA, CISA, CISM, CISSP, CEH, CEP-PM, QIA, COBIT5

Big Data With Hadoop

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Data Science Analytics & Research Centre

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

BIG DATA, MAPREDUCE & HADOOP

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Big Data Analytics* Outline. Issues. Big Data

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

Transforming the Telecoms Business using Big Data and Analytics

CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment

LANGUAGES FOR HADOOP: PIG & HIVE

Hadoop Integration Guide

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Copy the.jar file into the plugins/ subfolder of your Eclipse installation. (e.g., C:\Program Files\Eclipse\plugins)

An Implementation of Sawzall on Hadoop

University of Maryland. Tuesday, February 2, 2010

5 HDFS - Hadoop Distributed System

Cloud Computing Era. Trend Micro

Three Approaches to Data Analysis with Hadoop

Hadoop and Big Data. Keijo Heljanko. Department of Information and Computer Science School of Science Aalto University

Case-Based Reasoning Implementation on Hadoop and MapReduce Frameworks Done By: Soufiane Berouel Supervised By: Dr Lily Liang

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Map Reduce & Hadoop Recommended Text:

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Introduction to Hadoop

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment

Hadoop MapReduce Tutorial - Reduce Comp variability in Data Stamps

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Hadoop. Dawid Weiss. Institute of Computing Science Poznań University of Technology

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Introduction to Big Data & Basic Data Analysis. Freddy Wetjen, National Library of Norway.

Hadoop Parallel Data Processing

Open source Google-style large scale data analysis with Hadoop

Transcription:

Introduction to Big Data Science Wuhui Chen

What is Big data? Volume Variety Velocity Outline What are people doing with Big data? Classic examples Two basic technologies for Big data management: Data storage: HDFS (Hadoop Distributed File System) Data processing: MapReduce

What s Big Data? No single definition; here is from Wikipedia: Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. Finding correlations to: spot business trends determine quality of research prevent diseases link legal citations combat crime and determine real-time roadway traffic conditions The challenges include capture, curation, storage, search, sharing, transfer, analysis, and visualization. 3

Big Data: 3V s 4

Data Volume Volume (Scale) Data volume is increasing exponentially 5

Data Volume Volume (Scale) 44x increase from 2009 2020 From 0.8 zettabytes to 35zb 6

12+ TBs of tweet data every day data every day 30 billion RFID tags today (1.3B in 2005) 4.6 billion camera phones world wide? TBs of 100s of millions of GPS enabled devices sold annually 25+ TBs of log data every day 76 million smart meters in 2009 200M by 2014 2+ billion people on the Web by end 2011

Maximilien Brice, CERN CERN s Large Hydron Collider (LHC) generates 15 PB a year

Variety (Complexity) Relational Data (Tables/Transaction/Legacy Data) Text Data (Web) Semi-structured Data (XML) Graph Data Social Network, Semantic Web (RDF), Streaming Data You can only scan the data once A single application can be generating/collecting many types of data Big Public Data (online, weather, finance, etc) To extract knowledge all these types of data need to linked together 9

A Single View to the Customer Social Media Banking Finance Gaming Customer Our Known History Entertain Purchas e All types of data linked together for value extraction!

Velocity (Speed) Data is begin generated fast and need to be processed fast Online Data Analytics Late decisions missing opportunities Examples E-Promotions: Based on your current location, your purchase history, what you like send promotions right now for store next to you Healthcare monitoring: sensors monitoring your activities and body any abnormal measurements require immediate reaction 11

Real-time/Fast Data Mobile devices (tracking all objects all the time) Social media and networks (all of us are generating data) Scientific instruments (collecting all sorts of data) Sensor technology and networks (measuring all kinds of data) The progress and innovation is no longer hindered by the ability to collect data But, by the ability to manage, analyze, summarize, visualize, and discover knowledge from the collected data in a timely manner and in a scalable fashion 12

Real-Time Analytics/Decision Requirement Product Recommendations that are Relevant & Compelling Influence Behavior Learning why Customers Switch to competitors and their offers; in time to Counter Improving the Marketing Effectiveness of a Promotion while it is still in Play Customer Preventing Fraud as it is Occurring & preventing more proactively Friend Invitations to join a Game or Activity that expands business

Some Make it 4V s 14

What is Big data? Volume Variety Velocity Outline What are people doing with Big data? Classic examples Two basic technologies for Big data management: Data storage: HDFS (Hadoop Distributed File System) Data processing: MapReduce

Example 1: Target News from Net York Times, 2012 Target Knew a High School Girl Was Pregnant Before Her Parents Did How? Transaction history analysis:

Example 2: Google Flu Trends Using Google search data to estimate current flu activity around the world in near real-time. Source: http://www.google.org/flutrends/ Estimating which cities are most at risk for spread of the Ebola virus. 17

Example 3:Elections2012

Example 4: NASA of biomedicine Oxford University's big data and Internet of Things project to 'create the NASA of biomedicine Care for cancer patients Datasets: Sequence the full genome of 100,000 patient volunteers in the NHS and combine it with the hospital clinical data

What is Big data? Volume Variety Velocity Outline What are people doing with Big data? Classic examples Two basic technologies for Big data management: Data storage: HDFS (Hadoop Distributed File System) Data processing: MapReduce

Hadoop Cluster World switch switch switch switch switch switch switch Name Node DN + TT DN + TT DN + TT DN + TT Job Tracker DN + TT DN + TT DN + TT DN + TT Secondary NN DN + TT DN + TT DN + TT DN + TT Client DN + TT DN + TT DN + TT DN + TT DN + TT DN + TT DN + TT DN + TT DN + TT DN + TT Rack 1 Rack 2 Rack 3 Rack 4 Rack N

Typical Workflow Load data into the cluster (HDFS writes) Analyze the data (Map Reduce) Store results in the cluster (HDFS writes) Read the results from the cluster (HDFS reads) Sample Scenario: How many times did our customers type the word Fraud into emails sent to customer service? Huge file containing all emails sent to customer service File.txt

Writing files to HDFS File.txt Blk A Blk B Blk C I want to write Blocks A,B,C of File.txt Client Name Node OK. Write to Data Nodes 1,5,6 Data Node 1 Data Node 5 Data Node 6 Data Node N Blk A Blk B Blk C Client consults Name Node Client writes block directly to one Data Node Data Nodes replicates block Cycle repeats for next block

Hadoop Rack Awareness Why? switch Name Node switch Data Node 1 B A Data Node 2 B Data Node 3 Data Node 5 switch Data Node 5 C A Data Node 6 A Data Node 7 Data Node 8 switch Data Node 9 C B Data Node 10 C Data Node 11 Data Node 12 Rack 1 Rack 5 Rack 9 Rack aware Rack 1: Data Node 1 Data Node 2 Data Node 3 Rack 5: Data Node 5 Data Node 6 Data Node 7 metadata File.txt= Blk A: DN1, DN5, DN6 Blk B: DN7, DN1, DN2 Blk C: DN5, DN8,DN9 Never loose all data if entire rack fails Keep bulky flows in-rack when possible Assumption that in-rack is higher bandwidth, lower latency

Preparing HDFS writes File.txt Blk A Blk B Blk C I want to write File.txt Block A Ready Data Nodes 5,6 Client Ready! switch switch switch Data Node 1 Data Node 5 Name Node OK. Write to Data Nodes 1,5,6 Rack aware Rack 1: Data Node 1 Rack 5: Data Node 5 Data Node 6 Rack 1 Ready Data Node 6 Ready? Data Node 6 Rack 2 Ready! Name Node picks two nodes in the same rack, one node in a different rack Data protection Locality for M/R

Pipelined Write File.txt Blk A Blk B Blk C Client Name Node Data Nodes 1 & 2 pass data along as its received TCP 50010 switch switch switch Data Node 1 Data Node 5 A A Data Node 6 A Rack aware Rack 1: Data Node 1 Rack 5: Data Node 5 Data Node 6 Rack 1 Rack 2

Pipelined Write File.txt Blk A Blk B Blk C Client Success Name Node File.txt Blk A: DN1, DN2, DN3 switch Block received Rack 1: Data Node 1 switch switch Data Node 1 Data Node 2 Rack 5: Data Node 2 Data Node 3 A A Data Node 3 A Rack 1 Rack 2

Multi-block Replication Pipeline File.txt Blk A Blk B Blk C Client switch 1TB File = 3TB storage 3TB network traffic switch switch switch Blk A Data Node 1 Blk A Data Node X Blk C Data Node 2 Blk B Blk B Data Node Y Blk A Blk C Data Node 3 Blk C Data Node W Blk B Data Node Z Rack 1 Rack 4 Rack 5

Name Node Awesome! Thanks. Name Node metadata DN1: A,C DN2: A,C DN3: A,C File system File.txt = A,C I have blocks: A, C I m alive! Data Node 1 Data Node 2 Data Node 3 Data Node N A C A C A C Data Node sends Heartbeats Every 10 th heartbeat is a Block report Name Node builds metadata from Block reports TCP every 3 seconds If Name Node is down, HDFS is down

Re-replicating missing replicas Uh Oh! Missing replicas Name Node metadata DN1: A,C DN2: A,C DN3: A, C Rack Awareness Rack1: DN1, DN2 Rack5: DN3, Rack9: DN8 Copy blocks A,C to Node 8 Data Node 1 Data Node 2 Data Node 3 Data Node 8 A C A C A C A C Missing Heartbeats signify lost Nodes Name Node consults metadata, finds affected data Name Node consults Rack Awareness script Name Node tells a Data Node to re-replicate

Secondary Name Node File system metadata Name Node File.txt = A,C Secondary Name Node Its been an hour, give me your metadata Not a hot standby for the Name Node Connects to Name Node every hour* Housekeeping, backup of Name Node metadata Saved metadata can rebuild a failed Name Node

Client reading files from HDFS Tell me the block locations of Results.txt Blk A = 1,5,6 Blk B = 8,1,2 Blk C = 5,8,9 Client Name Node switch Data Node 1 B A switch Data Node 5 C A switch Data Node 8 C B metadata Results.txt = Blk A: DN1, DN5, DN6 Data Node 2 B Data Node Data Node Data Node 6 A Data Node Data Node Data Node 9 C Data Node Data Node Blk B: DN7, DN1, DN2 Blk C: DN5, DN8,DN9 Rack 1 Rack 5 Rack 9 Client receives Data Node list for each block Client picks first Data Node for each block Client reads blocks sequentially

Data Node reading files from HDFS Tell me the locations of Block A of File.txt switch Name Node Block A = 1,5,6 switch Data Node 1 B A Data Node 2 B Data Node 3 Data Node switch Data Node 5 C A Data Node 6 A Data Node Data Node switch Data Node 8 C B Data Node 9 C Data Node Data Node Rack aware Rack 1: Data Node 1 Data Node 2 Data Node 3 Rack 5: Data Node 5 Rack 1 Rack 5 Rack 9 Name Node provides rack local Nodes first Leverage in-rack bandwidth, single hop metadata File.txt= Blk A: DN1, DN5, DN6 Blk B: DN7, DN1, DN2 Blk C: DN5, DN8,DN9

What is Big data? Volume Variety Velocity Outline What are people doing with Big data? Classic examples Two basic technologies for Big data management: Data storage: HDFS (Hadoop Distributed File System) Data processing: MapReduce

What is MapReduce Origin from Google, [OSDI 04] A simple programming model Functional model For large-scale data processing Exploits large set of commodity computers Executes process in distributed manner Offers high availability

Motivation Lots of demands for very large scale data processing A certain common themes for these demands Lots of machines needed (scaling) Two basic operations on the input Map Reduce

Peta-scale Data Main web 2 weed 1 green 2 Data collection 1..* Parser Thread 1..* Counter sun moon 1 land 1 part 1 DataCollection WordList ResultTable KEY web weed green sun moon land part web green. CCSCNE 2009 VALUE Palttsburg, April 24 2009

Divide and Conquer Data collection One node 1..* Parser DataCollection Main Thread WordList 1..* Counter ResultTable For our example, #1: Schedule parallel parse tasks #2: Schedule parallel count tasks This is a particular solution; Lets generalize it: Main Data collection 1..* Parser Thread 1..* Counter Our parse is a mapping operation: MAP: input <key, value> pairs DataCollection WordList ResultTable Data collection 1..* Parser DataCollection Main Thread WordList 1..* Counter ResultTable Our count is a reduce operation: REDUCE: <key, value> pairs reduced Map/Reduce originated from Lisp But have different meaning here Data collection 1..* Parser DataCollection Main Thread WordList 1..* Counter ResultTable Runtime adds distribution + fault tolerance + replication + monitoring + load balancing to your base application!

Architecture overview Master node user Job tracker Slave node 1 Slave node 2 Slave node N Task tracker Task tracker Task tracker Workers Workers Workers

Data Processing: Map How many times does Fraud appear in File.txt? Name Node Client Job Tracker Count Fraud in Block C Map Task Map Task Map Task Data Node 1 Fraud = 3 Data Node 5 Data Node 9 A B C Fraud = 0 Fraud = 11 File.txt Map: Run this computation on your local data Job Tracker delivers Java code to Nodes with local data

What if data isn t local? How many times does Fraud appear in File.txt? Name Node Client Job Tracker Count Fraud in Block C switch switch switch A I need block A Data Node 1 no Map tasks left Data Node 2 Data Node 5 Map Task Map Task B Data Node 9 Fraud = 0 Fraud = 11 C Rack 1 Rack 5 Job Tracker tries to select Node in same rack as data Name Node rack awareness Rack 9

Data Processing: Reduce Client Job Tracker Sum Fraud Results.txt Fraud = 14 X Y Z HDFS Data Node 3 Reduce Task Map Task Fraud = 0 Map Task Map Task Data Node 1 Data Node 5 Data Node 9 A B C Reduce: Run this computation across Map results Map Tasks deliver output data over the network Reduce Task data output written to and read from HDFS

Principle of MapReduce Operation

Map Reduce

Map Reduce

Map Reduce

Map Reduce

Map Reduce

Map Reduce

Map Reduce

Map Reduce

Map Reduce

Large scale data splits Map <key, 1> Reducers (say, Count) Parse-hash Count P-0000, count1 Parse-hash Parse-hash Count P-0001, count2 Parse-hash Count P-0002,count3 Map Reduce 57 CCSCNE 2009 Palttsburg, April 24 2009

Map Reduce

Map Reduce

HADOOP MapReduce Program : WordCount.java package net.kzk9; import java.io.ioexception; import java.util.stringtokenizer; import org.apache.hadoop.util.genericoptionsparser; import org.apache.hadoop.conf.configuration; import org.apache.hadoop.fs.path; import org.apache.hadoop.io.intwritable; import org.apache.hadoop.io.longwritable; import org.apache.hadoop.io.text; import org.apache.hadoop.mapreduce.job; import org.apache.hadoop.mapreduce.mapper; import org.apache.hadoop.mapreduce.reducer; import org.apache.hadoop.mapreduce.lib.input.fileinputformat; import org.apache.hadoop.mapreduce.lib.input.textinputformat; import org.apache.hadoop.mapreduce.lib.output.fileoutputformat; import org.apache.hadoop.mapreduce.lib.output.textoutputformat; public class WordCount { // Mapperの 実 装 public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); @Override protected void map(longwritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.tostring(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasmoretokens()) { word.set(tokenizer.nexttoken()); context.write(word, one); } } } Map Reduce // Reducerの 実 装 public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { private IntWritable value = new IntWritable(0); @Override protected void reduce(text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable value : values) sum += value.get(); value.set(sum); context.write(key, value); } } public static void main(string[] args) throws Exception { // 設 定 情 報 の 読 み 込 み Configuration conf = new Configuration(); // 引 数 のパース GenericOptionsParser parser = new GenericOptionsParser(conf, args); args = parser.getremainingargs(); // ジョブの 作 成 Job job = new Job(conf, "wordcount"); job.setjarbyclass(wordcount.class); // Mapper/Reducerに 使 用 するクラスを 指 定 job.setmapperclass(map.class); job.setreducerclass(reduce.class); // ジョブ 中 の 各 種 型 を 設 定 job.setmapoutputkeyclass(text.class); job.setmapoutputvalueclass(intwritable.class); job.setoutputkeyclass(text.class); job.setoutputvalueclass(intwritable.class);

HADOOP MapReduce Program : WordCount.java // 入 力 / 出 力 パスを 設 定 FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); // ジョブを JobTrackerにサブミット boolean success = job.waitforcompletion(true); System.out.println(success); } } TestWordCount.java package net.kzk9; import net.kzk9.wordcount; import org.apache.hadoop.io.*; import org.apache.hadoop.mapreduce.*; import org.apache.hadoop.mrunit.mapreduce.mapdriver; import junit.framework.testcase; import org.junit.before; import org.junit.test; public class WordCountTest extends TestCase { // テストする 対 象 の Mapper private WordCount.Map mapper; // Mapper 実 行 用 のドライバクラス private MapDriver driver; @Before public void setup() { // テスト 対 象 の Mapperの 作 成 mapper = new WordCount.Map(); // テスト 実 行 用 MapDriverの 作 成 driver = new MapDriver(mapper); } @Test public void testwordcountmapper() { // 入 力 を 設 定 driver.withinput(new LongWritable(0), new Text("this is a pen")) // 期 待 される 出 力 を 設 定.withOutput(new Text("this"), new IntWritable(1)).withOutput(new Text("is"), new IntWritable(1)).withOutput(new Text("a"), new IntWritable(1)).withOutput(new Text("pen"), new IntWritable(1)) // テストを 実 行.runTest(); } } #!/bin/bash export HADOOP_HOME=/usr/lib/hadoop-0.20/ export DIR=wordcount_classes # CLASSPATHの 設 定 CLASSPATH=$HADOOP_HOME/hadoop-core.jar for f in $HADOOP_HOME/lib/*.jar; do CLASSPATH=${CLASSPATH}:$f; done # ディレクトリの 初 期 化 rm -fr $DIR mkdir $DIR # コンパイル javac -classpath $CLASSPATH -d $DIR WordCount.java # jarファイルの 作 成 jar -cvf wordcount.jar -C $DIR. Map Reduce

Conclusion and Discussion The Concept of Big Data. Big Data s Current States Basic Technologies: Data Storage and Analysis How we can join the Big Data Trend?

Reference Apache Hadoop: http://hadoop.apache.org/ http://wiki.apache.org/hadoop/ Hadoop: The Definitive Guide, by Tom White, 2 nd edition, Oreilly s, 2010 Dean, J. and Ghemawat, S. 2008. MapReduce: simplified data processing on large clusters. Communication of ACM 51, 1 (Jan. 2008), 107-113. B. Hedlund s blog: http://bradhedlund.com/2011/09/10/understandi ng-hadoop-clusters-and-the-network/ 11/24/2014