Introduction to Big Data Science. Wuhui Chen

Transcription

1 Introduction to Big Data Science Wuhui Chen

2 What is Big data? Volume Variety Velocity Outline What are people doing with Big data? Classic examples Two basic technologies for Big data management: Data storage: HDFS (Hadoop Distributed File System) Data processing: MapReduce

3 What s Big Data? No single definition; here is from Wikipedia: Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. Finding correlations to: spot business trends determine quality of research prevent diseases link legal citations combat crime and determine real-time roadway traffic conditions The challenges include capture, curation, storage, search, sharing, transfer, analysis, and visualization. 3

4 Big Data: 3V s 4

5 Data Volume Volume (Scale) Data volume is increasing exponentially 5

6 Data Volume Volume (Scale) 44x increase from From 0.8 zettabytes to 35zb 6

7 12+ TBs of tweet data every day data every day 30 billion RFID tags today (1.3B in 2005) 4.6 billion camera phones world wide? TBs of 100s of millions of GPS enabled devices sold annually 25+ TBs of log data every day 76 million smart meters in M by billion people on the Web by end 2011

8 Maximilien Brice, CERN CERN s Large Hydron Collider (LHC) generates 15 PB a year

9 Variety (Complexity) Relational Data (Tables/Transaction/Legacy Data) Text Data (Web) Semi-structured Data (XML) Graph Data Social Network, Semantic Web (RDF), Streaming Data You can only scan the data once A single application can be generating/collecting many types of data Big Public Data (online, weather, finance, etc) To extract knowledge all these types of data need to linked together 9

10 A Single View to the Customer Social Media Banking Finance Gaming Customer Our Known History Entertain Purchas e All types of data linked together for value extraction!

11 Velocity (Speed) Data is begin generated fast and need to be processed fast Online Data Analytics Late decisions missing opportunities Examples E-Promotions: Based on your current location, your purchase history, what you like send promotions right now for store next to you Healthcare monitoring: sensors monitoring your activities and body any abnormal measurements require immediate reaction 11

12 Real-time/Fast Data Mobile devices (tracking all objects all the time) Social media and networks (all of us are generating data) Scientific instruments (collecting all sorts of data) Sensor technology and networks (measuring all kinds of data) The progress and innovation is no longer hindered by the ability to collect data But, by the ability to manage, analyze, summarize, visualize, and discover knowledge from the collected data in a timely manner and in a scalable fashion 12

13 Real-Time Analytics/Decision Requirement Product Recommendations that are Relevant & Compelling Influence Behavior Learning why Customers Switch to competitors and their offers; in time to Counter Improving the Marketing Effectiveness of a Promotion while it is still in Play Customer Preventing Fraud as it is Occurring & preventing more proactively Friend Invitations to join a Game or Activity that expands business

14 Some Make it 4V s 14

16 Example 1: Target News from Net York Times, 2012 Target Knew a High School Girl Was Pregnant Before Her Parents Did How? Transaction history analysis:

17 Example 2: Google Flu Trends Using Google search data to estimate current flu activity around the world in near real-time. Source: Estimating which cities are most at risk for spread of the Ebola virus. 17

18 Example 3:Elections2012

19 Example 4: NASA of biomedicine Oxford University's big data and Internet of Things project to 'create the NASA of biomedicine Care for cancer patients Datasets: Sequence the full genome of 100,000 patient volunteers in the NHS and combine it with the hospital clinical data

21 Hadoop Cluster World switch switch switch switch switch switch switch Name Node DN + TT DN + TT DN + TT DN + TT Job Tracker DN + TT DN + TT DN + TT DN + TT Secondary NN DN + TT DN + TT DN + TT DN + TT Client DN + TT DN + TT DN + TT DN + TT DN + TT DN + TT DN + TT DN + TT DN + TT DN + TT Rack 1 Rack 2 Rack 3 Rack 4 Rack N

22 Typical Workflow Load data into the cluster (HDFS writes) Analyze the data (Map Reduce) Store results in the cluster (HDFS writes) Read the results from the cluster (HDFS reads) Sample Scenario: How many times did our customers type the word Fraud into s sent to customer service? Huge file containing all s sent to customer service File.txt

23 Writing files to HDFS File.txt Blk A Blk B Blk C I want to write Blocks A,B,C of File.txt Client Name Node OK. Write to Data Nodes 1,5,6 Data Node 1 Data Node 5 Data Node 6 Data Node N Blk A Blk B Blk C Client consults Name Node Client writes block directly to one Data Node Data Nodes replicates block Cycle repeats for next block

24 Hadoop Rack Awareness Why? switch Name Node switch Data Node 1 B A Data Node 2 B Data Node 3 Data Node 5 switch Data Node 5 C A Data Node 6 A Data Node 7 Data Node 8 switch Data Node 9 C B Data Node 10 C Data Node 11 Data Node 12 Rack 1 Rack 5 Rack 9 Rack aware Rack 1: Data Node 1 Data Node 2 Data Node 3 Rack 5: Data Node 5 Data Node 6 Data Node 7 metadata File.txt= Blk A: DN1, DN5, DN6 Blk B: DN7, DN1, DN2 Blk C: DN5, DN8,DN9 Never loose all data if entire rack fails Keep bulky flows in-rack when possible Assumption that in-rack is higher bandwidth, lower latency

25 Preparing HDFS writes File.txt Blk A Blk B Blk C I want to write File.txt Block A Ready Data Nodes 5,6 Client Ready! switch switch switch Data Node 1 Data Node 5 Name Node OK. Write to Data Nodes 1,5,6 Rack aware Rack 1: Data Node 1 Rack 5: Data Node 5 Data Node 6 Rack 1 Ready Data Node 6 Ready? Data Node 6 Rack 2 Ready! Name Node picks two nodes in the same rack, one node in a different rack Data protection Locality for M/R

26 Pipelined Write File.txt Blk A Blk B Blk C Client Name Node Data Nodes 1 & 2 pass data along as its received TCP switch switch switch Data Node 1 Data Node 5 A A Data Node 6 A Rack aware Rack 1: Data Node 1 Rack 5: Data Node 5 Data Node 6 Rack 1 Rack 2

27 Pipelined Write File.txt Blk A Blk B Blk C Client Success Name Node File.txt Blk A: DN1, DN2, DN3 switch Block received Rack 1: Data Node 1 switch switch Data Node 1 Data Node 2 Rack 5: Data Node 2 Data Node 3 A A Data Node 3 A Rack 1 Rack 2

28 Multi-block Replication Pipeline File.txt Blk A Blk B Blk C Client switch 1TB File = 3TB storage 3TB network traffic switch switch switch Blk A Data Node 1 Blk A Data Node X Blk C Data Node 2 Blk B Blk B Data Node Y Blk A Blk C Data Node 3 Blk C Data Node W Blk B Data Node Z Rack 1 Rack 4 Rack 5

29 Name Node Awesome! Thanks. Name Node metadata DN1: A,C DN2: A,C DN3: A,C File system File.txt = A,C I have blocks: A, C I m alive! Data Node 1 Data Node 2 Data Node 3 Data Node N A C A C A C Data Node sends Heartbeats Every 10 th heartbeat is a Block report Name Node builds metadata from Block reports TCP every 3 seconds If Name Node is down, HDFS is down

30 Re-replicating missing replicas Uh Oh! Missing replicas Name Node metadata DN1: A,C DN2: A,C DN3: A, C Rack Awareness Rack1: DN1, DN2 Rack5: DN3, Rack9: DN8 Copy blocks A,C to Node 8 Data Node 1 Data Node 2 Data Node 3 Data Node 8 A C A C A C A C Missing Heartbeats signify lost Nodes Name Node consults metadata, finds affected data Name Node consults Rack Awareness script Name Node tells a Data Node to re-replicate

31 Secondary Name Node File system metadata Name Node File.txt = A,C Secondary Name Node Its been an hour, give me your metadata Not a hot standby for the Name Node Connects to Name Node every hour* Housekeeping, backup of Name Node metadata Saved metadata can rebuild a failed Name Node

32 Client reading files from HDFS Tell me the block locations of Results.txt Blk A = 1,5,6 Blk B = 8,1,2 Blk C = 5,8,9 Client Name Node switch Data Node 1 B A switch Data Node 5 C A switch Data Node 8 C B metadata Results.txt = Blk A: DN1, DN5, DN6 Data Node 2 B Data Node Data Node Data Node 6 A Data Node Data Node Data Node 9 C Data Node Data Node Blk B: DN7, DN1, DN2 Blk C: DN5, DN8,DN9 Rack 1 Rack 5 Rack 9 Client receives Data Node list for each block Client picks first Data Node for each block Client reads blocks sequentially

33 Data Node reading files from HDFS Tell me the locations of Block A of File.txt switch Name Node Block A = 1,5,6 switch Data Node 1 B A Data Node 2 B Data Node 3 Data Node switch Data Node 5 C A Data Node 6 A Data Node Data Node switch Data Node 8 C B Data Node 9 C Data Node Data Node Rack aware Rack 1: Data Node 1 Data Node 2 Data Node 3 Rack 5: Data Node 5 Rack 1 Rack 5 Rack 9 Name Node provides rack local Nodes first Leverage in-rack bandwidth, single hop metadata File.txt= Blk A: DN1, DN5, DN6 Blk B: DN7, DN1, DN2 Blk C: DN5, DN8,DN9

35 What is MapReduce Origin from Google, [OSDI 04] A simple programming model Functional model For large-scale data processing Exploits large set of commodity computers Executes process in distributed manner Offers high availability

36 Motivation Lots of demands for very large scale data processing A certain common themes for these demands Lots of machines needed (scaling) Two basic operations on the input Map Reduce

37 Peta-scale Data Main web 2 weed 1 green 2 Data collection 1..* Parser Thread 1..* Counter sun moon 1 land 1 part 1 DataCollection WordList ResultTable KEY web weed green sun moon land part web green. CCSCNE 2009 VALUE Palttsburg, April

38 Divide and Conquer Data collection One node 1..* Parser DataCollection Main Thread WordList 1..* Counter ResultTable For our example, #1: Schedule parallel parse tasks #2: Schedule parallel count tasks This is a particular solution; Lets generalize it: Main Data collection 1..* Parser Thread 1..* Counter Our parse is a mapping operation: MAP: input <key, value> pairs DataCollection WordList ResultTable Data collection 1..* Parser DataCollection Main Thread WordList 1..* Counter ResultTable Our count is a reduce operation: REDUCE: <key, value> pairs reduced Map/Reduce originated from Lisp But have different meaning here Data collection 1..* Parser DataCollection Main Thread WordList 1..* Counter ResultTable Runtime adds distribution + fault tolerance + replication + monitoring + load balancing to your base application!

39 Architecture overview Master node user Job tracker Slave node 1 Slave node 2 Slave node N Task tracker Task tracker Task tracker Workers Workers Workers

40 Data Processing: Map How many times does Fraud appear in File.txt? Name Node Client Job Tracker Count Fraud in Block C Map Task Map Task Map Task Data Node 1 Fraud = 3 Data Node 5 Data Node 9 A B C Fraud = 0 Fraud = 11 File.txt Map: Run this computation on your local data Job Tracker delivers Java code to Nodes with local data

41 What if data isn t local? How many times does Fraud appear in File.txt? Name Node Client Job Tracker Count Fraud in Block C switch switch switch A I need block A Data Node 1 no Map tasks left Data Node 2 Data Node 5 Map Task Map Task B Data Node 9 Fraud = 0 Fraud = 11 C Rack 1 Rack 5 Job Tracker tries to select Node in same rack as data Name Node rack awareness Rack 9

42 Data Processing: Reduce Client Job Tracker Sum Fraud Results.txt Fraud = 14 X Y Z HDFS Data Node 3 Reduce Task Map Task Fraud = 0 Map Task Map Task Data Node 1 Data Node 5 Data Node 9 A B C Reduce: Run this computation across Map results Map Tasks deliver output data over the network Reduce Task data output written to and read from HDFS

43 Principle of MapReduce Operation

44

45

46

47

48 Map Reduce

49 Map Reduce

50 Map Reduce

51 Map Reduce

52 Map Reduce

53 Map Reduce

54 Map Reduce

55 Map Reduce

56 Map Reduce

57 Large scale data splits Map <key, 1> Reducers (say, Count) Parse-hash Count P-0000, count1 Parse-hash Parse-hash Count P-0001, count2 Parse-hash Count P-0002,count3 Map Reduce 57 CCSCNE 2009 Palttsburg, April

58 Map Reduce

59 Map Reduce

60 HADOOP MapReduce Program : WordCount.java package net.kzk9; import java.io.ioexception; import java.util.stringtokenizer; import org.apache.hadoop.util.genericoptionsparser; import org.apache.hadoop.conf.configuration; import org.apache.hadoop.fs.path; import org.apache.hadoop.io.intwritable; import org.apache.hadoop.io.longwritable; import org.apache.hadoop.io.text; import org.apache.hadoop.mapreduce.job; import org.apache.hadoop.mapreduce.mapper; import org.apache.hadoop.mapreduce.reducer; import org.apache.hadoop.mapreduce.lib.input.fileinputformat; import org.apache.hadoop.mapreduce.lib.input.textinputformat; import org.apache.hadoop.mapreduce.lib.output.fileoutputformat; import org.apache.hadoop.mapreduce.lib.output.textoutputformat; public class WordCount { // Mapperの実装 public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new protected void map(longwritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.tostring(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasmoretokens()) { word.set(tokenizer.nexttoken()); context.write(word, one); } } } Map Reduce // Reducerの実装 public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { private IntWritable value = new protected void reduce(text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable value : values) sum += value.get(); value.set(sum); context.write(key, value); } } public static void main(string[] args) throws Exception { // 設定情報の読み込み Configuration conf = new Configuration(); // 引数のパース GenericOptionsParser parser = new GenericOptionsParser(conf, args); args = parser.getremainingargs(); // ジョブの作成 Job job = new Job(conf, "wordcount"); job.setjarbyclass(wordcount.class); // Mapper/Reducerに使用するクラスを指定 job.setmapperclass(map.class); job.setreducerclass(reduce.class); // ジョブ中の各種型を設定 job.setmapoutputkeyclass(text.class); job.setmapoutputvalueclass(intwritable.class); job.setoutputkeyclass(text.class); job.setoutputvalueclass(intwritable.class);

61 HADOOP MapReduce Program : WordCount.java // 入力 / 出力パスを設定 FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); // ジョブを JobTrackerにサブミット boolean success = job.waitforcompletion(true); System.out.println(success); } } TestWordCount.java package net.kzk9; import net.kzk9.wordcount; import org.apache.hadoop.io.*; import org.apache.hadoop.mapreduce.*; import org.apache.hadoop.mrunit.mapreduce.mapdriver; import junit.framework.testcase; import org.junit.before; import org.junit.test; public class WordCountTest extends TestCase { // テストする対象の Mapper private WordCount.Map mapper; // Mapper 実行用のドライバクラス private MapDriver public void setup() { // テスト対象の Mapperの作成 mapper = new WordCount.Map(); // テスト実行用 MapDriverの作成 driver = new MapDriver(mapper); public void testwordcountmapper() { // 入力を設定 driver.withinput(new LongWritable(0), new Text("this is a pen")) // 期待される出力を設定.withOutput(new Text("this"), new IntWritable(1)).withOutput(new Text("is"), new IntWritable(1)).withOutput(new Text("a"), new IntWritable(1)).withOutput(new Text("pen"), new IntWritable(1)) // テストを実行.runTest(); } } #!/bin/bash export HADOOP_HOME=/usr/lib/hadoop-0.20/ export DIR=wordcount_classes # CLASSPATHの設定 CLASSPATH=$HADOOP_HOME/hadoop-core.jar for f in $HADOOP_HOME/lib/*.jar; do CLASSPATH=${CLASSPATH}:$f; done # ディレクトリの初期化 rm -fr $DIR mkdir $DIR # コンパイル javac -classpath $CLASSPATH -d $DIR WordCount.java # jarファイルの作成 jar -cvf wordcount.jar -C $DIR. Map Reduce

62 Conclusion and Discussion The Concept of Big Data. Big Data s Current States Basic Technologies: Data Storage and Analysis How we can join the Big Data Trend?

63 Reference Apache Hadoop: Hadoop: The Definitive Guide, by Tom White, 2 nd edition, Oreilly s, 2010 Dean, J. and Ghemawat, S MapReduce: simplified data processing on large clusters. Communication of ACM 51, 1 (Jan. 2008), B. Hedlund s blog: ng-hadoop-clusters-and-the-network/ 11/24/2014