Introduction to Big Data Science Wuhui Chen
What is Big data? Volume Variety Velocity Outline What are people doing with Big data? Classic examples Two basic technologies for Big data management: Data storage: HDFS (Hadoop Distributed File System) Data processing: MapReduce
What s Big Data? No single definition; here is from Wikipedia: Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. Finding correlations to: spot business trends determine quality of research prevent diseases link legal citations combat crime and determine real-time roadway traffic conditions The challenges include capture, curation, storage, search, sharing, transfer, analysis, and visualization. 3
Big Data: 3V s 4
Data Volume Volume (Scale) Data volume is increasing exponentially 5
Data Volume Volume (Scale) 44x increase from 2009 2020 From 0.8 zettabytes to 35zb 6
12+ TBs of tweet data every day data every day 30 billion RFID tags today (1.3B in 2005) 4.6 billion camera phones world wide? TBs of 100s of millions of GPS enabled devices sold annually 25+ TBs of log data every day 76 million smart meters in 2009 200M by 2014 2+ billion people on the Web by end 2011
Maximilien Brice, CERN CERN s Large Hydron Collider (LHC) generates 15 PB a year
Variety (Complexity) Relational Data (Tables/Transaction/Legacy Data) Text Data (Web) Semi-structured Data (XML) Graph Data Social Network, Semantic Web (RDF), Streaming Data You can only scan the data once A single application can be generating/collecting many types of data Big Public Data (online, weather, finance, etc) To extract knowledge all these types of data need to linked together 9
A Single View to the Customer Social Media Banking Finance Gaming Customer Our Known History Entertain Purchas e All types of data linked together for value extraction!
Velocity (Speed) Data is begin generated fast and need to be processed fast Online Data Analytics Late decisions missing opportunities Examples E-Promotions: Based on your current location, your purchase history, what you like send promotions right now for store next to you Healthcare monitoring: sensors monitoring your activities and body any abnormal measurements require immediate reaction 11
Real-time/Fast Data Mobile devices (tracking all objects all the time) Social media and networks (all of us are generating data) Scientific instruments (collecting all sorts of data) Sensor technology and networks (measuring all kinds of data) The progress and innovation is no longer hindered by the ability to collect data But, by the ability to manage, analyze, summarize, visualize, and discover knowledge from the collected data in a timely manner and in a scalable fashion 12
Real-Time Analytics/Decision Requirement Product Recommendations that are Relevant & Compelling Influence Behavior Learning why Customers Switch to competitors and their offers; in time to Counter Improving the Marketing Effectiveness of a Promotion while it is still in Play Customer Preventing Fraud as it is Occurring & preventing more proactively Friend Invitations to join a Game or Activity that expands business
Some Make it 4V s 14
What is Big data? Volume Variety Velocity Outline What are people doing with Big data? Classic examples Two basic technologies for Big data management: Data storage: HDFS (Hadoop Distributed File System) Data processing: MapReduce
Example 1: Target News from Net York Times, 2012 Target Knew a High School Girl Was Pregnant Before Her Parents Did How? Transaction history analysis:
Example 2: Google Flu Trends Using Google search data to estimate current flu activity around the world in near real-time. Source: http://www.google.org/flutrends/ Estimating which cities are most at risk for spread of the Ebola virus. 17
Example 3:Elections2012
Example 4: NASA of biomedicine Oxford University's big data and Internet of Things project to 'create the NASA of biomedicine Care for cancer patients Datasets: Sequence the full genome of 100,000 patient volunteers in the NHS and combine it with the hospital clinical data
What is Big data? Volume Variety Velocity Outline What are people doing with Big data? Classic examples Two basic technologies for Big data management: Data storage: HDFS (Hadoop Distributed File System) Data processing: MapReduce
Hadoop Cluster World switch switch switch switch switch switch switch Name Node DN + TT DN + TT DN + TT DN + TT Job Tracker DN + TT DN + TT DN + TT DN + TT Secondary NN DN + TT DN + TT DN + TT DN + TT Client DN + TT DN + TT DN + TT DN + TT DN + TT DN + TT DN + TT DN + TT DN + TT DN + TT Rack 1 Rack 2 Rack 3 Rack 4 Rack N
Typical Workflow Load data into the cluster (HDFS writes) Analyze the data (Map Reduce) Store results in the cluster (HDFS writes) Read the results from the cluster (HDFS reads) Sample Scenario: How many times did our customers type the word Fraud into emails sent to customer service? Huge file containing all emails sent to customer service File.txt
Writing files to HDFS File.txt Blk A Blk B Blk C I want to write Blocks A,B,C of File.txt Client Name Node OK. Write to Data Nodes 1,5,6 Data Node 1 Data Node 5 Data Node 6 Data Node N Blk A Blk B Blk C Client consults Name Node Client writes block directly to one Data Node Data Nodes replicates block Cycle repeats for next block
Hadoop Rack Awareness Why? switch Name Node switch Data Node 1 B A Data Node 2 B Data Node 3 Data Node 5 switch Data Node 5 C A Data Node 6 A Data Node 7 Data Node 8 switch Data Node 9 C B Data Node 10 C Data Node 11 Data Node 12 Rack 1 Rack 5 Rack 9 Rack aware Rack 1: Data Node 1 Data Node 2 Data Node 3 Rack 5: Data Node 5 Data Node 6 Data Node 7 metadata File.txt= Blk A: DN1, DN5, DN6 Blk B: DN7, DN1, DN2 Blk C: DN5, DN8,DN9 Never loose all data if entire rack fails Keep bulky flows in-rack when possible Assumption that in-rack is higher bandwidth, lower latency
Preparing HDFS writes File.txt Blk A Blk B Blk C I want to write File.txt Block A Ready Data Nodes 5,6 Client Ready! switch switch switch Data Node 1 Data Node 5 Name Node OK. Write to Data Nodes 1,5,6 Rack aware Rack 1: Data Node 1 Rack 5: Data Node 5 Data Node 6 Rack 1 Ready Data Node 6 Ready? Data Node 6 Rack 2 Ready! Name Node picks two nodes in the same rack, one node in a different rack Data protection Locality for M/R
Pipelined Write File.txt Blk A Blk B Blk C Client Name Node Data Nodes 1 & 2 pass data along as its received TCP 50010 switch switch switch Data Node 1 Data Node 5 A A Data Node 6 A Rack aware Rack 1: Data Node 1 Rack 5: Data Node 5 Data Node 6 Rack 1 Rack 2
Pipelined Write File.txt Blk A Blk B Blk C Client Success Name Node File.txt Blk A: DN1, DN2, DN3 switch Block received Rack 1: Data Node 1 switch switch Data Node 1 Data Node 2 Rack 5: Data Node 2 Data Node 3 A A Data Node 3 A Rack 1 Rack 2
Multi-block Replication Pipeline File.txt Blk A Blk B Blk C Client switch 1TB File = 3TB storage 3TB network traffic switch switch switch Blk A Data Node 1 Blk A Data Node X Blk C Data Node 2 Blk B Blk B Data Node Y Blk A Blk C Data Node 3 Blk C Data Node W Blk B Data Node Z Rack 1 Rack 4 Rack 5
Name Node Awesome! Thanks. Name Node metadata DN1: A,C DN2: A,C DN3: A,C File system File.txt = A,C I have blocks: A, C I m alive! Data Node 1 Data Node 2 Data Node 3 Data Node N A C A C A C Data Node sends Heartbeats Every 10 th heartbeat is a Block report Name Node builds metadata from Block reports TCP every 3 seconds If Name Node is down, HDFS is down
Re-replicating missing replicas Uh Oh! Missing replicas Name Node metadata DN1: A,C DN2: A,C DN3: A, C Rack Awareness Rack1: DN1, DN2 Rack5: DN3, Rack9: DN8 Copy blocks A,C to Node 8 Data Node 1 Data Node 2 Data Node 3 Data Node 8 A C A C A C A C Missing Heartbeats signify lost Nodes Name Node consults metadata, finds affected data Name Node consults Rack Awareness script Name Node tells a Data Node to re-replicate
Secondary Name Node File system metadata Name Node File.txt = A,C Secondary Name Node Its been an hour, give me your metadata Not a hot standby for the Name Node Connects to Name Node every hour* Housekeeping, backup of Name Node metadata Saved metadata can rebuild a failed Name Node
Client reading files from HDFS Tell me the block locations of Results.txt Blk A = 1,5,6 Blk B = 8,1,2 Blk C = 5,8,9 Client Name Node switch Data Node 1 B A switch Data Node 5 C A switch Data Node 8 C B metadata Results.txt = Blk A: DN1, DN5, DN6 Data Node 2 B Data Node Data Node Data Node 6 A Data Node Data Node Data Node 9 C Data Node Data Node Blk B: DN7, DN1, DN2 Blk C: DN5, DN8,DN9 Rack 1 Rack 5 Rack 9 Client receives Data Node list for each block Client picks first Data Node for each block Client reads blocks sequentially
Data Node reading files from HDFS Tell me the locations of Block A of File.txt switch Name Node Block A = 1,5,6 switch Data Node 1 B A Data Node 2 B Data Node 3 Data Node switch Data Node 5 C A Data Node 6 A Data Node Data Node switch Data Node 8 C B Data Node 9 C Data Node Data Node Rack aware Rack 1: Data Node 1 Data Node 2 Data Node 3 Rack 5: Data Node 5 Rack 1 Rack 5 Rack 9 Name Node provides rack local Nodes first Leverage in-rack bandwidth, single hop metadata File.txt= Blk A: DN1, DN5, DN6 Blk B: DN7, DN1, DN2 Blk C: DN5, DN8,DN9
What is Big data? Volume Variety Velocity Outline What are people doing with Big data? Classic examples Two basic technologies for Big data management: Data storage: HDFS (Hadoop Distributed File System) Data processing: MapReduce
What is MapReduce Origin from Google, [OSDI 04] A simple programming model Functional model For large-scale data processing Exploits large set of commodity computers Executes process in distributed manner Offers high availability
Motivation Lots of demands for very large scale data processing A certain common themes for these demands Lots of machines needed (scaling) Two basic operations on the input Map Reduce
Peta-scale Data Main web 2 weed 1 green 2 Data collection 1..* Parser Thread 1..* Counter sun moon 1 land 1 part 1 DataCollection WordList ResultTable KEY web weed green sun moon land part web green. CCSCNE 2009 VALUE Palttsburg, April 24 2009
Divide and Conquer Data collection One node 1..* Parser DataCollection Main Thread WordList 1..* Counter ResultTable For our example, #1: Schedule parallel parse tasks #2: Schedule parallel count tasks This is a particular solution; Lets generalize it: Main Data collection 1..* Parser Thread 1..* Counter Our parse is a mapping operation: MAP: input <key, value> pairs DataCollection WordList ResultTable Data collection 1..* Parser DataCollection Main Thread WordList 1..* Counter ResultTable Our count is a reduce operation: REDUCE: <key, value> pairs reduced Map/Reduce originated from Lisp But have different meaning here Data collection 1..* Parser DataCollection Main Thread WordList 1..* Counter ResultTable Runtime adds distribution + fault tolerance + replication + monitoring + load balancing to your base application!
Architecture overview Master node user Job tracker Slave node 1 Slave node 2 Slave node N Task tracker Task tracker Task tracker Workers Workers Workers
Data Processing: Map How many times does Fraud appear in File.txt? Name Node Client Job Tracker Count Fraud in Block C Map Task Map Task Map Task Data Node 1 Fraud = 3 Data Node 5 Data Node 9 A B C Fraud = 0 Fraud = 11 File.txt Map: Run this computation on your local data Job Tracker delivers Java code to Nodes with local data
What if data isn t local? How many times does Fraud appear in File.txt? Name Node Client Job Tracker Count Fraud in Block C switch switch switch A I need block A Data Node 1 no Map tasks left Data Node 2 Data Node 5 Map Task Map Task B Data Node 9 Fraud = 0 Fraud = 11 C Rack 1 Rack 5 Job Tracker tries to select Node in same rack as data Name Node rack awareness Rack 9
Data Processing: Reduce Client Job Tracker Sum Fraud Results.txt Fraud = 14 X Y Z HDFS Data Node 3 Reduce Task Map Task Fraud = 0 Map Task Map Task Data Node 1 Data Node 5 Data Node 9 A B C Reduce: Run this computation across Map results Map Tasks deliver output data over the network Reduce Task data output written to and read from HDFS
Principle of MapReduce Operation
Map Reduce
Map Reduce
Map Reduce
Map Reduce
Map Reduce
Map Reduce
Map Reduce
Map Reduce
Map Reduce
Large scale data splits Map <key, 1> Reducers (say, Count) Parse-hash Count P-0000, count1 Parse-hash Parse-hash Count P-0001, count2 Parse-hash Count P-0002,count3 Map Reduce 57 CCSCNE 2009 Palttsburg, April 24 2009
Map Reduce
Map Reduce
HADOOP MapReduce Program : WordCount.java package net.kzk9; import java.io.ioexception; import java.util.stringtokenizer; import org.apache.hadoop.util.genericoptionsparser; import org.apache.hadoop.conf.configuration; import org.apache.hadoop.fs.path; import org.apache.hadoop.io.intwritable; import org.apache.hadoop.io.longwritable; import org.apache.hadoop.io.text; import org.apache.hadoop.mapreduce.job; import org.apache.hadoop.mapreduce.mapper; import org.apache.hadoop.mapreduce.reducer; import org.apache.hadoop.mapreduce.lib.input.fileinputformat; import org.apache.hadoop.mapreduce.lib.input.textinputformat; import org.apache.hadoop.mapreduce.lib.output.fileoutputformat; import org.apache.hadoop.mapreduce.lib.output.textoutputformat; public class WordCount { // Mapperの 実 装 public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); @Override protected void map(longwritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.tostring(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasmoretokens()) { word.set(tokenizer.nexttoken()); context.write(word, one); } } } Map Reduce // Reducerの 実 装 public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { private IntWritable value = new IntWritable(0); @Override protected void reduce(text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable value : values) sum += value.get(); value.set(sum); context.write(key, value); } } public static void main(string[] args) throws Exception { // 設 定 情 報 の 読 み 込 み Configuration conf = new Configuration(); // 引 数 のパース GenericOptionsParser parser = new GenericOptionsParser(conf, args); args = parser.getremainingargs(); // ジョブの 作 成 Job job = new Job(conf, "wordcount"); job.setjarbyclass(wordcount.class); // Mapper/Reducerに 使 用 するクラスを 指 定 job.setmapperclass(map.class); job.setreducerclass(reduce.class); // ジョブ 中 の 各 種 型 を 設 定 job.setmapoutputkeyclass(text.class); job.setmapoutputvalueclass(intwritable.class); job.setoutputkeyclass(text.class); job.setoutputvalueclass(intwritable.class);
HADOOP MapReduce Program : WordCount.java // 入 力 / 出 力 パスを 設 定 FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); // ジョブを JobTrackerにサブミット boolean success = job.waitforcompletion(true); System.out.println(success); } } TestWordCount.java package net.kzk9; import net.kzk9.wordcount; import org.apache.hadoop.io.*; import org.apache.hadoop.mapreduce.*; import org.apache.hadoop.mrunit.mapreduce.mapdriver; import junit.framework.testcase; import org.junit.before; import org.junit.test; public class WordCountTest extends TestCase { // テストする 対 象 の Mapper private WordCount.Map mapper; // Mapper 実 行 用 のドライバクラス private MapDriver driver; @Before public void setup() { // テスト 対 象 の Mapperの 作 成 mapper = new WordCount.Map(); // テスト 実 行 用 MapDriverの 作 成 driver = new MapDriver(mapper); } @Test public void testwordcountmapper() { // 入 力 を 設 定 driver.withinput(new LongWritable(0), new Text("this is a pen")) // 期 待 される 出 力 を 設 定.withOutput(new Text("this"), new IntWritable(1)).withOutput(new Text("is"), new IntWritable(1)).withOutput(new Text("a"), new IntWritable(1)).withOutput(new Text("pen"), new IntWritable(1)) // テストを 実 行.runTest(); } } #!/bin/bash export HADOOP_HOME=/usr/lib/hadoop-0.20/ export DIR=wordcount_classes # CLASSPATHの 設 定 CLASSPATH=$HADOOP_HOME/hadoop-core.jar for f in $HADOOP_HOME/lib/*.jar; do CLASSPATH=${CLASSPATH}:$f; done # ディレクトリの 初 期 化 rm -fr $DIR mkdir $DIR # コンパイル javac -classpath $CLASSPATH -d $DIR WordCount.java # jarファイルの 作 成 jar -cvf wordcount.jar -C $DIR. Map Reduce
Conclusion and Discussion The Concept of Big Data. Big Data s Current States Basic Technologies: Data Storage and Analysis How we can join the Big Data Trend?
Reference Apache Hadoop: http://hadoop.apache.org/ http://wiki.apache.org/hadoop/ Hadoop: The Definitive Guide, by Tom White, 2 nd edition, Oreilly s, 2010 Dean, J. and Ghemawat, S. 2008. MapReduce: simplified data processing on large clusters. Communication of ACM 51, 1 (Jan. 2008), 107-113. B. Hedlund s blog: http://bradhedlund.com/2011/09/10/understandi ng-hadoop-clusters-and-the-network/ 11/24/2014