Outline Big Data Analytics* Big Data Data Analytics: Challenges and Issues Misconceptions Big Data Infrastructure Scalable Distributed Computing: Hadoop Programming in Hadoop: MapReduce Paradigm Example MongoDB Some Comparisons & Summary CS 5331 by Rattikorn Hewett Texas Tech University 1 2 Big Data Issues Big Data Common Issues Laney [2001] first defined 3 V s in Big Data management: Volume: size continues increasing Variety: different types of data (e.g., text, sensor data, audio, video, graph) Velocity: streams of data Now two more V s: Variability: in user interpretation and changes in the structure of the data Value: business value to organizations in decision-making Scalability Speed Accuracy Trust, Provenance, Privacy Interactiveness 3 4 1
Some tools Apache Hadoop platform for distributed parallel computing with MapReduce programming paradigm (later) Other Apache Family: Pig, Hive, Hbase, ZooKeeper, Cassandra, Cascading, etc. Apache S4 platform for real-time processing continuous data streams Storm streaming data for distributed applications similar to S4 but implemented by Twitter Data Analytics: Challenges Analytic Architecture How to deal with historic + real-time data at the same time? Batch layer+ Serving layer + Speed layer (e.g., Hadoop) (e.g., Storm) Statistical Significance Huge data & many questions at once Random answer Distributed Analytics Not all existing algorithms can be parallelized/distributed Time evolving data Analytics must be able to detect and adapt to evolving data 5 6 Data Analytics: Challenges Data Store and Analytics Compression vs. Sampling (more time less space) (lose Information) Visualization Huge data & many questions at once Random answer? Tagged Data Analytics 3% of data are tagged and less are analyzed Big data: Some Misconceptions Bigger data are not always better Big data analytics does not necessarily use MapReduce & Hadoop In real-time data analytic, data size is not as important as data recency Accuracy can be misleading - # fake correlations grow when # variables grow 7 8 2
Outline Big Data Data Analytics: Challenges and Issues Misconceptions Big Data Infrastructure Scalable Distributed Computing: Hadoop Programming in Hadoop: MapReduce Paradigm Example MongoDB Some Comparisons & Summary What is Hadoop? A general-purpose storage & data analysis platform Open source Apache software, implemented in Java Enables easier parallel processing Designed to run on a set of computers (cheap hardware) Consists mainly of: A distributed file system MapReduce programming framework 9 10 Sequential vs. Parallel Processing Sequential Processing Program has a single code path from beginning to end Typical data analytics tasks require more Parallel Processing Program executes multiple code paths simultaneously Almost all apps are multi-threaded/multi-tasked, e.g., Email app: display mails & download new mails Data analytics: different analytics tasks run on different machines in a parallel/distributed system 11 Why we need parallel processing End of Moore s law (# transistors on a chip will double every two years) CPU has more cores but clock speed remains the same Process on multi-core multi-machine simultaneously BUT Writing parallelized code is hard What about hard drives? Cheap and plenty BUT slow for R/W to Split data stored in multiple hard drives and R/W simul. BUT connected machines are failure prone code must deal with these failures We need Parallel Processing but it is hard 12 3
But parallel processing is not new High Performance Computing (HPC) and Grid Computing Distribute work across a cluster of machines Access a shared file system (hosted by SAN Storage Area Network) Pro: for compute-intensive jobs Con: for data-intensive jobs to access large data volumes Network bandwidth becomes bottleneck Compute node is not utilized Simple Hadoop example Word Counts Task: count the number of occurrences of each word Input: large text file(s) with space separated words Output: text file with a count of each distinct word Why use Hadoop for this task? Lots of I/O Time saving (100s for GBs files) Easy to parallelize Count words on different pieces of the file independently Match MapReduce paradigm 13 14 Word count: sequential solution Parallelized Word Count first try Repeat Read a line Recognize words in the line Update frequency in a hash table Until no more text line to be read Works but performance is limited by the disk read speed May run out of ram for the hash table Single machine & multiple threads Each thread processes a portion of the text Combine results of each thread Problem: performance limited by the read speed of hard drive may be slower than sequential processing Add machines won t help! Run out of ram for hash table File reader A quick brown fox jumps over the lazy dog 15 16 4
Parallelized Word Count 2 nd try Use multiple servers & central file server The text file is on the file server Each server accesses portion of text from the file server and counts words Result: Speed relies on file sever Faster than sequential Problem: Bandwidth/network speed Data A quick brown fox jumps over the lazy dog File Server Each server counts words on a portion of the text Server Server Server Combine the word counts Server Server Parallelized Word Count Distributed Data 1. Get the count for one portion of the text 0. Spread text across servers Server 2. Combine counts from all of the servers Server Server Server Server No Centralized file server. A portion of data store in each server (computer node) for the processing Reduce network traffic & bandwidth 17 18 Parallelized Word Count Distributed Data Steps: 1. Distribute a portion of text to individual machines Must keep track of which portion is not processed, pending, processed 2. If machine is down Detect it and Restart the process 3. Combine all the results What if servers are done at the same time? Non-Hadoop Specific Hadoop Distributed File System (HDFS) Issues & Motivations: "Big Data" require multiple disks Mainframes are expensive while Clusters are cheap Data management in Clusters can fail "Big Data" require efficient data access and processing that are platform independent Must take care of all potential problems! 19 20 5
HDFS: Goals and Features Cluster environment Fault resistant detect faults and provide quick, automatic recovery Move computation, not data applications move themselves to where data is located Streaming data access batch processing with high throughput of data access Portability portable across heterogeneous HW and SW platforms HDFS: Basic Architecture NameNode master server, manages file system namespace, regulates access to files by clients DataNodes A Client reads data from HDFS* manage storage attached to nodes they run on Files split into one or more blocks (typically 64 MB), that are stored in a set of DataNodes blocks replicated for fault tolerance 21 * From p. 63 in http://ce.sysu.edu.cn/hope/uploadfiles/education/2011/10/201110221516245419.pdf 22 HDFS Architecture MapReduce Paradigm Job Task Data Location Programming model for processing large data sets map function processes a key/value pair to generate a set of intermediate key/value pairs reduce function merges all intermediate values associated with the same intermediate key MapReduce provides: Automatic parallelization and distribution Fault-tolerance I/O scheduling Status & monitoring 23 http://research.google.com/archive/mapreduce-osdi04-slides/index-auto-0002.html 24 6
MapReduce Process Steps: 1.Split data across servers to multiple inputs and keep track 2.Run Map code on individual inputs key-value pairs 3.Shuffle outputs send outputs to each processor with a corresponding assigned key 4.Run Reduce code to aggregate outputs for each key 5.Produce final output collect reduce results &generate final output Programming model User-specified two basic functions: map (in_key, in_value) -> list(out_key, intermediate_value) Takes input key/value pair to generate set of intermediate pairs reduce (out_key, list(intermediate_value)) -> list(out_value) Combines all intermediate values that shared the same key to generate a set of merged output values (usually one) Inspired by operators in Lisp or other functional programming languages 25 26 Pseudocode : Word Count Execution map(string input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_value: EmitIntermediate(w, "1"); reduce(string output_key, Iterator intermediate_values): // output_key: a word // output_values: a list of counts int result = 0; for each v in intermediate_values: result += ParseInt(v); Emit(AsString(result)); 27 28 7
Parallel Execution 29 MapReduce in Hadoop JobTracker schedules & monitors tasks, monitors them, and re-execute failed tasks TaskTrackers - one per clusternode, execute tasks directed by JobTracker MapReduce & HDFS run on same set of nodes - can schedule tasks where data are already present USER: specifies input/output locations and Map/Reduce codes Hadoop framework implemented in Java BUT MapReduce codes can be Python, Ruby, C++, etc http://ce.sysu.edu.cn/hope/uploadfiles/education/2011/10/201110221516245419.pdf 30 WordCount.java of Map WordCount.java of Reduce public static class Map extends Mapper<Text, LongWritable, Text, IntWritable> { public void map( LongWritable word, Text value, Context context) throws IOException, InterruptedException { String line = value.tostring(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasmoretokens()) Separate by spaces { word.set(tokenizer.nexttoken()); static IntWritable one = new IntWritable(1); context.write(word, one); (w, 1) } Longwritable Intwritable } public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce( Text key, Iterable<IntWritable> values, (w, (1, 1, 1)) Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } (w, 3) } context.write(key, new IntWritable(sum)); } } http://wiki.apache.org/hadoop/wordcount 31 http://wiki.apache.org/hadoop/wordcount 32 8
Running Hadoop An example of command line: $ hadoop fs -mkdir brown Data Input file $ hadoop fs -put /corpora/icame/texts/brown1 brown/input $ hadoop jar /opt/hadoop/hadoop-examples-*.jar wordcount brown/input brown/output Code Arguments Outline Big Data Data Analytics: Challenges and Issues Misconceptions Big Data Infrastructure Scalable Distributed Computing: Hadoop Programming in Hadoop: MapReduce Paradigm Example MongoDB Some Comparisons & Summary http://depts.washington.edu/uwcl/twiki/bin/view.cgi/main/hadoopwordcountexample 33 34 MongoDB (by 10gen) For Big Data, RDBM can t handle High Volume Data Variety - Unstructured data MonhoDB solution MongoDB Normalized Data in Relational DB Normalized Data in Document DB 35 36 9
MongoDB Implemented in C++, Data serialized to BSON Runs nearly everywhere, Memory caching Best features for all below in one! key/value stores Document DB & Relational DB Designed for operational DB for Big Data Not a data processing engine MongoDB MapReduce MongoDB MapReduce are quite capable but limited by JVS MongoDB MapReduce Hadoop MapReduce For heavy processing needs. Hadoop 37 38 Outline Big Data Data Analytics: Challenges and Issues Misconceptions Big Data Infrastructure Scalable Distributed Computing: Hadoop Programming in Hadoop: MapReduce Paradigm Example MongoDB Some Comparisons & Summary RDBMS & MapReduce Traditional RDBMS MapReduce Data Size Giga/Tera bytes Peta/Hexa bytes Access Interactive & Batch Batch Updates R/W many times W once, R many Structure Static Schema Dynamic Schema Integrity High (ACID) Low Scaling Nonlinear Linear http://ce.sysu.edu.cn/hope/uploadfiles/education/2011/10/201110221516245419.pdf 39 40 10
Hadoop & MongoDB* MongoDB NoSQL datastore - document oriented, schema less that provides looser data consistency models (e.g., one record has 5 fields others 8 fields) Designed for Real-time processing, store Big Data Hadoop Open source implementation of MapReduce Technology Designed for Analytical purpose Hadoop OLAP vs. OLTP MongoDB Summary HPC parallel processing - good for computation-intensive but limited by disk access and network speed bottleneck Hadoop for parallel processing cheap & easy with high I/O throughput Takes care of problems in processing distributed data, so users can focus on the core algorithm not the be-all end-all solution for parallel processing not just for toy problems like Word Count BUT has been deployed to index the web at Google and Yahoo MongoDB is powerful data store for Big data that can be used along with Hadoop for complex Data Analytics * See http://www.slideshare.net/spf13/mongodb-and-hadoop 41 42 11