BIG DATA ANALYTICS USING HADOOP AND SPARK ON HATHI Boyu Zhang Research Computing ITaP
BIG DATA APPLICATIONS Big data has become one of the most important aspects in scientific computing and business analytics 1 0.5 0-0.5-1 -1.5 Provide key insights to scientists and decision makers The amount of data is exploding 2-1 -0.5 0 0.5 0 1-0.5-1
CHARACTERISTICS OF BIG DATA Commonly accepted 3Vs of big data Veracity, Value 3 M. Stonebraker: Big Data Means at Least Three Different Things, http://www.nist.gov/itl/ssd/is/upload/nist-stonebraker.pdf M. Walker: Data Veracity, http://www.datasciencecentral.com/profiles/blogs/data-veracity
VOLUME The amount of data to process Figure 1 The digital universe: 50-fold growth from the beginning of 2010 to the end of 2020 Source: IDC's Digital Universe Study, sponsored by EMC, December 2012 4 Within these broad outlines of the digital universe are some singularities worth noting. First, J. while Gantz the and portion D. Reinsel: of the digital The Digital universe Universe holding in potential 2020: Big analytic Data, value Bigger is Digital growing, Shadows, only a tiny and Biggest Growth in the Far East, fraction https://www.emc.com/collateral/analyst-reports/idc-the-digital-universe-in-2020.pdf of territory has been explored. IDC estimates that by 2020, as much as 33% of the digital universe will contain information that might be valuable if analyzed, compared with 25% today. This untapped value could be found in patterns in social media usage, correlations in scientific data from discrete studies, medical information intersected with sociological data, faces in security footage, and
VOLUME Fortress archive at Purdue 5
VELOCITY How much data is generated on the internet every minute The global internet population ~3.2 billion 6 J. James: Data Never Sleeps 3.0: https://www.domo.com/blog/2015/08/data-never-sleeps-3-0
VARIETY Data in many forms Database, photo, video, audio, web Data 7
VARIOUS DOMAINS Biology, Physics, Finance, Computer Science 8
BIG DATA ANALYTICS Finding the value Need a set of powerful tools! 9
OUTLINE Introduction to big data analytics Tools available MapReduce Hadoop Spark ITaP resource hathi Platform Examples Summary 10
TRADITIONAL METHODS R, SAS, Matlab, C/C++, MPI, Java, Python Challenges Need parallel processing Programmer need to handle parallelization explicitly Work distribution, data distribution, communication, fault tolerance Time to solution is way too long Need a set of tools that free programmers from all the above, and let them focus on the problem logic 11
MAPREDUCE PARADIGM Programmer writes map and reduce functions that run in parallel Map Input: <key in, val in > pairs Output: list of <key i, val i > pairs <key in, val in > map list<key i, val i > Reduce Input: <key j, list(val)> pairs Output: <key out, val out > pairs key j, list(val) reduce <key out, val out > 12
MAPREDUCE RUNTIME Execute map and reduce functions in parallel Communicate all the values associated with the same key from map to reduce in the shuffle stage <key in, val in > map list<key i, val i > key j, list(val) reduce <key out, val out > shuffle <key in, val in > map list<key i, val i > key j, list(val) reduce <key out, val out > Popular runtime libraries support this model: Hadoop, Spark, MapReduce-MPI (MR-MPI) 13 Hadoop: https://hadoop.apache.org Spark: http://spark.apache.org MapReduce-MPI: http://mapreduce.sandia.gov
EXAMPLE- WORDCOUNT Count the occurrence of each word in a collection of text documents This is a test. map Process 0 <This, 1> <is, 1> <a, 1> <test., 1> shuffle <This, {1,1}> <is, {1, 1}> reduce <This, 2> <is, 2> Process 1 map This is also a test. <This, 1> <is, 1> <also, 1> <a, 1> <test., 1> shuffle <a, {1,1}> <test., {1,1}> <also, {1}> reduce <a, 2> <test., 2> <also, 1> 14
MAPREDUCE ADVANTAGES Simplicity Developers choice of language C/C++, Java, Python, R etc Automatic parallelization and communication in a restricted manner Built-in fault tolerance Performance Bring computation to data location Scalability Petabytes of data, tens of thousand of compute nodes 15
HADOOP A library that supports the execution of MapReduce applications Uses Hadoop Distributed File System (HDFS) for data storage Uses Hadoop NextGen MapReduce (YARN) for MapReduce applications scheduling and execution MR Pig Hive YARN HDFS 16
HADOOP - HDFS Master slave architecture Single NameNode, one DataNode per compute node in the cluster Storage and access model Each file is stored as a sequence of blocks of the same size Each block is replicated multiple times Write-once-read-many access model for files 17
HADOOP - YARN Master slave architecture Single ResourceManager, one NodeManager per compute node in the cluster, one ApplicationMasterper applicaion ResourceManager Manages the use of resources across the cluster NodeManager Launches and monitors containers A container executes an applicationspecific process ApplicationMaster Negotiates resource containers from the ResourceManager Tracks status and progress 18
ANATOMY OF A YARN APPLICATION RUN A client contacts the ResourceManager and asks it to run an AppicationMasterprocess. The ResourceManager finds a NodeManager that can launch the ApplicationMaster in a container. ApplicationMaster may run a computation in the container or request more containers from the ResourceManager. 19 T. White: Hadoop: The Definitive Guide 4 th Edition
EXAMPLE - WORDCOUNT import java.io.ioexception; import java.util.stringtokenizer; import org.apache.hadoop.conf.configuration; import org.apache.hadoop.fs.path; import org.apache.hadoop.io.intwritable; import org.apache.hadoop.io.text; import org.apache.hadoop.mapreduce.job; import org.apache.hadoop.mapreduce.mapper; import org.apache.hadoop.mapreduce.reducer; import org.apache.hadoop.mapreduce.lib.input.fileinputformat; import org.apache.hadoop.mapreduce.lib.output.fileoutputformat; public class WordCount { public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{ } public static void main(string[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "word count"); job.setjarbyclass(wordcount.class); job.setmapperclass(tokenizermapper.class); job.setcombinerclass(intsumreducer.class); job.setreducerclass(intsumreducer.class); job.setoutputkeyclass(text.class); job.setoutputvalueclass(intwritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true)? 0 : 1); } private final static IntWritable private Text word = new Text(); one = new IntWritable(1); } public void map(object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasmoretokens()) { word.set(itr.nexttoken()); context.write(word, one); } } 47 Lines of Code! public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable(); 20 } public void reduce(text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); }
HADOOP ECOSYSTEM 21 Hadoop Ecosystem Overview: http://thebigdatablog.weebly.com/blog/the-hadoop-ecosystem-overview
SPARK Extends MapReduce model to support more types of computations such as interactive queries and stream processing Spark SQL Spark Streaming SPARK MLlib GraphX Scheduler Standalone YARN Mesos EC2 Storage HDFS Local FS Amazon S3 22
SPARK PROGRAMMING ABSTRACTION Resilient Distributed Dataset (RDD) A distributed collection of items among cluster nodes Operations Transformations construct a new RDD from a previous one Actions compute a result based on an RDD Transformations filter() map() sample() Actions reduce() collect() count() Transformations are only computed when an action requires a result to be returned to the driver program (lazy evaluation) 23
ANATOMY OF A SPARK APPLICATION RUNS Driver program launches parallel operations on a cluster Manages a number of executors SparkContex object represents a connection to a cluster Builds RDDs Runs operations on RDDs 24
EXAMPLE WORDCOUNT from pyspark import SparkContext sc = SparkContext(appName="PythonWordCount") text_file = sc.textfile("hdfs://user/myname/text.txt") counts = text_file.flatmap(lambda line: line.split(" ")) \.map(lambda word: (word, 1)) \.reducebykey(lambda a, b: a + b) counts.saveastextfile("hdfs://user/myname/count.txt") sc.stop() 6 Lines of Code! 25
SPARK ADVANTAGES Flexibility Supports a wider range of workflow in addition to MapReduce Provides APIs in Python, Scala, Java, SQL and R Performance Runs computation in memory 2014 Daytona Gray Sort 100 TB Benchmark 26 Sort rate/node (GB/min) 25 20 15 10 5 0 Hadoop Spark
HATHI OVERVIEW Compute nodes run user jobs Front end nodes handle user log in, simple file manipulations, and other miscellaneous operations Adm node runs Hadoop master daemons hathi-fe00 hathi-fe01 hathi-adm hathia000 hathia004 hathia001 hathia002 hathia003 hathia005 27
HATHI SUPPORTED LIBRARIES Hadoop MapReduce Java API, Hadoop Streaming to run any executable Spark Scala, Python, R API Hive SQL-like language Pig Pig Latin statements 28
SUBMITTING TO HATHI Prepare input hdfs dfs copyfromlocal test.txt /user/my/input Submit the job Monitor the job hadoop jar wordcount.jar org.myorg.wordcount /user/my/input /user/my/output command line output and web interfaces HDFS: http://hathi-adm.rcac.purdue.edu:50070 All applications: http://hathi-adm.rcac.purdue.edu:8088 Hadoop JobHistory server: http://hathi-adm.rcac.purdue.edu:19888 Spark History Server: http://hathi-adm.rcac.purdue.edu:18080 Retrieve output hdfs dfs copytolocal /user/my/output output 29
SUMMARY MapReduce and related tools are the key to big data analytics Hadoop and Spark are popular runtime libraries that support MapReduce type operations Hathi is a free resource that is available for students who work on projects with professors (https://www.rcac.purdue.edu/compute/hathi/guide/) 30