BIG DATA APPLICATIONS

Size: px

Start display at page:

Download "BIG DATA APPLICATIONS"

Annabelle Reeves
8 years ago
Views:

1 BIG DATA ANALYTICS USING HADOOP AND SPARK ON HATHI Boyu Zhang Research Computing ITaP

2 BIG DATA APPLICATIONS Big data has become one of the most important aspects in scientific computing and business analytics Provide key insights to scientists and decision makers The amount of data is exploding

3 CHARACTERISTICS OF BIG DATA Commonly accepted 3Vs of big data Veracity, Value 3 M. Stonebraker: Big Data Means at Least Three Different Things, M. Walker: Data Veracity,

Stonebraker: Big Data Means at Least Three Different Things, http://www.

4 VOLUME The amount of data to process Figure 1 The digital universe: 50-fold growth from the beginning of 2010 to the end of 2020 Source: IDC's Digital Universe Study, sponsored by EMC, December Within these broad outlines of the digital universe are some singularities worth noting. First, J. while Gantz the and portion D. Reinsel: of the digital The Digital universe Universe holding in potential 2020: Big analytic Data, value Bigger is Digital growing, Shadows, only a tiny and Biggest Growth in the Far East, fraction of territory has been explored. IDC estimates that by 2020, as much as 33% of the digital universe will contain information that might be valuable if analyzed, compared with 25% today. This untapped value could be found in patterns in social media usage, correlations in scientific data from discrete studies, medical information intersected with sociological data, faces in security footage, and

$Reinsel: of the digital The Digital universe Universe holding in potential 2020: Big analytic Data, value Bigger is Digital growing, Shadows, only a tiny and Biggest Growth in the Far East, fraction$

5 VOLUME Fortress archive at Purdue 5

6 VELOCITY How much data is generated on the internet every minute The global internet population ~3.2 billion 6 J. James: Data Never Sleeps 3.0:

7 VARIETY Data in many forms Database, photo, video, audio, web Data 7

8 VARIOUS DOMAINS Biology, Physics, Finance, Computer Science 8

9 BIG DATA ANALYTICS Finding the value Need a set of powerful tools! 9

10 OUTLINE Introduction to big data analytics Tools available MapReduce Hadoop Spark ITaP resource hathi Platform Examples Summary 10

11 TRADITIONAL METHODS R, SAS, Matlab, C/C++, MPI, Java, Python Challenges Need parallel processing Programmer need to handle parallelization explicitly Work distribution, data distribution, communication, fault tolerance Time to solution is way too long Need a set of tools that free programmers from all the above, and let them focus on the problem logic 11

distribution, communication, fault tolerance Time to solution is way too long Need a set

12 MAPREDUCE PARADIGM Programmer writes map and reduce functions that run in parallel Map Input: <key in, val in > pairs Output: list of <key i, val i > pairs <key in, val in > map list<key i, val i > Reduce Input: <key j, list(val)> pairs Output: <key out, val out > pairs key j, list(val) reduce <key out, val out > 12

pairs <key in, val in > map list<key i, val i > Reduce Input: <key j,

13 MAPREDUCE RUNTIME Execute map and reduce functions in parallel Communicate all the values associated with the same key from map to reduce in the shuffle stage <key in, val in > map list<key i, val i > key j, list(val) reduce <key out, val out > shuffle <key in, val in > map list<key i, val i > key j, list(val) reduce <key out, val out > Popular runtime libraries support this model: Hadoop, Spark, MapReduce-MPI (MR-MPI) 13 Hadoop: Spark: MapReduce-MPI:

val in > map list<key i, val i > key j, list(val) reduce <key out, val out > Popular runtime libraries support this model: Hadoop,

14 EXAMPLE- WORDCOUNT Count the occurrence of each word in a collection of text documents This is a test. map Process 0 <This, 1> <is, 1> <a, 1> <test., 1> shuffle <This, {1,1}> <is, {1, 1}> reduce <This, 2> <is, 2> Process 1 map This is also a test. <This, 1> <is, 1> <also, 1> <a, 1> <test., 1> shuffle <a, {1,1}> <test., {1,1}> <also, {1}> reduce <a, 2> <test., 2> <also, 1> 14

, 1> shuffle <This, {1,1}> <is, {1, 1}> reduce <This, 2> <is, 2> Process 1 map This is also a

15 MAPREDUCE ADVANTAGES Simplicity Developers choice of language C/C++, Java, Python, R etc Automatic parallelization and communication in a restricted manner Built-in fault tolerance Performance Bring computation to data location Scalability Petabytes of data, tens of thousand of compute nodes 15

restricted manner Built-in fault tolerance Performance Bring computation

16 HADOOP A library that supports the execution of MapReduce applications Uses Hadoop Distributed File System (HDFS) for data storage Uses Hadoop NextGen MapReduce (YARN) for MapReduce applications scheduling and execution MR Pig Hive YARN HDFS 16

data storage Uses Hadoop NextGen MapReduce (YARN) for

17 HADOOP - HDFS Master slave architecture Single NameNode, one DataNode per compute node in the cluster Storage and access model Each file is stored as a sequence of blocks of the same size Each block is replicated multiple times Write-once-read-many access model for files 17

Each file is stored as a sequence of blocks of the same size Each

18 HADOOP - YARN Master slave architecture Single ResourceManager, one NodeManager per compute node in the cluster, one ApplicationMasterper applicaion ResourceManager Manages the use of resources across the cluster NodeManager Launches and monitors containers A container executes an applicationspecific process ApplicationMaster Negotiates resource containers from the ResourceManager Tracks status and progress 18

cluster NodeManager Launches and monitors containers A container executes an applicationspecific

19 ANATOMY OF A YARN APPLICATION RUN A client contacts the ResourceManager and asks it to run an AppicationMasterprocess. The ResourceManager finds a NodeManager that can launch the ApplicationMaster in a container. ApplicationMaster may run a computation in the container or request more containers from the ResourceManager. 19 T. White: Hadoop: The Definitive Guide 4 th Edition

The ResourceManager finds a NodeManager that can launch the ApplicationMaster in a container.

20 EXAMPLE - WORDCOUNT import java.io.ioexception; import java.util.stringtokenizer; import org.apache.hadoop.conf.configuration; import org.apache.hadoop.fs.path; import org.apache.hadoop.io.intwritable; import org.apache.hadoop.io.text; import org.apache.hadoop.mapreduce.job; import org.apache.hadoop.mapreduce.mapper; import org.apache.hadoop.mapreduce.reducer; import org.apache.hadoop.mapreduce.lib.input.fileinputformat; import org.apache.hadoop.mapreduce.lib.output.fileoutputformat; public class WordCount { public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{ } public static void main(string[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "word count"); job.setjarbyclass(wordcount.class); job.setmapperclass(tokenizermapper.class); job.setcombinerclass(intsumreducer.class); job.setreducerclass(intsumreducer.class); job.setoutputkeyclass(text.class); job.setoutputvalueclass(intwritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true)? 0 : 1); } private final static IntWritable private Text word = new Text(); one = new IntWritable(1); } public void map(object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasmoretokens()) { word.set(itr.nexttoken()); context.write(word, one); } } 47 Lines of Code! public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable(); 20 } public void reduce(text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); }

fileinputformat; import org.apache.hadoop.mapreduce.lib.output.

21 HADOOP ECOSYSTEM 21 Hadoop Ecosystem Overview:

22 SPARK Extends MapReduce model to support more types of computations such as interactive queries and stream processing Spark SQL Spark Streaming SPARK MLlib GraphX Scheduler Standalone YARN Mesos EC2 Storage HDFS Local FS Amazon S3 22

23 SPARK PROGRAMMING ABSTRACTION Resilient Distributed Dataset (RDD) A distributed collection of items among cluster nodes Operations Transformations construct a new RDD from a previous one Actions compute a result based on an RDD Transformations filter() map() sample() Actions reduce() collect() count() Transformations are only computed when an action requires a result to be returned to the driver program (lazy evaluation) 23

24 ANATOMY OF A SPARK APPLICATION RUNS Driver program launches parallel operations on a cluster Manages a number of executors SparkContex object represents a connection to a cluster Builds RDDs Runs operations on RDDs 24

25 EXAMPLE WORDCOUNT from pyspark import SparkContext sc = SparkContext(appName="PythonWordCount") text_file = sc.textfile("hdfs://user/myname/text.txt") counts = text_file.flatmap(lambda line: line.split(" ")) \.map(lambda word: (word, 1)) \.reducebykey(lambda a, b: a + b) counts.saveastextfile("hdfs://user/myname/count.txt") sc.stop() 6 Lines of Code! 25

26 SPARK ADVANTAGES Flexibility Supports a wider range of workflow in addition to MapReduce Provides APIs in Python, Scala, Java, SQL and R Performance Runs computation in memory 2014 Daytona Gray Sort 100 TB Benchmark 26 Sort rate/node (GB/min) Hadoop Spark

27 HATHI OVERVIEW Compute nodes run user jobs Front end nodes handle user log in, simple file manipulations, and other miscellaneous operations Adm node runs Hadoop master daemons hathi-fe00 hathi-fe01 hathi-adm hathia000 hathia004 hathia001 hathia002 hathia003 hathia005 27

28 HATHI SUPPORTED LIBRARIES Hadoop MapReduce Java API, Hadoop Streaming to run any executable Spark Scala, Python, R API Hive SQL-like language Pig Pig Latin statements 28

29 SUBMITTING TO HATHI Prepare input hdfs dfs copyfromlocal test.txt /user/my/input Submit the job Monitor the job hadoop jar wordcount.jar org.myorg.wordcount /user/my/input /user/my/output command line output and web interfaces HDFS: All applications: Hadoop JobHistory server: Spark History Server: Retrieve output hdfs dfs copytolocal /user/my/output output 29

30 SUMMARY MapReduce and related tools are the key to big data analytics Hadoop and Spark are popular runtime libraries that support MapReduce type operations Hathi is a free resource that is available for students who work on projects with professors ( 30

Tutorial- Counting Words in File(s) using MapReduce

Tutorial- Counting Words in File(s) using MapReduce 1 Overview This document serves as a tutorial to setup and run a simple application in Hadoop MapReduce framework. A job in Hadoop MapReduce usually