BIG DATA APPLICATIONS

Similar documents

Tutorial- Counting Words in File(s) using MapReduce

Hadoop Lab Notes. Nicola Tonellotto November 15, 2010

Enterprise Data Storage and Analysis on Tim Barr

Mrs: MapReduce for Scientific Computing in Python

HDInsight Essentials. Rajesh Nadipalli. Chapter No. 1 "Hadoop and HDInsight in a Heartbeat"

Word Count Code using MR2 Classes and API

Getting to know Apache Hadoop

Big Data 2012 Hadoop Tutorial

Hadoop and ecosystem * 本文中的言论仅代表作者个人观点 * 本文中的一些图例来自于互联网. Information Management. Information Management IBM CDL Lab

Introduc)on to Map- Reduce. Vincent Leroy

Hadoop and Eclipse. Eclipse Hawaii User s Group May 26th, Seth Ladd

Working With Hadoop. Important Terminology. Important Terminology. Anatomy of MapReduce Job Run. Important Terminology

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems

Lambda Architecture. CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014

Hadoop Ecosystem B Y R A H I M A.

The Hadoop Eco System Shanghai Data Science Meetup

Hadoop Configuration and First Examples

Big Data for the JVM developer. Costin Leau,

Zebra and MapReduce. Table of contents. 1 Overview Hadoop MapReduce APIs Zebra MapReduce APIs Zebra MapReduce Examples...

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Hadoop and Spark Tutorial for Statisticians

Xiaoming Gao Hui Li Thilina Gunarathne

Hadoop Basics with InfoSphere BigInsights

Unified Big Data Analytics Pipeline. 连城

Introduction to MapReduce and Hadoop

Map Reduce & Hadoop Recommended Text:

Hadoop and Big Data. Keijo Heljanko. Department of Information and Computer Science School of Science Aalto University

Hadoop Framework. technology basics for data scientists. Spring Jordi Torres, UPC - BSC

Internals of Hadoop Application Framework and Distributed File System

HiBench Introduction. Carson Wang Software & Services Group

Extreme Computing. Hadoop MapReduce in more detail.

Introduction to Spark

Big Data Analytics. Lucas Rego Drumond

Running Hadoop on Windows CCNP Server

Unified Big Data Processing with Apache Spark. Matei

Chapter 7. Using Hadoop Cluster and MapReduce

19 Putting into Practice: Large-Scale Data Management with HADOOP

Introduc8on to Apache Spark

Architectures for massive data management

Tutorial. Christopher M. Judd

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

How To Create A Data Visualization With Apache Spark And Zeppelin

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

MR-(Mapreduce Programming Language)

Hadoop Streaming coreservlets.com and Dima May coreservlets.com and Dima May

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Introduction to Big Data Science. Wuhui Chen

Big Data Analytics Hadoop and Spark

CS54100: Database Systems

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

An Introduction to Apostolos N. Papadopoulos

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Big Data With Hadoop

Hadoop implementation of MapReduce computational model. Ján Vaňo

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

COURSE CONTENT Big Data and Hadoop Training

Scaling Out With Apache Spark. DTL Meeting Slides based on

Spark and the Big Data Library

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

Large scale processing using Hadoop. Ján Vaňo

Hadoop Job Oriented Training Agenda

Apache Hadoop new way for the company to store and analyze big data

Outline of Tutorial. Hadoop and Pig Overview Hands-on

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Big Data Frameworks: Scala and Spark Tutorial

Big Data Analytics(Hadoop) Prepared By : Manoj Kumar Joshi & Vikas Sawhney

IBM Big Data Platform

Apache Flink Next-gen data analysis. Kostas

Hadoop 只支援用 Java 開發嘛? Is Hadoop only support Java? 總不能全部都重新設計吧? 如何與舊系統相容? Can Hadoop work with existing software?

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Three Approaches to Data Analysis with Hadoop

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Workshop on Hadoop with Big Data

Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan

Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY

Moving From Hadoop to Spark

E6895 Advanced Big Data Analytics Lecture 3:! Spark and Data Analytics

Hadoop: Understanding the Big Data Processing Method

BIG DATA ANALYTICS USING HADOOP TOOLS. A Thesis. Presented to the. Faculty of. San Diego State University. In Partial Fulfillment

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

Connecting Hadoop with Oracle Database

Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1

Case Study : 3 different hadoop cluster deployments

Istanbul Şehir University Big Data Camp 14. Hadoop Map Reduce. Aslan Bakirov Kevser Nur Çoğalmış

Hadoop, Hive & Spark Tutorial

Step 4: Configure a new Hadoop server This perspective will add a new snap-in to your bottom pane (along with Problems and Tasks), like so:

How To Write A Mapreduce Program In Java.Io (Orchestra)

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

HDFS. Hadoop Distributed File System

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Transcription:

BIG DATA ANALYTICS USING HADOOP AND SPARK ON HATHI Boyu Zhang Research Computing ITaP

BIG DATA APPLICATIONS Big data has become one of the most important aspects in scientific computing and business analytics 1 0.5 0-0.5-1 -1.5 Provide key insights to scientists and decision makers The amount of data is exploding 2-1 -0.5 0 0.5 0 1-0.5-1

CHARACTERISTICS OF BIG DATA Commonly accepted 3Vs of big data Veracity, Value 3 M. Stonebraker: Big Data Means at Least Three Different Things, http://www.nist.gov/itl/ssd/is/upload/nist-stonebraker.pdf M. Walker: Data Veracity, http://www.datasciencecentral.com/profiles/blogs/data-veracity

VOLUME The amount of data to process Figure 1 The digital universe: 50-fold growth from the beginning of 2010 to the end of 2020 Source: IDC's Digital Universe Study, sponsored by EMC, December 2012 4 Within these broad outlines of the digital universe are some singularities worth noting. First, J. while Gantz the and portion D. Reinsel: of the digital The Digital universe Universe holding in potential 2020: Big analytic Data, value Bigger is Digital growing, Shadows, only a tiny and Biggest Growth in the Far East, fraction https://www.emc.com/collateral/analyst-reports/idc-the-digital-universe-in-2020.pdf of territory has been explored. IDC estimates that by 2020, as much as 33% of the digital universe will contain information that might be valuable if analyzed, compared with 25% today. This untapped value could be found in patterns in social media usage, correlations in scientific data from discrete studies, medical information intersected with sociological data, faces in security footage, and

VOLUME Fortress archive at Purdue 5

VELOCITY How much data is generated on the internet every minute The global internet population ~3.2 billion 6 J. James: Data Never Sleeps 3.0: https://www.domo.com/blog/2015/08/data-never-sleeps-3-0

VARIETY Data in many forms Database, photo, video, audio, web Data 7

VARIOUS DOMAINS Biology, Physics, Finance, Computer Science 8

BIG DATA ANALYTICS Finding the value Need a set of powerful tools! 9

OUTLINE Introduction to big data analytics Tools available MapReduce Hadoop Spark ITaP resource hathi Platform Examples Summary 10

TRADITIONAL METHODS R, SAS, Matlab, C/C++, MPI, Java, Python Challenges Need parallel processing Programmer need to handle parallelization explicitly Work distribution, data distribution, communication, fault tolerance Time to solution is way too long Need a set of tools that free programmers from all the above, and let them focus on the problem logic 11

MAPREDUCE PARADIGM Programmer writes map and reduce functions that run in parallel Map Input: <key in, val in > pairs Output: list of <key i, val i > pairs <key in, val in > map list<key i, val i > Reduce Input: <key j, list(val)> pairs Output: <key out, val out > pairs key j, list(val) reduce <key out, val out > 12

MAPREDUCE RUNTIME Execute map and reduce functions in parallel Communicate all the values associated with the same key from map to reduce in the shuffle stage <key in, val in > map list<key i, val i > key j, list(val) reduce <key out, val out > shuffle <key in, val in > map list<key i, val i > key j, list(val) reduce <key out, val out > Popular runtime libraries support this model: Hadoop, Spark, MapReduce-MPI (MR-MPI) 13 Hadoop: https://hadoop.apache.org Spark: http://spark.apache.org MapReduce-MPI: http://mapreduce.sandia.gov

EXAMPLE- WORDCOUNT Count the occurrence of each word in a collection of text documents This is a test. map Process 0 <This, 1> <is, 1> <a, 1> <test., 1> shuffle <This, {1,1}> <is, {1, 1}> reduce <This, 2> <is, 2> Process 1 map This is also a test. <This, 1> <is, 1> <also, 1> <a, 1> <test., 1> shuffle <a, {1,1}> <test., {1,1}> <also, {1}> reduce <a, 2> <test., 2> <also, 1> 14

MAPREDUCE ADVANTAGES Simplicity Developers choice of language C/C++, Java, Python, R etc Automatic parallelization and communication in a restricted manner Built-in fault tolerance Performance Bring computation to data location Scalability Petabytes of data, tens of thousand of compute nodes 15

HADOOP A library that supports the execution of MapReduce applications Uses Hadoop Distributed File System (HDFS) for data storage Uses Hadoop NextGen MapReduce (YARN) for MapReduce applications scheduling and execution MR Pig Hive YARN HDFS 16

HADOOP - HDFS Master slave architecture Single NameNode, one DataNode per compute node in the cluster Storage and access model Each file is stored as a sequence of blocks of the same size Each block is replicated multiple times Write-once-read-many access model for files 17

HADOOP - YARN Master slave architecture Single ResourceManager, one NodeManager per compute node in the cluster, one ApplicationMasterper applicaion ResourceManager Manages the use of resources across the cluster NodeManager Launches and monitors containers A container executes an applicationspecific process ApplicationMaster Negotiates resource containers from the ResourceManager Tracks status and progress 18

ANATOMY OF A YARN APPLICATION RUN A client contacts the ResourceManager and asks it to run an AppicationMasterprocess. The ResourceManager finds a NodeManager that can launch the ApplicationMaster in a container. ApplicationMaster may run a computation in the container or request more containers from the ResourceManager. 19 T. White: Hadoop: The Definitive Guide 4 th Edition

EXAMPLE - WORDCOUNT import java.io.ioexception; import java.util.stringtokenizer; import org.apache.hadoop.conf.configuration; import org.apache.hadoop.fs.path; import org.apache.hadoop.io.intwritable; import org.apache.hadoop.io.text; import org.apache.hadoop.mapreduce.job; import org.apache.hadoop.mapreduce.mapper; import org.apache.hadoop.mapreduce.reducer; import org.apache.hadoop.mapreduce.lib.input.fileinputformat; import org.apache.hadoop.mapreduce.lib.output.fileoutputformat; public class WordCount { public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{ } public static void main(string[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "word count"); job.setjarbyclass(wordcount.class); job.setmapperclass(tokenizermapper.class); job.setcombinerclass(intsumreducer.class); job.setreducerclass(intsumreducer.class); job.setoutputkeyclass(text.class); job.setoutputvalueclass(intwritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true)? 0 : 1); } private final static IntWritable private Text word = new Text(); one = new IntWritable(1); } public void map(object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasmoretokens()) { word.set(itr.nexttoken()); context.write(word, one); } } 47 Lines of Code! public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable(); 20 } public void reduce(text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); }

HADOOP ECOSYSTEM 21 Hadoop Ecosystem Overview: http://thebigdatablog.weebly.com/blog/the-hadoop-ecosystem-overview

SPARK Extends MapReduce model to support more types of computations such as interactive queries and stream processing Spark SQL Spark Streaming SPARK MLlib GraphX Scheduler Standalone YARN Mesos EC2 Storage HDFS Local FS Amazon S3 22

SPARK PROGRAMMING ABSTRACTION Resilient Distributed Dataset (RDD) A distributed collection of items among cluster nodes Operations Transformations construct a new RDD from a previous one Actions compute a result based on an RDD Transformations filter() map() sample() Actions reduce() collect() count() Transformations are only computed when an action requires a result to be returned to the driver program (lazy evaluation) 23

ANATOMY OF A SPARK APPLICATION RUNS Driver program launches parallel operations on a cluster Manages a number of executors SparkContex object represents a connection to a cluster Builds RDDs Runs operations on RDDs 24

EXAMPLE WORDCOUNT from pyspark import SparkContext sc = SparkContext(appName="PythonWordCount") text_file = sc.textfile("hdfs://user/myname/text.txt") counts = text_file.flatmap(lambda line: line.split(" ")) \.map(lambda word: (word, 1)) \.reducebykey(lambda a, b: a + b) counts.saveastextfile("hdfs://user/myname/count.txt") sc.stop() 6 Lines of Code! 25

SPARK ADVANTAGES Flexibility Supports a wider range of workflow in addition to MapReduce Provides APIs in Python, Scala, Java, SQL and R Performance Runs computation in memory 2014 Daytona Gray Sort 100 TB Benchmark 26 Sort rate/node (GB/min) 25 20 15 10 5 0 Hadoop Spark

HATHI OVERVIEW Compute nodes run user jobs Front end nodes handle user log in, simple file manipulations, and other miscellaneous operations Adm node runs Hadoop master daemons hathi-fe00 hathi-fe01 hathi-adm hathia000 hathia004 hathia001 hathia002 hathia003 hathia005 27

HATHI SUPPORTED LIBRARIES Hadoop MapReduce Java API, Hadoop Streaming to run any executable Spark Scala, Python, R API Hive SQL-like language Pig Pig Latin statements 28

SUBMITTING TO HATHI Prepare input hdfs dfs copyfromlocal test.txt /user/my/input Submit the job Monitor the job hadoop jar wordcount.jar org.myorg.wordcount /user/my/input /user/my/output command line output and web interfaces HDFS: http://hathi-adm.rcac.purdue.edu:50070 All applications: http://hathi-adm.rcac.purdue.edu:8088 Hadoop JobHistory server: http://hathi-adm.rcac.purdue.edu:19888 Spark History Server: http://hathi-adm.rcac.purdue.edu:18080 Retrieve output hdfs dfs copytolocal /user/my/output output 29

SUMMARY MapReduce and related tools are the key to big data analytics Hadoop and Spark are popular runtime libraries that support MapReduce type operations Hathi is a free resource that is available for students who work on projects with professors (https://www.rcac.purdue.edu/compute/hathi/guide/) 30