Outline of Tutorial. Hadoop and Pig Overview Hands-on

Similar documents

Hadoop Hands-On Exercises

BIG DATA APPLICATIONS

Mrs: MapReduce for Scientific Computing in Python

Big Data 2012 Hadoop Tutorial

Hadoop and ecosystem * 本文中的言论仅代表作者个人观点 * 本文中的一些图例来自于互联网. Information Management. Information Management IBM CDL Lab

Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu

Xiaoming Gao Hui Li Thilina Gunarathne

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems

Cloudera Certified Developer for Apache Hadoop

Evaluating MapReduce and Hadoop for Science

Enterprise Data Storage and Analysis on Tim Barr

Internals of Hadoop Application Framework and Distributed File System

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Tutorial- Counting Words in File(s) using MapReduce

Big Data for the JVM developer. Costin Leau,

Lambda Architecture. CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014

The Hadoop Eco System Shanghai Data Science Meetup

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

How To Scale Out Of A Nosql Database

Introduction to MapReduce and Hadoop

Hadoop 只支援用 Java 開發嘛? Is Hadoop only support Java? 總不能全部都重新設計吧? 如何與舊系統相容? Can Hadoop work with existing software?

Extreme Computing. Hadoop MapReduce in more detail.

Hadoop Ecosystem B Y R A H I M A.

Hadoop and Eclipse. Eclipse Hawaii User s Group May 26th, Seth Ladd

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

CSE-E5430 Scalable Cloud Computing Lecture 2

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

MapReduce with Apache Hadoop Analysing Big Data

Getting to know Apache Hadoop

Hadoop IST 734 SS CHUNG

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

Big Data and Apache Hadoop s MapReduce

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Introduction to Cloud Computing

Hadoop implementation of MapReduce computational model. Ján Vaňo

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

Introduc)on to Map- Reduce. Vincent Leroy

Hadoop Lab Notes. Nicola Tonellotto November 15, 2010

Working With Hadoop. Important Terminology. Important Terminology. Anatomy of MapReduce Job Run. Important Terminology

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig

Data-Intensive Computing with Map-Reduce and Hadoop

BIG DATA TRENDS AND TECHNOLOGIES

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

A Brief Outline on Bigdata Hadoop

Introduction to Hadoop

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Hadoop. Dawid Weiss. Institute of Computing Science Poznań University of Technology

Programming Hadoop Map-Reduce Programming, Tuning & Debugging. Arun C Murthy Yahoo! CCDI acm@yahoo-inc.com ApacheCon US 2008

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

University of Maryland. Tuesday, February 2, 2010

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Cloud Computing at Google. Architecture

Big Data Management and NoSQL Databases

Map Reduce & Hadoop Recommended Text:

Word Count Code using MR2 Classes and API

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

Hadoop. Sunday, November 25, 12

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Complete Java Classes Hadoop Syllabus Contact No:

COSC 6397 Big Data Analytics. Distributed File Systems (II) Edgar Gabriel Spring HDFS Basics

Click Stream Data Analysis Using Hadoop

BIG DATA, MAPREDUCE & HADOOP

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

A very short Intro to Hadoop

Accelerating and Simplifying Apache

IBM Big Data Platform

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13

Workshop on Hadoop with Big Data

Big Data With Hadoop

Big Data and Data Science Grows Up. Ron Bodkin Founder & CEO Think Big Analy8cs ron.bodkin@xthinkbiganaly8cs..com

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Chapter 7. Using Hadoop Cluster and MapReduce

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

BIG DATA What it is and how to use?

Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software. 22 nd October :00 Sesión B - DB2 LUW

Cloud Computing Era. Trend Micro

Large scale processing using Hadoop. Ján Vaňo

CS54100: Database Systems

Peers Techno log ies Pv t. L td. HADOOP

HDFS. Hadoop Distributed File System

Big Data Management. Big Data Management. (BDM) Autumn Povl Koch November 11,

Open source large scale distributed data management with Google s MapReduce and Bigtable

Implement Hadoop jobs to extract business value from large and varied data sets

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

Application Development. A Paradigm Shift

Big Data Too Big To Ignore

BBM467 Data Intensive ApplicaAons

Transcription:

Outline of Tutorial Hadoop and Pig Overview Hands-on 1

Hadoop and Pig Overview Lavanya Ramakrishnan Shane Canon Lawrence Berkeley National Lab October 2011

Overview Concepts & Background MapReduce and Hadoop Hadoop Ecosystem Tools on top of Hadoop Hadoop for Science Examples, Challenges Programming in Hadoop Building blocks, Streaming, C-HDFS API 3

Processing Big Data Internet scale generates BigData Terabytes of data/day just reading 100 TB can be overwhelming using clusters of standard commodity computers for linear scalability Timeline Nutch open source search project (2002-2004) MapReduce & DFS implementation and Hadoop splits out of Nutch (2004-2006) 4

MapReduce Computation performed on large volumes of data in parallel divide workload across large number of machines need a good data management scheme to handle scalability and consistency Functional programming concepts map reduce 5 OSDI 2004

Mapping Map input to an output using some function Example string manipulation 6

Reduces Aggregate values together to provide summary data Example addition of the list of numbers 7

Google File System Distributed File System accounts for component failure multi-gb files and billions of objects Design single master with multiple chunkservers per master file represented as fixed-sized chunks 3-way mirrored across chunkservers 8

Hadoop Open source reliable, scalable distributed computing platform implementation of MapReduce Hadoop Distributed File System (HDFS) runs on commodity hardware Fault Tolerance restarting tasks data replication Speculative execution handles stragglers 9

HDFS Architecture 10

HDFS and other Parallel Filesystems HDFS GPFS and Lustre Typical Replication 3 1 Storage Location Compute Node Servers Access Model Custom (except with Fuse) POSIX Stripe Size 64 MB 1 MB Concurrent Writes No Yes Scales with # of Compute Nodes # of Servers Scale of Largest Systems O(10k) Nodes O(100) Servers User/Kernel Space User Kernel

Who is using Hadoop? A9.com Amazon Adobe AOL Baidu Cooliris Facebook NSF-Google university initiative 12 IBM LinkedIn Ning PARC Rackspace StumbleUpon Twitter Yahoo!

Hadoop Stack Pig Chukwa MapReduce Hive HDFS Core HBase Zoo Keeper Avro Source: Hadoop: The Definitive Guide Constantly evolving! 13

Google Vs Hadoop Google Hadoop MapReduce Hadoop MapReduce GFS HDFS Sawzall Pig, Hive BigTable Hbase Chubby Zookeeper Pregel Hama, Giraph 14

Pig Platform for analyzing large data sets Data-flow oriented language Pig Latin data transformation functions datatypes include sets, associative arrays, tuples high-level language for marshalling data Developed at Yahoo! 15

Hive SQL-based data warehousing application features similar to Pig more strictly SQL-type Supports SELECT, JOIN, GROUP BY, etc Analyzing very large data sets log processing, text mining, document indexing Developed at Facebook 16

HBase Persistent, distributed, sorted, multidimensional, sparse map based on Google BigTable provides interactive access to information Holds extremely large datasets (multitb) High-speed lookup of individual (row, column) 17

ZooKeeper Distributed consensus engine runs on a set of servers and maintains state consistency Concurrent access semantics leader election service discovery distributed locking/mutual exclusion message board/mailboxes producer/consumer queues, priority queues and multi-phase commit operations 18

Other Related Projects [1/2] Chukwa Hadoop log aggregation Scribe more general log aggregation Mahout machine learning library Cassandra column store database on a P2P backend Dumbo Python library for streaming Spark in memory cluster for interactive and iterative Hadoop on Amazon Elastic MapReduce 19

Other Related Projects [2/2] Sqoop import SQL-based data to Hadoop Jaql JSON (JavaScript Object Notation) based semi-structured query processing Oozie Hadoop workflows Giraph Large scale graph processing on Hadoop Hcatlog relational view of HDFS Fuse-DS POSIX interface to HDFS 20

Hadoop for Science 21

Magellan and Hadoop DOE funded project to determine appropriate role of cloud computing for DOE/SC midrange workloads Co-located at Argonne Leadership Computing Facility (ALCF) and National Energy Research Scientific Center (NERSC) Hadoop/Magellan research questions Are the new cloud programming models useful for scientific computing? 22

Data Intensive Science Evaluating hardware and software choices for supporting next generation data problems Evaluation of Hadoop using mix of synthetic benchmarks and scientific applications understanding application characteristics that can leverage the model data operations: filter, merge, reorganization compute-data ratio (collaboration w/ Shane Canon, Nick Wright, Zacharia Fadika) 23

MapReduce and HPC Applications that can benefit from MapReduce/Hadoop Large amounts of data processing Science that is scaling up from the desktop Query-type workloads Data from Exascale needs new technologies Hadoop On Demand lets one run Hadoop through a batch queue 24

Hadoop for Science Advantages of Hadoop transparent data replication, data locality aware scheduling fault tolerance capabilities Hadoop Streaming allows users to plug any binary as maps and reduces input comes on standard input 25

BioPig Analytics toolkit for Next-Generation Sequence Data User defined functions (UDF) for common bioinformatics programs BLAST, Velvet readers and writers for FASTA and FASTQ pack/unpack for space conservation with DNA sequenceså 26

Application Examples Bioinformatics applications (BLAST) parallel search of input sequences Managing input data format Tropical storm detection binary file formats can t be handled in streaming Atmospheric River Detection maps are differentiated on file and parameter 27

Bring your application Hadoop workshop When: TBD Send us email if you are interested LRamakrishnan@lbl.gov Scanon@lbl.gov Include a brief description of your application. 28

HDFS vs GPFS (Time) Teragen (1TB) HDFS 12 GPFS Linear (HDFS) Expon. (HDFS) Linear (GPFS) Expon. (GPFS) 10 Time (minutes) 8 6 4 2 0 0 500 1000 1500 Number of maps 29 2000 2500 3000

Application Characteristic Affect Choices Wikipedia data set On ~ 75 nodes, GPFS performs better with large nodes Iden%cal data loads and processing load Amount of wri%ng in applica%on aﬀects performance

Hadoop: Challenges Deployment all jobs run as user hadoop affecting file permissions less control on how many nodes are used affects allocation policies Programming: No turn-key solution using existing code bases, managing input formats and data Additional benchmarking, tuning needed, Plug-ins for Science 31

10 0 Comparison of MapReduce Implementations 0 10 20 node1: (Under stress) node2 node3 Hadoop 80 Output data size (MB) 90 LEMO MR 64 core Twister Cluster 64 core LEMO MR Cluster 64 core Hadoop Cluster! 30 40 70 Twister! 20 60 Speedup 0 50 LEMO MR Load balancing! 40 Twister 50 Number of words processed (Billion) 100 64 core Twister Cluster 64 core Hadoop Cluster 64 core LEMO MR Cluster 50 Processing time (s) 150 Hadoop 10 Producing random floating point numbers!!! 0! Collaboration w/ Zacharia Fadika, Elif Dede, Madhusudhan Govindaraju, SUNY Binghamton 0 10 20 30 40 Cluster size (cores) Processing 5 million 33 x 33 matrices 32 50 60

Programming Hadoop 33

Programming with Hadoop Map and reduce as Java programs using Hadoop API Pipes and Streaming can help with existing applications in other languages C- HDFS API Higher-level languages such as Pig might help with some applications 34

Keys and Values Maps and reduces produce key-value pairs arbitrary number of values can be output may map one input to 0,1,.100 outputs reducer may emit one or more outputs Example: Temperature recordings 94089 8:00 am, 59 27704 6:30 am, 70 94089 12:45 pm, 80 47401 1 pm, 90 35

Keys divide the reduce space 36

Data Flow 37

Mechanics[1/2] Input files large 10s of GB or more, typically in HDFS line-based, binary, multi-line, etc. InputFormat function defines how input files are split up and read TextInputFormat (default), KeyValueInputFormat, SequenceFileInputFormat InputSplits unit of work that comprises a single map task FileInputFormat divides it into 64MB chunks 38

Mechanics [2/2] RecordReader loads data and converts to key value pair Sort & Partiton & Shuffle intermediate data from map to reducer Combiner reduce data on a single machine Mapper & Reducer OutputFormat, RecordWriter 39

Word Count Mapper public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{ private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasmoretokens()) { word.set(itr.nexttoken()); context.write(word, one); } } } 40

Word Count Reducer public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable(); public void reduce(text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } } 41

Word Count Example public static void main(string[] args) throws Exception { Configuration conf = new Configuration(); String[] otherargs = new GenericOptionsParser(conf, args).getremainingargs();. Job job = new Job(conf, "word count"); job.setjarbyclass(wordcount.class); job.setmapperclass(tokenizermapper.class); job.setcombinerclass(intsumreducer.class); job.setreducerclass(intsumreducer.class); job.setoutputkeyclass(text.class); job.setoutputvalueclass(intwritable.class); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); System.exit(job.waitForCompletion(true)? 0 : 1); } 42

Pipes Allows C++ code to be used for Mapper and Reducer Both key and value inputs to pipes programs are provided as std::string $ hadoop pipes 43

C-HDFS API Limited C API to read and write from HDFS #include "hdfs.h" int main(int argc, char **argv) { hdfsfs fs = hdfsconnect("default", 0); hdfsfile writefile = hdfsopenfile(fs, writepath, O_WRONLY O_CREAT, 0, 0, 0); tsize num_written_bytes = hdfswrite(fs, writefile, (void*)buffer, strlen(buffer)+1); hdfsclosefile(fs, writefile); } 44

Hadoop Streaming Generic API that allows programs in any language to be used as Hadoop Mapper and Reducer implementations Inputs written to stdin as strings with tab character separating Output to stdout as key \t value \n $ hadoop jar contrib/streaming/ hadoop-[version]-streaming.jar 45

Debugging Test core functionality separate Use Job Tracker Run local in Hadoop Run job on a small data set on a single node Hadoop can save files from failed tasks 46

Pig Basic Operations LOAD loads data into a relational form FOREACH..GENERATE Adds or removes fields (columns) GROUP Group data on a field JOIN Join two relations DUMP/STORE Dump query to terminal or file There are others, but these will be used for the exercises today

Pig Example Find the number of gene hits for each model in an hmmsearch (>100GB of output, 3 Billion Lines) bash# cat * cut f 2 sort uniq -c > hits = LOAD /data/bio/*' USING PigStorage() AS (id:chararray,model:chararray, value:float);! > amodels = FOREACH hits GENERATE model;! > models = GROUP amodels BY model;! > counts = FOREACH models GENERATE group,count (amodels) as count;! > STORE counts INTO 'tcounts' USING PigStorage();!

Pig - LOAD Example: hits = LOAD 'load4/*' USING PigStorage() AS (id:chararray, model:chararray,value:float);! Pig has several built-in data types (chararray, float, integer) PigStorage can parse standard line oriented text files. Pig can be extended with custom load types written in Java. Pig doesn t read any data until triggered by a DUMP or STORE

Pig FOREACH..GENERATE, GROUP Example: amodel = FOREACH model GENERATE hits;! models = GROUP amodels BY model;! counts = FOREACH models GENERATE group,count (amodels) as count;! Use FOREACH..GENERATE to pick of specific fields or generate new fields. Also referred to as a projection GROUP will create a new record with the group name and a bag of the tuples in each group You can reference a specific field in a bag with <bag>.field (i.e. amodels.model) You can use aggregate functions like COUNT, MAX, etc on a bag

Pig Important Points Nothing really happens until a DUMP or STORE is performed. Use FILTER and FOREACH early to remove unneeded columns or rows to reduce temporary output Use PARALLEL keyword on GROUP operations to run more reduce tasks

Questions? Shane Canon Scanon@lbl.gov Lavanya Ramakrishnan LRamakrishnan@lbl.gov 52