How To Create A Map And Reduce Function In A Microsoft Microsoft Hadoop 2.5.2 (Minorodeo) (Mino) And Hadooper 2.3.2.2

Similar documents

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1

Big Data Analytics. Lucas Rego Drumond

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

Hadoop Job Oriented Training Agenda

Architectures for massive data management

How To Create A Data Visualization With Apache Spark And Zeppelin

Big Data Analytics Hadoop and Spark

Moving From Hadoop to Spark

Scaling Out With Apache Spark. DTL Meeting Slides based on

Hadoop Ecosystem B Y R A H I M A.

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

Introduction to Spark

Unified Big Data Processing with Apache Spark. Matei

Big Data With Hadoop

Oracle Big Data Fundamentals Ed 1 NEW

Workshop on Hadoop with Big Data

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

BIG DATA TECHNOLOGY. Hadoop Ecosystem

BIG DATA What it is and how to use?

Unified Big Data Analytics Pipeline. 连城

Systems Engineering II. Pramod Bhatotia TU Dresden dresden.de

Spark and the Big Data Library

Ali Ghodsi Head of PM and Engineering Databricks

Big Data and Scripting Systems build on top of Hadoop

Big Data and Apache Hadoop s MapReduce

Introduction to Big Data Training

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com

How Companies are! Using Spark

Internals of Hadoop Application Framework and Distributed File System

Large scale processing using Hadoop. Ján Vaňo

Map Reduce & Hadoop Recommended Text:

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

BIG DATA APPLICATIONS

How To Scale Out Of A Nosql Database

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

COURSE CONTENT Big Data and Hadoop Training

A Brief Outline on Bigdata Hadoop

Hadoop implementation of MapReduce computational model. Ján Vaňo

Big Data on Microsoft Platform

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

This is a brief tutorial that explains the basics of Spark SQL programming.

Big Data and Data Science: Behind the Buzz Words

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

ITG Software Engineering

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Big Data and Industrial Internet

Apache Spark and Distributed Programming

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Big Data Course Highlights

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

TRAINING PROGRAM ON BIGDATA/HADOOP

Spark. Fast, Interactive, Language- Integrated Cluster Computing

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Chase Wu New Jersey Ins0tute of Technology

Certification Study Guide. MapR Certified Spark Developer Study Guide

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

HDFS. Hadoop Distributed File System

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Big Data and Scripting Systems build on top of Hadoop

Native Connectivity to Big Data Sources in MSTR 10

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software. 22 nd October :00 Sesión B - DB2 LUW

BIG DATA HADOOP TRAINING

Machine- Learning Summer School

Case Study : 3 different hadoop cluster deployments

Hadoop: The Definitive Guide

Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.

Brave New World: Hadoop vs. Spark

Big Data Processing with Google s MapReduce. Alexandru Costan

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Big Data and Analytics: Challenges and Opportunities

BIG DATA TRENDS AND TECHNOLOGIES

Big Systems, Big Data

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Distributed DataFrame on Spark: Simplifying Big Data For The Rest Of Us

Native Connectivity to Big Data Sources in MicroStrategy 10. Presented by: Raja Ganapathy

CSE-E5430 Scalable Cloud Computing Lecture 11

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Unlocking the True Value of Hadoop with Open Data Science

Real Time Data Processing using Spark Streaming

Information Builders Mission & Value Proposition

BIG DATA - HADOOP PROFESSIONAL amron

Big Data. Lyle Ungar, University of Pennsylvania

Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu

Getting Started with Hadoop with Amazon s Elastic MapReduce

COMP9321 Web Application Engineering

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hive Interview Questions

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Big Data Explained. An introduction to Big Data Science.

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

A programming model in Cloud: MapReduce

Transcription:

Introduction to MapReduce, Hadoop, & Spark Jonathan Carroll-Nellenback Center for Integrated Research Computing

Big Data Outline Analytics Map Reduce Programming Model Hadoop Ecosystem HDFS, Pig, Hive, Oosize, Sqoop, HBase, Yarn Spark

Big Data Volume Too big for typical database software tools to capture, store, manage, and analyze (TB, PB, EB) Velocity Real time streaming data Variety Structured & Unstructured (documents, graphs, videos, audio etc...) Veracity How trustworthy is the data Value How useful is the data

Big Data Analytics Descriptive analytics What happened Predictive analytics What will happen Prescriptive analytics What to do going forward Social media analytics Mine social media data to discover sentiment, behavior patterns, individual preferences Entity analytics Analyze common types of events, people, things, transactions, and relationships Cognitive computing IBM Watson (simulation of human thought processes natural language processing, pattern recognition, data mining)

Map Reduce Map Reduce Programming Model (Google) map[k1, V1] => List [K2, V2] reduce[k2, List[V2]] => List[K3, V3]

Word Count Map Reduce Programming Model map[k1, V1] => List [K2, V2] reduce[k2, List[V2]] => List[K3, V3] Count occurrences of words in document K1 = line number V1 = text Map returns a list of words and the number 1 K2 = word V2 = 1 Reduce adds up values for each key K3 = word V3 = number of occurrences K1 V1 1 It was the best of times, 2 It was the worst of times, 3 it was the age of wisdom, 4 it was the age of foolishness, 5 it was the epoch of belief, 6 it was the epoch of incredulity, 7 it was the season of Light, 8 it was the season of Darkness, K2 V2 K3 V3 It 1 It 10 was 1 was 12 the 1 the 100 best 1 best 10 of 1 of 50 times 1 times 15 It 1 age 4 was 1 wisdom 3

Image credit: Broomfield Technology Consultants

WordCount.ipynb

Generate Index Map Reduce Programming Model map[k1, V1] => List [K2, V2] reduce[k2, List[V2]] => List[K3, V3] List all lines that a word appears on K1 = line number V1 = text Map returns a list of words and the line number K2 = word V2 = line # Reduce concatenates values for each key K3 = word V3 = list of line # s K1 V1 1 It was the best of times, 2 It was the worst of times, 3 it was the age of wisdom, 4 it was the age of foolishness, 5 it was the epoch of belief, 6 it was the epoch of incredulity, 7 it was the season of Light, 8 it was the season of Darkness, K2 V2 K3 V3 It 1 It 1,2,3,... was 1 was 1,2,3,... the 1 the 1,2,3,... best 1 best 1 of 1 of 1,2,3,... times 1 times 1,2 It 2 age 3,4, was 2 wisdom 3

Index.ipynb

Join Tables Map Reduce Programming Model map[k1, V1] => List [K2, V2] reduce[k2, List[V2]] => List[K3, V3] Join two tables using a key K1 = tableid V1 = row of data Map returns a list of words and the line number K2 = value to join on V2 = row and tableid Reduce does an outer/inner product on each set of rows from different tableids K3 = none V3 = outer product from each set of rows first 1 George 2 John 3 Thomas K1 + = last 1 Washington 2 Adams 3 Jefferson V1 first 1 George first 2 John first 3 Thomas last 1 Washington last 2 Adams last 3 Jefferson

Join Tables Map Reduce Programming Model map[k1, V1] => List [K2, V2] reduce[k2, List[V2]] => List[K3, V3] Join two tables using a key K1 = tableid V1 = row of data Map returns a list of words and the line number K2 = value to join on V2 = row and tableid Reduce does an outer/inner product on each set of rows from different tableids K3 = none V3 = outer product from each set of rows K1 V1 first 1 George first 2 John first 3 Thomas last 1 Washington last 2 Adams last 3 Jefferson K2 V2 1 first George 2 first John 3 first Thomas 1 last Washington 2 last Adams 3 last Jefferson K3 V3 1 George Washington 2 John Adams 3 Thomas Jefferson

Join.ipynb

Join Tables Collecting data for each join key within a single partition allows for any type of join (left, right, inner, outer) Tables don t have to have same number of columns K1 V1 first 1 George 1789 first 2 John 1797 first 3 Thomas 1801 last 1 Washington last 2 Adams last 3 Jefferson K2 V2 1 first George 1789 2 first John 1797 3 first Thomas 1801 1 last Washington 2 last Adams 3 last Jefferson K3 V3 1 George Washington 1789 2 John Adams 1797 3 Thomas Jefferson 1801

Map Reduce Map Reduce Programming Model map[k1, V1] => List [K2, V2] reduce[k2, List[V2]] => List[K3, V3] Group by key Shuffle Initial key value pairs are partitioned across multiple nodes Mapping can occur locally Grouping by key involves data shuffling prior to reduce If reduction operator is associative and commutative, then no need to group data for each key K2 on the same partition. Reduction can happen within each partition and then across partitions.

Hadoop Ecosystem Hadoop Framework Hadoop MapReduce HDFS (Hadoop Distributed File System) Fault tolerant Designed for Batch processing Large datasets Simple coherency model: file content cannot be updated just appended Hadoop YARN Resource management and job scheduler Additional packages Pig Procedural language (pig latin) to express map and reduce processes Hive Provides SQL interface on top of MapReduce HBase Column-oriented key-value store on top of HDFS Sqoop Allows for efficiently moving data from relational database to HDFS

Apache Spark Runs in memory Main Components Spark Core MapReduce Spark SQL Database Spark Streaming Real-time analysis of streaming data Spark MLlib Distributed machine learning library Spark GraphX Distributed graph processing framework Procedural style interface similar to PigLatin Libraries available for Python, R, Scala, Java Supports lazy evaluation Map operations (transformations) only occur when needed by a reduction Reductions (actions) are scheduled to run within a MapReduce cluster. Can choose to cache intermediate results Comes with standalone native Spark cluster, or can interface with YARN Can connect to HDFS, GPFS, Cassandra, Amazon S3, etc...

Jupyter Web application that allows for the creation and sharing of documents that contain live code, equations, visualizations, and explanatory text. Supports Julia, Python, and R Jupyter + Python + Spark = Easy interactive big-data analytics

Spark+Jupyter on Bluehive Install X2go following instructions for your OS under Getting Started Guide https://info.circ.rochester.edu Getting Started Connecting using <OS> Connecting using a Graphical User Interface

Spark+Jupyter on Bluehive Open a terminal on Head node (under Applications : System Tools ) Copy over directory containing sample files. cp -r /public/jcarrol5/csc461. cd CSC461 Run the command init_jupyter.sh which will 1.) create a self-signed key for the jupyter web app 2.) prompt you for a password to secure your jupyter notebook server 3.) update your jupyter notebook configuration file

Spark+Jupyter on Bluehive Run the command interactive -p debug -t 60 --exclusive This will request an interactive session on a compute node in the debug partition for 60 minutes. Run the script start-spark-jupyter This will start a stand-alone spark cluster within your job s allocation and launch your jupyter notebook initialized with the spark context (sc) and spark SQL context (sqlcontext) Note the url of the spark master (http://bhc0001:8080) and the Jupyter Notebook (https://bhc0001:8888) Open firefox web browser (under Applications : Internet) Connect to the Jupyter Notebook (and the Spark master)

Spark Data structures Spark Core: RDD Resilient Distributed Data Set RDD Each element is just a value (iestring) MapRDD Each element is a key-value pair - though the value can be anything. RowRDD Each element in the RDD is a Row object with some schema Spark SQL: DataFrame Column oriented distributed data set (Hive). Supports SQL queries as well as procedural operations (join, select,.)

Wegmans Data The Wegmans.ipynb notebook contains example code that loads data from text files into an RDD of strings maps that onto a RDD of Rows Transforms the RowRDDs into DataFrames Performs SQL queries as well as procedural queries Performs Market Basket Analysis to find sets of frequent items

Spark modules Core RDD SQL DataFrames Streaming Supports continuous inflow of data mllib Machine Learning Library Statistics Classification/Regression Collaborative Filtering ALS Clustering (ie k-means) Dimensionality Reduction (SVD / PCA) Frequent pattern mining (FP-growth) GraphX Supports operations on graphs (maptripletes, groupedges, pagerank, connectedcomponents, )

Homework Modify the word count example to count letters instead of words in the taleoftwocities.txt file Modify the Wegmans.ipynb Jupyter notebook to calculate the top 20 grossing products sold by Wegmans. Calculate the average number of products in each transaction. Repeat the basket analysis but using item category names instead of item names. Determine what items are frequently bought with CARBONATED SODA POP