Workshop: From Zero. Budapest DW Forum 2014



Similar documents
Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Running Knn Spark on EC2 Documentation

Cloud Computing. AWS a practical example. Hugo Pérez UPC. Mayo 2012

CAPTURING & PROCESSING REAL-TIME DATA ON AWS

Cloudera Distributed Hadoop (CDH) Installation and Configuration on Virtual Box

Single Node Hadoop Cluster Setup

MapReduce, Hadoop and Amazon AWS

HDFS Cluster Installation Automation for TupleWare

t] open source Hadoop Beginner's Guide ij$ data avalanche Garry Turkington Learn how to crunch big data to extract meaning from

Background on Elastic Compute Cloud (EC2) AMI s to choose from including servers hosted on different Linux distros

CONFIGURING ECLIPSE FOR AWS EMR DEVELOPMENT

CSE 344 Introduction to Data Management. Section 9: AWS, Hadoop, Pig Latin TA: Yi-Shu Wei

How to Run Spark Application

Introduction to analyzing big data using Amazon Web Services

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Virtual Machine (VM) For Hadoop Training

Getting Started with Hadoop with Amazon s Elastic MapReduce

What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea

! E6893 Big Data Analytics:! Demo Session II: Mahout working with Eclipse and Maven for Collaborative Filtering

cloud-kepler Documentation

Hadoop & Spark Using Amazon EMR

Hadoop Setup. 1 Cluster

Hadoop Installation MapReduce Examples Jake Karnes

Hadoop Basics with InfoSphere BigInsights

Big Data Spatial Analytics An Introduction

Data Pipeline with Kafka

Extreme computing lab exercises Session one

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

Hadoop Data Warehouse Manual

Big Data for everyone Democratizing big data with the cloud. Steffen Krause Technical

Informatica Cloud & Redshift Getting Started User Guide

Introduction to HDFS. Prasanth Kothuri, CERN

E6893 Big Data Analytics: Demo Session for HW I. Ruichi Yu, Shuguan Yang, Jen-Chieh Huang Meng-Yi Hsu, Weizhen Wang, Lin Haung.

CS 378 Big Data Programming

File S1: Supplementary Information of CloudDOE

Chase Wu New Jersey Ins0tute of Technology

Big Data on AWS. Services Overview. Bernie Nallamotu Principle Solutions Architect

AdWhirl Open Source Server Setup Instructions

BIG DATA What it is and how to use?

Big Data on Microsoft Platform

Workshop on Hadoop with Big Data

CS 378 Big Data Programming. Lecture 2 Map- Reduce

Apache Hadoop: Past, Present, and Future

AWS Data Pipeline. Developer Guide API Version

L1: Introduction to Hadoop

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Real Time Big Data Processing

Hadoop Hands-On Exercises

Amazon Elastic MapReduce. Jinesh Varia Peter Sirota Richard Cole

Introduction to Hadoop

CactoScale Guide User Guide. Athanasios Tsitsipas (UULM), Papazachos Zafeirios (QUB), Sakil Barbhuiya (QUB)

Using The Hortonworks Virtual Sandbox

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

Sriram Krishnan, Ph.D.

The objective of this lab is to learn how to set up an environment for running distributed Hadoop applications.

Introduction to Cloud Computing

Case Study : 3 different hadoop cluster deployments

Hadoop IST 734 SS CHUNG

Building Out Your Cloud-Ready Solutions. Clark D. Richey, Jr., Principal Technologist, DoD

Big Data Infrastructure at Spotify

the missing log collector Treasure Data, Inc. Muga Nishizawa

H2O on Hadoop. September 30,

Tutorial- Counting Words in File(s) using MapReduce

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

10605 BigML Assignment 4(a): Naive Bayes using Hadoop Streaming

TRAINING PROGRAM ON BIGDATA/HADOOP

Apache Hadoop new way for the company to store and analyze big data

SIG-NOC Meeting - Stuttgart 04/08/2015 Icinga - Open Source Monitoring

How To Install Hadoop From Apa Hadoop To (Hadoop)

Extreme Computing. Hadoop. Stratis Viglas. School of Informatics University of Edinburgh Stratis Viglas Extreme Computing 1

Tutorial for Assignment 2.0

Big Data for the JVM developer. Costin Leau,

CDH installation & Application Test Report

Setup Hadoop On Ubuntu Linux. ---Multi-Node Cluster

Getting Started with AWS. Hosting a Static Website

Savanna Hadoop on. OpenStack. Savanna Technical Lead

Data Analytics. CloudSuite1.0 Benchmark Suite Copyright (c) 2011, Parallel Systems Architecture Lab, EPFL. All rights reserved.

Hadoop 2.6 Configuration and More Examples

Productionizing a 24/7 Spark Streaming Service on YARN

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

Business Intelligence for Big Data

MapReduce. Tushar B. Kute,

Hadoop (pseudo-distributed) installation and configuration

Getting Started with AWS. Hosting a Static Website

Cray XC30 Hadoop Platform Jonathan (Bill) Sparks Howard Pritchard Martha Dumler

IDS 561 Big data analytics Assignment 1

CSE-E5430 Scalable Cloud Computing Lecture 2

Hadoop Hands-On Exercises

Big Data Too Big To Ignore

To reduce or not to reduce, that is the question

Rstudio Server on Amazon EC2

Introduction to HDFS. Prasanth Kothuri, CERN

TP1: Getting Started with Hadoop

HADOOP BIG DATA DEVELOPER TRAINING AGENDA

CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment

Yahoo! Grid Services Where Grid Computing at Yahoo! is Today

Transcription:

Workshop: From Zero to _ Budapest DW Forum 2014

Agenda today 1. Some setup before we start 2. (Back to the) introduction 3. Our workshop today 4. Part 2: a simple Scalding job on EMR

Some setup before we start

There is a lot to copy and paste so let s all join a Google Hangout chat http://bit.ly/1xgsqid If I forget to paste some content into the chat room, just shout out and remind me

First, let s all download and setup Virtualbox and Vagrant http://docs.vagrantup.com/v2/installation/in dex.html https://www.virtualbox.org/wiki/downloads

Now let s setup our development environment $ vagrant plugin install vagrant-vbguest If you have git already installed: $ git clone --recursive https://github.com/snowplow/dev-environment.git If not: $ wget https://github.com/snowplow/devenvironment/archive/temp.zip $ unzip temp.zip $ wget https://github.com/snowplow/ansibleplaybooks/archive/temp.zip $ unzip temp.zip

Now let s setup our development environment $ cd dev-environment $ vagrant up $ vagrant ssh

Final step for now, let s install some software $ ansible-playbook /vagrant/ansibleplaybooks/aws-tools.yml --inventoryfile=/home/vagrant/ansible_hosts -- connection=local $ ansible-playbook /vagrant/ansibleplaybooks/scala-sbt.yml --inventoryfile=/home/vagrant/ansible_hosts -- connection=local

(Back to the) introduction

Snowplow is an open-source web and event analytics platform, built on Hadoop Co-founders Alex Dean and Yali Sassoon met at OpenX, the open-source ad technology business in 2008 We released Snowplow as a skunkworks prototype at start of 2012: github.com/snowplow/snowplow We built Snowplow on top of Hadoop from the very start

We wanted to take a fresh approach to web analytics Your own web event data -> in your own data warehouse Your own event data model Slice / dice and mine the data in highly bespoke ways to answer your specific business questions Plug in the broadest possible set of analysis tools to drive value from your data Data pipeline Data warehouse Analyse your data in any analysis tool

And we saw the potential of new big data technologies and services to solve these problems in a scalable, low-cost manner CloudFront Amazon S3 Amazon EMR Amazon Redshift These tools make it possible to capture, transform, store and analyse all your granular, event-level data, to you can perform any analysis

Our Snowplow event processing flow runs on Hadoop, specifically Amazon s Elastic MapReduce hosted Hadoop service Website / webapp Snowplow Hadoop data pipeline Amazon S3 JavaScript event tracker CloudFrontbased event collector or Clojurebased event collector Scaldingbased enrichment on Hadoop Amazon Redshift / PostgreSQL

Why did we pick Hadoop? Scalability We have customers processing 350m Snowplow events a day in Hadoop runs in <2 hours Easy to reprocess data If business rules change, we can fire up a large cluster and re-process all historical raw Snowplow events Highly testable We write unit and integration tests for our jobs and run them locally, giving us confidence that our jobs will run correctly at scale on Hadoop

And why Amazon s Elastic MapReduce (EMR)? No need to run our own cluster Running your own Hadoop cluster is a huge pain not for the fainthearted. By contrast, EMR just works (most of the time!) Elastic Snowplow runs as a nightly (sometimes more frequent) batch job. We spin up the EMR cluster to run the job, and shut it down straight after Interop with other AWS services EMR works really well with Amazon S3 as a file store. We are big fans of Amazon Redshift (hosted columnar database) too

Our workshop today

Hadoop is complicated

for our workshop today, we will stick to using Elastic MapReduce and try to avoid any unnecessary complexity

and we will learn by doing! Lots of books and articles about Hadoop and the theory of MapReduce We will learn by doing no theory unless it s required to directly explain the jobs we are creating Our priority is to get you up-and-running on Elastic MapReduce, and confident enough to write your own Hadoop jobs

Part 1: a simple Pig Latin job on EMR

What is Pig (Latin)? Pig is a high-level platform for creating MapReduce jobs which can run on Hadoop The language you write Pig jobs in is called Pig Latin For quick-and-dirty scripts, Pig just works Cascading Crunch Hive Pig Java Hadoop MapReduce Hadoop DFS

Let s all come up with a unique name for ourselves Lowercase letters, no spaces or hyphens or anything E.g. I will be alexsnowplow please come up with a unique name for yourself! It will be visible to other participants so choose something you don t mind being public In the rest of this workshop, wherever you see YOURNAME, replace it with your unique name

Let s restart our Vagrant and do some setup $ mkdir zero2hadoop $ aws configure // And type in: AWS Access Key ID [None]: AKIAILD6DCBTFI642JPQ AWS Secret Access Key [None]: KMVdr/bsq4FDTI5H143K3gjt4ErG2oTjd+1+a+ou Default region name [None]: eu-west-1 Default output format [None]:

Let s create some buckets in Amazon S3 this is where our data and our apps will live $ aws s3 mb s3://zero2hadoop-in-yourname $ aws s3 mb s3://zero2hadoop-out-yourname $ aws s3 mb s3://zero2hadoop-jobs-yourname // Check those worked $ aws s3 ls

Let s get some source data uploaded $ mkdir -p ~/zero2hadoop/part1/in $ cd ~/zero2hadoop/part1/in $ wget https://raw.githubusercontent.com/snowplow/sc alding-example-project/master/data/hello.txt $ cat hello.txt Hello world Goodbye world $ aws s3 cp hello.txt s3://zero2hadoop-in- YOURNAME/part1/hello.txt

Let s get our EMR command-line tools installed (1/2) $ /vagrant/emr-cli/elastic-mapreduce $ rvm install ruby-1.8.7-head $ rvm use 1.8.7 $ alias emr=/vagrant/emr-cli/elasticmapreduce

Let s get our EMR command-line tools installed (2/2) Add this file: { "access_id": "AKIAI55OSYYRLYWLXH7A", "private_key": "SHRXNIBRdfWuLPbCt57ZVjf+NMKUjm9WTknDHPTP", "region": "eu-west-1" } to: /vagrant/emr-cli/credentials.json (sudo sntp -s 24.56.178.140)

Let s get our EMR command-line tools installed (2/2) // This should work fine now: $ emr --list <no output>

Let s do some local file work $ mkdir -p ~/zero2hadoop/part1/pig $ cd ~/zero2hadoop/part1/pig $ wget https://gist.githubusercontent.com/alexanderd ean/d8371cebdf00064591ae/raw/cb3030a6c48b85d1 01e296ccf27331384df3288d/wordcount.pig // The original https://gist.github.com/alexanderdean/d8371ce bdf00064591ae

Now upload to S3 $ aws s3 cp wordcount.pig s3://zero2hadoopjobs-yourname/part1/ $ aws s3 ls --recursive s3://zero2hadoopjobs-yourname/part1/ 2014-06-06 09:10:31 674 part1/wordcount.pig

And now we run our Pig script $ emr --create --name "part1 YOURNAME" \ --set-visible-to-all-users true \ --pig-script s3n://zero2hadoop-jobs- YOURNAME/part1/wordcount.pig \ --ami-version 2.0 \ --args "-p,input=s3n://zero2hadoop-in- YOURNAME/part1, \ -p,output=s3n://zero2hadoop-out- YOURNAME/part1"

Let s check out the jobs running in Elastic MapReduce first at the console $ $ emr --list j-1hr90swpp40m4 STARTING part1 YOURNAME PENDING Setup Pig PENDING Run Pig Script

and also in the UI

Okay let s check the output of our job! (1/2) $ aws s3 ls --recursive s3://zero2hadoop-out- YOURNAME/part1 2014-06-06 09:57:53 0 part1/_success 2014-06-06 09:57:50 26 part1/part-r- 00000

Okay let s check the output of our job! $ mkdir -p ~/zero2hadoop/part1/out $ cd ~/zero2hadoop/part1/out $ aws s3 cp --recursive s3://zero2hadoop-out- YOURNAME/part1. $ ls part-r-00000 _SUCCESS $ cat part-r-00000 2 world 1 Hello 1 Goodbye

Part 2: a simple Scalding job on EMR

What is Scalding? Scalding is a Scala API over Cascading, the Java framework for building data processing pipelines on Hadoop: Scalding Cascalog PyCascading cascading. jruby Cascading Pig Java Hadoop MapReduce Hadoop DFS

Cascading has a plumbing abstraction over vanilla MapReduce which should be quite comfortable to DW practitioners

Scalding improves further on Cascading by reducing boilerplate and making more complex pipelines easier to express Scalding written in Scala reduces a lot of boilerplate versus vanilla Cascading. Easier to look at a job in its entirety and see what it does Scalding created and supported by Twitter, who use it throughout their organization We believe that data pipelines should be as strongly typed as possible all the other DSLs/APIs on top of Cascading encourage dynamic typing

Strongly typed data pipelines why? Catch errors as soon as possible and report them in a strongly typed way too Define the inputs and outputs of each of your data processing steps in an unambiguous way Forces you to formerly address the data types flowing through your system Lets you write code like this:

Okay let s get started! Head to https://github.com/snowplow/scalding-example-project

Let s get this code down locally and build it $ mkdir -p ~/zero2hadoop/part2 $ cd ~/zero2hadoop/part2 $ git clone git://github.com/snowplow/scalding-exampleproject.git $ cd scalding-example-project $ sbt assembly

Here is our MapReduce code

Good, tests are passing, now let s upload this to S3 so it s available to our EMR job $ aws s3 cp target/scala-2.10/scaldingexample-project-0.0.5.jar s3://zero2hadoopjobs-yourname/part2/ // If that doesn t work: $ aws cp s3://snowplow-hosted-assets/thirdparty/scalding-example-project-0.0.5.jar s3://zero2hadoop-jobs-yourname/part2/ $ aws s3 ls s3://zero2hadoop-jobs- YOURNAME/part2/

And now we run it! $ emr --create --name part2 YOURNAME" \ --set-visible-to-all-users true \ --jar s3n://zero2hadoop-jobs- YOURNAME/part2/scalding-example-project- 0.0.5.jar \ --arg com.snowplowanalytics.hadoop.scalding.wordcou ntjob \ --arg --hdfs \ --arg --input --arg s3n://zero2hadoop-in- YOURNAME/part1/hello.txt \ --arg --output --arg s3n://zero2hadoop-out- YOURNAME/part2

Let s check out the jobs running in Elastic MapReduce first at the console $ emr --list j-1m62igrepl7i STARTING scalding-example-project PENDING Example Jar Step

and also in the UI

Okay let s check the output of our job! $ aws s3 ls --recursive s3://zero2hadoop-out- YOURNAME/part2 $ mkdir -p ~/zero2hadoop/part2/out $ cd ~/zero2hadoop/part2/out $ aws s3 cp --recursive s3://zero2hadoop-out- YOURNAME/part2. $ ls $ cat part-00000 goodbye 1 hello 1 world 2

Part 3: a more complex Scalding job on EMR

Let s explore another tutorial together https://github.com/sharethrough/scalding-emr-tutorial

Questions? http://snowplowanalytics.com https://github.com/snowplow/snowplow @snowplowdata To talk offline @alexcrdean on Twitter or alex@snowplowanalytics.com