Workshop: From Zero. Budapest DW Forum 2014
|
|
|
- Percival Andrews
- 10 years ago
- Views:
Transcription
1 Workshop: From Zero to _ Budapest DW Forum 2014
2 Agenda today 1. Some setup before we start 2. (Back to the) introduction 3. Our workshop today 4. Part 2: a simple Scalding job on EMR
3 Some setup before we start
4 There is a lot to copy and paste so let s all join a Google Hangout chat If I forget to paste some content into the chat room, just shout out and remind me
5 First, let s all download and setup Virtualbox and Vagrant dex.html
6 Now let s setup our development environment $ vagrant plugin install vagrant-vbguest If you have git already installed: $ git clone --recursive If not: $ wget $ unzip temp.zip $ wget $ unzip temp.zip
7 Now let s setup our development environment $ cd dev-environment $ vagrant up $ vagrant ssh
8 Final step for now, let s install some software $ ansible-playbook /vagrant/ansibleplaybooks/aws-tools.yml --inventoryfile=/home/vagrant/ansible_hosts -- connection=local $ ansible-playbook /vagrant/ansibleplaybooks/scala-sbt.yml --inventoryfile=/home/vagrant/ansible_hosts -- connection=local
9 (Back to the) introduction
10 Snowplow is an open-source web and event analytics platform, built on Hadoop Co-founders Alex Dean and Yali Sassoon met at OpenX, the open-source ad technology business in 2008 We released Snowplow as a skunkworks prototype at start of 2012: github.com/snowplow/snowplow We built Snowplow on top of Hadoop from the very start
11 We wanted to take a fresh approach to web analytics Your own web event data -> in your own data warehouse Your own event data model Slice / dice and mine the data in highly bespoke ways to answer your specific business questions Plug in the broadest possible set of analysis tools to drive value from your data Data pipeline Data warehouse Analyse your data in any analysis tool
12 And we saw the potential of new big data technologies and services to solve these problems in a scalable, low-cost manner CloudFront Amazon S3 Amazon EMR Amazon Redshift These tools make it possible to capture, transform, store and analyse all your granular, event-level data, to you can perform any analysis
13 Our Snowplow event processing flow runs on Hadoop, specifically Amazon s Elastic MapReduce hosted Hadoop service Website / webapp Snowplow Hadoop data pipeline Amazon S3 JavaScript event tracker CloudFrontbased event collector or Clojurebased event collector Scaldingbased enrichment on Hadoop Amazon Redshift / PostgreSQL
14 Why did we pick Hadoop? Scalability We have customers processing 350m Snowplow events a day in Hadoop runs in <2 hours Easy to reprocess data If business rules change, we can fire up a large cluster and re-process all historical raw Snowplow events Highly testable We write unit and integration tests for our jobs and run them locally, giving us confidence that our jobs will run correctly at scale on Hadoop
15 And why Amazon s Elastic MapReduce (EMR)? No need to run our own cluster Running your own Hadoop cluster is a huge pain not for the fainthearted. By contrast, EMR just works (most of the time!) Elastic Snowplow runs as a nightly (sometimes more frequent) batch job. We spin up the EMR cluster to run the job, and shut it down straight after Interop with other AWS services EMR works really well with Amazon S3 as a file store. We are big fans of Amazon Redshift (hosted columnar database) too
16 Our workshop today
17 Hadoop is complicated
18 for our workshop today, we will stick to using Elastic MapReduce and try to avoid any unnecessary complexity
19 and we will learn by doing! Lots of books and articles about Hadoop and the theory of MapReduce We will learn by doing no theory unless it s required to directly explain the jobs we are creating Our priority is to get you up-and-running on Elastic MapReduce, and confident enough to write your own Hadoop jobs
20 Part 1: a simple Pig Latin job on EMR
21 What is Pig (Latin)? Pig is a high-level platform for creating MapReduce jobs which can run on Hadoop The language you write Pig jobs in is called Pig Latin For quick-and-dirty scripts, Pig just works Cascading Crunch Hive Pig Java Hadoop MapReduce Hadoop DFS
22 Let s all come up with a unique name for ourselves Lowercase letters, no spaces or hyphens or anything E.g. I will be alexsnowplow please come up with a unique name for yourself! It will be visible to other participants so choose something you don t mind being public In the rest of this workshop, wherever you see YOURNAME, replace it with your unique name
23 Let s restart our Vagrant and do some setup $ mkdir zero2hadoop $ aws configure // And type in: AWS Access Key ID [None]: AKIAILD6DCBTFI642JPQ AWS Secret Access Key [None]: KMVdr/bsq4FDTI5H143K3gjt4ErG2oTjd+1+a+ou Default region name [None]: eu-west-1 Default output format [None]:
24 Let s create some buckets in Amazon S3 this is where our data and our apps will live $ aws s3 mb s3://zero2hadoop-in-yourname $ aws s3 mb s3://zero2hadoop-out-yourname $ aws s3 mb s3://zero2hadoop-jobs-yourname // Check those worked $ aws s3 ls
25 Let s get some source data uploaded $ mkdir -p ~/zero2hadoop/part1/in $ cd ~/zero2hadoop/part1/in $ wget alding-example-project/master/data/hello.txt $ cat hello.txt Hello world Goodbye world $ aws s3 cp hello.txt s3://zero2hadoop-in- YOURNAME/part1/hello.txt
26 Let s get our EMR command-line tools installed (1/2) $ /vagrant/emr-cli/elastic-mapreduce $ rvm install ruby head $ rvm use $ alias emr=/vagrant/emr-cli/elasticmapreduce
27 Let s get our EMR command-line tools installed (2/2) Add this file: { "access_id": "AKIAI55OSYYRLYWLXH7A", "private_key": "SHRXNIBRdfWuLPbCt57ZVjf+NMKUjm9WTknDHPTP", "region": "eu-west-1" } to: /vagrant/emr-cli/credentials.json (sudo sntp -s )
28 Let s get our EMR command-line tools installed (2/2) // This should work fine now: $ emr --list <no output>
29 Let s do some local file work $ mkdir -p ~/zero2hadoop/part1/pig $ cd ~/zero2hadoop/part1/pig $ wget ean/d8371cebdf ae/raw/cb3030a6c48b85d1 01e296ccf df3288d/wordcount.pig // The original bdf ae
30 Now upload to S3 $ aws s3 cp wordcount.pig s3://zero2hadoopjobs-yourname/part1/ $ aws s3 ls --recursive s3://zero2hadoopjobs-yourname/part1/ :10: part1/wordcount.pig
31 And now we run our Pig script $ emr --create --name "part1 YOURNAME" \ --set-visible-to-all-users true \ --pig-script s3n://zero2hadoop-jobs- YOURNAME/part1/wordcount.pig \ --ami-version 2.0 \ --args "-p,input=s3n://zero2hadoop-in- YOURNAME/part1, \ -p,output=s3n://zero2hadoop-out- YOURNAME/part1"
32 Let s check out the jobs running in Elastic MapReduce first at the console $ $ emr --list j-1hr90swpp40m4 STARTING part1 YOURNAME PENDING Setup Pig PENDING Run Pig Script
33 and also in the UI
34 Okay let s check the output of our job! (1/2) $ aws s3 ls --recursive s3://zero2hadoop-out- YOURNAME/part :57:53 0 part1/_success :57:50 26 part1/part-r
35 Okay let s check the output of our job! $ mkdir -p ~/zero2hadoop/part1/out $ cd ~/zero2hadoop/part1/out $ aws s3 cp --recursive s3://zero2hadoop-out- YOURNAME/part1. $ ls part-r _SUCCESS $ cat part-r world 1 Hello 1 Goodbye
36 Part 2: a simple Scalding job on EMR
37 What is Scalding? Scalding is a Scala API over Cascading, the Java framework for building data processing pipelines on Hadoop: Scalding Cascalog PyCascading cascading. jruby Cascading Pig Java Hadoop MapReduce Hadoop DFS
38 Cascading has a plumbing abstraction over vanilla MapReduce which should be quite comfortable to DW practitioners
39 Scalding improves further on Cascading by reducing boilerplate and making more complex pipelines easier to express Scalding written in Scala reduces a lot of boilerplate versus vanilla Cascading. Easier to look at a job in its entirety and see what it does Scalding created and supported by Twitter, who use it throughout their organization We believe that data pipelines should be as strongly typed as possible all the other DSLs/APIs on top of Cascading encourage dynamic typing
40 Strongly typed data pipelines why? Catch errors as soon as possible and report them in a strongly typed way too Define the inputs and outputs of each of your data processing steps in an unambiguous way Forces you to formerly address the data types flowing through your system Lets you write code like this:
41 Okay let s get started! Head to
42 Let s get this code down locally and build it $ mkdir -p ~/zero2hadoop/part2 $ cd ~/zero2hadoop/part2 $ git clone git://github.com/snowplow/scalding-exampleproject.git $ cd scalding-example-project $ sbt assembly
43 Here is our MapReduce code
44 Good, tests are passing, now let s upload this to S3 so it s available to our EMR job $ aws s3 cp target/scala-2.10/scaldingexample-project jar s3://zero2hadoopjobs-yourname/part2/ // If that doesn t work: $ aws cp s3://snowplow-hosted-assets/thirdparty/scalding-example-project jar s3://zero2hadoop-jobs-yourname/part2/ $ aws s3 ls s3://zero2hadoop-jobs- YOURNAME/part2/
45 And now we run it! $ emr --create --name part2 YOURNAME" \ --set-visible-to-all-users true \ --jar s3n://zero2hadoop-jobs- YOURNAME/part2/scalding-example-project jar \ --arg com.snowplowanalytics.hadoop.scalding.wordcou ntjob \ --arg --hdfs \ --arg --input --arg s3n://zero2hadoop-in- YOURNAME/part1/hello.txt \ --arg --output --arg s3n://zero2hadoop-out- YOURNAME/part2
46 Let s check out the jobs running in Elastic MapReduce first at the console $ emr --list j-1m62igrepl7i STARTING scalding-example-project PENDING Example Jar Step
47 and also in the UI
48 Okay let s check the output of our job! $ aws s3 ls --recursive s3://zero2hadoop-out- YOURNAME/part2 $ mkdir -p ~/zero2hadoop/part2/out $ cd ~/zero2hadoop/part2/out $ aws s3 cp --recursive s3://zero2hadoop-out- YOURNAME/part2. $ ls $ cat part goodbye 1 hello 1 world 2
49 Part 3: a more complex Scalding job on EMR
50 Let s explore another tutorial together
51 Questions? To talk on Twitter or
Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015
Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL May 2015 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved. Notices This document
Running Knn Spark on EC2 Documentation
Pseudo code Running Knn Spark on EC2 Documentation Preparing to use Amazon AWS First, open a Spark launcher instance. Open a m3.medium account with all default settings. Step 1: Login to the AWS console.
Cloud Computing. AWS a practical example. Hugo Pérez UPC. Mayo 2012
Cloud Computing AWS a practical example Mayo 2012 Hugo Pérez UPC -2- Index Introduction Infraestructure Development and Results Conclusions Introduction In order to know deeper about AWS services, mapreduce
CAPTURING & PROCESSING REAL-TIME DATA ON AWS
CAPTURING & PROCESSING REAL-TIME DATA ON AWS @ 2015 Amazon.com, Inc. and Its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent
Cloudera Distributed Hadoop (CDH) Installation and Configuration on Virtual Box
Cloudera Distributed Hadoop (CDH) Installation and Configuration on Virtual Box By Kavya Mugadur W1014808 1 Table of contents 1.What is CDH? 2. Hadoop Basics 3. Ways to install CDH 4. Installation and
Single Node Hadoop Cluster Setup
Single Node Hadoop Cluster Setup This document describes how to create Hadoop Single Node cluster in just 30 Minutes on Amazon EC2 cloud. You will learn following topics. Click Here to watch these steps
MapReduce, Hadoop and Amazon AWS
MapReduce, Hadoop and Amazon AWS Yasser Ganjisaffar http://www.ics.uci.edu/~yganjisa February 2011 What is Hadoop? A software framework that supports data-intensive distributed applications. It enables
HDFS Cluster Installation Automation for TupleWare
HDFS Cluster Installation Automation for TupleWare Xinyi Lu Department of Computer Science Brown University Providence, RI 02912 [email protected] March 26, 2014 Abstract TupleWare[1] is a C++ Framework
t] open source Hadoop Beginner's Guide ij$ data avalanche Garry Turkington Learn how to crunch big data to extract meaning from
Hadoop Beginner's Guide Learn how to crunch big data to extract meaning from data avalanche Garry Turkington [ PUBLISHING t] open source I I community experience distilled ftu\ ij$ BIRMINGHAMMUMBAI ')
Background on Elastic Compute Cloud (EC2) AMI s to choose from including servers hosted on different Linux distros
David Moses January 2014 Paper on Cloud Computing I Background on Tools and Technologies in Amazon Web Services (AWS) In this paper I will highlight the technologies from the AWS cloud which enable you
CONFIGURING ECLIPSE FOR AWS EMR DEVELOPMENT
CONFIGURING ECLIPSE FOR AWS EMR DEVELOPMENT With this post we thought of sharing a tutorial for configuring Eclipse IDE (Intergrated Development Environment) for Amazon AWS EMR scripting and development.
CSE 344 Introduction to Data Management. Section 9: AWS, Hadoop, Pig Latin TA: Yi-Shu Wei
CSE 344 Introduction to Data Management Section 9: AWS, Hadoop, Pig Latin TA: Yi-Shu Wei Homework 8 Big Data analysis on billion triple dataset using Amazon Web Service (AWS) Billion Triple Set: contains
How to Run Spark Application
How to Run Spark Application Junghoon Kang Contents 1 Intro 2 2 How to Install Spark on a Local Machine? 2 2.1 On Ubuntu 14.04.................................... 2 3 How to Run Spark Application on a
Introduction to analyzing big data using Amazon Web Services
Introduction to analyzing big data using Amazon Web Services This tutorial accompanies the BARC seminar given at Whitehead on January 31, 2013. It contains instructions for: 1. Getting started with Amazon
Prepared By : Manoj Kumar Joshi & Vikas Sawhney
Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks
Virtual Machine (VM) For Hadoop Training
2012 coreservlets.com and Dima May Virtual Machine (VM) For Hadoop Training Originals of slides and source code for examples: http://www.coreservlets.com/hadoop-tutorial/ Also see the customized Hadoop
Getting Started with Hadoop with Amazon s Elastic MapReduce
Getting Started with Hadoop with Amazon s Elastic MapReduce Scott Hendrickson [email protected] http://drskippy.net/projects/emr-hadoopmeetup.pdf Boulder/Denver Hadoop Meetup 8 July 2010 Scott Hendrickson
What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea
What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea Overview Riding Google App Engine Taming Hadoop Summary Riding
! E6893 Big Data Analytics:! Demo Session II: Mahout working with Eclipse and Maven for Collaborative Filtering
E6893 Big Data Analytics: Demo Session II: Mahout working with Eclipse and Maven for Collaborative Filtering Aonan Zhang Dept. of Electrical Engineering 1 October 9th, 2014 Mahout Brief Review The Apache
cloud-kepler Documentation
cloud-kepler Documentation Release 1.2 Scott Fleming, Andrea Zonca, Jack Flowers, Peter McCullough, El July 31, 2014 Contents 1 System configuration 3 1.1 Python and Virtualenv setup.......................................
Hadoop & Spark Using Amazon EMR
Hadoop & Spark Using Amazon EMR Michael Hanisch, AWS Solutions Architecture 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda Why did we build Amazon EMR? What is Amazon EMR?
Hadoop Setup. 1 Cluster
In order to use HadoopUnit (described in Sect. 3.3.3), a Hadoop cluster needs to be setup. This cluster can be setup manually with physical machines in a local environment, or in the cloud. Creating a
Hadoop Installation MapReduce Examples Jake Karnes
Big Data Management Hadoop Installation MapReduce Examples Jake Karnes These slides are based on materials / slides from Cloudera.com Amazon.com Prof. P. Zadrozny's Slides Prerequistes You must have an
Hadoop Basics with InfoSphere BigInsights
An IBM Proof of Technology Hadoop Basics with InfoSphere BigInsights Part: 1 Exploring Hadoop Distributed File System An IBM Proof of Technology Catalog Number Copyright IBM Corporation, 2013 US Government
Big Data Spatial Analytics An Introduction
2013 Esri International User Conference July 8 12, 2013 San Diego, California Technical Workshop Big Data Spatial Analytics An Introduction Marwa Mabrouk Mansour Raad Esri iu UC2013. Technical Workshop
Data Pipeline with Kafka
Data Pipeline with Kafka Peerapat Asoktummarungsri AGODA Senior Software Engineer Agoda.com Contributor Thai Java User Group (THJUG.com) Contributor Agile66 AGENDA Big Data & Data Pipeline Kafka Introduction
Extreme computing lab exercises Session one
Extreme computing lab exercises Session one Michail Basios ([email protected]) Stratis Viglas ([email protected]) 1 Getting started First you need to access the machine where you will be doing all
Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.
Hadoop Source Alessandro Rezzani, Big Data - Architettura, tecnologie e metodi per l utilizzo di grandi basi di dati, Apogeo Education, ottobre 2013 wikipedia Hadoop Apache Hadoop is an open-source software
Hadoop Data Warehouse Manual
Ruben Vervaeke & Jonas Lesy 1 Hadoop Data Warehouse Manual To start off, we d like to advise you to read the thesis written about this project before applying any changes to the setup! The thesis can be
Big Data for everyone Democratizing big data with the cloud. Steffen Krause Technical Evangelist @AWS_Aktuell [email protected]
Big Data for everyone Democratizing big data with the cloud Steffen Krause Technical Evangelist @AWS_Aktuell [email protected] Does this Data make me look big? Overview Designing big data solutions in
Informatica Cloud & Redshift Getting Started User Guide
Informatica Cloud & Redshift Getting Started User Guide 2014 Informatica Corporation. No part of this document may be reproduced or transmitted in any form, by any means (electronic, photocopying, recording
Introduction to HDFS. Prasanth Kothuri, CERN
Prasanth Kothuri, CERN 2 What s HDFS HDFS is a distributed file system that is fault tolerant, scalable and extremely easy to expand. HDFS is the primary distributed storage for Hadoop applications. Hadoop
E6893 Big Data Analytics: Demo Session for HW I. Ruichi Yu, Shuguan Yang, Jen-Chieh Huang Meng-Yi Hsu, Weizhen Wang, Lin Haung.
E6893 Big Data Analytics: Demo Session for HW I Ruichi Yu, Shuguan Yang, Jen-Chieh Huang Meng-Yi Hsu, Weizhen Wang, Lin Haung 1 Oct 2, 2014 2 Part I: Pig installation and Demo Pig is a platform for analyzing
CS 378 Big Data Programming
CS 378 Big Data Programming Lecture 2 Map- Reduce CS 378 - Fall 2015 Big Data Programming 1 MapReduce Large data sets are not new What characterizes a problem suitable for MR? Most or all of the data is
File S1: Supplementary Information of CloudDOE
File S1: Supplementary Information of CloudDOE Table of Contents 1. Prerequisites of CloudDOE... 2 2. An In-depth Discussion of Deploying a Hadoop Cloud... 2 Prerequisites of deployment... 2 Table S1.
Chase Wu New Jersey Ins0tute of Technology
CS 698: Special Topics in Big Data Chapter 4. Big Data Analytics Platforms Chase Wu New Jersey Ins0tute of Technology Some of the slides have been provided through the courtesy of Dr. Ching-Yung Lin at
Big Data on AWS. Services Overview. Bernie Nallamotu Principle Solutions Architect
on AWS Services Overview Bernie Nallamotu Principle Solutions Architect \ So what is it? When your data sets become so large that you have to start innovating around how to collect, store, organize, analyze
AdWhirl Open Source Server Setup Instructions
AdWhirl Open Source Server Setup Instructions 11/09 AdWhirl Server Setup Instructions The server runs in Amazon s web cloud. To set up the server, you need an Amazon Web Services (AWS) account and the
BIG DATA What it is and how to use?
BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14
Big Data on Microsoft Platform
Big Data on Microsoft Platform Prepared by GJ Srinivas Corporate TEG - Microsoft Page 1 Contents 1. What is Big Data?...3 2. Characteristics of Big Data...3 3. Enter Hadoop...3 4. Microsoft Big Data Solutions...4
Workshop on Hadoop with Big Data
Workshop on Hadoop with Big Data Hadoop? Apache Hadoop is an open source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly
CS 378 Big Data Programming. Lecture 2 Map- Reduce
CS 378 Big Data Programming Lecture 2 Map- Reduce MapReduce Large data sets are not new What characterizes a problem suitable for MR? Most or all of the data is processed But viewed in small increments
Apache Hadoop: Past, Present, and Future
The 4 th China Cloud Computing Conference May 25 th, 2012. Apache Hadoop: Past, Present, and Future Dr. Amr Awadallah Founder, Chief Technical Officer [email protected], twitter: @awadallah Hadoop Past
AWS Data Pipeline. Developer Guide API Version 2012-10-29
AWS Data Pipeline Developer Guide Amazon Web Services AWS Data Pipeline: Developer Guide Amazon Web Services What is AWS Data Pipeline?... 1 How Does AWS Data Pipeline Work?... 1 Pipeline Definition...
L1: Introduction to Hadoop
L1: Introduction to Hadoop Feng Li [email protected] School of Statistics and Mathematics Central University of Finance and Economics Revision: December 1, 2014 Today we are going to learn... 1 General
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after
Real Time Big Data Processing
Real Time Big Data Processing Cloud Expo 2014 Ian Meyers Amazon Web Services Global Infrastructure Deployment & Administration App Services Analytics Compute Storage Database Networking AWS Global Infrastructure
Hadoop Hands-On Exercises
Hadoop Hands-On Exercises Lawrence Berkeley National Lab Oct 2011 We will Training accounts/user Agreement forms Test access to carver HDFS commands Monitoring Run the word count example Simple streaming
Amazon Elastic MapReduce. Jinesh Varia Peter Sirota Richard Cole
Amazon Elastic MapReduce Jinesh Varia Peter Sirota Richard Cole Start End From IDE Command line Web Console Notify Input Data Get Results Start End From IDE Command line Web Console AWS EC2 Instance Notify
Introduction to Hadoop
Introduction to Hadoop Miles Osborne School of Informatics University of Edinburgh [email protected] October 28, 2010 Miles Osborne Introduction to Hadoop 1 Background Hadoop Programming Model Examples
CactoScale Guide User Guide. Athanasios Tsitsipas (UULM), Papazachos Zafeirios (QUB), Sakil Barbhuiya (QUB)
CactoScale Guide User Guide Athanasios Tsitsipas (UULM), Papazachos Zafeirios (QUB), Sakil Barbhuiya (QUB) Version History Version Date Change Author 0.1 12/10/2014 Initial version Athanasios Tsitsipas(UULM)
Using The Hortonworks Virtual Sandbox
Using The Hortonworks Virtual Sandbox Powered By Apache Hadoop This work by Hortonworks, Inc. is licensed under a Creative Commons Attribution- ShareAlike3.0 Unported License. Legal Notice Copyright 2012
Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14
Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 14 Big Data Management IV: Big-data Infrastructures (Background, IO, From NFS to HFDS) Chapter 14-15: Abideboul
Sriram Krishnan, Ph.D. [email protected]
Sriram Krishnan, Ph.D. [email protected] (Re-)Introduction to cloud computing Introduction to the MapReduce and Hadoop Distributed File System Programming model Examples of MapReduce Where/how to run MapReduce
The objective of this lab is to learn how to set up an environment for running distributed Hadoop applications.
Lab 9: Hadoop Development The objective of this lab is to learn how to set up an environment for running distributed Hadoop applications. Introduction Hadoop can be run in one of three modes: Standalone
Introduction to Cloud Computing
Introduction to Cloud Computing Qloud Demonstration 15 319, spring 2010 3 rd Lecture, Jan 19 th Suhail Rehman Time to check out the Qloud! Enough Talk! Time for some Action! Finally you can have your own
Case Study : 3 different hadoop cluster deployments
Case Study : 3 different hadoop cluster deployments Lee moon soo [email protected] HDFS as a Storage Last 4 years, our HDFS clusters, stored Customer 1500 TB+ data safely served 375,000 TB+ data to customer
Hadoop IST 734 SS CHUNG
Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to
Building Out Your Cloud-Ready Solutions. Clark D. Richey, Jr., Principal Technologist, DoD
Building Out Your Cloud-Ready Solutions Clark D. Richey, Jr., Principal Technologist, DoD Slide 1 Agenda Define the problem Explore important aspects of Cloud deployments Wrap up and questions Slide 2
Big Data Infrastructure at Spotify
Big Data Infrastructure at Spotify Wouter de Bie Team Lead Data Infrastructure June 12, 2013 2 Agenda Let s talk about Data Infrastructure, how we did it, what we learned and how we ve failed Some Context
the missing log collector Treasure Data, Inc. Muga Nishizawa
the missing log collector Treasure Data, Inc. Muga Nishizawa Muga Nishizawa (@muga_nishizawa) Chief Software Architect, Treasure Data Treasure Data Overview Founded to deliver big data analytics in days
H2O on Hadoop. September 30, 2014. www.0xdata.com
H2O on Hadoop September 30, 2014 www.0xdata.com H2O on Hadoop Introduction H2O is the open source math & machine learning engine for big data that brings distribution and parallelism to powerful algorithms
Tutorial- Counting Words in File(s) using MapReduce
Tutorial- Counting Words in File(s) using MapReduce 1 Overview This document serves as a tutorial to setup and run a simple application in Hadoop MapReduce framework. A job in Hadoop MapReduce usually
Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook
Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future
Building Scalable Big Data Infrastructure Using Open Source Software. Sam William sampd@stumbleupon.
Building Scalable Big Data Infrastructure Using Open Source Software Sam William sampd@stumbleupon. What is StumbleUpon? Help users find content they did not expect to find The best way to discover new
10605 BigML Assignment 4(a): Naive Bayes using Hadoop Streaming
10605 BigML Assignment 4(a): Naive Bayes using Hadoop Streaming Due: Friday, Feb. 21, 2014 23:59 EST via Autolab Late submission with 50% credit: Sunday, Feb. 23, 2014 23:59 EST via Autolab Policy on Collaboration
TRAINING PROGRAM ON BIGDATA/HADOOP
Course: Training on Bigdata/Hadoop with Hands-on Course Duration / Dates / Time: 4 Days / 24th - 27th June 2015 / 9:30-17:30 Hrs Venue: Eagle Photonics Pvt Ltd First Floor, Plot No 31, Sector 19C, Vashi,
Apache Hadoop new way for the company to store and analyze big data
Apache Hadoop new way for the company to store and analyze big data Reyna Ulaque Software Engineer Agenda What is Big Data? What is Hadoop? Who uses Hadoop? Hadoop Architecture Hadoop Distributed File
SIG-NOC Meeting - Stuttgart 04/08/2015 Icinga - Open Source Monitoring WWW.ICINGA.ORG
SIG-NOC Meeting - Stuttgart 04/08/2015 Icinga - Open Source Monitoring WWW.ICINGA.ORG Me Michael Friedrich @dnsmichi, 31, Austrian Application Developer @NETWAYS Icinga responsibilities Core 1.x & 2.x
How To Install Hadoop 1.2.1.1 From Apa Hadoop 1.3.2 To 1.4.2 (Hadoop)
Contents Download and install Java JDK... 1 Download the Hadoop tar ball... 1 Update $HOME/.bashrc... 3 Configuration of Hadoop in Pseudo Distributed Mode... 4 Format the newly created cluster to create
Extreme Computing. Hadoop. Stratis Viglas. School of Informatics University of Edinburgh [email protected]. Stratis Viglas Extreme Computing 1
Extreme Computing Hadoop Stratis Viglas School of Informatics University of Edinburgh [email protected] Stratis Viglas Extreme Computing 1 Hadoop Overview Examples Environment Stratis Viglas Extreme
Tutorial for Assignment 2.0
Tutorial for Assignment 2.0 Florian Klien & Christian Körner IMPORTANT The presented information has been tested on the following operating systems Mac OS X 10.6 Ubuntu Linux The installation on Windows
Big Data for the JVM developer. Costin Leau, Elasticsearch @costinl
Big Data for the JVM developer Costin Leau, Elasticsearch @costinl Agenda Data Trends Data Pipelines JVM and Big Data Tool Eco-system Data Landscape Data Trends http://www.emc.com/leadership/programs/digital-universe.htm
CDH installation & Application Test Report
CDH installation & Application Test Report He Shouchun (SCUID: 00001008350, Email: [email protected]) Chapter 1. Prepare the virtual machine... 2 1.1 Download virtual machine software... 2 1.2 Plan the guest
Setup Hadoop On Ubuntu Linux. ---Multi-Node Cluster
Setup Hadoop On Ubuntu Linux ---Multi-Node Cluster We have installed the JDK and Hadoop for you. The JAVA_HOME is /usr/lib/jvm/java/jdk1.6.0_22 The Hadoop home is /home/user/hadoop-0.20.2 1. Network Edit
Getting Started with AWS. Hosting a Static Website
Getting Started with AWS Hosting a Static Website Getting Started with AWS: Hosting a Static Website Copyright 2016 Amazon Web Services, Inc. and/or its affiliates. All rights reserved. Amazon's trademarks
Savanna Hadoop on. OpenStack. Savanna Technical Lead
Savanna Hadoop on OpenStack Sergey Lukjanov Savanna Technical Lead Mirantis, 2013 Agenda Savanna Overview Savanna Use Cases Roadmap & Current Status Architecture & Features Overview Hadoop vs. Virtualization
Data Analytics. CloudSuite1.0 Benchmark Suite Copyright (c) 2011, Parallel Systems Architecture Lab, EPFL. All rights reserved.
Data Analytics CloudSuite1.0 Benchmark Suite Copyright (c) 2011, Parallel Systems Architecture Lab, EPFL All rights reserved. The data analytics benchmark relies on using the Hadoop MapReduce framework
Hadoop 2.6 Configuration and More Examples
Hadoop 2.6 Configuration and More Examples Big Data 2015 Apache Hadoop & YARN Apache Hadoop (1.X)! De facto Big Data open source platform Running for about 5 years in production at hundreds of companies
Productionizing a 24/7 Spark Streaming Service on YARN
Productionizing a 24/7 Spark Streaming Service on YARN Issac Buenrostro, Arup Malakar Spark Summit 2014 July 1, 2014 About Ooyala Cross-device video analytics and monetization products and services Founded
Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce
Analytics in the Cloud Peter Sirota, GM Elastic MapReduce Data-Driven Decision Making Data is the new raw material for any business on par with capital, people, and labor. What is Big Data? Terabytes of
Business Intelligence for Big Data
Business Intelligence for Big Data Will Gorman, Vice President, Engineering May, 2011 2010, Pentaho. All Rights Reserved. www.pentaho.com. What is BI? Business Intelligence = reports, dashboards, analysis,
MapReduce. Tushar B. Kute, http://tusharkute.com
MapReduce Tushar B. Kute, http://tusharkute.com What is MapReduce? MapReduce is a framework using which we can write applications to process huge amounts of data, in parallel, on large clusters of commodity
Hadoop (pseudo-distributed) installation and configuration
Hadoop (pseudo-distributed) installation and configuration 1. Operating systems. Linux-based systems are preferred, e.g., Ubuntu or Mac OS X. 2. Install Java. For Linux, you should download JDK 8 under
Getting Started with AWS. Hosting a Static Website
Getting Started with AWS Hosting a Static Website Getting Started with AWS: Hosting a Static Website Copyright 2015 Amazon Web Services, Inc. and/or its affiliates. All rights reserved. The following are
Cray XC30 Hadoop Platform Jonathan (Bill) Sparks Howard Pritchard Martha Dumler
Cray XC30 Hadoop Platform Jonathan (Bill) Sparks Howard Pritchard Martha Dumler Safe Harbor Statement This presentation may contain forward-looking statements that are based on our current expectations.
IDS 561 Big data analytics Assignment 1
IDS 561 Big data analytics Assignment 1 Due Midnight, October 4th, 2015 General Instructions The purpose of this tutorial is (1) to get you started with Hadoop and (2) to get you acquainted with the code
CSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing
Hadoop Hands-On Exercises
Hadoop Hands-On Exercises Lawrence Berkeley National Lab July 2011 We will Training accounts/user Agreement forms Test access to carver HDFS commands Monitoring Run the word count example Simple streaming
Big Data Too Big To Ignore
Big Data Too Big To Ignore Geert! Big Data Consultant and Manager! Currently finishing a 3 rd Big Data project! IBM & Cloudera Certified! IBM & Microsoft Big Data Partner 2 Agenda! Defining Big Data! Introduction
To reduce or not to reduce, that is the question
To reduce or not to reduce, that is the question 1 Running jobs on the Hadoop cluster For part 1 of assignment 8, you should have gotten the word counting example from class compiling. To start with, let
Rstudio Server on Amazon EC2
Rstudio Server on Amazon EC2 Liad Shekel [email protected] June 2015 Liad Shekel Rstudio Server on Amazon EC2 1 / 72 Rstudio Server on Amazon EC2 Outline 1 Amazon Web Services (AWS) History Services
Introduction to HDFS. Prasanth Kothuri, CERN
Prasanth Kothuri, CERN 2 What s HDFS HDFS is a distributed file system that is fault tolerant, scalable and extremely easy to expand. HDFS is the primary distributed storage for Hadoop applications. HDFS
TP1: Getting Started with Hadoop
TP1: Getting Started with Hadoop Alexandru Costan MapReduce has emerged as a leading programming model for data-intensive computing. It was originally proposed by Google to simplify development of web
HADOOP BIG DATA DEVELOPER TRAINING AGENDA
HADOOP BIG DATA DEVELOPER TRAINING AGENDA About the Course This course is the most advanced course available to Software professionals This has been suitably designed to help Big Data Developers and experts
CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment
CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment James Devine December 15, 2008 Abstract Mapreduce has been a very successful computational technique that has
Yahoo! Grid Services Where Grid Computing at Yahoo! is Today
Yahoo! Grid Services Where Grid Computing at Yahoo! is Today Marco Nicosia Grid Services Operations [email protected] What is Apache Hadoop? Distributed File System and Map-Reduce programming platform
