Amazon-style shopping cart analysis using MapReduce on a Hadoop cluster. Dan Şerban



Similar documents
Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Tutorial for Assignment 2.0

Chapter 7. Using Hadoop Cluster and MapReduce

How To Use Hadoop

Introduction to Hadoop on the SDSC Gordon Data Intensive Cluster"

Tutorial for Assignment 2.0

Hadoop Parallel Data Processing

BIG DATA TRENDS AND TECHNOLOGIES

L1: Introduction to Hadoop

Simple Parallel Computing in R Using Hadoop

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Getting Started with Hadoop with Amazon s Elastic MapReduce

NoSQL and Hadoop Technologies On Oracle Cloud

CSE-E5430 Scalable Cloud Computing Lecture 2

BIG DATA USING HADOOP

Hadoop. Bioinformatics Big Data

HADOOP MOCK TEST HADOOP MOCK TEST II

Introduction To Hive

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Beyond Web Application Log Analysis using Apache TM Hadoop. A Whitepaper by Orzota, Inc.

Big Data : Experiments with Apache Hadoop and JBoss Community projects

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Cloud Computing. Chapter Hadoop

Big Data. White Paper. Big Data Executive Overview WP-BD Jafar Shunnar & Dan Raver. Page 1 Last Updated

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Hadoop Architecture. Part 1

Hadoop 101. Lars George. NoSQL- Ma4ers, Cologne April 26, 2013

CS455 - Lab 10. Thilina Buddhika. April 6, 2015

Hadoop/MapReduce Workshop. Dan Mazur, McGill HPC July 10, 2014

Yuji Shirasaki (JVO NAOJ)

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Package hive. January 10, 2011

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov

NoSQL for SQL Professionals William McKnight

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Open source Google-style large scale data analysis with Hadoop

Extreme computing lab exercises Session one

Large scale processing using Hadoop. Ján Vaňo

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Hadoop implementation of MapReduce computational model. Ján Vaňo

Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software. 22 nd October :00 Sesión B - DB2 LUW

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Hadoop Streaming. Table of contents

Sector vs. Hadoop. A Brief Comparison Between the Two Systems

Integrating Big Data into the Computing Curricula

How To Analyze Network Traffic With Mapreduce On A Microsoft Server On A Linux Computer (Ahem) On A Network (Netflow) On An Ubuntu Server On An Ipad Or Ipad (Netflower) On Your Computer

Are You Ready for Big Data?

The Big Data Ecosystem at LinkedIn Roshan Sumbaly, Jay Kreps, and Sam Shah LinkedIn

Hadoop MultiNode Cluster Setup

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

HADOOP CLUSTER SETUP GUIDE:

16.1 MAPREDUCE. For personal use only, not for distribution. 333

Using an In-Memory Data Grid for Near Real-Time Data Analysis

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Workshop on Hadoop with Big Data

Introduction to Hadoop

Leveraging BlobSeer to boost up the deployment and execution of Hadoop applications in Nimbus cloud environments on Grid 5000

THE HADOOP DISTRIBUTED FILE SYSTEM

Hadoop/MapReduce Workshop

How to install Apache Hadoop in Ubuntu (Multi node setup)

Sunnie Chung. Cleveland State University

Real-time Analytics at Facebook: Data Freeway and Puma. Zheng Shao 12/2/2011

MapReduce and Hadoop Distributed File System V I J A Y R A O

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

Extreme computing lab exercises Session one

Distributed Computing and Big Data: Hadoop and MapReduce

Real-time Big Data Analytics with Storm

Hadoop and Map-Reduce. Swati Gore

Processing of Hadoop using Highly Available NameNode

HDFS Cluster Installation Automation for TupleWare

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment

Hadoop (pseudo-distributed) installation and configuration

Data-Intensive Computing with Map-Reduce and Hadoop

CDH AND BUSINESS CONTINUITY:

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

Networking in the Hadoop Cluster

Using In-Memory Computing to Simplify Big Data Analytics

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

CS242 PROJECT. Presented by Moloud Shahbazi Spring 2015

Hadoop Setup Walkthrough

Big Data and Apache Hadoop s MapReduce

Log Mining Based on Hadoop s Map and Reduce Technique

Infrastructures for big data

A bit about Hadoop. Luca Pireddu. March 9, CRS4Distributed Computing Group. (CRS4) Luca Pireddu March 9, / 18

Big Data: Tools and Technologies in Big Data

Map Reduce & Hadoop Recommended Text:

Big Data Storage, Management and challenges. Ahmed Ali-Eldin

Hadoop for MySQL DBAs. Copyright 2011 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

How to properly misuse Hadoop. Marcel Huntemann NERSC tutorial session 2/12/13

Hadoop IST 734 SS CHUNG

map/reduce connected components

Transcription:

Amazon-style shopping cart analysis using MapReduce on a Hadoop cluster Dan Şerban

Agenda :: Introduction - Real-world uses of MapReduce - The origins of Hadoop - Hadoop facts and architecture :: Part 1 - Deploying Hadoop :: Part 2 - MapReduce is machine learning :: Q&A

Why shopping cart analysis is useful to amazon.com

Linkedin and Google Reader

The origins of Hadoop :: Hadoop got its start in Nutch :: A few enthusiastic developers were attempting to build an open source web search engine and having trouble managing computations running on even a handful of computers :: Once Google published their GoogleFS and MapReduce whitepapers, the way forward became clear :: Google had devised systems to solve precisely the problems the Nutch project was facing :: Thus, Hadoop was born

Hadoop facts :: Hadoop is a distributed computing platform for processing extremely large amounts of data :: Hadoop is divided into two main components: - the MapReduce runtime - the Hadoop Distributed File System (HDFS) :: The MapReduce runtime allows the user to submit MapReduce jobs :: The HDFS is a distributed file system that provides a logical interface for persistent and redundant storage of large data :: Hadoop also provides the HadoopStreaming library that leverages STDIN and STDOUT so you can write mappers and reducers in your programming language of choice

Hadoop facts :: Hadoop is based on the principle of moving computation to where the data is :: Data stored on the Hadoop Distributed File System is broken up into chunks and replicated across the cluster providing fault tolerant parallel processing and redundancy for both the data and the jobs :: Computation takes the form of a job which consists of a map phase and a reduce phase :: Data is initially processed by map functions which run in parallel across the cluster :: Map output is in the form of key-value pairs :: The reduce phase then aggregates the map results :: The reduce phase typically happens in multiple consecutive waves until the job is complete

Hadoop architecture

Part 1: Configuring and deploying the Hadoop cluster

Hands-on with Hadoop

core-site.xml - before

core-site.xml - after

hdfs-site.xml - before

hdfs-site.xml - after

mapred-site.xml - before

mapred-site.xml - after

Setting up SSH :: hadoop@hadoop-master needs to be able to ssh* into: - hadoop@hadoop-master - hadoop@chunkserver-a - hadoop@chunkserver-b - hadoop@chunkserver-c :: hadoop@job-tracker needs to be able to ssh* into: - hadoop@job-tracker - hadoop@chunkserver-a - hadoop@chunkserver-b - hadoop@chunkserver-c *Passwordless-ly and passphraseless-ly

Hands-on with Hadoop

Hands-on with Hadoop

Hands-on with Hadoop

Hands-on with Hadoop

Hands-on with Hadoop

Hands-on with Hadoop

Hands-on with Hadoop

Part 2: MapReduce is machine learning

Rolling your own self-hosted alternative to...

Hands-on with MapReduce

Hands-on with MapReduce

Hands-on with MapReduce

mapper.py #!/usr/bin/python import sys for line in sys.stdin: line = line.strip() IDs = line.split() for firstid in IDs: for secondid in IDs: if secondid > firstid: print '%s_%s\t%s' % (firstid, secondid, 1)

reducer.py #!/usr/bin/python import sys subtotals = {} for line in sys.stdin: line = line.strip() word = line.split('\t')[0] count = int(line.split('\t')[1]) subtotals[word] = subtotals.get(word, 0) + count for k, v in subtotals.items(): print "%s\t%s" % (k, v)

Hands-on with MapReduce

Hands-on with MapReduce

Hands-on with MapReduce

Hands-on with MapReduce

Hands-on with MapReduce

Other MapReduce use cases :: :: :: :: :: :: :: :: :: :: Google Suggest Video recommendations (YouTube) ClickStream Analysis (large web properties) Spam filtering and contextual advertising (Yahoo) Fraud detection (ebay, CC companies) Firewall log analysis to discover exfiltration and other undesirable (possibly malware-related) activity Finding patterns in social data, analyzing likes and building a search engine on top of them (FaceBook) Discovering microblogging trends and opinion leaders, analyzing who follows who (Twitter) Plain old supermarket shopping basket analysis The semantic web

Questions / Feedback

Bonus slide: Making of SQLite DB