MapReduce with Apache Hadoop Analysing Big Data



Similar documents
Hadoop implementation of MapReduce computational model. Ján Vaňo

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Large scale processing using Hadoop. Ján Vaňo

Hadoop Architecture. Part 1

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Open source Google-style large scale data analysis with Hadoop

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

Introduction to Hadoop

How To Scale Out Of A Nosql Database

Big Data and Apache Hadoop s MapReduce

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop Ecosystem B Y R A H I M A.

Data-Intensive Computing with Map-Reduce and Hadoop

Chapter 7. Using Hadoop Cluster and MapReduce

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Hadoop. Sunday, November 25, 12

Open source large scale distributed data management with Google s MapReduce and Bigtable

Hadoop IST 734 SS CHUNG

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Hadoop: Distributed Data Processing. Amr Awadallah Founder/CTO, Cloudera, Inc. ACM Data Mining SIG Thursday, January 25 th, 2010

CSE-E5430 Scalable Cloud Computing Lecture 2

A very short Intro to Hadoop

BBM467 Data Intensive ApplicaAons

Hadoop and its Usage at Facebook. Dhruba Borthakur June 22 rd, 2009

Big Data With Hadoop

Big Data Introduction

Introduction to Hadoop

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Map Reduce & Hadoop Recommended Text:

Hadoop and Map-Reduce. Swati Gore

Using Hadoop for Webscale Computing. Ajay Anand Yahoo! Usenix 2008

Hadoop & its Usage at Facebook

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Apache Hadoop. Alexandru Costan

Qsoft Inc

Apache Hadoop FileSystem and its Usage in Facebook

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Hadoop & its Usage at Facebook

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

Application Development. A Paradigm Shift

Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网. Information Management. Information Management IBM CDL Lab

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

L1: Introduction to Hadoop

Introduction to Hadoop

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford

Implement Hadoop jobs to extract business value from large and varied data sets

Intro to Map/Reduce a.k.a. Hadoop

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software. 22 nd October :00 Sesión B - DB2 LUW

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

MapReduce and Hadoop Distributed File System

A Brief Outline on Bigdata Hadoop

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

THE HADOOP DISTRIBUTED FILE SYSTEM

Keywords: Big Data, Hadoop, cluster, heterogeneous, HDFS, MapReduce

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

NoSQL and Hadoop Technologies On Oracle Cloud

COURSE CONTENT Big Data and Hadoop Training

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Big Data and Data Science: Behind the Buzz Words

EXPERIMENTATION. HARRISON CARRANZA School of Computer Science and Mathematics

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

Hadoop Job Oriented Training Agenda

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

MapReduce. Introduction and Hadoop Overview. 13 June Lab Course: Databases & Cloud Computing SS 2012

A Survey on Big Data Concepts and Tools

Big Data Analytics: Hadoop-Map Reduce & NoSQL Databases

Hadoop for MySQL DBAs. Copyright 2011 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

Scalable Cloud Computing

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Hadoop Big Data for Processing Data and Performing Workload

White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP

Big Data Big Data/Data Analytics & Software Development

MySQL and Hadoop. Percona Live 2014 Chris Schneider

t] open source Hadoop Beginner's Guide ij$ data avalanche Garry Turkington Learn how to crunch big data to extract meaning from

Distributed Filesystems

Internals of Hadoop Application Framework and Distributed File System

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Transcription:

MapReduce with Apache Hadoop Analysing Big Data April 2010 Gavin Heavyside gavin.heavyside@journeydynamics.com

About Journey Dynamics Founded in 2006 to develop software technology to address the issues of congestion, fuel efficiency, driving safety and eco-driving Based in the Surrey Technology Centre, Guildford, UK Analyse large amounts (TB) of GPS data from cars, vans & trucks TrafficSpeedsEQ - Accurate traffic speed forecasts by hour of day and day of week for every link in the road network MyDrive - Unique & sophisticated system that learns how drivers behave Drivers can improve fuel economy Insurance companies can understand driver risk Navigation devices can improved route choice & ETA Fleet managers can monitor their fleet to improve safety & eco-driving 2

Big Data Data volumes increasing NYSE: 1TB new trade data/day Google: Processes 20PB/day (Sep 2007) http://tcrn.ch/agyjel LHC: 15PB data/year Facebook: several TB photos uploaded/day 3

Medium Data Most of us arenʼt at Google or Facebook scale But: data at the GB/TB scale is becoming more common Outgrow conventional databases Disks are cheap, but slow 1TB drive - 50 2.5 hours to read 1TB at 100MB/s 4

Two Challenges Managing lots of data Doing something useful with it 5

Managing Lots of Data Access and analyse any or all of your data SAN technologies (FC, iscsi, NFS) Querying (MySQL, PostgreSQL, Oracle) Cost, network bandwidth, concurrent access, resilience When you have 1000s of nodes, MTBF < 1 day 6

Analysing Lots of Data Parallel processing HPC Grid Computing MPI Sharding Too big for memory, specialised HW, complex, scalability Hardware reliability in large clusters 7

Apache Hadoop Reliable, scalable distributed computing platform HDFS - high throughput fault-tolerant distributed file system MapReduce - fault-tolerant distributed processing Runs on commodity hardware Cost-effective Open source (Apache License) 8

Hadoop History 2003-2004 Google publishes MapReduce & GFS papers 2004 Doug Cutting add DFS & MapReduce to Nutch 2006 Cutting joins Yahoo!, Hadoop moves out of Nutch Jan 2008 - top level Apache project April 2010: 95 companies on PoweredBy Hadoop wiki Yahoo!, Twitter, Facebook, Microsoft, New York Times, LinkedIn, Last.fm, IBM, Baidu, Adobe "The name my kid gave a stuffed yellow elephant. Short, relatively easy to spell and pronounce, meaningless, and not used elsewhere: those are my naming criteria. Kids are good at generating such. Googol is a kid's term" Doug Cutting 9

Hadoop Ecosystem HDFS MapReduce HBase ZooKeeper Pig Hive Chukwa Avro 10

Anatomy of a Hadoop Cluster Namenode Tasktracker Tasktracker Tasktracker Tasktracker Datanode Datanode Datanode Datanode Rack 1 JobTracker Tasktracker Tasktracker Tasktracker Tasktracker Datanode Datanode Datanode Datanode Rack n 11

HDFS Reliable shared storage Modelled after GFS Very large files Streaming data access Commodity Hardware Replication Tolerate regular hardware failure 12

HDFS Block size 64MB Default replication factor = 3 1 2 3 4 5 HDFS 1 2 3 2 3 4 3 4 5 4 5 1 5 1 2 13

HDFS Block size 64MB Default replication factor = 3 1 2 3 4 5 HDFS 1 2 3 2 3 4 3 4 5 4 5 1 5 1 2 13

MapReduce Based on 2004 Google paper Concepts from Functional Programming Used for lots of things within Google (and now everywhere) Parallel Map => Shuffle & Sort => Parallel Reduce Easy to understand and write MapReduce programs Move the computation to the data Rack-aware Linear Scalability Works with HDFS, S3, KFS, file:// and more 14

MapReduce Single Threaded MapReduce: cat input/* map sort reduce > output Map program parses the input and emits [key,value] pairs Sort by key Reduce computes output from values with same key Map Sort Reduce Extrapolate to PB of data on thousands of nodes 15

MapReduce Distributed Example Split 0 HDFS Map sort copy sort merge Split 1 HDFS Map Reduce part 0 HDFS merge Reduce part 1 HDFS Split n HDFS Map sort 16

MapReduce can be good for: Embarrassingly Parallel problems Semi-structured or unstructured data Index generation Log analysis Statistical analysis of patterns in data Image processing Generating map tiles Data Mining Much, much more 17

MapReduce is not be good for: Real-time or low-latency queries Some graph algorithms Algorithms that canʼt be split into independent chunks Some types of joins* Not a replacement for RDBMS * Can be tricky to write unless you use an abstraction e.g. Pig, Hive 18

Writing MapReduce Programs Java Pipes (C++, sockets) Streaming Frameworks, e.g. wukong(ruby), dumbo(python) JVM languages e.g. JRuby, Clojure, Scala Cascading.org Cascalog Pig Hive 19

Streaming Example (ruby) mapper.rb reducer.rb 20

Pig High level language for writing data analysis programs Runs MapReduce jobs Joins, grouping, filtering, sorting, statistical functions User-defined functions Optional schemas Sampling Pig Latin similar to imperative language, define steps to run 21

Pig Example 22

Hive Data warehousing and querying HiveQL - SQL-like language for querying data Runs MapReduce jobs Joins, grouping, filtering, sorting, statistical functions Partitioning of data User-defined functions Sampling Declarative syntax 23

Hive Example 24

Getting Started http://hadoop.apache.org Cloudera Distribution (VM, source, rpm, deb) Elastic MapReduce Cloudera VM Pseudo-distributed cluster 25

Learn More http://hadoop.apache.org Books Mailing Lists Commercial Support & Training, e.g. Cloudera 26

Related Cassandra 0.6 has Hadoop integration - run MapReduce jobs against data in Cassandra NoSQL DBs with MapReduce functionality include CouchDB, MongoDB, Riak and more RDBMS with MapReduce include Aster, Greenplum, HadoopDB and more 27

Gavin Heavyside gavin.heavyside@journeydynamics.com www.journeydynamics.com 28