Open source Google-style large scale data analysis with Hadoop



Similar documents
Open source large scale distributed data management with Google s MapReduce and Bigtable

Hadoop IST 734 SS CHUNG

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

MapReduce, Hadoop and Amazon AWS

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Application Development. A Paradigm Shift

Hadoop Architecture. Part 1

CSE-E5430 Scalable Cloud Computing Lecture 2

Hadoop implementation of MapReduce computational model. Ján Vaňo

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Large scale processing using Hadoop. Ján Vaňo

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Big Data With Hadoop

Chapter 7. Using Hadoop Cluster and MapReduce

Hadoop and its Usage at Facebook. Dhruba Borthakur June 22 rd, 2009

MapReduce with Apache Hadoop Analysing Big Data

Apache Hadoop. Alexandru Costan

Data-Intensive Computing with Map-Reduce and Hadoop

Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

EXPERIMENTATION. HARRISON CARRANZA School of Computer Science and Mathematics

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Big Data and Apache Hadoop s MapReduce

Apache Hadoop new way for the company to store and analyze big data

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

Hadoop & its Usage at Facebook

Large-Scale Data Processing

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

THE HADOOP DISTRIBUTED FILE SYSTEM

Using Hadoop for Webscale Computing. Ajay Anand Yahoo! Usenix 2008

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Hadoop Parallel Data Processing

Hadoop: Embracing future hardware

BIG DATA TECHNOLOGY. Hadoop Ecosystem

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

CS54100: Database Systems

A very short Intro to Hadoop

Intro to Map/Reduce a.k.a. Hadoop

BIG DATA TRENDS AND TECHNOLOGIES

Big Data Analytics by Using Hadoop

Apache Hadoop FileSystem and its Usage in Facebook

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software. 22 nd October :00 Sesión B - DB2 LUW

Qsoft Inc

Data Processing in the Cloud at Yahoo!

Hadoop & its Usage at Facebook

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

Design and Evolution of the Apache Hadoop File System(HDFS)

Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012

Entering the Zettabyte Age Jeffrey Krone

Big Data Storage Options for Hadoop Sam Fineberg, HP Storage

Finding Insights & Hadoop Cluster Performance Analysis over Census Dataset Using Big-Data Analytics

How To Scale Out Of A Nosql Database

Hadoop and Map-Reduce. Swati Gore

BBM467 Data Intensive ApplicaAons

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

Hadoop: Distributed Data Processing. Amr Awadallah Founder/CTO, Cloudera, Inc. ACM Data Mining SIG Thursday, January 25 th, 2010

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

From Wikipedia, the free encyclopedia

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Hadoop. Sunday, November 25, 12

Lecture 10 - Functional programming: Hadoop and MapReduce

Implement Hadoop jobs to extract business value from large and varied data sets

LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Journal of Environmental Science, Computer Science and Engineering & Technology

MapReduce. Introduction and Hadoop Overview. 13 June Lab Course: Databases & Cloud Computing SS 2012

Introduction to Hadoop

Big Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com

Map Reduce & Hadoop Recommended Text:

t] open source Hadoop Beginner's Guide ij$ data avalanche Garry Turkington Learn how to crunch big data to extract meaning from

NoSQL and Hadoop Technologies On Oracle Cloud

Big Data Weather Analytics Using Hadoop

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

BIG DATA USING HADOOP


How To Use Hadoop

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Mining of Web Server Logs in a Distributed Cluster Using Big Data Technologies

Big Data Explained. An introduction to Big Data Science.

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov

Apache Hadoop: Past, Present, and Future

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Transcription:

Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical and Computer Engineering National Technical University of Athens

Big Data Facebook: 20TB/day compressed CERN/LHC: 40TB/day (15PB/year) NYSE: 1TB/day 2009 Digital Universe: 800.000 Petabytes or 0.8 Zettabytes Moore's Law: Data doubles every 18 months 2020 prediction: 35 Zettabytes (44 times bigger than 2009) 2/18

What is Hadoop? It s a distributed framework for large-scale data processing: Inspired by Google s architecture: Map Reduce and Google File System Can scale to thousands of nodes and petabytes of data A top-level Apache project (since 2008) Hadoop is open source Written in Java, plus a few shell scripts 3/18

Why Hadoop? Hadoop is designed to run on cheap commodity hardware Fault-tolerant hardware is expensive It automatically handles data replication and node failure It does the hard work you can focus on processing data 4/18

When to use Hadoop? There is access to lots of commodity hardware The processing can be easily parallelized Need to process lots of unstructured data Data intensive applications It is ok to run batch jobs (no need for interactive results) 5/18

Architecture HDFS: Distributed file system Hard to store a PB Based on Google FS Fault-tolerant: handles replication, node failure, etc MapReduce : Data aware parallel computation framework Even harder to process a PB Based on a research paper by Google 6/18

Hadoop Distributed File System 1/2 Master/Slave Architecture Files are split into one or more blocks and these blocks are stored in a set of DataNodes A Master NameNode a master server that manages the file system namespace and regulates access to files by clients determines the mapping of blocks to DataNodes Many DataNodes Serve client read/write requests Create/delete/replicate blocks 7/18

Hadoop Distributed File System 2/2 8/18

MapReduce 1/3 A programming model A software framework for writing applications that rapidly process vast amounts of data in parallel on large clusters of compute nodes 9/18

MapReduce 2/3 Problem is separated in two different phases, the Map and Reduce phase. Map: Non overlapping chunks of input data is assigned to separate processes (mappers) that emit a set of intermediate results Reduce: Map results are fed to a usually smaller number of processes called reducers that summarize their input in a smaller number of results 10/18

MapReduce 3/3 11/18

When should I use it? Good choice for Indexing log files Sorting vast amounts of data Image analysis Bad choice for Figuring π to 1,000,000 digits Calculating Fibonacci sequences MySQL replacement 12/18

Hadoop MapReduce Master/Slave architecture A JobTracker Master Runs together with NameNode Receives client job requests Schedules and monitors MR jobs Move computation near the data Speculative execution Many TaskTrackers Run together with DataNodes Perform I/O operations with DataNodes 13/18

Typical problems Log and/or clickstream analysis of various kinds Marketing analytics Machine learning and/or sophisticated data mining Image processing Processing of XML messages Web crawling and/or text processing General archiving, including of relational/tabular data, e.g. for compliance 14/18

Use cases 1/3 Large Scale Image Conversions 100 Amazon EC2 Instances, 4TB raw TIFF data 11 Million PDF in 24 hours and 240$ Internal log processing Reporting, analytics and machine learning Cluster of 1110 machines, 8800 cores and 12PB raw storage Open source contributors (Hive) Store and process tweets, logs, etc Open source contributors (hadoop-lzo) 15/18

Use cases 2/3 100.000 CPUs in 25.000 computers Content/Ads Optimization, Search index Machine learning (e.g. spam filtering) Open source contributors (Pig) Natural language search (through Powerset) 400 nodes in EC2, storage in S3 Open source contributors (!) to HBase ElasticMapReduce service On demand elastic Hadoop clusters for the Cloud 16/18

Use cases 3/3 ETL processing, statistics generation Advanced algorithms for behavioral analysis and targeting Used for discovering People you May Know, and for other apps 3X30 node cluster, 16GB RAM and 8TB storage Leading Chinese language search engine Search log analysis, data mining 300TB per week 10 to 500 node clusters 17/18

Questions 18/18