DATA MINING WITH HADOOP AND HIVE Introduction to Architecture



Similar documents
Open source Google-style large scale data analysis with Hadoop

Big Data With Hadoop

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Big Data and Apache Hadoop s MapReduce

Application Development. A Paradigm Shift

Open source large scale distributed data management with Google s MapReduce and Bigtable

Hadoop implementation of MapReduce computational model. Ján Vaňo

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

CS54100: Database Systems

CSE-E5430 Scalable Cloud Computing Lecture 2

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

BBM467 Data Intensive ApplicaAons

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Hadoop & its Usage at Facebook

Performance and Energy Efficiency of. Hadoop deployment models

Large scale processing using Hadoop. Ján Vaňo

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

EXPERIMENTATION. HARRISON CARRANZA School of Computer Science and Mathematics

Hadoop & its Usage at Facebook

Recap. CSE 486/586 Distributed Systems Data Analytics. Example 1: Scientific Data. Two Questions We ll Answer. Data Analytics. Example 2: Web Data C 1

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

Hadoop IST 734 SS CHUNG

Data-Intensive Computing with Map-Reduce and Hadoop

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

Apache Hadoop FileSystem and its Usage in Facebook

Hadoop and Map-Reduce. Swati Gore

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Hadoop and its Usage at Facebook. Dhruba Borthakur June 22 rd, 2009

Big Data Storage, Management and challenges. Ahmed Ali-Eldin

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

Using Hadoop for Webscale Computing. Ajay Anand Yahoo! Usenix 2008

MapReduce with Apache Hadoop Analysing Big Data

Hadoop Architecture. Part 1

Introduction to Hadoop

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP

A very short Intro to Hadoop

Introduction to Hadoop

How To Scale Out Of A Nosql Database

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components


Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

and HDFS for Big Data Applications Serge Blazhievsky Nice Systems

Map Reduce / Hadoop / HDFS

Parallel Processing of cluster by Map Reduce

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Chapter 7. Using Hadoop Cluster and MapReduce

Apache Hadoop. Alexandru Costan

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

Intro to Map/Reduce a.k.a. Hadoop

L1: Introduction to Hadoop

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

!"#$%&' ( )%#*'+,'-#.//"0( !"#$"%&'()*$+()',!-+.'/', 4(5,67,!-+!"89,:*$;'0+$.<.,&0$'09,&)"/=+,!()<>'0, 3, Processing LARGE data sets

THE HADOOP DISTRIBUTED FILE SYSTEM

Hadoop: Distributed Data Processing. Amr Awadallah Founder/CTO, Cloudera, Inc. ACM Data Mining SIG Thursday, January 25 th, 2010

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Distributed File Systems

Hadoop Distributed File System. Jordan Prosch, Matt Kipps

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

MapReduce and Hadoop Distributed File System

MapReduce, Hadoop and Amazon AWS

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

NoSQL and Hadoop Technologies On Oracle Cloud

A Brief Outline on Bigdata Hadoop

Suresh Lakavath csir urdip Pune, India

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Apache Hadoop FileSystem Internals

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

Apache Hadoop: Past, Present, and Future

BIG DATA USING HADOOP

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Reduction of Data at Namenode in HDFS using harballing Technique

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Big Data Analysis and HADOOP

Hadoop Ecosystem B Y R A H I M A.

Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网. Information Management. Information Management IBM CDL Lab

The Hadoop Framework

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Performance Evaluation for BlobSeer and Hadoop using Machine Learning Algorithms

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Hadoop: Embracing future hardware

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Introduction to MapReduce and Hadoop

Design and Evolution of the Apache Hadoop File System(HDFS)

Hadoop. Sunday, November 25, 12

Transcription:

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of authors (Dr. Hansen or Dr. Zadrozny) DSBA6100 Big Data Analytics for Competitive Advantage Slide #

Data Science Hadoop & Hive Source: http://www.dataists.com/2010/09/thedata-science-venn-diagram/ 2015-2025. Reproduction or usage prohibited without permission of authors (Dr. Hansen or Dr. Zadrozny) DSBA6100 Big Data Analytics for Competitive Advantage Slide #

Hadoop, Map-reduce, Hive, A few slides today (with some updates by WZ). Full PDF of Prof. Akella s slides on Moodle (104 slides) You ll use it in your projects We ll review and expand in future lectures (time permitting) 2015-2025. Reproduction or usage prohibited without permission of authors (Dr. Hansen or Dr. Zadrozny) DSBA6100 Big Data Analytics for Competitive Advantage Slide #

2015-2025. Reproduction or usage prohibited without permission of authors (Dr. Hansen or Dr. Zadrozny) DSBA6100 Big Data Analytics for Competitive Advantage Slide #

MapReduce and Hadoop

MapReduce MapReduce programming paradigm for clusters of commodity PCs Map computation across many inputs Fault-tolerant Scalable Machine independent programming model Permits programming at abstract level Runtime system handles scheduling, load balancing

First Google server ~1999 Computer History Museum

Data Centers Yahoo data center Google data center layout Harpers, 3/2008

Motivation: Large Scale Data Processing Many tasks: Process lots of data to produce other data Want to use hundreds or thousands of CPUs... but this needs to be easy MapReduce provides: Automatic parallelization and distribution Fault-tolerance I/O scheduling Status and monitoring

Example Tasks Finding all occurrences of a string on the web Finding all pages that point to a given page Data analysis of website access log files Clustering web pages

Functional Programming MapReduce: Based on Functional Programming paradigm that treats computation as evaluation of math functions Map map result-type function sequence &rest more-sequences The function must take as many arguments as there are sequences provided; at least one sequence must be provided. The result of map is a sequence such that element j is the result of applying function to element j of each of the argument sequences. Example: (map 'list #'- '(1 2 3 4)) => (-1-2 -3-4) Reduce reduce function sequence &key :from-end :start :end :initial-value The reduce function combines all the elements of a sequence using a binary operation; for example, using + one can add up all the elements. Example: (reduce #'+ '(1 2 3 4)) => 10

MapReduce Programming Model Input and Output: each a set of key/value pairs Programmer specifies two functions: map (in_key, in_value) -> list(out_key, intermediate_value) Processes input key/value pair Produces set of intermediate pairs reduce (out_key, list(intermediate_value)) -> list(out_value) Combines all intermediate values for a particular key Produces a set of merged output values (usually just one) Inspired by similar primitives in LISP and other languages

Example: Count word occurrences

MapReduce Operations Conceptual: Map: (K1, V1) -> list(k2, V2) Reduce: (K2, list(v2)) -> list(k3, V3) WordCount example: Map: (doc, contents) -> list(word_i, 1) Reduce: (word_i, list(1,1, )) -> list(word_i, count_i)

Execution Overview Dean and Ghemawat, 2008

Parallel Execution 200,000 map/5000 reduce tasks w/ 2000 machines (Dean and Ghemawat, 2004) Over 1m/day at FB last year

Model has Broad Applicability MapReduce Programs In Google Source Tree Example uses: distributed grep distributed sort web link-graph reversal term-vector per host web access log stats inverted index construction document clustering machine learning statistical machine translation

Usage at Google

Hadoop Open Source Apache project Written in Java; runs on Linux, Windows, OS/X, Solaris Hadoop includes: MapReduce: distributes applications HDFS: distributes data

Hadoop Design Goals Storage of large data sets Running jobs in parallel Maximizing disk I/O Batch processing

Job Distribution Users submit mapreduce jobs to jobtracker Jobtracker puts jobs in queue, executes on first-come, first-served basis Jobtracker manages assignment of map and reduce tasks to tasktrackers Tasktrackers execute tasks upon instruction from jobtracker, and handle data transfer between map and reduce phases

Hadoop MapReduce

Data Distribution Data transfer handled implicitly by HDFS Move computation to where data is: data locality Map tasks are scheduled on same node that input data resides on If lots of data is on the same node, nearby nodes will map instead

Hadoop DFS (HDFS)

Map Reduce and HDFS http://www.michael-noll.com/wiki/running_hadoop_on_ubuntu_linux_(multi-node_cluster)

Data Access CPU and transfer speed, RAM and disk size double every 18-24 months Disk seek time is nearly constant (~5% per year) Time to read entire disk is growing Scalable computing should not be limited by disk seek time Throughput more important than latency

Original Google Storage Source: Computer History Museum

HDFS Inspired by Google File System (GFS) Follows master/slave architecture HDFS installation has one Namenode and one or more Datanodes (one per node in cluster) Namenode: Manages filesystem namespace and regulates file access by clients. Makes filesystem namespace operations (open/close/rename of files and directories) available via RPC Datanode: Responsible for serving read/write requests from filesystem clients. Also perform block creation/deletion/replication (upon instruction from Namenode)

HDFS Design Goals Very large files: Files may be GB or TB in size Streaming data access: Write once, read many times Throughput more important than latency Commodity hardware Node failure may occur

HDFS Files are broken into blocks of 64MB (but can be user specified) Default replication factor is 3x Block placement algorithm is rack-aware Dynamic control of replication factor

HDFS Source: http://lucene.apache.org/hadoop/hdfs_design.html

Example HDFS Installation Facebook, 2010 (Largest HDFS installation at the time) 2000 machines, 22,400 cores 24 TB / machine, (21 PB total) Writing 12TB/day Reading 800TB/day 25K MapReduce jobs/day 65 Million HDFS files 30K simultaneous clients. 2014: Facebook generates 4 new petabyes of data and runs 600,000 queries and 1 million map-reduce jobs per day. Hive is Facebook's data warehouse, with 300 petabytes of data in 800,000 tables More at: https://research.facebook.com/blog/15226929279 72019/facebook-s-top-open-data-problems/ 1B users per day (http://fortune.com/2015/08/28/1-billion-facebook/)

Companies to first use MapReduce/Hadoop Google Yahoo Microsoft Amazon Facebook Twitter Fox interactive media AOL Hulu LinkedIn Disney NY Times Ebay Quantcast Veoh Joost Last.fm Any company with enough data

IBM BigInsights Hadoop Vendors Cloudera Hortonworks MapR

Hive Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. Developed at Facebook to enable analysts to query Hadoop data MapReduce for computation, HDFS for storage, RDBMS for metadata Can use Hive to perform SQL style queries on Hadoop data Most Hive queries generate MapReduce jobs Can perform parallel queries over massive data sets

Apache Pig Pig: High-level platform for creating MapReduce programs Developed at Yahoo Pig Latin: Procedural language for Pig Ability to use user code at any point in pipeline makes it good for pipeline development

Traditional Data Enterprise Architecture Hortonworks, 2013

Emerging Big Data Architecture 1. Collect data 2. Clean and process using Hadoop 3. Push data to Data Warehouse, or use directly in Enterprise applications Hortonworks, 2013

Data flow of meter done manually Awadallah and Graham

Automatic update of meter readings every 15 mins Awadallah and Graham

DW and Hadoop Use Cases Awadallah and Graham