Big Data Too Big To Ignore

Similar documents
Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Hadoop and Map-Reduce. Swati Gore

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Qsoft Inc

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Constructing a Data Lake: Hadoop and Oracle Database United!

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

Hadoop Ecosystem B Y R A H I M A.

Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software. 22 nd October :00 Sesión B - DB2 LUW

MySQL and Hadoop. Percona Live 2014 Chris Schneider

Hadoop Job Oriented Training Agenda

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

BIG DATA HADOOP TRAINING

Peers Techno log ies Pv t. L td. HADOOP

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Deploying Hadoop with Manager

Bringing Big Data to People

Hadoop implementation of MapReduce computational model. Ján Vaňo

<Insert Picture Here> Big Data

Certified Big Data and Apache Hadoop Developer VS-1221

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Big Data Explained. An introduction to Big Data Science.

CSE-E5430 Scalable Cloud Computing. Lecture 4

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Big Data Course Highlights

How To Scale Out Of A Nosql Database

Large scale processing using Hadoop. Ján Vaňo

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

ITG Software Engineering

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Scaling Up 2 CSE 6242 / CX Duen Horng (Polo) Chau Georgia Tech. HBase, Hive

Apache Hadoop: Past, Present, and Future

Workshop on Hadoop with Big Data

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Implement Hadoop jobs to extract business value from large and varied data sets

Data processing goes big

Big Data and Apache Hadoop Adoption:

HOW TO LIVE WITH THE ELEPHANT IN THE SERVER ROOM APACHE HADOOP WORKSHOP

Big Data With Hadoop

HDFS. Hadoop Distributed File System

Internals of Hadoop Application Framework and Distributed File System

Modernizing Your Data Warehouse for Hadoop

Hadoop and its Usage at Facebook. Dhruba Borthakur June 22 rd, 2009

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

IBM BigInsights Has Potential If It Lives Up To Its Promise. InfoSphere BigInsights A Closer Look

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Open source Google-style large scale data analysis with Hadoop

Using Hadoop for Webscale Computing. Ajay Anand Yahoo! Usenix 2008

Hadoop. Sunday, November 25, 12

Linux Clusters Ins.tute: Turning HPC cluster into a Big Data Cluster. A Partnership for an Advanced Compu@ng Environment (PACE) OIT/ART, Georgia Tech

Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science

Hadoop IST 734 SS CHUNG

E6893 Big Data Analytics: Demo Session for HW I. Ruichi Yu, Shuguan Yang, Jen-Chieh Huang Meng-Yi Hsu, Weizhen Wang, Lin Haung.

Forecast of Big Data Trends. Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 3 September 2014

Hadoop 101. Lars George. NoSQL- Ma4ers, Cologne April 26, 2013

From Relational to Hadoop Part 1: Introduction to Hadoop. Gwen Shapira, Cloudera and Danil Zburivsky, Pythian

Hadoop & its Usage at Facebook

A very short Intro to Hadoop

Big Data and Data Science: Behind the Buzz Words

Cost-Effective Business Intelligence with Red Hat and Open Source

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Big Data Introduction

I/O Considerations in Big Data Analytics

Big Data Workshop. dattamsha.com

Cloudera Distributed Hadoop (CDH) Installation and Configuration on Virtual Box

Cloudera Certified Developer for Apache Hadoop

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

MapReduce with Apache Hadoop Analysing Big Data

Big Data Weather Analytics Using Hadoop

The Inside Scoop on Hadoop

HADOOP. Revised 10/19/2015

Microsoft SQL Server 2012 with Hadoop

Dominik Wagenknecht Accenture

Scaling Up HBase, Hive, Pegasus

ITG Software Engineering

Big Data on Microsoft Platform

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Big Data and Apache Hadoop s MapReduce

Big Data Open Source Stack vs. Traditional Stack for BI and Analytics

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

#mstrworld. Tapping into Hadoop and NoSQL Data Sources in MicroStrategy. Presented by: Trishla Maru. #mstrworld

Apache Hadoop. Alexandru Costan

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

CSE-E5430 Scalable Cloud Computing Lecture 2

Data Warehouse design

Chase Wu New Jersey Ins0tute of Technology

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

BIG DATA & HADOOP DEVELOPER TRAINING & CERTIFICATION

Big Data Technologies Compared June 2014

Transcription:

Big Data Too Big To Ignore

Geert! Big Data Consultant and Manager! Currently finishing a 3 rd Big Data project! IBM & Cloudera Certified! IBM & Microsoft Big Data Partner 2

Agenda! Defining Big Data! Introduction to Hadoop 3

Our Vision Volume Big Data

Big Data Velocity

Our Vision Volume Variety

Big Data Technical Drivers 7

Big Data Business Drivers Do More ANALYTICS with Less COSTS 8

Forrester Research McKinsey Gartner 9

Transformation of Online Marketing BLOGS.FORBES.COM/DAVEFEINLEIB 10

Transformation of Customer Service BLOGS.FORBES.COM/DAVEFEINLEIB 11

Big Data Definition Big Data Technologies allow you to implement Use Cases which Legacy Technologies can t. 12

Implementing Big Data Our Vision on Data

Current Situation 14

Our Vision #1 Focus on Data not on Derived Data 15

Our Vision #2 Data is immutable 16

Our Vision #3 Query = function (all data) 17

Concept 18

Introducing The Hadoop Ecosystem

Context: Performance Gap Trend 20

Context: Exponential for Decades! Abundance of - computing & storage - generated data (estimated 8ZB in 15) - things! More data provides greater value! Traditional data doesn t scale well! It s time for a new approach! 21

New Hardware Approach Traditional Big Data! Exotic HW! Commodity HW - big central servers - SAN - RAID! Hardware reliability! Limited scalability! Expensive - racks of pizza boxes - Ethernet - JBOD! Unreliable HW! Scales further! Cost effective 22

New Software Approach Traditional! Monolotic - Centralized - RDBMS! Schema first! Proprietary Big Data! Distributed - storage & compute nodes! Raw data! Open source 23

Hadoop! De facto big data industry standard (batch)! Vendor adoption - IBM, Microsoft, Oracle, EMC,...! A collection of projects at Apache - HDFS, MapReduce, Hive, Pig, Hbase, Flume, Oozie,...! Main components - HDFS - MapReduce! Cluster - Set of machines running HDFS and MapReduce 24

Distributions! Cloudera - www.cloudera.com - Cloudera Enterprise subscription - Currently CDH3 Linux package Virtual machine Cloud - Stack hadoop, hbase, hive, pig, mahout, flume,... - Cloudera SCM - Connectors for Teradata, Netezza, Microstrategy and Quest 25

Distributions! Hortonworks - www.hortonworks.com - Hortonworks Data Platform Powered by Apache Hadoop, 100% opensource solution! IBM - Offers a derivative version of Apache Hadoop that IBM supports on IBM JVMs on a number of platforms! Microsoft - Hadoop on Azure & Hadoop on Premise! Apache Hadoop - Distribution for individual components only 26

HDFS 27

Access to HDFS! The HDFS java client api s can be used! Typically files are moved from local filesystem into HDFS! Using hadoop fs commands! Through Hue (Cloudera SCM)! Fuse! HDFS DAV 28

hadoop fs examples! Get directory listing of user home directory in hdfs - hadoop fs -ls! Get directory listing of root directory - hadoop fs -ls /! Copy a local file to hdfs (user home directory) - hadoop fs -copyfromlocal foo.txt foo.txt! Move a file from hdfs to local disk - hadoop fs -copytolocal /user/geert/foo.txt foo.txt! Display the contents of a file - hadoop fs -cat /data/as400/customers.txt! Make a new directory (user home directory) - hadoop fs -mkdir output 29

hadoop fs examples! Remove a directory - hadoop fs -rm /business/data! Remove a directory and all its content (recursive) - hadoop fs -rmr /business/data! More information about hadoop fs command options - http://hadoop.apache.org/common/docs/r0.20.2/hdfs_shell.html 30

hadoop fs examples 31

Hadoop Namenode webpage 32 32

Hadoop Namenode webpage 33

Hadoop Namenode webpage 34

MapReduce! MapReduce is the system used to process data in the Hadoop cluster! Consists of two phases - Map & Reduce - Between the two is a stage known as the shuffle and sort! Data Locality - Each Map task operates on a discrete portion of the overall dataset - Typically one HDFS block of data! After all Maps are complete, the MapReduce system distributes the intermediate data to nodes which perform the Reduce phase 35

MapReduce 36

MapReduce 37

MapReduce 38

MapReduce 39

Hadoop Architecture 40

MapReduce In Action! Calculate average word length per first letter of word - AverageWordLength.java: launches job - LetterMapper.java: mapper per first letter - AverageReducer.java: calculates average length 41

AverageWordLength 42

LetterMapper 43

AverageReducer 44

MapReduce In Action 45

JobTracker page 46

JobTracker page 47 The Hadoop Ecosystem

MapReduce! Abstract Processing Model - Distributed sort merge engine! Implementation - Programming Java Python - High-level tool using MapReduce Jobs Hive Pig... 48

Hive! Framework for data warehousing on top of Hadoop! Developed at Facebook for managing and learning from the huge volumes of data Facebook was generating! Makes it possible for analysts with strong SQL skills to run queries! Used by many organizations! SQL is lingua franca in business intelligence tools! SQL is limited so Hive is not fit for building complex machine learning algorithms! Generates MR jobs when executing queries 49

Hive CREATE EXTERNAL TABLE movie (id INT, name STRING, year INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LOCATION '/user/ root/movie' CREATE EXTERNAL TABLE movierating (userid INT, movieid INT, rating INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LOCATION '/user/cloudera/movierating' SELECT * FROM movie --Select oldest movie SELECT * FROM movie WHERE year!= 0 SORT BY year LIMIT 1 --Select movies without rating SELECT name, year FROM movie LEFT OUTER JOIN movierating ON movie.id = movierating.movieid WHERE movieid IS NULL --Update movies with numratings, avgrating DROP TABLE newmovie 50

Hive root@master ~ # hive Hive history file=/tmp/root/hive_job_log_root_201108031010_1952745660.txt hive> select * from movie limit 10; OK 1 Toy Story 1995 2 Jumanji 1995 3 Grumpier Old Men 1995 4 Waiting to Exhale 1995 5 Father of the Bride Part II 1995 6 Heat 1995 7 Sabrina 1995 8 Tom and Huck 1995 9 Sudden Death 1995 10 GoldenEye 1995 Time taken: 0.067 seconds hive> 51

Pig! Abstraction layer for processing large data sets! 2 Components - Pig Latin: the language used to express data flows - Grunt: the execution environment! Pig Program - Composed of series of operations, or transformations - The operations describe a dataflow that is translated into one or more MapReduce jobs 52

Pig -- max_temp.pig: Finds the maximum temperature by year records = LOAD 'input/ncdc/micro-tab/sample.txt' AS (year:chararray, temperature:int, quality:int); filtered_records = FILTER records BY temperature!= 9999 AND ( quality == 0 OR quality == 1 OR quality == 4 OR quality == 5 OR quality == 9); grouped_records = GROUP filtered_records BY year; max_temp = FOREACH grouped_records GENERATE group, MAX(filtered_records.temperature); DUMP max_temp; 53