E6893 Big Data Analytics Lecture 2: Big Data Analytics Platforms



Similar documents
Chase Wu New Jersey Ins0tute of Technology

Hadoop. Sunday, November 25, 12

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

Big Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Lecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015

Hadoop Ecosystem B Y R A H I M A.

How To Scale Out Of A Nosql Database

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

Hadoop IST 734 SS CHUNG

HDFS. Hadoop Distributed File System

ITG Software Engineering

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Accelerating and Simplifying Apache

Constructing a Data Lake: Hadoop and Oracle Database United!

Distributed Filesystems

HADOOP MOCK TEST HADOOP MOCK TEST I

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Application Development. A Paradigm Shift

Hadoop implementation of MapReduce computational model. Ján Vaňo

HADOOP MOCK TEST HADOOP MOCK TEST II

Workshop on Hadoop with Big Data

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

BIG DATA What it is and how to use?

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science

Dominik Wagenknecht Accenture

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

Large scale processing using Hadoop. Ján Vaňo

Big Data Management and NoSQL Databases

!"#$%&' ( )%#*'+,'-#.//"0( !"#$"%&'()*$+()',!-+.'/', 4(5,67,!-+!"89,:*$;'0+$.<.,&0$'09,&)"/=+,!()<>'0, 3, Processing LARGE data sets

Open source Google-style large scale data analysis with Hadoop

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

E6893 Big Data Analytics: Demo Session for HW I. Ruichi Yu, Shuguan Yang, Jen-Chieh Huang Meng-Yi Hsu, Weizhen Wang, Lin Haung.

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Big Data Management. Big Data Management. (BDM) Autumn Povl Koch November 11,

Hadoop Job Oriented Training Agenda

Manifest for Big Data Pig, Hive & Jaql

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

Peers Techno log ies Pv t. L td. HADOOP

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

BBM467 Data Intensive ApplicaAons

THE HADOOP DISTRIBUTED FILE SYSTEM

Certified Big Data and Apache Hadoop Developer VS-1221

How Companies are! Using Spark

! E6893 Big Data Analytics Lecture 5:! Big Data Analytics Algorithms -- II

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.

HDFS Users Guide. Table of contents

Apache Hadoop Ecosystem

BIG DATA TECHNOLOGY. Hadoop Ecosystem

HadoopRDF : A Scalable RDF Data Analysis System

E6895 Advanced Big Data Analytics Lecture 3:! Spark and Data Analytics

Qsoft Inc

COURSE CONTENT Big Data and Hadoop Training

E6893 Big Data Analytics Lecture 4: Big Data Analytics Algorithms -- I

Bringing Big Data to People

Upcoming Announcements

File S1: Supplementary Information of CloudDOE

Next Gen Hadoop Gather around the campfire and I will tell you a good YARN

Deploying Hadoop with Manager

NoSQL and Hadoop Technologies On Oracle Cloud

Big Data and Apache Hadoop s MapReduce

Big Data With Hadoop

Complete Java Classes Hadoop Syllabus Contact No:

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

Apache Hadoop: Past, Present, and Future

A very short Intro to Hadoop

BIG DATA SERIES: HADOOP DEVELOPER TRAINING PROGRAM. An Overview

Big Data Introduction

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

WROX Certified Big Data Analyst Program by AnalytixLabs and Wiley

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Hadoop Architecture. Part 1

Alternatives to HIVE SQL in Hadoop File Structure

Hadoop & Spark Using Amazon EMR

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Comparing the Hadoop Distributed File System (HDFS) with the Cassandra File System (CFS)

The Big Data Ecosystem at LinkedIn Roshan Sumbaly, Jay Kreps, and Sam Shah LinkedIn

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

#TalendSandbox for Big Data

Transcription:

E6893 Big Data Analytics Lecture 2: Big Data Analytics Platforms Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science Mgr., Dept. of Network Science and Big Data Analytics, IBM Watson Research Center September 11st, 2014 1

Course Structure Class Data 09/04/14 09/11/14 09/18/14 09/25/14 10/02/14 10/09/14 10/16/14 10/23/14 10/30/14 11/06/14 11/13/14 11/20/14 11/27/14 12/04/14 12/11/14 & 12/12/14 Number 1 2 3 4 5 6 7 8 9 10 11 12 13 14-15 Topics Covered Introduction to Big Data Analytics Big Data Analytics Platforms Big Data Storage and Processing Big Data Analytics Algorithms -- I Big Data Analytics Algorithms -- II Linked Big Data Analysis Graph Computing and Network Science Big Data Visualization Big Data Mobile Applications Large-Scale Machine Learning Big Data Analytics on Specific Processors Hardware and Cluster Platforms for Big Data Analytics Big Data Next Challenges IoT, Cognition, and Beyond Thanksgiving Holiday Final Projects Discussion (Optional) Two-Day Big Data Analytics Workshop Final Project Presentations 2

Course information -- TAs 11 Teaching Assistants: Ruichi Yu <ry2254>, Computer Science Aonan Zhang <az2385>, Electrical Engineering Promiti Dutta <pd2049>, Electrical Engineering and Environmental Engineering Bhaveep Sethi, <bas2226>, Computer Science Weizhen Wang <ww2339>, Computer Science Jen-Chieh Huang <jh3478>, Computer Engineering Yunzhi Ye <yy2509>, Computer Science Meng-Yi (Marcus) Hsu <mh3346>, Electrical Engineering Shuguan Yang <sy2518>, Electrical Engineering Lin Huang <lh2647>, Electrical Engineering Huan Gao <hg2357>, Electrical Engineering 3

Students shall be divided into groups based on interest Goal: Align interest domain into groups in order to focus on use scenarios, datasets, requirements to create open source Big Data Analytics toolkits. And also other fields that are not on the list: Education, Social Science, etc. Selection: An online website will be opened to let all students (on-campus & CVN) submit up to 3 preferences and description of personal education/work background towards the domain. TAs will be assigned to lead 11 groups. Some groups may have multiple fields. Some fields may be multiple groups. 4

Related Information Guest: Prof. Ernesto Reuben, Business School; Associate Director, CELSS Columbia Experimental Laboratory for the Social Sciences (CELSS): A joint venture by the Business School Sociology Economics SIPA Political Science 5

Reading Reference for Lecture 2 & 3 6

Remind -- Apache Hadoop The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing. The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures. The project includes these modules: Hadoop Common: The common utilities that support the other Hadoop modules. Hadoop Distributed File System (HDFS ): A distributed file system that provides highthroughput access to application data. Hadoop YARN: A framework for job scheduling and cluster resource management. Hadoop MapReduce: A YARN-based system for parallel processing of large data sets. http://hadoop.apache.org 7

Remind -- Hadoop-related Apache Projects Ambari : A web-based tool for provisioning, managing, and monitoring Hadoop clusters.it also provides a dashboard for viewing cluster health and ability to view MapReduce, Pig and Hive applications visually. Avro : A data serialization system. Cassandra : A scalable multi-master database with no single points of failure. Chukwa : A data collection system for managing large distributed systems. HBase : A scalable, distributed database that supports structured data storage for large tables. Hive : A data warehouse infrastructure that provides data summarization and ad hoc querying. Mahout : A Scalable machine learning and data mining library. Pig : A high-level data-flow language and execution framework for parallel computation. Spark : A fast and general compute engine for Hadoop data. Spark provides a simple and expressive programming model that supports a wide range of applications, including ETL, machine learning, stream processing, and graph computation. Tez : A generalized data-flow programming framework, built on Hadoop YARN, which provides a powerful and flexible engine to execute an arbitrary DAG of tasks to process data for both batch and interactive use-cases. ZooKeeper : A high-performance coordination service for distributed applications. 8

Common Use Cases for Big Data in Hadoop Log Data Analysis most common, fits perfectly for HDFS scenario: Write once & Read often. Data Warehouse Modernization Fraud Detection Risk Modeling Social Sentiment Analysis Image Classification Graph Analysis Beyond D. derooset al, Hadoopfor Dummies, John Wiley & Sons, 2014 9

Example: Business Value of Log Analysis Struggle Detection D. derooset al, Hadoopfor Dummies, John Wiley & Sons, 2014 10

Remind -- Hadoop Distributed File System (HDFS) http://hortonworks.com/hadoop/hdfs/ 11

Remind -- MapReduce example http://www.alex-hanna.com 12

MapReduce Process on User Behavior via Log Analysis D. derooset al, Hadoopfor Dummies, John Wiley & Sons, 2014 13

Setting Up the Hadoop Environment Local (standalone) mode Pseudo-distributed mode Fully-distributed mode 14

Setting Up the Hadoop Environment Pseudo-distributed mode http://hadoop.apache.org/docs/stable2/hadoop-project-dist/hadoop-common/singlenodesetup.html 15

Setting Up the Hadoop Environment Pseudo-distributed mode 16

Data Storage Operations on HDFS Hadoop is designed to work best with a modest number of extremely large files. Average file sizes larger than 500MB. Write One, Read Often model. Content of individual files cannot be modified, other than appending new data at the end of the file. What we can do: Create a new file Append content to the end of a file Delete a file Rename a file Modify file attributes like owner 17

HDFS blocks File is divided into blocks (default: 64MB) and duplicated in multiple places (default: 3) Dividing into blocks is normal for a file system. E.g., the default block size in Linux is 4KB. The difference of HDFS is the scale. Hadoop was designed to operate at the petabyte scale. Every data block stored in HDFS has its own metadata and needs to be tracked by a central server. 18

HDFS blocks Replication patterns of data blocks in HDFS. When HDFS stores the replicas of the original blocks across the Hadoop cluster, it tries to ensure that the block replicas are stored in different failure points. 19

HDFS is a User-Space-Level file system 20

Interaction between HDFS components 21

HDFS Federation Before Hadoop 2.0, NameNode was a single point of failure and operation limitation. Before Hadoop 2, Hadoop clusters usually have fewer clusters that were able to scale beyond 3,000 or 4,000 nodes. Multiple NameNodes can be used in Hadoop 2.x. (HDFS High Availability feature one is in an Active state, the other one is in a Standby state). http://hadoop.apache.org/docs/r2.3.0/hadoop-yarn/hadoop-yarn-site/hdfshighavailabilitywithnfs.html 22

High Availability of the NameNodes Active NameNode Standby NameNode keeping the state of the block locations and block metadata in memory -> HDFS checkpointing responsibilities. JournalNode if a failure occurs, the Standby Node reads all completed journal entries to ensure the new Active NameNode is fully consistent with the state of cluster. Zookeeper provides coordination and configuration services for distributed systems. 23

Data Compression in HDFS 24

Several useful commands for HDFS All hadoop commands are invoked by the bin/hadoop script. % hadoop fsck / -files blocks: list the blocks that make up each file in HDFS. For HDFS, the schema name is hdfs, and for the local file system, the schema name is file. A file or director in HDFS can be specified in a fully qualified way, such as: hdfs://namenodehost/parent/child or hdfs://namenodehost The HDFS file system shell command is similar to Linux file commands, with the following general syntax: hadoop hdfs file_cmd For instance mkdir runs as: $hadoop hdfs dfs mkdir /user/directory_name 25

Several useful commands for HDFS -- II 26

Ingesting Log Data -- Flume Ingesting stream data 27

Execute Hadoop Works http://www.alex-hanna.com 28

Remind -- MapReduce Data Flow http://www.ibm.com/developerworks/cloud/library/cl-openstack-deployhadoop/ 29

MapReduce Use Case Example flight data Data Source: Airline On-time Performance data set (flight data set). All the logs of domestic flights from the period of October 1987 to April 2008. Each record represents an individual flight where various details are captured: Time and date of arrival and departure Originating and destination airports Amount of time taken to taxi from the runway to the gate. Download it from Statistical Computing: http://statcomputing.org/dataexpo/2009/ 30

Other datasets available from Statistical Computing http://stat-computing.org/dataexpo/ 31

Flight Data Schema 32

MapReduce Use Case Example flight data Count the number of flights for each carrier Serial way (not MapReduce): 33

MapReduce Use Case Example flight data Count the number of flights for each carrier Parallel way: 34

MapReduce application flow 35

MapReduce steps for flight data computation 36

FlightsByCarrier application Create FlightsByCarrier.java: 37

FlightsByCarrier application 38

FlightsByCarrier Mapper 39

FlightsByCarrier Reducer 40

Run the code 41

See Result 42

Using Pig Script E.g.: totalmiles.pig: calculates the total miles flown for all flights flown in one year Execute it: pig totalmiles.pig See result: hdfs dfs cat /user/root/totalmiles/part-r-00000 775009272 43

Questions? 44