HOW TO LIVE WITH THE ELEPHANT IN THE SERVER ROOM APACHE HADOOP WORKSHOP

Similar documents

Hadoop Ecosystem B Y R A H I M A.

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

MySQL and Hadoop. Percona Live 2014 Chris Schneider

Apache Hadoop: Past, Present, and Future

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

How To Scale Out Of A Nosql Database

Hadoop implementation of MapReduce computational model. Ján Vaňo

Big Data Too Big To Ignore

Constructing a Data Lake: Hadoop and Oracle Database United!

The Future of Data Management

Apache Hadoop: The Pla/orm for Big Data. Amr Awadallah CTO, Founder, Cloudera, Inc.

Hadoop for MySQL DBAs. Copyright 2011 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

Large scale processing using Hadoop. Ján Vaňo

Hadoop and Map-Reduce. Swati Gore

Hadoop. Sunday, November 25, 12

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

White Paper: What You Need To Know About Hadoop

BIG DATA TRENDS AND TECHNOLOGIES

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

Hadoop IST 734 SS CHUNG

BIG DATA TECHNOLOGY. Hadoop Ecosystem

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

<Insert Picture Here> Big Data

Big Data on Microsoft Platform

Internals of Hadoop Application Framework and Distributed File System

White Paper: Hadoop for Intelligence Analysis

The Future of Data Management with Hadoop and the Enterprise Data Hub

WHITE PAPER. Four Key Pillars To A Big Data Management Solution

WHITE PAPER USING CLOUDERA TO IMPROVE DATA PROCESSING

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

How To Handle Big Data With A Data Scientist

So What s the Big Deal?

Implement Hadoop jobs to extract business value from large and varied data sets

Executive Summary... 2 Introduction Defining Big Data The Importance of Big Data... 4 Building a Big Data Platform...

Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software. 22 nd October :00 Sesión B - DB2 LUW

CSE-E5430 Scalable Cloud Computing Lecture 2

MySQL and Hadoop: Big Data Integration. Shubhangi Garg & Neha Kumari MySQL Engineering

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Big data for the Masses The Unique Challenge of Big Data Integration

Splice Machine: SQL-on-Hadoop Evaluation Guide

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Peers Techno log ies Pv t. L td. HADOOP

Big Data and Advanced Analytics Applications and Capabilities Steven Hagan, Vice President, Server Technologies

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

Big Data With Hadoop

Apache Hadoop in the Enterprise. Dr. Amr Awadallah,

Entering the Zettabyte Age Jeffrey Krone

Native Connectivity to Big Data Sources in MSTR 10

Data processing goes big

MapReduce with Apache Hadoop Analysing Big Data

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

Dell In-Memory Appliance for Cloudera Enterprise

Cloudera Certified Developer for Apache Hadoop

Workshop on Hadoop with Big Data

Data-Intensive Computing with Map-Reduce and Hadoop

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Certified Big Data and Apache Hadoop Developer VS-1221

HDP Enabling the Modern Data Architecture

BIG DATA What it is and how to use?

Hadoop 101. Lars George. NoSQL- Ma4ers, Cologne April 26, 2013

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

Data Warehouse design

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Case Study : 3 different hadoop cluster deployments

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Practical Hadoop by Example

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

A Brief Outline on Bigdata Hadoop

Please give me your feedback

Hortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc All Rights Reserved

Big Data Explained. An introduction to Big Data Science.

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

Microsoft SQL Server 2012 with Hadoop

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Hadoop and MySQL for Big Data

How Cisco IT Built Big Data Platform to Transform Data Management

Hadoop & its Usage at Facebook

Tapping Into Hadoop and NoSQL Data Sources with MicroStrategy. Presented by: Jeffrey Zhang and Trishla Maru

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

Open source Google-style large scale data analysis with Hadoop

Hadoop & Spark Using Amazon EMR

Big Data Management and Security

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

How to Enhance Traditional BI Architecture to Leverage Big Data

Relational Processing on MapReduce

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Integrating Hadoop. Into Business Intelligence & Data Warehousing. Philip Russom TDWI Research Director for Data Management, April

#mstrworld. Tapping into Hadoop and NoSQL Data Sources in MicroStrategy. Presented by: Trishla Maru. #mstrworld

Cost-Effective Business Intelligence with Red Hat and Open Source

BIG DATA HADOOP TRAINING

Transcription:

HOW TO LIVE WITH THE ELEPHANT IN THE SERVER ROOM APACHE HADOOP WORKSHOP

AGENDA Introduction What is Hadoop and the rationale behind it Hadoop Distributed File System (HDFS) and MapReduce Common Hadoop use cases How Hadoop integrates with other systems like Relational Databases and Data Warehouses The other components in a typical Hadoop stack such as: Hive, Pig, HBase, Sqoop, Flume and Oozie Conclusion

ABOUT TRIFORCE Triforce provides critical, reliable IT infrastructure solutions and services to Australian and New Zealand listed corporations and government agencies. Triforce has qualified and experienced technical and sales consultants and demonstrated experience in designing and delivering enterprise Apache Hadoop solutions.

TRIFORCE BIG DATA PARTNERSHIP NetApp The NetApp Open Solution for Hadoop provides customers with flexible choices for delivering enterprise-class Hadoop. Cloudera Cloudera is the market leader in Hadoop enterprise solutions. Cloudera s 100% open-source distribution including Apache Hadoop (CDH), combined with Cloudera Enterprise, comprises the most reliable and complete Hadoop solution available.

WHAT IS HADOOP? a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. (http://hadoop.apache.org/) Apache Hadoop is a software framework that supports data-intensive distributed applications under a free license. It enables applications to work with thousands of nodes and petabytes of data. (http://en.wikipedia.org/wiki/hadoop/)

THE RATIONALE FOR HADOOP Hadoop enables distributed parallel processing of huge amounts of data across inexpensive, industry-standard servers that both store and process the data, and can scale without limits. With Hadoop, no data is too big. (http://www.cloudera.com) Hadoop processes petabytes of unstructured data in parallel across potentially thousands of commodity boxes using an open source filesystem and related tools Hadoop has been all about innovative ways to process, store, and eventually analyse huge volumes of multi-structured data.

EXAMPLES 2.7 Zettabytes of data exist in the digital universe today. (Gigabyte, Terabyte, Petabyte, Exabyte, Zettabyte) Facebook stores, accesses, and analyses 30+ Petabytes of user generated data. Decoding the human genome originally took 10 years to process; now it can be achieved in one week. YouTube users upload 48 hours of new video every minute of the day. 100 terabytes of data uploaded daily to Facebook

HADOOP Handles all types of data structured, unstructured, log files, pictures, audio files, communications records, email No prior need for a schema you don t need to know how you intend to query your data before you store it Makes all of your data useable By making all of your data useable, not just what s in your databases, Hadoop lets you see relationships that were hidden before and reveal answers that have always been just out of reach. You can start making more decisions based on hard data instead of hunches and look at complete data sets, not just samples. Two parts to Hadoop MapReduce Hadoop Distributed File System (HDFS)

What is this Big Elephant? HADOOP Geever Paul Pulikkottil BigData Solutions Architect (CCAH,CCDH)

CASE FOR BIGDATA Databases here for more than 20yrs continue to store structured transactional data Large server (s) Multi CPUs Huge Memory Buffer SAN disks Relatively low latency queries, indexed data

CASE FOR BIGDATA TYPICAL WORKLOADS DATABASE OLTP (online transaction processing) Typical Use: e-commerce, banking Nature: User facing, real-time, low latency, highly-concurrent Job: relatively small set of standard transactional queries Data access pattern: random reads, updates, writes (relatively small data) OLAP (online analytical processing) Typical Use: BI, Data Mining Nature: Back-end processing, Batch workloads Job: complex analytical queries, often ad hoc Data access: Table scans, Large query

CASE FOR BIGDATA Data warehouse: Consolidated database loaded from CRM, ERP, OLTP Process: Staging, Cleansing, Loading Purpose: BI Reporting, Forecasts, Quarterly reporting Size: larger server, multiple CPUs, SAN disks- many TBs Challenge: As the data grows overtime, things getting slower Batch should fit in within daily, weekly loading cycle Relatively expensive to license, store, manage

CASE FOR BIGDATA New Objective: Businesses wants to connect with the customer We are generating lots of data most discarded them Likes and Dislikes Facebook, Twitter, Linked-in Predictable outcomes - you can when you know the customer React quickly time missed = opportunity lost! Question: Can DW provide that? Where can you store TB or PB s unstructured data more economically How can you scale out easily, rather than forklift upgrades How can I finish batch jobs when the data grows beyond TBs Need a scalable, distributed system that can store and process large amounts of data

CASE FOR BIGDATA Distributed systems are not NEW: Common frameworks include MPI, PVM Focuses on distributing the processing workload Powerful compute nodes with Separate systems for data storage Fast network connections Infiniband Typical processing pattern: Step 1: Copy input data from storage to compute node Step 2: Perform necessary processing Step 3: Copy output data back to storage Often hundreds to thousands of nodes with GPUs

CASE FOR BIGDATA Distributed HPC relatively small amounts of data doesn t scale with large amounts of data more time spent copying data than actually processing getting data to the processors is the bottleneck getting worse as more compute nodes are added each node competing for the same bandwidth compute nodes become starved for data Distributed systems pay for compute scalability by adding complexity CudaFortran, PGI programing?

BIGDATA SOLUTION: HADOOP What is Hadoop open source distributed computing platform based on Google s GFS File system commodity hardware, no SAN, no infiniband scale up from single servers to thousands of machines each offering local computation and storage designed to detect and handle failures at the application layer adding more nodes, increase performance and capacity with no penalty commodity hardware is prone to failures, Hadoop knows that!

HADOOP CLUSTER STACK Master Nodes (1 st rack) - Name Node - Standby Name Node - Job Tracker Slave Nodes (all racks) - Data Nodes with direct attached large capacity disks (SATA) Plus: - Management or Admin Node - Hadoop Client Node(s) - Typical setup

MAPREDUCE PROGRAMING Hadoop is great for large-data processing! - MapReduce code requires you to write Java class, driver code - Its complicated to write MapReduce jobs so we need a simpler method. - Develop a higher-level language to facilitate large data processing - Hive: SQL language for Hadoop, called HQL - Pig: Pig Latin is scripting language, a bit like Perl - Both translate and run a series of Map only or MapReduce Jobs

ECOSYSTEM TOOLS: HIVE AND PIG Hive: Pig: Objective: - Data warehousing application in Hadoop - Query language is HQL, variant of SQL - Tables stored on HDFS as flat files - Developed by Facebook, now open source - large-scale data processing system - Scripts are written in Pig Latin - Dataflow language Developed by Yahoo!, now open source - Higher-level language to facilitate large-data processing - Higher-level language compiles down to Hadoop jobs

HIVE AND PIG EXAMPLE CODE Hive example: Pig example:

ECOSYSTEM TOOLS: SQOOP - - - - Import data from RDBMS to Hadoop - - - - Individual tables, Portions (where clause) or entire Databases Stored to HDFS as delimited text files or Sequence Files Provides the ability to import from SQL databases straight into your Hive Datawarehouse JDBC to connect to RDBMS, additional connectors available to BI/DW Sqoop automatically generates a Java class to import data into Hadoop Sqoop provides an incremental import mode Export tables to RDBMS from Hadoop

SQOOP IMPORT EXAMPLES > Importing Data into HDFS as Hive table using SQOOP user@dbserver$> sqoop --connect jdbc:mysql://db.example.com/website --table USERS --local \ --hive-import > Importing Data to HDFS as compressed sequence files (No Hive) using SQOOP user@dbserver$>sqoop --connect jdbc:mysql://db.example.com/website --table USERS \ --as-sequencefile > Importing Data into HBase using SQOOP: $ sqoop import --connect jdbc:mysql://localhost/acmedb \ --table ORDERS --username test --password **** \ --hbase-create-table --hbase-table ORDERS --column-family mysql >Exporting Data to RDBMS using SQOOP: $ sqoop export --connect jdbc:mysql://localhost/acmedb \ --table ORDERS --username test --password **** \ --export-dir /user/arvind/orders This would connect to the MySQL database on this server and import the USERS table into HDFS. The -local option instructs Sqoop to take advantage of a local MySQL connection. The -hive-import option after reading the data into HDFS, Sqoop will connect to the Hive metastore, create a table named USERS with the same columns and types (translated into their closest analogues in Hive), and load the data into the Hive warehouse directory on HDFS (instead of a subdir of your HDFS home dir)

SQOOP CUSTOM CONNECTORS Sqoop Works with standard JDBC connection with common Databases, custom faster tuned connectors available for Cloudera Connector for Teradata Cloudera Connector for Netezza Cloudera Connector for MicroStrategy Cloudera Connector for Tableau Quest Data Connector for Oracle and Hadoop

ECOSYSTEM TOOLS: FLUME Flume: Gather data/logs from Multiple systems, inserting them into HDFS as they are generated. Typically used to ingest log files from real-time systems such as Web servers, firewalls and mail servers into HDFS. Each Flume agent has a source and a sink Source Tells the node where to receive data from Sink Tells the node where to send data to Channel A queue between the Source and Sink Can be in memory only or Durable Durable channels will not lose data if power is lost

ECOSYSTEM TOOLS: FUSE FUSE : Filesystem in Userspace Allows HDFS to be mounted as a UNIX file system User can operate 'ls', 'cd', 'cp', 'mkdir', 'find', 'grep', or use standard Posix libraries like open, write, read, close. You can export a fuse mount using NFS,

ECOSYSTEM TOOLS: OOZIE Oozie: Oozie is a workflow engine Runs workflows of Hadoop jobs Pig, Hive, Sqoop jobs Jobs can be run at specific times, One-off or recurring Jobs can also be run when data is present in a directory

ECOSYSTEM TOOLS: MAHOUT Mahout: - Mahout is a Machine Learning library - Contains many pre written ML algorithms - R is another set of open source library used by Data Scientists

ECOSYSTEM TOOLS: IMPALA <CDH4.1> IMPALA: Brings real-time, ad hoc query Query data stored in HDFS or HBase SELECT, JOIN, and aggregate functions in real time. Uses the same Hive Metadata SQL syntax (Hive SQL), ODBC driver User interface (Hue Beeswax) as Hive and Impala shell Released 26 th Oct 2012 CDH4.1

HBASE REAL TIME DATA WITH UPDATE HBase is a distributed, sparse, column-oriented data store Real-time read/write access to data on HDFS Modeled after Google s Bitable data store Designed to use multiple machines to store and serve data Leverages HDFS to store data Each row may or may not have values for all columns Data is stored grouped by column, rather than by row Columns are grouped into column families, which define what columns are physically stored together Scales to provide very high write throughput Hundreds of thousands of inserts per second Has a constrained access model: NO SQL Insert a row, retrieve a row, do a full or partial table scan Only one column (the row key ) is indexed Based on Key/value Store: [rowkey, column family, column qualifier, timestamp] -> Cell Value [TheRealMT, info, password, 1329088818321] -> abc123 [TheRealMT, info, password, 13290888321289] -> newpass123

HBASE Hbase: Indexed by [rowkey+column qualifier +timestamp] HBase is Not a Relational Database No SQL Query language (GET/PUT/SCAN) No Joins, No Secondary Indexing, No Transactions Table is split into Regions Regions are served by Region Servers Region Servers are Java processes, on DataNodes two special tables: ROOT and.meta MemStore, Hfiles Every Memstore flush creates one HFile per Col.Fam Compactions Major/Minor reduce consolidated hfiles

DATA HAS CHANGED

HADOOP USE CASES: What do we know today? We love to be connected and collaborated We love to share emotions likes and dislikes Digital marketing has focus towards social media Get more insights across collection of data Need all sorts of data to store and analyse Real-time recommendation engines Predictive modelling with data science

COMMON HADOOP USE CASES Financial Services Consumer & market risk modelling Personalization & recommendations Fraud detection & anti-money laundering Portfolio valuations

COMMON HADOOP USE CASES Government Cyber security & fraud detection, Geospatial image & video processing

COMMON HADOOP USE CASES Media & Entertainment Search & recommendation optimization, User engagement & digital content analysis, Ad/offer targeting, Sentiment & social media analysis

HADOOP USE CASES: DATA STORES OLTP database (OLTP) for user-facing transaction, Retain records Extract-Transform-Load (ETL) Periodic ETL (e.g., nightly), Extract records from source Transform: clean data, check integrity, aggregate, etc. Load into OLAP database OLAP database for Data Warehousing (DW) Business Intelligence: reporting, ad hoc queries, data mining

HADOOP USE CASES: REPLACE DW? Reporting is often a nightly task ETL is often slow, runs after the day What happens if processing 24 hours of data takes longer than 24hr Hadoop is perfect Most likely, you already have some DW Ingest is limited by speed of HDFS Scales out with more nodes Massively parallel Ability to use any processing tool Much cheaper than parallel databases ETL is a batch process anyway!

CLOUDERA DISTRIBUTION HADOOP 4.1 Cloudera Enterprise Subscription Options: Cloudera Enterprise Core Cloudera Enterprise RTD (Real-Time Delivery) Cloudera Enterprise RTQ (Real-Time Query)

WHERE TO FROM HERE? Understand Use Cases Build a business Case Design a solution Deploy Hadoop Infrastructure Confirm Data sources Use Hadoop to answer questions

CONTACT TRIFORCE Call 1300 664 667 Email: info@triforce.com.au View our Big Data Resources page at www.triforce.com.au Follow us on LinkedIN http://www.linkedin.com/company/triforceaustralia