Stinger Initiative: Introduction

Similar documents

Apache Hadoop's Role in Your Big Data Architecture

Big Data Realities Hadoop in the Enterprise Architecture

HDP Enabling the Modern Data Architecture

HDP Hadoop From concept to deployment.

Big Data: Making Sense of it all!

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

Next Gen Hadoop Gather around the campfire and I will tell you a good YARN

A Brief Introduction to Apache Tez

Hadoop in the Enterprise

A Modern Data Architecture with Apache Hadoop

Upcoming Announcements

Apache Hadoop: The Pla/orm for Big Data. Amr Awadallah CTO, Founder, Cloudera, Inc.

The Future of Data Management

Aligning Your Strategic Initiatives with a Realistic Big Data Analytics Roadmap

Comprehensive Analytics on the Hortonworks Data Platform

How Companies are! Using Spark

Datenverwaltung im Wandel - Building an Enterprise Data Hub with

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Hadoop Ecosystem B Y R A H I M A.

Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015

Harnessing big data with Hortonworks Data Platform and Red Hat JBoss Data Virtualization

Hortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc All Rights Reserved

Big Data Course Highlights

#TalendSandbox for Big Data

Hadoop and Relational Database The Best of Both Worlds for Analytics Greg Battas Hewlett Packard

Hadoop Job Oriented Training Agenda

End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ

Hadoop, the Data Lake, and a New World of Analytics

SAP and Hortonworks Reference Architecture

Hortonworks and ODP: Realizing the Future of Big Data, Now Manila, May 13, 2015

BIG DATA: FROM HYPE TO REALITY. Leandro Ruiz Presales Partner for C&LA Teradata

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

The Future of Data Management with Hadoop and the Enterprise Data Hub

HADOOP. Revised 10/19/2015

Using distributed technologies to analyze Big Data

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Constructing a Data Lake: Hadoop and Oracle Database United!

GAIN BETTER INSIGHT FROM BIG DATA USING JBOSS DATA VIRTUALIZATION

Modern Data Architecture for Predictive Analytics

Talend Big Data. Delivering instant value from all your data. Talend

Luncheon Webinar Series May 13, 2013

Analytics on Spark &

Modernizing Your Data Warehouse for Hadoop

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing

Tap into Hadoop and Other No SQL Sources

BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014

Session 0202: Big Data in action with SAP HANA and Hadoop Platforms Prasad Illapani Product Management & Strategy (SAP HANA & Big Data) SAP Labs LLC,

Oracle Database - Engineered for Innovation. Sedat Zencirci Teknoloji Satış Danışmanlığı Direktörü Türkiye ve Orta Asya

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Unified Batch & Stream Processing Platform

TE's Analytics on Hadoop and SAP HANA Using SAP Vora

Data Security in Hadoop

Apache Hadoop: Past, Present, and Future

CSE-E5430 Scalable Cloud Computing Lecture 2

Oracle Big Data SQL Technical Update

Native Connectivity to Big Data Sources in MSTR 10

Implement Hadoop jobs to extract business value from large and varied data sets

Native Connectivity to Big Data Sources in MicroStrategy 10. Presented by: Raja Ganapathy

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

How To Handle Big Data With A Data Scientist

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

Bringing Big Data to People

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

#mstrworld. Tapping into Hadoop and NoSQL Data Sources in MicroStrategy. Presented by: Trishla Maru. #mstrworld

Trafodion Operational SQL-on-Hadoop

Please give me your feedback

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes

BIG DATA TRENDS AND TECHNOLOGIES

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS PART 4 BEYOND MAPREDUCE...385

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Oracle Database 12c Plug In. Switch On. Get SMART.

Evolution from Big Data to Smart Data

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.

Getting Started Practical Input For Your Roadmap

Apache Hadoop: The Big Data Refinery

Extending the Enterprise Data Warehouse with Hadoop Robert Lancaster. Nov 7, 2012

Interactive data analytics drive insights

Tapping Into Hadoop and NoSQL Data Sources with MicroStrategy. Presented by: Jeffrey Zhang and Trishla Maru

Testing Big data is one of the biggest

The Next Wave of Data Management. Is Big Data The New Normal?

Cloudera Enterprise Data Hub in Telecom:

An Integrated Analytics & Big Data Infrastructure September 21, 2012 Robert Stackowiak, Vice President Data Systems Architecture Oracle Enterprise

The Enterprise Data Hub and The Modern Information Architecture

A very short talk about Apache Kylin Business Intelligence meets Big Data. Fabian Wilckens EMEA Solutions Architect

Simplifying Big Data Analytics: Unifying Batch and Stream Processing. John Fanelli,! VP Product! In-Memory Compute Summit! June 30, 2015!!

BIG DATA What it is and how to use?

Whitepaper: Solution Overview - Breakthrough Insight. Published: March 7, Applies to: Microsoft SQL Server Summary:

Big Data: What You Should Know. Mark Child Research Manager - Software IDC CEMA

Building a real-time, self-service data analytics ecosystem Greg Arnold, Sr. Director Engineering

Apache Kylin Introduction Dec 8,

How To Use Big Data For Business

Big Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014

Transcription:

Stinger Initiative: Introduction Interactive Query on Hadoop Chris Harris E-Mail : charris@hortonworks.com Twitter : cj_harris5 Page 1

The World of Data is Changing Data Explosion 1 Zettabyte (ZB) = 1 Billion TBs 15x growth rate of machine generated data by 2020 Source: IDC By 2015, organizations that build a modern information management system will outperform their peers financially by 20 percent. Gartner, Mark Beyer, Information Management in the 21st Century Page 2

What is Hadoop? Hadoop is a new data platform, that can store more data, more kinds of data, and perform more flexible analyses Hadoop is open source and runs on industry standard hardware, so it's 1-2 orders of magnitude more economical than conventional data solutions Hadoop provides more cost effective storage, processing, and analysis. Some existing workloads run faster, cheaper, better Page 3

Hadoop Enterprise Use Cases Vertical Use Case Data Type Financial Services Telecom Retail Manufacturing New Account Risk Screens Text, Server Logs Fraud Prevention Server Logs Trading Risk Server Logs Maximize Deposit Spread Text, Server Logs Insurance Underwriting Geographic, Sensor, Text Accelerate Loan Processing Text Call Detail Records (CDRs) Machine, Geographic Infrastructure Investment Machine, Server Logs Next Product to Buy (NPTB) Clickstream Real-time Bandwidth Allocation Server Logs, Text, Sentiment New Product Development Machine, Geographic 360 View of the Customer Clickstream, Text Analyze Brand Sentiment Sentiment Localized, Personalized Promotions Geographic Website Optimization Clickstream Optimal Store Layout Sensor Supply Chain and Logistics Sensor Assembly Line Quality Assurance Sensor Proactive Maintenance Machine Crowdsourced Quality Assurance Sentiment Page

A Brief History of Apache Hadoop Apache Project Established Yahoo! begins to Operate at scale Hortonworks Data Platform 2013 2004 2006 2008 2010 2012 2005: Yahoo! creates team under E14 to work on Hadoop Focus on INNOVATION Enterprise Hadoop 2008: Yahoo team extends focus to operations to support multiple projects & growing clusters Focus on OPERATIONS 2011: Hortonworks created to focus on Enterprise Hadoop. Starts with 24 key Hadoop engineers from Yahoo STABILITY Page 5

Leadership that Starts at the Core Driving next generation Hadoop YARN, MapReduce2, HDFS2, High Availability, Disaster Recovery 420k+ lines authored since 2006 More than twice nearest contributor Deeply integrating w/ecosystem Enabling new deployment platforms (ex. Windows & Azure, Linux & VMware HA) Creating deeply engineered solutions (ex. Teradata big data appliance) All Apache, NO holdbacks 100% of code contributed to Apache Page 6

Operational Data Refinery Refine Explore Enrich APPLICATIONS Business Analy;cs Custom Applica;ons Enterprise Applica;ons Collect data and apply a known algorithm to it in trusted operational process DATA SYSTEMS RDBMS EDW MPP TRADITIONAL REPOS 3 2 1 Capture Capture all data 2 Process Parse, cleanse, apply structure & transform DATA SOURCES Tradi;onal Sources (RDBMS, OLTP, OLAP) New Sources (web logs, email, sensor data, social media) 1 3 Exchange Push to existing data warehouse for use with existing analytic tools Page 7

Key Capability in Hadoop: Late binding With tradi;onal ETL, structure must be agreed upon far in advance and is difficult to change. MACHINE GENERATED Store Transformed Data WEB LOGS, CLICK STREAMS OLTP ETL Server Data Mart / EDW Client Apps With Hadoop, capture all data, structure data as business need evolve. MACHINE GENERATED HORTONWORKS DATA PLATFORM Dynamically Apply Transforma8ons WEB LOGS, CLICK STREAMS OPERATIONAL SERVICES DATA SERVICES HADOOP CORE OLTP Hortonworks HDP Data Mart / EDW Client Apps Page 8

Big Data Exploration & Visualization Refine Explore Enrich APPLICATIONS Business Analy;cs Custom Applica;ons Enterprise Applica;ons Collect data and perform iterative investigation for value 3 DATA SYSTEMS RDBMS EDW MPP TRADITIONAL REPOS 2 1 Capture Capture all data 2 Process Parse, cleanse, apply structure & transform DATA SOURCES Tradi;onal Sources (RDBMS, OLTP, OLAP) New Sources (web logs, email, sensor data, social media) 1 3 Exchange Explore and visualize with analytics tools supporting Hadoop Page 9

Visualization Tooling Robust visualization and business tooling Ensures scalability when working with large datasets Native Excel support Web browser support Mobile support Page 10

Enhancing the Core of Apache Hadoop Deliver high-scale storage & processing with enterprise-ready platform services HADOOP CORE PLATFORM SERVICES MAP REDUCE HDFS YARN (in 2.0) Enterprise Readiness Unique Focus Areas: Bigger, faster, more flexible Continued focus on speed & scale and enabling near-real-time apps Tested & certified at scale Run ~1300 system tests on large Yahoo clusters for every release Enterprise-ready services High availability, disaster recovery, snapshots, security, Page 11

Data Services for Full Data Lifecycle HADOOP CORE PLATFORM SERVICES FLUME SQOOP DATA SERVICES PIG HIVE HCATALOG WEBHDFS Distributed Storage & Processing Enterprise Readiness HBASE Provide data services to store, process & access data in many ways Unique Focus Areas: Apache HCatalog Metadata services for consistent table access to Hadoop data Apache Hive Explore & process Hadoop data via SQL & ODBC-compliant BI tools Apache HBase NoSQL database for Hadoop WebHDFS Access Hadoop files via scalable REST API Talend Open Studio for Big Data Graphical data integration tools Page 12

Organize Tiers and Process with Metadata Access Tier Conform, Summarize, Access Pig HiveQL Organize data based on source/derived relationships Allows for fault and rebuild process Gold Tier Work Tier Raw Tier Transform, Integrate, Storage Standardize, Cleanse, Transform Pig MapReduce Pig MapReduce HCat Provides unified metadata access to Pig, Hive & MapReduce Extract & Load WebHDFS Flume Sqoop Page 13

Hive Current Focus Area Non- Real-Time Interactive Batch Interactive Online systems R-T analytics CEP Parameterized Reports Drilldown Visualization Exploration Data preparation Incremental batch processing Dashboards / Scorecards Operational batch processing Enterprise Reports Data Mining Current Hive Sweet Spot 0-5s 5s 1m 1m 1h 1h+ Data Size Page 14

Stinger: Extending Hive s Sweetspot Non- Real-Time Interactive Batch Interactive Online systems R-T analytics CEP Parameterized Reports Drilldown Visualization Exploration Future Hive Expansion Data preparation Incremental batch processing Dashboards / Scorecards Operational batch processing Enterprise Reports Data Mining Current Hive Sweet Spot 0-5s 5s 1m 1m 1h 1h+ Data Size Improve Latency & Throughput Query engine improvements New Optimized RCFile column store Next-gen runtime (elim s M/R latency) Extend Deep Analytical Ability Analytics functions Improved SQL coverage Continued focus on core Hive use cases Page 15

Hadoop 2.0 The Enterprise Generation 1.0 Architected for the Large Web Properties Business Value 2.0 Architected for the Broad Enterprise Single PlaTorm Mul;ple Use Enterprise Requirements Mixed workloads Interactive Query Reliability Point in time Recovery Multi Data Center ZERO downtime Security Hadoop 2.0 Features YARN Hive on Tez Full Stack HA Snapshots Disaster Recovery Rolling Upgrades Knox Gateway BATCH INTERACTIVE ONLINE Big Data Transactions, Interactions, Observations

The Stinger Initiative Interactive Query on Hadoop Page 17

Stinger Initiative At A Glance Page 18

Base Optimizations: Intelligent Optimizer Introduction of In-Memory Hash Join: For joins where one side fits in memory: New in-memory-hash-join algorithm. Hive reads the small table into a hash table. Scans through the big file to produce the output. Introduction of Sort-Merge-Bucket Join: Applies when tables are bucketed on the same key. Dramatic speed improvements seen in benchmarks. Other Improvements: Lower the footprint of the fact tables in memory. Enable the optimizer to automatically pick map joins. Page 19

Dimensionally Structured Data Extremely common pattern in EDW. Results in large fact tables and small dimension tables. Dimension tables often small enough to fit in RAM. Sometimes called Star Schema. Page 20

A Query on Dimensional Data Derived from TPC-DS Query 27 SELECT col5, avg(col6) FROM fact_table join dim1 on (fact_table.col1 = dim1.col1) join dim2 on (fact_table.col2 = dim2.col1) join dim3 on (fact_table.col3 = dim3.col1) join dim4 on (fact_table.col4 = dim4.col1) GROUP BY col5 ORDER BY col5 LIMIT 100; Dramatic speedup on Hive 0.11 Page 21

Star Schema Join Improvements in 0.11 Page 22

ORCFile - Optimized Column Storage Make a better columnar storage file Tightly aligned to Hive data model Decompose complex row types into primitive fields Better compression and projection Only read bytes from HDFS for the required columns. Store column level aggregates in the files Only need to read the file meta information for common queries Stored both for file and each section of a file Aggregates: min, max, sum, average, count Allows fast access by sorted columns Ability to add bloom filters for columns Enables quick checks for whether a value is present Page 23

Yarn Moving Hive and Hadoop beyond MapReduce Page 24

Hadoop 2.0 Innovations - YARN Focus on scale and innovation Support 10,000+ computer clusters Extensible to encourage innovation Next generation execution Improves MapReduce performance Supports new frameworks beyond MapReduce Low latency, Streaming, Services Do more with a single Hadoop cluster MapReduce Tez Graph Processing Other YARN: Cluster Resource Management HDFS Redundant, Reliable Storage

Tez Moving Hive and Hadoop beyond MapReduce Page 26

Tez Low level data-processing execution engine Use it for the base of MapReduce, Hive, Pig, Cascading etc. Enables pipelining of jobs Removes task and job launch times Hive and Pig jobs no longer need to move to the end of the queue between steps in the pipeline Does not write intermediate output to HDFS Much lighter disk and network usage Built on YARN Page 27

Tez - Core Idea Task with pluggable Input, Processor & Output Input Processor Output Task Tez Task - <Input, Processor, Output> YARN ApplicationMaster to run DAG of Tez Tasks Page 28

Tez Blocks for building tasks MapReduce Map MapReduce Reduce HDFS Input Map Processor Sorted Output Shuffle Input Reduce Processor HDFS Output MapReduce Map Task MapReduce Reduce Task Intermediate Reduce for Map-Reduce-Reduce Shuffle Input Reduce Processor Sorted Output Intermediate Reduce for Map- Reduce-Reduce Page 29

Tez More tasks Special Pig/Hive Map In-memory Map HDFS Input Map Processor Pipeline Sorter Output HDFSIn put Map Processor Inmemory Sorted Output Tez Task Tez Task Special Pig/Hive Reduce Shuffle Skipmerge Input Reduce Processor Sorted Output Tez Task Page 30

Pig/Hive-MR versus Pig/Hive-Tez SELECT a.state, COUNT(*), AVERAGE(c.price) FROM a JOIN b ON (a.id = b.id) JOIN c ON (a.itemid = c.itemid) GROUP BY a.state Job 1 Job 2 I/O Synchronization Barrier I/O Synchronization Barrier Single Job Job 3 Pig/Hive - MR Pig/Hive - Tez Page 31

FastQuery: Beyond Batch with YARN Tez Generalizes Map-Reduce Simplified execution plans process data more efficiently Always-On Tez Service Low latency processing for all Hadoop data processing Page 32

Recap and Questions: Hive Performance Page 33

Improving Hive s SQL Support Page 34

Stinger: Deep Analytical Capabilities SQL:2003 Window Functions OVER clauses Multiple PARTITION BY and ORDER BY supported Windowing supported (ROWS PRECEDING/FOLLOWING) Large variety of aggregates RANK FIRST_VALUE LAST_VALUE LEAD / LAG Distrubutions Page 35

Hive Data Type Conformance Data Types: Add fixed point NUMERIC and DECIMAL type (in progress) Add VARCHAR and CHAR types with limited field size Add DATETIME Add size ranges from 1 to 53 for FLOAT Add synonyms for compatibility BLOB for BINARY TEXT for STRING REAL for FLOAT SQL Semantics: Sub-queries in IN, NOT IN, HAVING. EXISTS and NOT EXISTS Page 36

Questions? Page 37