Performance Management in Big Data Applica6ons. Michael Kopp, Technology Strategist @mikopp

Similar documents
Using RDBMS, NoSQL or Hadoop?

Testing 3Vs (Volume, Variety and Velocity) of Big Data

Delivering Intelligence to Publishers Through Big Data

Implement Hadoop jobs to extract business value from large and varied data sets

SQream Technologies Ltd - Confiden7al

DNS Big Data

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

Oracle Big Data SQL Technical Update

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

NextGen Infrastructure for Big DATA Analytics.

Real-time Big Data Analytics with Storm

Data Management in the Cloud: Limitations and Opportunities. Annies Ductan

How To Scale Out Of A Nosql Database

.nl ENTRADA. CENTR-tech 33. November 2015 Marco Davids, SIDN Labs. Klik om de s+jl te bewerken

Big Data Analytics Nokia

Big Data Analytics - Accelerated. stream-horizon.com

Reference Architecture, Requirements, Gaps, Roles

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

xpaaerns on Spark, Shark, Tachyon and Mesos

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS PART 4 BEYOND MAPREDUCE...385

Evaluating NoSQL for Enterprise Applications. Dirk Bartels VP Strategy & Marketing

YARN, the Apache Hadoop Platform for Streaming, Realtime and Batch Processing

Cost-Effective Business Intelligence with Red Hat and Open Source

Apache Hadoop: The Pla/orm for Big Data. Amr Awadallah CTO, Founder, Cloudera, Inc.

Big Fast Data Hadoop acceleration with Flash. June 2013

Can the Elephants Handle the NoSQL Onslaught?

A Scalable Data Transformation Framework using the Hadoop Ecosystem

Data processing goes big

In Memory Accelerator for MongoDB

H2O on Hadoop. September 30,

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

STeP-IN SUMMIT June 2014 at Bangalore, Hyderabad, Pune - INDIA. Performance testing Hadoop based big data analytics solutions

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

I/O Considerations in Big Data Analytics

Unified Big Data Processing with Apache Spark. Matei

Scalable Architecture on Amazon AWS Cloud

CSE-E5430 Scalable Cloud Computing Lecture 2

PLATFORA INTERACTIVE, IN-MEMORY BUSINESS INTELLIGENCE FOR HADOOP

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

Using distributed technologies to analyze Big Data

Constructing a Data Lake: Hadoop and Oracle Database United!

THE DEVELOPER GUIDE TO BUILDING STREAMING DATA APPLICATIONS

Oracle Database - Engineered for Innovation. Sedat Zencirci Teknoloji Satış Danışmanlığı Direktörü Türkiye ve Orta Asya

Hadoop IST 734 SS CHUNG

Big Data With Hadoop

BIG DATA What it is and how to use?

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Play with Big Data on the Shoulders of Open Source

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Hadoop & Spark Using Amazon EMR

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Distributed Computing and Big Data: Hadoop and MapReduce

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov

Benchmarking Cassandra on Violin

Hadoop and Map-Reduce. Swati Gore

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

How To Handle Big Data With A Data Scientist

Testing Big data is one of the biggest

Big Data: A Storage Systems Perspective Muthukumar Murugan Ph.D. HP Storage Division

A Brief Introduction to Apache Tez

Big Data Are You Ready? Thomas Kyte

Transforming the Telecoms Business using Big Data and Analytics

Optimizing Performance. Training Division New Delhi

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

Big Data and Market Surveillance. April 28, 2014

APPLICATION PERFORMANCE MANAGEMENT BEST PRACTICES. Optimizing Hadoop. A collection of articles from the Compuware APM Center of Best Practices.

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya

Introduc8on to Apache Spark

The Internet of Things and Big Data: Intro

Integrate Master Data with Big Data using Oracle Table Access for Hadoop

<Insert Picture Here> Big Data

Hadoop: Embracing future hardware

Big Data Introduction

The 3 questions to ask yourself about BIG DATA

HBase Schema Design. NoSQL Ma4ers, Cologne, April Lars George Director EMEA Services

Hadoop Ecosystem B Y R A H I M A.

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

Unified Big Data Analytics Pipeline. 连 城

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems CLOUD COMPUTING GROUP - LITAO DENG

Hadoop. Sunday, November 25, 12

Introducing Storm 1 Core Storm concepts Topology design

Big Data Course Highlights

Direct NFS - Design considerations for next-gen NAS appliances optimized for database workloads Akshay Shah Gurmeet Goindi Oracle

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

NoSQL for SQL Professionals William McKnight

INTRODUCING DRUID: FAST AD-HOC QUERIES ON BIG DATA MICHAEL DRISCOLL - CEO ERIC TSCHETTER - LEAD METAMARKETS

Open source Google-style large scale data analysis with Hadoop

Hadoop implementation of MapReduce computational model. Ján Vaňo

Apache Hadoop: Past, Present, and Future

HADOOP PERFORMANCE TUNING

Use case: Merging heterogeneous network measurement data

Open source large scale distributed data management with Google s MapReduce and Bigtable

Transcription:

Performance Management in Big Data Applica6ons Michael Kopp, Technology Strategist

NoSQL: High Volume/Low Latency DBs Web Java Key Challenges 1) Even Distribu6on 2) Correct Schema and Access paperns 3) Understanding Applica6on Impact 4) Proprietary API Key Benefits 1) Fast Read/Write 2) Horizontal Scalability 3) Redundancy and High Availability

Hadoop: Large Scale Parallel Processing Master Node Hadoop Cluster 1 Data Node per Host 1 Task Tracker per Host Many Task JVMs per Host Job Tracker Name Node 1) HDFS: Distributed File System 2) MapReduce Key Challenges 1) Op6mal Distribu6on 2) Unwieldy Configura6on 3) Can easily waste your resources 4) Failure or Error Analysis is hard 5) Performance Op6miza6on is hard Key Benefits 1) Massive Horizontal Batch Job 2) Split big Problems into smaller ones

Raw Data ingest Hadoop Cluster Ingest Data Query Data Raw Data! Slow and Cumbersome 4

Transform to Hive/Pig Structured Data Hadoop Cluster Ingest Data Query Data MapReduce needs to keep up with data ingest Structured Data! Query Performance! 5

Leveraging the Results Hadoop Cluster Ingest Data Query Data 6

Pushing Results further for more Business AnalyLcs Hadoop Cluster Data warehouse Keep up with Data Ingest External SLAs! Data is more valuable if its current! 7

Impressions look like 9

A High Level look at RTB 1. Browsers visit Publishers and create impressions. 2. Publishers sell impressions via Exchanges. 3. Exchanges serve as auclon houses for the impressions 4. On behalf of the marketer, m6d bids the impressions via the auclon house. If m6d wins, we display our ad to the browser.

Typical MapReduce Job

Hadoop at media6degrees Cri6cal piece of infrastructure Long Term Data Storage Raw logs Aggrega6ons Reports Generated data (feed back loops) Numerous ETL (Extract Transform Load) Scheduled and adhoc processes Used directly by Tech- Team, Ad Ops, Data Science Hadoop Cluster

Hadoop at media6degrees Two deployments 'produc6on' and 'research' ~ 500 TB - 40+ Nodes ~ 350 TB 20+ Nodes Thousands of jobs <5 minute jobs and 12 hour Job Flows Mostly Hive Jobs Some custom code and streaming jobs Hadoop Cluster

Most important Design Tenants Data Locality Par66on data for good distribu6on By 6me interval Par66on pruning with WHERE Clustering (aka bucke6ng) Op6mized sampling and joins Columnar Column oriented Hadoop Cluster One Hive Table Mar Apr May Jun Jul

The Price for Scale out 1

The DistribuLon Challenge Data Skew Non Uniform Computa6on (long running tasks) Raw Data Growth Data features change

Intermediate I/O Shuffle Bo\leneck Compression codec Block size Split- table formats Tuning Hadoop variables (spills, buffers, Etc, etc)

OpLmizing Your Map/Reduce Jobs directly Solves Problems at its roots BePer performance Less hardware Less data? Hadoop Custom Job Counters Extensive Logging Test Data is not like Real Data

What about Hive or Pig? Solves Problems at its roots BePer performance Less hardware Less data? Custom Job Counters Extensive Logging

Op6mize MapReduce

Understanding Map/Reduce Performance APen6on Maximum Data Parallelism Volume! Actual Mapping Parallelism Also Millions your own of Execu6ons!!! Code APen6on Poten6al Choke Point! Maximum Reduce Parallelism Actual Reduce Parallelism Also your own Code

Understanding Map/Reduce Performance

Map/Reduce Performance Breakdown

DistribuLon and Hotspots Avoid Brute Force Focus Then Op6mize on Big Hotspots Hadoop Save a lot of Hardware

Combine and Spill Performance

Map/Reduce behind the scenes Serialize De- Serialize and Serialize again Poten6ally Inefficient Too Many Files, Same Key spread all over De- Serialize and Serialize again Expensive Synchronous Combine 1) Pre Combine in Mapping Step 2) Avoid many intermediate files and combines

IdenLfying Root Cause for failed Jobs Failed Job Actual Errors

IdenLfying Root Cause for failed Jobs Error is in Reducer when wri6ng to S3

What comes aber Hadoop? Hadoop Cluster

Cassand and ApplicaLon Performance

Cassandra at m6d for Real Time Bidding RTB limited data is provided from exchange System to store informa6on on users Frequency Capping Visit History Segments (product service affinity) Low latency Requirements Less then 100ms Requires fast read/write on discrete data

Key Cassandra Design Tenants Swap/paging not possible Mostly schema- less Writes do not read Read/Write is an an6- papern Op6mize around put and get Not for scan and query De- Normalize data APempt to get all data in single read*

Cassandra Design Challenges De- normalize Store data to op6mize reads Composite (mul6- column) keys Mul6- column family and Mul6- tenant scenarios Compress seqngs Disk and cache savings CPU and JVM costs Data/Compac6on seqngs Size 6ered vs. LevelDB Caching, Memtable and other tuning

How to handle performance issues? System Monitoring CPU, Disk and Process Health Track req/sec Track size of Column Families, rows and columns (JMX) Applica6on Monitoring similar to JDBC Memory Issues? Too much GC Suspensions?

APM for Cassandra Applica6ons

NoSQL APM is not so different aber all Web Java Key APM Problems IdenLfied 1) Response Time Contribu6on 2) data access paperns 3) transac6on to query rela6onship (transac6on flow) 1) Data Access Distribu6on 2) End- to- End Monitoring 3) Storage (I/O, GC) BoPlenecks 4) Consistency Level

Real End- to- End ApplicaLon Performance Third Party Our ApplicaLon External End User Services End User Response Time Contribu6on

Understanding Cassandra s ContribuLon Which statements did the Transac6on Execute? Which node where Contribu6on they executed of each Too against? many Statment calls? Which Consistency Level was used? Data Access paperns

Understand Response Time ContribuLon 5 Calls 4 Calls ~50-80 ms Contribu6on? ~15 ms Contribu6on Access and Data Distribu6on

Why and how was a statement executed? 45ms latency? 60ms wai6ng on the server?

Any Hotspots on the Cassandra Nodes? Much more load on Node3? Which Transac6ons are responsible

Conclusion

Extend Performance Focus on ApplicaLon Web Java A Fast Database doesn t make a fast ApplicaLon

Intelligent MapReduce APM batch trigger Hive high- level map/reduce query Hive Server JOB JOB 1 1 2 2 3 3 4... 754 data/task node data/task node master node data/task node Simple OpLmizaLons with big impact

Q&A

Michael Kopp, Technology Strategist michael.kopp@compuware.com blog.dynatrace.com book.dynatrace.com 2012 Compuware Corporation All Rights Reserved 47

Monitor MapReduce and Hadoop