lprof : A Non- intrusive Request Flow Pro5iler for Distributed Systems

Similar documents

Benchmarking Hadoop & HBase on Violin

ReproLite : A Lightweight Tool to Quickly Reproduce Hard System Bugs

Comparing Scalable NOSQL Databases

Deploying Splunk on Amazon Web Services

Hadoop. History and Introduction. Explained By Vaibhav Agarwal

How To Scale Out Of A Nosql Database

The Future of Big Data SAS Automotive Roundtable Los Angeles, CA 5 March 2015 Mike Olson Chief Strategy Officer,

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Can the Elephants Handle the NoSQL Onslaught?

Hadoop & Spark Using Amazon EMR

Moving From Hadoop to Spark

Open source large scale distributed data management with Google s MapReduce and Bigtable

Network Measurement. Why Measure the Network? Types of Measurement. Traffic Measurement. Packet Monitoring. Monitoring a LAN Link. ScienLfic discovery

Open source Google-style large scale data analysis with Hadoop

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

Cloudera Manager Introduction

Snapshots in Hadoop Distributed File System

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems

This exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing.

Addressing Open Source Big Data, Hadoop, and MapReduce limitations

Native Connectivity to Big Data Sources in MSTR 10

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Big Data and Hadoop for the Executive A Reference Guide

Bigtable is a proven design Underpins 100+ Google services:

Hadoop Technology for Flow Analysis of the Internet Traffic

So What s the Big Deal?

Data Semantics Aware Cloud for High Performance Analytics

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Scalable Forensics with TSK and Hadoop. Jon Stewart

Real World Big Data Architecture - Splunk, Hadoop, RDBMS

Sujee Maniyam, ElephantScale

Managing large clusters resources

BIG DATA - HADOOP PROFESSIONAL amron

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

Configuration Manual Yahoo Cloud System Benchmark (YCSB) 24-Mar-14 SEECS-NUST Faria Mehak

Hadoop Setup. 1 Cluster

Map Reduce & Hadoop Recommended Text:

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

MinCopysets: Derandomizing Replication In Cloud Storage

PaRFR : Parallel Random Forest Regression on Hadoop for Multivariate Quantitative Trait Loci Mapping. Version 1.0, Oct 2012

Cloudera Manager Training: Hands-On Exercises

Hadoop for MySQL DBAs. Copyright 2011 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

Informatica Master Data Management Multi Domain Hub API: Performance and Scalability Diagnostics Checklist

Getting to know Apache Hadoop

CDH 5 Quick Start Guide

Analysis of MapReduce Algorithms

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Can High-Performance Interconnects Benefit Memcached and Hadoop?

STeP-IN SUMMIT June 2014 at Bangalore, Hyderabad, Pune - INDIA. Performance testing Hadoop based big data analytics solutions

A Cost-Evaluation of MapReduce Applications in the Cloud

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Mobile Cloud Computing for Data-Intensive Applications

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

Hadoop 2.6 Configuration and More Examples

Cloud Storage Solution for WSN Based on Internet Innovation Union

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Copyright Shraddha Pradip Sen 2011 ALL RIGHTS RESERVED

Apache Hama Design Document v0.6

Integrating Big Data into the Computing Curricula

Like what you hear? Tweet it using: #Sec360

High Availability for Database Systems in Cloud Computing Environments. Ashraf Aboulnaga University of Waterloo

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya

Big Data Course Highlights

Performance Testing Process A Whitepaper

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Advanced Computer Networks. Scheduling

Applied Storage Performance For Big Analytics. PRESENTATION TITLE GOES HERE Hubbert Smith LSI

Energy Efficient MapReduce

Automated Machine Learning For Autonomic Computing

YARN Apache Hadoop Next Generation Compute Platform

Lecture Data Warehouse Systems

COURSE CONTENT Big Data and Hadoop Training

Technical Paper. Moving SAS Applications from a Physical to a Virtual VMware Environment

Hadoop Performance Diagnosis By Post-execution Log Analysis

HiTune: Dataflow-Based Performance Analysis for Big Data Cloud

YARN, the Apache Hadoop Platform for Streaming, Realtime and Batch Processing

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Enabling High performance Big Data platform with RDMA

Apache HBase. Crazy dances on the elephant back

Audit Trails in the Aeolus Distributed Security Platform. Victoria Popic

Media Upload and Sharing Website using HBASE

System requirements for ICS Skills ATS

On- Prem MongoDB- as- a- Service Powered by the CumuLogic DBaaS Platform

SALSA: Analyzing Logs as StAte Machines 1

Introduction to Hadoop

Transcription:

lprof : A Non- intrusive Request Flow Pro5iler for Distributed Systems Xu Zhao, Yongle Zhang, David Lion, Muhammad FaizanUllah, Yu Luo, Ding Yuan, Michael Stumm Oct. 8th 2014

Performance analysis tools are needed Poor performance of distributed systems leads to o Increase of user latency o Increase of data center cost Distributed system behavior is hard to understand o Concurrent requests being processed by mullple nodes To diagnose poor performance, tools are needed to o Reconstruct the request control flow o Understand system behavior 2

Existing tools are intrusive Instrument systems to infer request control flow o E.g. MagPie, Project 5, X- Trace, Dapper, etc. o Incur performance overhead o InstrumentaLons are oqen system specific 3

System logs contain rich information Rich informalon in logs is not coincidence o Developers rely on logs to perform manual debugging Distributed systems generate lots of logs o During normal execulon 7TB/month [Cloudera 13] 25TB/day [Rothschild 09] 4

Existing log analyzers are limited Cannot infer request control flow o Machine learning based log analyzers E.g. [Xu 09], DISTALYZER, SynopGc, etc. Only detect system anomalies o Commercial tools E.g. splunk, VMWare LogInsight Require users to perform key- word based searches 5

lprof: a non- intrusive pro5iler Infers request control flow from system logs o Along with Lming informalon o Group logs printed by the same request on mullple nodes o Use informagon generated by stagc analysis System logs Request control 5low Node 2 Node 1 request Node1 Node2 Thread1 Thread3 Thread2 Thread4 6

Outline IntroducLon Case Study Design EvaluaLon 7

A real- world example Performance regression HDFS- 4049 writeblock Latency for each type of request readblock (ms) 1500 1000 Latency 500 04:00 08:00 12:00 16:00 Time 0 writeblock is suspecious 8

(ms) 1000 Zoom into per- node latency Per- node Latency DN1 Latency 750 500 DN3 DN2 DN1 DN3 DN2 250 0 pre- update writeblock Node post- update writeblock o Intra- node latency doesn t increases while inter- node does o Conclusion: unnecessary network communicalon 9

Outline IntroducLon Case Study Design EvaluaLon 10

Overview Byte code StaLc analysis lprof Model VisualizaLon Logs Log analysis Request database 11

Challenges Goal: to sltch log messages with respeclve requests DataNode2 Receiving block blk_01 DataNode1 Receiving block blk_01 Receiving block blk_02 Received block blk_01 HDFS_READ block blk_01 blk_01 terminating log snippet from HDFS data nodes writeblock1 writeblock2 readblock Logs are interleaved From different request types From different request instances of the same type Perfect idenlfiers doesn t always exist Distributed across mullple nodes 12

Code snippet in HDFS 1 dataxceiver() { 2 switch(opcode){ 3 case WRITE_BLOCK: 4 blk_id = getblock(); 5 writeblock(blk_id, ); 6 break; 7 case READ_BLOCK: 8 blk_id = getblock(); 9 readblock(blk_id, ); 10 break; 11 } 12} 13 writeblock(blk_id, ) { 14 log( Receiving block + blk_id); 15 new PacketResponder().start(); 16 } 17 readblock(blk_id, ) { 18 log( HDFS_READ block + blk_id); 19 } 20 PacketResponder.run() { 21 log( Received block + blk_id); 22 23 log(blk_id + terminating ); 24 } Top level method - starlng method to process a request Request idenlfier - logged variable not modified in one request Log temporal order - possible order between log statements CommunicaLon behavior - communicalon between threads 13

Find top level method Find request idenlfiers IntuiLon Request analysis o Request idenlfiers already exist for manual debugging o Not modified within one request o Once modified, outside of the request 14

Bokom- up analysis on call graph o Logged variables idenlfier candidates (IC) o Number of Lmes they got printed count Call Graph 1 dataxceiver () { 2 switch(opcode){ 3 case WRITE_BLOCK: Request analysis example 4 blk_id = getblock(); 5 writeblock(blk_id, ); 6 break; 7 case READ_BLOCK: 8 blk_id = getblock(); 9 readblock(blk_id, ); 10 break; 11 } 12} writeblock() IC: {blk_id} count: 8 dataxceiver() IC: {} count:0 readblock() IC: {blk_id} count: 7 top level method idengfiers writeblock readblock blk_id blk_id o Once count decreases, pick top level method and idenlfier 15

Temporal order analysis Control flow analysis in each top level method 20 PacketResponder.run() { 21 log( Received block + blk_id); 22 23 log(blk_id + terminating ); 24 } temporal order 16

Communication pair analysis CommunicaLon between request top level methods o Intra- node: thread crealon, shared objects o Network: socket, RPC Pair serializing and de- serializing methods 17

Summary of static analysis output Top level method & request idenlfier top level method idengfiers writeblock() blk_id readblock() blk_id Log temporal order PacketResponder.run() CommunicaLon pair writeblock() type: Network id: blk_id writeblock() type: Thread CreaLon id: blk_id writeblock() PacketResponder.run() 18

Distributed log stitching Implemented as a MapReduce job Node 3 Node 2 Node 1 1 Receiving blk_01 2 Received blk_01 3 blk_01 terminating Node 1 s log writeblock [log1],blk_01 PktRsp.run [log2,3],blk_01 [log1,2,3],blk_01 writeblock [9 logs from 3 nodes], blk_01 Map Combine Reduce 19

Outline IntroducLon Case Study Design EvaluaLon 20

Evaluation methodology Evaluated on logs from 4 distributed systems o HDFS, Yarn, HBase, Cassandra o Logs generated on 200 Amazon EC2 nodes o HiBench, YCSB workload Authors manually verified each unique log sequence logs Receiving block Received block terminating writeblock log sequence 21

Request attribution accuracy System Correct Incomplete Failed Incorrect HDFS Yarn Cassandra HBase accuracy for all the log messages 97.0% 79.6% 95.3% 90.6% 0.1% 19.2% 0.1% 2.5% 2.6% 1.2% 4.6% 3.4% 0.3% 0.0% 0.0% 3.5% Average 90.4% 5.7% 3.0% 1.0% 22

Real- world performance anomalies Randomly selected 23 anomalies o Reproduced each one to collect logs lprof is helpful for idenlfying the root cause for 65% Reasons for the cases lprof cannot help o Abnormal requests don t print any logs o The abnormal request only print 1 log But latency is needed for debugging 23

Related work Intrusive tools o E.g. MagPie, Project 5, X- Trace, Dapper, etc. ExisLng log analyzers o E.g. [Xu 09], DISTALYZER, SynopGc, etc. The Mystery Machine [Chow 14] o Infers request flow across soqware layers o Analyzes crilcal path and slack o But requires instrumenlng IDs into logs 24

Conclusions lprof: a profiler for distributed system o Infers request control flow along with Lming informalon o Non- intrusive because enlrely from system logs o Analyzes logs with informalon generated by stalc analysis lprof leverages the natural way developers do logging Demo 25

Limitations lprof benefits from good logging praclce o lprof cannot help when there s no log o Timestamp is required for latency analysis o Good idenlfier can improve the accuracy lprof cannot infer request across soqware layers lprof currently works on Java byte code 26