Simplifying Lustre Operations with Chukwa and Hadoop



Similar documents
HadoopTM Analytics DDN

IBM InfoSphere Guardium Data Activity Monitor for Hadoop-based systems

Trends and Research Opportunities in Spatial Big Data Analytics and Cloud Computing NCSU GeoSpatial Forum

Chukwa: A large-scale monitoring system

IBM Tivoli Composite Application Manager for WebSphere

Scalable NetFlow Analysis with Hadoop Yeonhee Lee and Youngseok Lee

Jason Hill HPC Operations Group ORNL Cray User s Group 2011, Fairbanks, AK

PEPPERDATA OVERVIEW AND DIFFERENTIATORS

PATROL From a Database Administrator s Perspective

SIMPLE MACHINE HEURISTIC INTELLIGENT AGENT FRAMEWORK

Lustre * Filesystem for Cloud and Hadoop *

Network Management and Monitoring Software

Final Project Proposal. CSCI.6500 Distributed Computing over the Internet

Hadoop. Sunday, November 25, 12

MyCloudLab: An Interactive Web-based Management System for Cloud Computing Administration

Proactive Performance Management for Enterprise Databases

Agile Business Intelligence Data Lake Architecture

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: Vol. 1, Issue 6, October Big Data and Hadoop

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

IBM QRadar Security Intelligence April 2013

XpoLog Competitive Comparison Sheet

Modern IT Operations Management. Why a New Approach is Required, and How Boundary Delivers

Bright Cluster Manager

OMNITURE MONITORING. Ensuring the Security and Availability of Customer Data. June 16, 2008 Version 2.0

Harnessing the Power of Big Data for Real-Time IT: Sumo Logic Log Management and Analytics Service

ROCANA WHITEPAPER How to Investigate an Infrastructure Performance Problem

XpoLog Center Suite Data Sheet

BIG DATA SOLUTION DATA SHEET

Presentation Title: When Anti-virus Doesn t Cut it: Catching Malware with SIEM

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Log Mining Based on Hadoop s Map and Reduce Technique

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

locuz.com Big Data Services

Using an In-Memory Data Grid for Near Real-Time Data Analysis

BIG DATA ANALYTICS For REAL TIME SYSTEM

A Brief Introduction to Apache Tez

Security Event Management. February 7, 2007 (Revision 5)

MinCopysets: Derandomizing Replication In Cloud Storage

Application and practice of parallel cloud computing in ISP. Guangzhou Institute of China Telecom Zhilan Huang

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

BIG DATA TRENDS AND TECHNOLOGIES

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

Extending the Enterprise Data Warehouse with Hadoop Robert Lancaster. Nov 7, 2012

Integrating a Big Data Platform into Government:

Intel HPC Distribution for Apache Hadoop* Software including Intel Enterprise Edition for Lustre* Software. SC13, November, 2013

Augmented Search for Web Applications. New frontier in big log data analysis and application intelligence

BIG DATA USING HADOOP

April 8th - 10th, 2014 LUG14 LUG14. Lustre Log Analyzer. Kalpak Shah. DataDirect Networks. ddn.com DataDirect Networks. All Rights Reserved.

Minder. simplifying IT. All-in-one solution to monitor Network, Server, Application & Log Data

End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ

End Your Data Center Logging Chaos with VMware vcenter Log Insight

Application Development. A Paradigm Shift

Networking in the Hadoop Cluster

Real World Big Data Architecture - Splunk, Hadoop, RDBMS

Big Fast Data Hadoop acceleration with Flash. June 2013

What is Security Intelligence?

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

Hadoop and Map-Reduce. Swati Gore

OBSERVEIT DEPLOYMENT SIZING GUIDE

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

BIG DATA THE NEW OPPORTUNITY

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Zenoss for Cisco ACI: Application-Centric Operations

The Rise of Industrial Big Data. Brian Courtney General Manager Industrial Data Intelligence

XpoLog Center Suite Log Management & Analysis platform

White Paper. How Streaming Data Analytics Enables Real-Time Decisions

Platfora Big Data Analytics

Big Data on Microsoft Platform

Simplifying Big Data Analytics: Unifying Batch and Stream Processing. John Fanelli,! VP Product! In-Memory Compute Summit! June 30, 2015!!

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

In-Memory Analytics for Big Data

Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7

Cloud Computing at Google. Architecture

Informatica PowerCenter The Foundation of Enterprise Data Integration

WHITEPAPER. PHD Virtual Monitor: Unmatched Value. of your finances. Unmatched Value for Your Virtual World

Oracle Big Data SQL Technical Update

Parallel Analysis and Visualization on Cray Compute Node Linux

Elasticsearch on Cisco Unified Computing System: Optimizing your UCS infrastructure for Elasticsearch s analytics software stack

Presenting Mongoose A New Approach to Traffic Capture (patent pending) presented by Ron McLeod and Ashraf Abu Sharekh January 2013

HDP Hadoop From concept to deployment.

Centralized Logging in a Decentralized World

Analyzing Big Data with Splunk A Cost Effective Storage Architecture and Solution

Big Data. White Paper. Big Data Executive Overview WP-BD Jafar Shunnar & Dan Raver. Page 1 Last Updated

Turnkey Hardware, Software and Cash Flow / Operational Analytics Framework

Splunk for VMware Virtualization. Marco Bizzantino Vmug - 05/10/2011

ScienceLogic vs. Open Source IT Monitoring

IBM Software InfoSphere Guardium. Planning a data security and auditing deployment for Hadoop

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

CSE-E5430 Scalable Cloud Computing Lecture 2

END TO END DATA CENTRE SOLUTIONS COMPANY PROFILE

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

Transcription:

Simplifying Lustre Operations with Chukwa and Hadoop LAD2012 Paris, France Travis Dawson

Abstract Lustre is complicated. It is commonly deployed across dozens of servers, supporting thousands of clients. Each of these components is constantly generating logs providing insight into operations, performance, errors and recoveries within the environment. A Lustre environment that generates millions of events a day is not uncommon. A system administrator can quickly become overwhelmed in trying to associated events from separate servers and create a complete picture of how health the environment is, proactively respond to problems and enable visibility for developers and users to the environment. Enter Chukwa. Chukwa is a distributed tool for log gathering, storage and analysis that runs on Apache Hadoop, levering the Hadoop Distributed File System. Chukwa enables administrators to intelligently locate patterns in their environments, respond and correlate events. This presentation will cover how Chukwa can be implemented on top of Apache Hadoop to facilitate the gathering, correlation, analysis and reporting of large numbers of events across a Lustre environment.

Lustre Operational Struggles Lots of Log Messages Storage System Hardware Lustre/Application Messages are hard to decipher Change version to version Hidden and linked errors Errors/Warnings that hide other problems Reactive vs Proactive So many logs only able to concentrate on failures and total errors

Current Lustre Logging/Monitoring SyslogNG Rationalized PrintK John Hammond TACC Used to decipher messages SEC Simple/Statistical Event Correlation Perl Nagios Sometimes Multiple Instances to handle the load RRDTool Pretty Pictures One really big machine to monitor them all

Introducing Chukwa Open Source Data Collection System http://incubator.apache.org/chukwa Built on top of HDFS and Map/Reduce (Hadoop) Hadoop Scalability & Robustness Built in Visualization toolkit (HICC) Widgets Alerts Pretty Graphs

Chukwa Architecture Host Agent (low overhead) Low/No impact on applications Fast-Path and Reliable Delivery Alerting as well as Archival Scalable to over 200MB/sec HICC Visualization Built in Graphing and Widgets

Visualization/Reporting/Alerting

Lustre Logs - Nov 30 17:47:41 uiy601 attrd_updater: [7930]: info: Invoked: attrd_updater -n pingd-tcp -v 4 -d 60 Nov 30 17:48:06 uiy601 ntpd[5861]: synchronized to 10.0.6.173, stratum 4 Nov 30 17:49:45 uiy601 attrd_updater: [8059]: info: Invoked: attrd_updater -n pingd-tcp -v 4 -d 60 Nov 30 17:51:06 uiy601 kernel: Lustre: 7063:0:(mds_lov.c:1191:mds_notify()) MDS scratch-mdt0000: in recovery, not resetting orphans on scratch-ost0003_uuid Nov 30 17:51:06 uiy601 kernel: Lustre: scratch-ost0003-osc: Connection restored to service scratch-ost0003 using nid 10.0.6.176@tcp. Nov 30 17:51:06 uiy601 kernel: Lustre: 7063:0:(mds_lov.c:1191:mds_notify()) MDS scratch-mdt0000: in recovery, not resetting orphans on scratch-ost0004_uuid Nov 30 17:51:06 uiy601 kernel: Lustre: scratch-ost0004-osc: Connection restored to service scratch-ost0004 using nid 10.0.6.176@tcp. Nov 30 17:51:49 uiy601 attrd_updater: [8190]: info: Invoked: attrd_updater -n pingd-tcp -v 4 -d 60 Nov 30 17:52:02 uiy601 cib: [6014]: info: cib_stats: Processed 228 operations (175.00us average, 0% utilization) in the last 10min Nov 30 17:52:19 uiy601 kernel: LustreError: 7667:0:(ldlm_lib.c:944:target_handle_connect()) scratch-mdt0000: denying connection for new client 10.0.6.3@tcp (8de758f3-5e08-4d53-bef3-c952a2f7cc53): 1 clients in recovery for 300s Nov 30 17:52:19 uiy601 kernel: LustreError: 7667:0:(ldlm_lib.c:1919:target_send_reply_msg()) @@@ processing error (-16) req@afcbf81061f5cc000 x1386922024894473/t0 o38-><?>@<?>:0/0 lens 368/264 e 0 to 0 dl 1322672039 ref 1 fl Interpret:/0/0 rc -16/0 Nov 30 17:52:25 uiy601 kernel: LustreError: 7668:0:(ldlm_lib.c:944:target_handle_connect()) scratch-mdt0000: denying connection for new client 10.0.6.3@tcp (8de758f3-5e08-4d53-bef3-c952a2f7cc53): 1 clients in recovery for 293s Nov 30 17:52:25 uiy601 kernel: LustreError: 7668:0:(ldlm_lib.c:1919:target_send_reply_msg()) @@@ processing error (-16) req@afcbf81030f78fc00 x1386922024894476/t0 o38-><?>@<?>:0/0 lens 368/264 e 0 to 0 dl 1322672045 ref 1 fl Interpret:/0/0 rc -16/0 Nov 30 17:52:32 uiy601 kernel: LustreError: 7669:0:(ldlm_lib.c:944:target_handle_connect()) scratch-mdt0000: denying connection for new client 10.0.6.3@tcp (8de758f3-5e08-4d53-bef3-c952a2f7cc53): 1 clients in recovery for 286s Nov 30 17:52:32 uiy601 kernel: LustreError: 7669:0:(ldlm_lib.c:1919:target_send_reply_msg()) @@@ processing error (-16) req@afcbf810310f36800 x1386922024894477/t0 o38-><?>@<?>:0/0 lens 368/264 e 0 to 0 dl 1322672052 ref 1 fl Interpret:/0/0 rc -16/0 Nov 30 17:52:39 uiy601 kernel: LustreError: 7670:0:(ldlm_lib.c:944:target_handle_connect()) scratch-mdt0000: denying connection for new client 10.0.6.3@tcp (8de758f3-5e08-4d53-bef3-c952a2f7cc53): 1 clients in recovery for 279s Nov 30 17:52:39 uiy601 kernel: LustreError: 7670:0:(ldlm_lib.c:1919:target_send_reply_msg()) @@@ processing error (-16) req@afcbf8103086b1c00 x1386922024894478/t0 o38-><?>@<?>:0/0 lens 368/264 e 0 to 0 dl 1322672059 ref 1 fl Interpret:/0/0 rc -16/0 Nov 30 17:52:46 uiy601 kernel: LustreError: 7671:0:(ldlm_lib.c:944:target_handle_connect()) scratch-mdt0000: denying connection for new client 10.0.6.3@tcp (8de758f3-5e08-4d53-bef3-c952a2f7cc53): 1 clients in recovery for 272s Nov 30 17:52:46 uiy601 kernel: LustreError: 7671:0:(ldlm_lib.c:1919:target_send_reply_msg()) @@@ processing error (-16) req@afcbf810312dcc000 x1386922024894479/t0 o38-><?>@<?>:0/0 lens 368/264 e 0 to 0 dl 1322672066 ref 1 fl Interpret:/0/0 rc -16/0 Nov 30 17:53:06 uiy601 kernel: LustreError: 7672:0:(ldlm_lib.c:944:target_handle_connect()) scratch-mdt0000: denying connection for new client 10.0.6.6@tcp (a64b13b1-9fc4-abf5-37f5-5860ef26fb03): 1 clients in recovery for 252s Nov 30 17:53:06 uiy601 kernel: LustreError: 7672:0:(ldlm_lib.c:1919:target_send_reply_msg()) @@@ processing error (-16) req@afcbf810619797800 x1386922073128969/t0 o38-><?>@<?>:0/0 lens 368/264 e 0 to 0 dl 1322672086 ref 1 fl Interpret:/0/0 rc -16/0 Nov 30 17:53:15 uiy601 kernel: LustreError: 7673:0:(ldlm_lib.c:944:target_handle_connect()) scratch-mdt0000: denying connection for new client 10.0.6.3@tcp (8de758f3-5e08-4d53-bef3-c952a2f7cc53): 1 clients in recovery for 243s Nov 30 17:53:15 uiy601 kernel: LustreError: 7673:0:(ldlm_lib.c:1919:target_send_reply_msg()) @@@ processing error (-16) req@afcbf8103177a7c00 x1386922024894482/t0 o38-><?>@<?>:0/0 lens 368/264 e 0 to 0 dl 1322672095 ref 1 fl Interpret:/0/0 rc -16/0 Nov 30 17:53:34 uiy601 kernel: LustreError: 7676:0:(ldlm_lib.c:944:target_handle_connect()) scratch-mdt0000: denying connection for new client 10.0.6.6@tcp (a64b13b1-9fc4-abf5-37f5-5860ef26fb03): 1 clients in recovery for 225s Nov 30 17:53:34 uiy601 kernel: LustreError: 7676:0:(ldlm_lib.c:944:target_handle_connect()) Skipped 2 previous similar messages Nov 30 17:53:34 uiy601 kernel: LustreError: 7676:0:(ldlm_lib.c:1919:target_send_reply_msg()) @@@ processing error (-16) req@afcbf81061fed2400 x1386922073128973/t0 o38-><?>@<?>:0/0 lens 368/264 e 0 to 0 dl 1322672114 2012 DataDirect ref 1 fl Interpret:/0/0 Networks. All rc -16/0 Rights Reserved.

Analyzing Lustre Logs Groups of xx recoverable clients remain and temporarily refusing client connection from xx.xx.xx.xx@tcp messages XX gets lower and lower What other messages appear around that time (scheduled cron job taking up CPU) Eviction Root Cause Analysis Network issues (correlate with packet drops) BUG (look for related errors, Kernel Messages) Random Gremlins (stop feeding them) Any other ideas?

Big Data Analytics on Lustre Logs Machine Logs are Ideal for Analytics Little variability, easy to parse (mostly) REAL event correlation Storage Subsystem Hardware Application Find the Root Cause of Issues How does a slight issue here result in a failure there Detect subtle errors using data mining and machine learning techniques Xu et al Detecting Large Scale System Problems by Mining Console Logs 22 nd ACM SOSP, 2009 Proactive vs Reactive

Future work Deeper integration Blktrace/Seekwatcher for IO Performance Deeper Lustre Logging/PrintK Better Analytics Taking advantage of more Data Mining Research Detecting Large-Scale System Problems by Mining Console Logs Vampir Log Analytics Take in Vampir Log Traces Plot them either internally or externally Display using HICC

Putting it all together Scalable Capable Reactive -> Proactive

Questions

References Chukwa: A system for reliable large-scale log collection http://www.cs.berkeley.edu/~asrabkin/drafts/chukwa-lisa.pdf Apache Org Project Pages http://incubator.apache.org/chukwa/ http://wiki.apache.org/hadoop/chukwa Detecting Large-Scale System Problems by Mining Console Logs http://www.cs.berkeley.edu/~jordan/papers/xu-etal-sosp09.pdf