Health monitoring & predictive analytics To lower the TCO in a datacenter

Similar documents

Copyright 11/1/2010 BMC Software, Inc 1

Predictive Asset Maintenance

Customer Evaluation Report On Incident.MOOG

IT Performance & Capacity Management

The top 10 misconceptions about performance and availability monitoring

CA Database Performance

Modern IT Operations Management. Why a New Approach is Required, and How Boundary Delivers

CA Virtual Assurance/ Systems Performance for IM r12 DACHSUG 2011

HadoopTM Analytics DDN

Optimize Application Performance and Enhance the Customer Experience

Proactive and Reactive Monitoring

Management of VMware ESXi. on HP ProLiant Servers

Work Smarter, Not Harder: Leveraging IT Analytics to Simplify Operations and Improve the Customer Experience

VMware Virtualization and Cloud Management Overview VMware Inc. All rights reserved

The Trellis Dynamic Infrastructure Optimization Platform for Data Center Infrastructure Management (DCIM)

Wait, How Many Metrics? Monitoring at Quantcast

Performance Management for Enterprise Applications

DATA CENTER INFRASTRUCTURE MANAGEMENT

Using Zenoss to Manage Cloud Services

RaMP Data Center Manager. Data Center Infrastructure Management (DCIM) software

ANY SURVEILLANCE, ANYWHERE, ANYTIME

Return On Investment XpoLog Center

Operations Management for Virtual and Cloud Infrastructures: A Best Practices Guide

HP Service Health Analyzer: Decoding the DNA of IT performance problems

SIZING-UP IT OPERATIONS MANAGEMENT: THREE KEY POINTS OF COMPARISON

How To Make Data Streaming A Real Time Intelligence

Align IT Operations with Business Priorities SOLUTION WHITE PAPER

Capacity planning with Microsoft System Center

How To Use Mindarray For Business

7/15/2011. Monitoring and Managing VDI. Monitoring a VDI Deployment. Veeam Monitor. Veeam Monitor

HP Business Service Management 9.2 and

Predictive Analytics for APM. Neil MacGowan Technical Director Netuitive Europe 18 April 2013

Enterprise IT is complex. Today, IT infrastructure spans the physical, the virtual and applications, and crosses public, private and hybrid clouds.

Understanding the Promise of (DCIM)

Meeting the Challenge of Big Data Log Management: Sumo Logic s Real-Time Forensics and Push Analytics

Considerations for a Highly Available Intelligent Rack Power Distribution Unit. A White Paper on Availability

Product Review: James F. Koopmann Pine Horse, Inc. Quest Software s Foglight Performance Analysis for Oracle

The Purview Solution Integration With Splunk

Detecting Anomalous Behavior with the Business Data Lake. Reference Architecture and Enterprise Approaches.

Vistara Lifecycle Management

InfraStruxure TM Management Software

2015 Analyst and Advisor Summit. Advanced Data Analytics Dr. Rod Fontecilla Vice President, Application Services, Chief Data Scientist

Predictive Intelligence: Identify Future Problems and Prevent Them from Happening BEST PRACTICES WHITE PAPER

How To Manage A Network With Ccomtechnique

What s New in Help Desk Authority 8.2?

Splunk for VMware Virtualization. Marco Bizzantino Vmug - 05/10/2011

Monitoring Remedy with BMC Solutions

Oracle Enterprise Manager 13c Cloud Control

The Advantages of Enterprise Historians vs. Relational Databases

The Trellis Dynamic Infrastructure Optimization Platform for Data Center Infrastructure Management (DCIM)

Topology Aware Analytics for Elastic Cloud Services

Applying Data Center Infrastructure Management in Collocation Data Centers

A New Era Of Analytic

High Availability of VistA EHR in Cloud. ViSolve Inc. White Paper February

Fight fire with fire when protecting sensitive data

IBM SmartCloud Monitoring

Frequently Asked Questions Plus What s New for CA Application Performance Management 9.7

Analyzing Big Data with Splunk A Cost Effective Storage Architecture and Solution

An Oracle White Paper June, Strategies for Scalable, Smarter Monitoring using Oracle Enterprise Manager Cloud Control 12c

Benchmarking Cassandra on Violin

Centralized Orchestration and Performance Monitoring

MySQL Enterprise Monitor

HP Virtualization Performance Viewer

PLA 7 WAYS TO USE LOG DATA FOR PROACTIVE PERFORMANCE MONITORING. [ WhitePaper ]

How To Use Ibm Tivoli Monitoring Software

Proactive Performance Management for Enterprise Databases

The Advantages of Plant-wide Historians vs. Relational Databases

XenServer Virtual Machine metrics

Using DCIM to Identify Data Center Efficiencies and Cost Savings. Chuck Kramer, RCDD, NTS

Best Practices for Managing Virtualized Environments

WHITE PAPER Empowering Application Workloads Migration to Cloud Services

Capacity Planning Use Case: Mobile SMS How one mobile operator uses BMC Capacity Management to avoid problems with a major revenue stream

Machine Data Analytics with Sumo Logic

Optimizing your IT infrastructure IBM Corporation

XpoLog Center Suite Log Management & Analysis platform

Monitoring and Log Management in Hybrid Cloud Environments

The Trellis Dynamic Infrastructure Optimization Platform

Zerto Virtual Manager Administration Guide

The Top 20 VMware Performance Metrics You Should Care About

Building Analytics Improve the efficiency, occupant comfort, and financial well-being of your building.

locuz.com Big Data Services

Learn How to Leverage System z in Your Cloud

AccelOps for Managed Service Providers

StruxureWare TM Data Center Expert

Proactive and Predictive Virtualization Management Optimizes Datacenter Availability and Utilization

Networking in the Hadoop Cluster

Redefining Infrastructure Management for Today s Application Economy

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

Intel Service Assurance Administrator. Product Overview

Site24x7: Key Mistakes in Data Center Operations

Transcription:

Health monitoring & predictive analytics To lower the TCO in a datacenter PRESENTATION TITLE GOES HERE Christian B Madsen & Andrei Khurshudov Seagate Technology christian.b.madsen@seagate.com

Outline 1. The opportunity 2. Our vision & implementation 3.Use cases 4. Summary 2

The opportunity PRESENTATION TITLE GOES HERE

What if Seagate offered you a technology that could help you Improve datacenter efficiency Optimize system management Reduce potential cost of operation 4

The problem Failures in storage lead to costly outages 1 billion hard drives will be used in cloud datacenters by 2020, highlighting the need to manage drive health at scale One total outage per datacenter is statistically expected every year 80% of those outages are not completely explained (or linked to root causes) $700,000 is the average cost per incident $8,000 is the average cost per minute of an unplanned outage Up to 10% of datacenter accidents are related to storage 56% of 13ZB 7.3ZB, >1billion drives in cloud 2020 56% Source: Seagate Strategic Marketing and Research 2013 5

Better drive management will lower the TCO Top 4 challenges in drive management 1. Drive health monitoring Need reliable key performance indicators to track drive health status 2. Drive failure prediction Ultimately, we want to know when our drives will fail so we can take actions before that happens 3. Drive failure diagnostics and management automation Need to correctly identify and quickly resolve issues Need to prevent false alerts to reduce cost of failure handling 4. Drive lifespan extension Need to know how to optimize operating environment for better reliability Need to reuse partially good drives (should be possible with in-drive diagnostic, IDD) 6

Our vision & implementation PRESENTATION TITLE GOES HERE

Read (GB) Write (GB) CPU (%) Temp (C) Read Write CPU Bytes In Bytes Out Temp 1. 4% > WL (W+R) spec 2. 1% > 2σ(T) Rack 1 Proxy-1 Swift-f1 Details Rack 2 Proxy-2 Swift-f1 Our vision Monitoring, analytics, prediction and control The internet of things 2 Data Aggregation MONITORING ANALYTICS Quick Issue Resolution S.M.A.R.T. Cloud Ganglia Gazer TM Metrics: Alerts (2): Sources: Global Access CONTROL Actionable Decisions Report storage health Run drive self-test Shut-down systems Repair drives Run auto-fa Point at an issue Highlight inefficiency Predict reliability Detect anomalies Data Center PREDICTIONS Early Warning System Drive-centric health monitoring Analytics and predictive models Closed-loop automation 8

Functional diagram Monitoring, intelligent decisions and automation Closed-loop automation Example Monitor Monitor Yes Passes Compliance (threshold) No Exception (alert) Drive health Not passing Drive predicted to fail Recommended action Resolution Choose action Automation Choosing action from recommended action can be automated by tying it to the specific application or saving choices Reset or turn off drive Turn off drive Choose action 9

Implementation Architecture overview Agents Cloud Gazer TM Elements Storage Server Drives Storage Server Drives Real-time metric aggregation Analytics Engine Data pool (10,000s of drives) nosql Cluster Server Drive Server ReSTful API Query data Check thresholds Manage drives REST API Calls Cloud Gazer TM Dashboard Storage Software Storage eco system Storage Server Storage Software Storage Server ReSTful API Cloud Gazer TM Drive Monitoring and Analytics Cloud Gazer TM Dashboard Storage Server Storage Server Drives Drives Drives Drives 10

Use cases PRESENTATION TITLE GOES HERE

Compliance (thresholds) Degradation and performance warnings Danger zone Datacenter failure rate 12

Compliance (thresholds) Degradation and performance warnings Overload detection Detecting and reporting when drive load exceeds design limits Danger zone Datacenter failure rate 13

Compliance (thresholds) Degradation and performance warnings Recommended action How to increase drive reliability Overload detection Detecting and reporting when drive load exceeds design limits Danger zone Datacenter failure rate 14

Compliance (thresholds) Degradation and performance warnings Recommended action How to increase drive reliability Overload detection Detecting and reporting when drive load exceeds design limits HDD population failure rate Measuring stress and estimating failure acceleration of the disk drive population in real time. Relies on the proprietary failure prediction algorithms Danger zone Datacenter failure rate 15

Compliance (thresholds) Degradation and performance warnings Failure detection Warning about expected drive failures. Relies on the proprietary failure prediction algorithms that use unsupervised machine learning techniques. Expected average failure prediction time window is from 9 days to 12 days. Recommended action How to increase drive reliability Overload detection Detecting and reporting when drive load exceeds design limits HDD population failure rate Measuring stress and estimating failure acceleration of the disk drive population in real time. Relies on the proprietary failure prediction algorithms Danger zone Datacenter failure rate 16

Workload optimization Drive visibility tools to improve workload balancing Before Load balancing issues After Workload distributed over servers and time Day Week Month Workload predominantly hitting one server Workload peaked on Sunday 17

Unsupervised machine learning and failure prediction No interaction between drive set, no prior knowledge Drives in field Multivariate time-series monitoring Apply failure prediction algorithm in parallel in real-time Real-time status prediction of drive Fine or going to fail For now, an average failure prediction window is on the order of 9 to 12 days Failure prediction accuracy ranges from 55% to 90% 18

Prediction and follow up actions Heat map indicates drives at risk and you can issue drive tests (DST, IDD, ) to resolve or corroborate Systematic failure predicted: 3 out of 5 drives predicted to fail sit in end location of servers 19

Find failure triggers Root cause tools including a drive temperature heat map can help you triage the cause of your drive issues Common factors for drives in the end position is a cooler temperature. Therefore increasing the server temperature may reduce the (dominant) failure mechanism and increase drive reliability Systematic failure predicted: 3 out of 5 drives predicted to fail sit in end location of servers 20

Failure prediction lead time We can predict drives will fail on average 9-10 days before the failure Case study 1, we predicted most drives (118 drives) to fail 12 days prior to failure Case study 2, we predicted 5 drives to failed 23 days prior to failure, 2 drives 22 days prior to failure, 2 drives just one day in advance Currently catch 55-90% of failures ahead of time 21

Summary PRESENTATION TITLE GOES HERE

Why Cloud Gazer TM? Truly drive-centric management tool for the cloud Most efficient tool for extracting drive health information using Seagate IP Nobody knows drives better than us Freeware utilities are frequently wrong Runs on any Linux system with little overhead (<1%) Windows is next Data can be collected, monitored and analyzed locally or in the Cloud ReSTful API to interact with other software New analytics, prediction, AI, and control capabilities are added continually Drive repair will be possible with in-drive diagnostic Simply SMARTer Competition Enclosure control will be possible by summer 2015 * Seagate drives 23

Questions? PRESENTATION TITLE GOES HERE