Extreme-Value Analysis of Corrosion Data



Similar documents
Extreme Value Analysis of Corrosion Data. Master Thesis

12.5: CHI-SQUARE GOODNESS OF FIT TESTS

Maximum Likelihood Estimation

There are a number of different methods that can be used to carry out a cluster analysis; these methods can be classified as follows:

Practice problems for Homework 11 - Point Estimation

Extreme Value Modeling for Detection and Attribution of Climate Extremes

Errata and updates for ASM Exam C/Exam 4 Manual (Sixteenth Edition) sorted by page

Exact Confidence Intervals

The R programming language. Statistical Software for Weather and Climate. Already familiar with R? R preliminaries

Estimation and attribution of changes in extreme weather and climate events

Distances, Clustering, and Classification. Heatmaps

Data Preparation and Statistical Displays

Capability Analysis Using Statgraphics Centurion

Bayesian Statistics in One Hour. Patrick Lam

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

CONTENTS PREFACE 1 INTRODUCTION 1 2 DATA VISUALIZATION 19

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Clustering UE 141 Spring 2013

EXPLORING SPATIAL PATTERNS IN YOUR DATA

Clustering Artificial Intelligence Henry Lin. Organizing data into clusters such that there is

An introduction to the analysis of extreme values using R and extremes

Neural Networks Lesson 5 - Cluster Analysis

Planning And Scheduling Maintenance Resources In A Complex System

Econometrics Simple Linear Regression

Efficiency and the Cramér-Rao Inequality

Multivariate Analysis of Ecological Data

Java Modules for Time Series Analysis

Two Topics in Parametric Integration Applied to Stochastic Simulation in Industrial Engineering

Nonparametric Statistics


Data Mining Cluster Analysis: Basic Concepts and Algorithms. Clustering Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Statistical Machine Learning from Data

Aggregate Loss Models

Cluster Analysis. Isabel M. Rodrigues. Lisboa, Instituto Superior Técnico

An application of extreme value theory to the management of a hydroelectric dam

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

MACHINE LEARNING IN HIGH ENERGY PHYSICS

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Statistical Software for Weather and Climate

2WB05 Simulation Lecture 8: Generating random variables

Simple Linear Regression Inference

Applications of R Software in Bayesian Data Analysis

MATH2740: Environmental Statistics

Least Squares Estimation

SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING

Questions and Answers

Computation of the Aggregate Claim Amount Distribution Using R and actuar. Vincent Goulet, Ph.D.

Cluster Analysis: Advanced Concepts

Expression Quantification (I)

Operational Risk Management: Added Value of Advanced Methodologies

Supplement to Call Centers with Delay Information: Models and Insights

Basics of Statistical Machine Learning

Estimating the random coefficients logit model of demand using aggregate data

Clustering of Documents for Forensic Analysis

A Cluster Analysis Approach for Banks Risk Profile: The Romanian Evidence

STT315 Chapter 4 Random Variables & Probability Distributions KM. Chapter 4.5, 6, 8 Probability Distributions for Continuous Random Variables

Spatial Statistics Chapter 3 Basics of areal data and areal data modeling

Calculating VaR. Capital Market Risk Advisors CMRA

business statistics using Excel OXFORD UNIVERSITY PRESS Glyn Davis & Branko Pecar

Machine Learning and Pattern Recognition Logistic Regression

Data Mining and Visualization

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Joint Exam 1/P Sample Exam 1

Optical Design for Automatic Identification

The Normal distribution

SOCIETY OF ACTUARIES/CASUALTY ACTUARIAL SOCIETY EXAM C CONSTRUCTION AND EVALUATION OF ACTUARIAL MODELS EXAM C SAMPLE QUESTIONS

Forschungskolleg Data Analytics Methods and Techniques

Alessandro Birolini. ineerin. Theory and Practice. Fifth edition. With 140 Figures, 60 Tables, 120 Examples, and 50 Problems.

Elements of statistics (MATH0487-1)

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

Math 425 (Fall 08) Solutions Midterm 2 November 6, 2008

Recent Developments of Statistical Application in. Finance. Ruey S. Tsay. Graduate School of Business. The University of Chicago

Elliptical Galaxies. Houjun Mo. April 19, Basic properties of elliptical galaxies. Formation of elliptical galaxies

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining

Dongfeng Li. Autumn 2010

C. PROCEDURE APPLICATION (FITNET)

The Network Structure of Hard Combinatorial Landscapes

Chapter ML:XI (continued)

Clustering. Data Mining. Abraham Otero. Data Mining. Agenda

171:290 Model Selection Lecture II: The Akaike Information Criterion

NSW Licensed Pipeline Performance Reporting Guidelines

Lecture 3: Linear methods for classification


Department of Mathematics, Indian Institute of Technology, Kharagpur Assignment 2-3, Probability and Statistics, March Due:-March 25, 2015.

Analyzing The Role Of Dimension Arrangement For Data Visualization in Radviz

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

An Introduction to Machine Learning

The impact of window size on AMV

Model Risk Within Operational Risk

Exercise 1.12 (Pg )

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang

MATH4427 Notebook 2 Spring MATH4427 Notebook Definitions and Examples Performance Measures for Estimators...

Using Smoothed Data Histograms for Cluster Visualization in Self-Organizing Maps

Automatic Labeling of Lane Markings for Autonomous Vehicles

Factor Analysis. Factor Analysis

RECOMMENDATION ITU-R P Method for point-to-area predictions for terrestrial services in the frequency range 30 MHz to MHz

Transcription:

July 16, 2007 Supervisors: Prof. Dr. Ir. Jan M. van Noortwijk MSc Ir. Sebastian Kuniewski Dr. Marco Giannitrapani

Outline Motivation

Motivation in the oil industry, hundreds of kilometres of pipes and other equipment can be affected by corrosion it is extreme defect depth/wall loss that influences the system reliability extreme-value methods are sensible for application only part of the system can be subjected to inspection results extrapolation is needed

Motivation defects/wall losses caused by corrosion are likely to be locally dependent independence assumption questionable

present statistical methods to model extremes of corrosion data, taking into account local defect dependence spatial extrapolation of the results examples of application framework/guideline for modelling extreme-values of corrosion

The Generalised Extreme-Value (GEV) distribution (block maxima data) G(z) = ( ) h z µ i 1 ξ exp 1 + ξ, ξ 0 σ + n h io exp exp, ξ = 0, z µ σ

The Generalised-Pareto (GP) distribution (excess over threshold data) Y i = X i u, for X > u, i = 1,...,n u 4 Threshold exceedances x i 3.5 u 2.5 2 1.5 1 H(y) =» 1 1 + ξȳ σ 1 exp ȳ σ 1/ξ, ξ 0 +, ξ = 0 0.5 0 0 2 4 6 8 10 12 14 16 18 20 i

(extrapolation-gev) return-level method G(z p ) = 1 p Pr{M > z p } = p = 1 N implied distribution of the maximum corresponding to the not inspected area Pr{X N z} = G N (z) = G(z) N

(extrapolation-gp) based on Poisson frequency of threshold exceedances (Poisson-GP model) return-level method H(y p ) = 1 p Pr{Y > y p } = p = 1 N E N E = λ GEV ( S k A ) - expected number of exceedances on the not inspected area

(extrapolation-gp) implied distribution of the maximum corresponding to the not inspected area ( Pr{X N x} = exp { λ GEV 1 + ξ x u ) } 1/ξ N σ +

-summary two methods for statistical inference about extreme-values of corrosion - the GEV distribution (block maxima) - the GP distribution (excess over threshold data) two methods for spatial results extrapolation - return-level - distribution of the maximum corresponding to the not inspected area the GEV and GP distributions are closely related and theoretically, should give the same results

Modelling extremes of dependent data for the stationary data characterised by the limited extend of long-range dependence at extreme levels, the extreme-value methods can be still applied in corrosion application it is reasonable to assume that pit depths are locally dependent extreme-value methods are applicable

Modelling extremes of dependent data with the GEV distribution assuming local data dependence, block maxima of stationary data (for sufficiently large block sizes) can be considered as approximately independent the GEV distribution is used in its standard form

Modelling extremes of dependent data with the GP distribution neighbouring exceedances may be dependent, therefore the change of practise is needed one of the most widely adopted method is data declustering - filtering out dependent observations such that remaining exceedances can be considered as approximately independent

Modelling extremes of dependent data with the GP distribution-approach define clusters of exceedances identify maximum excess within each cluster assuming that cluster maxima are independent fit the GP distribution

Estimation of the number of data clusters

Extremal index method: the extent of short-range dependence of extreme events is captured by the parameter θ, called extremal index 1 θ = limiting mean cluster size extremal index measures the degree of clustering of the process at extreme levels

Extremal index method 1 θ = N c = θ N e limiting mean cluster size where N c - number of clusters, N e - number of exceedances above threshold u θ is estimated by the intervals estimator

Finding number of clusters using clustering algorithm define a validity criteria for the number N c of found clusters run the clustering algorithm for a range of N c as proper number of clusters choose the one for which the validity criteria are optimised

Agglomerative hierarchical clustering algorithm starts with single points as clusters at each step the two closest clusters are merged stops when only one cluster remains

Agglomerative hierarchical clustering algorithm 7 7 6 6 9 5 4 2 distance 5 5 4 8 6 2 distance 5 3 3 3 3 2 1 4 2 1 7 4 1 1 0 0 1 2 3 4 5 6 7 0 0 1 2 3 4 5 6 7 Dendrogram 2 link Distance between objects 1.5 1 distance= height of the link 0.5 0 2 3 1 4 5 Objects

Validation criteria, that we used, aim at identifying clusters that are compact and well isolated: silhouette plot - maximum value indicates optimum Davies-Bouldin index - minimum indicates optimum

-summary two methods to estimate the number of data clusters - the extremal index method - clustering algorithm + validation criteria (Davies-Bouldin index, silhouette plot) prior to data declustering perform clustering tendency test

Simulated corroded surface application of the gamma-process model dependence in terms of the product moment correlation ( 2 ) q/p ρ(x k, X l ) = exp d dist i p i=1 7 6 X k = (x k,y k ), X l = (x l,y l ), dist 1 = x k x l, dist 2 = y k y l, d = 0.3, p = 2, q = 1 5 4 3 2 1 4 2 3 distance 5 1 0 0 1 2 3 4 5 6 7

Simulated corroded surface 1 0.9 Matrix 1 Matrix 2 Matrix 3 Matrix 4 Matrix 5 0.8 Correlation coefficient 0.7 0.6 0.5 0.4 0.3 Matrix 6 Matrix 11 Matrix 7 Matrix 12 Matrix 8 Matrix 13 Matrix 9 Matrix 14 Matrix 10 Matrix 15 0.2 0.1 Matrix 16 Matrix 17 Matrix 18 Matrix 19 Matrix 20 0 0 20 40 60 80 100 x

Simulated corroded surface

Simulated corroded surface - clustering algorithm 1.2 Davies Bouldin index vs number of clusters 1 Silhouette plot 1.1 0.95 1 Davies Bouldin index 0.9 0.8 0.7 0.6 0.5 0.4 0.3 Average silhouette coefficient 0.9 0.85 0.8 0.75 0.7 0.2 20 40 60 80 100 120 140 Number of clusters 0.65 20 40 60 80 100 120 140 Number of clusters

Simulated corroded surface - results ˆθ n c n calg 0.544 107 121 Table: The estimate of extremal index and determined number of clusters Number of clusters ADup 2 p v. KS p v. 107 0.376 0.204 121 0.549 0.493 Table: Goodness-of-fit test results for different number of clusters

ˆξ ˆ σ AD 2 up p v. KS p v. 0.065 1.341 0.647 0.488 Table: GP fit to excess of dependent data ˆξ ˆ σ ADup 2 p v. KS p v. -0.0137 1.6839 0.555 0.493 Table: GP fit to excess of declustered data ˆξ ˆσ ˆµ ADup 2 p v. KS p v. -0.007 1.627 5.637 0.621 0.899 Table: GEV fit to block maxima data

Real data example

Real data example

Real data example 0.8 Davies Bouldin index vs number of clusters 1 Silhouette plot 0.7 0.95 Davies Bouldin index 0.6 0.5 0.4 0.3 Average silhouette coefficient 0.9 0.85 0.8 0.2 0.75 0.1 20 40 60 80 100 120 140 160 Number of clusters 0.7 20 40 60 80 100 120 140 160 Number of clusters

ˆξ ˆσ ˆµ ADup 2 p v. KS p v. -0.082 0.182 0.757 0.713 0.986 Table: GEV fit to block maxima data ˆξ ˆ σ ADup 2 p v. KS p v. -0.008 0.133 0.616 0.074 Table: GP fit to excess of dependent data ˆξ ˆ σ ADup 2 p v. KS p v. -0.069 0.172 0.717 0.654 Table: GP fit to excess of declustered data

4 3.5 Extrapolated probability density function GP based GEV 4 3.5 Extrapolated probability density function GP based GEV 3 3 2.5 2.5 Density 2 Density 2 1.5 1.5 1 1 0.5 0.5 0 0.5 1 1.5 2 2.5 3 Extreme pit depth [mm] 0 0.5 1 1.5 2 2.5 3 Extreme pit depth [mm]

with the GEV distribution

with the GP distribution

the two applied distributions are closely related and lead to the consistent inference about extreme-values of corrosion data declustering improves the results given by the GP distribution the performance of other clustering algorithms could be checked in order to take into account corrosion nonstationarity due to space-varying environmental conditions, covariate-dependent extreme-value models with trends could be considered

THANK YOU FOR ATTENTION Questions???