Data Preparation and Statistical Displays



Similar documents
Geostatistics Exploratory Analysis

Annealing Techniques for Data Integration

Diagrams and Graphs of Statistical Data

INTRODUCTION TO GEOSTATISTICS And VARIOGRAM ANALYSIS

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

South Carolina College- and Career-Ready (SCCCR) Probability and Statistics

Lecture 2: Descriptive Statistics and Exploratory Data Analysis

MTH 140 Statistics Videos

The Comparisons. Grade Levels Comparisons. Focal PSSM K-8. Points PSSM CCSS 9-12 PSSM CCSS. Color Coding Legend. Not Identified in the Grade Band

BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I

Fairfield Public Schools

Imputing Values to Missing Data

DESCRIPTIVE STATISTICS. The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses.

Tail-Dependence an Essential Factor for Correctly Measuring the Benefits of Diversification

A MULTIVARIATE OUTLIER DETECTION METHOD

Algebra 1 Course Information

Organizing Your Approach to a Data Analysis

Java Modules for Time Series Analysis

EXPLORING SPATIAL PATTERNS IN YOUR DATA

Incorporating Geostatistical Methods with Monte Carlo Procedures for Modeling Coalbed Methane Reservoirs


Non Linear Dependence Structures: a Copula Opinion Approach in Portfolio Optimization

PROPERTIES OF THE SAMPLE CORRELATION OF THE BIVARIATE LOGNORMAL DISTRIBUTION

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Week 1. Exploratory Data Analysis

Manhattan Center for Science and Math High School Mathematics Department Curriculum

ADD-INS: ENHANCING EXCEL

Introduction to Modeling Spatial Processes Using Geostatistical Analyst

Quantitative Methods for Finance

AP Statistics: Syllabus 1

How To Write A Data Analysis

Curriculum Map Statistics and Probability Honors (348) Saugus High School Saugus Public Schools

430 Statistics and Financial Mathematics for Business

Vertical Alignment Colorado Academic Standards 6 th - 7 th - 8 th

Lecture 2. Summarizing the Sample

Common Core Unit Summary Grades 6 to 8

COMMON CORE STATE STANDARDS FOR

Summarizing and Displaying Categorical Data

LAGUARDIA COMMUNITY COLLEGE CITY UNIVERSITY OF NEW YORK DEPARTMENT OF MATHEMATICS, ENGINEERING, AND COMPUTER SCIENCE

UNIT 1: COLLECTING DATA

CONTENTS PREFACE 1 INTRODUCTION 1 2 DATA VISUALIZATION 19

List of Examples. Examples 319

BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS

The right edge of the box is the third quartile, Q 3, which is the median of the data values above the median. Maximum Median

Petrel TIPS&TRICKS from SCM

Correlation and Regression

Problem Solving and Data Analysis

2. Filling Data Gaps, Data validation & Descriptive Statistics

Simple Linear Regression

Data Visualization in R

Analysis of Financial Time Series

A Correlation of. to the. South Carolina Data Analysis and Probability Standards

International College of Economics and Finance Syllabus Probability Theory and Introductory Statistics

Introduction to Exploratory Data Analysis

PCHS ALGEBRA PLACEMENT TEST

DecisionSpace Earth Modeling Software

Exploratory data analysis (Chapter 2) Fall 2011

Part II Chapter 9 Chapter 10 Chapter 11 Chapter 12 Chapter 13 Chapter 14 Chapter 15 Part II

Data exploration with Microsoft Excel: analysing more than one variable

Descriptive statistics Statistical inference statistical inference, statistical induction and inferential statistics

business statistics using Excel OXFORD UNIVERSITY PRESS Glyn Davis & Branko Pecar

EXPLORATORY DATA ANALYSIS: GETTING TO KNOW YOUR DATA

Exercise 1.12 (Pg )

Algebra Academic Content Standards Grade Eight and Grade Nine Ohio. Grade Eight. Number, Number Sense and Operations Standard

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.

Density Curve. A density curve is the graph of a continuous probability distribution. It must satisfy the following properties:

Descriptive Statistics

STATS8: Introduction to Biostatistics. Data Exploration. Babak Shahbaba Department of Statistics, UCI

2013 MBA Jump Start Program. Statistics Module Part 3

Linear Regression. Chapter 5. Prediction via Regression Line Number of new birds and Percent returning. Least Squares

Exploratory Data Analysis with MATLAB

Exploratory Data Analysis

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler

MIC - Detecting Novel Associations in Large Data Sets. by Nico Güttler, Andreas Ströhlein and Matt Huska

Multivariate Analysis of Ecological Data

INTRODUCING DATA ANALYSIS IN A STATISTICS COURSE IN ENVIRONMENTAL SCIENCE STUDIES

CHARTS AND GRAPHS INTRODUCTION USING SPSS TO DRAW GRAPHS SPSS GRAPH OPTIONS CAG08

3. Data Analysis, Statistics, and Probability

Graphs. Exploratory data analysis. Graphs. Standard forms. A graph is a suitable way of representing data if:

APPLIED MISSING DATA ANALYSIS

Getting Correct Results from PROC REG

SPSS Tests for Versions 9 to 13

Dimensionality Reduction: Principal Components Analysis

A Regime-Switching Model for Electricity Spot Prices. Gero Schindlmayr EnBW Trading GmbH

Course Text. Required Computing Software. Course Description. Course Objectives. StraighterLine. Business Statistics

Introduction to Regression and Data Analysis

Engineering Problem Solving and Excel. EGN 1006 Introduction to Engineering

Is log ratio a good value for measuring return in stock investments

What is the purpose of this document? What is in the document? How do I send Feedback?

Descriptive statistics parameters: Measures of centrality

STAT 360 Probability and Statistics. Fall 2012

4.1 Exploratory Analysis: Once the data is collected and entered, the first question is: "What do the data look like?"

Statistics Graduate Courses

Transcription:

Reservoir Modeling with GSLIB Data Preparation and Statistical Displays Data Cleaning / Quality Control Statistics as Parameters for Random Function Models Univariate Statistics Histograms and Probability Plots Q-Q and P-P Plots Bivariate Statistics and Distributions Exploratory Data Analysis

Statistical Analysis of Data Well Depth X Y Poro Perm Layer Seq LogK 10100 7158.9 672.7 2886.0 27.00 554.000 1 51 2.744 10100 7160.4 672.8 2886.0 28.60 1560.000 1 51 3.193 10100 7160.7 672.8 2886.0 30.70 991.000 1 51 2.996 10100 7161.0 672.8 2886.0 28.40 1560.000 1 51 3.193 10100 7161.9 672.9 2886.0 31.00 3900.000 1 51 3.591 10100 7162.3 672.9 2886.0 32.40 4100.000 1 51 3.613 Goals of Exploratory Data Analysis understand the data: statistical versus geological populations ensure data quality condense information Only limited functionality in GSLIB (some special geostat tools) a supplement with commercial software

Statistics as Parameters of Random Function Models Not interested in sample statistics need to access underlying population parameters A model is required to go beyond the known data Because earth science phenomena involve complex processes they appear as random. Important to keep in mind that actual data are not the result of a random process.

Univariate Description Frequency Histogram: A counting of samples in classes Sometimes two scales are needed to show the details (use trimming limits) logarithmic scale can be useful Summary statistics mean is sensitive to outliners median is sensitive to gaps in the middle of a distribution locate distribution by selected quantiles (e.g., quartiles) spread measured by standard deviation (very sensitive to extreme values) GSLIB program histplt

Cumulative Probability Plot Cumulative Probability 0.01 0.10 1.00 10.0 Variable Useful to see all of the data values on one plot Useful for isolating statistical populations May be used to check distribution models: straight line on arithmetic scale a normal distribution straight line on logarithmic scale a lognormal distribution small departures can be important possible to transform data to perfectly reproduce any univariate distribution GSLIB program probplt

Cumulative Histograms Frequency Cumulative Frequency 1 0 The cumulative frequency is the total or the cumulative fraction of samples less than a given threshold

Cumulative Histograms Cumulative Frequency 1.0 0.25 0.0 First Quartile Value Cumulative frequency charts do not depend on the binwidth; they can be created at the resolution of the data A valuable descriptive tool and used for inference A quantile is the variable-value that corresponds to a fixed cumulative frequency first quartile = 0.25 quantile second quartile = median = 0.5 quantile third quartile = 0.75 quantile can read any quantile from the cumulative frequency plot Can also read probability intervals from the cumulative frequency plot (say, the 90% probability interval) Direct link to the frequency

Q-Q / P-P Plots 20.0 Q-Q Plot: Equal Weighted 1.0 P-P Plot: Equal Weighted True Value 10.0 True Value 0.5 0.0 10.0 20.0 0.0 0.5 Clustered Data Clustered Data 1.0 Compares two univariate distributions Q-Q plot is a plot of matching quantiles a a straight line implies that the two distributions have the same shape. P-P plot is a plot of matching cumulative probabilities a a straight line implies that the two distributions have the same shape. Q-Q plot has units of the data, P-P plots are always scaled between 0 and 1 GSLIB program qpplt

Q-Q Plots Data Set One Data Set Two value cdf value cdf 0.010 0.0002 0.060 0.0036 0.020 0.0014 0.090 0.0250 0.020 0.0018 0.090 0.0321 0.020 0.0022 0.030 0.0034 0.030 0.0038 0.960 0.4998 2.170 0.4964 0.960 0.5002 2.220 0.5036 38.610 0.9962 40.570 0.9966 42.960 0.9978 43.500 0.9982 19.440 0.9679 46.530 0.9986 20.350 0.9750 102.700 0.9998 58.320 0.9964 Sort the values in each data set Calculate cumulative distribution function (CDF) for each Match according to (CDF) values

Let's Build a Q-Q Plot Cumulative Frequency Cumulative Frequency Frequency Frequency Core Porosity Log Porosity Histograms of core porosity and log-derived porosity Preferential sampling explains difference; these are not paired samples so we can not detect bias in samples

Let's Build a Q-Q Plot Log porosity Core porosity Read corresponding quantiles from the cumulative frequency plots on the previous page Plot those quantiles on the plot

Univariate Transformation Frequency Cumulative Frequency Frequency Cumulative Frequency Transforming values so that they honor a different histogram may be done by matching quantiles Many geostatistical techniques require the data to be transformed to a Gaussian or normal distribution

Data Transformation Core Porosity Core Porosity Cumulative Frequency Frequency Cumulative Frequency Log-derived Porosity Log-derived Porosity Frequency Histograms of core porosity and log-derived porosity These are paired samples so we may want to transform the log-derived porosity values to core porosity

Data Transformation Use the cumulative frequency plots on the previous page to fill in the following table Log Porosity 10.0 15.0 20.0 25.0 30.0 Core Porosity Could we transform 20000 log-derived porosity values according to the 853 paired samples we have? Under what circumstances would we consider doing this? What problems might be encountered? 7 10 18 26 29 Cumulative Frequency Core Porosity Cumulative Frequency Log-derived Porosity

Monte Carlo Simulation Frequency Cumulative Frequency 0.7807 Core Porosity 28.83 Core Porosity Monte Carlo Simulation / Stochastic Simulation / Random Drawing proceed by reading quantiles from a cumulative distribution The procedure: generate a random number between 0 and 1 (calculator, table, program,... read the quantile associated to that random number For Example: Random Number 0.7807 0.1562 0.6587 0.8934 Simulated Number 28.83...

Scatterplots True versus Estimate 16.0 True Value 0.0 Estimate ρ rank > ρ Bivariate display, estimate-true, two covariates, or the same variable separated by some distance vector Linear correlation coefficient ranges between -1 and +1 and is sensitive to extreme values (points away from the main cloud) ρ rank < ρ Rank correlation coefficient is a useful supplement: if ρ rank > ρ then a few outliers are spoiling an otherwise good correlation if ρ rank < ρ then a few outliers are enhancing an otherwise poor correlation if ρ rank = 1 then a non-linear transform of one covariate can make ρ = 1 GSLIB program scatplt 16.0

Look at bivariate summaries Marginal histograms Scatterplots

Bivariate Distributions a) Bivariate histogram b) Bivariate cumulative distribution function c) Conditional cumulative distribution function

Conditional Distributions Permeability (md) Known primary value Distribution of possible permeability values at a known porosity value Calibration Scattergram z - values Prediction of conditional distributions is at the heart of geostatistical algorithms

Class Definition 10,000 Permeability 0.01 0.0 0.20 Porosity Choose equal probability intervals rather than equal porosity or permeability intervals Check for bias in histogram - often there are many more low porosity log data than low porosity core data Due to a biased core data set, we may need to set porosity cutoffs based on an even φ spacing rather than equal probability

Exploratory Data Analysis Plot the data in different ways; our eyes are good at pattern detection Choose geological/statistical populations for detailed analyses: populations must be identifiable in wells without core must be able to map these populations (categories) can not deal with too many, otherwise there are too few data for reliable statistics often a decision must be made to pool certain types of data stationarity is a property of statistical models and not reality important and very field/data/goals-specific Perform statistical analyses within each population: ensure data quality look for trends understand physics as much as possible Decluster data for geostatistical modeling Statistical tools are used throughout a reservoir characterization study