Software for Distributions in R



Similar documents
Review of Random Variables

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

Probability Calculator

Operational Risk Modeling Analytics

Introduction to R. I. Using R for Statistical Tables and Plotting Distributions

Dongfeng Li. Autumn 2010

i=1 In practice, the natural logarithm of the likelihood function, called the log-likelihood function and denoted by

FITTING INSURANCE CLAIMS TO SKEWED DISTRIBUTIONS: ARE

Maximum Likelihood Estimation

MATH4427 Notebook 2 Spring MATH4427 Notebook Definitions and Examples Performance Measures for Estimators...

Probability and Statistics Prof. Dr. Somesh Kumar Department of Mathematics Indian Institute of Technology, Kharagpur

VISUALIZATION OF DENSITY FUNCTIONS WITH GEOGEBRA

2WB05 Simulation Lecture 8: Generating random variables

Exploratory Data Analysis

Gamma Distribution Fitting

Modeling Individual Claims for Motor Third Party Liability of Insurance Companies in Albania

extreme Datamining mit Oracle R Enterprise

Quantitative Methods for Finance

Multivariate Normal Distribution

Alessandro Birolini. ineerin. Theory and Practice. Fifth edition. With 140 Figures, 60 Tables, 120 Examples, and 50 Problems.

Java Modules for Time Series Analysis

Master s Theory Exam Spring 2006

SUMAN DUVVURU STAT 567 PROJECT REPORT

Exam C, Fall 2006 PRELIMINARY ANSWER KEY

( ) = 1 x. ! 2x = 2. The region where that joint density is positive is indicated with dotted lines in the graph below. y = x

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

MATH BOOK OF PROBLEMS SERIES. New from Pearson Custom Publishing!

SAS Software to Fit the Generalized Linear Model

Logistic Regression (1/24/13)

Part 2: One-parameter models

Binomial Distribution n = 20, p = 0.3

Poisson Models for Count Data

Curriculum Map Statistics and Probability Honors (348) Saugus High School Saugus Public Schools

UNIT I: RANDOM VARIABLES PART- A -TWO MARKS

Department of Mathematics, Indian Institute of Technology, Kharagpur Assignment 2-3, Probability and Statistics, March Due:-March 25, 2015.

Distribution (Weibull) Fitting

Chapter 3 RANDOM VARIATE GENERATION

Gaussian Conjugate Prior Cheat Sheet

STAT2400 STAT2400 STAT2400 STAT2400 STAT2400 STAT2400 STAT2400 STAT2400&3400 STAT2400&3400 STAT2400&3400 STAT2400&3400 STAT3400 STAT3400

Errata and updates for ASM Exam C/Exam 4 Manual (Sixteenth Edition) sorted by page

UNIVERSITY OF OSLO. The Poisson model is a common model for claim frequency.

Fairfield Public Schools

Bootstrap Example and Sample Code

EMPIRICAL FREQUENCY DISTRIBUTION

5: Magnitude 6: Convert to Polar 7: Convert to Rectangular

Normal Distribution. Definition A continuous random variable has a normal distribution if its probability density. f ( y ) = 1.

( ) is proportional to ( 10 + x)!2. Calculate the

Statistics I for QBIC. Contents and Objectives. Chapters 1 7. Revised: August 2013

Homework 4 - KEY. Jeff Brenion. June 16, Note: Many problems can be solved in more than one way; we present only a single solution here.

Maximum Likelihood Estimation by R

PROPERTIES OF THE SAMPLE CORRELATION OF THE BIVARIATE LOGNORMAL DISTRIBUTION

Notes on the Negative Binomial Distribution

The Variability of P-Values. Summary

Approximation of Aggregate Losses Using Simulation

Exam P - Total 23/

How To Check For Differences In The One Way Anova

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.

Normality Testing in Excel

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Determining distribution parameters from quantiles

Supplement to Call Centers with Delay Information: Models and Insights

THE USE OF STATISTICAL DISTRIBUTIONS TO MODEL CLAIMS IN MOTOR INSURANCE

Lecture Notes 1. Brief Review of Basic Probability

Package hier.part. February 20, Index 11. Goodness of Fit Measures for a Regression Hierarchy

A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution

Lecture 8. Confidence intervals and the central limit theorem

Basic Statistical and Modeling Procedures Using SAS

Overview Classes Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7)

Offset Techniques for Predictive Modeling for Insurance

List of Examples. Examples 319

Comparison of resampling method applied to censored data

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

Basel II: Operational Risk Implementation based on Risk Framework

Statistics Graduate Courses

Lecture 19: Conditional Logistic Regression

Selecting the Right Distribution

Lecture 3: Linear methods for classification

SPSS Tests for Versions 9 to 13

NEW YORK STATE TEACHER CERTIFICATION EXAMINATIONS

Permutation Tests for Comparing Two Populations

A LOGNORMAL MODEL FOR INSURANCE CLAIMS DATA

Course Text. Required Computing Software. Course Description. Course Objectives. StraighterLine. Business Statistics

Using the Delta Method to Construct Confidence Intervals for Predicted Probabilities, Rates, and Discrete Changes

Statistics 100A Homework 8 Solutions

Normal Approximation. Contents. 1 Normal Approximation. 1.1 Introduction. Anthony Tanbakuchi Department of Mathematics Pima Community College

The Probit Link Function in Generalized Linear Models for Data Mining Applications

Overview of Monte Carlo Simulation, Probability Review and Introduction to Matlab

STAT 360 Probability and Statistics. Fall 2012

Nonparametric statistics and model selection

Lecture 8: More Continuous Random Variables

5/31/ Normal Distributions. Normal Distributions. Chapter 6. Distribution. The Normal Distribution. Outline. Objectives.

Practice problems for Homework 11 - Point Estimation

An Introduction to Modeling Stock Price Returns With a View Towards Option Pricing

1.1 Introduction, and Review of Probability Theory Random Variable, Range, Types of Random Variables CDF, PDF, Quantiles...

business statistics using Excel OXFORD UNIVERSITY PRESS Glyn Davis & Branko Pecar

Transcription:

David Scott 1 Diethelm Würtz 2 Christine Dong 1 1 Department of Statistics The University of Auckland 2 Institut für Theoretische Physik ETH Zürich July 10, 2009

Outline 1 Introduction 2 Distributions in Base R 3 Design for Distribution Implementation

Outline 1 Introduction 2 Distributions in Base R 3 Design for Distribution Implementation

Outline 1 Introduction 2 Distributions in Base R 3 Design for Distribution Implementation

1 Introduction 2 Distributions in Base R 3 Design for Distribution Implementation

Distributions Distributions are how we model uncertainty There is well-established theory concerning distributions There are standard approaches for fitting distributions There are many distributions which have been found to be of interest Software implementation of distributions is a well-defined subject in comparision to say modelling of time-series

Introduction In base R there are 20 distributions implemented, at least in part All univariate consider univariate distributions only Numerous other distributions have been implemented in R CRAN packages solely devoted to one or more distributions CRAN packages which implement distributions incidentally (e.g. VGAM) implementations of distributions not on CRAN See the task view http://cran.r-project.org/web/views/distributions.html

Introduction There are overlaps in coverage of distributions in R Implementations of distributions in R are inconsistent naming of objects parameterizations function arguments functionality return structures It is useful to discuss some standardization of implementation of software for distributions

1 Introduction 2 Distributions in Base R 3 Design for Distribution Implementation

Distributions in Base R Implementation in R is essentially the provision of dpqr functions: the density (or probability) function, distribution function, quantile or inverse distribution function and random number generation The distributions are the binomial (binom), geometric (geom), hypergeometric (hyper), negative binomial (nbinom), Poisson (pois), Wilcoxon signed rank statistic (signrank), Wilcoxon rank sum statistic (wilcox), beta (beta), Cauchy (Cauchy), non-central chi-squared (chisq), exponential (exp),f (f), gamma (gamma), log-normal (lnorm), logistic (logis), normal (norm), t (t), uniform (unif), Weibull (weibull), and Tukey studentized range (tukey) for which only the p and q functions are implemented

dpqr Functions Any experienced R user will be aware of the naming conventions for the density, cumulative distribution, quantile and random number generation functions for the base R distributions The argument lists for the dpqr functions are standard First argument is x, p, q and n for respectively a vector of quantiles, a vector of quantiles, a vector of probabilities, and the sample size rwilcox is an exception using nn because n is a parameter Subsequent arguments give the parameters The gamma distribution is unusual, with argument list shape, rate =1, scale = 1/rate This mechanism allows the user to specify the second parameter as either the scale or the rate

dpqr Functions Other arguments differ among the dpqr functions The d functions take the argument log, the p and q functions the argument log.p These allow the extension of the range of accurate computation for these quantities The p and q functions have the argument lower.tail The dpqr functions are coded in C and may be found in the source software tree at /src/math/ They are in large part due to Ross Ihaka and Catherine Loader Martin Mächler is now responsible for on-going maintenance

Testing and Validation The algorithms used in the dpqr functions are well-established algorithms taken from a substantial scientific literature There are also tests performed, found in the directory tests in two files d-p-q-r-tests.r and p-r-random-tests.r Tests in d-p-q-r-tests.r are inversion tests which check that qdist(pdist(x))=x for values x generated by rdist There are tests relying on special distribution relationships, and tests using extreme values of parameters or arguments For discrete distributions equality of cumsum(ddist(.))=pdist(.)

Testing and Validation Tests in p-r-random-tests.r are based on an inequality of Massart: ( ) Pr sup ˆF n (x) F (x) > λ x 2 exp( 2nλ 2 ) where ˆF n is the empirical distribution function for a This is a version of the Dvoretsky-Kiefer-Wolfowitz inequality with the best possible constant, namely the leading 2 in the right hand side of the inequality The inequality above is true for all distribution functions, for all n and λ Distributions are tested by generating a sample of size 10,000 and comparing the difference between the empirical distribution function and distribution function

1 Introduction 2 Distributions in Base R 3 Design for Distribution Implementation

What Should be Provided? Besides the obvious dpqr functions, what else is needed? moments, at least low order ones the mode for unimodal distributions moment generating function and characteristic function functions for changing parameterisations functions for fitting of distributions and fit diagnostics goodness-of-fit tests methods associated with fit results: print, plot, summary, print.summary and coef for maximum likelihood fits, methods such as loglik and profile

Fitting Diagnostics To assess the fit of a distribution, diagnostic plots should be provided Some useful plots are a histogram or empirical density with fitted density a log-histogram with fitted log-density a QQ-plot with optional fitted line a PP-plot with optional fitted line For maximum likelihood estimation, contour plots and perspective plots for pairs of parameters, and likelihood profile plots

Fitting Some generic fitting routines are currently available mle from stats4 can be used to fit distributions but the log likelihood and starting values must be supplied fitdistr from MASS will automatically fit most of the distributions from base R Other distributions can be fitted using mle by supplying the density and started values In designing fitting functions, the structure of the object returned and the methods available are vital aspects mle returns an S4 object of class mle fitdistr produces and S3 object of class fitdistr

Fitting The methods available for an object of class mle are: Method Action confint Confidence intervals from likelihood profiles loglik Extract maximized log-likelihood profile Likelihood profile generation show Display object briefly summary Generate object summary update Update fit vcov Extract variance-covariance matrix For fitdistr the methods are print, coef, and loglik Neither function returns the data, so a plot method which produces suitable diagnostic plots is not possible Ideally a fit should return and object of class distfit say, and the mle class should extend that

Some Principles Some principles are the major guide to the design should be what exists in base R the design should be logical with as few special cases as possible the design should minimize the possibility of programming mistakes by users and developers the design should simplify as much as possible the provision of the range of facilities needed to implement a distribution A naming scheme for functions is an important part of a standard

A Possible Naming Scheme Function Name Description ddist Density function or probability function pdist Cumulative distribution function qdist Quantile function rdist Random number generator distmean Theoretical mean distvar Theoretical variance distskew Theoretical skewness distkurt Theoretical kurtosis distmode Mode of the density distmoments Theoretical moments (and mode) distmgf Moment generating function distchfn Characteristic function distchangepars Change parameterization of the distribution distfit Result of fitting the distribution to data disttest Test the distribution

Some Alternatives Use dots: hyperb.mean, usually deprecated because of confusion with S3 methods Use initial letters: mhyperb, but what about the mgf, multivariate distributions?

Function Arguments Specification of parameters could allow for the parameters to be specified as a vector This is helpful for maximum likelihood estimation The approach used for the gamma distribution can be used to allow for both single parameter specification and vector parameter specification Separate parameters should be in the order of location, scale, skewness, kurtosis

Classes I don t have a view on whether S3 or S4 classes should be used, but probably S4 classes should be aimed for For a fit from a distribution, the class could be called distfit A fit for a particular distribution would add the distribution name: hyperbfit, distfit with S3 methods For S4 methods the class distfit would be extended Similar ideas would be used for tests: a Kolomogorov-Smirnov test would have class kstest, htest with S3 methods

Testing Testing is vital to quality of software and developers should provide test data and code Firstly test parameter sets should be provided, which cover the range of the parameter set Unit tests should be provided for all functions: the RUnit package supports this The distributions in Rmetrics have tests of this type, although further development seems warranted The Massart inequality seems ideal to use in testing

Final Thoughts The distr package is an object-oriented implementation of distributions It facilitates operations on distributions such as convolutions It uses sampling for calculation of moments and distribution functions The package VarianceGamma has been designed and implemented using these ideas Implementing the logarithm options of the dpq functions is quite difficult for the distributions in which we are interested