Syntax Menu Description Options Remarks and examples Stored results Methods and formulas Acknowledgments References Also see.



Similar documents
Syntax Menu Description Options Remarks and examples Stored results Methods and formulas References Also see. Description

Syntax Menu Description Options Remarks and examples Stored results Methods and formulas References Also see

Using kernel methods to visualise crime data

Descriptive statistics Statistical inference statistical inference, statistical induction and inferential statistics

Module 2 Basic Data Management, Graphs, and Log-Files

Descriptive Statistics

Descriptive Statistics

Data Exploration Data Visualization

Statistics Revision Sheet Question 6 of Paper 2

Chapter 3 RANDOM VARIATE GENERATION

Tutorial 3: Graphics and Exploratory Data Analysis in R Jason Pienaar and Tom Miller

BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I

ECONOMICS 351* -- Stata 10 Tutorial 2. Stata 10 Tutorial 2

Describing, Exploring, and Comparing Data

HISTOGRAMS, CUMULATIVE FREQUENCY AND BOX PLOTS

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Data analysis and regression in Stata

Graphing in SAS Software

Data exploration with Microsoft Excel: univariate analysis

The right edge of the box is the third quartile, Q 3, which is the median of the data values above the median. Maximum Median

Chapter 32 Histograms and Bar Charts. Chapter Table of Contents VARIABLES METHOD OUTPUT REFERENCES...474

Exploratory Data Analysis

Imaging Systems Laboratory II. Laboratory 4: Basic Lens Design in OSLO April 2 & 4, 2002

Exercise 1.12 (Pg )

Chapter 1: Looking at Data Section 1.1: Displaying Distributions with Graphs

Scientific Graphing in Excel 2010

What Does the Normal Distribution Sound Like?

Module 2: Introduction to Quantitative Data Analysis

Lecture 2: Descriptive Statistics and Exploratory Data Analysis

Directions for Frequency Tables, Histograms, and Frequency Bar Charts

Summarizing and Displaying Categorical Data

Gamma Distribution Fitting

Evaluating System Suitability CE, GC, LC and A/D ChemStation Revisions: A.03.0x- A.08.0x

Scatter Plots with Error Bars

Statistics Chapter 2

TEACHER NOTES MATH NSPIRED

An Introduction to Point Pattern Analysis using CrimeStat

Expression. Variable Equation Polynomial Monomial Add. Area. Volume Surface Space Length Width. Probability. Chance Random Likely Possibility Odds

Dongfeng Li. Autumn 2010

SECTION 2-1: OVERVIEW SECTION 2-2: FREQUENCY DISTRIBUTIONS

Module 4: Data Exploration

Density Curve. A density curve is the graph of a continuous probability distribution. It must satisfy the following properties:

VISUALIZATION OF DENSITY FUNCTIONS WITH GEOGEBRA

Visualizing Data. Contents. 1 Visualizing Data. Anthony Tanbakuchi Department of Mathematics Pima Community College. Introductory Statistics Lectures

Lecture 1: Review and Exploratory Data Analysis (EDA)

Fairfield Public Schools

Descriptive Statistics. Purpose of descriptive statistics Frequency distributions Measures of central tendency Measures of dispersion

Introduction to Statistics for Psychology. Quantitative Methods for Human Sciences

Demographics of Atlanta, Georgia:

Student Outcomes. Lesson Notes. Classwork. Exercises 1 3 (4 minutes)

Using Excel for descriptive statistics

Graphs. Exploratory data analysis. Graphs. Standard forms. A graph is a suitable way of representing data if:

Data Visualization in R

Engineering Problem Solving and Excel. EGN 1006 Introduction to Engineering

Density Distribution Sunflower Plots

Exploratory data analysis (Chapter 2) Fall 2011

Geostatistics Exploratory Analysis

R Graphics Cookbook. Chang O'REILLY. Winston. Tokyo. Beijing Cambridge. Farnham Koln Sebastopol

The Stata Journal. Volume 14 Number A Stata Press publication StataCorp LP College Station, Texas

CHARTS AND GRAPHS INTRODUCTION USING SPSS TO DRAW GRAPHS SPSS GRAPH OPTIONS CAG08

Diagrams and Graphs of Statistical Data

STATS8: Introduction to Biostatistics. Data Exploration. Babak Shahbaba Department of Statistics, UCI

Lecture 2. Summarizing the Sample

Spreadsheet software for linear regression analysis

INTERPRETING THE ONE-WAY ANALYSIS OF VARIANCE (ANOVA)

Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4)

Week 1. Exploratory Data Analysis

Glencoe. correlated to SOUTH CAROLINA MATH CURRICULUM STANDARDS GRADE 6 3-3, , , 4-9

Formulas, Functions and Charts

business statistics using Excel OXFORD UNIVERSITY PRESS Glyn Davis & Branko Pecar

Predict the Popularity of YouTube Videos Using Early View Data

6.4 Normal Distribution

MBA 611 STATISTICS AND QUANTITATIVE METHODS

Valor Christian High School Mrs. Bogar Biology Graphing Fun with a Paper Towel Lab

UNDERSTANDING THE INDEPENDENT-SAMPLES t TEST

CHAPTER 7 INTRODUCTION TO SAMPLING DISTRIBUTIONS

MetroBoston DataCommon Training

Robert Collins CSE598G. More on Mean-shift. R.Collins, CSE, PSU CSE598G Spring 2006

Microsoft Excel 2010 Charts and Graphs

Introduction Course in SPSS - Evening 1

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

How To Use Statgraphics Centurion Xvii (Version 17) On A Computer Or A Computer (For Free)

CA200 Quantitative Analysis for Business Decisions. File name: CA200_Section_04A_StatisticsIntroduction

Introduction to Exploratory Data Analysis

MATH 10: Elementary Statistics and Probability Chapter 5: Continuous Random Variables

Interpreting Data in Normal Distributions

1. Kyle stacks 30 sheets of paper as shown to the right. Each sheet weighs about 5 g. How can you find the weight of the whole stack?

STATGRAPHICS Online. Statistical Analysis and Data Visualization System. Revised 6/21/2012. Copyright 2012 by StatPoint Technologies, Inc.

SAS Analyst for Windows Tutorial

Scope and Sequence KA KB 1A 1B 2A 2B 3A 3B 4A 4B 5A 5B 6A 6B

Data exploration with Microsoft Excel: analysing more than one variable

In mathematics, there are four attainment targets: using and applying mathematics; number and algebra; shape, space and measures, and handling data.

Grade 5 Common Core State Standard

Solving Quadratic Equations

The Center for Teaching, Learning, & Technology

Indicator 2: Use a variety of algebraic concepts and methods to solve equations and inequalities.

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

STT315 Chapter 4 Random Variables & Probability Distributions KM. Chapter 4.5, 6, 8 Probability Distributions for Continuous Random Variables

ABSTRACT INTRODUCTION EXERCISE 1: EXPLORING THE USER INTERFACE GRAPH GALLERY

Transcription:

Title stata.com Syntax kdensity Univariate kernel density estimation Syntax Menu Description Options Remarks and examples Stored results Methods and formulas Acknowledgments References Also see kdensity varname [ if ] [ in ] [ weight ] [, options ] options Description Main kernel(kernel) specify kernel function; default is kernel(epanechnikov) bwidth(#) half-width of kernel generate(newvar x newvar d ) store the estimation points in newvar x and the density estimate in newvar d n(#) estimate density using # points; default is min(n, 50) at(var x ) estimate density using the values specified by var x nograph suppress graph Kernel plot cline options plots normal normopts(cline options) student(#) stopts(cline options) Add plots addplot(plot) Y axis, X axis, Titles, Legend, Overall twoway options affect rendition of the plotted kernel density estimate add normal density to the graph affect rendition of normal density add Student s t density with # degrees of freedom to the graph affect rendition of the Student s t density add other plots to the generated graph any options other than by() documented in [G-3] twoway options kernel epanechnikov epan2 biweight cosine gaussian parzen rectangle triangle Description Epanechnikov kernel function; the default alternative Epanechnikov kernel function biweight kernel function cosine trace kernel function Gaussian kernel function Parzen kernel function rectangle kernel function triangle kernel function fweights, aweights, and iweights are allowed; see [U] 11.1.6 weight. 1

2 kdensity Univariate kernel density estimation Menu Statistics > Nonparametric analysis > Kernel density estimation Description kdensity produces kernel density estimates and graphs the result. Options Main kernel(kernel) specifies the kernel function for use in calculating the kernel density estimate. The default kernel is the Epanechnikov kernel (epanechnikov). bwidth(#) specifies the half-width of the kernel, the width of the density window around each point. If bwidth() is not specified, the optimal width is calculated and used. The optimal width is the width that would minimize the mean integrated squared error if the data were Gaussian and a Gaussian kernel were used, so it is not optimal in any global sense. In fact, for multimodal and highly skewed densities, this width is usually too wide and oversmooths the density (Silverman 1992). generate(newvar x newvar d ) stores the results of the estimation. newvar x will contain the points at which the density is estimated. newvar d will contain the density estimate. n(#) specifies the number of points at which the density estimate is to be evaluated. The default is min(n, 50), where N is the number of observations in memory. at(var x ) specifies a variable that contains the values at which the density should be estimated. This option allows you to more easily obtain density estimates for different variables or different subsamples of a variable and then overlay the estimated densities for comparison. nograph suppresses the graph. This option is often used with the generate() option. Kernel plot cline options affect the rendition of the plotted kernel density estimate. See [G-3] cline options. plots normal requests that a normal density be overlaid on the density estimate for comparison. normopts(cline options) specifies details about the rendition of the normal curve, such as the color and style of line used. See [G-3] cline options. student(#) specifies that a Student s t density with # degrees of freedom be overlaid on the density estimate for comparison. stopts(cline options) affects the rendition of the Student s t density. See [G-3] cline options. Add plots addplot(plot) provides a way to add other plots to the generated graph. See [G-3] addplot option.

Y axis, X axis, Titles, Legend, Overall kdensity Univariate kernel density estimation 3 twoway options are any of the options documented in [G-3] twoway options, excluding by(). These include options for titling the graph (see [G-3] title options) and for saving the graph to disk (see [G-3] saving option). Remarks and examples stata.com Kernel density estimators approximate the density f(x) from observations on x. Histograms do this, too, and the histogram itself is a kind of kernel density estimate. The data are divided into nonoverlapping intervals, and counts are made of the number of data points within each interval. Histograms are bar graphs that depict these frequency counts the bar is centered at the midpoint of each interval and its height reflects the average number of data points in the interval. In more general kernel density estimates, the range is still divided into intervals, and estimates of the density at the center of intervals are produced. One difference is that the intervals are allowed to overlap. We can think of sliding the interval called a window along the range of the data and collecting the center-point density estimates. The second difference is that, rather than merely counting the number of observations in a window, a kernel density estimator assigns a weight between 0 and 1 based on the distance from the center of the window and sums the weighted values. The function that determines these weights is called the kernel. Kernel density estimates have the advantages of being smooth and of being independent of the choice of origin (corresponding to the location of the bins in a histogram). See Salgado-Ugarte, Shimizu, and Taniuchi (1993) and Fox (1990) for discussions of kernel density estimators that stress their use as exploratory data-analysis tools. Cox (2007) gives a lucid introductory tutorial on kernel density estimation with several Stata produced examples. He provides tips and tricks for working with skewed or bounded distributions and applying the same techniques to estimate the intensity function of a point process. Example 1: Histogram and kernel density estimate Goeden (1978) reports data consisting of 316 length observations of coral trout. We wish to investigate the underlying density of the lengths. To begin on familiar ground, we might draw a histogram. In [R] histogram, we suggest setting the bins to min( n, 10 log 10 n), which for n = 316 is roughly 18:

4 kdensity Univariate kernel density estimation. use http://www.stata-press.com/data/r13/trocolen. histogram length, bin(18) (bin=18, start=226, width=19.777778) 0.002.004.006.008 200 300 400 500 600 length The kernel density estimate, on the other hand, is smooth.. kdensity length Kernel density estimate 0.001.002.003.004.005 200 300 400 500 600 length kernel = epanechnikov, bandwidth = 20.1510 Kernel density estimators are, however, sensitive to an assumption, just as are histograms. In histograms, we specify a number of bins. For kernel density estimators, we specify a width. In the graph above, we used the default width. kdensity is smarter than twoway histogram in that its default width is not a fixed constant. Even so, the default width is not necessarily best. kdensity stores the width in the returned scalar bwidth, so typing display r(bwidth) reveals it. Doing this, we discover that the width is approximately 20. Widths are similar to the inverse of the number of bins in a histogram in that smaller widths provide more detail. The units of the width are the units of x, the variable being analyzed. The width is specified as a half-width, meaning that the kernel density estimator with half-width 20 corresponds to sliding a window of size 40 across the data.

kdensity Univariate kernel density estimation 5 We can specify half-widths for ourselves by using the bwidth() option. Smaller widths do not smooth the density as much:. kdensity length, bwidth(10) 0.002.004.006.008 Kernel density estimate 200 300 400 500 600 length kernel = epanechnikov, bandwidth = 10.0000. kdensity length, bwidth(15) 0.002.004.006 Kernel density estimate 200 300 400 500 600 length kernel = epanechnikov, bandwidth = 15.0000 Example 2: Different kernels can produce different results When widths are held constant, different kernels can produce surprisingly different results. This is really an attribute of the kernel and width combination; for a given width, some kernels are more sensitive than others at identifying peaks in the density estimate. We can see this when using a dataset with lots of peaks. In the automobile dataset, we characterize the density of weight, the weight of the vehicles. Below we compare the Epanechnikov and Parzen kernels.

6 kdensity Univariate kernel density estimation. use http://www.stata-press.com/data/r13/auto (1978 Automobile Data). kdensity weight, kernel(epanechnikov) nograph generate(x epan). kdensity weight, kernel(parzen) nograph generate(x2 parzen). label var epan "Epanechnikov density estimate". label var parzen "Parzen density estimate". line epan parzen x, sort ytitle() legend(cols(1)) 0.0002.0004.0006.0008 1000 2000 3000 4000 5000 Weight (lbs.) Epanechnikov density estimate Parzen density estimate We did not specify a width, so we obtained the default width. That width is not a function of the selected kernel, but of the data. See Methods and formulas for the calculation of the optimal width. Example 3: with overlaid normal density In examining the density estimates, we may wish to overlay a normal density or a Student s t density for comparison. Using automobile weights, we can get an idea of the distance from normality by using the normal option.. kdensity weight, kernel(epanechnikov) normal 0.0001.0002.0003.0004.0005 Kernel density estimate 1000 2000 3000 4000 5000 Weight (lbs.) kernel = epanechnikov, bandwidth = 295.7504 Kernel density estimate Normal density

kdensity Univariate kernel density estimation 7 Example 4: Compare two densities We also may want to compare two or more densities. In this example, we will compare the density estimates of the weights for the foreign and domestic cars.. use http://www.stata-press.com/data/r13/auto, clear (1978 Automobile Data). kdensity weight, nograph generate(x fx). kdensity weight if foreign==0, nograph generate(fx0) at(x). kdensity weight if foreign==1, nograph generate(fx1) at(x). label var fx0 "Domestic cars". label var fx1 "Foreign cars". line fx0 fx1 x, sort ytitle() 0.0002.0004.0006.0008.001 1000 2000 3000 4000 5000 Weight (lbs.) Domestic cars Foreign cars Technical note Although all the examples we included had densities of less than 1, the density may exceed 1. The probability density f(x) of a continuous variable, x, has the units and dimensions of the reciprocal of x. If x is measured in meters, f(x) has units 1/meter. Thus the density is not measured on a probability scale, so it is possible for f(x) to exceed 1. To see this, think of a uniform density on the interval 0 to 1. The area under the density curve is 1: this is the product of the density, which is constant at 1, and the range, which is 1. If the variable is then transformed by doubling, the area under the curve remains 1 and is the product of the density, constant at 0.5, and the range, which is 2. Conversely, if the variable is transformed by halving, the area under the curve also remains at 1 and is the product of the density, constant at 2, and the range, which is 0.5. (Strictly, the range is measured in certain units, and the density is measured in the reciprocal of those units, so the units cancel on multiplication.)

8 kdensity Univariate kernel density estimation Stored results kdensity stores the following in r(): Scalars r(bwidth) r(n) r(scale) Macros r(kernel) kernel bandwidth number of points at which the estimate was evaluated density bin width name of kernel Methods and formulas A kernel density estimate is formed by summing the weighted values calculated with the kernel function K, as in f K = 1 n ( ) x Xi w i K qh h i=1 where q = i w i if weights are frequency weights (fweight) or analytic weights (aweight), and q = 1 if weights are importance weights (iweights). Analytic weights are rescaled so that i w i = n (see [U] 11 Language syntax). If weights are not used, then w i = 1, for i = 1,..., n. kdensity includes seven different kernel functions. The Epanechnikov is the default function if no other kernel is specified and is the most efficient in minimizing the mean integrated squared error. Kernel Formula { 15 Biweight K[z] = 16 (1 z2 ) 2 if z < 1 { Cosine K[z] = 1 + cos(2πz) if z < 1/2 { 3 Epanechnikov K[z] = 4 (1 1 5 z2 )/ 5 if z < 5 { 3 Epan2 K[z] = 4 (1 z2 ) if z < 1 Gaussian K[z] = 1 2π e z2 /2 4 3 8z2 + 8 z 3 if z 1/2 Parzen K[z] = 8(1 z ) 3 /3 if 1/2 < z 1 { Rectangular K[z] = 1/2 if z < 1 { Triangular K[z] = 1 z if z < 1 From the definitions given in the table, we can see that the choice of h will drive how many values are included in estimating the density at each point. This value is called the window width or bandwidth. If the window width is not specified, it is determined as

kdensity Univariate kernel density estimation 9 m = min ( variancex, h = 0.9m n 1/5 ) interquartile range x 1.349 where x is the variable for which we wish to estimate the kernel and n is the number of observations. Most researchers agree that the choice of kernel is not as important as the choice of bandwidth. There is a great deal of literature on choosing bandwidths under various conditions; see, for example, Parzen (1962) or Tapia and Thompson (1978). Also see Newton (1988) for a comparison with sample spectral density estimation in time-series applications. Acknowledgments We gratefully acknowledge the previous work by Isaías H. Salgado-Ugarte of Universidad Nacional Autónoma de México, and Makoto Shimizu and Toru Taniuchi of the University of Tokyo; see Salgado-Ugarte, Shimizu, and Taniuchi (1993). Their article provides a good overview of the subject of univariate kernel density estimation and presents arguments for its use in exploratory data analysis. References Cox, N. J. 2005. Speaking Stata: probability plots. Stata Journal 5: 259 273.. 2007. Kernel estimation as a basic tool for geomorphological data analysis. Earth Surface Processes and Landforms 32: 1902 1912. Fiorio, C. V. 2004. Confidence intervals for kernel density estimation. Stata Journal 4: 168 179. Fox, J. 1990. Describing univariate distributions. In Modern Methods of Data Analysis, ed. J. Fox and J. S. Long, 58 125. Newbury Park, CA: Sage. Goeden, G. B. 1978. A monograph of the coral trout, Plectropomus leopardus (Lacépède). Queensland Fisheries Services Research Bulletin 1: 1 42. Kohler, U., and F. Kreuter. 2012. Data Analysis Using Stata. 3rd ed. College Station, TX: Stata Press. Newton, H. J. 1988. TIMESLAB: A Time Series Analysis Laboratory. Belmont, CA: Wadsworth. Parzen, E. 1962. On estimation of a probability density function and mode. Annals of Mathematical Statistics 33: 1065 1076. Royston, P., and N. J. Cox. 2005. A multivariable scatterplot smoother. Stata Journal 5: 405 412. Salgado-Ugarte, I. H., and M. A. Pérez-Hernández. 2003. Exploring the use of variable bandwidth kernel density estimators. Stata Journal 3: 133 147. Salgado-Ugarte, I. H., M. Shimizu, and T. Taniuchi. 1993. snp6: Exploring the shape of univariate data using kernel density estimators. Stata Technical Bulletin 16: 8 19. Reprinted in Stata Technical Bulletin Reprints, vol. 3, pp. 155 173. College Station, TX: Stata Press.. 1995a. snp6.1: ASH, WARPing, and kernel density estimation for univariate data. Stata Technical Bulletin 26: 23 31. Reprinted in Stata Technical Bulletin Reprints, vol. 5, pp. 161 172. College Station, TX: Stata Press.. 1995b. snp6.2: Practical rules for bandwidth selection in univariate density estimation. Stata Technical Bulletin 27: 5 19. Reprinted in Stata Technical Bulletin Reprints, vol. 5, pp. 172 190. College Station, TX: Stata Press.. 1997. snp13: Nonparametric assessment of multimodality for univariate data. Stata Technical Bulletin 38: 27 35. Reprinted in Stata Technical Bulletin Reprints, vol. 7, pp. 232 243. College Station, TX: Stata Press. Scott, D. W. 1992. Multivariate Estimation: Theory, Practice, and Visualization. New York: Wiley. Silverman, B. W. 1992. Estimation for Statistics and Data Analysis. London: Chapman & Hall. Simonoff, J. S. 1996. Smoothing Methods in Statistics. New York: Springer.

10 kdensity Univariate kernel density estimation Steichen, T. J. 1998. gr33: Violin plots. Stata Technical Bulletin 46: 13 18. Reprinted in Stata Technical Bulletin Reprints, vol. 8, pp. 57 65. College Station, TX: Stata Press. Tapia, R. A., and J. R. Thompson. 1978. Nonparametric Probability Estimation. Baltimore: Johns Hopkins University Press. Van Kerm, P. 2003. Adaptive kernel density estimation. Stata Journal 3: 148 156.. 2012. Kernel-smoothed cumulative distribution function estimation with akdensity. Stata Journal 12: 543 548. Wand, M. P., and M. C. Jones. 1995. Kernel Smoothing. London: Chapman & Hall. Also see [R] histogram Histograms for continuous and categorical variables