Week 1 Exploratory Data Analysis
Practicalities This course ST903 has students from both the MSc in Financial Mathematics and the MSc in Statistics. Two lectures and one seminar/tutorial per week. Exam (for the MSc in Financial Mathematics) in January, plus assessed coursework.
Aims and Objectives What s the course about? 1. Describing financial data 2. Modelling financial data 3. Making inferences about financial data
Samples and Populations Samples and Populations (Experimental) Unit the object on which measurements are made Population the set of all units about which information is wanted Sample the set of units about which information is available (Simple) random sample a sample such that units in the population have equal chance of inclusion, independent of the inclusion of any other unit Variable a measurable characteristic of a unit Statistic a measurable characteristic of a sample Parameter a measurable characteristic of a population
Variation Variation Natural Variation variation due to different units in the population having different values of the same variable Sampling Variation variation due to different samples containing different units and hence producing different values of the same statistic
Nature and Structure of Data Primary and Secondary Data Primary and Secondary Data Primary Data are collected specifically for the current study Observational e.g. survey data Intervention e.g. experimental data Secondary Data collected and/or compiled for another purpose can be limitations or problems with quality
Nature and Structure of Data Primary and Secondary Data Example: National Unemployment Data Suppose we want to know the UK unemployment figures in 5-year age bands to compare with similar figures from China, collected in 2005. Published data may be insufficient because only collected from major cities unemployment numbers presented in 10-year age bands compiled from a survey 10 years ago
Nature and Structure of Data Form of Data Form of Data - Samples Relationship between samples independent samples e.g. unemployment figures from two countries dependent samples e.g. social class of father and son Structure across samples unstructured e.g. unemployment figures from two countries structured e.g. 2 2 factorial experiment
Nature and Structure of Data Form of Data Form of Data - Variables Number of variables univariate, bivariate or multivariate Scales of measurement continuous e.g. age discrete e.g. sex (binary), ethnic origin (unordered categorical), social class (ordered categorical)
Stages of Data Analysis Stages of Data Analysis 1. Exploratory data analysis using descriptive statistics numerical summaries tabular summaries graphical summaries 2. Formal analysis using statistical techniques, often based on an assumed probability model 3. Presentation and evaluation of results
Numerical Summaries Numerical Summaries Numerical summaries help to describe and compare samples give information about corresponding parameters Qualitative data can be summarise by counts or percentages. Quantitative data can be summarised by measures of location, scale and shape.
Measures of Location Averages For observations x 1,..., x n, let x (j) denote the j th smallest observation (j th order statistic) Sample mean x = 1 n x i n i=1 Sample median x ( ) n+1 if n is odd 2 x M = [ ] 1 2 x ( ) n + x ( ) n if n is even 2 2 +1 = x ( ) n 2 + 1 2 Sample mode the value which occurs most frequently in the sample
Measures of Location Averages - Advantages and Disadvantages Sample mean adv: conventional average; uses every value, convenient mathematically disadv: rarely corresponds to sample unit, influenced by outliers Sample median reverse adv/disadv of the sample mean Sample mode often not well defined; sample values are often poor values for populations
Measures of Location Quantiles Sample Lower Quartile x L = x ( ) n 4 + 1 2 Sample Upper Quartile x U = x ( ) 3n 4 + 1 2 pth Sample Percentile x 100p% = x ( ) pn 100 + 1 2 Five Number Summary ( x(1), x L, x M, x U, x (n) )
Measures of Scale Measures of Scale Sample Variance V ar(x) = n j=1 (x j x) 2 n 1 Sample Standard Deviation (SD) V ar(x) Inter-Quartile Range(IQR) Sample Range x U x L x (n) x (1)
Measures of Scale Measures of Scale - Advantages and Disadvantages Variance similar adv/disadv to mean SD in the same units as the data - useful for interpretation IQR robust measure Sample Range sensitive to outliers, sampling variability and data errors
Measures of Shape Measures of Shape Modality number of peaks in the sample distribution Skewness a statistic measuring symmetry such that 0 symmetric sample distribution +ve skewed to the right (long right-hand tail) -ve skewed to the left (long left-hand tail) Kurtosis a statistic measuring peakedness such that 3 same peakedness as the Normal distribution (mesokurtic) > 3 more peaked - slim or long-tailed (leptokurtic) < 3 less peaked - flat, fat or short-tailed (platykurtic) Sometimes adjusted to give 0 for mesokurtic distributions.
Measures of Shape Skewness and Kurtosis f(x) 0.0 0.1 0.2 0.3 0.4 4 2 0 2 4 x f(x) 0.0 0.2 0.4 0.6 4 2 0 2 4 x
Measure of Linear Relation Between Two Variables Correlation Coefficient For observations x 1,..., x n ; y 1,..., y n of two variables X and Y Correlation Coefficient n i=1 r = (x i x)(y i ȳ) [ n i=1 (x i x) 2][ n i=1 (y i ȳ) 2] measure of linear relationship correlation does not imply cause may be linked via third variable
Tabular Summaries Tabular Summaries Provide succinct display of data set Emphasise the structure of the data Sometimes more powerful than a graph, or may provide record of graphed data Things to consider included data layout (dimensions, ordering, totals) representation of numbers (units, significant figures, percentages)
Tabular Summaries Example: Society of Business Economists Salary Survey Age (years) Per cent of responses Median salaries ( k)* 2004 2003 1999 2004 2003 1999 30 & under 21 12 17 35.0 30.5 30.0 31 35 12 17 15 45.7 46.8 45.0 36 40 17 14 8 65.0 73.0 45.0 41 45 8 9 17 63.0 100.0 65.0 46 50 18 15 18 82.3 90.0 60.0 51 55 10 16 17 59.0 55.5 55.0 Over 55 14 17 8 75.0 55.0 50.0 Men 84 80 91 60.0 60.0 52.0 Women 16 20 9 52.5 45.5 30.0 *Including any London/regional allowance and self-employment income Source: http://www.sbe.co.uk/survey/salary_survey_2004.pdf
Graphical Summaries Graphical Summaries Graphical summaries are useful for providing an overall picture of the data exploring relationships e.g. comparing groups, exploring trends over time checking assumptions underlying methods of formal analysis checking for problems with the data, e.g. outliers
Graphical Summaries for Qualitative Data Graphical Summaries for Qualitative Data Pie Charts area of slices proportional to frequency - misleading to compare pie charts of different area or based on different sample sizes limited accuracy - rounding can be misleading hard to read with large number of segments Bar Charts height of bars proportional to frequency - more intuitive bars can be segmented to show component parts
Graphical Summaries for Qualitative Data Example: Shares of National Income Source: Survey of Current Business (2006) 86(1), http://bea.gov/bea/pub/0106cont.htm
Graphical Summaries for Qualitative Data Example: Shares of National Income 0 20 40 60 80 100 Other Taxes on production & imports Net interest & misc. payments Corporate profits Rental income of persons Proprietors' income Supplements to wages & salaries Wages and salary accruals 1959 2004
Graphical Summaries for Quantitative Data Stem-and-Leaf Plots Tallies data in bins, using values themselves for display E.g. Times (in hours) to first failure of air-conditioning unit on Boeing 720, different transformations hours 10 hours log10 hours 1 04 3 27 1.0 0 2 03455679 4 589 1.1 5 3 2357 5 00124779 1.2 4 4469 6 1668 1.3 068 5 369 7 035778 1.4 00136 6 01 8 779 1.5 1247 7 569 9 25 1.6 4469 8 4 1.7 25789 9 0 1.8 88 1.9 025
Graphical Summaries for Quantitative Data Stem-and-Leaf Plots E.g. Carbon-dating fragments of a pre-historic artefact, different scales 1000 years 100 years 100 years, split (* = 0-4,. = 5-9) 4 99999999 48 89 48. 89 5 0000000001111122 49 0111337778 49* 011133 50 11235588 49. 7778 51 2456 50* 1123 50. 5588 51* 24 51. 56
Graphical Summaries for Quantitative Data Box Plots Represent five number summaries diagrammatically. Most software produce truncated box plots, which exclude outliers - these are usually plotted as isolated points E.g. Inflation rates over 20 year period for five countries 5 10 15 20 25 USA UK Japan Germany France
Graphical Summaries for Quantitative Data Histogram Equivalent of barchart for binned continuous data Area of bars proportional to frequency in each bin - usually choose equal bin widths so height proportional to frequency E.g. GDP per capita for 26 countries Frequency 0 2 4 6 8 10 0 5000 10000 15000 20000 25000 30000 GDP per capita ($)
Graphical Summaries for Quantitative Data Graphical Summaries of Distribution for Quantitative Data Stem-and-leaf adv: good for small data sets - shows all of the data disadv: choice of bins affects display Box plot adv: simple, can split by group, almost any sample size will do disadv: can be too simple, e.g. no good for multi-modal data Histogram adv: good for large data sets, shows all characteristics of distribution disadv: choice of bins affects display
Graphical Summaries for Quantitative Data Scatterplot Plot of data points in 2-D or 3-D space with variables as axes Useful for exploring relationships between variables E.g. Standard & Poor (S&P) company s index of 500 common stock prices against the Consumer Price Index (CPI) for 1978-1989 SP500 Index 100 150 200 250 300 70 80 90 100 110 120 CPI
Graphical Summaries for Quantitative Data Time Series Plot of data against time Look for seasonality, unusual events, etc E.g. Quarterly personal consumption expenditure (PCE) from 1977-1980 (AUS$) PCE 20 25 30 1977 1978 1979 1980 Time