Graphs Exploratory data analysis Dr. David Lucy d.lucy@lancaster.ac.uk Lancaster University A graph is a suitable way of representing data if: A line or area can represent the quantities in the data in some way. Several standard forms can be used. Standard forms are not the only forms. You can make up your own if you please R very good for this. Exploratory data analysis p.1/36 Exploratory data analysis p.3/36 Graphs Standard forms Graphics are a very important part of making sense of data: Allow the researcher to compare quantities easily and simply by comparing lengths and/or areas. Many humans adapted to view quantities rather than number. Immediate impact. Suggests ideas for further work. For many scientists this is their main form of anlysis - some of the worlds best science has been done purely by graphs. There are three standard types of graph: 1. histogram - used to examine the distribution of a set of observations - can be used to compare distributions between sets of observations - observations may be discrete (underlying continuous) and continuous, 2. scatterplot - use to look for relationships between different continuous variables, 3. boxplot - sometimes called box and whiskers plot - use to compare distributions of continuous variables which is equivalent to looking for relationships between factors and continuous variables. Exploratory data analysis p.2/36 Exploratory data analysis p.4/36
Histograms Histograms Partial Full Partial Full 0 20 40 60 0 20 40 60 0 20 40 60 0 20 40 60 Exploratory data analysis p.5/36 Exploratory data analysis p.7/36 Histograms Exercise 3.2.2 Do not confuse histigrams with barcharts: Histograms have the area proportional to the quantity of interest: not necessarity equal column widths, although most are. Rescale the full histogram so the bars sum to one? Number of runs 71 28 5 2 2 1 in 109 observations Barcharts have the column height proportional to the quantity of interest. Exploratory data analysis p.6/36 Exploratory data analysis p.8/36
Exercise 3.2.2 Histogram comparison = 109 - divide each frequency by 109: Ladybower Reservoir Number of runs 71 28 5 2 2 1 in 109 observations Normalised 0.65 0.26 0.05 0.02 0.02 0.01 Process sometimes called normalisation by scientists 0.00 0.01 0.02 0.03 0.04 0.00 0.01 0.02 0.03 20 40 60 80 Daily max ozone 20 40 60 80 100 120 Daily max ozone Exploratory data analysis p.9/36 Exploratory data analysis p.11/36 Scaled histograms Histogram comparison Differences Partial Full 0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6 50 40 30 20 10 0 10 20 Daily max ozone Exploratory data analysis p.10/36 Exploratory data analysis p.12/36
Histogram problems Kernal density estimates 0.00 0.02 0.04 0.06 0.08 0.10 0.00 0.02 0.04 0.06 0.08 0.10 probability density 0 20 40 60 80 100 Summer daily maxima 20 40 60 80 100 Summer daily maxima Exploratory data analysis p.13/36 0 2 4 6 8 x Exploratory data analysis p.15/36 Kernal density estimates Cumulative distributions Ladybower Recall the cumulative distribution function (c.d.f.) of a random variable X: F(x) = P(X x) How can we estimate this from a finite number of observations? 20 40 60 80 100 120 Ozone (ppb) Exploratory data analysis p.14/36 Exploratory data analysis p.16/36
Cumulative distributions Cumulative distributions Let us assume That our variables X 1,...,X n are independent and identically distributed (i.i.d.) They are replicates of a random variable X which has cumulative distribution function F. We can denote by x 1,...,x n, the observed values of X 1,...,X n. The empirical c.d.f is a proper distribution function and has the following properties: F(x) is a step function with jumps at the data points; F(x) = 1 if x max(x 1,...,x n ); F(x) = 0 if x < min(x 1,...,x n ). Exploratory data analysis p.17/36 Exploratory data analysis p.19/36 Cumulative distributions Cumulative distributions The empirical cumulative distribution function (c.d.f.) is defined as: F(x) = 1 n n (num of x i=1 i x) = Π(x i x) n where : Π(x i x) = { 1 if x i x 0 if x i > 0 To construct: Take the observed values and order them so that the smallest one comes first. Label these ordered values x (1),x (2),,x (n) so that x (1) x (2) x (n). Then the kth ordered point x (k) is the k/n th quantile. Exploratory data analysis p.18/36 Exploratory data analysis p.20/36
Exercise 3.2.5 Exercise 3.2.5 For the observations {1, 2, 2, 3, 4}, find F(x) and sketch the plot. x 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 F(x) density 0.0 0.2 0.4 0.6 0.8 1.0 x Exploratory data analysis p.21/36 Exploratory data analysis p.23/36 Exercise 3.2.5 Exercise 3.2.6 For the observations {1, 2, 2, 3, 4}, find F(x) and sketch the plot. x 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 n(x i ) x i 0 0 1 1 3 3 4 4 5 5 5 F(x) 0 0 1/5 1/5 3/5 3/5 4/5 4/5 5/5 5/5 5/5 F(x) 0 0 0.2 0.2 0.6 0.6 0.8 0.8 1 1 1 Construct the cdf for the first 20 points from the Summer ozone measurements and sketch it These are: 32 29 32 32 33 27 34 22 30 35 27 23 28 34 35 45 36 26 23 16 At each sorted data point we have a jump of i/n, which is 1/20 as n = 20. Exploratory data analysis p.22/36 Exploratory data analysis p.24/36
Exercise 3.2.6 Scatterplots Fn(x) 0.0 0.2 0.4 0.6 0.8 1.0 Scatterplots look at the relationship between continuous variables. Usually they project two dimensions onto two dimensions. Several ways of representing three dimensions. Scatterplots are the mainstay of physical sciences. 15 20 25 30 35 40 45 Summer daily maxima Exploratory data analysis p.25/36 Exploratory data analysis p.27/36 Summer ozone Scatterplots Summer Winter Fn(x) 0.0 0.2 0.4 0.6 0.8 1.0 20 40 60 80 0 10 20 30 40 0 20 40 60 80 Summer daily maxima 20 40 60 80 100 NO2 20 40 60 80 100 120 NO2 Exploratory data analysis p.26/36 Exploratory data analysis p.28/36
Scatterplots Independence peripheral COHb saturation level 40 30 20 10 0 10 15 20 25 30 Conditional probabilities were introduced in Math104: If A and B are two events then, as long as P(B) > 0, the conditional probability of A given B is written as P(A B) and calculated from: P(A B) = P(A B). P(B) heart COHb saturation level Exploratory data analysis p.29/36 Exploratory data analysis p.31/36 Independence Independence Scatterplots can be used to look for dependence between continuous variables. They can also be useful to identify situations in which variables appear to be independent. If two variables are independent, then the distribution of one variable will look the same regardless of the value of the other variable. This is what the ozone versus NO 2 above plots looked like. We can look for some structure in our data: including the dependence of one variable on another, by examining conditional distributions of some subsets of our data. Do this by seperating the data by some defined criterion, and plotting the subsets. Exploratory data analysis p.30/36 Exploratory data analysis p.32/36
Exercise 3.2.8 Boxplots Summer Ozone NO2 <= 40 Winter Ozone NO2 <=40 Summer Winter 0 10 20 30 40 50 60 70 Summer Ozone 40<NO2<=60 0 10 20 30 40 50 60 70 0 10 20 30 40 Winter Ozone 40<NO2<=60 0 10 20 30 40 0 20 40 60 80 100. Ladybower. 0 20 40 60 80 100. Ladybower. Exploratory data analysis p.33/36 Exploratory data analysis p.35/36 Boxplots Next session The third of the standard forms for graphs: Similar to multiple histograms. Examine distribution of continuous variable. For different levels of a discrete variable. The discrete variable can be ordered, or nominal. Next time we shall: 1. take a look at boxplots, 2. learn about some of the classic plots from history, 3. find out what makes a good graph, 4. look at some non-standard forms of graphs. Exploratory data analysis p.34/36 Exploratory data analysis p.36/36