Desciptive Statistics Qualitative data Quantitative data Graphical methods Numerical methods
Qualitative data Data are classified in categories Non numerical (although may be numerically codified) Elements Class Each one of the categories to classify the data Frequency Number of cases in each class Relative frequency Frequency divided by total number of cases Class percentage Relative frequency multiplied by 100
Example: Aphasia DATA SET Subject Type of Aphasia Subject Type of Aphasia 1 Broca s 12 Broca s 2 Anomic 13 Anomic 3 Anomic 14 Broca s 4 Conduction 15 Anomic 5 Broca s 16 Anomic 6 Conduction 17 Anomic 7 Conduction 18 Conduction 8 Anomic 19 Broca s 9 Conduction 20 Anomic 10 Anomic 21 Conduction 11 Conduction 22 Anomic
Example: Aphasia Number of cases 22 Classes Anomic, Broca s, Conduction Frequencies Anomic: 10; Broca s: 5; Conduction: 7 Relative frequencies Anomic: 0:45; Broca s: 0:23 ; Conduction: 0:32 Class percentage Anomic: 45%; Broca s: 23% ; Conduction: 32% IN TABLE FORM: Classes Anomic Broca s Conduction Total Frequencies 10 5 7 22 Relative frequencies 0:45 0:23 0:32 1:00 Class percentage 45% 23% 32% 100%
Summarizing the data We will use: Numerical Graphical methods
Graphical methods Bar graphs The height of the bar may represent the frequency the relative frequency the percentage Pie charts Relative frequencies are represented by fraction of total area Pareto Diagrams Bar graphs with classes ordered by size
Bar graph
Pie chart
Quantitative variables A variable is quantitative...... if it represents a measure, given in a meaningful numerical scale: age, height, time, lenght, concentration, pressure,... Again, we can summarize the data using: Graphical Numerical methods
A few graphical methods for quantitative variables We want to put some order in the set of numbers to get an idea about the size of the numbers and their spread Methods Stem and leaf displays Put together all the numbers that have same first digit. Order them by the second digit Dot plots Histograms
Stem and leaf Group the numbers by all-but-last equal digits List vertically only once the group digits Write last digits ordered within each group.
Example From the collection of numbers 225 228 252 228 237 237 240 198 240 210 210 210 228 198 228 240 192 240 240 192 210 225 228 231 210 225 264 204 240 240 210 255 237 207 First we will get: And then: 19 8822 19 2288 20 47 20 47 21 000000 21 000000 22 58888585 22 55588888 23 7717 23 1777 24 0000000 24 0000000 25 25 25 25 26 4 26 4
Dot plots Dot plots display a dot for each observation Dots for repeated values are aligned next to each other Lines of dots representing consecutive values are place next to each other We may have too many distinct values to plot. Then values are placed in classes before plotting
Example Use data from exercise 2.182 on hear loss to build the dot plot. Coding goes from 1: hearing within normal limits to 7: severe-to-profound loss 6 7 1 1 2 6 4 6 4 2 5 2 5 1 5 4 6 6 5 5 5 2 5 3 6 4 6 6 4 2
Histograms Histograms are graphs of the frequency or relative frequency of a variable on an interval (class interval). Usually: Class intervals are placed on the horizontal axis Class intervals have a length proportional to their width (in the units that the variable is measured) In most cases we will use equal length intervals The frequencies or relative frequencies are marked on the vertical axis
Example A 1903 paper published a report on length of Cuckoo s (Cuculus canorus) eggs classified by the species of the nest where they where found. The following measures, in mm, correspond to the length of the eggs found in Meadow Pipit s (Anthus pratensis) nests. 19.65 20.05 20.65 20.85 21.65 21.65 21.65 21.85 21.85 21.85 22.05 22.05 22.05 22.05 22.05 22.05 22.05 22.05 22.05 22.05 22.25 22.25 22.25 22.25 22.25 22.25 22.25 22.25 22.45 22.45 22.45 22.65 22.65 22.85 22.85 22.85 22.85 23.05 23.25 23.25 23.45 23.65 23.85 24.25 24.45
Example The values are already ordered from lowest to highest. We have 45 values so we will use 10 intervals of equal length. As a rule of thumb: divide the number of values by 5; never take more than 20 intervals. The range is 24:45 19:65 = 4:80 5:00. Then, the intervals will have a length of :5 mm: 19:50 20:00 20:00 20:50 20:50 21:00 21:00 21:50 21:50 22:00 22:00 22:50 22:50 23:00 23:00 23:50 23:50 24:00 24:00 24:50 Count the number of cases in each interval: 1 1 2 0 6 21 6 4 2 2 We construct bars of this height on the corresponding intervals.
Descriptive numerical measures For central tendency Mean Median Mode For variability Range Sample variance Sample standard deviation Interquartile range
Sample mean The average of the numbers in the sample If the list of numbers is x 1 ; x 2 ; : : : ; x n the mean is x = P ni=1 x i n Example x 1 = 3:2; x 2 = 4:3; x 3 = 5:4; x 4 = 3:1; x 5 = 2:7 X x = 18:7 x = 3:74
Example Is this mean representative of the given list of numbers? x 1 = x 2 = = x 9 = 1; x 10 = 100 X x = 109 x = 10:9 11
Median Given a list of numbers a median is any number that divides the ordered list in two equal parts Formally, M is a median of the list x 1 ; x 2 ; : : : ; x n if 1. the number of elements in the list that are greater than or equal to M is at least n 2 AND 2. the number of elements in the list that are less than or equal to M is at least n 2 If there is more than one median, usually the average of the largest and the smallest is chosen as the median
Example The only median of x 1 = 3:2; x 2 = 4:3; x 3 = 5:4; x 4 = 3:1; x 5 = 2:7 is 3:2 However, all the numbers in the interval [3:2; 3:8] are medians of the list x 1 = 3:2; x 2 = 4:3; x 3 = 5:4; x 4 = 3:1; x 5 = 2:7; x 6 = 3:8 although usually, 3:2 + 3:8 2 = 3:5 is taken as the median Observe the difference in finding the median when we have an even or we have an odd number of elements in the list
Example For the set of numbers x 1 = x 2 = = x 9 = 1; x 10 = 100 i.e. 1 1 1 1 1 1 1 1 1 100 the only median is 1: there are 9 ( 10 2 ) elements in the list that are less than or equal to 1 and there are 10 ( 10 2 ) elements in the list that are grater than or equal to 1
Mode Definition The mode of a list of numbers is the most frequent number in the list In the examples: In the last list of numbers: the mode is 1, that appears 9 times in the list In the egg length example: the mode is 22:05, that appears 10 times in the list
Modal class Sometimes, when the measure of the variable is "continuous", there may be no repetitions, but when the data are grouped in intervals to draw a histogram, one of these intervals is more frequent than the others. This interval is called modal class The center point of this modal class is sometimes taken as the mode In the egg length example, the modal class is the interval [22:0; 22:5] Using only the information from the histogram we would take the mode as 22:25
Relationships If the data are symmetric Mean = Median If there are a few values a lot bigger than the rest Mean>Median We say that the data are skewed to the right If there are a few values a lot smaller than the rest Mean<Media We say that the data are skewed to the left
Numerical measures of variability Range Sample variance (sample standard deviation) Quartiles Interquartile range Percentiles
Range Definition The range of a list of numbers is simply the highest value minus the lowest value in the list
Sample variance To measure the spread of the data around the center we look at the distances from the values to the center and average them in a special form: Definition The sample variance of the list of numbers x 1 ; x 2 ; : : : ; x n is s 2 = P ni=1 (x i x) 2 (n 1) Observe the square in the notation: s 2, the square in each term of the sum: (x i x) 2 the average of n terms is taken dividing by n 1
Standard Deviation The unit of measure in the sample variance is the square of the unit in the data. To have an idea of the spread in terms of the unit of measure in the variable, take the square root Sample Standard Deviation s = p s 2 A simplifying formula nx i=1 (x i x) 2 = nx i=1! xi 2 n x 2
Extracting info from the SD The SD tells us how far from the mean the data points are. We can use two distinct criteria to interpret its value: The Chebishev rule, very conservative and valid for any data set: The fraction of points within k standard deviations from the mean is at least 1 1 k 2 = k 2 1 k 2 The normal rule, valid for special types of symmetric data 68% of the data are within one SD from the mean 95% of the data are within two SD s from the mean More than 99% of the data are within three SD s from the mean
Percentiles Definition For any number p; 0 < p < 100, we say that P is the p th percentile of a data set if 1. the number of elements in the set that are less than or equal to P is at least p% of the data set AND 2. the number of elements in the set that are greater than or equal to P is at least (100 p)%
Quartiles The 25 th percentile is called the first or lower quartile (Q1) The 75 th percentile is called the third or upper quartile (Q3) The median coincides with the 50 th percentile, that is also the second or middle quartile The percentiles that are a multiple of 10 are called deciles Interquartile range IQR = Q3 Q1
Example
Outliers Outliers are unusually large or small measurements in a data set. Outliers may be due to one of the following causes: 1. The measurement is incorrect. 2. The measurement does not correspond to the same population. 3. The measurement corresponds to an odd event in the population. Normally we will consider potential outliers those values that are either (1:5)IQR below Q1 or (1:5)IQR above Q3.
Boxplots A Boxplot, or box and whiskers diagram, gives an idea of the spread and shape of the data set. To make a Boxplot we need: The five number summary: Min, Q1, Median, Q3, Max The Interquartile range: the difference IQR = Q3 Q1. The "inner" and "outer" fences The outliers Then: Draw the Median Draw Q1 and Q3 (the "hinges") and close the box Mark the outliers Find the min and max after discarding the outliers Draw the whiskers
The TILLRATIO data sets Ratio Al / Be # Location Ratio 1 UMRB-1 3.75 2 UMRB-1 4.05 3 UMRB-1 3.81 4 UMRB-1 3.23 5 UMRB-1 3.13 6 UMRB-1 3.30 7 UMRB-1 3.21 8 UMRB-2 3.32 9 UMRB-2 4.09 10 UMRB-2 3.90 11 UMRB-2 5.06 12 UMRB-2 3.85 13 UMRB-2 3.88 14 UMRB-3 4.06 15 UMRB-3 4.56 16 UMRB-3 3.60 17 UMRB-3 3.27 18 UMRB-3 4.09 19 UMRB-3 3.38 20 UMRB-3 3.37 21 SWRA 2.73 22 SWRA 2.95 23 SWRA 2.25 24 SD 2.73 25 SD 2.55 26 SD 3.06
SPSS output: Stem-and-leaf
SPSS: Descriptive Statistics
SPSS: Box and whiskers 1
SPSS: Box and whiskers 2
Bivariate data In the data ALWINS we expect that there should be a "positive" relationship between the two variables. One way to try to find out if this relationship exists is ploting the data in what is called a scatterplot. We draw two axes and plot each case as a point on the plane choosing one of the variables as x and the other as y. We observe if they roughly line up.
Scatterplot The data in ALWINS gives the following plot