Data presentation and descriptive statistics

Similar documents
DESCRIPTIVE STATISTICS. The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses.

Exercise 1.12 (Pg )

Diagrams and Graphs of Statistical Data

1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number

STATS8: Introduction to Biostatistics. Data Exploration. Babak Shahbaba Department of Statistics, UCI

How To Write A Data Analysis

Introduction to Statistics for Psychology. Quantitative Methods for Human Sciences

Northumberland Knowledge

MTH 140 Statistics Videos

Data Exploration Data Visualization

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.

Descriptive Statistics and Measurement Scales

Course Text. Required Computing Software. Course Description. Course Objectives. StraighterLine. Business Statistics

business statistics using Excel OXFORD UNIVERSITY PRESS Glyn Davis & Branko Pecar

Exploratory data analysis (Chapter 2) Fall 2011

Descriptive Statistics

Module 3: Correlation and Covariance

BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I

Pie Charts. proportion of ice-cream flavors sold annually by a given brand. AMS-5: Statistics. Cherry. Cherry. Blueberry. Blueberry. Apple.

DATA INTERPRETATION AND STATISTICS

STA-201-TE. 5. Measures of relationship: correlation (5%) Correlation coefficient; Pearson r; correlation and causation; proportion of common variance

Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4)

Statistics. Measurement. Scales of Measurement 7/18/2012

Probability and Statistics Vocabulary List (Definitions for Middle School Teachers)

II. DISTRIBUTIONS distribution normal distribution. standard scores

MBA 611 STATISTICS AND QUANTITATIVE METHODS

Means, standard deviations and. and standard errors

Descriptive statistics Statistical inference statistical inference, statistical induction and inferential statistics

Fairfield Public Schools

A Correlation of. to the. South Carolina Data Analysis and Probability Standards

CALCULATIONS & STATISTICS

Descriptive Statistics

Summarizing and Displaying Categorical Data

430 Statistics and Financial Mathematics for Business

Lecture 2: Descriptive Statistics and Exploratory Data Analysis

Statistics Review PSY379

Simple linear regression

Statistics I for QBIC. Contents and Objectives. Chapters 1 7. Revised: August 2013

The right edge of the box is the third quartile, Q 3, which is the median of the data values above the median. Maximum Median

AP STATISTICS REVIEW (YMS Chapters 1-8)

Using SPSS, Chapter 2: Descriptive Statistics

Chapter 1: Looking at Data Section 1.1: Displaying Distributions with Graphs

Variables. Exploratory Data Analysis

The Big Picture. Describing Data: Categorical and Quantitative Variables Population. Descriptive Statistics. Community Coalitions (n = 175)

HISTOGRAMS, CUMULATIVE FREQUENCY AND BOX PLOTS

Week 1. Exploratory Data Analysis

List of Examples. Examples 319

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Scatter Plots with Error Bars

Current Standard: Mathematical Concepts and Applications Shape, Space, and Measurement- Primary

Session 7 Bivariate Data and Analysis

2. Simple Linear Regression

DESCRIPTIVE STATISTICS AND EXPLORATORY DATA ANALYSIS

This unit will lay the groundwork for later units where the students will extend this knowledge to quadratic and exponential functions.

Center: Finding the Median. Median. Spread: Home on the Range. Center: Finding the Median (cont.)

2013 MBA Jump Start Program. Statistics Module Part 3

Lecture 2. Summarizing the Sample

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Descriptive Statistics. Purpose of descriptive statistics Frequency distributions Measures of central tendency Measures of dispersion

3: Summary Statistics

Expression. Variable Equation Polynomial Monomial Add. Area. Volume Surface Space Length Width. Probability. Chance Random Likely Possibility Odds

Projects Involving Statistics (& SPSS)

Part 2: Analysis of Relationship Between Two Variables

Relationships Between Two Variables: Scatterplots and Correlation

Descriptive statistics parameters: Measures of centrality

CHAPTER THREE COMMON DESCRIPTIVE STATISTICS COMMON DESCRIPTIVE STATISTICS / 13

Describing, Exploring, and Comparing Data

X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1)

Simple Regression Theory II 2010 Samuel L. Baker

Common Tools for Displaying and Communicating Data for Process Improvement

Valor Christian High School Mrs. Bogar Biology Graphing Fun with a Paper Towel Lab

Data Mining Techniques Chapter 5: The Lure of Statistics: Data Mining Using Familiar Tools

Chapter 6: Constructing and Interpreting Graphic Displays of Behavioral Data

EXPLORING SPATIAL PATTERNS IN YOUR DATA

Descriptive statistics

with functions, expressions and equations which follow in units 3 and 4.

CA200 Quantitative Analysis for Business Decisions. File name: CA200_Section_04A_StatisticsIntroduction

Glencoe. correlated to SOUTH CAROLINA MATH CURRICULUM STANDARDS GRADE 6 3-3, , , 4-9

Mean = (sum of the values / the number of the value) if probabilities are equal

Introduction to Quantitative Methods

Big Ideas in Mathematics

Course Syllabus MATH 110 Introduction to Statistics 3 credits

Correlation and Regression

Geostatistics Exploratory Analysis

COMMON CORE STATE STANDARDS FOR

Correlation key concepts:

Prentice Hall Algebra Correlated to: Colorado P-12 Academic Standards for High School Mathematics, Adopted 12/2009

Chapter 10. Key Ideas Correlation, Correlation Coefficient (r),

Exploratory Data Analysis

Name: Date: Use the following to answer questions 2-3:


BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS

CRLS Mathematics Department Algebra I Curriculum Map/Pacing Guide

How To Check For Differences In The One Way Anova

Data Analysis Tools. Tools for Summarizing Data

Data exploration with Microsoft Excel: analysing more than one variable

Foundation of Quantitative Data Analysis

6.4 Normal Distribution

Algebra Academic Content Standards Grade Eight and Grade Nine Ohio. Grade Eight. Number, Number Sense and Operations Standard

Transcription:

Data presentation and descriptive statistics Paola Grosso SNE research group Today with Jeroen van der Ham as special guest

Instructions for use I do talk fast: Ask me to repeat if something is not clear; I made an effort to keep it interesting, but you are the guinea pigs feedback is welcome! You will not get a grade: But you will have to do some work ; 3 for the price of 2 We will start slow and accelerate; We will (ambitiously?) cover lots of material; We will also use more than the standard two hours. Sep.06 2010 - Slide 2

Introduction

Why should you pay attention? We are going to talk about Data presentation, analysis and basic statistics. Your idea is? Sep.06 2010 - Slide 4

Our motivation We want to avoid to hear this from you. 1. An essential component of scientific research; 2. A must-have skill (!) of any master student and researcher ( but useful also in commercial/industry/business settings); 3. It will help to communicate more effectively your results (incidentally, it also means higher grades during RPs). Sep.06 2010 - Slide 5

How to conduct a scientific project Research your topic Make a hypothesis. Write down your procedure. Control sample Variables Assemble your Materials. Conduct the experiment. Repeat the experiment. Analyze your results. Draw a Conclusion. This is our main focus! Sep.06 2010 - Slide 6

Roadmap for today and next week Collecting data Presenting data Descriptive statistics A real-life example (Jeroen) Basic probability theory Probability distributions Parameter estimation Confidence intervals, limits, significance Hypothesis testing Sep.06 2010 - Slide 7

Collecting data Terminology Sampling Data types

Basic terminology Population = the collection of items under investigation Sample = a representative subset of the population, used in the experiments Estimate the height? Variable = the attribute that varies in each experiment Observation = the value of a variable during taken during one of the experiments. Sep.06 2010 - Slide 9

Quick test Estimate the proportion of a population given a sample. The FNWI has N students: you interview n students on whether they use public transport to come to the Science Park; a students answer yes. Can you estimate the number of students who travel by public transport? Sep.06 2010 - Slide 10

The problem of bias Sep.06 2010 - Slide 11

Sampling Non-probability sampling: some elements of the population have no chance of selection, or where the probability of selection can't be accurately determined. Accidental (or convenience) Sampling; Quota Sampling; Purposive Sampling. Probability sampling: every unit in the population has a chance (greater than zero) of being selected in the sample, and this probability can be accurately determined. Simple random sample Systematic random sample Stratified random sample Cluster sample Sep.06 2010 - Slide 12

Variables The attribute that varies in each experiment. Qualitative variables, cannot be assigned a numerical value. Quantitative variables, can be assigned a numerical value. Discrete data values are distinct and separate, i.e. they can be counted Categorical data values can be sorted according to category. Nominal data values can be assigned a code in the form of a number, where the numbers are simply labels Ordinal data values can be ranked or have a rating scale attached Continuous data Values may take on any value within a finite or infinite interval Sep.06 2010 - Slide 13

Quick test Discrete or continuous? The number of suitcases lost by an airline. The height of apple trees. The number of apples produced. The number of green M&M's in a bag. The time it takes for a hard disk to fail. The production of cauliflower by weight. Sep.06 2010 - Slide 14

Presenting the data Tables Charts Graphs

Frequency tables How many friends do you have on Facebook?. 23,44,156,246,37,79,156,123,267,12, 145,88,95,156,32,287,167,55,256,47, A way to summarize data. It records how often each value of the variable occurs. How you build it? Identify lower and upper limits Number of classes and width Segment data in classes Each value should fit in one (and no more) than one class: classes are mutually exclusive Friends Frequency Relative Frequency Percentage (%) Cumulative (less than) 0-50 6 6/20 30% 6 20 51-100 4 4/20 20% 10 14 101-150 2 2/20 10% 12 10 151-200 4 4/20 20% 16 8 201-250 1 1/20 5% 17 4 251-300 3 3/20 15% 20 3 Cumulative (greater than) Sep.06 2010 - Slide 16

Of course not everybody is a believer: As the Chinese say, 1001 words is worth more than a picture John McCartey Sep.06 2010 - Slide 17

Histograms The graphical representation of a frequency table; Summarizes categorical, nominal and ordinal data; Display bar vertically or horizontally, where the area is proportional to the frequency of the observations falling into that class. Useful when dealing with large data sets; Show outliers and gaps in the data set; Sep.06 2010 - Slide 18

Building an histogram Add values Add title (or caption in document) Add axis legends Sep.06 2010 - Slide 19

Pie charts Suitable to represent categorical data; Used to show percentages; Areas are proportional to value of category. Caution: You should never use a pie chart to show historical data over time; Also do not use for the data in the frequency distribution. Sep.06 2010 - Slide 20

Line charts Are commonly used to show changes in data over time; Can show trends or changes well. Year RP2 thesis Students 2004/2005 9 17 2005/2006 7 14 2006/2007 8 15 2007/2008 11 13 2008/2009 10 17 Sep.06 2010 - Slide 21

Dependent vs. independent variables N.b= the terms are used differently in statistics than in mathematics! In statistics, the dependent variable is the event studied and expected to change whenever the independent variable is altered. The ultimate goal of every research or scientific analysis is to find relations between variables. Sep.06 2010 - Slide 22

Scatter plots Displays values for two variables for a set of data; The independent variable is plotted on the horizontal axis, the dependent variable on the vertical axis; It allows to determine correlation Positive (bottom left -> top right) Negative (top left -> bottom right) Null with a trend line drawn on the data. Sep.06 2010 - Slide 23

and more Shmoo plot Forest plot Arrhenius plot Stemplot Bode plot Ternary plot Bland-Altman plot Galbraith plot Recurrence plot Nyquist plot Lineweaver Burk plot Nichols plot Star plot Q-Q plot Funnel plot Violin plot Sep.06 2010 - Slide 24

Statistics packages followed by some hands on work

Graphics and statistics tools Plenty of tools to use to plot and do statistical analysis. Just some you could use: gnuplot ROOT Excel We will use the open-source statistical computer program R. Make installation yourself; $> apt-get install r-base-core Run R as: $> R You find the documentation at: http://www.r-project.org/ Sep.06 2010 - Slide 26

Quick exercise Create a CSV file with frequency data. Now in R: > salaries <- read.csv(file= Path-to-file/Salary.csv") > salaries > salaries$salary > barplot(salaries$salary) > dev.copy(png, MyBarPlot.png ) > dev.off() Student,Salary 1,1250 2,2200 3,2345 4,6700 5,15000 6,3300 7,2230 8,1750 9,1900 10,1750 11,2100 12,2050 Can you improve this barplot? help(barplot)??plot Sep.06 2010 - Slide 27

Descriptive statistics Median, mean and mode Variance and standard deviation Basic concepts of distribution Correlation Linear regression

Median, mean and mode To estimate the centre of a set of observations, to convey a one-liner information about your measurements, you often talk of average. Let s be precise. Given a set of measurements: { x 1, x 2,, x N } The median is the middle number in the ordered data set; below and above the median there is an equal number of observations. The (arithmetic) mean is the sum of the observations divided by the number of observations. : The mode is the most frequently occurring value in the data set. Sep.06 2010 - Slide 29

Quick test Look at the (fictitious!) monthly salary distribution of fresh OS3 graduates: What is median, mean and mode of this data set? Can you figure out how to do this in R? OS3 graduates Grad 1 1250 Grad 2 2200 Grad 3 2345 Grad 4 6700 Grad 5 15000 Grad 6 3300 Grad 7 2230 Grad 8 1750 Grad 9 1900 Monthly salary (gross in ) What did you learn? Grad 10 1750 Grad 11 2100 Grad 12 2050 Sep.06 2010 - Slide 30

Outliers An outlying observation is an observation that is numerically distant from the rest of the data (for example unusually large or small compared to others) Causes: measurement error the population has a heavy-tailed distribution Sep.06 2010 - Slide 31

Symmetry and skewness A symmetrical distribution has the same number of values above and below the mean which is represented by the peak of the curve. The mean and median in a symmetrical distribution are equal. Outliers create skewed distributions: Positively skewed if the outliers are above the mean: the mean is greater than the median and the mode; Negatively skewed if the outliers are below the mean: the mean is smaller than the median and the mode.

Dispersion and variability The mean represents the central tendency of the data set. But alone it does not really gives us an idea of how the data is distributed. We want to have indications of the data variability. The range is the difference between the highest and lowest values in a set of data. It is the crudest measure of dispersion. The variance V(x) of x expresses how much x is liable to vary from its mean value: V (x) = 1 N (x i x) 2 i = x 2 x 2 The standard deviation is the square root of the variance: s x = V (x) = 1 N (x i x) 2 = x 2 x 2 i Sep.06 2010 - Slide 33

Different definitions of the Standard Deviation s x = 1 N i (x x ) 2 is the S.D. of the data sample Presumably our data was taken from a parent distributions which has mean µ and S.F. σ Data Sample Parent Distribution (from which data sample was drawn) s x σ µ x mean of our sample s S.D. of our sample µ mean of our parent dist σ S.D. of our parent dist Beware Notational Confusion! Sep.06 2010 - Slide 34

Different definitions of the Standard Deviation Which definition of σ you use, s data or σ parent, is matter of preference, but be clear which one you mean! Data Sample Parent Distribution (from which data sample was drawn) s data σ parent x µ In addition, you can get an unbiased estimate of σ parent from a given data sample using ˆ σ parent = 1 N (x x ) 2 = s data N 1 N 1 i s data = 1 N i (x x ) 2

Quartiles and percentiles Quartiles: Q 1, Q 2 and Q 3 divide the sample of observations into four groups: 25% of data points Q 1 ; 50% of data points Q 2 ; (Q 2 is the median); 75% of data points Q 3. The semi-inter-quartile range (SIQR), or quartile deviation, is: SIQR = Q Q 3 1 2 The 5-number summary: (min_value, Q 1, Q 2, Q 3 and max_value) Percentiles: The values that divide the data sample in 100 equal parts. Sep.06 2010 - Slide 36

Box and whisker plot It uses the 5-number summary. Sep.06 2010 - Slide 37

Correlation and regression

Correlation Correlation offers a predictive relationship that can be exploited in practice; it determines the extent to which values of the two variables are "proportional" to each other.. Proportional means linearly related; that is, the correlation is high if it can be "summarized" by a straight line (sloped upwards or downwards); This line is called the regression line or least squares line, because it is determined such that the sum of the squared distances of all the data points from the line is the lowest possible. Sep.06 2010 - Slide 39

Covariance and Pearson s correlation factor Given 2 variables x,y and a a dataset consisting of pairs of numbers: { (x 1,y 1 ), (x 2,y 2 ), (x N,y N ) } Dependencies between x and y are described by the sample covariance: (has dimension D(x)D(y)) The sample correlation coefficient is defined as: cov(x, y) r = [ 1,+1] s x s y (is dimensionless) Sep.06 2010 - Slide 40

Visualization of correlation r = 0 r = 0.1 r = 0.5 r = -0.7 r = -0.9 r = 0.99

Correlation & covariance in >2 variables Concept of covariance, correlation is easily extended to arbitrary number of variables so that takes the form of a n x n symmetric matrix This is called the covariance matrix, or error matrix Similarly the correlation matrix becomes Sep.06 2010 - Slide 42

Quick test Create a CSV file with frequency data. Read the file into the R memory in variable obesity. Run the following commands: Weight, Food_consumption 84,32 93,33 81,33 61,24 95,39 86,32 90,34 78,28 85,33 72,27 65,26 75,29 attach(obesity) plot(weight,food_consumption) cor(weight,food_consumption) cor(obesity) cor.test(weight,food_consumption) What have you learned? Sep.06 2010 - Slide 43

Careful with correlation coefficients! Correlation does not imply cause Correlation is a measure of linear relation only Misleading influence of a third variable Spurious correlation of a part with the whole Combination of unlike population Inference to an unlike population Sep.06 2010 - Slide 44

Least-square regression The goal is to fit a line to (x i,y i ): y i = a + bx i + ε i such that the vertical distances ε i (the error on y i ) are minimized. The resulting equation and coefficients are: ε i = y i ˆ y i ˆ y = a + bx b = (x i x )(y i y ) (x i x ) 2 = cov(x, y) s x 2 = r s y s x a = y bx Note, the correlation coefficient here Sep.06 2010 - Slide 45

Quick test From the example before in R: > pairs(obesity) > fit <- lm(food_consumption~weight) > fit > summary(fit) > plot(weight,food_consumption,pch=16) > abline(lm(food_consumption~weight),col='red') Sep.06 2010 - Slide 46

Jeroen van der Ham: An end-to-end statistical analysis

See you next week