Summarising Data. Mark Lunt 11/10/2016. Arthritis Research UK Centre for Excellence in Epidemiology University of Manchester

Similar documents
STATS8: Introduction to Biostatistics. Data Exploration. Babak Shahbaba Department of Statistics, UCI

Exploratory data analysis (Chapter 2) Fall 2011

Week 1. Exploratory Data Analysis

Lecture 1: Review and Exploratory Data Analysis (EDA)

The Big Picture. Describing Data: Categorical and Quantitative Variables Population. Descriptive Statistics. Community Coalitions (n = 175)

Variables. Exploratory Data Analysis

Pie Charts. proportion of ice-cream flavors sold annually by a given brand. AMS-5: Statistics. Cherry. Cherry. Blueberry. Blueberry. Apple.

Lecture 2: Descriptive Statistics and Exploratory Data Analysis

Introduction to Statistics for Psychology. Quantitative Methods for Human Sciences

Describing and presenting data

The right edge of the box is the third quartile, Q 3, which is the median of the data values above the median. Maximum Median

Northumberland Knowledge

Descriptive statistics Statistical inference statistical inference, statistical induction and inferential statistics

Data Exploration Data Visualization

Descriptive Statistics

Visualizing Data. Contents. 1 Visualizing Data. Anthony Tanbakuchi Department of Mathematics Pima Community College. Introductory Statistics Lectures

Descriptive statistics parameters: Measures of centrality

BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I

DESCRIPTIVE STATISTICS. The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses.

Chapter 1: Looking at Data Section 1.1: Displaying Distributions with Graphs

Descriptive Statistics and Measurement Scales

Using SPSS, Chapter 2: Descriptive Statistics

Exercise 1.12 (Pg )

MBA 611 STATISTICS AND QUANTITATIVE METHODS

Describing, Exploring, and Comparing Data

Means, standard deviations and. and standard errors

Descriptive Statistics. Purpose of descriptive statistics Frequency distributions Measures of central tendency Measures of dispersion

Diagrams and Graphs of Statistical Data

Center: Finding the Median. Median. Spread: Home on the Range. Center: Finding the Median (cont.)

EXAM #1 (Example) Instructor: Ela Jackiewicz. Relax and good luck!

CA200 Quantitative Analysis for Business Decisions. File name: CA200_Section_04A_StatisticsIntroduction

Mind on Statistics. Chapter 2

Summarizing and Displaying Categorical Data

Exploratory Data Analysis. Psychology 3256

Statistics. Measurement. Scales of Measurement 7/18/2012

THE BINOMIAL DISTRIBUTION & PROBABILITY

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Foundation of Quantitative Data Analysis

Descriptive Statistics

Statistics I for QBIC. Contents and Objectives. Chapters 1 7. Revised: August 2013

Exploratory Data Analysis

Demographics of Atlanta, Georgia:

Chapter 1: Exploring Data

2 Describing, Exploring, and

CHAPTER THREE. Key Concepts

AP * Statistics Review. Descriptive Statistics

Descriptive Statistics

AP Statistics Solutions to Packet 2

Characteristics of Binomial Distributions

1.3 Measuring Center & Spread, The Five Number Summary & Boxplots. Describing Quantitative Data with Numbers

Lecture 2. Summarizing the Sample

Introduction to Quantitative Methods

Introduction to Environmental Statistics. The Big Picture. Populations and Samples. Sample Data. Examples of sample data

Probability and Statistics Vocabulary List (Definitions for Middle School Teachers)

Lesson 4 Measures of Central Tendency

Biostatistics: DESCRIPTIVE STATISTICS: 2, VARIABILITY

How To Write A Data Analysis

Tutorial 3: Graphics and Exploratory Data Analysis in R Jason Pienaar and Tom Miller

Chapter 2 Data Exploration

Data! Data! Data! I can t make bricks without clay. Sherlock Holmes, The Copper Beeches, 1892.

HISTOGRAMS, CUMULATIVE FREQUENCY AND BOX PLOTS

Sta 309 (Statistics And Probability for Engineers)

AMS 7L LAB #2 Spring, Exploratory Data Analysis

DESCRIPTIVE STATISTICS AND EXPLORATORY DATA ANALYSIS

determining relationships among the explanatory variables, and

2. Filling Data Gaps, Data validation & Descriptive Statistics

Introduction; Descriptive & Univariate Statistics

CHAPTER THREE COMMON DESCRIPTIVE STATISTICS COMMON DESCRIPTIVE STATISTICS / 13

Iris Sample Data Set. Basic Visualization Techniques: Charts, Graphs and Maps. Summary Statistics. Frequency and Mode

2. Here is a small part of a data set that describes the fuel economy (in miles per gallon) of 2006 model motor vehicles.

1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number

Dongfeng Li. Autumn 2010

Statistics Chapter 2

Bernd Klaus, some input from Wolfgang Huber, EMBL

4.1 Exploratory Analysis: Once the data is collected and entered, the first question is: "What do the data look like?"

Mathematical goals. Starting points. Materials required. Time needed

Introduction to Statistics and Probability. Michael P. Wiper, Universidad Carlos III de Madrid

DESCRIPTIVE STATISTICS - CHAPTERS 1 & 2 1

Measures of Central Tendency and Variability: Summarizing your Data for Others


Directions for Frequency Tables, Histograms, and Frequency Bar Charts

Descriptive statistics

Lesson outline 1: Mean, median, mode

Descriptive Statistics

A Correlation of. to the. South Carolina Data Analysis and Probability Standards

Week 11 Lecture 2: Analyze your data: Descriptive Statistics, Correct by Taking Log

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

What Does the Normal Distribution Sound Like?

3: Summary Statistics

Data Visualization Techniques

Basics of Statistics

Sampling and Descriptive Statistics

Data exploration with Microsoft Excel: univariate analysis

MTH 140 Statistics Videos

AP STATISTICS REVIEW (YMS Chapters 1-8)

4 Other useful features on the course web page. 5 Accessing SAS

DATA INTERPRETATION AND STATISTICS

SCHOOL OF HEALTH AND HUMAN SCIENCES DON T FORGET TO RECODE YOUR MISSING VALUES

Exploratory Data Analysis

Transcription:

Summarising Data Mark Lunt Arthritis Research UK Centre for Excellence in Epidemiology University of Manchester 11/10/2016

Summarising Data Types of data Today we will consider Different types of data Appropriate ways to summarise these data

Types of Data Types of data Qualitative Nominal Outcome is one of several categories Ordinal Outcome is one of several ordered categories Quantitative Discrete Can take one of a fixed set of numerical values Continuous Can take any numerical value

Examples of Types of Data Nominal Blood group; Hair colour. Ordinal Strongly agree, agree, disagree, strongly disagree. Discrete Number of children. Continuous Birthweight.

Caveats with Data Types Distinction between nominal and ordinal variables can be subjective: e.g. vertebral fracture types: Wedge, Concavity, Biconcavity, Crush. Could argue that a crush is worse than a biconcavity which is worse than a concavity..., but this is not self-evident. Distinction between ordinal and discrete variables can be subjective: e.g. cancer staging I, II, III, IV: sounds discrete, but better treated as ordinal. Continuous variables generally measured to a fixed level of precision, which makes them discrete. Not a problem, provide there are enough levels.

Types of Variables Types of data What type of variable are each of the following: Number of visits to a G.P. this year Marital Status Size of tumour in cm Pain, rated as minimal/moderate/severe/unbearable Blood pressure (mm Hg)

Summarizing Count the number of subjects in each group. The count is commonly refered to as the frequency The proportion in each group is referred to as the relative frequency Stata command to produce a tabulation is tabulate varname

of region Freq. Percent Cum. ------------+----------------------------------- Canada 422 22.84 22.84 USA 541 29.27 52.11 Mexico 223 12.07 64.18 Europe 493 26.68 90.85 Asia 169 9.15 100.00 ------------+----------------------------------- Total 1,848 100.00

of Bar Chart: Data represented as a series of bars, height of bar proportional to frequency. Pie Chart: Data represented as a circle divided into segments, area of segment proportional to frequency. Pictograms: Similar to bar chart, but uses a number of pictures to represent each bar. Bar chart is the easiest to understand.

Bar Chart Types of data Frequency 0 200 400 600 Canada USA Mexico Europe Asia region

Summarizing Simplest method: treat as qualitative data. Divide observations into groups May be unnecessary for discrete data. Look at the frequency distribution of these groups Can use table or diagram.

The Histogram Types of data Similar to a bar chart Continuous, not categorical variable Area of bars proportional to probability of observation being in that bar Axis can be Frequency (heights add up to n) Percentage (heights add up to 100%) Density (Areas add up to 1)

How Many Groups? Types of data Impossible to say. Depends on the number of observations: if individual groups are too small, results are meaningless. With discrete variables, exact positions of boundaries may be important. Tables need few groups, graphs can have more if sufficient numbers. May be decided for you in software.

Histograms Types of data Density 0.02.04.06.08 female male 140 160 180 200 140 160 180 200 Graphs by sex measured height (cm)

Histogram: Effect of Wrong number of bins Density 0.02.04.06 0 10 20 30 x Density 0.01.02.03.04 0 10 20 30 x 24 bins (default) 30 bins (correct)

Bar charts and histograms in Stata histogram varname produces a histogram Number of bars can by set by option bin() Width of a bar can be set by option width() histogram varname, discrete produces a bar chart What stata calls a bar chart is the mean of second variable subdivided by category, rather than a frequency.

of Need to know: 1 What is a typical value ( location ) 2 How much do the values vary ( scale ) Simplest distribution to summarize is the normal distribution Other summary statistics (skewness, kurtosis etc) thought of relative to normal distribution.

The Normal Distribution Symmetrical Bell-shaped distribution Easiest to use mathematically Many variables are normally distributed Can be described by two numbers Mean (measure of location) Standard Deviation (measure of variation)

Histogram & Normal Distribution Density 0.02.04.06.08 female male 140 160 180 200 140 160 180 200 Graphs by sex measured height (cm) Density normal nurseht

Non-Normal Distributions Normal distribution is symmetric. Asymmetric distributions are called skewed : Positively skewed = some extremely high values (mean > median). Negatively skewed = some extremely low values (mean < median). Distribution may have more than one peak : bi-modal. Usually formed by mixing two different groups.

Non-Normal Distributions Density 0.05.1.15.2 Density 0.05.1.15.2 4 2 0 2 4 6 y1 0 5 10 15 20 y2 Bimodal Distribution Positively Skewed Dist n

Measures of Location Types of data What is the value of a typical observation? May be: (Arithmetic) Mean Median Other forms of mean Rarely used Only if data has been transformed

Arithmetic Mean Types of data Add them up and divide by how many there are. x = x 1 + x 2 +... + x n n = (Σ n i=1 x i)/n

Median Types of data Arrange in increasing order, pick the middle. If an even number of observations, take mean of middle two. Ignores the precise magnitude of most observations Contains less information than mean May be useful if there are outliers Less easy to use mathematically.

Mean vs. Median Types of data Consider this series of durations of absence from work due to sickness (in days). 1,1,2,2,3,3,4,4,4,4,5,6,6,6,6,7,8,10,10,38,80 Mean = 10 Median = 5 Very few observations are as large as the mean: median is more typical.

Percentiles Types of data The x th percentile is the value than which x% of observations are smaller and (100 x)% are larger. The median is the 50th percentile. Other centiles can easily be calculated, eg 5th, 25th etc.

Measures of Variation How close to the typical value are other values. Range Inter-quartile range Variance

Simple Measures of Variation Range Inter-quartile Range (Largest measurement) - (smallest measurement) Depends on only two measurements Can only increase as you add more to the sample (75th centile) - (25th centile). Less sensitive to extreme values Need fairly large numbers of observations

Standard Deviation Types of data Standard Deviation = Σ(x i x) 2 /n Nearly the average difference from the mean Uses information from every observation Not robust to outliers Variance is easy to use mathematically Standard deviation is the same units as the observations

Summary Statistics in Stata summarize varlist will give mean, SD, min and max summarize varlist, detail also gives percentiles tabstat or table can produce tables of summary statistics

: Table 1 Quantitative variables Need a measure of location & variation Normal variables: mean and SD Skewed variables: median and IQR Need to give units Qualitative variables Number and % in each category

Example Age in years: Mean (SD) 63 (7.9) Spine BMD in g/cm 2 : Median (IQR) 1.05 (0.78, 1.30) Gender: n (%) Male 1537 (44) Female 1924 (56)

The Box and Whisker Plot Very efficient summary of distribution: Shows median, upper and lower quartiles (25th and 75th percentiles). Also shows range of normal values and individual unusual values. Definitions of normal and unusual differ. Will demonstrate skewness, not bimodality. Stata command: graph box varname, [by(groupname)]

Box and Whisker Plots 4 2 0 2 4 0 5 10 15 20 Normal Distribution Positively Skewed Dist n

Transforming Data Types of data Skewed distributions may be made symmetric by a transformation. Taking logs is the most common. Other transformations (e.g. square root, reciprocal) can be used, but can be very difficult to interpret. May be better to transform back to original units to present results. Geometric mean is back-transformation of mean of log-transformed data.

Further Reading Types of data Edward R. Tufte, The Visual Display of Quantitative Information is the classic text on statistical graphs.