Descriptive Data Summarization

Similar documents
Data Exploration Data Visualization

STATS8: Introduction to Biostatistics. Data Exploration. Babak Shahbaba Department of Statistics, UCI

Center: Finding the Median. Median. Spread: Home on the Range. Center: Finding the Median (cont.)

DESCRIPTIVE STATISTICS. The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses.

Descriptive Statistics. Purpose of descriptive statistics Frequency distributions Measures of central tendency Measures of dispersion

Geostatistics Exploratory Analysis

Lecture 1: Review and Exploratory Data Analysis (EDA)

Exploratory data analysis (Chapter 2) Fall 2011

Probability and Statistics Vocabulary List (Definitions for Middle School Teachers)

The right edge of the box is the third quartile, Q 3, which is the median of the data values above the median. Maximum Median

Descriptive Statistics

Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4)

Descriptive statistics Statistical inference statistical inference, statistical induction and inferential statistics

Means, standard deviations and. and standard errors

II. DISTRIBUTIONS distribution normal distribution. standard scores

BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I

BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS

MTH 140 Statistics Videos

Lecture 2: Descriptive Statistics and Exploratory Data Analysis

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

Exploratory Data Analysis

Projects Involving Statistics (& SPSS)

Introduction to Statistics for Psychology. Quantitative Methods for Human Sciences

3: Summary Statistics

ETL PROCESS IN DATA WAREHOUSE

Exercise 1.12 (Pg )

STAT355 - Probability & Statistics

Curriculum Map Statistics and Probability Honors (348) Saugus High School Saugus Public Schools

Variables. Exploratory Data Analysis

6.4 Normal Distribution

Module 4: Data Exploration

2. Filling Data Gaps, Data validation & Descriptive Statistics

Descriptive Statistics and Measurement Scales

Mean = (sum of the values / the number of the value) if probabilities are equal

Lesson 4 Measures of Central Tendency

Descriptive Statistics

CALCULATIONS & STATISTICS

Data Analysis Tools. Tools for Summarizing Data

business statistics using Excel OXFORD UNIVERSITY PRESS Glyn Davis & Branko Pecar

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.

Chapter 3 RANDOM VARIATE GENERATION

Introduction to Quantitative Methods

E3: PROBABILITY AND STATISTICS lecture notes

Diagrams and Graphs of Statistical Data

Exploratory Data Analysis

Fairfield Public Schools

Data Mining: Data Preprocessing. I211: Information infrastructure II

Study Guide for the Final Exam

Chapter 1: Looking at Data Section 1.1: Displaying Distributions with Graphs

4. Continuous Random Variables, the Pareto and Normal Distributions

1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number

Chapter 7 Section 7.1: Inference for the Mean of a Population

Final Exam Practice Problem Answers

CA200 Quantitative Analysis for Business Decisions. File name: CA200_Section_04A_StatisticsIntroduction

Lecture 2. Summarizing the Sample

Measures of Central Tendency and Variability: Summarizing your Data for Others

EXPLORING SPATIAL PATTERNS IN YOUR DATA

Introduction to Regression and Data Analysis

AP * Statistics Review. Descriptive Statistics

Course Text. Required Computing Software. Course Description. Course Objectives. StraighterLine. Business Statistics

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Week 1. Exploratory Data Analysis

Statistics Review PSY379

Chapter 3. Cartesian Products and Relations. 3.1 Cartesian Products

Descriptive statistics parameters: Measures of centrality

Northumberland Knowledge

Dongfeng Li. Autumn 2010

LAB 4 INSTRUCTIONS CONFIDENCE INTERVALS AND HYPOTHESIS TESTING


How To Write A Data Analysis

Good luck! BUSINESS STATISTICS FINAL EXAM INSTRUCTIONS. Name:

Statistics I for QBIC. Contents and Objectives. Chapters 1 7. Revised: August 2013

Statistical tests for SPSS

Sampling and Descriptive Statistics

Simple linear regression

Exploratory Data Analysis

A Correlation of. to the. South Carolina Data Analysis and Probability Standards

Statistics. Measurement. Scales of Measurement 7/18/2012

MBA 611 STATISTICS AND QUANTITATIVE METHODS

DATA INTERPRETATION AND STATISTICS

Exploratory Data Analysis. Psychology 3256

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

1 Descriptive statistics: mode, mean and median

Chapter 2 Data Exploration

DESCRIPTIVE STATISTICS AND EXPLORATORY DATA ANALYSIS

Quantitative Methods for Finance

How To Check For Differences In The One Way Anova

First Midterm Exam (MATH1070 Spring 2012)

COM CO P 5318 Da t Da a t Explora Explor t a ion and Analysis y Chapte Chapt r e 3

Biostatistics: DESCRIPTIVE STATISTICS: 2, VARIABILITY

Correlation Coefficient The correlation coefficient is a summary statistic that describes the linear relationship between two numerical variables 2

A and B This represents the probability that both events A and B occur. This can be calculated using the multiplication rules of probability.

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

Module 3: Correlation and Covariance

Biostatistics: Types of Data Analysis

THE BINOMIAL DISTRIBUTION & PROBABILITY

Simple Regression Theory II 2010 Samuel L. Baker

How Does My TI-84 Do That

Transcription:

Descriptive Data Summarization (Understanding Data) First: Some data preprocessing problems... 1

Missing Values The approach of the problem of missing values adopted in SQL is based on nulls and three-valued logic (3VL) null corresponds to UNK for unknown 3VL a mistake? Boolean Operators AND t u f OR t u f NOT t t u f t t t t t f u u u f u t u u u u f f f f f t u f f t In scalar comparison in which either of the compared is UNK evaluates the unknown truth value 2

MAYBE Another important Boolean operator is MAYBE MAYBE t u f f t f Example Consider the query Get employees who may be- but are not definitely known to beprogrammers born before January 18, 1971, with salary less then 40.000 EMP WHERE MAYBE ( JOB = PROGRAMMER AND DOB < DATE ( 1971-1-18 ) AND SALLARY < 40000 ) 3

Without maybe we assume the existence of another operator called IS_UKN which takes a single scalar operand and returns true if operand evaluates UNK otherwise false EMP WHERE ( JOB = PROGRAMMER OR IS_UKN (JOB) ) AND ( DOB < DATE ( 1971-1-18 ) OR IS_UKN (DOB) ) AND ( SALLARY < 40000 OR IS_UKN (SALLARY) ) AND NOT ( JOB = PROGRAMMER AND DOB < DATE ( 1971-1-18 ) AND SALLARY < 40000 ) Numeric expression WEIGHT * 454 If WEIGHT is UKN, then the result is also UKN Any numeric expression is considered to evaluate UNK if any operands of that expression is itself UNK Anomalies WEIGHT-WEIGHT=UNK (0) WEIGHT/0=UNK ( zero divide ) 4

UNK is not u (unk) UNK (the value-unknown null) u (unk) (unknown truth value)...are not the same thing u is a value, UNK not a value at all! Suppose X is BOOLEAN Has tree values: t (true),f (false), u ukn X is ukn, X is known to be unk X is UKN, X is not known! Some 3VL Consequences The comparison x=x does not give true In 3VL x is not equal to itself it is happens to be UNK The Boolean expression p OR NOT(p) does not give necessarily true unk OR NOT (unk) = unk 5

Example Get all suppliers in Porto and take the union with get all suppliers not in Porto We do not get all suppliers! We need to add maybe in Porto In 2 VL p OR NOT(p) corresponds to p OR NOT(p) OR MAYBE(p) in 3VL While two cases my exhaust full range of possibilities in the real world, the database does not contain the real world - instead it contains only knowledge about real world Some 3VL Consequences The expression r JOIN r does not necessarily give r A=B and B=C together does not imply A=C... Many equivalences that are valid in 2VL break down in 3VL We will get wrong answers 6

Special Values Drop the idea of null and UNK,unk 3VL Use special values instead to represent missing information Special values are used in the real world In the real world we might use the special value? to denote hours worked by a certain employee if actual value is unknown Special Values General Idea: Use an appropriate special value, distinct from all regular values of the attribute in question, when no regular value can be used The special value must be of the applicable attribute is not just integers, but integers integers plus whatever the special value is Approach is not very elegant, but without 3VL problems, because it is in 2VL 7

3VL NULL corresponds to UNK for unknown u or unk also represended by NULL The NULL value can be surprising until you get used to it. Conceptually, NULL means a missing unknown value and it is treated somewhat differently from other values. To test for NULL, you cannot use the arithmetic comparison operators such as =, <, or <>. 3VL a mistake? Visualizing one Variable Statistics for one Variable Joint Distributions Hypothesis testing Confidence Intervals 8

Visualizing one Variable A good place to start Distribution of individual variables A common visualization is the frequency histogram It plots the relative frequencies of values in the distribution To construct a histogram Divide the range between the highest and lowest values in a distribution into several bins of equal size Toss each value in the appropriate bin of equal size The height of a rectangle in a frequency histogram represents the number of values in the corresponding bin 9

The choice of bin size affects the details we see in the frequency histogram Changing the bin size to a lower number illuminates things that were previously not seen Bin size affects not only the detail one sees in the histogram but also one s perception of the shape of distribution 10

Statistics for one Variable Sample Size The sample size denoted by N, is the number of data items in a sample Mean The arithmetic mean is the average value, the sum of all values in the sample divided by the number of values N x = i=1 x i N Statistics for one Variable Median If the values in the sample are sorted into a non decreasing order, the median is the value that splits the distribution in half (1 1 1 2 3 4 5) the median is 2 If N is even, the sample has middle values, and the median can be found by interpolating between them or by selecting one of the arbitrary 11

Mode The mode is the most common value in the distribution (1 2 2 3 4 4 4) the mode is 4 If the data are real numbers mode nearly no information Low probability that two or more data will have exactly the same value Solution: map into discrete numbers, by rounding or sorting into bins for frequency histograms We often speak of a distribution having two or more modes Distributions has two or more values that are common The mean, median and mode are measures of location or central tendency in distribution They tell us where the distribution is more dense In a perfectly symmetric, unimodal distribution (one peak), the mean, median and mode are identical Most real distributions - collecting data - are neither symmetric nor unimodal, but rather, are skewed and bumpy 12

Skew In a skewed distribution the bulk of the data are at one end of the distribution If the bulk of the distribution is on the right, so the tail is on the left, then the distribution is called left skewed or negatively skewed If the bulk of the distribution is on the left, so the tail is on the right, then the distribution is called right skewed or positively skewed The median is to be robust, because its value is not distorted by outliers Outliers: values that are very large or small and very uncommon Symmetric vs. Skewed Data Mean Median Mode Median, mean and mode of symmetric, positively and negatively skewed data 13

Trimmed mean Another robust alternative to the mean is the trimmed mean Lop off a fraction of the upper and lower ends of the distribution, and take the mean of the rest 0,0,1,2,5,8,12,17,18,18,19,19,20,26,86,116 Lop off two smallest and two larges values and take the mean of the rest Trimmed mean is 13.75 The arithmetic mean 22.75 Maximum, Minimum, Range Range is the difference between maximum and minimum 14

Interquartile Range Interquartile range is found by dividing a sorted distribution into four containing parts, each containing the same number Each part is called quartile The difference between the highest value in the third quartile and the lowest value in the second quartile is the interquartile range Quartile example 1,1,2,3,3,5,5,5,5,6,6,100 The quartiles are (1 1 2),(3 3 5),(5 5 5), (6,6,100) Interquartile range 5-3=2 Range 100-1=99 Interquartile range is robust against outliers 15

Standard Deviation and Variance Square root of the variance, which is the sum of squared distances between each value and the mean divided by population size (finite population) Example 1,2,15 Mean=6 ( ) 2 + (2 6) 2 + (15 6) 2 σ = 1 N ( x i x) 2 1 6 = 40.66 σ=6.37 3 N i=1 Sample Standard Deviation and Sample Variance Square root of the variance, which is the sum of squared distances between each value and the mean divided by sample size Example 1,2,15 Mean=6 s = 1 N 1 N i=1 ( x x i ) 2 ( ) 2 + (2 6) 2 + (15 6) 2 1 6 = 61 s=7.81 3 1 16

Because they are averages, both the mean and the variance are sensitive to outliers Big effects that can wreck our interpretation of data For example: Presence of a single outlier in a distribution over 200 values can render some statistical comparisons insignificant The Problem of Outliers One cannot do much about outliers expect find them, and sometimes, remove them Removing requires judgment and depend on one s purpose 17

Joint Distributions Good idea, to see if some variable influence others Scatter plot Provides a first look at bivariate data to see clusters of points, outliers, etc Each pair of values is treated as a pair of coordinates and plotted as points in the plane 18

Correlation Analysis Correlation coefficient (also called Pearson s product moment coefficient) r XY = ( x i x ) y i y ( ) (n 1)σ X σ Y where n is the number of tuples, X and Y are the respective means of X and Y, σ and σ are the respective standard deviation of A and B, and Σ(XY) is the sum of the XY cross-product. If r X,Y > 0, X and Y are positively correlated (X s values increase as Y s). The higher, the stronger correlation. r X,Y = 0: independent; r X,Y < 0: negatively correlated A is positive correlated B is negative correlated C is independent (or nonlinear) 19

Correlation Analysis Χ 2 (chi-square) test ( Observed Expected χ = Expected The larger the Χ 2 value, the more likely the variables are related The cells that contribute the most to the Χ 2 value are those whose actual count is very different from the expected count Correlation does not imply causality 2 2 ) # of hospitals and # of car-theft in a city are correlated Both are causally linked to the third variable: population Chi-Square Calculation: An Example Play chess Not play chess Like science fiction 250(90) 200(360) 450 Sum (row) Not like science fiction 50(210) 1000(840) 1050 Sum(col.) 300 1200 1500 Χ 2 (chi-square) calculation (numbers in parenthesis are expected counts calculated based on the data distribution in the two categories) 2 2 (250 90) χ = 90 (50 210) + 210 2 (200 360) + 360 2 (1000 840) + 840 2 = 507.93 It shows that like_science_fiction and play_chess are correlated in the group 20

Properties of Normal Distribution Curve The normal (distribution) curve From μ σ to μ+σ: contains about 68% of the measurements (μ: mean, σ: standard deviation) From μ 2σ to μ+2σ: contains about 95% of it From μ 3σ to μ+3σ: contains about 99.7% of it 68% 95%!3!2!1 0 +1 +2 +3 99.7%!3!2!1 0 +1 +2 +3!3!2!1 0 +1 +2 +3 Kinds of data analysis Exploratory (EDA) looking for patterns in data Statistical inferences from sample data Testing hypotheses Estimating parameters Building mathematical models of datasets Machine learning, data mining We will introduce hypothesis testing 21

The logic of hypothesis testing Example: toss a coin ten times, observe eight heads. Is the coin fair (i.e., what is it s long run behavior?) and what is your residual uncertainty? You say, If the coin were fair, then eight or more heads is pretty unlikely, so I think the coin isn t fair. Like proof by contradiction: Assert the opposite (the coin is fair) show that the sample result ( 8 heads) has low probability p, reject the assertion, with residual uncertainty related to p. Estimate p with a sampling distribution. Probability of a sample result under a null hypothesis If the coin were fair (p=.5, the null hypothesis) what is the probability distribution of r, the number of heads, obtained in N tosses of a fair coin? Get it analytically or estimate it by simulation (on a computer): Loop K times r := 0 // r is num.heads in N tosses Loop N times // simulate the tosses Generate a random 0 x 1.0 If x >= p increment r // p is the probability of a head Push r onto sampling_distribution Print sampling_distribution 22

Sampling distributions Frequency (K = 1000) 70 60 50 40 30 20 10 Probability of r = 8 or more heads in N = 10 tosses of a fair coin is 54 / 1000 =.054 0 1 2 3 4 5 6 7 8 9 10 Number of heads in 10 tosses This is the estimated sampling distribution of r under the null hypothesis that p =.5. The estimation is constructed by Monte Carlo sampling. The logic of hypothesis testing Establish a null hypothesis: H0: p =.5, the coin is fair Establish a statistic: r, the number of heads in N tosses Figure out the sampling distribution of r given H0 0 1 2 3 4 5 6 7 8 9 10 The sampling distribution will tell you the probability p of a result at least as extreme as your sample result, r = 8 If this probability is very low, reject H0 the null hypothesis 23

A common statistical test: The Z test for different means A sample N = 25 computer science students has mean IQ m=135. Are they smarter than average? Population mean is 100 with standard deviation 15 The null hypothesis, H0, is that the CS students are average, i.e., the mean IQ of the population of CS students is 100. What is the probability p of drawing the sample if H0 were true? If p small, then H0 probably false. Find the sampling distribution of the mean of a sample of size 25, from population with mean 100 Central Limit Theorem: The sampling distribution of the mean is given by the Central Limit Theorem The sampling distribution of the mean of samples of size N approaches a normal (Gaussian) distribution as N approaches infinity. If the samples are drawn from a population with mean µ and standard deviation σ, then the mean of the sampling distribution is µ and its standard deviation is σ x = σ N as N increases. These statements hold irrespective of the shape of the original distribution. 24

The sampling distribution for the CS student example If sample of N = 25 students were drawn from a population with mean 100 and standard deviation 15 (the null hypothesis) then the sampling distribution of the mean would asymptotically be normal with mean 100 and standard deviation 15 25 = 3 IQ: 100 135 The mean of the CS students falls nearly 12 standard deviations away from the mean of the sampling distribution Only ~1% of a normal distribution falls more than two standard deviations away from the mean The probability that the students are average is roughly zero The Z test Mean of sampling distribution Sample statistic Mean of sampling distribution Test statistic std=3 std=1.0 100 135 0 11.67 Z = x µ σ N = 135 100 15 25 = 35 3 = 11.67 25

Reject the null hypothesis? Commonly we reject the H0 when the probability of obtaining a sample statistic (e.g., mean = 135) given the null hypothesis is low, say <.05. A test statistic value, e.g. Z = 11.67, recodes the sample statistic (mean = 135) to make it easy to find the probability of sample statistic given H0. Reject the null hypothesis? We find the probabilities by looking them up in tables, or statistics packages provide them. For example, Pr(Z 1.67) =.05; Pr(Z 1.96) =.01. Pr(Z 11) is approximately zero, reject H0. 26

The t test Same logic as the Z test, but appropriate when population standard deviation is unknown, samples are small, etc. Sampling distribution is t, not normal, but approaches normal as samples size increases Test statistic has very similar form but probabilities of the test statistic are obtained by consulting tables of the t distribution, not the normal The t test Suppose N = 5 students have mean IQ = 135, std = 27 Estimate the standard deviation of sampling distribution using the sample standard deviation t = x µ 135 100 = s 27 N 5 = 35 12.1 = 2.89 Mean of sampling distribution Sample statistic Mean of sampling distribution Test statistic std=12.1 std=1.0 100 135 0 2.89 27

Summary of hypothesis testing H0 negates what you want to demonstrate; find probability p of sample statistic under H0 by comparing test statistic to sampling distribution; if probability is low, reject H0 with residual uncertainty proportional to p. Example: Want to demonstrate that CS graduate students are smarter than average. H0 is that they are average. t = 2.89, p.022 Have we proved CS students are smarter? NO! We have only shown that mean = 135 is unlikely if they aren t. We never prove what we want to demonstrate, we only reject H0, with residual uncertainty. And failing to reject H0 does not prove H0, either! 28

Confidence Intervals Just looking at a figure representing the mean values, we can not see if the differences are significant Confidence Intervals (σ known) Standard error from the sample standard deviation σ x = σ Population N 95 Percent confidence interval for normal distribution is about the mean x ±1.96 σ x 29

Confidence interval when (σ unknown) Standard error from the sample standard deviation ˆ σ x = s N 95 Percent confidence interval for t distribution (t 0.025 from a table) is x ± t 0.025 σ ˆ x Previous Example: Measuring the Dispersion of Data Quartiles, outliers and boxplots Quartiles: Q 1 (25 th percentile), Q 3 (75 th percentile) Inter-quartile range: IQR = Q 3 Q 1 Five number summary: min, Q 1, M, Q 3, max Boxplot: ends of the box are the quartiles, median is marked, whiskers, and plot outlier individually Outlier: usually, a value higher/lower than 1.5 x IQR 30

Boxplot Analysis Five-number summary of a distribution: Minimum, Q1, M, Q3, Maximum Boxplot Data is represented with a box The ends of the box are at the first and third quartiles, i.e., the height of the box is IRQ The median is marked by a line within the box Whiskers: two lines outside the box extend to Minimum and Maximum Visualization of Data Dispersion: Boxplot Analysis 31

Visualizing one Variable Statistics for one Variable Joint Distributions Hypothesis testing Confidence Intervals Next: Noise Integration Data redundancy Feature selection 32