Mean, Median, Standard Deviation Prof. McGahagan Stat 1040



Similar documents
Descriptive Statistics

Definition 8.1 Two inequalities are equivalent if they have the same solution set. Add or Subtract the same value on both sides of the inequality.

Properties of Real Numbers

Session 7 Bivariate Data and Analysis

This unit will lay the groundwork for later units where the students will extend this knowledge to quadratic and exponential functions.

Independent samples t-test. Dr. Tom Pierce Radford University

Descriptive Statistics and Measurement Scales

PART A: For each worker, determine that worker's marginal product of labor.

WHERE DOES THE 10% CONDITION COME FROM?

Accentuate the Negative: Homework Examples from ACE

Linear Equations and Inequalities

6.4 Normal Distribution

Answer Key for California State Standards: Algebra I

Module 3: Correlation and Covariance

1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number

CALCULATIONS & STATISTICS

DESCRIPTIVE STATISTICS. The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses.

ALGEBRA. sequence, term, nth term, consecutive, rule, relationship, generate, predict, continue increase, decrease finite, infinite

or, put slightly differently, the profit maximizing condition is for marginal revenue to equal marginal cost:

Method To Solve Linear, Polynomial, or Absolute Value Inequalities:

Current California Math Standards Balanced Equations

Lies My Calculator and Computer Told Me

3.2. Solving quadratic equations. Introduction. Prerequisites. Learning Outcomes. Learning Style

A Short Guide to Significant Figures

Introduction; Descriptive & Univariate Statistics

ECON 459 Game Theory. Lecture Notes Auctions. Luca Anderlini Spring 2015

Chapter 1: Looking at Data Section 1.1: Displaying Distributions with Graphs

Zero: If P is a polynomial and if c is a number such that P (c) = 0 then c is a zero of P.

Simple Regression Theory II 2010 Samuel L. Baker

Exercise 1.12 (Pg )

Pie Charts. proportion of ice-cream flavors sold annually by a given brand. AMS-5: Statistics. Cherry. Cherry. Blueberry. Blueberry. Apple.

Integer Operations. Overview. Grade 7 Mathematics, Quarter 1, Unit 1.1. Number of Instructional Days: 15 (1 day = 45 minutes) Essential Questions

Numerical integration of a function known only through data points

Algebra I Vocabulary Cards

Biostatistics: DESCRIPTIVE STATISTICS: 2, VARIABILITY

Descriptive Statistics

5. Linear Regression

Math 115 Spring 2011 Written Homework 5 Solutions

Normal distribution. ) 2 /2σ. 2π σ

The Cost of Production

4. Continuous Random Variables, the Pareto and Normal Distributions

Algebra Unit Plans. Grade 7. April Created By: Danielle Brown; Rosanna Gaudio; Lori Marano; Melissa Pino; Beth Orlando & Sherri Viotto

Probability and Statistics Prof. Dr. Somesh Kumar Department of Mathematics Indian Institute of Technology, Kharagpur

Answer: C. The strength of a correlation does not change if units change by a linear transformation such as: Fahrenheit = 32 + (5/9) * Centigrade

Introduction to Statistics for Psychology. Quantitative Methods for Human Sciences

Solving Rational Equations

Minimax Strategies. Minimax Strategies. Zero Sum Games. Why Zero Sum Games? An Example. An Example

Binary Adders: Half Adders and Full Adders

Prentice Hall. California Edition of Algebra 1 - Classics Edition (Smith/Charles) Grade 8

Chapter 10. Key Ideas Correlation, Correlation Coefficient (r),

Determine If An Equation Represents a Function

3. Logical Reasoning in Mathematics

Introduction. Appendix D Mathematical Induction D1

CHAPTER 5 Round-off errors

Chicago Booth BUSINESS STATISTICS Final Exam Fall 2011

Part 1 Expressions, Equations, and Inequalities: Simplifying and Solving

Engineering Problem Solving and Excel. EGN 1006 Introduction to Engineering

Elasticity. I. What is Elasticity?

Chapter 27: Taxation. 27.1: Introduction. 27.2: The Two Prices with a Tax. 27.2: The Pre-Tax Position

Mathematics Task Arcs

Measurement with Ratios

How to Win the Stock Market Game

Lecture 5 : The Poisson Distribution

Exploratory Data Analysis. Psychology 3256

Calculating P-Values. Parkland College. Isela Guerra Parkland College. Recommended Citation

Review of Fundamental Mathematics

Math Games For Skills and Concepts

Stat 20: Intro to Probability and Statistics

Summer Assignment for incoming Fairhope Middle School 7 th grade Advanced Math Students

PYTHAGOREAN TRIPLES KEITH CONRAD

5.1 Radical Notation and Rational Exponents

3. What is the difference between variance and standard deviation? 5. If I add 2 to all my observations, how variance and mean will vary?

Decision Making under Uncertainty

Math 120 Final Exam Practice Problems, Form: A

Numeracy and mathematics Experiences and outcomes

EQUATIONS and INEQUALITIES

Means, standard deviations and. and standard errors

Descriptive Statistics. Purpose of descriptive statistics Frequency distributions Measures of central tendency Measures of dispersion

Standard Deviation Estimator

1.6 The Order of Operations

Adding and Subtracting Positive and Negative Numbers

MEASURES OF VARIATION

Section 1.1 Linear Equations: Slope and Equations of Lines

MATH BOOK OF PROBLEMS SERIES. New from Pearson Custom Publishing!

3 Some Integer Functions

Capital Market Theory: An Overview. Return Measures

The KaleidaGraph Guide to Curve Fitting

2. Simple Linear Regression

Core Maths C1. Revision Notes

2 Sample t-test (unequal sample sizes and unequal variances)

MBA Jump Start Program

Tom wants to find two real numbers, a and b, that have a sum of 10 and have a product of 10. He makes this table.

Evaluating Trading Systems By John Ehlers and Ric Way

LIES MY CALCULATOR AND COMPUTER TOLD ME

The Binomial Distribution

Prot Maximization and Cost Minimization

Microeconomics Sept. 16, 2010 NOTES ON CALCULUS AND UTILITY FUNCTIONS

Topic 9 ~ Measures of Spread

Gamma Distribution Fitting

Chapter 2: Linear Equations and Inequalities Lecture notes Math 1010

Transcription:

Mean, Median, Standard Deviation Prof. McGahagan Stat 1040 Mean = arithmetic average, add all the values and divide by the number of values. Median = 50 th percentile; sort the data and choose the middle value (or average the two middle values) In a perfectly symmetric distribution, the mean and median will be the same. In a right-skewed distribution, the mean will be higher than the median. (Example: income distribution with 9 individuals making $ 40,000 a year plus one individual making $10,000,000 a year. Median = $ 40,000; Mean = $ 10,360,000 / 10 = $ 1,036,000) The median is more robust than the mean = it is not as sensitive to outliers. (In the above example, suppose someone who makes $ 50,000,000 is added to the group. The median income remains $ 40,000; the mean income is $ 60,300,000 / 11 = $ 5,487,273) Deviation = difference of an individual value from the mean. We want to find some way of finding the average deviation. There are three candidates: Mean absolute deviation = add the absolute values of all deviations and take their mean. Median absolute deviation = the median value of the absolute deviations from the median. Standard deviation = root mean square deviation. Take the square root of the average of the squared deviations. Example: Wage distribution with 4 individuals: Arthur makes $ 20 an hour. Beth makes $ 300 an hour. Charles makes $ 40 an hour. Diana makes $ 40 an hour. Note that if you asked the computer to draw a histogram of this data, you would get exactly the same histogram as we did with the example in our Data and Histograms handout. Try it: the commands would be: wages < - c(20,300,40,40) hist (wages, breaks = c(0, 30, 50, 500)) And you can find the results given below by the commands: mean(wages), median(wages), mean(abs (wages mean(wages))), median(abs(wages - median(wages)) Mean = $ 460 / 4 = $ 100 Median = middle observation or average of the two middle observations. Sorted data = ($20 $40 $ 40 $300) Median = ( $ 40 + $ 40) / 2 = $ 40 The mean and median have some interesting properties, which can be explained after giving a little thought to a bet we might propose: Guess the income of the next person who walks through the door (we know Arthur, Beth, Charles and Diana are in a meeting and are betting on who will walk out). You are risk neutral, so you will take any bet which has a positive expected value. If you agree to play the game, I will give you $ 80 before you guess.. However, you will lose the absolute value of the difference between your guess and the average hourly income of the person who actually walks through the door next. If you guess $ 50 and Diana comes in, your payoff is $ 80 - $ 40 - $ 50 = + $ 70. If you guess $ 50 and Beth comes in, your payoff is $ 80 - $ 300 - $ 50 = - $ 170.

Should you play the game at all? What guess should you make if you play the game? Consider the mean and median as logical guesses. You might also consider guessing Beth's salary to avoid the maximum possible loss. To answer the question, you must calculate the expected value of the game for each guess. For example, if you guess Beth's salary, the expected value will be the average of all the four payoffs, which we assume equally likely. If Arthur walks in, your payoff is: $ 80 - $ 300 $ 20 = $ 80 - $ 280 = - $ 200 If Beth walks in, your payoff is: $ 80 - $ 300 $ 300 = $ 80 - $ 0 = + $ 80 If Charles walks in, your payoff is $ 80 - $ 300 $ 40 = $ 80 - $ 260 = - $ 180 If Diana walks in, your payoff is: $ 80 - $ 300 $ 40 = $ 80 - $ 260 = - $ 180 The expected value of the game is therefore ( - 200 + 80-180 - 180) / 4 = - $ 480 / 4 = - $ 120 Would another guess make it profitable to play the game? First, try guessing the mean value of $ 100. If Arthur walks in, your payoff is: $ 80 - $ 100 $ 20 = $ 80 - $ 80 = + $ 0 If Beth walks in, your payoff is: $ 80 - $ 100 $ 300 = $ 80 - $ 200 = - $ 120 If Charles walks in, your payoff is $ 80 - $ 100 $ 40 = $ 80 - $ 60 = + $ 20 If Diana walks in, your payoff is: $ 80 - $ 100 $ 40 = $ 80 - $ 60 = + $ 20 The expected value of the game is therefore ( 20 100 + 40 + 40) / 4 = - $ 0 / 4 = - $ 22.50 It is not worth while to play and guess the mean. How about guessing the median value of $ 40? If Arthur walks in, your payoff is: $ 80 - $ 40 $ 20 = $ 80 - $ 20 = + $ 60 If Beth walks in, your payoff is: $ 80 - $ 40 $ 300 = $ 80 - $ 260 = - $ 180 If Charles walks in, your payoff is $ 80 - $ 40 $ 40 = $ 80 - $ 0 = + $ 80 If Diana walks in, your payoff is: $ 80 - $ 40 $ 40 = $ 80 - $0 = + $ 80 The expected value of the game is therefore ( 60 180 + 80 + 80) / 4 = - $ 40 / 4 = $ 10 If you are risk neutral, it is worth while to play the game and guess the median income the expected value of the game for you is positive.. Deviations from the median result in a smaller sum of errors, and hence a smaller average error, than deviations from the mean. It clearly sounds like a good idea to minimize the sum of errors whether in this betting game or in (say) estimating your minimum cost level of output. So why bother with the mean? Isn't the median the better of the two? Answer: not always. If you think that avoiding a big loss is also an important consideration, you might look more kindly on the mean - even though the expected value is negative, you win some money in three of the possible four events, and take a smaller maximum loss than if you had guessed the median.

The mean minimizes the sum of squared deviations, and so minimizes the really big errors one might make with the median. Often small errors are not that bad, and large errors are really terrible. Example: Estimating cost functions. You should remember from microeconomics that the average cost curve is roughly U-shaped. The textbook average cost curves look nice and smooth but estimating real ones from data is likely to lead to a lot of points scattered about a U-shaped curve. Suppose you are interested in choosing the quantity of output which will minimize your firm's average cost, and the true average cost curve is AC = 1000 + square(q 500). The minimum average cost will be at a quantity of 500, as is clear without calculus (a more realistic AC curve would require some calculus; more realistic economics would note that firms should be interested in maximizing profit rather than minimizing average cost, but simplicity is better than realism here). Here's the curve in red, with a horizontal blue line at the minimum cost of 1000.

The important point here is not so much the optimum quantity of 500, but the fact that a small mistake in choosing that optimum quantity will not cost you very much but a large mistake will. Let's define the COE (Cost of Error) as AC(Q - Q*) = square(500 - Q), so that we can compute: COE (0) = square(500 500) = 0 COE (501) = square (500 501) = 1 COE (510) = square (500 510) = 100 COE (600) = square (500 600) = 10,000 An error of 10 is not 10 times as costly as an error of 1, but 100 times as costly; an error of 100 is not 10 times as costly, but 10,000 times as costly. The cost function increases with the square of the error. So, if our real target is the cost of the error rather than the arithmetic magnitude of the error, we want an estimator which penalizes the squared error, not the absolute value of the error. Other examples: Medicine: giving a patient a little more or a little less of a drug may be quite harmless but cutting the dose in half or doubling it may kill the patient. Grading: a history teacher asking when Thomas Jefferson and John Adams died (both died the same day, July 4, 1826) might not penalize you very much if you said July 4, 1825 but might flunk you for the course if you said October 12, 1492. Astronomy: If you were trying to estimate a missing asteroid's course on the basis of a few observations, a small error in the estimation might still leave the missing asteroid in the viewing field of your telescope but a larger error would leave it invisible. This was exactly the problem that Karl Friedrich Gauss was trying to solve when he invented least squares estimation in 1809. Does the mean value minimize the sum of squared errors in our example? Make your guess for the best estimator of average income in the list (20, 40, 40, 300) Median = 40: Errors: ( 20 40) (40 40) (40 40) (300-40) Squared errors: 400 0 0 260 * 260 = 67,600 Sum of squared errors = 68,000 Mean = 100 Errors: ( 20 100) (40 100) (40 100) (300-100) Squared errors: 80 * 80 60 * 60 60 * 60 200 * 200 Sum of squared errors = 53, 600 6400 3600 3600 40,000 To examine the other possibilities, see the Appendix. A calculus-based proof is also presented in the appendix. It will not be on the exam.

Computational details: Mean/median absolute deviation and standard deviation. Mean absolute deviation: ( 20 100 + 40 100 + 40-100 + 300 100 ) / 4 Note that the vertical lines are the math symbol for take the absolute value (80 + 60 + 60 + 200) / 4 $ 400 / 4 = $ 100 Median absolute deviation: median( 20 40 + 40 40 + 40-40 + 300 40 ) median (20, 0, 0, 2600) = ($ 0 + $ 20) / 2 = $ 10 Note: in R, the median absolute deviation (or MAD) is defined with an adjustment constant defined so that it is in a predictable relation to a normal distribution. To get our results, you would set the adjustment constant to 1, with the command: mad(wages, constant=1). Standard deviation: First, compute the sum of squared deviations: [ (20 100) 2 + (40 100) 2 + (40-100) 2 + (300 100 ) 2 ] = ( (-80) 2 + (-60) 2 + (-60) 2 + 200 2 ) = ( 6,400 + 3,600 + 3,600 + 40,000) = 53,600 Second, find the mean squared deviation = sum of squared deviations / number of observations. The mean squared deviation is also called the variance. Variance = 53,600 / 4 = 13,400 The standard deviation in our example is sqrt(13,400) = 115.7584 Note well: The text and Stark use what is called the population standard deviation = sqrt(sse / N) in their calculations throughout the book. There is a (very technical) argument that a slight change to the formula gives a more reliable estimate of the variance, when dealing with small samples, and the sample standard deviation = sqrt(sse / (N-1)) is used by some texts and computer programs (including R) when computing standard deviations. The very slight technical advantage of the more complicated formula in some situations does not in my mind justify inflicting it on beginning students. The basic idea that the standard deviation is one of three possible ways to measure the average deviation is watered down by forcing the formula on students, and it will not be used at all in this course. But keep it in mind if your calculator or computer program gives you a slightly differed standard deviation than you calculated. In R, you can define stdev < - function(x) sqrt(mean(square(x - mean(x))) to get the text formula; the R command sd(x) will give the sample standard deviation.

Properties of the mean and standard deviation: Both the text and Stark explore at length what happens to the mean and SD if you add, subtract, multiply and/or divide a list of numbers by another number. Such a transformation of a list of numbers by simple arithmetic operations (not including squares, square roots, logarithms, or trignometric functions), is known as a linear transformation or (in Stark's online text) as an affine transformation (a term I will not use on exams). Consider a few examples (and run the numbers through your calculator to confirm my statements and to get some practice in calculating) Consider the list of numbers: x = (3, 10, 7, 2, 3) Confirm that mean(x) = 5 and stdev(x) = 3.03315 (and that R's sd(x) = 3.391165) Add 3 to each number in x to get another list, y = (6, 13, 10, 5, 6) Confirm that mean(y) = 8 and stdev(y) = 3.03315 (and R's sd(y) also remains unchanged) Multiply each number in the original list x by -2 to get the list z = (-6, -20, -14, -4, -6) Confirm that mean(z) = -10 and stdev(z) = 6.0663 (and that sd(z) = 2 * sd(x)) Answer the following questions: (true/false, but explain any false statement or part of a statement) If necessary, work your own example to confirm or refute the statements. Notice that refutation only requires a single counterexample, but a confirmation could be accidental, so you should try two or three examples to confirm that your example would work for both fractions and whole numbers, positive and negative numbers. 1. Multiplying any list by a positive number k will lead to a list with a mean and SD k times greater. 2. Subtracting 10,000 from any list will decrease the mean by 10,000 but the SD by only 100, since in calculating the SD we take the square root. 3. Adding any number, whether positive or negative, to a list will not change the SD of a list. 4. Multiplying any list by a negative number will lead to a negative mean and negative standard deviation. 5. Squaring every number in a list will square both the mean and the SD. 6. Squaring every number in a list will lead to a larger mean and SD, but not necessarily to an exact square of either mean or SD. 7. If the standard deviation of a list is zero, every item on that list must be zero. 8. The standard deviation of the list (-3, -2, -1, 0, 1, 2, 3) is zero. See the end of appendix for answers.

Appendix 1. Demonstration that the median minimizes absolute value of difference among all reasonable guesses. Would any other strategy be better? We can try to see this graphically, by looking at the outcome of all guesses between $ 0 and $ 100. Note: lines beginning with a prompt (>) indicate what you are supposed to type. (Don't type the prompt) Press return where I start a new line, and the computer will provide a + to indicate that you are continuing the line. First, define an R function: > payoff <- function (guess, payment=80,salaries=c(20,300,40,40)) { + mean(payment - abs(salaries guess)) } # defines payoff as a function of three arguments: # guess, which you must supply # payment, which defaults to 80 # salaries, which defaults to a list of all the salaries we assumed. And then set guesses to all integers from 1 to 100: guesses < - seq(1, 100) The computation is done simply: > payoffs < - sapply(guesses, payoff) # sapply is short for simple apply we apply the payoff function to all the guesses we made. > plot(guesses, payoffs, type= l, col= red, lwd=3) # draws the plot below > abline(v=40, col= blue ) was used to add the vertical blue line at the median.

2. Demonstration that the mean minimizes the sum of squared errors First, define two R function: square and sse: > square < - function(x) (x * x) > sse <- function (guess, salaries=c(20,300,40,40)) { + sum(square(payment - abs(salaries guess))) } Second, set the range of your guesses: > guesses <- seq(40, 300) will give you the sequence 1.0, 1.1, 1.2 499.9, 500.0 Confirm that the mean beats the median: > sum.squared.errors(40) # we guess the median, and should get 68,000 > sum.squared.errors(100) # we guess the mean, and should get 53,600 Compute and save the sums of squared errors: (note that R is case sensitive: SSEs is not the same as sses) SSEs < - sapply(guesses, sse) Plot the payoffs: plot(guesses, SSEs, col="red", type="l", lwd=2, main="sum of Squared Errors") abline(h=53600, col= blue ) # horizontal line at minimum sum of squared errors. abline(v=100) # vertical line at a guess of the mean. Change your guesses to seq(50, 150) to focus in on the critical area, or calculate sse(99.9), sse(100), sse(100.1) to see that this is the real minimum.

3.Calculus based proof that the mean minimizes the sum of squared errors: Start with a list of numbers x. There are N numbers in the list. The SSE is defined as: Σ ( x - b) 2 where Σ is the summation sign (over all n observations)and b is any guess. We want the guess that minimizes this value, so we take the derivative of the SSE and set it equal to zero Derivative of SSE with respect to b = Σ -2 ( x - b) = -2 Σ x + 2 Σ b = -2 Σ x + 2 N b (note that the sum of N b's is N times b) Set the derivative equal to zero and solve: Divide through by 2, so - Σ x + N b = 0; add Σ x to each side of the equation, so we have N b = Σ x Divide both sides by N,so b = Σ x / N Which is of course the arithmetic mean but note that we were looking for any expression that would minimize the sum of squared errors. 4.Answers to true/false questions on linear transformations and their effect on mean and SD 1. Multiplying any list by a positive number k will lead to a list with a mean and SD k times greater. True but note that if k is a fraction, k times greater means smaller in absolute terms. 2. Subtracting 10,000 from any list will decrease the mean by 10,000 but the SD by only 100, since in calculating the SD we take the square root. The mean does decrease by 10,000, but the SD remains unchanged by adding or subtracting any number. 3. Adding any number, whether positive or negative, to a list will not change the SD of a list. True the location of all numbers changes, but their spread does not. 4. Multiplying any list by a negative number will lead to a negative mean and negative standard deviation. False.. If you begin with a list of all negative numbers, multiplying by a negative number will mean a list of all positive numbers, and hence a positive mean. The standard deviation can never be negative squaring all deviations results in a list of all positive numbers. 5. Squaring every number in a list will square both the mean and the SD. False we are not dealing with a linear transformation, so results are completely unpredictable. 6. Squaring every number in a list will lead to a larger mean and SD, but not necessarily to an exact square of either mean or SD. False if the list is a list of positive fractions, say x = (0.3, 0.4, 0.5). The mean is 0.4, and SD is 0.08165 The new list is z = (0.09, 0.16, 0.25). The mean of this list is 0.1667, and the SD is 0.06549 Both the mean and SD are smaller after squaring. 7. If the SD of a list is zero, every item in the list must be zero. FALSE every item must be the SAME, but not necessarily zero. Compute the SD of (8, 8, 8) if you don't believe this. 8. The standard deviation of the list (-3, -2, -1, 0, 1, 2, 3) is zero. FALSE the mean is zero. The SD will be non-zero if every item is not the same. In this case, it is sqrt ( 28 / 7) = 2.