The Math Part of the Course

Transcription

1 The Math Part of the Course

2 Measures of Central Tendency Mode: The number with the highest frequency in a dataset Median: The middle number in a dataset Mean: The average of the dataset When to use each: Mode: Good for non-numerical data and for frequent occurrences Median: When an outlier may significantly influence the mean, use median Mean: When data have no likely outlier, use mean Measures of Dispersion Range: Range of values in a dataset (describes the extremes around the typical case) Standard deviation: Shows how much variation there is from the mean. Low standard deviation indicates that the data points tend to be very close to the mean, whereas a high standard deviation indicates that the data is spread out over a large range of values. Population Standard Deviation Formula Sample Standard Deviation Formula Solving for population standard deviation: Assume the dataset: 1, 8, 14, 29, 46 Step one: Solve for : Step two: Solve for

3 Step three: Solve final equation The Normal Distribution

4 Say μ = 2 and σ = 1/3 in a normal distribution. The graph of the normal distribution is as follows: μ = 2, σ = 1/3 The following graph represents the same information, but it has been standardized so that μ = 0 and σ = 1: μ = 0, σ = 1 The two graphs have different μ and σ, but have the same shape (if we tweak the axes). The new distribution of the normal random variable Z with mean 0 and variance 1 (or standard deviation 1) is called a standard normal distribution. Standardizing the distribution like this makes it much easier to calculate probabilities. Considering our example above where μ = 2, σ = 1/3, then One-half standard deviation = σ/2 = 1/6, and Two standard deviations = 2σ = 2/3

5 If we have mean μ and standard deviation σ, then Since all the values of X falling between x 1 and x 2 have corresponding Z values between z 1 and z 2, it means: The area under the X curve between X = x 1 and X = x 2 equals: The area under the Z curve between Z = z 1 and Z = z 2. Hence, we have the following equivalent probabilities: P(x 1 < X < x 2 ) = P(z 1 < Z < z 2 ) So ½ s.d. to 2 s.d. to the right of μ = 2 will be represented by the area from to. This area is graphed as follows: μ = 2, σ = 1/3 The area above is exactly the same as the area z 1 = 0.5 to z 2 = 2 in the standard normal curve: μ = 0, σ = 1

6 Finding the Area Under the Normal Curve In the standard normal curve, the mean is 0 and the standard deviation is 1. The green shaded area in the diagram represents the area that is within 1.45 standard deviations from the mean. The area of this shaded portion is (or 42.65% of the total area under the curve). To get this area of , we read down the left side of the table for the standard deviation's first 2 digits (the whole number and the first number after the decimal point, in this case 1.4), then we read across the table for the "0.05" part (the top row represents the 2nd decimal place of the standard deviation that we are interested in.) z We have: (left column) (top row) 0.05 = 1.45 standard deviations The area represented by 1.45 standard deviations to the right of the mean is shaded in green in the standard normal curve above. You can see how to find the value of in the full z-table below. Follow the "1.4" row across and the "0.05" column down until they meet at

7 z

8 Find the area under the standard normal curve for the following, using the z-table. Sketch each one. (a) between z = 0 and z = 0.78 (b) between z = and z = 0 (c) between z = and z = 0.78 (d) between z = 0.44 and z = 1.50 (e) to the right of z = (a) (b)

9 (c) = (d) = (e) =

10 It was found that the mean length of 100 parts produced by a lathe was mm with a standard deviation of 0.02 mm. Find the probability that a part selected at random would have a length (a) between mm and mm (b) between mm and mm (c) less than mm X = length of part (a) is 1 standard deviation below the mean; is standard deviations above the mean P(20.03<X<20.08) =P(-1<Z<1.5) = =.7745 So the probability is (b) is 0.5 standard deviations above the mean; is 1 standard deviation above the mean P(20.06<X<20.07) =P(.5<Z<1) = =.1498 So the probability is (c) is 2 s.d. below the mean. P(X<20.07) =P(Z<-2) = =.0228 So the probability is

11 A company pays its employees an average wage of $3.25 an hour with a standard deviation of 60 cents. If the wages are approximately normally distributed, determine (a) a. the proportion of the workers getting wages between $2.75 and $3.69 an hour; b. the minimum wage of the highest 5%. X = wage P(2.75<X<3.69) = P(-.833<Z<.7333) = =.566 So about 56.6% of the workers have wages between $2.75 and $3.69 an hour. (b) W = minimum wage of highest 5% x = (from table) X-3.25=.987 X=4.237 So the minimum wage of the top 5% of salaries is $4.24.

12 The average life of a certain type of motor is 10 years, with a standard deviation of 2 years. If the manufacturer is willing to replace only 3% of the motors that fail, how long a guarantee should he offer? Assume that the lives of the motors follow a normal distribution. X = life of motor x = guarantee period Normal Curve: μ = 10, σ = 2 We need to find the value (in years) that will give us the bottom 3% of the distribution. These are the motors that we are willing to replace under the guarantee. P(X < x) = 0.03 The area that we can find from the z-table is = 0.47 The corresponding z-score is z = Since, we can write: Solving this gives x = So the guarantee period should be 6.24 years.

13 Measures of Association Age Group < >24 Monkey Low Favorability Medium Rating High Lambda: An asymmetrical measure of association: the value varies depending on which variable is independent. Ranges from 0 to 1 Formula: 1. Calculate Row and Column Totals Age Group < >24 Monkey Low Favorability Medium Rating High Calculate E1: Find the mode of the dependent variable (the attribute that occurs the most often) and subtract it from N (sample size). E1=N-ƒ of the mode E1=85-31=54 3. Calculate E2: Find the mode in each column (i.e., category of the independent variable). Subtract each value from the column (category) total and add them together. E2=(Column total Column mode) + (Column total Column mode) for all attributes of the independent variable. E2=(32-20)+(23-9)+(30-18)= =38 4. Find lambda. We know that thirty percent of the errors in predicting the relationship between age and monkey favorability can be reduced by taking into account the voter s age.

14 Gamma: A measure of association using ordinal variables It is a symmetrical measure, therefore you don t need to specify the IV and DV. Compares pairs of observations that are positive (going in the same direction) and negative (going in the opposite direction). Ranges from 0 to 1 Formula: Ns=Count of Same order pairs (positive); Nd= Count of inverse order pairs (negative) Age Group < >24 Monkey Low Favorability Medium Rating High To find Ns: Multiply top left cell frequency by the sum of all cells that are lower and to the right of that cell. Ns= 4( ) + 8(8+3) + 6(9+3) + 9(3) Ns= = 313 To find Nd: Multiply top right cell frequency by the sum of all cells that are lower and to the left of that cell. Nd= 18( ) + 9(8+20) + 6(8+20) + 9(20) Nd= = 1410 Interpret: Using age to predict monkey favorability results in a proportional reduction of error of 65%. There is an inverse or negative relationship: as age increases, favorability of monkeys decreases.

15 Chi-Square: Chi-square is a statistical test commonly used to compare observed data with data we would expect to obtain according to a specific hypothesis. For example, if, according to Mendel's laws, you expected 10 of 20 offspring from a cross to be male and the actual observed number was 8 males, then you might want to know about the "goodness to fit" between the observed and expected. Were the deviations (differences between observed and expected) the result of chance, or were they due to other factors. How much deviation can occur before you, the investigator, must conclude that something other than chance is at work, causing the observed to differ from the expected. The chisquare test is always testing what scientists call the null hypothesis, which states that there is no significant difference between the expected and observed result. Age Group < >24 Monkey Low Favorability Medium Rating High Hypotheses: H 0 : Age and favorability are independent; H 1 : Age and favorability are related First step: Calculate the expected values of each cell. Our null hypothesis would be that age has no bearing on favorability of monkeys. As a result, the null hypothesis would expect that favorability within each age group would be equal. To calculate the expected value of a cell: Monkey Favorability Rating Low Medium High Age Group < > (10.54) (7.58) (9.88) (9.79) (7.04) (9.18) (11.67) (8.39) (10.94) Second step: Calculate the chi-square calculated value. Formula: =

16 Third step: Determine the critical value Significance Level df To use this table, we need to first determine our level of significance. For the purposes of this class, let s always work on the assumption that we want 95% confidence ( ). Next, we need to figure out our degrees of freedom (df). As a result, our critical value for.05 at df = 4 is Fourth step: Compare the calculated chi-square value with the critical value. Chi-square calculated: 23.66; chi-square critical: 9.49 As a result, we REJECT the null. We can conclude that monkey favorability and age are related in some way.

17 Two Sample T-Test Purpose: To compare responses from two groups. These two groups can come from different experimental treatments, or different natural "populations". Assumptions: each group is considered to be a sample from a distinct population the responses in each group are independent of those in the other group the distributions of the variable of interest are normal In a test of the hypothesis that females smile at others more than males, females and males were videotaped while interacting and the number of smiles emitted was recorded. Using the following number of smiles in the 5-minute interaction, test the null hypothesis that there are no gender differences between the number of smiles. Step One: Calculate the Means of Each Group Males Females Step Two: Solve for the Variances of the Two Samples Step Three: Solve for t

18 Step Four: Compare Calculated t-value with Critical t-value To determine the critical t-value, we first need to determine the degrees of freedom (df). With t- tests, df = n 1 +n df = = 8 At 95% confidence ( ), the critical t-value is consequently df 50% 60% 70% 80% 90% 95% 98% 99% 99.5% 99.8% 99.9% t-score calculated: 2.98; t-score critical: As a result, we REJECT the null. We can conclude that gender and smiling are related in some way.

19 Regression Regression is a tool for describing how, how strongly, and under what conditions an independent and dependent variable are associated. It can be used to make causal inferences. The ordinary least squares regression formula is Y = a + bx and describes the slope of a line: Y = dependent variable a = y-intercept (or constant) b = slope or coefficient X = independent variable If b is positive, the relationship is positive; if b is negative, the relationship is negative. Interpreting Regression Data are gathered on 40 countries to study variations in birth rate. Consider this equation: Y = X r = -.78 Se b = Where: Y = birth rate per 1000 population and X = per capita income Identify the following: independent and dependent variables; regression coefficient; the constant; the correlation coefficient; the coefficient of determination; the standard error of the slope. IV: Per capita income DV: Birth rate per 1000 population Regression coefficient: (for every drop of 1 in per capita income, we see an increase of.0018 in birth rate per 1000 population) Constant: 32 (the predicted value of Y would be 32 if X=0) Correlation coefficient: -.78 (there is a strong, negative relationship) Coefficient of determination:.6084 (-.78*-.78) Standard error of the slope: What percent variation in birth rate is associated with per capita income? (r 2 =-.78*-.78) What is the direction of the relationship? Negative

20 Calculate the t-ratio. What does this tell you? It allows us to test the hypothesis that b=0. df = 38 (n-2). The critical t-value at 95% confidence and df = 38 is As a result, we REJECT the null. We can conclude that gender and smiling are related in some way. A country has a per capita income of $2000. Estimate its birth rate. Regression Y = X Y= (2000) Y= Y= births per 1000 population Model Summary Interpreting Multiple Regression Std. Error of the Model R R Square Adjusted R Square Estimate a a. Predictors: (Constant), ZZ11. PRE IWR OBS: R gender, Y6. Employment status, J1. Party ID: Does R think of self as Dem, Rep, Ind or what, Y1x. Age of Respondent, Y3. Highest grade of school or year of college R completed, C5ax. SUMMARY: R better/worse off than 1 year ago, F1ax. SUMMARY: economy better worse in last year, Y21a. Household income R-Square is the proportion of variance in the dependent variable which can be predicted from the independent variables. This value indicates that 41% of the variance in the dependent variable can be predicted from the independent variables. Note that this is an overall measure of the strength of association, and does not reflect the extent to which any particular independent variable is associated with the dependent variable.

21 ANOVA b Model Sum of Squares Df Mean Square F Sig. 1 Regression a Residual Total a. Predictors: (Constant), ZZ11. PRE IWR OBS: R gender, Y6. Employment status, J1. Party ID: Does R think of self as Dem, Rep, Ind or what, Y1x. Age of Respondent, Y3. Highest grade of school or year of college R completed, C5ax. SUMMARY: R better/worse off than 1 year ago, F1ax. SUMMARY: economy better worse in last year, Y21a. Household income b. Dependent Variable: B1j. Feeling Thermometer: Republican Party The F Value is the Mean Square Regression divided by the Mean Square Residual, yielding F. The p value associated with this F value is very small (0.0000). These values are used to answer the question "Do the independent variables reliably predict the dependent variable?". The p value is compared to your alpha level (typically 0.05) and, if smaller, you can conclude "Yes, the independent variables reliably predict the dependent variable". You could say that the group of independent variables can be used to reliably predict the dependent variable. If the p value were greater than 0.05, you would say that the group of independent variables do not show a significant relationship with the dependent variable, or that the group of independent variables do not reliably predict the dependent variable. Note that this is an overall significance test assessing whether the group of independent variables when used together reliably predict the dependent variable, and does not address the ability of any of the particular independent variables to predict the dependent variables. The ability of each individual independent variable to predict the dependent variable is addressed in the table below where each of the individual variables are listed.

22 Coefficients a Standardized Unstandardized Coefficients Coefficients Model B Std. Error Beta t Sig. 1 (Constant) C5ax. SUMMARY: R better/worse off than 1 year ago F1ax. SUMMARY: economy better worse in last year J1. Party ID: Does R think of self as Dem, Rep, Ind or what Y1x. Age of Respondent Y3. Highest grade of school or year of college R completed Y6. Employment status Y21a. Household income ZZ11. PRE IWR OBS: R gender a. Dependent Variable: B1j. Feeling Thermometer: Republican Party Feeling thermometer Republican Party = Better/Worse Off Economy PartyID Age Education Unemployed Income Gender (B) These estimates tell you about the relationship between the independent variables and the dependent variable. These estimates tell the amount of increase in Feeling Thermometer Republican that would be predicted by a 1 unit increase in the predictor. (b) These are the values for a regression equation if all of the variables are standardized to have a mean of zero and a standard deviation of one. Because the standardized variables are all expressed in the same units, the magnitudes of the standardized coefficients indicate which variables have the greatest effects on the predicted value. This is not necessarily true of the unstandardized coefficients. Because the magnitudes of the unstandardized coefficients can largely depend on the units of the variables, the effects of the variable on the prediction can be difficult to gauge. While the standardized coefficients may vary significantly from the unstandardized coefficients in magnitude, the sign (positive or negative) of the coefficients is unchanged. These columns provide the t value and 2 tailed p value used in testing the null hypothesis that the coefficient is 0. Coefficients having p values less than alpha are significant. For example, if you chose alpha to be 0.05, coefficients having a p value of 0.05 or less would be statistically significant (i.e., you can reject the null hypothesis and say that the coefficient is significantly different from 0).