Chapter 7: Scatter Plots, Association, and Correlation Scatterplots compare two quantitative variables in the same way that segmented bar charts compared two categorical variables. You can start observing the relationship between the two variables and visually notice if there is an association.
When we describe scatter plots, we describe their... -Direction -Form -Strength -Outliers
Direction: -When the pattern of the data runs from the bottom left to the upper right, we say it has a POSITIVE direction -When the pattern runs from the upper left to the bottom right, we say it has a NEGATIVE direction -Relate this to the sign of the slope of a line
Form: -When the pattern of the data roughly follows a straight path we say the data is LINEAR -If the data curves it is NON-LINEAR -If the data curves but maintains the same direction it is possible to transform the data to make it more linear -If the data changes direction there is not much that can be done
Strength: -If the data is clustered closely together and a patter is easy to see, it is stronger -If the data is spread apart and a pattern is difficult to see or (in the extreme case) like a cloud - with no discernable pattern, it is weaker
Outliers: -Data that does not fit the overall pattern -Unusual clusters or subgroups should also raise concerns
Converting units on the variables in a scatter plot does not change any of its features
When describing which variables are on the horizontal axis and the vertical axis, we do NOT call them the "x" and "y" variables (or axes). The PREDICTOR or EXPLANATORY variable is placed on the horizontal axis The RESPONSE variable is placed on the vertical axis You determine which is which by asking which variable is more likely to affect the other. Also use clues from the context of the description of the scenario (the W's) Which is the predictor and which is the response variable? Ex1: When comparing the price of a new product to the # of units sold Ex2: When comparing the number of cars damaged on a street to the number of potholes on the street. Ex3: When comparing the fat content of a candy bar to its sugar content *If the roles are unclear you should just select whichever you feel is most likely the predictor and response... *We avoid the terms "x", "y", "dependent variable", and "independent variable" because we never want to give the impression that there is a CAUSE and EFFECT relationship rather than an association between them.
Correlation: CORRELATION is a numerical measure of the linear relationship between two quantitative variables. Before calculating such a value you must make sure that the following conditions are met. 1. The data must be quantitative 2. The scatter plot looks nearly linear 3. There are no outliers **If there is an outlier, best practice is to calculate the value both with and without the outlier The value we calculate is called the CORRELATION COEFFICIENT and is denoted with a lowercase "r" -It is a value ranging anywhere from -1 to 1 -When r is closer to 0 you have a weaker linear association -When r is closer to +/- 1 you have a stronger linear association -The sign indicates direction -Does not have any units (based on z-scores) -Changing units will not change r -You can calculate a correlation coefficient for any pair of variables but if the relationship is not linear the value will be misleading
There IS a formula for calculating the correlation coefficient, but it would take a very long time to calculate even with a small set of data. We will rely on technology to do the tedious work for us anytime we need to calculate this value. Calculated by standardizing (z-score) for all data
Straightening Scatterplots A section of this chapter talks about re-plotting the data to make a more linear graph. Generally speaking you do 'something' to all values of a certain variable, then graph... i.e. graph (x, y 2 ) instead of (x, y) Chapter 10 is dedicated to this so, for now, just know that it can be done and we'll learn more later.
Correlation vs Causation Though tempting, even when you get a large r, you can never say that the predictor variable CAUSED the response variable to change. Ex: In infants, as vocabulary increases, so does appetite *Words make you hungry Ex: Studies show that as ice cream sales rise, so do shark attacks *Eating ice cream makes you tastier to sharks **Posted comments to an online article Beware of Lurking Variables A LURKING VARIABLE is a third variable that affects the both of the two variables in your scatter plot. - recall: Ice cream Sales and Shark Attacks - The temperature will cause people to eat more ice cream AND make people want to go to the beach
Calculators To enter data into your calculator (same as before): STAT button edit menu 1:Edit option...then enter a set of data into a column (L1, L2...) (remember which is the predictor and which is the response) To Calculate the correlation coefficient:...enter data STAT button calc menu 8:8-Linreg (a+bx) option...back on the home screen input the parameters - i.e. L1, L2 *If r is not shown, you need to enable a feature on the calculator Go to the catalogue - Scroll to 'DiagnosticOn' - Press enter (twice) To make a scatter plot: STAT PLOT button (second function of the Y= button) ZOOM then 9 will graph the data in the "best" viewing window