the Median-Medi Graphing bivariate data in a scatter plot

the Median-Medi Students use movie sales data to estimate and draw lines of best fit, bridging technology and mathematical understanding. david c. Wilson Graphing bivariate data in a scatter plot and drawing an approximate line of best fit for the data have become commonly recommended activities (NCTM 2000) or even, in some states, a standard (e.g., New York State Education Department 2005) for middle school and high school students. The graphing calculator has provided a mechanism for students both to approximate a best-fit line (e.g., using the Transform application on the TI-84) and to calculate the best-fit line using a built-in option (e.g., LinReg or Med-Med on the TI-84). Computer software such as Fathom (2007) offers similar options for students to use in their exploration of data. Frequently, the goals of such explorations focus on exploring the slope as a rate of change, discussing the meaning of the y-intercept, or interpolating and extrapolating within the given context. These goals all contribute to broadening student understanding, yet the initial task of drawing the best-fit line remains a bit of a mystery to most students. Consider the two graphs in figure 1, which represent two possible responses from students when asked to draw a best-fit line for the scatter plot of data relating weight and length of a rubber band. What criteria did each student use to draw his or her line? Two common misperceptions of students when drawing the best-fit line are that the line should hit as many points as possible and that the line should divide the data set so that the number of points above and below the line is equal. It is difficult to counter these beliefs with mathematical reasoning when the mathematical process behind finding a line of best fit seems beyond the students current knowledge. To develop insight into this process, we will use the median-median line, which uses basic principles of coordinate geometry and linear equations to calculate the equation of the line of best fit. The process will provide students with a way of thinking about the appropriate criteria to consider when drawing bestfit lines and will reveal the mystery of what is happening in their calculator or computer when best-fit lines are generated. In addition, students will be able to compare and contrast the median-median line and least-squares regression line and make decisions about which may be more appropriate for a given set of data. Lisa thornberg/istockphoto.com 262 MatheMatics teacher Vol. 104, no. 4 november 2010 Copyright 2010 The National Council of Teachers of Mathematics, Inc. www.nctm.org. All rights reserved. This material may not be copied or distributed electronically or in any other format without written permission from NCTM.

an line Vol. 104, No. 4 November 2010 Mathematics Teacher 263

(a) (b) Fig. 1 Which student s best-fit line for this data set is more appropriate? THE BEST-FIT LINE FOR THREE DATA POINTS The task of determining a best-fit line for a data set consisting of two points is a trivial task. The addition of a third point complicates the task and calls for some analysis. Suppose that we begin by drawing a line through the outermost points, as shown in figure 2, with points A (2, 5), B (6, 4), and C (8, 2). If we take the line to be our initial attempt at a best-fit line, the task becomes one of looking to account for point B in some way. One possibility would be to shift the line up a bit toward point B. The question is, How far up do we slide the line so that it takes point B into account in a reasonable way? For example, if we shift the line up 1 unit, then it passes through B, but this solution does not seem to be a reasonable best-fit line. To resolve this Fig. 2 Using the left and right data points to form a line of best fit is a good starting point. dilemma, it is necessary to introduce the concept of a residual. The residual (R) of a data point is the vertical distance between the data point and the line of best fit. It is calculated by subtracting the y-coordinate of the point on the line from the y-coordinate of the data point; the formula would be R = y yʹ, where yʹ is the y-coordinate of the point on the line directly above or below the data point. This way of thinking about the distance from a point to a line is different from the typical Euclidian interpretation and is easier to calculate. Figure 3 shows the residual for point B. This method of calculating residuals results in positive residuals for points above the line and negative residuals for points below the line. Thus, subtracting the y-coordinates as described yields 4 3 = 1. The value of the residual allows us to think quantitatively in trying to answer our question about how far up the line should be shifted to account for point B. (Take a moment to ponder this idea before reading on.) Consider using one-third of the residual as the shift amount. Figure 4 shows the best-fit line for these three points when using this approach. Now consider the value of the residuals for the three points to see why 1/3 is the desired amount. Point B now has a residual of 2/3, while points A and C each have a residual of 1/3. Summing the residuals results in zero and prompts a new name for the line the zero-residual line. The process for finding the equation of this line is readily accessible to most eighth- and ninth-grade students, making it very attractive to use. First, students find the equation of the line that goes through the outermost points A and C, in this case yielding y = (1/2)x + 6. Because this line is then shifted by 1/3 of the residual toward B, the slope of the line will not change; only the y-intercept changes. That is, if 1/3 of the residual of B is added to the y-inter- 264 MatheMatics teacher Vol. 104, no. 4 november 2010

Fig. 3 the vertical red segment is the residual for the data point B. Fig. 4 shifting the line in figure 3 up by 1/3 produces a line whose residuals sum to zero. (a) (b) (c) Fig. 5 Using three points helps students understand the process. cept of the line through A and C, then the resulting equation is the zero-residual line. In this case, the result is y = (1/2)x + 19/3. This line is the median-median line for the three points. It can also be displayed by entering the coordinates of the points into lists on the TI-84 and generating the line of best fit by selecting Med-Med under the STAT-CALC menu. Figure 5 shows screen shots for this process. The last screen shot (fig. 5c) displays the residual values of points A, B, and C. These are generated automatically each time a user asks the calculator to find the Med-Med line. To view them, type RESID as the name of a list and press ENTER. THE MEDIAN-MEDIAN LINE One question that may need to be addressed at this point is, Why is this best-fit line referred to as the median-median line? A second question arises as to what to do when the data set has more than three points. The answer to the second question yields the answer to the first. Consider the data set given in table 1, provided by the National Association of Theater Owners (http://www.natoonline.org/ statistics.htm), on the average cost of admission to a movie (in U.S. dollars) and total annual movie Table 1 Movie Attendance Data 2002 9 Year Cost (in U.S. dollars) Attendance (in billions) 2002 5.80 1.570 2003 6.03 1.521 2004 6.21 1.484 2005 6.41 1.376 2006 6.55 1.401 2007 6.88 1.400 2008 7.18 1.341 2009 7.50 1.414 attendance in the United States and Canada from 2002 through 2009. The process of finding the median-median line of best fit begins with reducing the data set to three summary points, thus enabling the use of the process described above. The first task is to order the data by the x-values (as we do when determining the median of a set of data) and then split the data into three groups. If the number of points in the data set is not divisible Vol. 104, no. 4 november 2010 MatheMatics teacher 265

Table 2 Grouped Movie Data Cost (in U.S. dollars) Attendance (in billions) 5.80 1.570 6.03 1.521 6.21 1.484 6.41 1.376 6.55 1.401 Fig. 6 the median-median line for the movie data can be superimposed on a scatter plot of the data. 6.88 1.400 7.18 1.341 7.50 1.414 by 3, then the data are split so that the two outer groups contain equal quantities. For the movie data, this process yields two outer groups of three points and a middle group of two points. Table 2 displays the three groups of data. Students can use the table or the graph (draw three vertical lines) to identify the three groups. The summary point for each group is found by calculating the median x-value and the median y-value, hence the name median-median. The summary point may or may not be part of the original data set. The summary points for the movie data are (6.03, 1.521), (6.48, 1.389), and (7.18, 1.400). The third summary point illustrates the fact that the median x-value and median y-value are not always the x- and y-values of the middle point if there are three points. Once the summary points have been determined, the line of best fit can be calculated by following the three-point process described previously. The equation of the line through the outer points is y = 0.105x + 2.155. The residual of the middle point is 0.086. Taking 1/3 of the residual and adding it to the y-intercept of the equation yields y = 0.105x + 2.126 as the median-median line for the movie data. Again, the TI-84 can be used to display this (see fig. 6). The differences in values reflect the rounding decisions made during calculations. Note that the residual data displayed show the residuals for each of the points in the data set and not the residuals of the summary points. The process of finding the median-median line, while requiring students to use procedures that they may be expected to be competent in, can still be a challenging one; real data seldom have integer values, and the multistep process initially may appear daunting. Students should work through several examples involving three points to become familiar with the process and understand the zero-residual line before working with larger data sets. Students should understand what they are doing when taking 1/3 of the residual to the middle point and how that leads to finding the median-median line; otherwise, this procedure will become rote. CONNECTIONS TO RELATED TOPICS The numerous connections between the medianmedian line and traditional mathematical content in eighth- and ninth-grade mathematics curricula are worth recognizing. The process of finding the median-median line involves the concepts of slope, intercept, parallel lines, vertical shifts of graphs, and median values as well as fundamental algebraic skills. Perhaps most significant, students will have a way of estimating where they should draw the best-fit line for a given set of data and better understand what the calculator or software is doing when they use those tools. The task of finding a best-fit line also provides an opportunity to bring meaning to the slope and y-intercept values in the context of the data. The movie cost and attendance data are particularly well suited for this purpose. Moreover, it is helpful for students to step outside the mathematical analysis and consider what the values reveal about the data. Interpretations of slope should lead students to suggest statements reflecting that an increase of one dollar in the cost of a ticket is associated with a decrease in attendance of approximately 0.105 billion people. Put into more meaningful terms (another important task!), an increase in ticket cost of one dollar is associated with a decrease in attendance of 105 million people or an increase of 10 cents is associated with a amanda rohde/istockphoto.com 266 MatheMatics teacher Vol. 104, no. 4 november 2010

decrease in attendance of 10.5 million people. Similarly, interpretations of the y-intercept should lead to discussions about attendance if the cost were zero and, naturally, how meaningful this value is and the need for caution when extrapolating from data. The movie data set also leads students to conclude reasonably but incorrectly that attendance is related to ticket cost in a causal way. It is worth having a discussion about what was happening in the home-entertainment world and economy during the years under examination. To students, it may seem counterintuitive that the data do not provide evidence of causality. To appreciate the distinction between association and causality more fully, students may explore causality within other data sets, such as the number of fire trucks responding to a fire vs. the cost in fire damage (in dollars). Another activity that uses Fathom to explore association versus causality focuses on shoe size and reading ability in an engaging way to develop reasoning regarding causality (Center for Technology and Teacher Education 2001). A natural question that students might ask after discussing the median-median line would be how the least-squares regression line (LinReg on the TI-84) is different. A beneficial outcome for students of working with the median-median line is that they now have the necessary knowledge to understand how the least-squares process results in a different line. Note that the process uses the mean rather than the median as a starting point. That is, the mean x-value and the mean y-value for the data set are calculated, and the best-fit line is drawn through that point. The least-squares process differs from the median-median line process in that it does not involve summing the residuals to zero; rather, the residuals for each point are squared (resulting in positive values), and their sum is minimized, hence the name least squares. Figure 7 displays the movie data graphed in Fathom with the leastsquares regression line, the equation, and the sum of the squares of the residual values. The differences between the underlying processes that result in two lines of best fit can also spur discussion of when it might be more appropriate to use one or the other. As in the process of selecting an appropriate measure of center, outliers play a determining role. That is, the median-median line, because summary points are used in its construction, is more resistant to extreme points than is the least-squares line. CONCLUSIONS Technology can serve as a tool to generate discussion of and interest in underlying procedures. The mathematics underlying the median-median line Fig. 7 Fathom can display the data and the least-squares line as well as provide images of the squares of the residuals. is within the reach of most high school students and involves tasks that they may be expected to perform. This activity provides all students with a richer understanding of the processes underlying the task of finding lines of best fit and gives them tools to use when thinking about where best-fit lines should be drawn. REFERENCES Center for Technology and Teacher Education. The Correlation between Shoe Size and Reading Level. 2001. www.teacherlink.org/content/math/ activities/ft-shoe/guide.html. Key Curriculum Press. Fathom (version 2.1). Berkeley, CA: Key Curriculum Press, 2007. National Council of Teachers of Mathematics (NCTM). Principles and Standards for School Mathematics. Reston, VA: NCTM, 2000. New York State Education Department (NYSED). New York State Mathematics Core Curriculum. NYSED, 2005. www.emsc.nysed.gov/ciai/mst/ mathcorepage.html. DAVID C. WILSON, wilsondc@ buffalostate.edu, a former high school teacher, is an associate professor of mathematics education at Buffalo State, SUNY. His work focuses on creating learning environments that foster inquiry and build connections and understanding. Vol. 104, no. 4 november 2010 MatheMatics teacher 267