1 CHAPTER 5 Follow Through Evaluation DI, UNDISPUTED WINNER The formal evaluation of Follow Through sponsors and Follow Through s overall performance came out in April of At that time scientists had recently discovered the cause of Legionnaire s disease; Jimmy Carter had become President and had pardoned Vietnam War draft evaders. The worst air collision to that time had occurred when two planes collided over the Canary Islands killing 583 people. The Follow Through evaluation did not make headlines because it had not been officially announced or interpreted by the Office of Education. The design of the evaluation was careful. To assure that the analysis was experimentally pristine, two organizations dealt with data. Stanford Research Institute collected the data; Abt Associates, on the other side of the continent, analyzed it. The evaluation cost $30 million. At the time of the evaluation, Bob Egbert was no longer the national director of Follow Through. He had been succeeded by Rosemary Wilson, who shared neither his vision nor his conception that if Follow Through permitted sponsors to implement in a conducive context, the evaluation would clearly identify winners and losers. The analysis centered around the third-grade performance of the later cohorts that went through the various Follow Through programs. 40,000 third-graders were tested. Part of the evaluation did not involve all of our sites, only most of the sites that started in kindergarten. Grand Rapids was one of the sites included in this part of the evaluation, although we had not worked with the site for three years and had received no money from
2 224 FOLLOW THROUGH EVALUATION national Follow Through for sponsoring the site. Yet, the analysis treated Grand Rapids as if it were one of our sites. Even so, we were confident that the data would show that we had won the horse race. This confidence was in defiance of the educational community s consensus that there would be no clear winners or losers. Although sponsors were not permitted to publish data that could be construed as comparing performance of different models, National Follow Through analyzed the data and reported comparative data to sponsors as early as 1973, when a conference in Brookings, Oregon presented results from The book Planned Variation in Education Should We Give Up or Try Harder? drew the conclusion, It already seems highly doubtful, however, that the results will provide clear-cut indications that one model is best. This conclusion was based on the cohort, but there was data on the cohort, which generated a far different picture. Kansas and our model were far ahead of the others. Also, the director of Follow Through research, Gary McDaniels, wrote, Several sponsors looked very strong after the first year, while others did not. The strongest were those that emphasized short-term achievement effects [in other words, Kansas and us]. The first published reports on the Follow Through performance were based on an analysis of sponsors conducted by Stallings and Kaskowitz, which appeared in Behavior Today in After making extensive observations of the various sponsors classrooms, Stallings concluded that there were different winners that corresponded to different program emphases. According to Stallings s calculations, those approaches that focused on reading and spent more time on reading had better reading performance (DI and Behavior Analysis). The main problem with this conclusion was that we did not spend more time teaching reading than most of the models. In fact, we probably spent less than half the time provided by the Bank Street model and several others. Stallings also concluded that different programs were creating children who were different in problem solving, responsibility, question asking, and cooperation. She concluded that children in High Scope and Open Education were high in these traits. She wrote: Cooperation was marked in classrooms where a wide variety of activities occurred throughout the day and where children would explore and choose their groups. The problem was that she defined cooperation in terms of the activities. If children spent more time in activities that apparently involved cooperation, Stallings concluded that they were more cooperative. She also defined responsibility in terms of activities a child or group engaged in any task without an adult. The definition has nothing to do with the amount of responsibility children learn. (Of course she assumed that if children spend time unsupervised, they must be learning more about
3 DI, UNDISPUTED WINNER 225 responsibility.) If institutionalized children had been included in the evaluation, they probably would have had responsibility scores even higher than Open Education or High Scope because they often have no supervision. During the period before the Abt Report came out, I was not concerned with what the analysts said about the data. I didn t have either time or interest to debate whether the fat lady was singing yet. The fourth volume of the Abt Report presented data on Follow Through sponsors. It came out in 1977 and left no doubt about whether the fat lady had sung. The volume provided arias involving winners and losers, based on performance data. The report confirmed what we knew all along. No other model was close to ours in sophistication. The achievement-test data and that of other tests were analyzed two ways, adjusted and unadjusted. The adjusted data were expressed as positive or negative outcomes. If a particular site had a score that was a standard number of points higher than the other sites, the site received a plus (+). If the site had a score that was a standard number of points lower than the other sites, the site received a minus ( ). If the site was somewhere between a + and, the difference was considered educationally insignificant. For analyzing performance of sponsors, Abt threw out performance comparisons if the Follow Through site and the comparison groups differed by more than 50 percent on their entry scores. Several comparisons involving East St. Louis were thrown out, which was unfortunate because East St. Louis children were initially more than 50 percent lower than the children in the comparison groups but still outperformed them by enough to earn a +. There were other ways the analysis was bent to be unkind to DI, including the way some of the data were interpreted. Even so, the numbers didn t lie. The evaluation had three categories: basic skills, cognitive (higher-order thinking) skills, and affective responses. The graph on page 226 shows the outcomes for the nine major sponsors.
4 226 FOLLOW THROUGH EVALUATION Number of Significant Outcomes for Basic Skills (B), Cognitive Skills (C), and Affective Measures (A) Index of Significant Outcomes The basic skills consisted of those things that could be taught by rote spelling, word identification, math facts and computation, punctuation, capitalization, and word usage. DI was first of all sponsors in basic skills. Our average score was +297 (which means that we had a considerably larger number of significantly positive outcomes than the Title I comparison students). Only two other sponsors had a positive average. The remaining models scored deep in the negative numbers, which means they were soundly outperformed by children of the same demographic strata who did not go through Follow Through. The sites that Stallings glorified did poorly in this category. High Scope was 389 and Open Education was 433 (far fewer significantly negative outcomes than the Title I comparisons recorded). According to the Stallings predictions, this outcome might be expected for basic skills.
5 DI, UNDISPUTED WINNER 227 DI was not expected to outperform the other models on cognitive skills, which require higher-order thinking, or on measures of responsibility. Cognitive skills were assumed to be those that could not be presented as rote, but required some form of process or scaffolding of one skill on another to draw a conclusion or figure out the answer. In reading, children were tested on main ideas, word meaning based on context, and inferences. Math problem-solving and math concepts evaluated children s higher-order skills in math. Not only was the DI model number one on these cognitive skills; it was the only model that had positive scores for all three higher-order categories: reading, math concepts, and math problem-solving. DI had a higher average score on the cognitive skills (+354) than it did for the basic skills (+297). No other model had an average score in the positive numbers for cognitive skills. Cognitive Curriculum (High Scope) and Open Education performed in the negative numbers, at 333 and 450. On the affective measures, which included a battery of tests that evaluated children s sense of responsibility and self-esteem, our model was first, followed by Kansas. The models that stressed affective development performed even below the Title I average. One of the affective tests described positive achievement experiences and negative experiences. DI children saw themselves as being more responsible for outcomes than children in any other model. On the test that assessed children s feelings about how they think other people view them and how they feel about school, DI children had the highest scores. Note that Dl was over 250 points above the Title I norm and Open Education was over 200 points below the norm. The Abt Report observed that the high performance of children in our model was unexpected because we did not describe affective outcomes as an objective. The reason was that we assume that children are fundamentally logical. If we do our job of providing them with experiences that show they are smart, they will conclude that they are smart. If they experience success in school that can also be measured in the neighborhood, those experiences serve as fuel for the conclusion that students are competent. At the time of the evaluation, I had heard more than 100 stories of our children helping older siblings learn to read or do homework. The children knew that they could do things the average kid on the street could not do. The rest of the Abt results were expressed as percentiles. The performance of an average student taking an achievement test is the 50th percentile. The goal for Follow Through had been to achieve the 50th percentile with at-risk children. The average percentile for Title I students was the 20th percentile less than half that of the average student. The 20th percentile was used as a measure of whether a model produced a posi-
6 228 FOLLOW THROUGH EVALUATION tive or negative effect. The farther above the 20th percentile a model is, the better it performs. The figure below shows the performance of the nine major sponsors in total reading, total math, spelling, and language. Percentile Comparisons for the 9 Major Follow Through Sponsors The horizontal line indicates the 20th percentile. As the figure shows, the competition is closest for reading. All but three of the sponsors scored at or above the 20th percentile. For math, only two models were above the 20th percentile, ours and Kansas (Behavioral Analysis). For language, our model was the only one above the 23rd percentile. High Scope (Cognitive Curriculum) had the lowest scores of all sponsors in math and language. Obviously, this data makes a mockery out of Stallings s notion that an untrained observer could make a few observations of classrooms, classify the activities, and draw any kind of valid conclusion about how successful each program was in inducing cognitive skills. There s more: Not only were we first in adjusted scores and first in percentile scores for basic skills, cognitive skills, and perceptions children had of themselves, we were first in spelling, first with sites that had a Headstart preschool, first in sites that started in K, and first in sites that started in grade 1. Our third-graders who went through only three years
7 DI, UNDISPUTED WINNER 229 (grades 1 3) were, on average, over a year ahead of children in other models who went through four years grades K 3. We were first with Native Americans, first with non-english speakers, first in rural areas, first in urban areas, first with Whites, first with Blacks, first with the lowest disadvantaged children, and first with high performers. From a historical perspective the performance of our children set important precedents. 1. For the first time in the history of compensatory education, DI showed that long-range, stable, replicable, and highly positive results were possible with at-risk children of different types and in different settings. Previously, the exemplary program was a phantom, something that was observed now but not a year from now. Most were a function of fortuitous happenings, a measurement artifact, or a hoax. 2. DI showed that relatively strong performance outcomes are achievable in all subject areas (not just reading) if the program is designed to effectively address the content issues of these areas. Also, this instruction created lively, smart children who had confidence in their abilities. 3. The performance of all the Follow Through children (but particularly DI children) clearly debunked all the myths about DI. DI did not turn children off or turn them into robots. DI children were smart and they knew it. 4. DI outcomes also debunked the myth that different programs are appropriate for children with different learning styles. The DI results were achieved with the same programs for all children, not one approach for higher performers and another for lower performers, or one for non-english speakers and another for English speakers. The programs were designed for any child who had the skills needed to perform at the beginning of a particular program. If the child is able to master the first lesson, she has the skills needed to master the next lesson and all subsequent lessons. The only variable is the rate at which children proceed through the lessons. That rate is based solely on the performance of the children. If it takes more repetition to achieve mastery, we provide more repetition, without prejudice. 5. The enormous discrepancies in performance between our model and all the others implies that we knew something they didn t know
8 230 FOLLOW THROUGH EVALUATION about instructing the full range of children in the full range of academic skills. We did not buy into the current labels, explanations, or assumptions about learning and performance. The results suggest that our interpretation was right, and that the philosophies of the cognitive and affective models did not translate into effective instruction. 6. The performance of the sponsors clearly debunked the notion that greater funding would produce positive results. All sponsors had the same amount of funding, which was more than a Title I program received. DI performed well in this context; however, the same level of funding did not result in significant improvement for the other models. For all programs there were comprehensive services, which included breakfast, lunch, medical, and dental care, and social services. In this context, the only reasonable cause for the failure of other models was that they used inferior programs and techniques. 7. The Direct Instruction model was the only one that was effective with extremely low performers. We showed that these children could uniformly be taught to read by the end of kindergarten and read pretty well by the end of first grade. Performance of this magnitude and consistency had never been demonstrated in the schools before Follow Through. 8. The relative uniformity of the DI sites implies that DI was better able to make the typical failed teacher successful. Teachers who had not been able to teach children to higher levels of performance were able to do it with our program. 9. Probably most important, the outcome showed that our focus on the moment-to-moment interactions between teachers and children was correct. Most of the other models viewed the problems of instruction in terms of broad interactions between teachers and children, not in terms of specific information delivered in moment-to-moment interactions. We were not into celebrating, but after work on the day Volume 4 arrived, we had a little party, a couple of beers, and several rounds of congratulations to trainers, managers, and consultants who worked in the trenches to achieve this outcome. Wes reminded us that the game plan was for the winners to be widely disseminated. So we needed to think about tooling up to work with a far greater number of places. We needed to
9 DI, UNDISPUTED WINNER 231 convert some of our trainers to project managers, recruit some of the superteachers from our Follow Through sites, and get ready to work with lots of Title I programs. Several of our project managers were skeptical about this degree of acceptance, but the rest of us felt that there would be a payoff for the last nine years of work. RECONSTRUCTING HISTORY AND LOGIC With the Abt data published, the moratorium on comparative studies was lifted. Wes promptly prepared a long article for the Harvard Educational Review, Teaching Reading and Language to the Disadvantaged What We Have Learned From Research, which came out in 1977 and gave overviews of our approach and programs, our training, and our results. Wes anticipated that the article would stimulate great interest. Instead, there was almost no response no revelations reported by readers who realized that the practices they espoused had led to unnecessary failure or revelations that DI presented a better way to solve problems that had been haunting school districts since the Coleman Report. There were no frantic phone calls from people wanting to learn more about DI, nor calls from reporters asking about the astonishing results. Instead there was a handful of responses, and most were not positive but raised carping issues about the design of the study or the problems associated with accurately measuring cognitive outcomes. Those who carped had earlier accepted the idea that achievement tests documented the performance problems of at-risk students. Yet, when the same kind of achievement-test data showed that their favored programs produced children who failed as miserably as children summarized by the Coleman Report, they rejected the study. THE GLASS HOUSE We later discovered that the effort to trivialize Follow Through data had begun before Abt 4 had been released. The effort was initiated by the Ford Foundation, which had been supporting failed educational programs. In January 1977, the Ford Foundation awarded a grant to the Center for Instructional Research and Curriculum Evaluation at the University of Illinois to conduct a third-party evaluation of Follow Through results. Ernest House was project director. He assembled a panel of professionals with national reputations in their fields Gene V. Glass of the University of Colorado, Leslie D. McLean of the Ontario Institute for Studies in
10 232 FOLLOW THROUGH EVALUATION Education, and Decker F. Walker of Stanford University. This assemblage judged Follow Through data. The main purpose of the critique was to prevent the Follow Through evaluation results from influencing education policy. The panel s report asserted that it was inappropriate to ask, Which model works best? Rather it should consider such other questions as What makes the models work? or How can one make the models work better? Glass wrote another report for the National Institute of Education (NIE), which argued that it was not sound policy for NIE to disseminate the results of the FT evaluations, even though the data collection and analysis had cost over 30 million dollars. Here s that part of the abstract of Glass s report to the NIE: Two questions are addressed in this document: What is worth knowing about Project FT? And, How should the National Institute of Education (NIE) evaluate the FT program? Discussion of the first question focuses on findings of past FT evaluations, problems associated with the use of experimental design and statistics, and prospects for discovering new knowledge about the program. With respect to the second question, it is suggested that NIE should conduct evaluation emphasizing an ethnographic, principally descriptive, casestudy approach to enable informed choice by those involved in the program. Again, this position is curious for one who apparently believed that data of the same type collected in the FT evaluation earlier documented the problem. Why would the data be adequate to document the problem but not appropriate for documenting outcomes of different approaches that address the problem? The suggestion that case studies would enable informed choice is not very thoughtful. Qualitative studies work only if they are carefully underpinned with rules about quantities. I m sure that if the game was for each sponsor to compile descriptions from their high-performing classrooms, DI would have a larger number of success stories than the other sponsors. Unless there are some number assumptions like, How consistently do positive case histories occur? The data is useless. Making it even more useless is the depth of description that would be needed to enable an informed choice. I would guess that each study would require many pages. It would be far more confusing to try to extract information about what works best from these documents than from a few tables that summarize the performance data. In fact, the suggestion for using ethnographic
11 THE GLASS HOUSE 233 studies was probably intended to make it impossible for readers to find out what worked best. Who would want to wade through possibly thousands of pages of case histories to distill a conclusion about which model was more effective? Glass even argued that all evaluations involving measurable events and data are invalid. The deficiencies of quantitative, experimental evaluation approaches are so thorough and irreparable as to disqualify their use. What is surprising about this statement is how anybody at NIE could have read it and not concluded that the author was loony. The Follow Through experiment was a teaching experiment involving not a few minutes in the lab, but nine years of cohorts in which students passed through four grades in actual classrooms. This study had huge numbers. Also, the evaluation tools that documented performance of students already included ethnographic descriptions of models (prepared by Nero and Associates). Another argument that Glass presented involved the audience. The audience for Follow Through evaluations is an audience of teachers to whom appeals to the need for accountability for public funds or the rationality of science are largely irrelevant. There are two major problems with this assertion: 1. The Follow Through study was not designed for teachers but for decision makers school districts, state and federal departments of education who serve as gatekeepers for what teachers do. The Coleman Report did not result in individual teachers organizing carpools to schlep children from the inner-city to the suburbs. The Follow Through study was not founded on the assumption that teachers enjoyed some kind of democratic world in which every teacher was able to make independent decisions about what and how to teach. Teachers are not decision makers on policy. Policy makers and district officials are. They would be far better informed by the Follow Through results than by any other single data source because only Follow Through provided extensive comparative data of different approaches. 2. Glass could not have seriously believed that even district-level decision makers would read Abt 4. They wouldn t. Glass appealed to NIE because he was concerned about what NIE would say about Follow Through. The final NIE word would make a lot of news and create great interest. In effect, what NIE would say about the program
12 234 FOLLOW THROUGH EVALUATION would become the truth about it. People, press, and historians would be greatly influenced by NIE s stance. Also, if NIE followed Glass s recommendations, there would be no challenge to the current order of things in education. The Ford Foundation would save face and wouldn t be labeled as a corporate fool for funding foolish programs for years. People in teacher colleges and district administrators would be able to keep their prejudices about children, learning, and teachers. The publishers of elementary-grade instructional material would be happy because no tidal wave would sink sales of instructional material, and school districts would not have to face uncomfortable issues of overhauling both their belief systems and their machinery. College professors could continue to espouse developmental theories and discovery practices as they decry programs that would divest teachers of their individuality and creativity. With a statement that downplayed the Follow Through data, Open Education and High Scope, Bank Street College, and Nimnict s program would not be cast as losers, because ethnographic studies might feature one of their good sites. All the primitive but well-greased machinery on all levels from state departments of education to classrooms would remain solidly in place, with no challenge. Of course, somewhere in this political milieu were millions of kids whose lives would be greatly influenced by NIE s decision. But as the anti-number philosophy suggests, who was counting? In 1978, House and Glass published an article in the Harvard Educational Review, No simple answer: A critique of the Follow- Through evaluation. Unlike Wes s earlier article, this one created quite a stir. A shortened version appeared in Educational Leadership in Although Gene Glass had been president of the American Educational Research Association, the flaws in the arguments the article presented were so conspicuous that they should have been obvious to the man on the street. The article presented two main arguments to discredit the Office of Education evaluation. The main argument was based on a simple value judgment: sponsors should not be compared. Therefore, the Abt focus on the performance of individual sponsors was inappropriate. House and Glass contended that the evaluation was actually designed to show how the aggregate of models performed, not what individual sponsors achieved. The aggregate failed; therefore, the most definitive statement about Follow Through would simply be Project Follow Through failed. In other words, the average of Follow Through students was no higher than those of comparable Title I students. Therefore, Follow Through failed, which means that every sponsor failed. The question of whether individual sponsors actually failed was not considered relevant because it s bad form to compare sponsors.
13 THE GLASS HOUSE 235 What seems most curious about this argument is how House and Glass could conclude that Follow Through as a whole failed. That judgment involves a comparison of programs, Follow Through and Title I. Why is it that sponsors can t be compared but larger programs like Follow Through can? Even more puzzling was how House and Glass could make the obvious distortion that Follow Through was never intended to evaluate the performance of different sponsors. Certainly, this assertion would be contradicted by Follow Through, unless there was collusion between several agencies, including the Office of Education. The second House-Glass argument was that no approach was successful with all its sites. House and Glass pointed out that there was variability among each sponsor s sites, some performing well and others poorly. Therefore, no single sponsor should be identified as being successful. Again, if House and Glass argued that data could not be used to compare sponsors, by what ground rules were they able to compare sponsors with respect to their variability? Logically, Glass and House would have to throw out this argument or reveal themselves as cherry pickers who used comparative data when they needed it and rejected it when it worked against them. The variability argument was particularly incredible because it was presented by professionals who were supposed to be experts in experimental designs. The most elementary fact about populations is that every observable population of anything has variation. In fact, populations vary across every measurable feature the size of the lobes on black oak leaves, the shape of snowflakes, the age of computers. So it would be insane to throw out data simply because it shows that there is variability, particularly in this case because only one DI site varied greatly from the seventeen others and only that site performed poorly. Grand Rapids had third graders performing a year lower than third graders in our other sites. The only possible evidence that House and Glass had about the failure of DI or the variability was based on one site that openly rejected the model s provisions and had no contact with the sponsor in more than three years. Furthermore, all the higher-echelon bureaucrats in Follow Through and in NIE knew that we hadn t worked with Grand Rapids for years. SAD SONG OF THE REAL FAT LADY The official statement that NIE issued was consistent with the recommendations by Glass and House: Project Follow Through did not significantly improve the performance of disadvantaged students over students in extant Title I programs. There was not a word about winners, losers, or
14 236 FOLLOW THROUGH EVALUATION about the performance of individual sponsors, just a flat statement that Project Follow Through an aggregate failed. Functionally, this decision showed the priorities of the educational system. It was more palatable for educators to accept that their favored approach failed than it was to admit that an approach in disfavor succeeded. The educators feelings and prejudices were functionally more important to them than evidence that there was a successful method for teaching at-risk children. Stated differently, these people showed that their beliefs were more important than the millions of failed children who could benefit from effective instruction. Make no mistake, they would not have gone through the various machinations they created if they believed their own rhetoric about how important it is to serve at-risk children. In the end, sites from all the Follow Through models including High Scope and the Open Classroom were validated. So the status quo was maintained; the models that had horrible results would remain in good standing; all educational myths were perpetuated. If policy makers wanted to believe in instructional models based on student choice, extensive parent involvement, or discovery learning, they wouldn t have to face the pesky problem of how to support this notion with data. Their collective conscience was clear because all these approaches had been validated. Someone receiving this information would assume that validated means that the validated approach was replicable and sound. PAPER TRAIL The master plan for Follow Through and how information about Follow Through would be disseminated to other schools and agencies was complicated. The switch of emphasis from sponsors to individual school programs had begun as soon as Abt 4 came out in The Office of Education established the Joint Dissemination and Review Committee. Its purpose was to screen Follow Through schools that applied for funds to disseminate information about the school s successful program. The review committee scrutinized schools that applied through forms, letters, and interviews. Those schools that made it over these hurdles were validated. Whether they were High Scope schools, Open Education schools, or DI sites. All received the same size validation. Not all of our sites that applied for funds made the cut. At least four of them received a rating of B and at least two a rating of C, even though they had excellent performance data. It seemed obvious that there was a conscious effort to keep DI from having more representation than some of
15 PAPER TRAIL 237 the other major models. Once validated, a school would become a member of the National Diffusion Network (NDN), which consisted of 200 programs, each receiving funds to promote itself to other schools. Of the 200 dissemination schools, only 21 were Follow Through schools, and only 3 were Direct Instruction schools Flint, Dayton, and East St. Louis. Most of the 200 schools came from a poor list of effective programs compiled by the Far West Regional Educational Lab. Very few of these schools actually had data of effectiveness, but neither did at least 14 of the 21 Follow Through sites that were now incorporated into the network. The National Diffusion Network was possibly well named because it did a good job of diffusing, in the sense of making the effect thin. Instead of being 3 out of the 21 programs to be disseminated, DI was now 3 out of 200, less than 2 percent of the total. And as usual, all validated sites had the same status. Individual schools were eligible to disseminate, but sponsors weren t. The Joint Dissemination and Review Committee ruled that only schools could apply for validation, not models. When we received this news, I thought Wes would go into apoplexy. I had never seen him that angry. I was not a portrait of happiness, but Wes exploded. He quickly recovered and in less than an hour was on the phone, trying to contact senators, representatives, and others who might have some influence. In October of 1977, Wes and I wrote a letter to Rosemary Wilson protesting Follow Through s position about dissemination. We wrote again in November, after it became apparent that nothing would change. That letter appears in whole, followed by her response. Our letter iterates some of the points I have covered.
16 238 FOLLOW THROUGH EVALUATION
17 PAPER TRAIL 239
18 240 FOLLOW THROUGH EVALUATION
19 PAPER TRAIL 241
20 242 FOLLOW THROUGH EVALUATION