USING STATISTICAL EVIDENCE TO PROVE THE MALPRACTICE STANDARD OF CARE: BRIDGING LEGAL, CLINICAL, AND STATISTICAL THINKING

Transcription

1 USING STATISTICAL EVIDENCE TO PROVE THE MALPRACTICE STANDARD OF CARE: BRIDGING LEGAL, CLINICAL, AND STATISTICAL THINKING Michelle M. Mello* Increasingly, there have been calls to supplement expert opinion testimony in medical malpractice cases with more objective empirical evidence of various kinds to establish the legal standard of care. Examples include proposals to incorporate clinical practice guidelines and the proposals made by Professor Meadow, Professor Hartz, and their colleagues in this Symposium to utilize physician surveys and epidemologic studies of physician practice patterns. While empiricizing the standard of care is an intriguing and worthy goal, implementing these proposals poses some substantial challenges. A number of complexities arise from differences in the way that legal triers of fact, physicians, and researchers reason through problems-especially the different meanings that they attach to the concept of "significance." Additionally, there are many methodological and practical issues associated with conducting empirical studies of physician behavior and introducing them as evidence in a jury trial. The experience of attempting to integrate clinical practice guidelines into malpractice litigation suggests that practical and conceptual problems involved in merging the cultures of medicine, science, and law should not be underestimated. I. INTRODUCTION Both medicine and law are undergoing a period of transformation marked by the ascendancy of empiricism. Following landmark research conducted in the 1970s and 1980s demonstrating wide unexplained variations in medical care processes between different areas of the country,' interest developed among health care * Assistant Professor of Health Policy and Law, Department of Health Policy and Management, Harvard School of Public Health; A.B. 1993, Stanford University; M.Phil. 1995, Oxford University; Ph.D. 1999, University of North Carolina at Chapel Hill; J.D. 2000, Yale Law School. Able research assistance by Anne Claiborne is gratefully acknowledged. Address reprint requests to: Dr. Mello, Department of Health Policy and Management, Harvard School of Public Health, 677 Huntington Ave., Boston,

2 822 WAKE FOREST LAW REVIEW [Vol. 37 providers in achieving a greater degree of standardization in clinical practice through the use of practice guidelines, based on hard evidence about the efficacy and cost-effectiveness of various treatments. The "evidence-based medicine" movement 2 now has a large and growing following among practitioners and health care payers. At the root of this movement is the notion that clinical decision-making in individual cases should reflect an application of the best available evidence from systematic research reported in the scientific literature. A growing attraction to the possible role of systematic research in the determination of individual cases is also present in a range of areas of tort law. Social science research has been used to prove facts and create rules of law in civil litigation since the days of Brown v. Board of Education, 3 but more recent times have seen increased reliance on other forms of research, most notably epidemiologic studies, to establish elements of the plaintiffs claim in tort cases. 4 The Supreme Court's decision in Daubert v. Merrell Dow MA 02115; Phone: (617) ; Fax: (617) ; mmello@hsph.harvard.edu. 1. See, e.g., James M. Perrin et al., Variations in Rates of Hospitalization of Children in Three Urban Communities, 320 NEw ENG. J. MED (1989) (comparing hospitalization rates for children in Rochester, Boston, and New Haven); John E. Wennberg & Alan Gittelsohn, Small Area Variations in Health Care Delivery, 182 SCI (1973) (identifying population-based differences in the delivery of health care services); John E. Wennberg et al., Are Hospital Services Rationed in New Haven or Over-Utilised in Boston?, 1 LANCET 1185 (1987) (finding that hospitalization rates vary substantially even among demographically similar communities served by major academic medical centers). 2. See infra note 118 & accompanying text. For background literature on the history and rationale of evidence-based medicine, see generally David L. Sackett et al., Evidence Based Medicine: What It Is and What It Isn't, 312 BMJ 71 (1996) (defining evidence-based medicine as the judicious use of current best evidence in determining patient care); Samuel Wiebe, The Principles of Evidence-Based Medicine, 20 CEPHALALGIA 10 (2000) (describing the principles, application, and limitations of evidence-based medicine) U.S. 483 (1954); see also John Monahan & Laurens Walker, Social Authority: Obtaining, Evaluating, and Establishing Social Science in Law, 134 U. PA. L. REV. 477, 491 (1986) (citing several examples of the use of research, including the effects of school segregation on self-esteem, the effects of pornography on antisocial behavior, the effects of the death penalty on crime rates, and the effect of alleged trademark violations on consumer perceptions of product origins); id. at 477 n.2 (listing cases in which the Supreme Court has explicitly relied on social science research). 4. See Tom Christoffel & Stephen P. Teret, Epidemiology and the Law: Courts and Confidence Intervals, 81 AM. J. PUB. HEALTH 1661, 1661 (1991) (citing evidence of a "dramatic" increase in judicial reliance on epidemiology since the early 1980s); Sheri L. Gronhovd, Note, Social Science Statistics in the Courtroom: The Debate Resurfaces in McCleskey v. Kemp, 62 NOTRE DAME L. REV. 688, 688 n.3 (1987) (describing several factors contributing to the expanded role of statistics in litigation).

3 2002] LEGAL, CLINICAL, AND STATISTICAL THINKING 823 Pharmaceuticals, Inc. 5 was a response to the growing influence of epidemiologic and scientific evidence in the courtroom and the conundrums it poses for judges. The Federal Judicial Center has been active in educating judges in methods of statistics and epidemiology to prepare them for the challenge of serving as "gatekeepers" for this complex scientific evidence.' Medical malpractice litigation, lying at the intersection of law and medicine, is one area in which the drive towards greater empiricism would seem to have particular promise. Whereas physicians and judges have traditionally been content to rely on expert opinion informed by personal experience to benchmark the clinical and legal standards of care for physicians in malpractice cases, increasingly there have been calls to utilize various kinds of more objective empirical evidence to establish the standard of care. Since the late 1980s, for example, legal academics have been proposing that judges permit litigants to introduce clinical practice guidelines as evidence of the standard of care.' Some state legislatures, persuaded by arguments that the use of guidelines would bring greater certainty and uniformity to the determination of whether the standard of care has been breached, have enacted statutes codifying the evidentiary role that such guidelines will play in state court proceedings. 8 More recently, two groups of physicians and attorneys have suggested that other forms of objective empirical evidence could play a useful role in establishing the legal standard of care. William Meadow, Cass Sunstein, and John Lantos argue that epidemiologic studies of actual care processes would provide a more reliable indicator of contemporary medical custom than classic expert opinion testimony. s Arthur Hartz and colleagues argue for the use U.S. 579 (1993). 6. See, e.g., FED. JUDICIAL CTR., REFERENCE MANUAl. ON SCIEN"rIFIC EVIDENCE 5 (2d ed. 2000). Judges' gatekeeping role arises from the Daubert decision. See Daubert, 509 U.S. at See, e.g., Clark C. Havighurst, Practice Guidelines As Legal Standards Governing Physician Liability, 54 LAw & CoNTEMP. PROBS. 87, 91 (1991); Richard E. Leahy, Comment, Rational Health Policy and the Legal Standard of Care: A Call for Judicial Deference to Medical Practice Guidelines, 77 CAL. L. REV. 1483, (1989). 8. See FLA. STAT. ANN (West 2002) (repealed 2002); KY. REV. STAT. ANN (8) (Michie 1997); ME. REV. STAT. ANN. tit. 24, (West 2000) (repealed 2002). 9. My critique is directly addressed to a working paper that served as the original impetus for this Symposium, but which was published elsewhere, see William Meadow & Cass R. Sunstein, Statistics, Not Experts, 51 DUKE L.J. 629 (2001), and William Meadow & John Lantos, A Proactive, Data-based Determination of the Standard of Medical Care in Pediatrics, 101 PEDIATRICS 1 (1998). My comments also apply in a general way to the paper published in this Symposium issue, William Meadow, Operationalizing the Standard of Medical Care: Uses and Limitations of Epidemiology to Guide Expert Testimony in Medical Negligence Allegations, 37 WAKE FOREST L. REV. 675 (2002), which I

4 WAKE FOREST LAW REVIEW [Vol. 37 of physician surveys involving individualized hypothetical vignettes that mirror the facts of the case at bar to establish what most physicians would do in those particular scenarios." These proposals are supported by two compelling rationales. First, there are serious grounds on which to criticize exclusive reliance on expert opinion testimony. The existence of a lucrative commercial market for expert witnesses, situated within an adversarial system in which the goal is for each side to present the strongest possible case, creates significant potential for deliberate or unconscious distortions of the truth. Further, even the most honest expert is prone to a range of cognitive biases. Recall bias is a particular problem: when asked about past events, Meadow and Lantos note that experts suffer from the human tendency to "consistently underestimate large numbers, overestimate small numbers, and skew responses in favor of outcomes deemed, in retrospect, more appropriate or desirable."" Also troubling is the optimistic bias that people tend to exhibit in predicting outcomes. Meadow and Sunstein present credible and worrying empirical evidence that physician experts' judgments suffer from this problem. 2 These considerations should make any observer interested in possible alternatives to reliance on unsupported expert opinion for crucial legal determinations. Second, there exists a large and growing body of data on physician practice patterns that is not presently used in malpractice litigation. Exclusive reliance on expert opinion may have been justifiable at a point in history when there was no better information available, but is more difficult to defend in an era when health services researchers are publishing empirical studies of physician behavior at a rapid pace and when medicine itself is becoming increasingly evidence-driven. 3 If there are better data, why not use them? The Meadow and Hartz proposals do an admirable job of challenging us to take this empirical imperative seriously. The proposals are both creative and sensible, demonstrating the unique received after this manuscript was drafted. I hereinafter refer to the arguments in the three papers collectively as "the Meadow proposal." 10. Arthur Hartz et al., Physician Surveys to Assess Customary Care in Medical Malpractice Cases, 17 J. GEN. INTERNAL MED. 546 (2002). This paper is published in revised form as Tim Cramm et al., Ascertaining Customary Care in Malpractice Cases: Asking Those Who Know, 37 WAKE FOREST L. REV. 699 (2002). 11. Meadow & Lantos, supra note 9, at 3 (citing several studies from the psychology literature). 12. See Meadow & Sunstein, supra note 9, at See id. at ("The best reason for the legal system's longstanding reliance on individual recollections has been historical-the simple absence of statistical evidence. But this is a gap that is rapidly being filled, and that is likely, in the next generation, to be replaced with a great deal of reliable information.").

5 2002] LEGAL, CLINICAL, AND STATISTICAL THINKING 825 contributions that an interdisciplinary team of scholars can bring to critiques of the legal process. The authors also are honest about both the shortcomings of the current system and the limitations of the proposed reforms. I find the general enterprise of the proposals inspiring. Tort law should be open to innovations that may bring it closer to the elusive goals of truth-discovery and justice, and thinking about ways in which health services research might inform malpractice law is a positive step toward that goal. But the journey from thought experiment to legal reform is a long and difficult one. Perhaps deliberately, Meadow, Hartz, and their colleagues do not take us very far along that path. In my view, the arguments presented in favor of the proposed uses of statistical evidence rely very heavily on criticisms of the present system, rather than persuasively establishing the superiority of the alternatives. This Mae West approach to legal reform 14 invites scrutiny of the proposals and further consideration of some of the complexities that would be involved in implementing them. In this Article, I begin that task. I open by discussing some important differences in the way that legal triers of fact, physicians, and statisticians and other researchers reason through problems. The distinct meanings that the concept of significance has in medicine, research, and the law, as well as other differences in clinical, scientific, and legal thinking, pose some important challenges for expanding the use of epidemiologic evidence in malpractice trials. In Part III, I identify a range of other methodological issues raised by the proposals. These issues also have been implicated in a parallel reform effort in malpractice law, the movement to base the standard of care on clinical practice guidelines. I discuss some of the lessons of the experience with practice guidelines for implementing the Meadow and Hartz proposals. I conclude that this experience, as well as experience with statistical evidence in other areas of tort law, argues in favor of a cautious approach in moving towards greater empiricism in establishing the malpractice standard of care. The Meadow and Hartz impulses are right, but the practical and conceptual problems involved in merging the cultures of medicine, science, and law should not be under-estimated. II. TRANSLATIONS BETWEEN LEGAL, CLINICAL, AND STATISTICAL THINKING Proposals for the greater use of empirical evidence to prove the 14. Mae West has been quoted as saying, "When choosing between two evils, I always like to try the one I've never tried before." See, e.g., Women's Voices: Quotations by Women, at httpj/womenshistory.about.comllibrary/qu blquwesm.htm (last visited Aug. 23, 2002).

6 WAKE FOREST LAW REVIEW [Vol. 37 standard of care in malpractice cases bring together the approaches to truth-discovery from three different professional traditions. The legal approach consists of the processes of reasoning and truthseeking engaged in by judges, juries, and litigants. The medical or clinical approach refers to reasoning and practices by physicians and other care providers in the processes of diagnosing and treating patients. Finally, what I will call the research approach, for want of a better term, is the method used by epidemiologists, statisticians, social scientists, and clinical researchers to investigate research questions. 5 In this Part, I will discuss unique facets of legal, clinical, and research reasoning and proof that pose challenges for proposals to use statistical evidence to prove the legal standard of care. I begin with some general observations about the nature of reasoning in these three professional fields and proceed to a discussion of some specific implications of these differences for implementing the Meadow and Hartz proposals. A. Reasoning in Law, Medicine, and Research 1. The Concept of Significance One challenge to finding the appropriate use of empirical evidence in proving the legal standard of care is that litigants, physicians, and researchers mean different things when they refer to the "significance" of a finding. That there is a difference between legal, clinical, and statistical significance does not, of course, mean that evidence from medical practice or epidemiologic or social science research has no place in the courtroom. It does, however, mean that the incorporation of this type of evidence into determinations of the legal standard of care must proceed with appropriate attention to the possibility of confusion on the part of judges and jurors. A legally significant finding can be thought of as a finding of fact or law that determines, or helps to determine, the validity of the plaintiffs claim or the defendant's defense. This definition cleaves closely to the evidentiary concepts of relevancy and materiality, or probative value. A finding is relevant if it tends "to make the existence of any fact that is of consequence to the determination of the action more probable or less probable than it would be without 15. While these four groups of academic researchers employ a diverse set of study designs to investigate questions within their particular disciplinary fields, they overlap in their approaches in two respects. First, their investigations all have a group or population focus. Second, professional norms dictate heavy reliance on a common core of established methods of proof, including the design of studies according to principles of the scientific method and analysis of data according to standard statistical methods.

7 2002] LEGAL, CLINICAL, AND STATISTICAL THINKING 827 the evidence." 16 In order to be relevant, the evidence must possess a certain minimum probative value in the proof of some matter. While relevancy and materiality are properties that pertain to the proof of facts, the concept of legal significance is broader in that it also applies to findings of law. For example, in a negligence case, a finding that a "special relationship" existed between plaintiff and defendant giving rise to a duty of care has clear legal significance. Simply put, it is a finding that makes a difference to the outcome of the case. Clinical significance also pertains to determinations that make a difference to outcomes. More specifically, that term generally refers to facts about the patient that affect the physician's judgments about diagnosis, prognosis, or appropriate therapeutic options; or facts about the care provided that affect the outcome for the patient. The concept of clinical significance also encompasses notions akin to relevancy and materiality. A clinical fact is not significant unless it bears some relation to the process of care for the particular medical condition at issue. A fact also lacks clinical significance if it does not provide at least a minimal amount of explanatory or predictive power vis-a-vis diagnosis, prognosis, appropriate courses of therapy, or outcomes. When researchers refer to a "significant" finding, they mean something altogether different. Generally, they are speaking of statistical significance. A statistically significant finding emerges out of a statistical test, the purpose of which is to estimate the likelihood that a particular hypothesis called the null hypothesis is false. The test yields a p-value which represents the probability of obtaining, in a sample of the size investigated, a mean as extreme as, or more extreme than, the observed sample mean, under the assumption that the null hypothesis is true.' 7 If the p-value satisfies a certain threshold agreed upon by convention (generally < 0.05), then the researcher concludes that she may reject the null hypothesis. The element of instrumentalism that is present in the legal and clinical approaches-the idea that a finding is significant because it helps the decision-maker progress towards some broader determination-is not inherent in the notion of statistical significance. To say that a finding is significant in the context of a research study simply means that the researcher is able to reject the null hypothesis, not that the finding itself (the rejection of the null hypothesis) is important. Several specific features of the way that legal decision-makers, physicians, and researchers talk about "significant" findings further 16. FED. R. EviD MARCELLO PAGANO & KIMBERLEE GAUNREAU, PINCIPLES OF BIosTATIsTics 234 (2d ed. 2000). I am grateful to Beverly Mellen for supplying this reference, along with a useful clarification of this point.

8 828 WAKE FOREST LAW REVIEW [Vol. 37 illustrate the distinctions between the three kinds of significance. 2. Binary vs. Nonbinary Thinking For juries and physicians, significance determinations are usually framed in binary terms. Jurors are asked to answer questions such as: Did a special relationship exist between the plaintiff and defendant or not? Was the standard of care breached or not? Did the plaintiff suffer a compensable injury or not? The process of differential diagnosis and treatment selection proceeds along similar lines: Physicians tackle a series of yes/no questions about the patient's symptoms and contraindications for specific therapeutic alternatives. These questions so often resemble a classic decision tree or flow chart that the diagnostic process is increasingly referred to in terms of "pathways" with well-defined decision nodes." i Clinical judgments about these matters are typically probabilistic, 9 meaning that they are made under uncertainty and that physicians have corresponding levels of confidence in their judgments. The level of certainty required for physicians to be comfortable reaching a clinical inference may vary dramatically depending on the nature of the decision." However, while the decision-making is probabilistic in this sense, the decisions themselves are often dichotomous: diagnose the patient with Disease A or Disease B, recommend Treatment X or Treatment Y. Researchers, on the other hand, tend to think of both judgments and outcome variables as continuous, rather than dichotomous. With respect to judgments, they are seldom called upon to make yes/no decisions. In contrast, juries and physicians are both compelled to make yes/no judgments in the face of uncertainty. If a jury cannot answer "yes" to a question about the defendant's liability or guilt at the required level of confidence, then the answer it must give is "no." Triers of fact in tort cases have a single standard of proof-a preponderance of evidence-that is required regardless of the particular nature of the plaintiffs situation or the amount of money or severity of injury involved. Physicians make decisions over a wider range of confidence levels, and may couch their judgments in terms of a level of certainty, rather than simply saying yes or no, but they must still reach a conclusion and act on it. Their judgment is their best guess, and may or may not represent a 18. See RAYMOND S. GREENBERG ET AL., MEDICAL EPIDEMIOLOGY (2d ed. 1996); Dominick A. Rizzi, Causal Reasoning and the Diagnostic Process, 15 THEORETICAL MED. 315, 323 (1994); Alison Round, Introduction to Clinical Reasoning, 7 J. EVALUATION CLINICAL PRAC. 109, 110 (2001). 19. See John M. Eisenberg, What Does Evidence Mean? Can the Law and Medicine Be Reconciled?, 26 J. HEALTH POL. POLY & L. 369, 375 (2001). 20. For instance, in recommending a course of treatment that involves risk to the patient, a physician may feel that one level of confidence in the efficacy of the treatment is required for a healthy patient, but a much lower level will suffice for a critically ill patient who has no other options.

9 2002] LEGAL, CLINICAL, AND STATISTICAL THINKING 829 certain threshold level of certainty. Researchers' judgments are different. There is no requirement that they reach a yes/no decision about the question they are investigating; rather, they may report evidence that is "suggestive" of an inference but decline to draw a firm conclusion. Particularly in the realm of clinical research, the process of statistical proof of a proposition may be lengthy and evolutionary. Exploratory research begins with samples that may be too small or unrepresentative to give rise to conclusive proof, but that provide preliminary evidence in one direction. These early findings provoke other scientists to pursue more carefully designed (and expensive) investigations. The base of evidence improves in both quality and quantity over time, until finally a sufficient evidence base from well-controlled studies is accumulated for the scientific community to issue a reasonab i authoritative pronouncement about the relationship under study. Thus, while juries and physicians are called upon to make on-thespot conclusive judgments, researchers have the luxury of inching toward judgments over time in conjunction with others in their scientific community. Only when reliable evidence from welldesigned studies demonstrates a finding at a statistical significance level of 95% or higher will researchers pronounce a firm conclusion.' Any finding below this standard generally results in rather equivocating talk of a "trend" in the results. The distinction between legal and research judgments is highlighted by an observation by the epidemiologist Bruce Charlton: "[Elpidemiological claims of causation cannot be refuted by any single, crucial contradictory item of evidence, no matter how strong or well replicated that counterevidence may be... Contradictory findings cannot do more than alter the balance of probability of multifactorial epidemiological causation."' The reason is that a conclusion in epidemiology is probabilistic in nature, rather than yes/no, and is reached through consideration of a "mosaic" of pieces of evidence that together convey a broader picture, rather than a chain of evidence that is broken when one of its links wears thin For a concrete example of this process involving research into the effectiveness of high-dose chemotherapy plus autologous bone marrow transplant for the treatment of breast cancer, see Michelle M. Mello & Troyen A. Brennan, The Controversy Over High-Dose Chemotherapy With Autologous Bone Marrow Transplant for Breast Cancer, 20 HEALTH AFF. 101, (2001). 22. The 95% convention is attributed to the influence of British statistician Sir RA Fisher, though evidence suggests that Fisher's practice was to report results with p-values, rather than conclusory significance judgments. D.H. Kaye, Is Proof of Statistical Significance Relevant?, 61 WASH. L. REv. 1333, n.53 (1986) (citing R.A. Fisher, The Arrangement of Field Experiments, 33 J. MINISTRY AGRIc. GR. BaRT. 504 (1926)); Leonard J. Savage, Ott Rereading RA. Fisher, 4 ANNALS STAT. 441, (1976). 23. Bruce G. Charlton, Attribution of Causation in Epidemiology: Chain or Mosaic?, 49 J. CLINIcAL EPIDEMIOLOGY 105, 106 (1996). 24. Id.

10 830 WAKE FOREST LAW REVIEW [Vol. 37 Contrast this to a legal case, in which a single piece of evidence can exculpate the defendant because it conclusively negates an element of the plaintiffs claim. The binary/nonbinary distinction is also relevant to the way in which juries, doctors, and researchers think about outcome variables. As noted earlier, for juries, outcomes are almost always binary: guilty or not guilty; liable or not liable. 25 For physicians, too, decisions are often binary: cancer or no cancer; surgery or no surgery. 6 In contrast, statisticians tend to think of outcomes in terms of distributions. Even if the variable is investigated and statistically modeled as a binary variable, the statistician conceives of it as a representation of some other underlying construct that is continuous. For example, a patient's decision to see a doctor about his earache is a manifestation of a judgment about the utility of seeing a doctor. 27 More specifically, if we observe the patient choosing to see a doctor, then we know that she derives greater expected utility from the decision to see a doctor than she does from the decision not to see a doctor. The difference in the expected utilities is an unobservable continuous variable. 28 What we would like to be able to measure is this continuous variable. But what we can observe is only whether that variable takes on a value greater than zero or less than zero. The statistical model thus has a dichotomous dependent variable, and the association between that dichotomous outcome and the explanatory variables in the model is only an approximation of the relationship we might see if we were actually able to observe, measure, and model the underlying construct. The gap between binary and nonbinary thinking has some important implications for implementing the Meadow and Hartz proposals, having to do with the difficulties of shoehorning epidemiologic evidence into a legal determination. Before proceeding to a discussion of these implications, though, it is worth 25. Where contributory negligence is asserted, the liability finding is released from its classic binary structure and conceived of along a continuum. 26. See GREENBERG ET AL., supra note 18, at (presenting a concrete example involving the differential diagnosis and treatment of alcoholic hepatitis and cholangitis). 27. See generally R. Duncan Luce & Patrick Suppes, Preference, Utility, and Subjective Probability, in HANDBOOK OF MATHEMATICAL PSYCHOLOGY 249 (R. Duncan Luce et al. eds., 3d ed. 1965) (describing the economic theory of consumer choice); Daniel McFadden, Conditional Logit Analysis of Qualitative Choice Behavior, in FRONTIERS IN ECONOMETRICS 105 (Paul Zarembka ed., 1974) (placing consumer choice theory in an econometric framework). 28. Formally stated, let U, = expected utility that individual i derives from alternative 1 and let U 2 = expected utility that individual i derives from alternative 2. The unobservable variable y" represents the difference in the utilities: y," = U, - Ul,. If we observe the individual choosing alternative 1, we know that U, > U 2, so y" > 0. If the individual chooses alternative 2, we know that U, < U 2, so y, < 0.

11 20021 LEGAL, CLINICAL, AND STATISTICAL THINKING 831 considering some additional disparities between legal, clinical, and statistical thinking. 3. Individualized vs. Group Assessments A further challenge to bridging legal, clinical, and statistical reasoning is agreeing on what the unit of analysis is. Physicians operate in a dyadic setting-they make decisions with respect to a single unique patient and need not be concerned with the generalizability of their inferences. That is, an oncologist's decision that surgery is the best treatment for Mr. Jones's tumor in no way constrains her ability to make treatment choices for future patients. Her future decisions will be informed by past experience, but, to invoke legal terminology, she reviews each case de novo. Judges also issue individualized rulings tailored to the facts of each case. However, their determinations are clearly informed by the consciousness that each decision sets a precedent that will influence the resolution of future cases. Thus, while the particular unit of analysis is the individual case, the determination is constrained by what has come before and what will come after. Research determinations are different in two crucial respects. First, the unit of analysis is always the group, rather than the individual. Second, it is not acceptable for the outcome of the research to be influenced by prior research results or by anxieties about the possible impact on future research or policy. Addressing the second point first, scientists must be careful to design their studies so that no external influences can affect the results. The use of randomized, controlled trial designs with double blinding is favored because researchers' preconceptions about whether or not the treatment is effective are less likely to influence the allocation of subjects into study groups, interactions with subjects, and measurement and recording of findings.' With respect to the first point, the process of scientific investigation involves the measurement and comparison of two samples or groups. All inferences are drawn with respect to the group, not with respect to particular individuals within the group. A finding that at the group or population level there is an association between risk factor X and disease Y absolutely does not support the inference that for a given individual who was exposed to X and developed Y, it was the exposure that led to the disease.' This false assumption is known to researchers as the "ecologic 29. In a double-blinded trial, neither the investigator nor the patient is aware of whether the patient has been randomized to the treatment group or to the control group. GREENBERG ET AL., supra note 18, at See Charlton, supra note 23, at 105 ("The imputed causal association [in epidemiology] is at the group level, and does not indicate the cause of disease in individual subjects.").

12 WAKE FOREST LAW REVIEW [Vol. 37 fallacy." 3 Courts considering epidemiologic studies in toxic-tort cases have dealt with the ecologic fallacy by requiring plaintiffs to prove causation through two distinct steps. 32 First, the plaintiff must establish general causation by invoking evidence from epidemiologic studies to prove that the chemical or environmental condition in question is associated, on a group level, with effects of the kind the plaintiff claims to suffer. 33 Second, the plaintiff must prove specific causation by showing that the chemical or condition in question was the actual cause of the instant plaintiffs injury. 34 Researchers are content to call it a day after completing step one. This is a fundamental difference between research and legal or clinical decision-making. Lawyers and physicians have to go the extra mile and apply information from generalized research settings to particular plaintiffs or patients. This additional task is something that makes researchers squirm. 4. Correlation vs. Causation Closely related to the individual/group distinction is the distinction between causation and correlation. Clinicians and juries are interested in causation, and then only in a particular kind of causation: specific or proximate causation. The question they seek to answer is, "What is the cause of this plaintiffs troubles?" In contrast, researchers' interest is broader. Their goal is often to obtain persuasive evidence of a causal process, but often they are content to discover and report mere correlations or even just variation among members of a population or between populations. Their question is, "What characteristics does this group of people afflicted with condition Y share that are different in people who do not have condition Y?" Researchers recognize that statistical methods are inherently incapable of affirmatively demonstrating causation-especially specific and proximate causation. The best that they can hope to do is establish a high probability that a randomly selected case of condition Y is one that would not have occurred without exposure to agent X. Epidemiologists have developed a set of criteria for determining whether an observed association rises to the level of a causal relationship, 35 but these criteria apply only at the group 31. See Leon Gordis, Scientific Methodology and Epidemiology, 9 KAN. J.L. & PUB. POLY 89, 100 (1999). 32. See Raphael Metzger, Epidemiology Can Be Your Friend: Using Epidemiology in the Courtroom, 2 ATLA ANN. CONVENTION REFERENCE MATERIALS 2815, 2815 (2001), WL 2 ANN.2001 ATLA-CLE Id. 34. Id. 35. See Paolo Vineis, Causality Assessment in Epidemiology, 12 THEORETICAL MED. 171, (1991) (reviewing the classic nine-point guidelines by Sir Austin Bradford Hill); see also Charlton, supra note 23, at 105

13 2002] LEGAL, CLINICAL, AND STATISTICAL THINKING 833 (general causation) level. When an epidemiologist talks about having proved causation, she does not mean quite the same thing as a lawyer. Both may say they have established a "mechanism" of causation, but the epidemiologist likely means a biological mechanism or dose-response relationship, while the lawyer knows she must prove that the particular dose to which the plaintiff was exposed was sufficient to trigger the response.3' This differencewhich is both linguistic and epistemic-does not bear directly on the question of how to prove the standard of care, but it further illustrates the breadth of the gap between legal and scientific thinking. To return to the Meadow and Hartz proposals, what are the implications of these several differences in processes of legal, clinical, and statistical reasoning? B. Implications for the Proposals 1. Where Do You Draw the Line? The fact that researchers tend to think of variables and judgments as falling along a continuum, while legal and clinical decision-makers tend to think in binary terms, creates some difficulties for the Meadow and Hartz proposals. One question is this: How do you create a binary breach-of-duty variable from an empirically-derived distribution of clinical behaviors? Wherever possible, researchers will try to model the "true" (continuous) outcome variable, rather than a binary representation. Thus, with respect to the central question put to the expert in a malpractice case-do you think the treatment rendered by the defendant was medically appropriate?-researchers' instinct would be to conceive of appropriateness judgments as continuous rather than binary. That is, they would ask survey respondents to rate appropriateness along a Likert scale or using some other score that is sensitive to variations in the strength of respondents' judgments of appropriateness. Litigants and jurors do not think this way. They want to know: Was it appropriate or wasn't it? For this reason, the binary/nonbinary distinction can lead to problems translating research into legally meaningful evidence. David Faigman has described this phenomenon in relation to the use of psychology research in policymaking: (listing other suggested criteria). 36. See, e.g., Sutera v. Perrier Group of Am., 986 F. Supp. 655, (D. Mass. 1997) (rejecting epidemiologic evidence of a link between benzene and leukemia because the existence of a biological mechanism and dose-response relationship established neither that the threshold of exposure was necessary to trigger cancer, nor that the plaintiff had exposure at or above this level).

14 834 WAKE FOREST LAW REVIEW [Vol. 37 [N]oteworthy problems arise in many cases due to differences in sensitivity between the legally relevant outcome measure and alternative measures selected by social scientists. For example, lawmakers may want to know whether the number of guilty verdicts is affected by removing jurors from the guilt phase of capital cases because they would refuse to vote to impose the death penalty upon finding the defendant guilty. Psychologists, however, sometimes avoid dichotomous variables like "guilty/not guilty," preferring to substitute variables more sensitive to effects, such as multi-valued scales. While psychological research should not be faulted for its adaptation of psychological methods to legal questions, the lessons drawn from such research nonetheless must be tempered by the recognition that sometimes psychologists ask their subjects questions that are different from those which the lawyers had previously asked the psychologists. The problem Faigman describes is likely to occur if litigants in a malpractice case attempt to utilize previously conducted studies of physician practice patterns to lead juries to a liable/not liable judgment. The variable definitions in these studies frequently are modeled as ordinal categorical or continuous variables which may relate only in a general way to the binary question of liability. The problem would be less likely to occur if litigants in a malpractice case commissioned a custom survey instead of relying upon previously published research. But even with a custom survey, the problem may arise because of the need to accommodate multiple aspects of care in a given situation. A constellation of acts and omissions may be alleged to have led to the plaintiffs injury. Somehow, the researcher has to distill these into a binary judgment about whether or not the standard of care was breached. How, as a practical matter, would this be done? A common approach of researchers to such problems is to develop a multi-attribute index score. Data are gathered on an array of variables thought to make up some larger concept, and these data are then combined to produce an overall score. A prominent example of this practice in health service research is the use of "propensity scores" reflecting a patient's propensity to consume health services based on sociodemographic characteristics, financial variables, and variables reflecting the supply of services in the patient's geographic area. Many of the component variables are binary (e.g., Is the patient male? Does the patient have health insurance?), but the overall propensity score is a continuous variable. This is the methodology that would likely be followed in complex malpractice cases where several things appear to have gone 37. David L. Faigman, To Have and Have Not: Assessing the Value of Social Science to the Law As Science and Policy, 38 EMoRY L.J. 1005, 1064 (1989) (citations omitted).

15 2002] LEGAL, CLINICAL, AND STATISTICAL THINKING 835 wrong in the patient's care experience. In a missed or delayed diagnosis case, for instance, relevant variables may include: whether or not a particular diagnostic test was ordered; time to test order; time to test result; whether or not an appropriate history was taken; whether or not an appropriate physical examination was performed; whether or not specialist consultations were obtained; and time to specialist consultation. A researcher following the Meadow or Hartz methodologies would find answers to these individual questions (using evidence from chart reviews or physician surveys, respectively) and then derive an overall index score reflecting intensity of care. This methodology provokes two questions. First, how does one create this index score? If the component variables are all binary variables, one could simply add them up. But this begs the question of whether all of the components should be weighted equally in terms of their importance to our overall judgment about the adequacy of care provided. If they are not equal, how should weights be calculated? If one is dealing with a mix of binary and nonbinary variables, then the process is even more complex. The second question is what to do once the index score has been derived and plotted along a distribution. Meadow and Lantos assume that the distribution will be normal, but there is no particular reason to believe this will be the case. Even if it is, how does one draw a line dividing the curve into sections representing care that does and does not violate the legal standard of care? Meadow and Lantos cheerfully forecast that substandard care will "fall[ I out neatly as behaviors lying outside the large majority of cases. Juries would be empowered (as they are currently) to determine exactly where on this curve substandard care lies... ' Thus, a jury presumably would be handed a charge like the following and asked to return answers: 1. What point(s) on this curve represent the divide between care that does and does not fall below the customary standard of care for a physician in the defendant's specialty practicing in a similar community? (Draw a vertical line across one or more points on the horizontal axis) 0 4 U-5- Care intensitv score 38. Meadow & Lantos, supra note 9, at 1.

16 WAKE FOREST LAW REVIEW [Vol Did the defendant's care fall within the portion(s) of the curve representing substandard care? (Answer yes or no) It is hard to imagine that jurors would know what to make of this. Even an expert would have a difficult time drawing lines here, because there is no consensus about what percentage of practicing physicians it takes to generate a custom. An expert might be attracted to the notion of referring to one standard deviation below and above the mean score (carving out a group of sixty-eight percent of responses, for a normal distribution), but this choice would be arbitrary from a legal standpoint. In other contexts, the Supreme Court has opted not to mandate a line-drawing approach based on standard deviations." In short, the cognitive task associated with the Meadow and Lantos proposal is much more demanding than the authors contemplate. Meadow and Lantos reasonably might respond that I have raised an extreme example to prove a point here. Many determinations in malpractice cases are simpler than the multiattribute determination I discuss above. Many cases rise and fall on whether the physician failed to administer one particular test, for example, or whether he waited too long to administer a particular treatment. I will argue that the line-drawing problem persists even in these simple cases. Even in these cases, it is difficult to distinguish between statistical significance and clinical significance. Furthermore, it is unclear what the relationship between statistical significance and legal significance should be. I address each of these points in turn, using Meadow and Sunstein's useful example of a case of delayed administration of antibiotics to a child with bacterial meningitis. 2. What Is "Significant" Variation in Clinical Practice? Meadow and Sunstein present the results of a medical chart review through which they found that the average time between initial presentation in the emergency room and administration of antibiotics for children diagnosed with bacterial meningitis at hospitals in Chicago, South Carolina, and California was 120 minutes." The authors also surveyed two groups of physician experts about their estimates of the timing of antibiotic 39. See Laurens Walker & John Monahan, Social Facts: Scientific Methodology As Legal Precedent, 76 CAL. L. REV. 877, (1988). Walker and Monahan note that in Castaneda v. Partida, 430 U.S. 482, 496 n.17 (1977), and Hazelwood v. United States, 433 U.S. 299, 309 n.14 (1977), the Court declined to establish "two or three standard deviations" as the standard of statistical proof of discrimination. In Hazelwood, the Court suggested that while the standard-deviation approach was reasonably precise, a Bayesian analysis might be more precise. 433 U.S. at 311 n See Meadow & Sunstein, supra note 9, at 638.

17 20021 LEGAL, CLINICAL, AND STATISTICAL THINKING 837 administration for children who present in the emergency room with bacterial meningitis. 4 They found that the experts' time estimates were much lower-forty-six minutes and eighty minutes for pediatric infectious disease specialists and emergency room physicians, respectively. 42 The authors do not present a statistical test of the magnitude of the difference between the expert judgments and actual treatment times (or information sufficient for the reader to conduct one), but the difference looks big. How would a plaintiff show how big? One approach would be to use a one-sample t-test to demonstrate where the defendant physician falls out in the distribution of physicians in his community. Let us suppose that the defendant took 170 minutes to administer antibiotics to the plaintiff. The null hypothesis that the plaintiff would like to be able to reject is that this is no different from what most physicians would take-that is, that the population mean juo is equal to 140. The alternative hypothesis is that the defendant took too long-that is, p, < 140. The results of the chart review study indicate that the mean time for the sampled patients was 120 minutes ( = 120). A t statistic could then be computed ' and compared to the distribution of the t statistic on n-1 degrees of freedom. Using a one-tailed test" and a significance level of 0.95, the computed t statistic greatly exceeds the critical value. ' Thus, the plaintiff would conclude that the null hypothesis could be rejected, and the difference between the mean time to antibiotic administration in Chicago hospitals and the time it took the defendant to give antibiotics is statistically significant at the 0.05 level. This sounds like compelling evidence of negligence. Statistical significance at this level means that the defendant was very far outside the range of practice of most of the physicians sampled in the chart review. However, the statistical significance determination leaves an important question unanswered: So what? What is the clinical significance of this difference in time-toantibiotic-administration? In this context, how long is too long to wait? Not all statistically significant differences in care lead to 41. Id. at Id. at 638. uo)f, 43. The formula for the t statistic is t= 5 where s is the standard deviation of the sample times and n is the sample size. If we assume s=20 and n=93, then t = (120-14O)93 = A one-tailed test is appropriate where, as here, the alternative hypothesis is directional in nature. 45. For n=93, the critical value is approximately 1.65.

18 838 WAKE FOREST LAW REVIEW [Vol. 37 appreciably different clinical outcomes." If there is no impact on clinical outcomes, the differences in clinical practice would seem to lack the elements of relevancy and materiality required for them to have legal significance. Thus, practice differences that are statistically meaningful may not be clinically meaningful, and if they are not clinically meaningful, it is difficult to see why they should be legally meaningful-that is, why they should constitute breaches of the legal standard of care. The objection I am raising here is by no means a devastating one. It is really just an observation about the difference between showing a breach of duty and proving causation. Meadow and Sunstein's evidence may establish that the physician did not do as most physicians do. However, it does not bear on the question of whether that deviation led to the plaintiffs injury. For that, additional evidence that the difference between the physician's behavior and the customary practice had clinical significance will be required. In principle, it would be possible to use another set of empirical studies to show that particular practice patterns are associated with differences in clinical outcomes. For example, the plaintiff could offer evidence from chart reviews showing that children with bacterial meningitis who received antibiotics within ninety minutes of presenting in the emergency room had significantly lower morbidity and mortality than those who did not receive antibiotics for more than two hours. This proof of general causation would need to be supplemented with expert opinion testimony to establish specific causation with respect to the instant plaintiffs injury. In practice, because empirical studies that fit the facts of the plaintiffs experience may not be available, plaintiffs are likely to rely on expert opinion testimony to prove both general and specific causation in malpractice cases. Unfortunately, that testimony is of precisely the kind that Meadow and Sunstein criticize-it calls for the expert to make a prediction. As the authors describe, rather serious cognitive problems and heuristic biases may attach to clinical predictions. 47 Because the expert in a malpractice case is required to make a counterfactual prediction, speculating about what would likely have 46. Cf GREENBERG ET AL., supra note 18, at 169 (distinguishing between statistical significance and clinical significance in the context of a research finding of a statistical association between dietary fat and breast cancer). 47. See Meadow & Sunstein, supra note 9, at 634 (noting that "'[v]irtually all' of the existing studies of physicians 'have documented frequent and large errors in predictions.' No study finds a high level of accuracy. The errors tend in a particular direction: 'physicians are prone to an optimistic bias.'" (citations omitted)); see also Round, supra note 18, at 110 (describing several forms of bias including: overconfidence; overweighting of small probabilities, losses, and potential regrets; asymmetric risk aversion with respect to losses and gains; and inconsistency in preference sets).

19 2002] LEGAL, CLINICAL, AND STATISTICAL THINKING 839 happened if the physician had acted differently than she did, the cognitive task is especially difficult and especially vulnerable to bias and error. The upshot is that the invocation in malpractice litigation of statistically significant differences between the defendant's behavior and the behavior of other physicians does not get one very far; most of the aspects of the expert's role that make Meadow and Sunstein uncomfortable persist. 3. Does Legal Significance Require Statistical Significance? Another question raised by the Meadow and Hartz proposals is whether empirical findings must achieve statistical significance in order to be accepted by courts as satisfying the legal burden of proof. To use the time-to-antibiotic-administration example, should a plaintiff in a malpractice case have to demonstrate that the difference between the time it took the defendant-physician to administer antibiotics and the mean time-to-administration among physicians in his specialty and community is of sufficient magnitude to achieve statistical significance? Statisticians conventionally require that effects be significant at the 95% level or higher (p < 0.05).48 The fact that courts in civil cases think about the preponderance-of-the-evidence standard as requiring proof at around the 51% level has created some confusion about how to treat this statistical convention when the case involves epidemiologic or social science research. An example of this confusion is the following: [Iln civil cases the legal system places the "burden of proof' on the plaintiff. In order to prevail the plaintiff must prove his claim by a "preponderance of the evidence"-in other words, more than 50 percent of the evidence must be in the plaintiffs favor. Science under the circumstances would be neither willing nor able to declare a winner. Administrative decisions generally call for a lower standard of proof, whereas criminal trials demand something closer to scientific certainty ("beyond reasonable doubt"). 49 This passage reflects a misguided conflation of legal adjudicators' confidence levels and the concept of statistical significance." 0 Implicitly, the author equates the preponderance 48. See GREENBERG ET AL., supra note 18, at SHEIA JASANOFF, SCIENCE AT THE BAR: LAW, SCIENCE, AND TECHNOLOGY INAMERICA 10 (1995). 50. Professor Michael Green has collected a large number of other instances of this kind of confusion, by courts as well as commentators. Michael D. Green, Regulating Toxic Substances: A Philosophy of Science and the Law, 37 JURI mtrcs J. 205, 221 n.67 (1997) (book review) [hereinafter Green, Regulating Toxic Substances] (citing several examples); from Michael D. Green, Professor, Wake Forest University School of Law, to Michelle Mello, Assistant Professor of Health Policy and Law, Harvard School of Public Health

20 WAKE FOREST LAW REVIEW [Vol. 37 standard with a p-value of around 0.50 and the criminal standard with a p-value of around The confusion is perhaps wrought by statisticians' unfortunate tendency to refer to p-values as "confidence levels." 51 In fact, p-values do not represent confidence levels in the same sense that legal evidentiary standards do. The preponderance standard calls for a determination of whether, considering all the evidence before it, the judge or jury can conclude that at least 51% of the evidence falls on the side supporting the plaintiffs proposition. Statistical significance levels determine whether there is any reason to consider a particular analysis of data as evidence of the existence of the proposition under study. In this way, a p-value looks more like an admissibility decision than a verdictive decision. If the finding is statistically significant, the researcher can "admit" the evidence to the pool of scientific knowledge supporting the proposition. If it is not significant (or close), then it is not appropriate to put that finding in the pot. But it is also not appropriate to conclude that because the p-value is below the required level, the weight of the evidence shows that the proposition is false. A p-value of 0.49 does not mean that 51% of the evidence points to the falsity of the proposition. Some courts have recognized this distinction, but still have struggled with how to shoehorn evidence of statistical significance or nonsignificance into the process of proving a legal claim through a (Apr. 8, 2002) (on file with author) (citing Almeida v. Sec'y of the Dep't of Health & Human Servs., No V, 1999 U.S. Claims LEXIS 294 (Fed. Cl. 1999); RICHARD GOLDBERG, CAUSATION AND RISK IN THE LAW OF TORTS: SCIENTIFIC EVIDENCE AND MEDICINAL PRODUCT LIABILITY (1999); Erica Beecher-Monas, Blinded by Science: How Judges Avoid the Science in Scientific Evidence, 71 TEMP. L. REV. 55, 71 n.110 (1998) (citing Richard Cranor "[for a discussion of the appropriateness of applying the 95% confidence interval to the regulatory context and tort law"); Jeff L. Lewin, The Genesis and Evolution of Legal Uncertainty About "Reasonable Medical Certainty," 57 MD. L. REV. 380, 400 (1998); Andrew A. Marino & Lawrence E. Marino, The Scientific Basis of Causality in Toxic Tort Cases, 21 U. DAYTON L. REV. 1, n.57 (1995); Leslie J. Sheffield & Ron Batagol, The Creation of Therapeutic Orphans-Or, What Have We Learnt from the Debendox Fiasco?, 143 MED. J. AUSTL. 143, 146 (1985); William M. Sage, Lessons from Breast Implant Litigation, 15 HEALTH AFF. 206, 209 (1996) (book review) (stating that the preponderance of the evidence standard "suggests a p-value of roughly 0.49 (or a 51 percent confidence interval)"); William Glaberson, The Courts vs. Scientific Certainty, N.Y. TIMES, June 27, 1999, 4 (Magazine), at 5 ("Science, which never stops searching for answers, has a high threshold for reaching conclusions: 95 percent certainty, some scientists say, is necessary to decide that one thing probably caused another. But the law must stop its search at the conclusion of each case. So juries in civil cases are told that a mere preponderance of the evidence-51 percent-is enough certainty to render a verdict.")). 51. See Kaye, supra note 22, at 1349 n.78 ("When a confidence interval is used in court,... it should not be denominated a 'confidence' interval because the confidence coefficient does not equal the subjective confidence that one should have in the truth of a relevant proposition.").