Linear regression analysis of censored medical costs



Similar documents
benefit is 2, paid if the policyholder dies within the year, and probability of death within the year is ).

Calculation of Sampling Weights

THE DISTRIBUTION OF LOAN PORTFOLIO VALUE * Oldrich Alfons Vasicek

THE METHOD OF LEAST SQUARES THE METHOD OF LEAST SQUARES

How To Calculate The Accountng Perod Of Nequalty

The OC Curve of Attribute Acceptance Plans

Causal, Explanatory Forecasting. Analysis. Regression Analysis. Simple Linear Regression. Which is Independent? Forecasting

Module 2 LOSSLESS IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

CHOLESTEROL REFERENCE METHOD LABORATORY NETWORK. Sample Stability Protocol

An Alternative Way to Measure Private Equity Performance

Marginal Benefit Incidence Analysis Using a Single Cross-section of Data. Mohamed Ihsan Ajwad and Quentin Wodon 1. World Bank.

Survival analysis methods in Insurance Applications in car insurance contracts

1 De nitions and Censoring

Traffic-light a stress test for life insurance provisions

What is Candidate Sampling

How Sets of Coherent Probabilities May Serve as Models for Degrees of Incoherence

Recurrence. 1 Definitions and main statements

Section 5.4 Annuities, Present Value, and Amortization

8.5 UNITARY AND HERMITIAN MATRICES. The conjugate transpose of a complex matrix A, denoted by A*, is given by

CS 2750 Machine Learning. Lecture 3. Density estimation. CS 2750 Machine Learning. Announcements

Can Auto Liability Insurance Purchases Signal Risk Attitude?

Exhaustive Regression. An Exploration of Regression-Based Data Mining Techniques Using Super Computation

1. Fundamentals of probability theory 2. Emergence of communication traffic 3. Stochastic & Markovian Processes (SP & MP)

DEFINING %COMPLETE IN MICROSOFT PROJECT

SPEE Recommended Evaluation Practice #6 Definition of Decline Curve Parameters Background:

NPAR TESTS. One-Sample Chi-Square Test. Cell Specification. Observed Frequencies 1O i 6. Expected Frequencies 1EXP i 6

CHAPTER 14 MORE ABOUT REGRESSION

PSYCHOLOGICAL RESEARCH (PYC 304-C) Lecture 12

Latent Class Regression. Statistics for Psychosocial Research II: Structural Models December 4 and 6, 2006

Statistical Methods to Develop Rating Models

1. Measuring association using correlation and regression

1 Example 1: Axis-aligned rectangles

Analysis of Premium Liabilities for Australian Lines of Business

Portfolio Loss Distribution

Answer: A). There is a flatter IS curve in the high MPC economy. Original LM LM after increase in M. IS curve for low MPC economy

8 Algorithm for Binary Searching in Trees

Performance Analysis of Energy Consumption of Smartphone Running Mobile Hotspot Application

How To Understand The Results Of The German Meris Cloud And Water Vapour Product

Staff Paper. Farm Savings Accounts: Examining Income Variability, Eligibility, and Benefits. Brent Gloy, Eddy LaDue, and Charles Cuykendall

The Current Employment Statistics (CES) survey,

Variance estimation for the instrumental variables approach to measurement error in generalized linear models

The Development of Web Log Mining Based on Improve-K-Means Clustering Analysis

SIMPLE LINEAR CORRELATION

Forecasting the Direction and Strength of Stock Market Movement

BERNSTEIN POLYNOMIALS

A DYNAMIC CRASHING METHOD FOR PROJECT MANAGEMENT USING SIMULATION-BASED OPTIMIZATION. Michael E. Kuhl Radhamés A. Tolentino-Peña

STATISTICAL DATA ANALYSIS IN EXCEL

Diabetes as a Predictor of Mortality in a Cohort of Blind Subjects

Number of Levels Cumulative Annual operating Income per year construction costs costs ($) ($) ($) 1 600,000 35, , ,200,000 60, ,000

Measuring Ad Effectiveness Using Geo Experiments

Ring structure of splines on triangulations

Hedging Interest-Rate Risk with Duration

Statistical algorithms in Review Manager 5

Risk-based Fatigue Estimate of Deep Water Risers -- Course Project for EM388F: Fracture Mechanics, Spring 2008

Realistic Image Synthesis

Study on Model of Risks Assessment of Standard Operation in Rural Power Network

Lecture 3: Force of Interest, Real Interest Rate, Annuity

Single and multiple stage classifiers implementing logistic discrimination

Proceedings of the Annual Meeting of the American Statistical Association, August 5-9, 2001

Traffic-light extended with stress test for insurance and expense risks in life insurance

ADVERSE SELECTION IN INSURANCE MARKETS: POLICYHOLDER EVIDENCE FROM THE U.K. ANNUITY MARKET *

Section 5.3 Annuities, Future Value, and Sinking Funds

A Novel Methodology of Working Capital Management for Large. Public Constructions by Using Fuzzy S-curve Regression

Risk Model of Long-Term Production Scheduling in Open Pit Gold Mining

The Application of Fractional Brownian Motion in Option Pricing

PERRON FROBENIUS THEOREM

Diagnostic Tests of Cross Section Independence for Nonlinear Panel Data Models

INVESTIGATION OF VEHICULAR USERS FAIRNESS IN CDMA-HDR NETWORKS

Intra-year Cash Flow Patterns: A Simple Solution for an Unnecessary Appraisal Error

v a 1 b 1 i, a 2 b 2 i,..., a n b n i.

where the coordinates are related to those in the old frame as follows.

Regression Models for a Binary Response Using EXCEL and JMP

Linear Circuits Analysis. Superposition, Thevenin /Norton Equivalent circuits

Luby s Alg. for Maximal Independent Sets using Pairwise Independence

Forecasting and Stress Testing Credit Card Default using Dynamic Models

IDENTIFICATION AND CORRECTION OF A COMMON ERROR IN GENERAL ANNUITY CALCULATIONS

CHAPTER 5 RELATIONSHIPS BETWEEN QUANTITATIVE VARIABLES

Quantization Effects in Digital Filters

ANALYZING THE RELATIONSHIPS BETWEEN QUALITY, TIME, AND COST IN PROJECT MANAGEMENT DECISION MAKING

Method for assessment of companies' credit rating (AJPES S.BON model) Short description of the methodology

IMPACT ANALYSIS OF A CELLULAR PHONE

An Evaluation of the Extended Logistic, Simple Logistic, and Gompertz Models for Forecasting Short Lifecycle Products and Services

Fragility Based Rehabilitation Decision Analysis

UK Letter Mail Demand: a Content Based Time Series Analysis using Overlapping Market Survey Statistical Techniques

Multiple-Period Attribution: Residuals and Compounding

Power-of-Two Policies for Single- Warehouse Multi-Retailer Inventory Systems with Order Frequency Discounts

PRIVATE SCHOOL CHOICE: THE EFFECTS OF RELIGIOUS AFFILIATION AND PARTICIPATION

Stress test for measuring insurance risks in non-life insurance

Project Networks With Mixed-Time Constraints

L10: Linear discriminants analysis

An Interest-Oriented Network Evolution Mechanism for Online Communities

Estimation of Dispersion Parameters in GLMs with and without Random Effects

Extending Probabilistic Dynamic Epistemic Logic

Transition Matrix Models of Consumer Credit Ratings

Calculating the high frequency transmission line parameters of power cables

Transcription:

Bostatstcs (2), 1, 1,pp. 35 47 Prnted n Great Brtan Lnear regresson analyss of censored medcal costs D. Y. LIN Department of Bostatstcs, Box 357232, Unversty of Washngton, Seattle, WA 98195, USA danyu@bostat.washngton.edu SUMMARY Ths paper deals wth the problem of lnear regresson for medcal cost data when some study subjects are not followed for the full duraton of nterest so that ther total costs are unknown. Standard survval analyss technques are ll-suted to ths type of censorng. The famlar normal equatons for the leastsquares estmaton are modfed n several ways to properly account for the ncompleteness of the data. The resultng estmators are shown to be consstent and asymptotcally normal wth easly estmated varance covarance matrces. The proposed methodology can be used when the cost database contans only the total costs for those wth complete follow-up. More effcent estmators are avalable when the cost data are recorded n multple tme ntervals. A study on the medcal cost for ovaran cancer s presented. Keywords: Censorng; Cost analyss; Economc evaluaton; Health economcs; Incomplete data; Medcal care; Survval analyss. 1. INTRODUCTION The escalatng cost of health care has posed serous challenges n the Unted States and other countres. The cost of ntensve chemotherapy wth perpheral stem cell support for mmune functon n stage II breast cancer approaches $1 n some treatment centers, and ths cost must be balanced aganst the potental expense of salvage therapy for later recurrent dsease. The ntal costs of screenng for varous forms of cancer are reasonably well understood, but the exact costs of long-term care n ths and other chronc dseases are largely unknown. Accurate understandng of the costs assocated wth alternatve therapes can lead to substantal cost savngs. Clncal tral data as well as patent seres data from medcal centers, dsease regstres and nsurance companes present excellent opportuntes to evaluate the cost of medcal care. Many mult-center clncal trals groups, such as the Eastern Cooperatve Oncology Group, have formed specal research teams to study medcal cost so that the clncal beneft of newer therapes can be evaluated n lght of cost. Despte the tremendous nterest n the evaluaton of medcal cost, there has been lttle progress n the development of formal statstcal methods for such evaluaton. The man dffculty les n the ncompleteness of the avalable data. In long-term clncal or observatonal studes to collect cost data, t s nevtable that some patents are not followed untl the endpont of nterest so that ther medcal costs are not fully observed. Ths phenomenon s referred to as censorng, whch s well known for survval tme data. Statstcal methods for handlng censorng n survval data have been well developed. Due to the smlarty between censored medcal costs and censored survval tmes, a number of authors have suggested that standard survval analyss methods, such as the Kaplan Meer estmator and Cox regresson, be used to analyse censored medcal costs. As ponted out by Ln et al. (1997), however, ths strategy s false c Oxford Unversty Press (2)

36 D. Y. LIN because the nherent patent heterogenety wth respect to cost accumulaton entals that the cumulatve cost at the censorng tme s postvely correlated wth the cumulatve cost at the endpont of nterest even f the underlyng censorng mechansm s purely random. Ln et al. (1997) developed non-parametrc methods for estmatng the mean total cost based on censored data. Several unpublshed reports have provded refnements of and alternatves to Ln et al. s estmators. All these efforts, however, are confned to the one-sample problem. Currently, there does not exst any vald regresson method for assessng the effects of covarates (e.g. therapes, nsurance plans and patents characterstcs) on medcal cost wth censored data. The regresson methodology would be partcularly valuable n dentfyng cost-effectve nterventon/preventon programs. A more specfc applcaton of such a methodology s to devse rsk-adjusted payment systems for the Medcare or nsurance companes whch would reduce the ncentves for hosptals to use unnecessarly expensve therapes and at the same tme would avod penalzng the hosptals that serve patents requrng more ntensve care. In the next secton, we develop smple and vald methods for fttng lnear regresson models to censored cost data. In Secton 3, we evaluate the performance of these methods by Monte Carlo smulaton. In Secton 4, we llustrate the proposed methods wth the ovaran cancer study ntally reported by Ln et al. (1997). Most of the techncal materal s relegated to the Appendx. 2. METHODS 2.1. Basc deas Suppose that one s nterested n the total medcal cost over the tme perod [,τ], whch may be 1 year or 5 years. For techncal reasons, τ s no larger than the overall length of the study; wthout makng strngent parametrc assumptons, t would not be possble to estmate the total cost over [,τ] f no subject were followed for τ years. Naturally, there s no further accumulaton of medcal cost after death. Thus, the total cost over [,τ] s the same as the cumulatve cost at T = mn(t,τ), where T s the survval tme. If one s nterested n the lfetme cost, then τ has to be larger than the support of the dstrbuton of T. Let Y be the cumulatve cost at τ or T, and let Z be a p 1 vector of covarates whose effects on Y are of nterest. It s convenent to relate Y to Z through the lnear regresson model Y = β Z + ɛ, (1) where β s a p 1 vector of unknown regresson parameters, and ɛ s a zero-mean error term wth an unspecfed dstrbuton. We set the frst component of Z to 1 so that the frst component of β corresponds to the ntercept. As mentoned n Secton 1, survval tme and medcal cost are normally subject to rght censorng. Let C be the censorng tme. Wrte X = mn(t, C), δ = I (C T ), and δ = I (C T ), where I ( ) s the ndcator functon. Clearly, Y s known f and only f δ = 1, whereas T s known f and only f δ = 1. A subject whose survval tme s censored at or after τ has a complete observaton on the medcal cost over [,τ], whle a subject whose survval tme s censored before τ has an ncomplete observaton. Suppose that we have a random sample of sze n. For = 1,...,n, the varables correspondng to the th subject are ndexed by the subscrpt. In ths subsecton, we assume that censorng occurs n a completely random fashon such that C s ndependent of all other random varables. Ths assumpton holds f, for example, all the censorng s caused by study termnaton, whch s the so-called admnstratve censorng. Let G(t) = Pr(C t), and let Ĝ(t) be the Kaplan Meer estmator of G(t) based on the data (X, δ )( = 1,...,n), where δ = 1 δ. We consder more general censorng mechansms n the next subsecton. If no survval tmes are censored before τ, then β may smply be estmated by the least-squares normal

Lnear regresson analyss of censored medcal costs 37 equaton (Y β Z )Z =. (2) In practce, there s some censorng before τ. As mentoned above, Y s known f and only f δ = 1. Because E{δ /G(T )} =1, t seems natural to modfy (2) as δ G(T ) (Y β Z )Z =. (3) Only the subjects wth complete cost data contrbute non-zero terms to the summaton n (3), but ther contrbutons are weghted nversely by ther probabltes of ncluson. Thus, the left-hand sdes of (2) and (3) have the same expectaton, whch s zero. Snce G s unknown but can be consstently estmated by the Kaplan Meer estmator Ĝ, we replace G n (3) wth Ĝ to yeld the followng estmatng equaton for β δ ) (Y β Z )Z =, (4) whch has a closed-form soluton ˆβ = { δ ) Z 2 } 1 δ ) Y Z. (5) Here and n the sequel, we use the notaton: a = 1, a 1 = a, and a 2 = aa.ifδ = 1 for all, then (5) reduces to the ordnary least-squares estmator. Remark. The above dea of weghtng the complete observatons by ther nversed probabltes of ncluson was orgnated by Horvtz and Thompson (1952) n the context of sample surveys. The adaptaton of ths dea to the settng of censored survval data was ntally consdered by Koul et al. (1981), and later on by Robns and Rotntzky (1992) and Ln and Yng (1993). Recently, Zhao and Tsats (1997) appled ths dea to the problem of qualty adjusted survval tme. We show n the Appendx that n 1 2 ( ˆβ β) converges n dstrbuton to a p-varate zero-mean normal random vector wth a covarance matrx whch can be consstently estmated by  1 ˆB 1  1, where  = n 1 ˆB = n 1 n n Z 2, (6) [ δ (Y ˆβ Z )Z ) + δ Q(X ) j=1 ] 2 δ j I (X j X )Q(X j ) nl=1, (7) I (X l X j ) and Q(t) = I (T > t)δ (Y / ˆβ Z )Z I (X j t). (8) ) j=1 Note that nether ˆβ nor ts covarance matrx estmator nvolves the cost data from the subjects whose survval tmes are censored before τ.

38 D. Y. LIN 2.2. Covarate-dependent censorng It s possble to allow censorng to depend on measured covarates. If the covarates are all dscrete wth a lmted number of values, then one may stratfy the sample accordng to the covarate values and replace Ĝ n Secton 2.1 wth the stratum-specfc Kaplan Meer estmators; otherwse, t s convenent to formulate the effects of covarates on censorng through the proportonal hazards model. Both of these two stuatons are encompassed by the followng stratfed proportonal hazards model (Cox, 1972) λ(t V, W ) = e γ W (t) λ V (t), (9) where V represents the stratfcaton varables and W the rest of the covarates, λ(t V, W ) s the condtonal hazard functon of C gven V and W, λ V ( ) s an unspecfed baselne hazard functon for stratum V, and γ s a set of unknown regresson parameters. Naturally, V and W may nclude part of Z. We assume that C s ndependent of all other random varables condtonal on (V, W ). Wthout ths assumpton, there would be a non-dentfablty problem. Let V and W be the values of V and W for the th subject. The parameters n model (9) are estmated from the data (X, δ, V, W )( = 1,...,n) by the partal lkelhood prncple (Cox, 1975). The valdty of the model can be verfed by a number of exstng methods; see Ln et al. (1993). Under model (9), the probablty that Y s known,.e. δ = 1, s estmated by { V, W ) = exp j=1 } δ j I (V j = V, X j < T )e ˆγ W (X j ) S (), (X j ; V, ˆγ) where ˆγ s the maxmum partal lkelhood estmator of γ (Ln et al. 1994). Here and n the sequel, we adopt the notaton: S (ρ) (t; V,γ)= I (V = V, X t) e γ W (t) W ρ (t), ρ =, 1, 2. The replacement of ) n (4) wth V, W ) yelds the estmatng equaton whch agan has an explct soluton δ V, W ) (Y β Z )Z =, (1) { } δ ˆβ 1 = V, W ) Z 2 The asymptotc propertes of ˆβ are descrbed at the end of the next subsecton. δ V, W ) Y Z. (11) 2.3. Multple tme ntervals The estmators developed n the prevous two subsectons may be neffcent when there s heavy censorng because the cost data from the subjects whose survval tmes are censored pror to τ are not used at all. In many applcatons, ncludng the ovaran cancer study to be presented n Secton 4, the costs are recorded n certan tme ntervals, e.g. every month or every year, n whch case t s possble to obtan

Lnear regresson analyss of censored medcal costs 39 more effcent estmators. The avalablty of the cost data n multple tme ntervals also offers the opportunty to assess how the effects of covarates change over tme. The dea of parttonng the entre study perod nto several tme ntervals to mprove effcency was ntally explored by Ln et al. (1997) n the one-sample case, though they handled censorng wth a very dfferent approach. Suppose that the entre tme perod of nterest [,τ] s dvded nto K ntervals by t < t 1 < < t K 1 < t K τ. For the th subject, let Y k denote the cost ncurred over the tme nterval (t k 1, t k ]. The ntal cost at t = s ncluded n the frst tme nterval. We specfy a lnear regresson model for each of the K ntervals: Y k = β k Z + ɛ k, k = 1,...,K ; = 1,...,n, (12) where β k (k = 1,...,K ) are p 1 vectors of unknown regresson parameters, and the error terms ɛ k s are assumed to be ndependent among dfferent subjects but allowed to be correlated wthn the same subject. Ths s a semparametrc margnal model for repeated measures n that only the margnal mean structure s modelled. By summng both sdes of (12) over k, we obtan Y = β Z + ɛ, = 1,...,n, where Y = K k=1 Y k, β = K k=1 β k, and ɛ = K k=1 ɛ k. Ths s the same as model (1). Nether model (1) nor (12) requres specfcaton of the relatonshp between survval tme and cost. Defne Tk = mn(t, t k ),andδk = I (C Tk ). Clearly, Y k s known f and only f δk = 1. Mmckng equaton (1), we propose the followng estmatng equaton for β k whch has an explct soluton { ˆβ k = δ k k V, W ) (Y k β k Z )Z =, (13) δ k Ĝ(Tk V, W ) Z 2 } 1 The correspondng estmator of β s ˆβ = K k=1 ˆβ k,or { } K δ ˆβ = 1 k k=1 Ĝ(Tk V, W ) Z 2 δ k k V, W ) Y k Z. (14) δ k k V, W ) Y k Z. (15) Ths estmator shares the sprt of the generalzed estmatng equatons for repeated measures (Lang and Zeger, 1986). A subject wll contrbute a non-zero term to the left-hand sde of (13) f he/she has complete cost data n the kth nterval. In other words, a subject whose survval tme s censored n the (k + 1)th nterval contrbutes hs/her cost data from the frst k tme ntervals to the estmaton of β. By contrast, a subject whose survval tme s censored before τ does not contrbute any cost nformaton to (1). Thus, (15) s expected to be more effcent than (11). We prove n the Appendx that n 1 2 ( ˆβ 1 β 1,..., ˆβ K β K ) s asymptotcally zero-mean normal and that the lmtng covarance matrx between n 1 2 ( ˆβ k β k ) and n 1 2 ( ˆβ l β l )(k, l = 1,...,K ) can be consstently estmated by  1 ˆB kl  1, where  s gven n (6), ˆB kl = n 1 n ˆξ k ˆξ l, and ˆξ k = δ k (Y k ˆβ k Z )Z k V, W ) + δ D k (X ) δ j I (V j = V, X j X )e ˆγ W (X j ) D k (X j ) S (). (16) (X j ; V, ˆγ) j=1

4 D. Y. LIN In expresson (16), where D k (t) = Q k (t; V ) + R k ˆ 1 {W (t) S (1) (t; V, ˆγ)/S () (t; V, ˆγ)}, Q k (t; V ) = I (V = V, Tk > t)e ˆγ W (t) δk (Y k ˆβ k Z )Z Ĝ(Tk V, (17), W )S () (t; V, ˆγ) R k = n 1 ˆ = n 1 n n δk (Y k ˆβ k Z )Z H (Tk ; V, W ) Ĝ(Tk V, (18), W ) { S δ (2) (X ; V, ˆγ) S () (X ; V, ˆγ) S(1) (X ; V, ˆγ) 2 S () (X ; V, ˆγ) 2 and { }/ H(t; V, W ) = δ I (V = V, X < t)e ˆγ W (X ) W (X ) S(1) (X ; V, ˆγ) S () S () (X ; V, ˆγ). (X ; V, ˆγ) It follows that n 1 2 ( ˆβ β) s asymptotcally zero-mean normal wth a covarance matrx whch can be consstently estmated by  1 ˆB  1, where ˆB = K Kl=1 k=1 ˆB kl. The above asymptotc results also apply to the estmator gven n (11) snce Secton 2.2 s a specal case of ths subsecton wth K = 1. To calculate the varance estmator for (11), we set K = 1 and replace Tk, δ k, Y k and ˆβ k n (16) (18) wth T, δ, Y and ˆβ, respectvely. If censorng occurs n a completely random fashon, we set V = 1 and W = ( = 1,...,n), and replace Ĝ(Tk V, W ) n (13) (15) wth the Kaplan Meer estmator Ĝ(Tk )(k = 1,...,K ; = 1,...,n). Specfcally, (15) becomes ˆβ = { K k=1 δk Ĝ(Tk ) Z 2 } 1 The aforementoned asymptotc results contnue to hold, but wth }, δk Ĝ(Tk )Y k Z. (19) ˆξ k = δ k (Y k ˆβ k Z )Z k ) + δ Q k (X ) j=1 δ j I (X j X )Q k (X j ) nl=1, I (X l X j ) where Q k (t) = If we further set K = 1, then ˆB kl and ˆB reduce to (7). I (Tk > t)δ k (Y k ˆβ k Z / )Z Ĝ(Tk ) I (X j t). j=1 3. SIMULATION STUDIES Monte Carlo smulaton was conducted to assess the operatng characterstcs of the proposed methods n practcal settngs. The smulaton scheme was a modfcaton of that of Ln et al. (1997). To be specfc, the survval tmes were generated from two dstrbutons: unform on (, 1) years and exponental wth

Lnear regresson analyss of censored medcal costs 41 Table 1. Summary statstcs for the smulaton studes Sngle nterval (K = 1) Multple ntervals (K = 1) T c n Bas SSE SEE CP Bas SSE SEE CP Unform 2 1.37.45.392.941.5.378.368.942 2.3.286.28.942.6.267.263.943 5.31.18.178.948.24.168.167.948 15 1.39.458.431.93.25.43.388.939 2.11.321.31.935.2.285.278.94 5.13.21.198.947.2.179.177.949 Exponental 2 1.18.453.44.935.23.49.4.94 2.5.316.314.946.36.288.285.947 5.15.2.199.946.11.181.181.95 15 1.17.526.5.925.7.435.422.936 2.65.368.358.937.47.36.32.944 5.14.228.227.948.15.192.192.949 Note: Bas s the mean of ˆβ mnus β; SE s the standard error of ˆβ; SEE s the mean of the standard error estmator for ˆβ; CP s the coverage probablty of the 95% confdence nterval for β. Each entry s based on 1 replcates. a mean of sx years. The censorng tmes were generated from the unform(, c) dstrbuton. We set c = 2 and 15 years, resultng n censorng probabltes of approxmately 25% and 35% under the unform survval tme dstrbuton and 3% and 4% under the exponental dstrbuton. The 1-year cost was consdered to be the quantty of man nterest. We postulated U-shaped sample paths for the costs. Specfcally, the entre tme perod of nterest [, 1] was dvded nto 1 1-year ntervals. Wthn each nterval, there was a baselne cost of unform (, 1). In addton, there were a dagnostc cost of unform (, 1) at t = and a unform (, 1) cost n the fnal year of lfe. The covarate was a treatment ndcator wth n 2 subjects n each of the two groups. The treatment assgnment was ndependent of the costs so that the regresson coeffcent for the treatment ndcator s. Table 1 summarzes the results of smulaton studes on the estmaton of treatment dfference based on estmator (5) and estmator (19) wth K = 1. Both estmators appear to be vrtually unbased. In general, the standard error estmators adequately reflect the true varatons of the parameter estmators and the assocated confdence ntervals have reasonable coverage probabltes. Makng use of the cost data n multple tme ntervals not only enhances the effcency of the estmaton but also mproves the accuracy of the asymptotc approxmaton n small samples.

42 D. Y. LIN Survval Probabltes..2.4.6.8 1. 1 2 3 4 5 Tme After Dagnoss (years) Fg. 1. Kaplan Meer estmates of the survval probabltes for epthelal ovaran cancer patents: sold curve, local stage; dotted curve, regonal stage; dashed curve, dstant stage. 4. OVARIAN CANCER STUDY In ths secton, we use the lnked SEER Medcare database (Potosky et al., 1993) to study the medcal cost for epthelal ovaran cancer among the Medcare enrollees n the Unted States. We consder the 355 Medcare benefcares over the age of 65 who were dagnosed wth ovaran cancer from 1984 to 1989. Among them, 54, 836 and 2174 subjects were dagnosed wth local, regonal and dstant stages, respectvely. The data on mortalty and monthly medcal costs were collected durng the perod of 1984 to 199. From a publc health pont of vew, t s mportant to assess how the stage at dagnoss affects the future survval and medcal cost. The survval tmes and medcal costs are censored on the patents who were stll alve at the end of 199. The censorng tmes as measured from the tmes of dagnoss vary substantally among the subjects as the dagnoses staggered over a perod of 6 years. Because censorng was solely caused by the lmted study duraton, t s reasonable to assume that censorng s ndependent of all other random varables. Fgure 1 shows the Kaplan Meer estmates of the survval probabltes separated by the three dsease stages at dagnoss. Clearly, the patents wth less aggressve dsease have better survval experences. The 5-year survval probabltes are approxmately 7%, 2% and 1% for the local, regonal and dstant stages, respectvely. To evaluate how the stage of dsease affects the 5-year post-dagnoss cost, we ft model (12) wth n = 355, K = 6 and three covarates: Z = (1,, ), (1, 1, ) or (1,, 1) f the th patent was dagnosed wth the local, regonal or dstant stages, respectvely. By the defnton of the covarates, the

Lnear regresson analyss of censored medcal costs 43 Table 2. Regresson estmates for the 5-year cost n ovaran cancer (a) Usng monthly costs Parameter Estmate St. error Est/SE 95% Confdence Interval β1 32229 1241 25.97 (29797, 34662) β 2 5972 1683 3.55 (2673, 9271) β 3 4527 142 3.23 (1779, 7275) (b) Usng 5-year costs Parameter Estmate St. error Est/SE 95% Confdence nterval β 1 33675 1689 19.93 (3365, 36986) β 2 6515 2182 2.99 (2239, 179) β 3 3614 1832 1.97 (24, 725) (c) Nave complete-cases analyss Parameter Estmate St. error Est/SE 95% Confdence nterval β 1 3459 164 21.9 (31375, 3785) β 2 3393 1969 1.72 ( 466, 7253) β 3 736 1764.42 ( 2723, 4194) Local stage. Dfference of regonal stage from local stage. Dfference of dstant stage from local stage. frst component of each β k corresponds to the local stage, and the second and thrd components correspond to the dfferences of the regonal and dstant stages from the local stage. Table 2(a) summarzes the results for the overall regresson parameters β = (β 1,β 2,β 3 ), whle Fgure 2 dsplays the cumulatve costs for the three stages based on ndvdual ˆβ k (k = 1,...,6). As s evdent from Fgure 2, the medcal cost for the local stage s lower than those of the other 2 stages n the frst 2 years after dagnoss, but the opposte s true n later years. Ths phenomenon s manly due to the fact that the local-stage patents survved longer. The average 5-year cost for a local-stage patent s slghtly over $32, whle those of the regonal- and dstant-stage patents are about $5 hgher. The dfferences are statstcally sgnfcant. To demonstrate the effcency gan of usng multple tme ntervals over usng a sngle nterval, we also apply the method of 2.1 to the total 5-year costs. The results are shown n Table 2(b). A comparson of Table 2(a) and (b) shows that the use of the monthly cost data yelds consderable varance reducton. The results n Table 2(c) are obtaned by applyng the ordnary least-squares method to the cases wth complete data on the 5-year cost. Such an analyss s based towards the costs of the patents wth shorter survval tmes because longer survval tmes are more lkely to be censored. The estmates n Table 2(c) dffer apprecably from those of Table 2(a) and (b).

44 D. Y. LIN Average Cumulatve Costs (dollars) 1 2 3 4 1 2 3 4 5 Tme After Dagnoss (years) Fg. 2. Estmates of the average cumulatve costs for epthelal ovaran cancer patents: sold curve, local stage; dotted curve, regonal stage; dashed curve, dstant stage. 5. REMARKS It mght seem natural to apply the exstng regresson methods n the survval analyss lterature, such as those based on the Cox proportonal hazards and accelerated falure tme models, to censored medcal costs by treatng (Ỹ,δ, Z )( = 1,...,n) as censored survval data, where Ỹ s the cumulatve cost at mn(t, C ),.e. the mnmum of the cumulatve costs at T and C. As mentoned n Secton 1, ths approach s nvald because the cumulatve cost at T s postvely correlated wth the cumulatve cost at C even f T and C are ndependent. Ths phenomenon of dependent censorng s caused by the fact that the patents are heterogeneous such that those who accumulate cost at hgher rates over tme tend to generate hgher cumulatve costs at all tme ponts, ncludng T and C, as compared wth those wth lower accumulaton rates. The approach taken n ths paper allows arbtrary censorng patterns, whereas that of Ln et al. (1997) requres censorng to occur only at the cut-ponts t 1,...,t K. In the one-sample case, the proposed estmators do not reduce to those of Ln et al. (1997). The man motvaton for developng the regresson methods s to handle a large number of contnuous and dscrete covarates. The ovaran cancer example gven n Secton 4 dd not demonstrate the full power of the proposed regresson methods because the avalable database does not contan any contnuous covarates. The new methods are currently beng appled to several ongong clncal trals. The results of those applcatons wll be reported n medcal journals. Ths paper focuses on the actual total medcal cost, whch acknowledges the fact that there s no further accumulaton of cost after death. Ths quantty s hghly mportant to publc health and nsurance

Lnear regresson analyss of censored medcal costs 45 ndustres. Because longer survval tme tends to be assocated wth hgher medcal cost, dfferent nterventon/preventon programs should be compared not only wth respect to medcal cost but also wth respect to survval tme. In fact, the cost-effectveness of a new program relatve to the standard s normally measured by the ncrease n mean medcal cost dvded by the ncrease n mean survval tme (Patrck and Erckson, 1993). ACKNOWLEDGEMENTS The author s grateful to the referees for ther careful and speedy revews of ths paper, and to Drs Ruth Etzon, Paula Dehr and Sean Sullvan for helpful dscussons on related topcs. Ths research was supported by the Natonal Insttutes of Health. APPENDIX Proofs of asymptotc results In ths appendx, we prove the asymptotc theory stated n Secton 2.3. The theory gven n Secton 2.2 s a specal case wth K = 1, whle that of Secton 2.1 s a further specal case wth K = 1 and V = 1 and W = for all = 1,...,n. The left-hand sde of (13) can be wrtten as U k (β k ) = U k1 (β k ) + U k2 (β k ), where δk U k1 (β k ) = G(T k V, W ) (Y k β k Z )Z, (A1) G(Tk U k2 (β k ) = V, W ) Ĝ(Tk V, W ) G(Tk V, W ) Ĝ(Tk V, W ) δ k (Y k β k Z )Z. Because E(δk V, W, Y k, Z, Tk ) = E(δ k V, W, Tk ) = G(T k V, W ), the term U k1 (β k ) conssts of n ndependent zero-mean random vectors. By a slght extenson of the results of Ln et al. (1994), n 1 2 {G(t V, W ) Ĝ(t V, W )} G(t V, W ) +h (t; V, W ) 1 n 1 2 = n 1 2 t I (V = V ) e γ W (x) dm (x) s () (x; V ) {W (x) w(x; V )} dm (x) + o p (1), where s (ρ) (t; V ) = lm n n 1 s (ρ) (t; V,γ)(ρ =, 1, 2), w(t; V ) = s (1) (t; V )/s () (t; V ), Thus, t h(t; V, W ) = e γ W (x) {W (x) w(x; V )}λ V (x)dx, { } s = lm n n 1 δ (2) (X ; V ) s () (X ; V ) w 2 (X ; V ), n 1 2 Uk2 (β k ) = n 1 2 M (t) = δ I (X t) t Q k (t; V )dm (t) I (X x) e γ W (x) λ V (x)dx.

46 D. Y. LIN where Q k (t; V ) = n 1 R k = n 1 + R k 1 n 1 2 n n I (V = V, T {W (t) w(t; V )}dm (t) + o p (1), k > t)eγ W (t) δk (Y k β k Z )Z Ĝ(Tk V, W )s () (t; V ) δk (Y k β k Z )Z h (Tk ; V, W ) Ĝ(Tk V., W ), The law of large numbers, together wth the consstency of Ĝ, mples that Q k (t; V ) and R k converge to well-defned lmts, say q k (t; V ) and r k. Therefore, n 1 2 Uk2 (β k ) = n 1 2 [ ] q k (t; V ) + r k 1 {W (t) w(t; V )} dm (t) + o p (1). (A2) Combnng (A2) wth (A1), we have n 1 2 U k (β k ) = n 1 2 n ξ k + o p (1), where ξ k = δ k G(T k V, W ) (Y k β k Z )Z + [ ] q k (t; V ) + r k 1 {W (t) w(t; V )} dm (t). Because (ξ 1,...,ξ K )( = 1,...,n) are n ndependent zero-mean random matrces, the multvarate central lmt theorem mples that n 1 2 {U 1 (β 1 ),...,U K (β K )} converges n dstrbuton to a zero-mean normal random matrx. The lmtng covarance matrx between n 1 2 U k (β k ) and n 1 2 U l (β l ) s B kl lm n n 1 n ξ k ξ l (k, l = 1,...,K ). By the Taylor seres expanson, n 1 2 ( ˆβ k β k ) = à 1 k n 1 2 U k (β k ), where à k = n 1 n δ k Ĝ(Tk V, W ) Z 2, whch converges n probablty to A lm n n 1 n Z 2. It then follows from the aforementoned asymptotc normalty of n 1 2 {U 1 (β 1 ),...,U K (β K )} that n 1 2 ( ˆβ 1 β 1,..., ˆβ K β K ) converges n dstrbuton to a zero-mean normal random matrx and the lmtng covarance matrx between n 1 2 ( ˆβ k β k ) and n 1 2 ( ˆβ l β l ) s A 1 B kl A 1. Ths convergence of dstrbuton mples the consstency of ˆβ k (k = 1,...,K ). Replacng all the unknown quanttes n ξ k wth ther respectve sample estmators, we obtan ˆξ k gven n (16). The consstency of ˆB kl for B kl (k, l = 1,...,K ) follows from the law of large numbers, together wth the consstency of Ĝ, ˆγ and ˆβ k (k = 1,...,K ). REFERENCES COX, D. R. (1972). Regresson models and lfe-tables (wth dscusson). Journal of the Royal Statstcal Socety, Seres B 34, 187 22. COX, D. R. (1975). Partal lkelhood. Bometrka 62, 269 276. HORVITZ, D. G. AND THOMPSON, D. J. (1952). A generalzaton of samplng wthout replacement from a fnte unverse. Journal of the Amercan Statstcal Assocaton 47, 663 685.

Lnear regresson analyss of censored medcal costs 47 KOUL, H., SUSARLA, V. AND VAN RYZIN, J. (1981). Regresson analyss wth randomly rght-censored data. The Annals of Statstcs 9, 1276 1288. LIANG, K.-Y. AND ZEGER, S. L. (1986). Longtudnal data analyss usng generalzed lnear models. Bometrka 73, 13 22. LIN, D.Y.,ETZIONI, R., FEUER, E.J.AND WAX, Y. (1997). Estmatng medcal costs from ncomplete follow-up data. Bometrcs 53, 419 434. LIN, D. Y., FLEMING, T. R. AND WEI, L. J. (1994). Confdence bands for survval curves under the proportonal hazards model. Bometrka 81, 73 81. LIN, D. Y., WEI, L. J. AND YING, Z. (1993). Checkng the Cox model wth cumulatve sums of martngale-based resduals. Bometrka 8, 557 572. LIN, D.Y.AND YING, Z. (1993). A smple nonparametrc estmator of the bvarate survval functon under unvarate censorng. Bometrka 8, 573 581. PATRICK, D.L.AND ERICKSON, P. (1993). Health Status and Health Polcy: Allocatng Resources to Health Care, pp. 52 53. New York: Oxford Unversty Press. POTOSKY, A. L., RILEY, G. F., LUBITZ, J. D., MENTNECH, R. M. AND KESSLER, L. G. (1993). Potental for cancer related health servces research usng a lnked Medcare-tumor regstry data base. Medcal Care 31, 732 747. ROBINS, J. M. AND ROTNITZKY, A. (1992). Recovery of nformaton and adjustment for dependent censorng usng surrogate markers. In AIDS Epdemology: Methodologcal Issues, Eds N. P. Jewell, K. Detz and V. T. Farewell, pp. 297 331. Boston, MA: Brkhäuser. ZHAO, H. AND TSIATIS, A. A. (1997). A consstent estmator for the dstrbuton of qualty-adjusted survval tme. Bometrka 84, 339 348. [Receved May 14, 1999. Revsed August 19, 1999]