Dealing with Health Care Data using the SAS system. Alan C. Wilson, UMDNJ-RWJMS, New Brunswick, New Jersey Marge Scerbo, CHPDM, UMBC, Baltimore, MD Overview Embarking upon analysis of health care data without a reasonable understanding of their peculiar intricacies and eccentricities is risky business. Knowledge of medical coding practices, variability in personal, time and place variables, and estimates of data completeness and quality, are essential. This paper provides an overview of different problems that might be encountered, as well as possible solutions, including integrating information by record linkage. Introduction In a perfect utopian world, health care information would flow freely among consumers, providers and payers, allowing health care administrators, planners, researchers and reviewers easy access to the data necessary for their activities. In the world we live in however, the data are collected and delivered in diverse formats, many of which have no common personal identifiers. This has given rise to a whole industry aimed at providing a uniform set of utilization and price data to federal agencies, state Medicaid agencies, state employers, and statewide data organizations. These data merchants develop custom collection tools to capture clinical, financial, and other health-related data, and serve as data collection hubs, accepting data from hundreds of sources in a variety of formats. They standardize the data, perform edit and verification checks, provide data correction tools to submitting organizations and work with them to improve accuracy. The data is then compiled into databases to provide custom reports and make the data available for epidemiological and bio-statistical analysis. We are here to tell you that with the aid of SAS, so can you. Origins of Diversity Births and deaths have been registered centrally for many years in a standard format by all 50 states, the District of Columbia, Puerto Rico, and U.S. territories. The health care claims data we will be discussing are a different matter. Over the life-time of an individual, typical health care billing claims areas will include: maternity and prenatal care children's health prevention/well visits acute illness chronic long-term health care hospice care. Across the continuum of health care providers (s)he will encounter the following: dentists pharmacists physicians other licensed health professionals ambulatory care centers acute care hospitals rehabilitation/chronic care centers psychiatric hospitals and each of these will have a different reporting format. Add onto this the differences in reporting requirements between states, payers and departments, and the problem becomes worse. The situation is not completely hopeless though. Many of the minimal characteristics needed for billing are included in each reporting format and can be used to standardize and link the data. Minimal variables - patient identifier - provider identifier - date of visit - place of service - diagnosis (one or more) - laboratory or radiology services - medical or surgical services - billed charges for each service Items such as the patient s age, date of birth, sex, ZIP code, source of admission, type of admission, discharge date, procedures and their dates, disposition of patient, and expected source of payment are generally also available. Variables of particular value in hospital discharge data are the attending/operating physician identification numbers, and hospital 1
identification number. In some cases, racial and ethnic category, names, addresses, and other personal identifiers (social security or Medicare number) may be omitted from the database. The claims files described above are just one of the available sources of information. The others are the beneficiary file (information obtained at enrollment or registration) and the provider file (containing information identifying and characterizing the health care providers). The particular disease or condition of interest, e.g. cancer, cardiovascular disease, accidents, procedures etc., will determine the appropriate data source, as will the setting and provider of the care. The Record Layout It is most important to know where the data came from. The completeness and quality of the information will depend upon the source (see later) but first you need to know the record layout. And you need to be aware of when periodic changes in record layout have taken place and adjust your data input statements accordingly. In contrast to mortality data, statewide hospital discharge data are currently maintained by 36 states. Most states use one of two standard formats: the (Uniform Billing) reporting format with the claim form, or the (Uniform Hospital Discharge Data Set) and the HCF-1500 form. The Department of Health, Education and Welfare initiated the in 1974 to improve the uniformity of hospital discharge data for patients enrolled in the Medicaid and Medicare programs. Other states, like Maryland, use unique systems such as HSCRC (Health Services Cost Review Commission). Virginia has no statewide system currently in use. NESUG state Connecticut Delaware Maine Maryland Massachusetts New Hampshire New Jersey New York Pennsylvania Rhode Island Vermont (Virginia) System type HSCRC Unique Not statewide If you have access to vital status (death certificate) data tapes, the particular record layout will need to be used for the appropriate file type, either single (underlying) cause of death, or multiple (contributing) causes. Variable Coding In addition to the record layout, the coding practices current at the time that the data was input must be known. This applies particularly to racial or ethnic grouping, which frequently differs between databases and within databases over time. Others include: - other personal variables - hospital discharge data - place variables - time variables ICD-9 Codes- Diagnoses Throughout the U.S. and most of the world, diseases, injuries and causes of death are coded according to the World Health Organization (WHO) International Classification of Diseases, Ninth Revision. The Clinical Modification of the codes (ICD-9-CM) provides additional fourth and fifth digits to the rubric to allow more detailed reporting. One complication is that it is difficult to distinguish between an admitting diagnosis, a discharge diagnosis, and diagnosis codes for conditions that arose during the hospital stay. ICD-9 Codes- Procedures Institutional (in-hospital) charges for procedures are coded with ICD-9 four digit codes to distinguish them from the physician s portion of the charges for procedures, which are billed with CPT codes. CPT codes Physicians billing for outpatient and in-hospital services or procedures use the American Medical Association s Current Procedural Terminology (CPT) five digit codes, which are revised each year. CPT coders for each department (ob/gyn, surgery, pediatrics, medicine etc.) are trained in that specialty, and use a translation program to produce HCPCS (Hicpiks) codes for submission to Blue Cross/Shield, Medicare and state Medicaid payers. HCPCS HCFA (Health Care Financing Administration) uses HCPCS National Level II Codes (HCFA Common Procedure Coding System) for 2
Medicare claims. HCPCS is used increasingly by commercial insurance for pharmaceuticals and durable medical equipment (DME) charges. Dental billing and pharmacy charges have different coding system, which we will not cover here. Revenue codes are also stored electronically in different forms. Recoding Data It may be necessary to recode variables from different databases; if there is a conflict, e.g. conversion may be needed from an alpha code to a numeric. Another reason for recoding is to achieve a particular goal in collection, aggregation, and analysis of the data. Collapsing and Summarizing HCFA does not pay by disease, but rather by a coding system. For inpatient hospital care, diagnosis-related groups (DRGs) cover costs. Physician/supplier fees are paid by CPT codes. Costs are not aggregated by disease so in order to calculate costs associated with specific diseases in the Medicare population you need to provide DRGs and/or CPT codes. Another option would be purchasing Public Use Files, specifically the Standard Analytical Files and searching these files for specific codes. There are a variety of sophisticated programs for use in the hospital inpatient billing setting from the basic HCFA (Medicare) DRGs, to the All Patient DRGs (AP-DRGs) and the severityadjusted All Patient Refined DRGs (APR- DRGs). A more recent development is the Ambulatory Patient or Ambulatory Care Groups (APGs, ACGs) for the classification of outpatient encounters. Composite Variables Combined ranges of ICD-9-CM codes, e.g. diabetes 250.00-250.90, and hypertension 401.00-405.99, can be used to indicate the presence or absence of various disease entities. DVS Cause of Death Codes The NCHS (National Center for Health Statistics) which is part of the Centers for Disease Control (CDC), within the Department of Health and Human Services (DHHS) has several lists of cause-of-death codes that collapse the ICD-9 into more manageable categories, ranging from 16 to 282 diseases. The coding from the various revisions of the ICD are adjusted to avoid discontinuities. Comorbidity or Severity Scores Constructing composite outcomes or endpoints from several intermediate diagnoses can enhance the usefulness of the hospital discharge record. For example, in acute myocardial infarction (MI), the following have been correlated with a poor long-term outcome: recurrent MI, new onset heart failure or cardiogenic shock, and left ventricular dysfunction. The number of diagnoses and/or the number of procedures must be considered. The number and type of diagnostic tests employed may also be an indicator of the number of possible diagnoses. This may indicate the number and types of problems addressed during the encounter. Stratification or Aggregation Subgroups of interest can be investigated by stratifying analyses by sex, race, age, county or other variables. Too small a number of events will lead to a diminished reliability of the estimates of rates and other measures. Conversely, to increase the number of events, data across several strata or years can be combined or aggregated. However this may mask trends in subgroups. There should be evidence that rates or trends are similar between groups before they are combined. Two grouping systems are used by the NCHS to stratify by age. A four-age strata system and a ten-age strata system are shown below. <15 0-4 15-44 5-14 45-64 15-24 65 or older 25-34 35-44 45-54 55-64 65-74 75-84 85 or older Data Completeness and Quality A good data entry system will have an up-todate code dictionary reflecting revisions and inactive codes, and using the fifth digit in ICD-9- CM. This is supplemented by regular exploratory data analysis by summarizing and range checking. We should recognize that absolute perfection is highly suspect, and that a particular collection of data represents one of a series of possible sets of data. Case Review The Peer Review Organization (PRO) program in each state ensures that Medicare 3
beneficiaries receive appropriate, high quality care and implements HCFA's Health Care Quality Improvement Program (HCQIP). The PROs' focus has been on individual case review. Provider Profiles Statistics on practice patterns including admission and visit rates, procedure rates etc. have been used to identify outliers for review. The case mix of each provider has to be used to adjust the rates. Some states have produced risk adjusted Report Cards for particular procedures such as cardiac surgery. Other states haven t, and some have even discontinued the reporting. HMO/MCO Quality Review HMOs (Health Maintenance Organizations) and MCOs (Managed Care Organizations) which have a health plan contract with a state Medicaid agency or the federal Medicare system as well as those with contracts with commercial payers fall under the quality review of HEDIS (Health Plan Employer Data and Information Set). HEDIS was developed by NCQA (National Committee for Quality Assurance). These reviews are applied to claims data as well as actual patient charts. Variability in Diagnosis, Personal and Time Variables A random sample of 669 patient charts was selected in New Jersey to validate the diagnosis of acute myocardial infarction for a pilot registry program. The diagnosis was accurate (definite MI) in 76% of the cases (81% in men, 58% in women). On the other hand it was wrong (no MI) in only 6% of the cases (5% men, 9% women). Classification accuracy of other variables is shown in the next table: Variable Rate/669 Percent Accuracy Sex 664 99.3 Age 667 99.7 Race 661 98.8 Length of Stay 664 99.3 Vital Status 661 98.8 Procedure 668 99.9 (card cath) Admit date 667 99.7 Integrating Information by Record Linkage. Matching people within or across files in order to achieve linked data is not an easy task in the absence of unique identifiers such as Social Security number or Medicare number. These are frequently omitted to maintain the confidentiality of the patient record. The data fields which contain useful information are first and last name, month, day and year of birth, ZIP code, city or county code, race and sex. In general the more value levels a variable has, the more useful it is in the linking procedure. Thus sex, with only 2 levels, contributes less than month of birth, with 12 levels. Matching records can be done by exact matching using key variables, or by probabilistic matching. The latter generates scores for each possible match, sets thresholds for acceptance and for clerical checking of possible matches. In the NJ experience the sensitivity and specificity of the algorithm matching records to death records were 98.16% and 98.17% respectively. Statistics Once the database has been made available to the SAS system, there are various statistical procedures that are useful in analyzing or exploring health care data: - Rates - Frequencies - Correlation - Age adjustment - Chi-square test - Confidence intervals - Logistic regression - Cox regression for survival - Kaplan-Meyer plots For reporting or presenting utilization rates, information about the population base is needed. This is available from census data or from the SEER system, the Surveillance, Epidemiology and End Results program of the National Cancer Institute (NCI). The effect of migration, either seasonal or to an adjacent major city, should be evaluated. Data Presentation Graphic display can be used to emphasize a point of interest. The following types of presentation are typical: - tables instead of sentences - horizontal bar chart for comparisons - vertical bar chart for time trends - line graph for time trends - clustered bar chart - stacked bar chart - pie chart for proportions - maps (convert ZIP codes to latitude/longitude) 4
The SAS system works excellently with all these. Conclusion - Limitations in Data It should be kept in mind that this is billing data, and is regulated by the requirements of the third party payer. The quality refers primarily to the accuracy of the information entered by the billing clerks. The clerks can only bill what the clinical provider writes in the patient chart. The admitting diagnosis and discharge diagnosis, if the patient s condition changes during the stay, determines the outcome of the payment claim. That is the outcome of interest, not the health care outcome of the patient. Certain diagnostic tests, repeated hospitalizations, and procedures may be limited by the regulations of payer. Whether they occurred or not may be unknown. A person beginning to work with health care data must be aware that the records are the results of the sometimes conflicting needs of administrative regulations, health economics, social sciences, medical records and clinical medicine. Diagnostic and coding practices are subject to change. He/she should always evaluate how this will impact on the results of their own analyses and research. A more accurate estimate may be obtained by using multiple data sources in combination. The Future Electronic claims media (ECM) submission using the electronic UB-92 Form (HCFA-1450) requires coordination of the electronic transfer between the different computer systems, e.g. a referral must be present in an HMO data base before a payment for a visit outside of the system is paid. Apart from the clinical note, the claim is essentially paper-free. Information about the electronic UB-92 can be found at www.hcfa.gov/medicare/edi/edi5.htm under the category "Health Claims." HCFA is developing the following: - A guide for states to use in developing, implementing, and using encounter data systems. - The Tape-to-Tape project - The State Medicaid Uniform Research Files (or SMRF) project - The Hospital Cost and Utilization Projects (HCUP-1, 2, 3) for the Agency for Health Care Policy and Research. - A prototype ICD-10 procedural coding system - A supplemental chapter for ICD-9- CM diagnosis coding (H-Codes) for vital signs, findings, laboratory results and functional health status. - An episodic grouper defining episodes of care which will allow care to be evaluated over time and across of full scope of services. Information sources: Your state health department, library or data center. The CDC and the NCHS can be accessed through their web sites www.cdc.gov and www.cdc.gov/nchswww or you can reach the NCHS at (301) 436-7035. The number for the NCHS library is (301) 436-6147 National Death Index (301) 436-8951 WONDER, the CDC s Wide Ranging Online Data for Epidemiology Research provides a wealth of data. Contact (404) 332-4569 or the CDC web site for account information National Association of Health Data Organizations, web site www.nahdo.org Agency for Health Care Policy and Research (AHCPR) web page www.ahcpr.gov/data/ provides links to access HCUP (Healthcare Cost and Utilization Project) which took discharge data from 17 states and organized it into 1 format.and many other sites. References and Acknowledgements Centers for Disease Control. "Using Chronic Disease Data: a Handbook for Public Health Practitioners" Atlanta: Centers for Disease Control, 1992. DM Steinwachs, JP Weiner, S Shapiro. Management Information Systems and Quality in Providing Quality Care: Future Challenges, N Goldfield, DB Nash (eds.) Health Administration Press, 1995 SAS is a registered trademark of SAS Institute Inc., Cary NC, USA Contact Alan Wilson (732) 235-7857 awilson@umdnj.edu Marge Scerbo (410) 455-6870 scerbo@chpdm.umbc.edu 5
Acronyms Meaning ACG Ambulatory Care Groups AHCPR Agency for Health Care Policy and Research AP-DRG All Patient DRG APG Ambulatory Patient Groups APR-DRP All Patient Refined DRG CDC Center for Disease Control CPT Current Procedural Terminology DHHS Department of Health and Human Services DME Durable medical Equipment DRG Diagnosis Related Group EDI Electronic Data Interchange enrollment EMC Electronic Media claims HCFA-1450 (UB-92) A Claims Form used by institutional providers for Medicare, Part A HCFA-1500 claims A Claims Form used by non-institutional providers for Medicare, HCFA Part Health B claims Care Financing Administration HCQIP HCFA Health Care Quality Improvement Program HCUP Hospital Cost and Utilization Projects HEDIS Health Plan Employer Data and Information Set HIPAA Health Insurance Portability and Accountability Act of 1996 HISSB (Kasselbaum/Kennedy CDC Health Information Bill) and Surveillance Systems Board HL7 A data standard Health Level Seven HMO Health Care Management Organization HCPCS HFCA Common Procedure Coding System HSCRC Health Services Cost Review Commission ICD-9 International Classification of Diseases MCO Managed Care Organization NAHDO National Associate of Health Data Organizations NCHS National Center for Health Statistics NCI National Cancer Institute NCQA National Committee for Quality Assurance PRO Peer Review Commission SDO Standards Development Organization (HL7) SEER Surveillance Epidemiology and End Results SMRF State Medicaid Uniform Research Files SNOMED A vocabulary UB-92 (82) Uniform Billing Uniform Hospital Discharge Data Set WHO World Health Organization X12 A data standard (ANSI accredited standards committee X12) 6