Purpose of Today s Lecture Secondary Data Analysis The Good, the Bad, and the Ugly Linda Simoni-Wastila Wastila,, PhD What is secondary data? What are the different types of secondary data? What are some the analytic issues specific to secondary data? How do you validate secondary data? What types of studies can you do with secondary data? Where can you get some secondary data? 1 2 A blessing! In a nutshell.. Secondary data is any data that has been collected and aggregated by anyone other than the user This differs from Primary data, which is data that the user collects, such as Survey Focus Group edical records Interviews 3 Secondary data has been around about as long as we have had written records Secondary data for research purposes, however, much more recent advance Pharmaceutical data even a more recent advance In past 15 years the number of sources of data with pharmaceutical information has exploded from a few sources to many sources 4 Before you google possible data sources, download a database, push a button, and complete your dissertation, there are a few things you need to consider: Does it make sense for you to collect your own data? If no, what data sources are available to you that will answer your research question(s)? What type of data can you hope and dream to procure? And where can you find it? To collect or not to collect? That is the question When you might want to collect primary data: Question is qualitative in nature Study design is observational Development/assessment of a data collection instrument Question has no available secondary data (or at least data that is available to you) 5 6 And where can you find it? 1
Pros: Cost to procure < cost to collect Large sample size/nationally representative Has population of interest Data are current Standard indices and measures Good data quality Available now Has documentation 7 Cons: Cost to procure > cost to collect Does not have population of interest Data are obsolete Questionable data quality Poor documentation ay not have specific variable(s) or population(s) of interest 8 Public-Use Data There are three types of secondary data: Public Use Administrative Claims Proprietary Secondary data that has been made available to the public (usually by a Federal agency) Cost usually minimal to user (though very expensive to collect) ade public because it is Federal mandate to make federally-funded funded data collection efforts available to any researcher Is often claims-based 9 10 Administrative Claims Data Data that captures transactions involving (usually) billed services. These data can be public sector (i.e., edicaid or edicare) or private sector (i.e., managed care, employer- sponsored insurance) or a combination (i.e., compilation of various payor sources) 11 Administrative Claims Data Administrative claims generally very large, difficult for first-time time user to navigate HIPPA has introduced new patient and privacy rules that may make it more difficult to obtain, especially for sensitive data (i.e., substance abuse/mental health services) Data quality varies from source and over time Data availability from Federal agencies is very good; may need connections to obtain private sector data 12 2
Proprietary Data Data that is collected and owned by another party who makes it available for those with the resources and/or the connections Generally quite expensive (there are often academic rates and some data may be available for free for dissertation work) These data are usually claims data, though some data is not (e.g., Tufts University CSDD Drugs In Development data) Once you have determined you will use secondary data, how do you determine the type of data you need? Two key considerations: File Structure Source of the data 13 14 File Structure: are the data static in time (cross- sectional) or continuous in time (longitudinal)? If your research question involves determining outcomes associated with drug exposure, your data will need to allow assessment of temporality (i.e., you will need longitudinal data) If your research question involves examination of associations, cross-sectional sectional data may be sufficient 15 Analytic Considerations Is the sample size sufficient? Is your population of interest included? In sufficient numbers? Is everyone included in the data you received? Are the outcomes of your research question available in your data? Are they measured sufficiently? What explanatory variables and covariates are included? How are they measured? What is the data quality? Are there missing variables? Values? How are missing data coded? Imputation? Is there documentation? ou WILL need to conduct data validation and cleaning steps with your data there is no such thing as perfect data. 16 Cost to obtain - inexpensive Public Use Claims Proprietary Cost to process/store Are you able to do the analyses you want with the data? For example, if study question involves determining outcomes associated w/ drug use, does your data allow you to assess causality? Cross-sectional sectional data generally will not allow you to assess causality, only the association between variables (exception: may ask respondents when an event happened so can assess temporality). Is your sample size sufficient? Does your research require national representativeness? Cost usually much less than to collect yourself Recency of data Has data var of interest (no control over what is asked and how it is asked) Has pop of interest (if nat l representative, may have but samples may still be too small for analyses w/ adequate power, esp if interested in rare events) Standard indices and measures Data quality (may be excellent but, you have no control over how data is collected, recorded, coded, entered, aggregated, etc; don t know where errors are) ational representativeness Waiting time to obtain (generally faster than own collection; however, to get recent data may wait; claims often difficult to process and clean) Documentation /A Unknown Unknown 17 18 3
Assessing quality of secondary data Secondary data varies in cost, timeliness, scope, available measures, and other domains How to know if it s s good data Research literature that has used the data Examine data dictionary and other documentation Ask for a sample or demo data Discuss with users Have vendor do preliminary search for variables, power How to validate secondary data Reliability varies across fields, respondents, and years Coding among institutions may vary umbers can be miscoded If you have outliers, look at data problems first Do interim analyses very early on; validate every filed in some way Calculate mean, median, mins and maxs, skewness, kurtosis If data are missing, try to determine why. Skip patterns? Coded as 9 or 0?? Or is it random? 19 20 our study design will be largely dependent upon the type and quality of data and its intended use Uses of secondary data: Exploratory/hypothesis generating Combine with primary data collection Hypothesis testing Exploratory/Hypothesis generating: can use secondary data to explore research questions prior to fielding your own primary data collection effort Can use to determine whether you will need to oversample specific populations, variable measurement, refinement and associations, and revising hypotheses 21 22 Combine with Primary data dynamic analysis that allows supplementation with indepth interviews, focus groups, surveys to provide context for quantitative analyses of secondary data Hypotheses Testing: ost research in our field falls into this domain Cross-sectional sectional data are most useful for exploratory studies and combined with primary data, but can be Ho-testing if data are congruent with the research questions and hypotheses Particularly Case-Control Control Studies any cross-sectional sectional data are collected annually, sop can pool to increase sample size and conduct trend analyses if look at cross-sections sections over time 23 24 4
Case-Control Control Studies Rare outcomes If money was not an issue, we d prefer prospective cohort studies Case-control studies are efficient cohort studies with a ready-made population from which to draw study subjects Case-Control Control Studies: Pros and Cons Strengths: Relatively inexpensive Good design for chronic conditions w/ long latency periods Rare diseases Can examine multiple factors Weaknesses: Inefficient for evaluating rare exposures Cannot directly estimate incidence rates Can be more difficult to control for biases and confounding 25 26 Designs for Temporal Analysis If your data are longitudinal, then you have more sophisticated options and can do robust Ho-testing studies Cohort: examine characteristics of cohorts at 2 or more points in time. Cohort = any group that experiences major life event at same time (eg( eg,, birth cohort). Especially suited to study aging, social, political or cultural change. Panel: data collected at 2 or more points in time for the same persons. Only panel designs allow study of changes among respondents rather than simply populations. Other types: event history design. Time series: used to describe changing patterns of phenomena, explain sources of changes, and make predictions about future changes. Can often use cross- sectional data on same measures bur different respondents. eed many time periods (at least 30). Where to Get Secondary Data See Handout ICPSR: InterUniversity Consortium for Political and Social Research www.icpsr icpsr.umich.edu www.cdc cdc.gov/nchs www.samhsa.gov and www.icpsr.umich.edu/sahda/ www.cms cms.hhs.gov 27 28 SOURCE DATA SET GEERAL DESCRIPTIO DataBase CBS EPS/ES HAES ACS/HACS HSDA/SDUH TF arketscan PB edicaid Claims PA edicare Claims Public Use Proprietary Claims/Encounter * Source(s) AHRQ CDC CDC SAHSA; SAHDA IDA; SAHDA edstat Various CS; also individual states IS CS AHRQ Agency for Healthcare Research and Quality http://www.ahrq.gov/ http://www.meps.ahrq.gov/ http://www.meps.ahrq.gov/epset/hc/epsethc.asp (for online analysis) http://www.ahrq.gov/data/hcsusix.htm Centers for edicare and edicaid Services http://www.cms.hhs.gov/ http://www.cms.hhs.gov/apps/mcbs/ EPS edical Expenditure Panel Survey HCSUS HIV Cost and Services Utilization Study CBS edicare Current Beneficiary Survey The edical Expenditure Panel Survey (EPS) is designed to continually provide policymakers, health care administrators, businesses, and others with timely, comprehensive information about health care use and costs in the United States, and to improve the accuracy of their economic projections. EPS collects data on the specific health services that Americans use, how frequently they use them, the cost of these services, and how they are paid for, as well as data on the cost, scope, and breadth of private health insurance held by and available to the U.S. population. (HCSUS is the first major research effort to collect information on a nationally representative sample of people in care for HIV infection. HCSUS is examining costs of care, utilization of a wide array of services, access to care, quality of care, quality of life, unmet needs for medical and nonmedical services, social support, satisfaction with medical care, and knowledge of HIV therapies. The edicare Current Beneficiary Survey (CBS) is a continuous, multipurpose survey of a nationally representative sample of aged, disabled, and institutionalized edicare beneficiaries. CBS, which is sponsored by CS, is the only comprehensive source of information on the health status, health care use and expenditures, health insurance coverage, and socioeconomic and demographic characteristics of the entire spectrum of edicare beneficiaries. 29 30 5
SOURCE CHS ational Center for Health Statistics http://www.cdc.gov/nchs/express.htm http://www.cdc.gov/nchs/about/major/nhis/hisdesc.htm http://www.cdc.gov/nchs/nhanes.htm http://www.cdc.gov/nchs/about/major/ahcd/ahcd1.htm http://www.cdc.gov/nchs/lsoa.htm DATA SET HIS ational Health Interview Survey HAES III ational Health and utrition Examination Survey ACS ational Ambulatory edical Care Survey HACS ational Hospital Ambulatory edical Care Survey LSOA Longitudinal Studies of Aging GEERAL DESCRIPTIO The ational Health Interview Survey is a crosssectional household interview survey. HIS collects data each year in three areas: demographics, health status, and health care utilization. Data may be used to provide national estimates on the incidence of acute illness and injuries, prevalence of chronic conditions and impairments, the extent of disability, utilization of health care services, and other health related topics. Data focus is on chronic diseases and risk factors such as heart disease, diabetes arthritis, infectious diseases, immunization status, growth and development of children, overweight, dental health, respiratory disease, osteoporosis, mental health, others. The data can be used to provide national prevalence estimates. The purpose of ACS is to gather and disseminate statistical data about the medical care provided by office-based physicians in the US. The purpose of HACS is to produce statistics that are representative of the experience of the US population in hospital emergendy departments (EDs) and outpatient departments (OPDs). Contains online drug database search engine http://www2.cdc.gov/drugs/ LOSA is a multicohort study of persons 70 years of age and over designed primarily to measure changes in the health, functional status, living arrangements, and health services utilization of two cohorts of Americans as they move into and through the oldest ages. 31 6