Using Ultra-Large Data Sets in Healthcare New Questions-New Answers David Hartzband, D.Sc.. Director, Technology Research, RCHN Community Health Foundation & Lecturer, Engineering Systems Division Massachusetts Institute of Technology
Big Data? Big Data is the management & use of ultralarge amounts of information, where: Management & use = efficient storage, search, analysis, visualization & Ultra-large = more than 1 Petabyte of data 1 byte = a single printed character 5 million bytes (5 Megabytes or 5MB) = complete printed works of Shakespeare 4x1000 times that (20 billion bytes, 20 Gigabytes) = complete recorded works of Beethoven 500 times that (10 trillion bytes, 10 Terabytes) = the printed works in the Library of Congress 100 times that (1 quadrillion bytes, 1 Petabyte) is a lot!
Is this Real? 8 years ago, as a VP at the EMC Corporation, Merck asked my group if we could manage a 1 PB submission to the FDA Today Google has about 2 PBs of information under management for Google Earth Typical EHR record/patient (not counting images) ranges from 1MB for a healthy young person, to 40MB for a middle-aged person with some health issues to 3-5GB for a person with several health issues including images* *SearchStorage.com
Translate this to Kaiser Kaiser Permanente has 8.8M members*, based on the estimates of EHR record size, KP would have between 26.5PBs & 44PBs of patient data under management just from EHR data including images & annotations Just by raw size, this is 4400 Libraries of Congress not a meaningful or imaginable concept By some estimates, total size of digitized patient data in the US might be as large as 600PB-10EB (10 exabytes) * http://xnet.kp.org/newscenter/aboutkp/fastfacts.html, accessed 9/19/11
OK, This is Big, But Kaiser is never going to try to analyze all 44PB of data at once Analysis of any kind, is typically done on cohorts of patients that number in the 1000s, The San Diego Supercomputer Center currently has 16TB* of CMS data under management Medicaid claims for the past 5 years (minus some States) but what if analysis could be done on much larger numbers? What kinds of questions could you ask? What kinds of analysis could you do? What could it tell you? * Natasha Balac, SDSC, personal communication
Questions? Say Kaiser (or HHS, or NY etc.) wanted to look at how many patients had Flu (like symptoms) by analyzing patterns in EHR data, even if Flu was not diagnosed, on a weekly basis for 2006-2010 (260 weeks) What if they wanted to correlate length of acute respiratory infections with administered doses of specific antibiotics? What if they wanted to model the course of seasonal respiratory infections & their response to different drug therapies? What if they wanted to enhance those results by using data from social media & other sources to develop new epidemiologic indicators What if they wanted to determine the relationship of cost (to Kaiser) of those infections & correlate that with specific drug therapies?
New Process & Analyses Data acquisition from EHRs, PHRs, other structured clinical & demographic data, the Web & other unstructured sources such as social media Aggregation of ultra-large analysis set, use of new database & data transformation technologies such as NoSQL DBs, MapReduce, UIMA etc. Use of new tools to define analysis or models including R, Hadoop Requires new skills to design analysis & interpret results Leaders include Google, IBM, Amazon, EMC, MongoDB, Opera Solutions, 1010Data, Quantivo, Zillabyte
Hasn t This Been Done? Yes for 1000s of patients, maybe even 10s of thousands of patients, not for millions The difference is between 90% (1 in 10 error rate) & 99.999999% (1 in 1,000,000) confidence level Analysis using this much data produces results with close to certainty
Calling Dr. Watson Early medical expert systems Mycin (Stanford), diagnosed bacterial infections, about 600 rules, data entered for each diagnosis, 69% effective (10% better than human), but never used Current Active Health (Aetna), still rule based (10,000s), uses knowledge base created by medical & IT staff, produces medical alerts based on current research & best practice IBM Dr. Watson adapted from Watson hardware/software system, Deep Question Answering (DeepQA), content acquired from Web or specified documents (EHR, etc.), analyzes questions, generates & evaluates hypotheses, generates & evaluates answers, proposes diagnosis & treatment (now allied with WellPoint 33.3M members)
We are just Starting Systems now can be directed at ultra-large scale analysis & predictive modeling, not just diagnosis Many companies developing tools for data acquisitions, analysis & modeling at this scale big & small: IBM, Oracle as well as 100s of start-ups & smaller companies Healthcare will benefit: Outcomes improved through discovery of new evidencebased practices Cost control through integrated clinical & financial analysis Public health improved through use of more accurate models
Continue the Discussion dhartzband@rchnfoundation.org dhartz@mit.edu