Paths to success in computer-assisted coding Today healthcare organizations still need to help coders apply their knowledge and expertise, but the traditional coding process is not going to remain sustainable for much longer. With the advent of ICD-10, everything changes. Coders must refresh their knowledge and master a drastically larger and more complex coding system. NLP can help. Executive summary The advent of computer-assisted coding (CAC) technology and the mandate to implement ICD-10 have ignited an explosion of CAC solutions in the marketplace based on natural language processing (NLP). With this explosion has come inevitable confusion in the market and clamor over NLP technologies. As a result, it s not surprising that even IT-savvy healthcare professionals have difficulty digging out from under the avalanche of jargon, competing claims, and conflicting guidance about what to pay attention to when evaluating NLP-based solutions. This paper has two goals. The first goal is to help healthcare professionals cut through confusion about NLP by providing a trustworthy big-picture view of the field and answering some of the key questions people have about it. The second goal is to help organizations develop reasonable expectations of what NLP can do, and what they should reasonably expect to gain, as they transition their coding to ICD-10. This paper also has two key take-aways. The first take-away is a set of three success factors for natural language processing in any field or industry: comprehensive and reliable knowledge, large quantities of language data, and state-of-the-art machine learning techniques that put the two together. The second take-away is that in addition to technology, experience and trust are crucial in the midst of rapid change. To achieve success with ICD-10 and beyond, healthcare organizations will need a coding software vendor that has not only the NLP technology, but also the commitment and experience to see them through this critical transition. By Philip Resnik, PhD Strategic Technology Advisor, 3M Health Information Systems
The new coding landscape Medical coding used to be about human coders carefully reading clinical documentation and producing codes. For decades, this approach has worked well: Educate coders thoroughly and provide them with trustworthy tools like the 3M Codefinder Software, which can help them more efficiently leverage their own knowledge and expertise. Today healthcare organizations still need to help coders apply their knowledge and expertise, but the traditional coding process is not going to remain sustainable for much longer. With the advent of ICD-10, everything changes. Coders must refresh their knowledge and master a significantly larger and more complex coding system, adding to the existing shortage of coders as the entire coding ecosystem evolves. Traditional coding productivity is expected to decline while coders get comfortable with the new system, documentation must be improved to meet the demands of the new code set, and HIM departments will continue to come fully up to speed. Along with these challenges, healthcare organizations are looking to go beyond coding and make broader and more effective use of their clinical data. They need to capture the information required for billing, and turn clinical documentation into a rich data repository that can be mined and analyzed in support of a wide range of downstream applications. The powerful characteristic of medical record coding is that it combines data and knowledge: Coders take in text and other raw data in clinical records and by using their knowledge and expertise transform that text and data into valuable information. When NLP technology is introduced into the coding world to enable CAC, it super-charges the most important elements of that coding process and provides the path from today s coding world to success in the new coding landscape. 1. Knowledge is required for NLP success trustworthy, reliable knowledge about the subject matter. The early days of NLP were all about creating systems by programming in human rules and knowledge, hoping to build computers that would understand language the same way we do. 2. Language data is vital to NLP success, because when it comes to real language, approaches based solely on human-written knowledge and rules simply don t work. Around 1990, the NLP field underwent a major revolution as people realized that for large and complicated problems, no team of experts can capture every detail and complexity. Instead of relying solely on the knowledge in experts heads, technologists developed algorithms to crunch through vast amounts of data, using automated analysis and statistics to help find order in the real-world chaos of how human language is actually used. 3. Machine learning consists of algorithms and models that allow machines to start with existing sources of knowledge, analyze new data, and improve their own capabilities. The last ten years of NLP have been described as the rise of machine learning. The sidebar on the next page captures this big picture visually, and its message is clear: If a vendor s solution is not using trustworthy knowledge sources, large-scale text data analysis, and machine learning, then it does not represent state-of-the-art NLP. To understand why, consider the three essential elements for creating successful, state-of-the-art NLP: 1 2 3M Health Information Systems
100 90 80 70 60 50 40 30 20 10 0 1970 1983 1988 1993 1998 2003 2008 1970-mid 1980s Natural language understanding Percentage of Statistical General NLP (ACL) Percentage of Statistical Biomedical NLP (AMIA) Figure 1 The rise of data-driven methods and machine learning in NLP 6 1988 Data-driven approaches 32 1994-1999 The field comes together 52 6 87 2000 The rise of machine learning 90 49 The rise of state-of-the-art NLP NLP has evolved from a purely knowledge-based field to one dominated by statistical modeling and machine learning approaches. Figure 1 combines information from two scholarly sources to illustrate this evolution and how the three critical success elements knowledge, language data and machine learning fit into NLP and its historical timeline. 2 The stages of evolution in NLP identified along the bottom of Figure 1 what happened when come from today s leading NLP textbook, Speech and Language Processing, Second Edition, by Daniel Jurafsky and James Martin (Pearson Prentice Hall, 2008). The data points show the percentage of papers using statistical, data-driven methods and machine learning, appearing in NLP s premier academic conferences, such as ACL (Association for Computational Linguistics) and AMIA (American Medical Informatics Association). As the figure shows, the statistical revolution in NLP began around 1990; in medical informatics the same revolution took hold ten years later and many in the field have not yet caught up. CAC and the coding vendor landscape While many coding software vendors today offer NLP-driven CAC solutions, very few of them can actually put all three NLP success factors to work in their products. Other vendors rely on claims about systems understanding what they read to mask their lack of expertise in modern machine learning. Finally, some vendors may be strong in statistics or machine learning, but they lack the necessary depth of experience and knowledge in the clinical coding domain. Some vendors market clinical terminologies and ontologies, which are expert-created knowledge sources, without offering a means of connecting them to real-world patterns of language use. Still other vendors resort to vague statements about using modern statistical NLP, but never manage to actually say where it can be found in their products. www.3mhis.com 3
Understanding state-of-the-art NLP and what it means for coding With all the product variety and competing claims in the marketplace, it can be difficult for organizations to understand what state-of-the-art NLP really looks like and connect what they are hearing about NLP to their real-world needs. In this section, some key questions are summarized and answered in a useful way for organizations needing to make informed decisions about CAC solutions. What is an NLP engine? The term NLP engine typically refers to a pipeline of technology components that perform tasks in language analysis the step-by-step processing of language input to achieve the desired output. Raw text is broken into sentences, sentences are broken into words, words are tagged with their parts of speech, meaningful phrases are identified, and so forth. However, the most recent technological innovations in NLP do not rely on a single pipeline. Instead, an NLP platform combines evidence from a collection of different components. Instead of just one path, there are many different ways to arrive at an answer. The system considers all of them and then combines their results to achieve a more reliable conclusion than any one path or pipeline could have achieved alone. A compelling example of how well this approach can work is IBM Watson, which beat the human champions on JEOPARDY! Watson is based on a platform containing multiple paths followed by a synthesis of evidence that identifies the most accurate answer. (See Figure 2.) Similarly, most advanced machine translation systems collect multiple translation results and synthesize them into a single output. Bottom line: An NLP engine is a pipeline architecture that takes language input through a sequence of processing steps to produce output. More recently, high performance NLP platforms go beyond a single pipeline to permit multiple paths of analysis and evidence combinations to produce the best answer. Shared analysis Multiple ways of finding the answer Evidence combination Question analysis Query decomposition Hypothesis generation Soft filtering Hypothesis and evidence scoring Synthesis Final merging and ranking Hypothesis generation Soft filtering Hypothesis and evidence scoring Trained models Answer and confidence Figure 2 The DeepQA platform in IBM Watson. To beat the human champions on JEOPARDY!, Watson combined knowledgedriven NLP with large-scale data analysis and machine learning in a platform that combines evidence from multiple paths to arrive at reliable conclusions. 3 4 3M Health Information Systems
Can systems really understand or comprehend clinical text? Well known NLP-based systems act as if they understand human language IBM Watson, Apple Siri, and Google Translate seem all too human to many of us but systems like these are all designed to perform very specific tasks using the fundamental success elements identified above: Reliable knowledge sources, large-scale data and machine learning. Getting computers to actually understand human language has been a dream of researchers since the earliest days of computing, but technologists know we are a long way from approaching true human understanding. To suggest otherwise is at best incredibly optimistic, and at worst it s misleading. Bottom line: No. Use of the word understanding tells you more about a vendor s marketing strategy than about its technology. What is statistical natural language processing? The term statistical NLP means NLP that includes learning quantitatively from data. This distinguishes it from pre-1990s NLP, which was based on knowledge and rules built by human experts. After a brief period of tumult in the research community, statistical NLP became the state of the art, as witnessed by the data points in Figure 1. Some of the earliest and most effective statistical NLP techniques involved matching or counting words and phrases; unfortunately, people who have not kept up with the field sometimes mistakenly use statistical NLP to refer specifically to those methods. Bottom line: Today s state-of-the-art NLP is statistical NLP the combination of expert-driven rules and knowledge, large-scale data analysis, and quantitative machine learning. Don t systems that use machine learning require a long time for data collection and training? A simplified picture of machine learning is that it first requires building up a large, coded training set which could take a long time, especially with a new code set and then running a training algorithm on that data. Fortunately, that s not the case in practice, because state-of-the-art NLP systems don t have to start their training from scratch. They can use expert rules and knowledge, along with large volumes of existing data even records coded using a different code set to start at a high level of performance and then further improve via continuous learning and adaptation as more and more data comes in. Bottom line: Although in academic settings clean room experiments are typically used to develop and test machine learning by making it start from scratch, that s not how machine learning systems work in the real world. Data-driven feedback is dramatically more effective over time than tweaking by human experts. Continued on next page > www.3mhis.com 5
< Continued from previous page Why is PHI not a problem for data-driven NLP systems? In this age of big data, privacy issues are, appropriately, a huge focus of attention and concern. Fortunately, when data-driven systems learn from clinical records, their job is to learn about clinical language and concepts, not about individual patients. This means that, as one of the very first steps in a machine learning process, any properly implemented NLP platform will disregard or discard data fields designated as protected health information (PHI), and the learning process itself will consist only of aggregated analysis, eliminating any possible identification with individual patient records. For example, a system might analyze tens of thousands of records to learn different ways the concept diabetes mellitus is described by physicians, but nowhere in the learning process does the system ever record that some particular patient is diabetic. In this regard, the automatic learning process in an NLP platform is very similar to the way researchers learn about drugs in a clinical trial. Personal identities and private information are not relevant to the analysis, and that information is generally not even accessible to the researchers. What matters, and what is retained, is information about large scale patterns, not individuals. Bottom line: From the perspective of data-driven NLP, all patients are anonymous. Data from a patient s record terms, phrases, concepts, codes, anything extracted from the record during the learning process ceases to be identifiable as belonging to any individual person or coming from any particular organization. How is NLP technology evaluated? Proper evaluation of language technology involves three key elements: A dataset that adequately represents the real-world problem you are trying to solve (the test set) The correct answers for that dataset (the ground truth), and An evaluation measure, which quantifies how well a system produces correct answers. 4 Precision and recall are widely used evaluation measures similar to specificity and sensitivity in medical diagnosis. Recall (identical to sensitivity) captures the extent to which you got all the codes you were supposed to, and precision measures the extent to which additional incorrect codes also got mixed in. Crucially, any test set needs to be large enough to represent real-world data a test set of a few hundred records cannot provide a thorough evaluation of system performance when there are tens of thousands of possible codes. Equally important, any evaluation of system performance must include human coders as a point of comparison, because the benchmark for human coding falls far below 100 percent even when using familiar coding systems. 5 Bottom line: Evaluation of language technology for coding requires a benchmark test set that is large enough to represent the real world, along with correct codes and an assessment of human performance to compare against. Small datasets produce untrustworthy results. 6 3M Health Information Systems
The promise and prospects for CAC, NLP and ICD-10 Perhaps the biggest question on every healthcare organization s mind today is how CAC can assist them in making the transition to ICD-10. This section focuses on the vision and the realities of applying CAC and NLP to the challenge of coding ICD-10. The impact of ICD-10 on human coders According to an ICD-10 end-to-end coding test reported on by Government Health IT in the autumn of 2013, trained ICD-10 coders of the human variety achieved between 55 and 63 percent accuracy in dual coding of 20 peer-reviewed scenarios. 6 In addition, the same test showed that, in comparison with ICD-9, coder productivity was literally cut in half even for the best coders. 7 In context, neither a loss of accuracy nor decreased productivity should be too surprising, given the difference in these two classification systems. When carefully evaluated, expert inter-coder agreement for ICD-9 codes is typically below 80 percent. 8 And it stands to reason that some coders may want or need to use reference manuals in the first months of working in a new system, so they may be more methodical in their coding process. Moreover, the requirements for adequate documentation will change with the introduction of ICD-10, since more detailed information is needed to conclusively select a code in the more granular coding system. Queries and feedback loops may cause further productivity losses as coders take time getting necessary information and clarifications from providers. 3M recommendations for the human side of the ICD-10 transition Train coders as early as possible on ICD-10; do not expect initial dual-coding and CAC software to be your first line of training. Budget for a productivity impact in the first six months; during the first several months of ICD-10, human coders will likely be slower than usual. Plan to perform concurrent CDI to mitigate any documentation gaps that affect ICD-10 coding. Specifically address inter-coder agreement in your organization; develop an internal process for detecting areas of disagreement and establish an internal coding clinic to facilitate sharing of expertise and learning throughout your coding team. Continued on next page > www.3mhis.com 7
< Continued from previous page How CAC and NLP can help with ICD-10 Although CAC and NLP will not magically make the impacts of ICD-10 go away, they can help with the transition in several clearly identifiable ways. First, NLP can help coders make more effective use of trusted knowledge resources in the less familiar ICD-10 setting. For example, thousands of coders in the U.S. are already using the 3M Codefinder Software to manage the complex rules and terminology of coding by incorporating mandated rules, principles, and guidelines for coding. The 3M 360 Encompass System seamlessly integrates automatic analysis of clinical language with the familiar 3M Codefinder Software. Second, NLP-enabled CAC can propose useful codes for human coders to verify or edit. Gray areas in coding can be expected to have a greater impact on all CAC systems in ICD-10 s early days, because ICD-10 is larger and more complex and years of documentation improvement and guideline development have gone into ICD-9 coding. However, systems built on the three success elements for NLP, like the 3M 360 Encompass System, can integrate feedback from the human coding process to quickly improve as human coders themselves improve. Third, NLP can facilitate the critical process of documentation improvement as providers and coders adapt to the new system. Automatic analysis of clinical text can identify gaps in communication that must be filled to code accurately. It can also provide early warning indicators to providers when documentation is insufficient, and auto-suggest physician queries to enable effective and immediate communication between coders, CDI specialists, physicians and care coordination teams. 3M s NLP platform In 2012, 3M acquired CodeRyte following several years of successful partnering on CAC applications. Unlike other coding industry consolidations, this union brings together all three NLP success elements in a single organization. These two distinct organizations and technologies have been effectively integrated, so today, 3M s NLP platform combines superlative knowledge sources, language data, and the machine-learning expertise to use them all effectively. On the knowledge side, 3M has 30 years of healthcare industry experience in coding expertise and clinical knowledge. Since the late 1970s, 3M Codefinder Software has brought detailed knowledge about clinical terminology, coding rules and guidelines to the fingertips of thousands of coders at thousands of healthcare organizations worldwide. In addition, the 3M Healthcare Data Dictionary (HDD) provides the core technology to enable semantic interoperability for the joint U.S. Department of Defense and Department of Veterans Affairs integrated electronic health record. On the data side, 3M s coding technology achieved a leading industry position in the outpatient world with 3M SM CodeRyte SM CodeAssist SM, using an approach based on modern data-driven NLP methods, bringing rule-based methods together with large volumes of data and machine learning. The technologists who created that software were instrumental in the technological revolution in NLP illustrated earlier in Figure 1. An entire industry is about to go through the same learning curve. But rapid progress in using NLP to improve ICD-10 will be possible using software that combines effective machine learning technology, the clinical language data to learn from, and deep subject matter expertise. 8 3M Health Information Systems
Figure 3 provides a high level view of 3M s integrated NLP platform. Like IBM Watson (shown in Figure 2) and other state-of-the-art NLP frameworks, 3M s NLP platform adopts an approach fundamentally based on the idea of using multiple approaches internally to analyze the input and propose solutions, leading to a combination of evidence to produce confidently accurate responses. 9 As in Figure 2, Figure 3 distinguishes three main stages of processing. 10 At the far left, a pipeline of standard language processing steps takes highly variable forms of input and produces a standard format for further processing, such as identifying distinct regions within the documentation, locating where sentences begin and end, identifying the word-level units within the text, etc. In the center, a multitude of knowledgebased and data-driven components add information to that shared analysis, including detecting relevant features within the text, labeling clinical concepts and identifying conceptual relationships that are likely to be relevant. Crucially, the idea of confidence is pervasive throughout this multi-path architecture: In state-of-theart NLP, components don t just produce outputs, they also provide an assessment of how confident one should be that the output is correct. Shared analysis Multiple ways of finding the answer Evidence combination Text features Expert rules Synthesis Merging / Ranking Region detection Sentence breaking Tokenization Clinical concepts Learned models Learned models Answers and confidence Figure 3 3M s NLP platform www.3mhis.com 9
Conclusion The landscape for coding and more generally for health care has entered a period of extraordinarily rapid change. Where things will finally land is a matter of speculation, but this much seems evident: The organizations that succeed will be the ones that tackle these dramatic developments head-on and well in advance, recognizing that they need to adapt some of their most fundamental processes how clinicians document, how clinical documentation is transformed into codes and other usable information, and how effectively that information is then used. NLP can make things easier, if it is done right. But when it comes to state-of-the-art approaches, many NLP technologists in the clinical setting have struggled to keep up. As Figure 1 shows, trends in the clinical NLP community are a solid ten years behind the field in general. To learn more In contrast, 3M has become an NLP leader in health care by concentrating on the three fundamental success factors: 1. Knowledge: Trustworthy, reliable knowledge about the subject matter and deep experience in coding and the clinical domain, unrivaled in the industry and supported by 3M s unique Nosology team along with tools and resources like the 3M Codefinder Software and the 3M Healthcare Data Dictionary (HDD) 2. Language data: Not just claims data, or raw EHR records, or even transcriptions, but the secure, scalable combination of an organization s metadata, discrete data, and text richly annotated with semantically interoperable representations. 3. Machine learning: A body of expertise in machine learning and NLP that leads the field at the cutting edge of NLP, not ten years behind. If you would like additional information on 3M s approach to NLP and CAC technology, as well as our new 3M 360 Encompass System, please contact your 3M representative today. You may also call us toll-free at 800-367-2447 or explore our website at www.3mhis.com. Much of this paper has discussed NLP technology. But for organizations struggling with decisions about the path forward especially when faced with variable and confusing messaging about unfamiliar kinds of technology the real bottom line question in the ICD-10 transition must also include experience and trust. 3M is the home of coding knowledge pioneers who defined the DRG standards; the teams who designed and delivered the 3M Codefinder Software, which coders use daily for accurate, complete and compliant coding of medical records; and a nosology support staff that even clients who use competing vendors come to for guidance. 11 3M has kept clients up-to-date with current regulatory changes and updates for over 30 years, never once missing an update. Its computer-assisted coding product, the 3M 360 Encompass System, leads the market with over 1,000 clients in the United States and growing, and 3M has a long-standing history of working with clients to make sure that they get what they need. Putting it all together, 3M offers a clear path forward to ICD-10 with state-ofthe-art NLP and a solid track record of experience and trust. 10 3M Health Information Systems
Footnotes 1 This discussion draws on Daniel Jurafsky and James H. Martin, Speech and Language Processing, Second Edition (Pearson Prentice Hall, 2008), the definitive textbook on NLP. In their introduction, Jurafsky and Martin describe early language understanding origins of NLP; the post-1993 period when the field comes together after a lengthy conflict over the roles of knowledge and data, arriving at the recognition that both are indispensable; and finally the rise of machine learning. A book edited by Judith Klavans and Philip Resnik, The Balancing Act: Combining Symbolic and Statistical Approaches to Language (MIT Press, 1996), was a key milestone in that process. For a concise overview of NLP s history and milestones, refer to 3M s 2013 white paper, Auto-Coding and Natural Language Processing by Richard Wolniewicz, PhD, Director, NLP Advanced Technology for 3M Health Information Systems (available at http://ow.ly/tp0zr). 2 The graph lines and measurements in Figure 1 are from Figure 9.1 in John Pestian, Louise Deleger et al., Natural Language Processing The Basics, from Chapter 9 in John Hutton (ed.), Pediatric Biomedical Informatics: Computer Applications in Pediatric Research, Springer, 2012: http://ow.ly/uendg. See also Ken Church, Speech and language processing: Where have we been and where we are going, Eurospeech, Geneva, Switzerland, 2003, and Claire Cardie and Ray Mooney, Machine Learning and Natural Language, Kluwer Academic Publishers, 1999. 3 Figure adapted from Wikimedia Commons: http://commons.wikimedia.org/wiki/file:deepqa.svg. 4 For a thorough discussion of evaluation in NLP, see Philip Resnik and Jimmy Lin, Evaluation of NLP Systems, The Handbook of Computational Linguistics and Natural Language Processing, Wiley, 2010. 5 Philip Resnik, Michael Niv, et al., Using Intrinsic and Extrinsic Metrics to Evaluate Accuracy and Facilitation in Computer Assisted Coding, Perspectives in Health Information Management, CAC Proceedings, Fall 2006: http://ow.ly/tuw04. 6 Erin McCann, The ICD-10 pilot that was just plain scary, Government Health IT, October 14, 2013, http://ow.ly/tuw4q. 7 On October 21, 2013, HIMSS and WEDI released a report of the results from their ICD-10 National Pilot Program; their findings showed that coders who averaged two medical records per hour in ICD-10 had averaged four records per hour under ICD-9 resulting in a 50 percent decline in productivity. For the complete report, see http://ow.ly/tuw7d. 8 Resnik and Niv, et al., : http://ow.ly/tuw04. 9 IBM s technical papers refer to this as hypothesis and evidence combination; in the technical literature one can also find the same ideas described using terms like system combination and ensemble methods. 10 The view in the figure is somewhat simplified; for example, evidence combination can actually take place at multiple points within the platform. 11 See Our Story Stands video by 3M Nosologist Michelle Taylor, who describes how clients using another vendor s coding software call the 3M Nosology help line for coding assistance: http://ow.ly/tpbhw www.3mhis.com 11
3M Health Information Systems Best known for our market-leading coding system and ICD-10 expertise, 3M Health Information Systems delivers innovative software and consulting services designed to raise the bar for clinical documentation improvement, computer-assisted coding, mobile physician applications, case mix and quality outcomes reporting, and document management. Our robust healthcare data dictionary and terminology services also support the expansion and accuracy of your electronic health record (EHR) system. With 30 years of healthcare industry experience and the know-how of more than 100 credentialed 3M coding experts, 3M Health Information Systems is the go-to choice for 5,000+ hospitals worldwide that want to improve quality and financial performance. For more information on how our solutions can assist your organization, contact your 3M sales representative, call us toll-free at 800-367-2447, or visit us online at www.3mhis.com. Health Information Systems 575 West Murray Boulevard Salt Lake City, UT 84123 U.S.A. 800 367 2447 www.3mhis.com 3M, Codefinder and 360 Encompass are trademarks and CodeRyte and CodeAssist are service marks of 3M Company. The International Statistical Classification of Diseases and Related Health Problems Tenth Revision (ICD-10) is copyrighted by the World Health Organization, Geneva, Switzerland 1992 2008. IBM and Watson are trademarks of International Business Machines Corporation, registered in many jurisdictions worldwide. JEOPARDY! is a registered trademark of Jeopardy Productions, Inc. Google and Google Translate are trademarks of Google, Inc. Please recycle. Printed in U.S.A. 3M 2014. All rights reserved. Published 08/14 70-2009-9301-5