Approaches to Text Mining for Clinical Medical Records



Similar documents
The Development of Web Log Mining Based on Improve-K-Means Clustering Analysis

An Interest-Oriented Network Evolution Mechanism for Online Communities

How To Understand The Results Of The German Meris Cloud And Water Vapour Product

Using Supervised Clustering Technique to Classify Received Messages in 137 Call Center of Tehran City Council

Forecasting the Direction and Strength of Stock Market Movement

Assessing Student Learning Through Keyword Density Analysis of Online Class Messages

Luby s Alg. for Maximal Independent Sets using Pairwise Independence

Enterprise Master Patient Index

What is Candidate Sampling

1.1 The University may award Higher Doctorate degrees as specified from time-to-time in UPR AS11 1.

NEURO-FUZZY INFERENCE SYSTEM FOR E-COMMERCE WEBSITE EVALUATION

An Alternative Way to Measure Private Equity Performance

Using Content-Based Filtering for Recommendation 1

A Secure Password-Authenticated Key Agreement Using Smart Cards

How To Analyze News From A News Report

Vision Mouse. Saurabh Sarkar a* University of Cincinnati, Cincinnati, USA ABSTRACT 1. INTRODUCTION

PAS: A Packet Accounting System to Limit the Effects of DoS & DDoS. Debish Fesehaye & Klara Naherstedt University of Illinois-Urbana Champaign

COMPUTER SUPPORT OF SEMANTIC TEXT ANALYSIS OF A TECHNICAL SPECIFICATION ON DESIGNING SOFTWARE. Alla Zaboleeva-Zotova, Yulia Orlova

Web Object Indexing Using Domain Knowledge *

FREQUENCY OF OCCURRENCE OF CERTAIN CHEMICAL CLASSES OF GSR FROM VARIOUS AMMUNITION TYPES

Face Verification Problem. Face Recognition Problem. Application: Access Control. Biometric Authentication. Face Verification (1:1 matching)

Gender Classification for Real-Time Audience Analysis System

Can Auto Liability Insurance Purchases Signal Risk Attitude?

Searching for Interacting Features for Spam Filtering

Predicting Software Development Project Outcomes *

Support Vector Machines

Context-aware Mobile Recommendation System Based on Context History

Invoicing and Financial Forecasting of Time and Amount of Corresponding Cash Inflow

Feature selection for intrusion detection. Slobodan Petrović NISlab, Gjøvik University College

Multiple-Period Attribution: Residuals and Compounding

IMPACT ANALYSIS OF A CELLULAR PHONE

benefit is 2, paid if the policyholder dies within the year, and probability of death within the year is ).

A DATA MINING APPLICATION IN A STUDENT DATABASE

A Replication-Based and Fault Tolerant Allocation Algorithm for Cloud Computing

WISE-Integrator: An Automatic Integrator of Web Search Interfaces for E-Commerce

Research on Transformation Engineering BOM into Manufacturing BOM Based on BOP

On the Optimal Control of a Cascade of Hydro-Electric Power Stations

AN APPOINTMENT ORDER OUTPATIENT SCHEDULING SYSTEM THAT IMPROVES OUTPATIENT EXPERIENCE

Calculating the high frequency transmission line parameters of power cables

DEFINING %COMPLETE IN MICROSOFT PROJECT

Module 2 LOSSLESS IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Probabilistic Latent Semantic User Segmentation for Behavioral Targeted Advertising*

Using an Adaptive Fuzzy Logic System to Optimise Knowledge Discovery in Proteomics

How To Predct On The Web For Hfmd

PEER REVIEWER RECOMMENDATION IN ONLINE SOCIAL LEARNING CONTEXT: INTEGRATING INFORMATION OF LEARNERS AND SUBMISSIONS

Bayesian Network Based Causal Relationship Identification and Funding Success Prediction in P2P Lending

IT09 - Identity Management Policy

RESEARCH ON DUAL-SHAKER SINE VIBRATION CONTROL. Yaoqi FENG 1, Hanping QIU 1. China Academy of Space Technology (CAST)

Calculation of Sampling Weights

Detecting Credit Card Fraud using Periodic Features

Single and multiple stage classifiers implementing logistic discrimination

THE APPLICATION OF DATA MINING TECHNIQUES AND MULTIPLE CLASSIFIERS TO MARKETING DECISION

Traffic-light a stress test for life insurance provisions

8 Algorithm for Binary Searching in Trees

Forecasting the Demand of Emergency Supplies: Based on the CBR Theory and BP Neural Network

Hollinger Canadian Publishing Holdings Co. ( HCPH ) proceeding under the Companies Creditors Arrangement Act ( CCAA )

IWFMS: An Internal Workflow Management System/Optimizer for Hadoop

BUSINESS PROCESS PERFORMANCE MANAGEMENT USING BAYESIAN BELIEF NETWORK. 0688,

Mining Multiple Large Data Sources

How To Classfy Onlne Mesh Network Traffc Classfcaton And Onlna Wreless Mesh Network Traffic Onlnge Network

Semantic Link Analysis for Finding Answer Experts *

A hybrid global optimization algorithm based on parallel chaos optimization and outlook algorithm

Network Security Situation Evaluation Method for Distributed Denial of Service

ANALYZING THE RELATIONSHIPS BETWEEN QUALITY, TIME, AND COST IN PROJECT MANAGEMENT DECISION MAKING

Overview of monitoring and evaluation

iavenue iavenue i i i iavenue iavenue iavenue

The Greedy Method. Introduction. 0/1 Knapsack Problem

INVESTIGATION OF VEHICULAR USERS FAIRNESS IN CDMA-HDR NETWORKS

A Dynamic Load Balancing for Massive Multiplayer Online Game Server

Vembu StoreGrid Windows Client Installation Guide


Estimating the Development Effort of Web Projects in Chile

Performance Analysis and Coding Strategy of ECOC SVMs

To manage leave, meeting institutional requirements and treating individual staff members fairly and consistently.

Damage detection in composite laminates using coin-tap method

On-Line Fault Detection in Wind Turbine Transmission System using Adaptive Filter and Robust Statistical Features

Improved SVM in Cloud Computing Information Mining


Robust Design of Public Storage Warehouses. Yeming (Yale) Gong EMLYON Business School

Design and Development of a Security Evaluation Platform Based on International Standards

Frequency Selective IQ Phase and IQ Amplitude Imbalance Adjustments for OFDM Direct Conversion Transmitters

Project Networks With Mixed-Time Constraints

Recurrence. 1 Definitions and main statements

Comparison of Domain-Specific Lexicon Construction Methods for Sentiment Analysis

The OC Curve of Attribute Acceptance Plans

A Performance Analysis of View Maintenance Techniques for Data Warehouses

Transcription:

Approaches to Text Mnng for Clncal Medcal Records Xaohua Zhou and Hyol Han College of Informaton Scence and Technology Drexel Unversty Phladelpha, PA 19104 xaohua.zhou@drexel.edu hhan@cs.drexel.edu Isaac Chanka, Ann Prestrud and Ar Brooks College of Medcne Drexel Unversty Phladelpha, PA 19102 {c36, ann.prestrud}@drexelmed.edu ar.brooks@drexelmed.edu Abstract Clncal medcal records contan a wealth of nformaton, largely n free-text form. Means to extract structured nformaton from free-text records s an mportant research endeavor. In ths paper, we descrbe a MEDcal Informaton Extracton (MedIE) system that extracts and mnes a varety of patent nformaton wth breast complants from free-text clncal records. MedIE s a part of medcal text mnng project beng conducted n Drexel Unversty. Three approaches are proposed to solve dfferent IE tasks and very good performance (precson and recall) was acheved. A graph-based approach whch uses the parsng result of lnk-grammar parser was nvented for relaton extracton; hgh accuracy was acheved. A smple but effcent ontology-based approach was adopted to extract medcal terms of nterest. Fnally, an NLP-based feature extracton method coupled wth an ID3-based decson tree was used to perform text classfcaton. Categores and Subject Descrptors I.2.7 [Natural Language Processng]: Text Analyss General Terms Expermentaton Keywords Clncal Records, Informaton Extracton, Relaton Extracton, Ontology 1. Introducton Patent medcal records contan a wealth of nformaton that can prove nvaluable for the conduct of clncal research. Clncal records are largely mantaned n free-text form. Thus, a relable and effcent method to extract structured nformaton for future data mnng from free-text usng nformaton extracton technques may greatly beneft research endeavors. We report here on the development of a MEDcal Permsson to make dgtal or hard copes of all or part of ths work for personal or classroom use s granted wthout fee provded that copes are not made or dstrbuted for proft or commercal advantage and that copes bear ths notce and the full ctaton on the frst page. To copy otherwse, or republsh, to post on servers or to redstrbute to lsts, requres pror specfc permsson and/or a fee. SAC 06, Aprl, 23-27, 2006, Djon, France. Copyrght 2006 ACM 1-59593-108-2/06/0004 $5.00. Informaton Extracton (MedIE) system that extracts and mnes a varety of patent nformaton wth breast complants from freetext clncal records. MedIE s a part of a large research project on breast cancer beng conducted at Drexel Unversty College of Medcne. Before researchers can conduct any analyss or mnng, they must frst code textual patent records and save ths structured nformaton nto the database. A total of 125 separate ntal consultaton notes were mned by our system. Results were then compared to a medcal student's ndependent manual processng of the same consultaton notes. The technque used n ths paper s an extenson to [19]. In [19], three approaches to numerc attrbutes fllng, medcal term dentfcaton, and text classfcaton were descrbed, and an evaluaton for 13 extractng tasks on a small collecton of 50 patent records was reported. We extended the graph-based approach for numerc attrbute fllng n [19] to a generc relaton extracton technque capable of performng the majorty of nformaton extracton tasks n the project; we also solved several techncal problems whle usng Lnk Grammar parser [14] to buld graphs, whch made the system more robust. We mproved term extracton approach n [19] by extensve use of ontology and adopton of a NLP-based term predcton technque. In short, the extended MedIE system s more generc, robust, and effectve n terms of knowledge extracton; the evaluaton for 23 extractng tasks on a larger collecton of 125 patent records s more representatve and convncng. The remander of ths paper s organzed as follows: n secton 2, we revew related work; n secton 3 we present our own approaches to extracton of the three types of nformaton; and secton 4 evaluates system performance. A short concluson fnshes the artcle. 2. Related Work One lne of research related to ours s Named Entty Recognton (NER) n free-text. Though most NER methods cannot handle medcal terms drectly, ther concepts, such as pattern matchng, can be borrowed. General Archtecture for Text Engneerng (GATE) [1] uses patterns wrtten n regular expressons to mplement all ts components such as tokenzaton and named entty recognton. It also provdes a Java Annotated Pattern Engne (JAPE) [2], by whch users can extend NER component to dentfy enttes of nterest. However, because medcal terms are full of synonyms and morphologc varants, the ontology s necessary to acheve hgh extracton accuracy. A research project, "Acqurng Medcal and Bologcal Informaton from Text" (AMBIT) [5], led by a research group at the Unversty of Sheffeld, ams to buld such a large database of

medcal termnology for nformaton extracton from clncal records. In ths partcular project, we adopt Unfed Medal Language System (UMLS)1 as the doman ontology to dentfy medcal terms. The pattern-based template fllng s a common technque for nformaton extracton. AutoSlog [12], PALKA [6], CRYSTAL [16] and WHISK [17] can automatcally nduce lngustc patterns from tranng examples. However, supervsed pattern learnng s very expensve to prepare tranng examples. Instead, we use an unsupervsed approach, whch makes use of the parsng results of lnk grammar parser [14], to extract a good porton of knowledge n the project. There s research that apples lnk grammar parser to nformaton extracton. Madhyastha et al. reported the use of lnk grammar parser for event extracton [9] and Dng et al. appled lnk grammar to extracton of bomedcal nteractons [4]. Both works acheved the goal of nformaton extracton by analyzng the meanng of mportant lnks n the sentence. Dfferng from ther approaches, we frst transform the parsed sentence to a formalsm of a graph and then perform concept assocaton based on the graph generated. Another lne of related research s text classfcaton. Decson trees are a frequently used technque for text classfcaton. Wendy Lehnert et al. [8] present an ID3-based decson tree for classfcaton, whch uses learned keywords as features [8]. Kuhn and De Mor propose applcaton of semantc classfcaton trees (SCT) to natural language understandng [7]. SCT s an extenson to word-based (as feature) decson trees. Unlke [8] and [7], Rloff and Lehnert [13] descrbe an approach to text classfcaton that represents a compromse between wordbased technque and n-depth natural language processng. It takes polysemy, synonyms, phrases, and local context nto account durng feature extracton. 3. Tasks and Methods The extractng tasks n our project can be roughly classfed nto three groups. The frst s extracton of medcal terms (e.g., past medcal hstory and past surgcal hstory). The second s text classfcaton. For example, a patent can be classfed as a former smoker, a current smoker, or a non-smoker. The last group s about relaton between two terms (e.g. symptoms and human body parts). We propose three approaches to address these extractng tasks, respectvely. 3.1 Ontology-based Term Extracton Medcal term extracton s often a task durng patent record processng. For example, clncans are always nterested n the medcal hstory and surgcal hstory of patents. Medcal term extracton essentally belongs to the task of named entty recognton. However, medcal terms are full of synonyms and morphologc varants. It s necessary to adopt ontology for hgh accuracy extracton of medcal terms from clncal records. Medcal terms are often mult-word phrases; therefore, t s not effcent to search all combnatons of sequental words n the sentence through the ontology. Instead, we follow the method n [19], usng part of speech patterns to generate term canddates and then checkng f the canddate terms exst n the ontology. 1 http://www.nlm.nh.gov/research/umls/ In UMLS, each term may belong to more than one concept and at least one semantc type s assgned to each concept. Accordng to the possble semantc type, we can determne whether the medcal term extracted s of nterest or not. In ths partcular project, we also need to group medcal terms. For example, clncans have partcular nterest n certan predefned dseases such as hypertenson. We then need to dentfy synonyms (e.g. hgh blood pressure s a synonym of hypertenson) of these predefned dseases. Ths task s smply completed by lookup of synonyms n ontology. The ontology-based approach for medcal term extracton acheves hgh precson and acceptable recall. But t stll fals to retreve a porton of terms of nterest smply due to the ontology ncompleteness or typo made by doctors. We releve the problem by predctng some terms based on the dea that elements n coordnatng structures should have smlar semantc types. In the followng example, we recognze splenectomy and gallbladder cholecystectomy as surgeres; gunshot then has a good chance to be a surgery name though t s not explctly defned n UMLS. Gunshot wound n 1989, splenectomy n 1992, and gallbladder cholecystectomy n 1990 The ontology-based approach for medcal term extracton can acheve hgher performance than those general named entty recognton approaches. However, t requres ntensve searchng though we adopt part of speech patterns to mnmze the number of term canddates. 3.2 Graph-based Relaton Extracton Relaton extracton refers to a task that fnds pars of two terms n text (usually n a sentence or a couple of consecutve sentences) that are semantcally or syntactcally related to each other. Most nformaton extracton (IE) tasks n ths project are relaton extracton or could be transformed to relaton extracton problem. One type of nformaton for extracton s numerc attrbute, such as blood pressure, pulse, age and weght of a patent. Because ths project targets patents wth breast cancer, clncans are also concerned about menarche age, number of pregnances, and number of lve brths. Extracton of these numbers s equvalent to assocated medcal concepts (e.g. blood pressure) wth numerc values. Another type of nformaton s the assocaton of dseases or symptoms wth persons (e.g. father, mother, aunt etc.) or parts of the body (e.g. rght breast or left breast). For example, the trace of famly hstory of cancer s about the assocaton of dsease wth a person; examnaton of breast s about the assocaton of symptoms wth part of human body. Some extractng tasks could be transformed to relaton extracton problems. For example, clncans are nterested n the menopausal status of the patent. Ths s a typcal classfcaton problem. But browsng patent records, we found that f the date of last menstrual perod s known, then menopausal status can be determned. Thus the problem s transformed to the assocaton of medcal term (last menstrual perod) wth date. The procedure of relaton extracton s comprsed of two major steps. The frst step s the extracton of varous terms ncludng dseases, symptoms, human body parts, persons, numbers, dates, etc, as descrbed n Secton 3.1. Co-reference resoluton s requred for relaton extracton because doctors may use pronoun or abbrevaton to reference prevous terms whle wrtng patent records. We use a shallow method [3] to fnd the real entty pronouns or abbrevatons refer to.

The second step s to fnd pars of terms that are semantcally or syntactcally related to each other. The judgment of semantc relaton s smple because the semantc type of each term s already gven durng extracton and the possble relatons of any two semantc types are pre-defned n the ontology. However, the determnaton of syntactc relaton s dffcult because n the majorty of cases a sentence contans more than two terms. In the frst sentence of the example below, there are four medcal metrcs and four numbers. In the second sentence, there are two human body parts and two symptoms. Blood pressure s 144/90, pulse of 84, temperature of 98.3, and weght of 154 pound. There s no other mass palpable n the rght breast whle the left breast s free of any lesons We propose a graph-based approach for the extracton of syntactc relaton based on the lnkage nformaton produced by Lnk Grammar Parser [14]. Lnk Grammar s an orgnal sentence parser, producng not only a consttuent tree as most parsers yeld, but also a lnkage dagram that conssts of lnks between two words. In the example shown n Fgure 1, there are nne lnks. The lnk between s and 144/90 represents a verb-object relaton (denoted by notaton O ). Some researchers have explored the use of lnk grammar n nformaton extracton. Madhyastha et al. reported the use of lnk grammar parser for event extracton [9] and Dng et al. appled lnk grammar to the extracton of bomedcal nteractons [4]. Both works reached the goal of nformaton extracton by analyzng the meanng of mportant lnks n the sentence. Dfferng from ther approaches, we frst transform the parsed sentence to a formalsm of graph and then perform concept assocaton based on the generated graph. Fgure 1. An Example of a Lnkage Dagram 2 Suppose a node represents a word and an edge represents a lnk. Then the lnkage dagram of a vald sentence can be vewed as a connected graph. Furthermore, each edge can be weghted aganst the type of lnk accordng to the applcaton (e.g. we penalze the lnks connectng two clauses). Thus, the dstance between any term par can be calculated from the graph. Intutvely, the dstance between any term par s a good measure of ther syntactc relatonshp. Then the task of syntactc relaton extracton s equvalent to search the shortest node (or the node wth a dstance less than the threshold [4]) wth certan semantc type for a gven node n a (weghted) graph. For some extractng tasks, we need to pay attenton to the occurrence of negatng words or phrases. In the followng example, left breast s lnked wth the symptom of lesons, but because of the occurrence of negatve phrase, be free of, they actually have no assocaton at all. There s no other mass palpable n the rght breast whle the left breast s free of any lesons 2 The dagram s yelded by the onlne Lnk Grammar parser at http://www.lnk.cs.cmu.edu/lnk/. Ths approach provdes a generc framework for relaton extracton, but has several techncal lmtatons n practce. Frst, lnk grammar parser cannot parse text fragments wthout verbs (e.g. blood pressure: 144/90). For ths reason, we also mplemented a pattern-based approach. If the parser fals to parse the sentence, the pattern approach wll take the place. Second, lnk grammar parser was orgnally developed for conversatonal Englsh and makes many errors whle parsng text n the bomedcal doman, most lkely due to ts lack of syntactc nformaton of bomedcal vocabulary. Thrd, lnk grammar parser can process sngle-word terms but cannot deal wth mult-word terms. Regardng the last two problems, Szolovts [18] presented a heurstc method to augment the lexcon of lnk grammar parser wth UMLS s specalst lexcon. We plan to adopt ths technque n future versons. In the current project, we used a smple method to releve ths problem. After medcal term dentfcaton, we replaced these terms n sentences wth placeholders and then submtted the modfed sentence to parser. The last example s sentence converted to the sentence below after our method s appled. Lnk grammar parser cannot recognze the meanng of placeholders, but t s able to fgure out the part of speech the holders represent and successfully parses the sentence. There s no other symptom1 n part1 whle part2 s free of any symptom2 In ths sub-secton, we ntroduced a graph-based approach for relaton extracton. In comparson wth pattern-based approach, t s more flexble and robust. Ths approach s comprsed of followng fve components, term extracton, co-reference resoluton, medcal term replacement, lnk grammar parsng, and graph buldng. 3.3 Decson Trees Based Text Classfcaton Text classfcaton s another type of nformaton extracton tasks n our project. For nstance, patents fall nto three classes wth regard to smokng behavor: non-smoker, former smoker, or current smoker. The followng texts are examples descrbng dfferent smokng behavors. She qut smokng fve years ago (former) She s currently a smoker (current) None (never) She has never smoked (never) For hgh accuracy, an analytc NLP approach s recommended by most of the lterature. Usually pattern-based semantc analyss would be performed to classfy cases. However, the analytc approach hghly demands large amounts of doman knowledge and s consequently dffcult to generalze. Conversely, a machne learnng technque does not depend on doman knowledge and the approach can easly be generalzed. In ths project, we employed an ID3-based decson tree [11] for categorcal felds. Accordng to nformaton theory, Informaton Gan (Mutual Informaton) of the predctor and dependent varable s a good measure of the predctor s dscrmnatng ablty. Thus, the ID3 decson tree s supposed to use fewer features than other decson tree algorthms. (For the detals, see [19]) 4. Experment The MedIE system s mplemented by Java. Lnk Grammar Parser s used to produce both lnkage nformaton for relaton extracton and consttuent trees for feature extracton durng text

classfcaton. WordNet s manly used to get the lemma (unnfected form) of each word n a sentence. GATE (General Archtecture for Text Engneerng) s used for tokenzaton and part of speech taggng. UMLS serves as the doman ontology for medcal term dentfcaton. For the sake of effcency, we downloaded the UMLS data and nstalled t n a local DB2 database. Data s accessed through JDBC. We mplemented the ID3-based decson tree algorthm for text classfcaton. We evaluated our MedIE system on a collecton of 125 patent records, each of whch s a subject wth breast complants. The format of the patent records s same as [19]. One record s comprsed of multple sectons, each of whch begns wth a fxed strng. Therefore, t s easy to splt the whole record nto sectons. Each secton s wrtten n natural language. Table 1. Result of extracton usng concept assocaton Attrbutes extracted Precson (Recall) Attrbutes extracted Precson (Recall) Blood pressure 100.0% Menopause 94.0% Weght 100.0% Palpable nodule 86.0% Pulse 100.0% Breast Mass 86.0% Age of menarche 100.0% Number of pregnances Age of frst chld 100.0% Number of lve brths Auxlary Nodes 100.0% 100.0% Npple D/C 100.0% 100.0% Famly Hstory of cancer The reason to vst doctor 92.0% 92.0% We use precson and recall to evaluate the performance. Extracton of fourteen attrbutes lsted n Table 1 based on the method of relaton extracton acheves extremely hgh precson (recall). In [19], only the frst seven tasks lsted n Table 1 were performed. By examnng all 125 records manually, we found that the extremely hgh precson s, n part, attrbuted to the consstent dctaton style (all records were provded by the same clncan, author Ar D. Brooks, MD). If the sze of the data set ncreases or the wrtng style vares, performance may be degraded. Table 2. Result of text classfcaton Classfcaton Tasks Precson (Recall) Smoke behavor 92.2% Alcohol use 89.4% Appearance 93.7% The ID3-based decson tree s evaluated on three text classfcaton tasks: smokng behavor, alcohol use, and appearance. Fve-fold cross valdaton was appled, that s, the whole data set s splt nto fve subsets. For each round, four subsets are treated as tranng data and the last as testng data. We ran a fve-fold cross valdaton ten tmes, each tme the dataset s randomly shuffled. Average precson (recall) s then calculated (see table 2). It s worth notng that [19] only evaluated the classfcaton task of smokng behavor. Table 3. Result of medcal term extracton Attrbute Name Predefned Past Medcal Hstory Other Past Medcal Hstory Predefned Past Surgcal hstory Other Past Surgcal Hstory Precson [ours] Recall [ours] Precson [19] Recall [19] 96.7% 96.7% 96.7% 96.7% 88.1% 89.4% 76.1% 86.4% 92.3% 94.2% 77.8% 35% 87.5% 92.3% 62.0% 75% Clncans n our project are also nterested n the medcal and surgcal hstory of patents. Because these attrbutes may contan multple values (medcal terms), the precson and recall for -th patent are defned respectvely as, ETrue R = TInstl and ETrue P =. Precson and recall for the whole ETotal collecton, respectvely, are defned as, = ETrue R TInst and ETrue P = where: ETotal ETrue : number of extracted true terms n -th subject. ETotal : number of extracted terms n -th subject. TInst : number of total true terms n -th subject. We revsed the approach for medcal term dentfcaton and acheved sgnfcant progress of precson and recall. For the detals of performance mprovement, please refer to Tables 3, whch lst the performance of medcal term extracton n [19] and our experment, respectvely. The performance mprovement s manly attrbuted to the extensve use of doman ontology. After medcal term dentfcaton, we need to further classfy terms nto predefned terms or other terms. In [19], the authors faled to recognze synonyms of predefned terms. We corrected ths problem, whch ncreases the recall of predefned terms and precson of other terms. The authors of [19] smply treat all terms exstng n doman ontology as medcal terms of nterest. Actually a small porton of extracted terms such as hstory and human was not expected. We fltered out these terms n the new verson by addtonally examnng the semantc type of the term, whch ncreases term precson. Due to the ncompleteness of doman ontology (UMLS) or typo, ontology-based approach faled to extract some terms not defned n ontology. In the current verson, we use the coordnatng structure to predct some medcal terms, whch ncreases the recall of medcal term extractons. 5. Concluson In ths paper, we mplemented a MEDcal Informaton Extracton (MedIE) system that extracts and mnes a varety of nformaton from clncal medcal records. Good performance was acheved. The nformaton extracton tasks n ths project can be roughly classfed nto three classes. The frst s extracton of medcal terms. The second, also the major one, s relaton extracton. The last s text classfcaton. We propose three approaches to address those three dfferent IE tasks.

A graph-based approach whch uses the parsng result of lnk-grammar parser was nvented for relaton extracton and hgh accuracy was acheved. A smple but effcent ontology-based approach was adopted to extract medcal terms of nterest. Fnally, an NLP-based feature extracton method coupled wth an ID3-based decson tree was used to perform text classfcaton. Ths prelmnary approach to categorcal felds has, so far, proven to be qute effectve. However, the sze of data set used s small. When more dversfed wrtng styles are ntroduced nto patent records, the performance of IE may degrade. We plan to use a larger data set to evaluate and tune our future work. Besdes, lnk grammar parser makes many errors whle parsng text n bomedcal doman. We are gong to releve ths problem by augmentng the lexcon wth UMLS s specalst lexcon n future versons. Moreover, we wll try to make the system more flexble and robust. Approaches proposed n ths paper may offer a new means by whch clncan-researchers may extract large volumes of data from patent medcal records. To date, ths resource s untapped, as there s no effectve means to extract data. We hope to contnue ths work, refnng our approach, to expand ts utlty. 6. References [1] Cunnngham, H., GATE, A General Archtecture for Text Engneerng, Computers and the Humantes, 2002, Vol. 36, pp. 223-254 [2] Cunnngham, H., Maynard, D., and Tablan., V., JAPE: a Java Annotaton Patterns Engne (Second Edton), Techncal report CS--00--10, Unversty of Sheffeld, Department of Computer Scence, 2000. [3] Dmtrov, M., Bontcheva, K., Cunnngham, H., and Maynard, D., A Lght-weght Approach to Coreference Resoluton for Named Enttes n Text, Proceedngs of the Fourth Dscourse Anaphora and Anaphor Resoluton Colloquum (DAARC), Lsbon, 2002. [4] Dng, J., Berleant, D., Xu, J., and Fulmer, A.W., Extractng Bochemcal Interactons from MEDLINE Usng a Lnk Grammar Parser, In the 15th IEEE Internatonal Conference on Tools wth Artfcal Intellgence (ICTAI'03), 2003. [5] Gazauskas, R., Hepple, M., Davs, N., Guo, Y., Harkema, H, Roberts, A., and Roberts, I., AMBIT: Acqurng Medcal and Bologcal Informaton from Text, ISMB/ECCB, Poster, 2004. [6] Km, J.T. and Moldovan, D.I., Acquston of Lngustc Patterns for Knowledge-Based Informaton Extracton, IEEE Transactons on Knowledge and Data Engneerng, Volume 7, Issue 5, 1995, pp. 713-724. [7] Kuhn, R. and Mor, R., Applcaton of Semantc Classfcaton Trees to Natural Language Understandng, IEEE Transactons on Pattern Analyss and Machne Intellgence, 1995, Vol. 17, No. 5. [8] Lehnert, W., Soderland, S., Aronow, D., Feng, F., and Shmuel, A., "Inductve Text Classfcaton for Medcal Applcatons", Journal for Expermental and Theoretcal Artfcal Intellgence, 1994, 7(1), pp. 49-80. [9] Madhyastha, H.V., Balakrshnan, N., and Ramakrshnan, K.R., Event Informaton Extracton Usng Lnk Grammar, 13th Internatonal Workshop on Research Issues n Data Engneerng: Mult-lngual Informaton Management (RIDE'03), 2003. [10] Mller, G. et al, WordNet: an On-lne Lexcal Database, Internatonal Journal of Lexcography, 1990, pp. 235-245. [11] Qunlan, J.R., Inducton of Decson Trees, Machne Learnng, 1986, No.1, pp.81-106. [12] Rloff, E., "Automatcally Constructng a Dctonary for Informaton Extracton Tasks", Proceedngs of the Eleventh Natonal Conference on Artfcal Intellgence, AAAI Press/the MIT Press, 1993, pp. 811-816 [13] Rloff, E. and Lehnert, W., "Informaton Extracton as a Bass for Hgh-Precson Text Classfcaton ", ACM Transactons on Informaton Systems, 1994, Vol. 12, No. 3, pp. 296 333. [14] Sleator, D. and Temperley D., "Parsng Englsh wth a Lnk Grammar", Thrd Internatonal Workshop on Parsng Technologes, 1993. [15] Soderland, S., Aronow, D., Fsher, D., Aseltne, J., and Lehnert, W., "Machne Learnng of Text Analyss Rules for Clncal Records", CIIR Techncal Report, Unversty of Massachusetts Amherst, 1995. [16] Soderland, S., Fsher, D., Aseltne, J., and Lehnert, W., "CRYSTAL: Inducng a Conceptual Dctonary", Proceedngs of the Fourteenth Internatonal Jont Conference on Artfcal Intellgence, 1995, pp. 1314-1319. [17] Soderland, S., Learnng Informaton Extracton rules for Sem-structured and free text, Machne Learnng, Vol. 34, 1998, pp. 233-272. [18] Szolovts, P., Addng a Medcal Lexcon to an Englsh Parser, Proc. AMIA 2003 Annual Symposum, 2003. [19] Zhou, X., Han, H., Chanka, I., Prestrud, A.A., and Brooks, A.D., "Convertng Sem-structured Clncal Medcal Records nto Informaton and Knowledge", In the Internatonal Workshop on Bomedcal Data Engneerng n conjuncton wth the 21st Internatonal Conference on Data Engneerng (ICDE), Tokyo, Japan, Aprl 3-4, 2005.