Data Mining in Health Informatics



Similar documents
Business Intelligence represents a fundamental shift in the purpose, objective and use of information

How to Reduce Project Lead Times Through Improved Scheduling

Equal Pay Audit 2014 Summary

FINANCE SCRUTINY SUB-COMMITTEE

UNIVERSITY OF CALIFORNIA MERCED PERFORMANCE MANAGEMENT GUIDELINES

CCHIIM ICD-10 Continuing Education Requirements for AHIMA Certified Professionals (& Frequently Asked Questions for Recertification)

ITIL Release Control & Validation (RCV) Certification Program - 5 Days

Basics of Supply Chain Management

ITIL Service Offerings & Agreement (SOA) Certification Program - 5 Days

The Importance Advanced Data Collection System Maintenance. Berry Drijsen Global Service Business Manager. knowledge to shape your future

Research Report. Abstract: The Emerging Intersection Between Big Data and Security Analytics. November 2012

CMS Eligibility Requirements Checklist for MSSP ACO Participation

CCHIIM ICD-10 Continuing Education Requirements for AHIMA Certified Professionals (& Frequently Asked Questions for Recertification)

Job Profile Data & Reporting Analyst (Grant Fund)

TOWARDS OF AN INFORMATION SERVICE TO EDUCATIONAL LEADERSHIPS: BUSINESS INTELLIGENCE AS ANALYTICAL ENGINE OF SERVICE

Change Management Process

Solution. Industry. Challenges. Client Case Study. Legacy Systems too Costly to Maintain. Supply Chain Advantage. Delivered.

This report provides Members with an update on of the financial performance of the Corporation s managed IS service contract with Agilisys Ltd.

How To Create A Veteran Prgram

Data Warehouse Scope Recommendations

The Importance of Market Research

NHPCO Guidelines for Using CAHPS Hospice Survey Results

Sample Outline for Prelicensure Course in Nursing Informatics

To transform information into knowledge- a firm must expend additional resources to discover, patterns, rules, and context where the knowledge works

Licensing Windows Server 2012 for use with virtualization technologies

University of Toronto Interprofessional Education Curriculum/Program

Job Classification Details Department Job Function Job Family Job Title Job Code Salary Level

Care Plan Oversight. Home Health Certification. July 23, Agenda

Data Abstraction Best Practices with Cisco Data Virtualization

CASSOWARY COAST REGIONAL COUNCIL POLICY ENTERPRISE RISK MANAGEMENT

Re- Defining Physician Credentialing Software A New Approach

Case Study. Sonata develops. comprehensive BI Application for a leading provider of Animal Nutrition Solutions. Ananthakrishnan

Project Startup Report Presented to the IT Committee June 26, 2012

Implementing an electronic document and records management system using SharePoint 7

Business Intelligence and DataWarehouse workshop

Internal Audit Charter and operating standards

Aim The aim of a communication plan states the overall goal of the communication effort.

Overview of the Final Requirements for Meaningful Use through 2017

Getting Started Guide

CSU STANISLAUS INFORMATION TECHNOLOGY PLAN SUMMARY

Licensing Windows Server 2012 R2 for use with virtualization technologies

Bakersfield College Program Review Annual Update

Key Steps for Organizations in Responding to Privacy Breaches

Version: Modified By: Date: Approved By: Date: 1.0 Michael Hawkins October 29, 2013 Dan Bowden November 2013

Revised October 27, 2011 Page 1 of 6

IEMA Practitioner Volume 14 Supporting Information

Research Report. Abstract: Advanced Malware Detection and Protection Trends. September 2013

Connecticut State Department of Education School Health Services Information Survey

A Walk on the Human Performance Side Part I

Systems Load Testing Appendix

Case Study Law Firm Profit and Growth LBMS Transforms a Major Law Firm s Market Expansion & Increased Profitability Vision into Reality

Please provide a 2-3 sentence summary of your proposal: Financial Profile of Organization:

Privacy Breach and Complaint Protocol

Analytical Techniques created for the offline world can they yield benefits online?

2008 BA Insurance Systems Pty Ltd

Standardization or Harmonization? You need Both

Trends and Considerations in Currency Recycle Devices. What is a Currency Recycle Device? November 2003

Privacy Policy. The Central Equity Group understands how highly people value the protection of their privacy.

COE: Hybrid Course Request for Proposals. The goals of the College of Education Hybrid Course Funding Program are:

Improved Data Center Power Consumption and Streamlining Management in Windows Server 2008 R2 with SP1

Integrate Marketing Automation, Lead Management and CRM

Better Practice Guide Financial Considerations for Government use of Cloud Computing

QAD Operations BI Metrics Demonstration Guide. May 2015 BI 3.11

POLISH STANDARDS ON HEALTH AND SAFETY AS A TOOL FOR IMPLEMENTING REQUIREMENTS OF THE EUROPEAN DIRECTIVES INTO THE PRACTICE OF ENTERPRISES

HEALTH INFORMATION EXCHANGE GRANTS CRITERIA

CDC UNIFIED PROCESS PRACTICES GUIDE

Considerations for Success in Workflow Automation. Automating Workflows with KwikTag by ImageTag

Overview of the CMS Modification to Meaningful Use 2015 through 2017

FundingEdge. Guide to Business Cash Advance & Bank Statement Loan Programs

Occupational Therapy

FINANCIAL OPTIONS. 2. For non-insured patients, payment is due on the day of service.

CONTENTS UNDERSTANDING PPACA. Implications of PPACA Relative to Student Athletes. Institution Level Discussion/Decisions.

GENERAL EDUCATION. Communication: Students will effectively exchange ideas and information using multiple methods of communication.

Build the cloud OpenStack Installation & Configuration Integration with existing tools and processes Cloud Migration

ITIL V3 Planning, Protection and Optimization (PPO) Certification Program - 5 Days

Request for Resume (RFR) CATS II Master Contract. All Master Contract Provisions Apply

Business Plan Overview

Network Security Trends in the Era of Cloud and Mobile Computing

Watlington and Chalgrove GP Practice - Patient Satisfaction Survey 2011

expertise hp services valupack consulting description security review service for Linux

Entrepreneur Purchasing Recommendations for CRM

AHI. Foreign Pre-Approval Inspections (PAIs) Points to Consider

Professional Leaders/Specialists

Software and Hardware Change Management Policy for CDes Computer Labs

Health and Safety Training and Supervision

SECTION J QUALITY ASSURANCE AND IMPROVEMENT PROGRAM

Standards and Procedures for Approved Master's Seminar Paper or Educational Project University of Wisconsin-Platteville Requirements

CCHIIM ICD-10 Continuing Education Requirements for AHIMA Certified Professionals (& Frequently Asked Questions for Recertification)

Training Efficiency: Optimizing Learning Technology

OFFICIAL JOB SPECIFICATION. Network Services Analyst. Network Services Team Manager

Importance and Contribution of Software Engineering to the Education of Informatics Professionals

In connection with the SEC's Money Market Reform proposal, DST Systems, Inc. respectfully submits our comments for your consideration.

MANITOBA SECURITIES COMMISSION STRATEGIC PLAN

WEB APPLICATION SECURITY TESTING

Seattle Police Department

Grant Application Writing Tips and Tricks

Team Process Data Warehouse Goals and High-Level Requirements

Corporate Standards for data quality and the collation of data for external presentation

Army DCIPS Employee Self-Report of Accomplishments Overview Revised July 2012

Performance Test Modeling with ANALYTICS

Transcription:

Data Mining in Health Infrmatics Abstract In this paper we present an verview f the applicatins f data mining in administrative, clinical, research, and educatinal aspects f Health Infrmatics. The current r ptential applicatins f varius data mining techniques in Health Infrmatics are illustrated thrugh a series f case studies frm published literature. The paper als prvides a detailed discussin f hw clinical data warehusing in cmbinatin with data mining can imprve varius aspects f Health Infrmatics. Finally, we pint ut a number f unique challenges f data mining in Health infrmatics. 1. Intrductin Health Infrmatics is a rapidly grwing field that is cncerned with applying Cmputer Science and Infrmatin Technlgy t medical and health data. With the aging ppulatin n the rise in develped cuntries and the increasing cst f healthcare, gvernments and large health rganizatins are becming very interested in the ptential f Health Infrmatics t save time, mney, and human lives. Human errrs cause the death f between 44000 t 98000 American patients annually [30 as cited in 9]. Furthermre, in Unites States alne, drug-related mrbidity and mrtality csts mre than $136 billin per year [26 as cited in 9]. Electrnic patient recrds, cmputer based alerting, reminder, and predictive systems, and adaptive training tls fr healthcare prfessinals can help reduce bth the human and financial csts f healthcare. As a relatively new field, Health Infrmatics des nt yet have a universally accepted definitin. The American Medical Infrmatics Assciatin defined health Infrmatics as "all aspects f understanding and prmting the effective rganizatin, analysis, management, and use f infrmatin in health care"[1]. Similarly, the Canada's Health Infrmatics Assciatin definitin f Health Infrmatics is "Intersectin f clinical, IM/IT and management practices t achieve better health"[2]. These are bth brad definitins that cver a wide range f technlgies, frm develping electrnic patient recrd data warehuses t installing wireless netwrks in hspitals. A mre specific definitin is prvided by the Natinal Library f Medicine, which defines Health Infrmatics as "the field f infrmatin science cncerned with the analysis and disseminatin f medical data thrugh the applicatin f cmputers t varius aspects f health care and medicine"[3]. Nte that here, Health Infrmatics is limited t "analysis and disseminatin f medical data", and wuld nt cver pure IT practices such as installing a netwrk in a hspital. Zaiane prvides an even mre specific definitin, which divides Health Infrmatics int fur subfields: Health Infrmatics is the cmputerizatin f health infrmatin t supprt and ptimize (1) administratin f health services; (2) clinical care; (3) medical research; and (4) training. It is the applicatin f cmputing and cmmunicatin technlgies t ptimize health infrmatin prcessing by cllectin, strage, effective retrieval (in due time and place), analysis and decisin supprt fr administratrs, clinicians, researchers, and educatrs f medicine. In this survey, we present an verview f the applicatins f data mining in varius subfields f Health Infrmatics. Fr each subfield f Health Infrmatics, we prvide a number f published papers as case studies f the current and ptential applicatins f data mining. We als present hw clinical data

warehusing in cmbinatin with data mining can help administrative, clinical, research and educatinal aspects f Health Infrmatics. Finally, we discuss a number f unique challenges f data mining in Health Infrmatics. 2. An Overview f Health Infrmatics and Applicatins f Data Mining As mentined in the intrductin, Health Infrmatics can be divided int fur main subfields: 1. Administratin f health services 2. Clinical care 3. Medical research 4. Training. The fllwing subsectins present an verview f each subfield f health Infrmatics, and hw data mining is, r can be, applied t extend and imprve each subfield. 2.1 Clinical Care Physicians and nurse practitiners make diagnstic decisins and treatment recmmendatins based n histry, medical imaging, lab results and ther text r multimedia recrds f patients. Health infrmatics allws dctrs t have faster access t mre relevant infrmatin, and thus make mre ptimal decisins. Fr instance, a centralized patient recrd database will allw a physician in a lcal clinic t have access t all the relevant medical recrds f the patient, anywhere in the cuntry. Furthermre, applying data mining techniques n the centralized database will give dctrs analytical and predictive tls that g beynd what is apparent frm the surface f the data. Fr instance, a new practitiner can query fr all the decisins that previus practitiners have made n a similar case. Similarly, a predictive mdel can advise dctrs whether a certain case wuld be better treated as an utpatient r an inpatient. 2.1.1 Clinical Decisin Supprt Systems The applicatins f Health Infrmatics in clinical care decisin-making are knwn as (Cmputer based) Clinical Decisin Supprt System (CDSS) 1 Shrtliffe defines a decisin supprt system as "any cmputer prgram that is designed t help health prfessinals t make clinical decisins" [44 as cited in 34]. Applicatins f Clinical Decisin Supprt Systems can be categrized int: Infrmatin retrieval: CDDS can ffer search capabilities fr medical queries. Fr instance the "antibitic assistant" f HELP system (intrduced in sectin 2.1.1.1) allws dctrs t query the hspital experience with previus infectins thrugh the last five years [9]. Alerting systems: A useful applicatin f CDSS is t mnitr inputs and check them fr predetermined triggers [21]. These alert systems can be simple, like predefined drug-drug r drugallergy cnflicts, r cmplex, such as alerts based n analysis f varius lab results and cmparisn with expected result prtcls. Reminders: unlike alerts that are triggered by a specific change in input data, reminders are triggered by passage f time and are used fr peridic tasks such as immunizatin r diabetes tests [21]. Suggestin Systems: Unlike alerts, which indicate predetermined cnditins in input data, suggestin systems are interactive prcesses that suggest actin riented messages based n their medical knwledge base. Predictin Mdels: CDSS predictin mdels can be categrized int diagnsis (defined as "aiding in the determinatin f the existence r nature f a disease" [4] and prgnsis (defined as the frecast f the prbable utcme f an illness'' [4]) [21]. An example f a diagnsis predictr is a mdel that detects nscmial hspital infectins based n infrmatin frm Micrbilgy 1 As we will discuss in sectin 2.1.3, this is nt a universal view f CDSS. Sme experts believe that CDSS include ther aspects f Health Infrmatics, like administrative decisin supprt, research, and training.

labratry, nurse charting, and ther surces. APACHE, intrduced in sectin 2.1.1.2, is an example f a prgnsis predictr which predicts ICU mrtality based n a number f physilgical variables. The fllwing subsectins describe a number f Clinical Decisin Supprt Systems currently in use in clinics and hspitals. 2.1.1.1 Case Study: HELP system Health Evaluatin thrugh Lgic Prcessing (HELP) system is an example f a Clinical Decisin Supprt System that includes alerting systems, suggestin systems, and predictin mdels [9]. An example f an alerting system used in HELP is a mdel that mnitrs patient labratry results, and has simple rule-based triggered t detect anmalies. A suggestin system included in HELP is a set f cmputerized prtcls fr managing care f Adult Respiratry Distress Syndrme (ARDS) patients. Bth alerting and suggestin systems in HELP are rule-based mdels, develped by physicians, nurses, and specialists in medical infrmatics. HELP includes tw types f predictin mdels. One f these mdels is rule-based mdels, such as the ne used in the Adverse Drug Events (ADE) detectin system. The ADE detectin system predicts the pssibility f a drug reactin based n patient histry and a set f predefined prtcls. Aside frm rulebased mdels, sme predictin mdels in HELP use lgistic regressin, e.g. the mdel that predicts nscmial hspital infectins based n a number f risk factrs. HELP system has been develped and tested fr mre than 25 years and it is currently in use in many f the 20 hspitals perated by Intermuntain Healthcare (IHC) [31 as cited in 9] 2.1.1.2 Case Study: APACHE series f mdels The Acute Physilgy and Chrnic Health Evaluatin (APACHE) series f mdels are develped t predict the individual patient's risk f hspital death in ICU, based n a number f physilgical variables. The riginal APACHE mdel was develped in 1981 as an exprt-based scring system. The later versins are based n lgistic regressin mdels. The mdels were trained n 17000 f cases in mre than 40 hspitals [21]. 2.1.1.3 Case Study: Pneumnia severity f illness index The Pneumnia Severity f Illness Index is anther lgistic regressin mdel that predicts the risk f death within 30 days fr adult patients with pneumnia. The mdel was develped by the Pneumnia Patient Outcme Research Team (PORT) in 1997 and was validated ver 50000 patients in 275 hspitals in US and Canada. The develpers claim that by using this mdel, up t 30% f pneumnia patients can be treated safely as utpatients, resulting in an annual savings f 1.2 billin dllars [21]. 2.1.2 Data Mining in Clinical Decisin Supprt Systems Aside frm sme use f lgistic regressin in predictive mdels, there is currently limited r n applicatins f data mining in Clinical Decisin Supprt Systems. Mst f the current systems are rulebased and are develped manually by experts. Data mining can extend and imprve all categries f CDSS, as illustrated by the fllwing examples. In infrmatin retrieval systems, data mining can be applied t query multimedia recrds. Image and vide mining, alng with applicatins f natural language prcessing techniques will allw physicians t effectively search thrugh patients' medical imagery, labratry results, and ther medical recrds. Data mining can be used t autmatically discver and update threshlds used in alerting and reminder systems. Maintaining and updating the underlying knwledge f rules is ne f the imprtant challenges that limit the adptin f CDSS by health rganizatins [21]. In the mst basic frm, data mining algrithms can be applied t mnitr the threshlds used in alerting and reminder systems, and either autmatically update them r alert human experts that the current threshlds shuld be recnsidered.

In suggestin systems, instead f depending n experts t manually develp the underlying prtcls, data mining appraches can be applied t autmatically generate these prtcls based n histric data. A team f human experts can then review the generated prtcls befre deplying them in the final suggestin systems. Predictin mdels are the mst evident and straightfrward targets fr applying data mining algrithms. There has been extensive research n the applicatins f supervised and unsupervised learning algrithms n medical data in machine learning and data mining cmmunities. Hwever, as we will discuss in sectin 4, mst f these algrithms are nt well understd r accepted in the medical cmmunity. The mre advanced predictin mdels develped in the data mining cmmunity have the ptential t increase the accuracy f the current mdels used in CDSS. 2.1.3 A brader view f Clinical Decisin Supprt Systems Sme experts present a brader view f CDSS that it is nt limited t the clinical care subfield f Health Infrmatics. Ledbetter and Mrgan state that the CDSS capabilities are useful in all phases f the clinical prcess: (a) assessment, (b) planning, (c) interventin, and (d) evaluatin [32]. Table 1, taken frm their article, describes the ptential applicatins f CDSS fr the cases f a patient-specific fcus as well as a ppulatin-specific (r aggregatin based) fcus. Table 1: Ptential applicatins f CDSS (taken frm [32]) Curtright et al. have develped a list f cre requirements fr CDSS tls and the fllwing cmprise the majr requirements discussed in their article [14]. The CDSS tls need t: Have enhanced netwrking and distributive features Be used at all decisin making levels in an rganizatin

Be used in bth real time and retrspective mdes Enabled predictive capabilities using classical statistics Utilize white bx (penly disclsed but prtected) methdlgies fr predictin and detailed supprt t prvide the kind f accuracy rates required fr health care decisin making Nte that these requirements are nt limited t the clinical care subfield f health infrmatics. In additin t the abve, we feel that in the brader view, CDSS tls als need t: Apply AI techniques fr disease predictin Use ther techniques such as spatial data mining and spati-tempral data mining t assist in health care decisin-making Be able t prvide a feedback t the decisin makers regarding the efficiency f the system Have Graphics and graphing capabilities s as t be able t present the data in several frmats such as tables, bar charts, pie charts, graphs etc. Have tighter security, and access cntrls in rder t avid persnal data falling int malicius hands. In the lnger term, it is expected that the clinical data can be used t assess episdes f risk [14] wherein CDS systems will help in early identificatin f risk factrs such as diet, exercise, travel, and air and water standards. It is als expected that in the future CDS systems will als help in perfrmance benchmarking, cntinuing medical educatin f the clinicians by the use f their wn data, identificatin f best practices, creatin and utilizatin f standard terminlgy etc. [14]. 2.2 Administratin f Health Services Administratrs f health care rganizatins make hundreds f critical decisins n daily basis. As in any administrative psitin, the quality f these decisins directly depends n the quality f the infrmatin that the decisins are based n. Fr example, the administratrs in a hspital need t decide n the amunt f supplies and number f staff and free beds required fr an upcming mnth. T make this decisin, the administratrs require an accurate predictin f the number f patients t expect during the cming mnth, and an apprximatin f hw lng each patient will remain in the hspital. As anther example, the federal and prvincial health administratrs need t decide whether a disease utbreak is in prgress, and if s, what preventive measures will be mst effective against it. T make these decisins, the administratin requires a system that can accurately predict a disease utbreak, and als mdel the cst and benefit f different preventive measures. The fllwing case study illustrates the applicatins f data mining techniques n epidemic detectin. Mre examples f administrative decisin supprt will be discussed in Sectin 4, where electrnic patient recrds and varius data warehusing techniques are intrduced. 2.2.1 Case Study: detecting disease utbreaks In "Decisin Theretic Analysis f Imprving Epidemic Detectin", Izadi and Buckeridge intrduce a methd t imprve existing threshld-based epidemic detectin methds by using POMDPs (Partially Observable Markv Decisin Prcesses) [24]. The main idea is that the ptential csts and effects f interventin can be quantified and be used t ptimize the alarm functin. Furthermre, the intermediate investigatin steps, such as asking fr mre systematic studies, r mre investigatin dne by human expert, can als be quantified in terms f cst and effect. Based n these cst and effects, the system can learn t recmmend the ptimal actin. While the paper cncludes that POMDPs can imprve the accuracy f the current utbreak detectin methds, the current level f false alarms (3 false alarms in every 100 days) seems t be unacceptable fr practical use. Similarly, Cper et al. investigates the use f Bayesian Netwrks fr utbreak detectin, fcusing n mdeling nn-cntagius utbreak diseases, such as airbrne anthrax [13]. The Bayesian netwrk is divided int 3 grups: glbal (G), interface (I) and peple (P). Furthermre, in rder t make the algrithm

scalable, peple with the same attributes are gruped in the same class. The netwrk is evaluated based n data generated by a simulatr. Given weather cnditins frm Histrical meterlgical cnditins fr a regin, parameters fr lcatin and amunt f airbrne anthrax, a Gaussian plume mdel derives the cncentratin f anthrax spres that are estimated t exist in each zip cde. The authrs cmpare a nnspatial mdel with a spatial mdel and cnclude that with spatial data they can get better results based n false psitive rate. 2.3 Medical Research Mst current successful applicatins f data mining in Health Infrmatics are in the subfield f medical research. The reasn is that mst f the current health related data are stred in small datasets scattered thrugh varius clinics, hspitals, and research centers. Hwever, mst applicatins f data mining in clinical and administrative decisin supprt systems require hmgeneus and centralized data warehuses (see sectin 3). On the ther hand, data mining methds can still be successfully applied n small and scattered datasets, and help researchers extract insightful patterns, cause and effect relatinships, and predictive scring systems frm currently available data. The fllwing subsectins intrduce a number f examples f data mining techniques applied n small datasets fr medical research. 2.3.1 Case Study: drug expsure side effects frm mining pregnancy data Chen et al. investigate the pssible effects f multiple drug expsures at different stages f pregnancy n preterm birth, using SmartRule, a data mining technique fr generating assciative rules [11]. In this wrk, tw subsets f Danish Natinal Birth Chrt (DNBC) dataset are used. The first subset cntains 4454 recrds including 1000 wmen wh were depressed and/r expsed t varius active drugs. This set is used fr finding the side effects f anti-depressin drugs. The secnd subset cntains 6231 recrds, including 414 preterm cases. This set is used fr finding side effects f multiple types f drugs. The authrs develp a tree hierarchical mdel fr rganizing the generated rules, in rder t ease the recgnitin f interesting rules by human experts. Using this system, the authrs claim that they are able t find nvel and interesting rules. 2.3.2 Case Study: Autmatic in viv micrscpy vide mining fr leukcytes Zhang et al. intrduce a framewrk fr vide mining in viv micrscpy images [47]. The gal is t track leukcytes in rder t predict inflammatry respnse. In viv micrscpy allws researchers t capture images f the cellular and mlecular prcesses in a living rganism. Hwever, autmatic mining f the imagery is challenging due t severe nise, backgrund mvement f the living rganism, and change f cntrast in different frames. Zhang et al. first apply a frame alignment technique, using RANSAC, t crrect the camera-subject mvement, and then apply a number f prbabilistic methds t detect mving leukcytes. Adherent leukcytes are detected, after the mving nes are remved, by finding threshlds fr cntrast values. The experimental results shw 1% false psitives and 50% recall n detecting mving leukcytes, and 2% false psitives and 95% recall n detecting adherent leukcytes. 2.3.3 Case Study: Knwledge-based analysis f micrarray gene expressin data using Supprt Vectr Machines Brwn et al. apply Supprt Vectr Machines n gene expressin data t classify genes based n functinality [10]. This is based n previus experiments suggesting that genes with similar functinality have similar patterns in micrarray data. The authrs claim that SVMs are well suited t the prblem f micrarray gene classificatin, because they perfrm well in extremely high-dimensinal feature space. A training set is generated by cmbining the DNA micrarray data f a set f genes that have certain functinality (i.e. psitive labels) and a set f genes knwn nt t be a member f this functinal class (i.e. negative labels). Once SVM is trained n this training set, it can determine whether a new gene belngs t the certain functinal class, r nt. The authrs apply SVM, with a number f different kernels, n gene expressin data frm the budding yeast Saccharmyses cerevisiae, with 5 predefined functinal classes. The predictin perfrmance f SVM is cmpared t predictins by a number f ther classificatin

methds, including decisin trees, Fisher's linear discriminates, and Parzen Windws. The authrs claim that SVM utperfrms all the ther classificatin methds. 2.3.4 Case Study: Assciatin rules and decisin trees fr disease predictin Ordnez applies different classifiers, assciative classifier and decisin trees, fr predicting the percentage f vessel narrwing (LDA, RCA, LCX and LM) cmpare t a healthy artery [35]. The dataset cntains 655 patient recrds with 25 medical attributes. Three main issues abut mining assciative rules in medical datasets are mentined in this wrk. A significant fractin f assciatin rules are irrelevant and mst relevant rules with high quality metrics appear nly at lw supprt. On the ther hand, the number f discvered rules becmes extremely large at lw supprt. Hence, assciatin rules are used with cnstraints. Each item crrespnds t the presence r absence f ne categrical value r ne numeric interval. First cnstraint is that there is a limit n the maximum item-set size. Secnd, the items are gruped and in each assciatin, there is at mst ne frm each grup. The third cnstraint is that each item can nly appear in antecedent r cnsequent. The result frm assciative classifier is cmpared with tw decisin tree algrithms: CN4.5 and CART. The authrs demnstrate that assciative rules can d better than decisin trees fr predicting diseased arteries. 2.4 Educatin and Training The furth subfield f health infrmatics is related t educating new healthcare prfessinals and retraining and keeping the current staff up-t-date with recent advances in technlgy. The educatin and training subfield f Health Infrmatics can be viewed as an instance f the rapidly grwing field f e-learning. An increasing interest in applying data mining techniques t e-learning has emerged in recent years, and sme f the early applicatins shw prmising results [38]. Data mining techniques can benefit all three grups f peple wh are in cntact with a learning system: students, educatrs, and administratrs [38]. Data mining techniques can mnitr the success f students at varius learning tasks, and recmmend relevant resurces, materials, and learning paths t achieve a mre successful learning experience. Fr educatrs, data mining techniques can prvide bjective feedback f the structure and the cntent f a curse, discver the learning patterns f the students, and cluster learners int smaller grups that have similar educatinal habits and needs. Administratrs benefit frm data mining techniques by learning abut the behavir f their users, s they can ptimize the servers, distribute netwrk traffic, and learn abut the verall effectiveness f the ffered educatinal prgrams. The fllwing tw case studies present an verview f a relatively new Health Infrmatics e-learning tl called HOMER, and a data mining technique t find relevant articles fr a particular gene. 2.4.2 Case Study: Hmer, an nline learning cmmunity Hmer is a centralized e-learning system and an Internet cmmunity, develped fr the medical students f the University f Alberta [5]. Hmer prvides nline access t a variety f learning materials, including medical dictinaries, demnstratin vides, and faculty presentatins. One imprtant feature f Hmer is the lifetime membership, which grants medical students cntinued access t learning materials after graduatin [18]. 2.4.2 Case Study: Finding relevant references t genes and prteins in Medline using a Bayesian apprach Lenard et al. apply a Bayesian apprach t find crss-references between the symbl f genes and prteins and Medline articles [33]. The authrs extract gene and prtein symbls frm article titles and abstracts, using a dictinary f gene and prtein symbls and a dictinary f English wrds alng with a set f rules. A different set f rules is used t find new gene and prtein symbls that are nt included in the gene and prtein symbl dictinary. After assigning articles t identified genes and prteins, a Bayesian estimated prbability (EP) based n wrd frequency is used t find the relevancy f each assigned article t each gene r prtein. Hence, nly the relevant articles are chsen fr each gene r prtein and the result will be a set f relevant references fr each gene r prtein.

3. Data Warehusing in Health Infrmatics This sectin demnstrates hw clinical data warehusing in cmbinatin with data mining can help each f the fur subfields in Health Infrmatics discussed in sectin 2. In particular, we will fcus n hw clinical data warehuses supprt the fllwing: Imprvement in Clinical Care Better administratin f health services Aiding medical research, and enhancing its quality Cheaper and mre effective training In the present times Electrnic Patient Recrd (EPR) has becme a buzzwrd in the field f E-health. Ledbetter [32] defines EPR as an electrnically maintained (cmputerized) patient recrd system with pint-f-care tls that supprt clinical care. Accrding t Ledbetter, in an ideal situatin an EPR shuld supprt all episdes f care t create a cmplete lngitudinal patient recrd. Kim et al. define EPR as an electrnic cllectin f diagnstic reprts f an individual patient s entire medical histry. These reprts can have varied frmats such as text, multimedia, etc. where multimedia itself wuld encmpass Digital Image and Cmmunicatin (DICOM), 3D Image set, Vice recrding, Health level 7 (HL7) types [27]. EPR based recrds hld several advantages ver the paper-based recrds that are currently being phased ut. Sme f these features are: (a) Simultaneus access by multiple users (b) n-line infrmatin prcessing fr clinical and administrative decisin (c) access t data frm multiple surces (d) csteffectiveness/apart frm the initial investment (e) data representatin and richness f the cntent f data (f) reliability and ease f distributin f data and (g) security. It is wrth emphasizing that all f the abve wuld nt have been pssible withut the great strides made in the field f Infrmatin Technlgy, Cmputing, Data mining, Infrmatin Security, and als the advent and prliferatin f the Wrld Wide Web (WWW). The use and strage data in the electrnic frm has created pprtunities fr applying data mining techniques t extract the hidden knwledge in the data. Frawley et al. define data mining as the nntrivial extractin f implicit, previusly unknwn, and ptentially useful infrmatin frm data [17]. Unfrtunately the electrnic data resides n different and hetergeneus systems with the result that integratin becmes a challenging task. Data warehuses allw us t perfrm this cmplex task f integrating the hetergeneus data; simultaneusly they act as central repsitries fr the data. The data warehuses used in health Infrmatics are smewhat different in nature (mre cmplex), hence they are called clinical data warehuses (as discussed later). 3.1 Data Warehuses vs. Real-time Databases The real time decisin-making prcesses rely n the use f Online Transactin Prcessing (OLTP) systems that are patient specific while the Online Analytical Prcessing (OLAP) systems carry ut an aggregate analysis based n data fr a grup f peple. The OLTP and the OLAP systems tgether cntribute t the success f a CDS system. OLTP systems need t handle a large vlume f transactins required by patientcare system such as patient registratin, clinical dcumentatin, rder entry, results review and clinical alerting. Ledbetter [32] argues that fr this reasn the perfrmance f an OLTP system may suffer if the system is used fr aggregate analysis. OLAP systems, n the ther hand, d nt have any data f their wn, and rely n OLTP systems fr data feed. These systems are always ff-line as they lag behind the OLTP systems smetimes by a day r s, and smetimes mnths altgether. Systems that emply OLAP techniques are called Data Warehuses (DW). In this sectin we lk at Data Warehuses frm the standpint f their theretical fundatin, and their functinality; the issues related t design and cnstructin are dealt later. Inmn [23] defines a Data Warehuse as a repsitry fr keeping data in a subject riented, integrated, time variant and nn-vlatile manner that facilitates decisin supprt. A Data Warehuse transfrms the

OLTP data in a way that facilitates mining, the infrmatin frm that data much easier the standard data structure is a multi-dimensinal cube (figure 1) that allws the user t rapidly change the dimensins by which a reprt is filtered, srted r gruped [32]. Of curse, a researcher culd drill t the patient recrd level if they s desire, hwever, it wuld be much easier and faster t get the same infrmatin by querying n OLTP system. Frm the pint f design, a data warehuse cnsists f fact tables and dimensins tables, smetimes called a STAR schema. The fact tables culd include measurements, rders, and bservatins alng with events such as admissins, discharges and transfers while the dimensin tables culd include patients, diagnsis, medicatins, supplies, clinical units etc. Figure 1: Pictrial view f a typical cube in data warehusing (taken frm [6]) Curtright et al. believe that CDS systems need t g beynd simple flags and alarms at the pint f care [14]. They state that the OLTP systems shuld aim fr what is really required by the clinicians at the pint f care, such as a likely clinical utcme trajectry fr a patient, the ptimal curse f treatment in the shrt and the lng term. Als the clinicians shuld be able t make mid-curse crrectins, and als they shuld be able t mdel the impacts f different clinical decisins as the patient s clinical curse changes. The infrmatin that clinicians really require fr making infrmed decisins are factrs such as the patient s risk characteristics, diagnstic and therapeutic interventins, and clinical utcmes. On the ther hand, OLAP techniques can be used t analyze thse business rules and clinical actin that affect the prfitability, resurce-planning and prductivity f the healthcare institutin. Curtright et al. are f the view that OLAP techniques shuld prvide an easily interpretable single value indexed scre fr any assessment, and this scre shuld incrprate measures such as cst, health status etc. [14]. The chsen health care pathway shuld be the ne that maximizes the abve-mentined scre. It has been well established that the aggregatin analysis helps in imprving the quality f healthcare delivered t the patients; hwever what has nt been bserved is its impact n the clinicians. Centralizing the databases prvides the clinicians with a brad insight int the actual clinical practices. Furthermre aggregatin analysis can be very useful in the case f perfrmance benchmarking fr clinicians by utilizing the same clinical data that is used fr making healthcare decisins. Further such systems can help clinicians determine the statistical impact f individual clinical decisins, which, in the lng run, can help them cme up with their wn clinical pathways, and t be able t cmpare thse with the standard practices, thus

fuelling and aiding medical research with greater ease. The ease f cmparisn f different pathways helps in imprving the quality f medical research. 3.2 Chalk and Cheese: Data Warehusing and Clinical Data Warehusing A Clinical Data Warehuse (CDW), as defined by Gray [20], is a place where healthcare prviders gain access t clinical data gathered in the patient care prcess. An apprpriate questin at this pint wuld be: Hw is a CDW different frm ther Data Warehuses? The answer lies in the fact that everything frm the planning prcess fr building a data warehuse t its design cmpnents, the sftware emplyed in the ETL (Extractin Transfrmatin Lading) phase, the extent f the essential backgrund knwledge f the architect is vastly different between the tw kinds f Data Warehuses. A CDW is immensely cmplex t build, and maintain when cmpared t ther Data warehuses. Herein, we discuss sme f the differences and the cmplexities f a CDW. In rder t speed up the time-cnsuming queries DW architects emply a very cmmn practice building materialized views based n aggregate values. Hwever, accrding t Gray [20], a lt f data ging int the CDW is nt additive at all e.g. vital signs f patients such as bld pressure, heart rate measurements etc. These frm a large vlume f the patient data. As a result n aggregatin can be dne even if nly ne such nn-additive clumn is present in a table, thus precluding the pssibility f speeding up queries by using materialized views. The prcess that mves the data frm the surce t the CDW shuld have a minimum impact n the peratins f the transactinal system (OLTP). Als, the time taken t transfrm and stre the data n the CDW shuld be as less as pssible. Fr rdinary data warehuses the transfrmatin step is carried ut in the evening when there is little r n activity; hwever the CDW being peratinal 24 7 the assigned budget is never mre than ne hur. Since the transfrmatin is highly CPU intensive that can affect the query perfrmance f CDW itself, the real transfrmatin budget is nly 10 minutes. A CDW needs t integrate data frm multiple surces, and hence synchrnizatin and cnsistency f this data is very imprtant. If a database lading a part f the data ges dwn such that the rest f the data will frce the CDW t be in an incnsistent state then the decisin whether t prceed r wait becmes difficult in the light f the fact that the time windw allwed fr lading transfrmatin is very small. While understanding hw an rganizatin perates their business rules and business lgic is nt an easy task, it all the mre difficult in case f hspitals. It is cmpunded by the fact that individual hspitals can fllw different practices, and tend t have different terminlgies fr the same task. As a result nt nly ff-the-shelf (generic) CDW s cannt wrk, architects wh design these DW have t be cnversant with the terminlgies and the practices. Finding such architects is nt an easy jb. 3.3 Evidence-based medicine: Data Warehuses fr Healthcare Decisin-making Evidence-based medicine is the use f the latest, mst respected, and well judged piece f evidence fr making infrmed chices abut the diagnsis and treatment f a diseased patient. Stlba and Tja define the task f evidence-based medicine as ne that cmplement[s] the existing clinical decisin-making prcess with the mst accurate and the mst efficient research evidence [45]. Anther definitin given by Sackett et al. describes eidence-based medicine as the cnscientius, explicit and judicius use f current best evidence in making decisins abut the care f individual patients" [39]. In their article, Stlba and Tja [45] present an example f a diabetic patient suffering frm prgressive liver disease, such that the clinician will need t find the mst effective therapy fr the patient s cnditin that des nt cnflict with their diabetic treatment. This is achieved by the clinician searching thrugh the evidence-based guidelines fr finding the mst recent and mst effective treatment fr the liver diseases, and then using anther query t make sure that the treatment fr the liver disease des nt cnflict with the ne fr diabetes.

Figure 2: Generatin f evidence-based guidelines (taken frm [45]) The cmbinatin f data mining and evidence-based medicine is prpelling Health Infrmatics int exciting and nvel avenues. The key t practising evidence-based medicine lies in creating rules that are based n evidence frm aggregate analysis, and are thrrughly studied and well researched by experts. Als, these rules need t be delivered t the clinicians at the pint f care in the frm f alerts. The relevant data-surces fr evidence-based medicine as utlined by Stlba and Tja [45] are: Evidence-based guidelines (in the frm f rules) Clinical data (Pharmaceutical data, patient data, medical treatments, length f stay) Administrative data (Staff skills, Nursing care hurs, staff leaves, vertime) Financial data (Drug csts, treatment csts, staff salaries, accunting) Organisatinal data (Facilities, Equipment, Rm ccupancy) The fllwing is after [45]. Figure 2 is a schematic diagram f the steps fllwed by the rule generatin prcess wherein data is gathered frm varied surces in the first step, and it is cleaned and transfrmed in the secnd step. Next, medical assciatins are fund by applying data mining in a data warehuse envirnment where the data warehuse itself cntains patient data, pharmaceutical and clinical data. The assciatins thus fund are sieved thrugh used by knwledge wrkers t islate thse that represent hidden knwledge in the data. All such assciatins are cllected, and rules based n these are created in the frm f labratry tests, therapies, recmmended drugs, r medical treatments; the rules thus created need t be examined by a cmmittee f experts wh can either apprve r reject a rule; the apprved rules are added t the Database alng with the evidence-based guidelines. The clinical impact f each new autmated rule needs t evaluate after a certain perid f time, typically six mnths after it is intrduced. Als, all the rules need t be evaluated at least n an annual basis and thse that are fund t be nt valid anymre in the light f the current research need t be discarded. The schematic diagram (figure 3) shws the rle f data warehusing in facilitating evidence-based medicine, at the pint f care. The fllwing is after [45]. Initially the clinician defines a clinical questin based n the patient s disease. After that he/she uses standard reprts, queries etc. t query the data warehuse. The evidence-based guidelines utputs results in the frm f medical treatments, drugs etc.

which then need t be matched with a patient s health histry, existing clinical equipment, and availability f the staff. Based n all f the analysis the best fitting rule is chsen, and presented t the clinician. Figure 3: Determinatin f line f treatment (taken frm [45]) In the recent times the cst f prviding quality healthcare has increased tremendusly. At the same time due t higher lngevity, and the afflictins theref, the ttal number f patients wh need healthcare (as a fractin f the ttal ppulatin) has als increased. As a result healthcare institutins and medical insurance cmpanies are being frced t adpt cst-cutting measures. Als the healthcare institutins are being frced t enlarge their facilities t cater twards the rising vlumes. Apart frm financial cnsideratins, staffing prblems, best utilizatin f available resurces, and maintaining the quality f healthcare under such cnditins are the majr issues. Mst case-studies, sme f them even tracking prjects frm their inceptin t the end, have fund that in the lnger run data mining in cmbinatin with data warehusing has resulted in faster healing rates, reductin in treatment csts, better quality f care, and very few clinical mistakes (if any) n the part f the nursing staff [14], [20], [36] [47], [46]. These case-studies justify the initial investment made in cnstructing these systems. Stlba and Tja mentin that the use f Data warehuse results in aviding the duplicatin f examinatins, time saved thrugh autmatin f rutine tasks and the simplificatin f accunting and administrative prcedures [45]. 3.4 Early Interventin: Data Warehusing in Disease Management It has been bserved that data mining, when applied in cmbinatin with a clinical data warehuse, has been very successful at extracting the early predictrs f sme diseases such as: Asthma, diabetes, cardivascular diseases etc. Once the early predictrs f a disease are extracted and the patients at risk are identified, they can be: invited t jin awareness campaigns, signed up fr disease management prgrams etc. Disease Management prgrams have been shwn t imprve patient care, lwer the disease ccurrence rates, and als lwer the healthcare csts. Ramick recmmends data marts (smaller sized data warehuses specific t a department) fr disease management prgrams because it reduces the maintenance issues, and requires less financial resurces fr deplyment. It is imprtant t nte that whether the healthcare institutin uses a data mart r a clinical data warehuse the data cleansing prcess in case f disease management prgrams is very different as cmpared t nrmal data warehuses. Ramick als pints ut that in case f disease management data the data in the CDW perfrms tw functins stratifying the patients by risk level fr targeted medical cnditins and tracking patient s prgress thrugh the disease management prgram [37]; hence, ne needs t prceed with cautin when eliminating data during the data cleaning phase because a patient s address,

their ccupatin etc. culd be significant. Fr example an asthmatic patient s address culd reveal the envirnmental hazards in the area that they live in. Ramick discusses case studies where disease management prgrams are being implemented based n data mining in clinical data warehuses. In ne such case U.S. Quality Algrithms (USQA) cllects administrative data frm pharmacies, labratry claims etc. Certain ailments that can be cntrlled by disease management are flagged in the CDW, and patients at risk are identified based n the data related t diagnsis, prcedures, labratry tests, and drug prescriptins in the CDW fr each patient. Anther cmpany, Blue Crss and Blue Shield, is in a unique psitin their data warehuse cntains nt nly the infrmatin abut the lab test rdered, but als the test results. The time lag assciated with this data is nly a cuple f days s that infrmatin can be analyzed in time, and have greatest impact when it is needed. In ur pinin, apart frm cube querying ther data mining techniques can be very useful fr the data in the CDW. Spatial data mining and Spati-tempral data mining culd reveal great insights int a patient s cnditin based n their gegraphic lcatin. Once a cause is identified deeper studies are needed fr all the member patients residing in that gegraphic lcatin. Machine learning techniques such as clustering, building a classifier, cntrast-set mining based n demgraphics etc. are sme f the ther data mining techniques that can extract interesting and previusly hidden patterns in the CDW data. Data mining n the Disease Management Prgram data that resides in the CDW nt nly helps in early detectin and preventin f diseases, and efficient targeting f resurces, it can als aid in the current medical research by identifying the mst cmmn diseases affecting the general ppulatin that are the result f scietylifestyle, envirnmental factrs r persnal chices. At the same time it can als act as an early warning system fr the health administratrs as well as the general public alike. 3.5 Mnitring the Mnitr: Data Warehuses fr Supervising Feedback Integratin In Health Infrmatics ne ften tends t wnder if the clinical practices are being fllwed prperly by the clinicians, r if the practices can further be imprved. In mst ther fields the intentin f applicatin f the data mining techniques is t extract previusly hidden nuggets f infrmatin, hwever in case f Health Infrmatics ne f the aspects f data mining in such a case is t mnitr the changes made t the prcesses based n the hidden infrmatin btained frm the data i.e. if the changes made t a previusly faulty practice prduce psitive results. The EPR data can and shuld be used fr prviding feedback fr prcess imprvement as well as fr finding the deficiencies in the system [7]. Grant et al. define feedback as a surce f bjective infrmatin f the prcess and utcme f patient care [19]. In their pint f view feedback shuld enable itemized review by a patient care team, critique with respect t best evidence, be a primary surce f infrmatin fr cnsensual practice imprvement and supprt educatin fr students and the team; they prvide an example wherein a system was develped t prvide feedback t the clinical teams befre the installatin f their clinical Data Warehuse. The system was used while cnducting a clinical study fr investigating whether there was an excess use f bld gas measurements in the ICU. Tw physicians and tw nurses frmed a cmmittee f experts fr studying the abve-mentined prblem. During the evidence phase (figure 4) the cmmittee cnsidered evidence in the frm f data t find the relatin between bld-gas requests and special events like surgery, time f the day etc.

Figure 4: The autcntrl practice change methdlgy (taken frm [19]) In the critique phase the mdel was imprved upn and slutins and causes were discussed. In the cnstructin phase a plan was laid ut fr a change in the practice. Finally, in the integratin phase the changes were adpted and the practice data evaluated t measure the failure f the change that was brught int frce. While the abve des nt require the use f a CDW, Grant et al. are f the view that the use f a CDW wuld ptimize the previus task by using varius tls and appraches fr using practice data as feedback fr practice change, ptimizatin, and innvatin [19]. Grant et al. [19] discuss the use f the dashbard cncept fr data enhancing, critiquing, and understanding the data. The term dashbard is usually used t describe a system that prvides a human cmputer interface t the user by emplying a set f windws can be used fr dynamic querying f data within certain partitins, ranges, and cmbinatins such that the results f the queries are prtrayed in the frm f graphs, tables, charts etc. Since the queries can be cnstructed and run dynamically, it allws the decisin maker t run a set f such queries, sme f them based n the results f the previus queries, in rder t assess a certain situatin frm all angles. The dashbard cncept is similar t that f perfrmance indicatrs and benchmarks that can prvide feedback at the same time fr practice evaluatin and change. The use f audit and feedback as a tl fr quality assurance has been studied widely [25]. Hwever, the use f feedback as a tl fr bringing abut change in faulty practices has been severly limited because f several reasns--ne f the biggest reasns is resistance t change. It has als been reprted that sme individuals miscnstrue the whle purpse f feedback as smething that might be used against them. A review study regarding the unnecessary use f lab tests fund n evidence f the success f implementatin f feedback. 3.6 Data Warehuse: Cnstructin and Design The task f designing and cnstructin f a Data warehuse is very cmplex it invlves many technical issues related t a number f fields and subfields. Sen and Sinha [43] discuss abut fifteen methdlgies fr this purpse. Sahama and Crll [40] explre the prs and cns f sme f these methdlgies fr the purpse f designing their wn Data warehuse. Batini et al. [8] discuss varius strategies such as tpdwn, bttm-up, inside-ut, and mixed strategy. Fr the purpses f a brad classificatin Hackney [22] classifies the design philsphies int tw categries viz. Enterprise-wide Data Warehuse design and Data-Mart design. Sen and Sinha d an exhaustive cmparisn f varius such Infrastructure-based philsphies in table 2 f their article. The data mart design philsphy was first discussed by Kimball [28] wherein a cmbinatin f the tp-dwn apprach and the bttm-up apprach is presented, and the unin f all such data marts frms a data warehuse. The metadata part f a data warehuse is much mre

vluminus than that f OLTP systems. Sen and Sinha advise the use f a metadata management fr this purpse. Fr the cmparisn f the different data warehusing methdlgies the reader is referred t the article by Sen and Sinha [43]. Herein we discuss tw imprtant pints frm their article. The first pint is related t change management. Cmpany diversificatin, merger, acquisitin etc. may lead t a redefinitin f business bjectives, pririties, and business rules. Als, the design f the data warehuse needs t incrprate the inherent dynamicity in the data such as new prducts, new sales regins, custmers address changes etc. The authrs stress n the fact that change management is an imprtant issue that is ften neglected by the vendrs. The final pint that we wuld like t talk abut regarding the design and cnstructin f a data warehuse is the Extractin, Transfrmatin and Lad (ETL) step. Irnically, while the mre cmplex technlgical tasks have been slved the simple task f extracting the data frm different surces (that may invlve different platfrms, file types etc.), cleaning and integrating it tgether befre lading it in the CDW turns ut t be the mst challenging task. Sahama and Crll reprt that as much as 90% f the effrt culd be spent in this step alne, hwever a better mdeling prcess can save a lt f time and effrt during this phase. Unlike ther decisin systems in cases f a CDW cming up with all the business requirements at the beginning f the prject is nt feasible, as the users are nt aware f the hidden knwledge in the data. At the same time they are als nt aware f the capabilities f the CDW. Sen and Sinha, as well as Inmn [23] advise against using a Sftware Develpment Life Cycle strategy fr the implementatin purpses. Other techniques such as Spiral develpment appraches have als been prpsed. Figure 5: Different warehusing methdlgies (taken frm [43]) 3.7 Case Studies Here we present case studies frm research papers t supprt the already established view that data mining, in cmbinatin with clinical data warehusing can play an imprtant rle in the administrative, clinical, research and training aspects f Health Infrmatics.

3.7.1 OLAP fr Claims Prcessing Verma and Harper discuss the claims prcessing system fr PriMed Management, Inc. which is a management services cmpany fr Hill Physicians Medical Grup [46]. Until 1996 PriMed was using their transactinal system fr prcessing claims as well as fr input f daily authrizatin requests fr medical prcedures. They were finding that the reprting jbs frm the transactinal system were slwing dwn ther jbs such as claims prcessing and authrizatin prcessing, very significantly. Hence they decided t cnstruct a data warehuse that culd, apart frm speeding things up, help the senir management track the trends in the data. After the data warehuse was built, data mining was applied, and it was bserved that the Health Data Analysis Department became very effective at respnding t typical requests fr infrmatin. They were able t initiate new reprting tls in the frm f standardized mnthly r quarterly reprts fr the purpses f: (1) Physician cmpensatin analysis, (2) Physician prfiling, (3) Utilizatin (facility) reprting, (4) Disease state management reprting, and (5) Analysis f cntract viability. 3.7.2 Clinical Data Warehuse fr University Health netwrk Ledbetter and Mrgan discuss the details f their prject that invlved the cnstructin f a Clinical data warehuse (CDW) fr the University Health Netwrk [32]. University Health Netwrk (UHN) cmprises f three Trnt hspitals: Trnt General Hspital, Trnt Western Hspital, and Princess Margaret Hspital. They have a transactin prcessing system called Patient 1 that is used fr admissin-transferdischarge (ADT) rder-entry, and results-review. This system cntains infrmatin fr mre than 300 millin patients with twelve millin visits spanning a perid f ten years. Patient 1 is available t the clinicians 24 hurs a day. In rder t identify the pprtunities fr quality imprvement, such as cutting dwn n unnecessary clinical testing, ptimizing anti-micrbial therapy etc. UHN embarked n building a CDW Decisin 1. Nw, Decisin 1 is being used by the CDS system t issue clinical alerts t the clinicians if the requested investigatin is a duplicate, r t suggest changing the antibitics frm intravenus t ral etc. The CDW is als being used t mnitr the effectiveness and impact f the alerts, and als fr mnitring resurce utilizatin as quality imprvement targets. In terms f the diagnstic tests prescribed by the clinicians it was fund that f the five tests cmpete bld cunt (CBC), actuated partial thrmbplastin time (APTT), prthrmbin time (PT), bld film review, and fibringen that cnstitute 95% f all the hematlgy examinatins, at least 10%, and smetimes as much as 25% f the tests were redundant based n widely accepted time perids. The reader is advised t cnsult the riginal paper fr all the advantages derived frm the use f data mining techniques cupled with a Clinical data warehuse. The future endeavrs include (1) flagging high-risk patients based n the risk factrs discvered by the CDW, (2) Preventing medicatin errrs, (3) Offering cst advisement n antibitics when lwer cst alternatives exist, (4) Prviding clinical reminders t clinicians t help them cmply with standard prtcls etc. 3.7.3 Feedback Integratin Sherbrke University Hspital (CHUS) in Mntreal has a transactin prcessing system called ARIANE, and als a Clinical Data Warehuse called CIRESSS. Prir t the installatin f the CDW cmplex sftware was designed in rder t prvide feedback t the clinical teams. Grant et al. [19] discuss ne such sftware that was designed t mnitr the use f bld gas measurements in Intensive care units. Apart frm the fact that such sftware can be immensely cmplex t build, there are ther prblems such as minimal cde re-use a lt f times sftware needs t be redesigned frm scratch in rder t cater t a different prblem. Bth the prblems were vercme by the use f a CDW. Tw dashbards were designed fr the use f feedback ne fr the emergency department, while the ther ne was designed fr the clinical bichemistry department. The design prcess included suggestins frm the end users. While the pst-feedback results have nt been published yet, hwever streamlining f the prcesses was bserved frm the day the sftware went int peratin. 3.7.4 Data Warehusing fr Disease Management The fllwing is after [37]. U.S. Quality Algrithms (USQA), uses a data warehuse t cllect administrative data t cllect administrative data frm pharmacies and labratry claims. Certain ailments

such as diabetes, cardivascular diseases, asthma etc. that respnd well t disease management prgrams are flagged by using algrithms that examine diagnsis, lab prcedures, and drugs used by the patients. These patients are then targeted fr applicable member mailings, r are placed in disease management prgrams. Once these patients are in the prgram(s) they generate mre data that can be analyzed further t single ut patients fr whm the disease was under cntrl vs. thse fr whm this was nt the case. The prper deplyment f the successful data warehuses in disease management prgrams benefits bth the rganizatin as well as the member patient. Anther cmpany, Hrizn Mercy fund that the mst cmmn diagnsis amngst pediatric patients was asthma. Placing these patients in asthma management prgrams allwed the cmpany t make key patient interventins, and create educatinal prgrams, and t save n the high emergency rm csts fr this segment f the ppulatin. 3.7.5 Shared Patient Recrds: A Means fr Cnducting Natinwide Research In this day and age the data may need t be transferred acrss gegraphically different lcatins befre data mining can be applied n t it. Knaup et al. intrduce an architecture called eardap fr shared electrnic patient recrds [29]. The architecture was implemented fr pediatric nclgy in Germany whereby abut 20 clinical trial centers spread thrughut Germany require data regarding the treatment f the each patient. The authrs claim that the architecture is extensible fr new research questins, and as well it can reuse data fr multiple purpses. eardap places special emphasis n the infrmatin systems f the EPR s surce hspital, and als t the security issues. Multiple dcumentatin, labratry examinatins etc. are avided by sharing the data, and thus there is a huge savings in the cst. There are tw main features fr data use in eardap: (1) fr general functins f EPR such as patient administratin, reprting and analysis, and (2) fr answering research questins such as thse n therapy ptimizatin r epidemilgic questins. Als, the amunt f data can be enrmus and it can be cmplex. It has been fund that the use f eardap has resulted in a smth prcess f data transfer between different research partners, and it wrks well in hetergeneus envirnments. Nne f this wuld have been pssible withut the use f Electrnic Patient Recrds. 3.8 Summary The use f Clinical data warehuses is n the rise, and health institutins, insurance cmpanies alike are reaping the benefits in terms f reduced cst f peratins, timely treatment f patients, streamlining f peratins, better educatin and training pprtunities. Hwever, it is imprtant t realize that the field is still in the develpment phase, and that many challenges lie ahead. As such there are a lt f exciting pprtunities ahead. Ebadllahi [15] reprt a new develpment in the field f electrnic health recrds. They present the idea f using cncept-based multimedia health recrds t better rganize the health recrds at the infrmatin level. Schabetsberger et al. [42] reprt n the secure reginal healthcare netwrk being develped in Austria. Sartipi et al. [41] reprt f a new architecture called Service Oriented Architecture that prvides standards fr sharing data and services; they mdel the cmpnents in the system in the frm f wrk flws. 4. Challenges f Data Mining in Health Infrmatics In this sectin, we verview a number f challenges faced in bth research and practice f data mining in Health Infrmatics. 4.1 Hetergeneity f Health Data Currently, there are limited r n centralized databases f health infrmatics data. A large prtin f ptentially relevant health infrmatin is nt stred electrnically. The fractin that is stred electrnically is scattered in hundreds f small databases thrugh different clinics, hspitals, and labratries. This data can be in many different frmats (e.g. text, image, vide) and is cllected frm varius surces, such as patient recrds, dctr cmments, and labratry test results.

Many f the ptential applicatins f data mining in health infrmatics, discussed in previus sectins, require centralized databases that integrate different frmats f health data frm varius surces. While there is a recent push frm the gvernments f Canada and United States fr develping such centralized databases, these prjects are still in infant status and many applicatin f data mining in health infrmatics are nt attainable until the centralized databases are mre fully develped 4.2 Discnnect between cmputer science and medical cmmunities A main challenge in applying data mining techniques t health infrmatics is a discnnect between the cmputer science and medical cmmunities. At the mst basic level, while simple practical issues such as placing cmputers in sterile areas r training physicians t use varius sftware packages are ften assumed t be trivial in the cmputer science cmmunity, in realty they are very difficult challenges. Fr instance, ne f the main reasns reprted by health care prfessinals fr nt utilizing a Clinical Decisin Supprt system was the extreme difficulties physicians had in electrnic rdering and interacting with electrnic recrds [20]. There is als a discnnect between types f learning algrithms utilized by the machine learning and data mining cmmunities, and the algrithms that the medical cmmunity feel cmfrtable t use. In particular, while the data mining cmmunity is interested in applying the latest and mst cmplex algrithms t medical datasets t achieve the highest accuracy pssible, the nn-academic clinicians prefer "simple, understandable mdels" [20]. Matheny and Ohn-Machad claim that the mre sphisticated machine learning techniques (such as SVM, Neural Netwrks and decisin trees) have limited r n representatin in health infrmatics applicatins. On the ther hand, simpler mdels, such as linear and lgistic regressin and scring systems are ppular. Matheny and Ohn-Machad explain the ne reasn fr unppularity f the mre cmplex mdels is that these mdels are nt well disseminated r well evaluated in the bimedical cmmunity. 4.3 Legal and Ethical Issues Data wnership, fear f lawsuits, and privacy cncerns are ther challenges that currently cnstrain the extended use f data mining in health infrmatics. Bellw is a shrt summary f each issue as described by Cis and Mre [12]. Data wnership: There is an unsettled questin f wnership f patient data. In particular, it is unclear whether the patients, the physicians, the labratries, r the insurance cmpanies wn the data cllected frm patients. There have been a number f lawsuits and cngressinal inquiries t address these issues [16], but the questin f health data wnership is still unsettled. Fear f lawsuits: In medical cmmunities, particularly in the Unites States, there is a fear f malpractice and ther cstly lawsuits that adds t the challenges f applying data mining in health infrmatics. Ptential lawsuits, that may be triggered by discvering anmalies in patient medical histries, leave medical prfessinals unwilling t share patient data with researchers. Privacy issues: Prtecting patient privacy and dctr-patient cnfidentially adds anther sets f challenges t data mining in health infrmatics. Administratrs and researchers shuld pay utmst attentin t privacy and security when transferring, string, r mining patient data. In many cases, patient recrds needs t be annymus (i.e. patient identities are remved at the time f infrmatin cllectin), annymized (that is patient identities are remved after the data is cllected), r de-identified (i.e. patient identities are encrypted and can be restred under certain institutinal plicies).

5. Cnclusin We have prvided an verview f applicatins f data mining in administrative, clinical, research, and educatinal aspects f Health Infrmatics. We established that while the current practical use f data mining in health related prblems is limited, there exists a great ptential fr data mining techniques t imprve varius aspects f health Infrmatics. Furthermre, the inevitable rise f clinical data warehuses will increase the ptential fr data mining techniques t imprve the quality and decrease the cst f healthcare. References [1]American Medical Infrmatics Assciatin, http://www.amia.rg/infrmatics/. [2] Canada s Health Infrmatics Assciatin, http://www.cachrg.cm/. [3] Natinal Library f Medicine, http://www.nlm.nih.gv/tsd/acquisitins/cdm/subjects58.html. [4] Canadian Institute f Health Research, http://www.mshri.n.ca/clrectalcancer/definitins.html, 05/25/2008 [5] Hmer Learning Cmmunity, https://hmer.med.ualberta.ca/. [6] http://www.inf-surce.us/data_warehusing_mining/data-mining-and-data-warehusing-in-bilgy- Medicine-and-Health-Care/image004.jpg [7] Allard R.D. The clinical labratry data warehuse An ver-lked diamnd mine. Am. J. Clin. Pathl. 817-819, 2003. [8] Batini C., Ceri S., Navathe S. "Cnceptual Database Design: An Entity-Relatinship Apprach". AddisnWesley. Spanish, 1991; ISBN 0-201-60120-6. [9] Berner E., "Clinical Decicin Supprt Systems". Springer Science+Business Media, 2007. [10] Brwn MP, Grundy WN, Lin D, Cristianini N, Sugnet CW, Furey TS, Ares M Jr, Haussler D., "Knwledge-based Analysis f Micrarray Gene Expressin Data using Supprt Vectr Machines". Prceedings f the Natinal Academy f Sciences, 2007. [11] Chen Y., Henning Pedersen L., Wesley W. Chu, Olsen J., "Drug Expsure Side Effects frm Mining Pregnancy Data". ACM SIGKDD Explratins Newsletter, 2007. [12] Cis K., Mre GW., "Uniqueness f Medical Data Mining". Artificial Intelligence in Medicine, 2002. [13] Cper G F., Dash D H., Levander J D., Wng W K, Hgan W R., Wagner M M., "Bayesian Bisurveillance f Disease Outbreaks". ACM Internatinal Cnference Prceeding Series; Vl. 70, 2004. [14] Curtright C., Crawfrd R. Klubert D. Criteria fr Develping Clinical Decisin Supprt Systems. CBMS '01: Prceedings f the Furteenth IEEE Sympsium n Cmputer-Based Medical Systems, 2001. [15] Ebadllahi S., Cden A. Tanenblatt M., Chang S., Syeda-Mahmd T., Amir A. Cncept-based electrnic health recrds: pprtunities and Challenges. MULTIMEDIA '06: Prceedings f the 14th annual ACM internatinal cnference n Multimedia, 997 1006, 2006.

[16] Fienberg S., "Sharing Statistical Data in the Bimedical and Health Sciences: Ethical, Institutinal, Legal, and Prfessinal Dimensins". Annual Review f Public Health, Vl. 15: 1-18, 1994 [17] Frawley W., Piatetsky-Shapir G., Matheus C. Knwledge Discvery in Databases: An Overview. AI Magazine. Vl 13, number 3, 57-70, 1992 [18] Gef McMaster, Gdbye ld schl, hell HOMER, Express News, University f Alberta, January 24, 2008. Available at: http://www.expressnews.ualberta.ca/article.cfm?id=9025 [19] Grant A., Mshyka A., Diaba H., Carna P., Lrenzia F., Bissna G., Menarda L., Lefebvrea R., Gauthiera P., Grndinb R., Desautelsb M. Integrating feedback frm a clinical data warehuse int practice rganisatin. Internatinal Jurnal f Medical Infrmatics. Vlume 75, Issues 3-4, Pages 232-239, March-April 2006 [20] Gray G. Challenges f building clinical data analysis slutins, Jurnal f Critical Care Vlume 19, Issue 4, December 2004, Pages 264-270 [21] Greens R., "Clinical Decisin Supprt". Elsevier Inc., 2007. [22]Hackney D. Understanding and Implementing Successful Data Marts. Addisn-Wesley Lngman Publishing C., Inc., 1997 [23]Inmn W. What is a data Warehuse?. Sunnyvale Calif. : Prism Slutins Inc., 1995 [24] Izadi MT, Buckeridge DL, "Decisin Theretic Analysis f Imprving Epidemic Detectin". American Medical Infrmatics Assciatin, 2007. [25]Jamtvedt G., Yung J., Kristffersen D., O'Brien M., Oxman A.. Audit and feedback: effects n prfessinal practice and health care utcmes. Cchrane Database f Systematic Reviews 1998. [26] Jhnsn JA., Btman HL., Drug-related mrbidity and mrtality: a cst f illness mdel. Arch Intern Med 1995; 266:2847-2851. [27] Kim J., Feng D., Cai T., Eberl S. A slutin t the distributin and standardizatin f multimedia medical data in E-Health. Prc. Pan-Sydney Area Wrkshp n Visual infrmatin Prcessing - Vlume 1, 2001. [28] Kimball R., Rss M. The Data Warehuse Tlkit: The Cmplete Guide t Dimensinal Mdeling, 2nd editin Wiley, New Yrk, 2002. [29] Knaup P., Garde S., Merzweiler A., Graf N., Schilling F., Weber R., Haux R. Twards shared patient recrds: An architecture fr using rutine data fr natinwide research. Internatinal Jurnal f Medical Infrmatics, Vlume 75, Issue 3-4, 191 2005. [30] Khn LT., Crrigan JM., Dnaldsn MS., eds. T err is human. Washingtn D.C.: Natinal Academy Press: 1999. [31] Kuperman GJ, Gardner RM, Pryr TA, "HELP: A dynamic hspital infrmatin system". Springer- Verlag, 1991 [32] Ledbetter C. Mrgan, M. Tward Best Practice: Leveraging the Electrnic Patient Recrd as a Clinical Data Warehuse. JOURNAL OF HEALTHCARE INFORMATION MANAGEMENT, VOL 15; PART 2, pages 119-132, 2001 [33] Lenard JE, Clmbe JB, Levy JL, "Finding relevant references t genes and prteins in Medline using a Bayesian apprach". Biinfrmatics Vl. 18, n. 11, 2002.