Data Mining: Introduction and a Health Care Application

Similar documents

Data Mining: Introduction. Lecture Notes for Chapter 1. Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler

Information Management course

Introduction to Data Mining

Introduction of Information Visualization and Visual Analytics. Chapter 4. Data Mining

Introduction to Data Mining

Introduction. A. Bellaachia Page: 1

DATA MINING - 1DL105, 1Dl111

Data Warehousing and Data Mining

Quick Introduction of Data Mining Techniques

Data Mining: Concepts and Techniques

Knowledge Discovery Process and Data Mining - Final remarks

Introduction to Data Mining

Search and Data Mining: Techniques. Applications Anya Yarygina Boris Novikov

Data Mining Introduction

Data Warehousing and Data Mining

Data Mining Solutions for the Business Environment

Data Mining. Knowledge Discovery, Data Warehousing and Machine Learning Final remarks. Lecturer: JERZY STEFANOWSKI

Data Mining on Social Networks. Dionysios Sotiropoulos Ph.D.

Big Data. Introducción. Santiago González

Introduction to Data Mining and Business Intelligence Lecture 1/DMBI/IKI83403T/MTI/UI

Foundations of Artificial Intelligence. Introduction to Data Mining

Data Mining: Introduction

Data Mining: Concepts and Techniques Chapter 1 Introduction

Data Mining System, Functionalities and Applications: A Radical Review

Data Mining and Machine Learning in Bioinformatics

Data Mining and Business Intelligence CIT-6-DMB. Faculty of Business 2011/2012. Level 6

Example application (1) Telecommunication. Lecture 1: Data Mining Overview and Process. Example application (2) Health

How To Solve The Kd Cup 2010 Challenge

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

Introduction to Artificial Intelligence G51IAI. An Introduction to Data Mining

Data Mining for Fun and Profit

Data Mining. Yeow Wei Choong Anne Laurent

not possible or was possible at a high cost for collecting the data.

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 1

Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

Syllabus. HMI 7437: Data Warehousing and Data/Text Mining for Healthcare

Database Marketing, Business Intelligence and Knowledge Discovery

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Data Mining: Concepts and Techniques. Jiawei Han. Micheline Kamber. Simon Fräser University К MORGAN KAUFMANN PUBLISHERS. AN IMPRINT OF Elsevier

Data Mining. Introduction to Modern Information Retrieval from Databases and the Web. Administrivia

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

THE COMPARISON OF DATA MINING TOOLS

DATA MINING TECHNOLOGY. Keywords: data mining, data warehouse, knowledge discovery, OLAP, OLAM.

OLAP and Data Mining. Data Warehousing and End-User Access Tools. Introducing OLAP. Introducing OLAP

Graph Mining and Social Network Analysis

A STUDY OF DATA MINING ACTIVITIES FOR MARKET RESEARCH

DATA MINING TECHNIQUES AND APPLICATIONS

Digging for Gold: Business Usage for Data Mining Kim Foster, CoreTech Consulting Group, Inc., King of Prussia, PA

Sanjeev Kumar. contribute

SPATIAL DATA CLASSIFICATION AND DATA MINING

Dynamic Data in terms of Data Mining Streams

International Journal of Scientific & Engineering Research, Volume 5, Issue 4, April ISSN

International Journal of World Research, Vol: I Issue XIII, December 2008, Print ISSN: X DATA MINING TECHNIQUES AND STOCK MARKET

Welcome. Data Mining: Updates in Technologies. Xindong Wu. Colorado School of Mines Golden, Colorado 80401, USA

Knowledge Discovery from Data Bases Proposal for a MAP-I UC

Data Mining: Overview. What is Data Mining?

ECLT 5810 E-Commerce Data Mining Techniques - Introduction. Prof. Wai Lam

An Overview of Database management System, Data warehousing and Data Mining

Data Mining Analytics for Business Intelligence and Decision Support

What is Data Mining?

Data Mining and Exploration. Data Mining and Exploration: Introduction. Relationships between courses. Overview. Course Introduction

(b) How data mining is different from knowledge discovery in databases (KDD)? Explain.

Data mining for prediction

2.1. Data Mining for Biomedical and DNA data analysis

INTERNATIONAL JOURNAL FOR ENGINEERING APPLICATIONS AND TECHNOLOGY DATA MINING IN HEALTHCARE SECTOR.

DATA MINING AND WAREHOUSING CONCEPTS

What is Data Mining? Data Mining (Knowledge discovery in database) Data mining: Basic steps. Mining tasks. Classification: YES, NO

An Overview of Knowledge Discovery Database and Data mining Techniques

CSE4334/5334 Data Mining Lecturer 2: Introduction to Data Mining. Chengkai Li University of Texas at Arlington Spring 2016

Marta Zorrilla Universidad de Cantabria

Healthcare Measurement Analysis Using Data mining Techniques

Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing

ANALYTICS IN BIG DATA ERA

What is Big Data? The three(or four) Vs in Big Data In 2013 the total amount of stored information is estimated to be Volume.

Lluis Belanche + Alfredo Vellido. Intelligent Data Analysis and Data Mining

DATA MINING CONCEPTS AND TECHNIQUES. Marek Maurizio E-commerce, winter 2011

Machine Learning, Data Mining, and Knowledge Discovery: An Introduction

USING DATA SCIENCE TO DISCOVE INSIGHT OF MEDICAL PROVIDERS CHARGE FOR COMMON SERVICES

Chapter 5. Warehousing, Data Acquisition, Data. Visualization

Data Mining Algorithms Part 1. Dejan Sarka

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

A Web-based Interactive Data Visualization System for Outlier Subspace Analysis

Lecture 6 - Data Mining Processes

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

Course DSS. Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Introduction to Data Mining

Concept and Applications of Data Mining. Week 1

Overview. Background. Data Mining Analytics for Business Intelligence and Decision Support

Data Mining On Diabetics

Data Warehousing and OLAP Technology for Knowledge Discovery

DATA MINING TECHNIQUES SUPPORT TO KNOWLEGDE OF BUSINESS INTELLIGENT SYSTEM

Importance or the Role of Data Warehousing and Data Mining in Business Applications

The Scientific Data Mining Process

Hexaware E-book on Predictive Analytics

The Data Mining Process

Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Neural Networks in Data Mining

Integration of Data Mining Systems using Sequence Process

Index Contents Page No. Introduction . Data Mining & Knowledge Discovery

Prediction of Heart Disease Using Naïve Bayes Algorithm

Transcription:

Data Mining: Introduction and a Health Care Application Prem Swaroop (pswaroop@rhsmith.umd.edu); Dr Bruce Golden (bgolden@rhsmith.umd.edu) Robert H Smith School of Business, University of Maryland, College Park

Case: Medicare and Hospitalization Costs Medicare has announced that it will no longer reimburse hospitals for errors, and for nosocomial infections. To reduce the occurrence of resistant infections, one 1000-bed hospital wants to implement a protocol to prevent them. High-risk patients admitted for elective surgery will be identified. They will be admitted to the hospital 24 hours prior to surgery (the usual protocol has them admitted after surgery) and placed on IV vancomycin. The antibiotic will be continued until discharge. Risk is not uniform and is based on a combination of patient demographics, diagnoses, and procedures. Further, the cost to treat infection is not uniform across patients. A superficial wound generally will not add to the overall length of stay. A deep skin wound requires ten days to three weeks of IV antibiotics (the date of discharge will vary). An infection in the bone requires six weeks of antibiotics, and carries the additional risk to the patient of limb amputation or death. A resistant infection in the lungs is lifethreatening and the patient will be moved to ICU where the daily costs to treat increase substantially.

Case: Breast Cancer Detection Breast canceris a disease in which malignant (cancer) cells form in the tissues of the breast. Breast cancer is the second leading cause of cancer deaths in women today (after lung cancer) and is the most common cancer among women, except for skin cancers. About 1.3 million women are expected to be diagnosed annually with breast cancer worldwide, and about 465,000 will die from the disease. In the United States alone, in 2007 an estimated 240,510 women were expected to be diagnosed with breast cancer, and 40,460 women are expected to have died from breast cancer. Screeningis looking for cancer in asymptomatic people i.e., before a person has any symptoms of the disease. Cancer screening can help find cancer at an early stage. When abnormal tissue or cancer is found early, it is often easier to treat. By the time symptoms appear,cancer may have begun to spread. The good news is that breast cancer death rates have been dropping steadily since 1990, both because of earlier detection via screening and better treatments. Steps for confirming cancer: 1. Screening mammography 2. Diagnostic mammography (5-10% of screened women) 3. Biopsy (3-10 cases per 1000 screened women) Undetected cancers at screening stage Multiple reasons Double reading implemented at many sites Increased detection by 4-15%; expensive Computer Aided Detection High rate of detection Improvements would help tremendously in: Sensitivity (Pred) True Positive/Act. Positive Specificity (Pred) False Positive/Act. Negative 4 stage system of CAD 1.Candidate generation 2.Feature extraction 3.Classification (detection) 4.Visual presentation

Data Mining: Introduction Motivation: Why mine data Definition: What is data mining Functionality: Key data mining tasks Classification: Multi-dimensional view Techniques and Example Applications Challenges Software Tools References

Motivation: Why mine data commercial viewpoint Lots of data is being collected and warehoused Web data, e-commerce purchases at department/grocery stores Bank/Credit Card transactions Computers have become cheaper and more powerful Competitive Pressure is Strong Provide better, customized services for an edge (e.g. in Customer Relationship Management)

Motivation: Why mine data scientific viewpoint Data collected and stored at enormous speeds (GB/hour) remote sensors on a satellite telescopes scanning the skies microarrays generating gene expression data scientific simulations generating terabytes of data Traditional techniques infeasible for raw data Data mining may help scientists in classifying and segmenting data in Hypothesis Formation

Motivation: Why mine data Summary There is often information hidden in the data that is not readily evident Human analysts may take weeks to discover useful information Much of the data is never analyzed at all We are drowning in data, but starving for knowledge! Necessity is the mother of invention Data mining Automated analysis of massive data sets

Definition: What is data mining Many Definitions Knowledge discovery from data Extraction of interesting (non-trivial,implicit, previously unknownand potentially useful)patterns or knowledge from huge amount of data Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns Alternative names: Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.

Definition: What is data mining Computational Knowledge Discovery

Definition: What is data mining Origins Machine Learning Pattern Recognition Statistics Applications Data Mining Visualization Algorithm Database Technology High-Performance Computing

Functionality: Key data mining tasks KDD Process This is a view from typical database systems and data warehousing communities Data mining plays an essential role in the knowledge discovery process Task-relevant Data Data Mining Pattern Evaluation Data Warehouse Selection Data Cleaning Data Integration

Functionality: Key data mining tasks KDD Process Input Data Data Pre- Processing Data Mining Post- Processing Data integration Normalization Feature selection Dimension reduction Pattern discovery Association & correlation Classification Clustering Outlier analysis Pattern evaluation Pattern selection Pattern interpretation Pattern visualization This is a view from typical machine learning and statistics communities

Functionality: Key data mining tasks KDD Process Increasing potential to support business decisions Decision Making This is a view from business intelligence communities End User Data Presentation Visualization Techniques Data Mining Information Discovery Business Analyst Data Analyst Data Exploration Statistical Summary, Querying, and Reporting Data Preprocessing/Integration, Data Warehouses Data Sources Paper, Files, Web documents, Scientific experiments, Database Systems DBA

Classification: Multi-dimensional view General functionality Descriptivedata mining Predictivedata mining Different views lead to different classifications Dataview: Kinds of data to be mined Knowledgeview: Kinds of knowledge to be discovered Methodview: Kinds of techniques utilized Applicationview: Kinds of applications adapted

Classification: Multi-dimensional view Data to be mined Relational, data warehouse, transactional, stream, object-oriented/relational, active, spatial, time-series, text, multi-media, heterogeneous, legacy, WWW Knowledge to be mined Characterization, discrimination, association, classification, clustering, trend/deviation, outlier analysis, etc. Multiple/integrated functions and mining at multiple levels Techniques utilized Database-oriented, data warehouse (OLAP), machine learning, statistics, visualization, etc. Applications adapted Retail, telecommunication, banking, fraud analysis, bio-data mining, stock market analysis, text mining, Web mining, etc.

Techniques and Example Applications Classification [Predictive] Clustering [Descriptive] Association Rule Discovery [Descriptive] Sequential Pattern Discovery [Descriptive] Regression [Predictive] Deviation Detection [Predictive] Structure and Network Analysis

Techniques Classification Given a collection of records (training set) Each record contains a set of attributes, one of the attributes is the class. Find a model for class attribute as a function of the values of other attributes. Goal: previously unseenrecords should be assigned a class as accurately as possible. A test setis used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.

Example Application Classification Direct Marketing Goal: Reduce cost of mailing by targetinga set of consumers likely to buy a new cell-phone product. Approach: Use the data for a similar product introduced before. We know which customers decided to buy and which decided otherwise. This {buy, don t buy} decision forms the class attribute. Collect various demographic, lifestyle, and company-interaction related information about all such customers. Type of business, where they stay, how much they earn, etc. Use this information as input attributes to learn a classifier model.

Techniques Clustering Given a set of data points, each having a set of attributes, and a similarity measure among them, find clusters such that Data points in one cluster are more similar to one another. Data points in separate clusters are less similar to one another. Similarity Measures: Euclidean Distance if attributes are continuous. Other Problem-specific Measures.

Example Application Clustering Market Segmentation: Goal: subdivide a market into distinct subsets of customers where any subset may conceivably be selected as a market target to be reached with a distinct marketing mix. Approach: Collect different attributes of customers based on their geographical and lifestyle related information. Find clusters of similar customers. Measure the clustering quality by observing buying patterns of customers in same cluster vs. those from different clusters.

Technique & Example Application Association Rules Given a set of records each of which contain some number of items from a given collection; Produce dependency rules which will predict occurrence of an item based on occurrences of other items. Marketing and Sales Promotion: Let the rule discovered be {Bagels, } --> {Potato Chips} Potato Chipsas consequent=> Can be used to determine what should be done to boost its sales. Bagels in the antecedent=> Can be used to see which products would be affected if the store discontinues selling bagels. Bagels in antecedentandpotato chips in consequent=> Can be used to see what products should be sold with Bagels to promote sale of Potato chips!

Technique & Example Application Sequential Patterns Given is a set of objects, with each object associated with its own timeline of events, find rules that predict strong sequential dependencies among different events. Rules are formed by first discovering patterns. Event occurrences in the patterns are governed by timing constraints. In telecommunications alarm logs, (Inverter_Problem Excessive_Line_Current) (Rectifier_Alarm) --> (Fire_Alarm) In point-of-sale transaction sequences, Computer Bookstore: (Intro_To_Visual_C) (C++_Primer) --> (Perl_for_dummies,Tcl_Tk) Athletic Apparel Store: (Shoes) (Racket, Racketball) --> (Sports_Jacket)

Technique & Example Application Regression Predict a value of a given continuous valued variable based on the values of other variables, assuming a linear or nonlinear model of dependency. Greatly studied in statistics, neural network fields. Examples: Predicting sales amounts of new product based on advertising expenditure. Predicting wind velocities as a function of temperature, humidity, air pressure, etc. Time series prediction of stock market indices.

Technique & Example Application Deviation Detect significant deviations from normal behavior Applications: Credit Card Fraud Detection Network Intrusion Detection Auto insurance: ring of collisions Money laundering:suspicious monetary transactions Medical insurance Professional patients, ring of doctors, and ring of references Unnecessary or correlated screening tests Telecommunications: phone-call fraud Phone call model: destination of the call, duration, time of dayor week. Analyze patterns that deviate from an expected norm Retail industry Analysts estimate that 38% of retail shrink is due to dishonest employees Anti-terrorism

Technique Structure and Network Analysis Graph mining Finding frequent sub graphs (e.g., chemical compounds), trees (XML), substructures (web fragments) Information network analysis Social networks: actors (objects, nodes) and relationships (edges) e.g., author networks in CS, terrorist networks Multiple heterogeneous networks A person could be multiple information networks: friends, family, classmates, Links carry a lot of semantic information: Link mining Web mining Web is a big information network: from PageRank to Google Analysis of Web information networks Web community discovery, opinion mining, usage mining,

Challenges in Data Mining Efficiency and scalability of data mining algorithms Parallel, distributed, stream, and incremental mining methods Handling high-dimensionality Handling noise, uncertainty, and incompleteness of data Incorporation of constraints, expert knowledge, and background knowledge in data mining Pattern evaluation and knowledge integration Mining diverse and heterogeneous kinds of data: e.g., bioinformatics, Web, software/system engineering, information networks Application-oriented and domain-specific data mining Invisible data mining (embedded in other functional modules) Protection of security, integrity, and privacy in data mining

Data Mining Software Tools Top ten most used packages as per KDD Nuggets Survey (May 2007) 1. SPSS/ SPSS Clementine 2. Salford Systems CART/MARS/TreeNet/RF 3. Yale (now Rapid Miner) (open source) 4. SAS / SAS Enterprise Miner 5. Angoss Knowledge Studio / Knowledge Seeker 6. KXEN 7. Weka (open source) 8. R (open source) 9. Microsoft SQL Server 10. MATLAB Source: the-data-mine.com; kdnuggets.com

References Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques, 2nd edition, Morgan Kaufmann, 2006 Pang-Ning Tan, Michael Steinbach and Vipin Kumar, Introduction to Data Mining, Pearson Addison Wesley, 2005 Ian Witten and Eibe Frank, Data Mining: Practical Machine Learning Tools and Techniques, 2nd Edition, Morgan Kaufmann, 2005

INFORMS Data Mining Contest 2008 Cost Model Formulation for Treatment of Patients at Risk of Nosocomial (Hospital-Acquired) Infections using Data Mining Prem Swaroop

Overview Problem Description Cost Model Motivation and Definitions Formulation Parameter Estimation

Problem Description Medicare has announced that it will no longer reimburse hospitals for errors, and for nosocomial infections. To reduce the occurrence of resistant infections, one 1000-bed hospital wants to implement a protocol to prevent them. High-risk patients admitted for elective surgery will be identified. They will be admitted to the hospital 24 hours prior to surgery (the usual protocol has them admitted after surgery) and placed on IV vancomycin. The antibiotic will be continued until discharge. Risk is not uniform and is based on a combination of patient demographics, diagnoses, and procedures. Further, the cost to treat infection is not uniform across patients. A superficial wound generally will not add to the overall length of stay. A deep skin wound requires ten days to three weeks of IV antibiotics (the date of discharge will vary). An infection in the bone requires six weeks of antibiotics, and carries the additional risk to the patient of limb amputation or death. A resistant infection in the lungs is lifethreatening and the patient will be moved to ICU where the daily costs to treat increase substantially.

Problem Description Objective: Given two years of patient data, including whether or not the patient contracted an infection during a surgical procedure, and the cost to treat that infection, determine an optimal strategy for choosing patients from the test group to minimize the total cost of medication. Part 1 (classification): For an unseen patient dataset, determine probability of each patient to be diagnosed with MRSA. Part 2 (policy): Develop and justify a realistic cost model (including cost of prophylactic treatment, cost of MRSA treatment, and probabilities from your predictive model), and use it to maximize the total cost savings of the proposed strategy.

Problem Description Summary Proposed Strategy Patients to be administered preventive antibiotics 24 hours prior to start of treatment Realistic Cost Model Predicted probabilities Costs of treatment Policy Design Maximize Total Cost Savings

Classification

Classification 4 datasets Preprocessing: Cleaning: Association Rules to identify Merge most frequently occurring Remove redundant fields diseases in historyand Handle noise current diagnosistogether with nosocomial infection Classification: Run multiple algorithms Evaluate prediction results Select best predicting ones Support Vector Machines

Policy

Cost Model Motivation Cost Contracts Nosocomial Infection Does not contract Prophylactic Treatment Predicted probability > a threshold value Cost No Prophylactic Treatment Cost Contracts Nosocomial Infection Does not contract Cost

Cost Model Definitions Probabilities Cost Contracts Nosocomial Infection Probability = 0 Prophylactic Treatment Probability = Ppred Does not contract Probability = Ppred Cost No Prophylactic Treatment Probability = 1-Ppred Contracts Nosocomial Infection Probability = Pnos Cost Does not contract Probability = 1 - Pnos Cost

Cost Model Definitions Total Costs Cost: Creg + Cpro Prophylactic Treatment Probability = Ppred No Prophylactic Treatment Probability = 1-Ppred Contracts Nosocomial Infection Probability = Pnos Cost: Creg + Cnos Does not contract Probability = 1 - Pnos Cost: Creg

Cost Model Cost Model Definitions Regular treatment cost, Creg: Hospitalization cost, Chos: Facilities cost, Cfc Physician charges, Cmd Medication cost, Crx Prophylactic treatment cost, Cpro: Antibiotic cost, Canti times Length of Stay, LOS + 1 Treatment cost of nosocomial infection, Cnos: Similar components as of Creg Random in nature

Formulation Objective minimize expected incremental cost Incremental Cost: Cpro Prophylactic Treatment Probability = Ppred Incremental Cost: Cnos No Prophylactic Treatment Probability = 1-Ppred Contracts Nosocomial Infection Probability = Pnos Does not contract Probability = 1 - Pnos Incremental Cost: 0

Formulation Formulation Minimize expected incremental cost = Σ {(Ppred[i] i = 1..N patients x Cpro[i]) + ((1 -Ppred[i]) x Pnos[i] x Cnos[i])} Incremental Cost: Cpro Prophylactic Treatment Probability = Ppred No Prophylactic Treatment Probability = 1-Ppred Contracts Nosocomial Infection Probability = Pnos Incremental Cost: Cnos

Formulation Formulation Minimize expected incremental cost = Σ {(Ppred[i] x Cpro[i]) + ((1-Ppred[i]) x Pnos[i] x Cnos[i])} i = 1..N patients subject to V patients i = 1.. N: Ppred[i] = f(patient i's risk rating for contracting nosocomial pneumonia) Cpro[i] = (LOS[i]+1) x Canti Cnos[i] = ø(severity of complications due to nosocomial pneumonia) Ppred[i], Pnos[i] =~ [0.. 1], real Canti, Cpro[i], Cnos[i] > 0, real LOS[i] >= 0, integer

Parameter Estimation Minimize expected incremental cost = Σ {(Ppred[i] i = 1..N patients x Cpro[i]) + ((1 -Ppred[i]) x Pnos[i] x Cnos[i])} Incremental Cost: Cpro Prophylactic Treatment Probability = Ppred No Prophylactic Treatment Probability = 1-Ppred Contracts NosocomialInfection Probability = Pnos Incremental Cost: Cnos

Parameter Estimation Ppred: probability of the patient contracting nosocomial infection AFTER physician s diagnosis, AND recommendation that patient be admitted Use data mining classifier Ppred is confidence score of the classifier For +ve predictions Possible improvements: Adjust Ppred for classifier s performance on independent test set or using n-fold cross-validation Instead of taking all +ve predictions, use a threshold prediction to select patients, truncate Ppred at threshold

Parameter Estimation Cpro: cost of prophylactic treatment Cpro = (LOS + 1) x Canti LOS: length of stay Physician s estimate of LOS upon diagnosis Possible improvement: Incorporate physicians estimate errors over long-term

Parameter Estimation Canti: cost of antibiotics per day Antibiotics to treat regular pneumonia Nosocomial pneumonia infected patients would have had additional treatments 1. from hospital dataset, identify patients who had pneumonia, and separate them from those with nosocomial pneumonia. 2. from the medications dataset, identify the set of common drugs across the regular pneumonia patients, validate this against an external information set (eg, www.drugs.com). Compute total cost for these drugs per patient in the given year. 3. in the hospital dataset, for those regular pneumonia patients whose NUMNIGHX has been recorded, compute the cost of drugs per day by dividingthe total cost of the stated drugs by NUMNIGHX. 4. take mean of valid costs per day to be Canti. Improvements possible: Use domain knowledge of medical experts

Parameter Estimation Minimize expected incremental cost = Σ {(Ppred[i] i = 1..N patients x Cpro[i]) + ((1 -Ppred[i]) x Pnos[i] x Cnos[i])} Incremental Cost: Cpro Prophylactic Treatment Probability = Ppred No Prophylactic Treatment Probability = 1-Ppred Contracts Nosocomial Infection Probability = Pnos Incremental Cost: Cnos

Parameter Estimation Pnos: probability of a low-risk patient contracting nosocomial pneumonia Classifier s false negative predictions Use evaluation of classifier s performance on independent test set or n-fold cross validation Possible improvements: Keep tuning the classifier!

Parameter Estimation Cnos: treatment cost of nosocomial pneumonia Has components: hospitalization and medication Random in nature 1. from hospital dataset, identify patients who had pneumonia, and separate them from those with nosocomial pneumonia. 2. from the medications dataset, identify the set of common drugs across the nosocomial pneumonia patients, validate this against an external information set (eg, www.drugs.com). Compute total cost for these drugs per patient in the given year. 3. in the hospital dataset, note total hospitalization costs for the nosocomial pneumonia patients, and subtract from these individually mean hospitalization costs for regular patients of the respective primary conditions. 4. add medication and hospitalization costs for each nosocomial pneumonia patient to arrive at the distribution of Cnos; use its mean as Cnos estimate for population. Possible improvements: Use regression or another data mining approach to predict severity and therefore cost of treatment for given profile of a patient