In this presentation, you will be introduced to data mining and the relationship with meaningful use.



Similar documents
Data Mining Applications in Higher Education

Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing

Data Mining: Overview. What is Data Mining?

An Introduction to Advanced Analytics and Data Mining

Framing Business Problems as Data Mining Problems

Predictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Pentaho Data Mining Last Modified on January 22, 2007

Azure Machine Learning, SQL Data Mining and R

Gerry Hobbs, Department of Statistics, West Virginia University

from Larson Text By Susan Miertschin

Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

TIETS34 Seminar: Data Mining on Biometric identification

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Discovering, Not Finding. Practical Data Mining for Practitioners: Level II. Advanced Data Mining for Researchers : Level III

Cleaned Data. Recommendations

The Data Mining Process

Divide-n-Discover Discretization based Data Exploration Framework for Healthcare Analytics

TDWI Best Practice BI & DW Predictive Analytics & Data Mining

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL

Nine Common Types of Data Mining Techniques Used in Predictive Analytics

DATA MINING TECHNIQUES AND APPLICATIONS

Index Contents Page No. Introduction . Data Mining & Knowledge Discovery

Using Data Mining Techniques for Analyzing Pottery Databases

Machine Learning and Data Mining. Fundamentals, robotics, recognition

Microsoft Azure Machine learning Algorithms

Social Media Mining. Data Mining Essentials

How can you unlock the value in real-world data? A novel approach to predictive analytics could make the difference.

Welcome. Data Mining: Updates in Technologies. Xindong Wu. Colorado School of Mines Golden, Colorado 80401, USA

not possible or was possible at a high cost for collecting the data.

Predicting Students Final GPA Using Decision Trees: A Case Study

USING DATA SCIENCE TO DISCOVE INSIGHT OF MEDICAL PROVIDERS CHARGE FOR COMMON SERVICES

Data Mining Algorithms Part 1. Dejan Sarka

The Use of Open Source Is Growing. So Why Do Organizations Still Turn to SAS?

Database Marketing, Business Intelligence and Knowledge Discovery

IBM SPSS Modeler 14.2 In-Database Mining Guide

Data Warehousing and Data Mining in Business Applications

Data Mining for Fun and Profit

Table of Contents. June 2010

A STUDY OF DATA MINING ACTIVITIES FOR MARKET RESEARCH

Course Syllabus. Purposes of Course:

CS590D: Data Mining Chris Clifton

The Prophecy-Prototype of Prediction modeling tool

Data Project Extract Big Data Analytics course. Toulouse Business School London 2015

Data Mining with SAS. Mathias Lanner Copyright 2010 SAS Institute Inc. All rights reserved.

KnowledgeSEEKER Marketing Edition

Machine Learning: Overview

Decision Trees What Are They?

CUSTOMER Presentation of SAP Predictive Analytics

Big Data: Rethinking Text Visualization

KnowledgeSTUDIO HIGH-PERFORMANCE PREDICTIVE ANALYTICS USING ADVANCED MODELING TECHNIQUES

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

Learning outcomes. Knowledge and understanding. Competence and skills

Data Mining: STATISTICA

2015 Workshops for Professors

Data Mining Application in Direct Marketing: Identifying Hot Prospects for Banking Product

How To Predict Diabetes In A Cost Bucket

Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP

STATISTICA. Financial Institutions. Case Study: Credit Scoring. and

ETL Anatomy 101 Tom Miron, Systems Seminar Consultants, Madison, WI

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

Predictive Models for Enhanced Audit Selection: The Texas Audit Scoring System

Comparison of K-means and Backpropagation Data Mining Algorithms

Model Deployment. Dr. Saed Sayad. University of Toronto

Introduction to Data Mining

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

How To Do Data Mining In R

Ensembles and PMML in KNIME

Healthcare Measurement Analysis Using Data mining Techniques

Data Mining. 1 Introduction 2 Data Mining methods. Alfred Holl Data Mining 1

Predictive Modeling and Big Data

In-Database Analytics

Data Mining and Machine Learning in Bioinformatics

Application of Predictive Model for Elementary Students with Special Needs in New Era University

Data Mining Part 5. Prediction

Promises and Pitfalls of Big-Data-Predictive Analytics: Best Practices and Trends

Introduction to Data Mining

EL Program: Smart Manufacturing Systems Design and Analysis

Fast and Easy Delivery of Data Mining Insights to Reporting Systems

Customer and Business Analytic

The basic data mining algorithms introduced may be enhanced in a number of ways.

How To Make A Credit Risk Model For A Bank Account

Impelling Heart Attack Prediction System using Data Mining and Artificial Neural Network

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and

Chapter 7: Data Mining

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES

Data Mining Jargon. Bob Muenchen The Statistical Consulting Center

An Overview of Data Mining: Predictive Modeling for IR in the 21 st Century

How To Perform An Ensemble Analysis

CRISP - DM. Data Mining Process. Process Standardization. Why Should There be a Standard Process? Cross-Industry Standard Process for Data Mining

EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d.

72. Ontology Driven Knowledge Discovery Process: a proposal to integrate Ontology Engineering and KDD

The Scientific Data Mining Process

A New Approach for Evaluation of Data Mining Techniques

Banking Analytics Training Program

How To Use Neural Networks In Data Mining

PharmaSUG2011 Paper HS03

Silvermine House Steenberg Office Park, Tokai 7945 Cape Town, South Africa Telephone:

Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees

Transcription:

In this presentation, you will be introduced to data mining and the relationship with meaningful use.

Data mining refers to the art and science of intelligent data analysis. It is the application of machine learning algorithms to large data sets with the primary aim of discovering meaningful insights and knowledge from that data.

Data mining essentially is the construction of data models that instantiate a machine learning algorithm on specific data elements. The model captures the essence of the discovered knowledge and helps us in our understanding of the world. Often times, these models are predictive. For instance, data mining models have been applied to healthcare data to predict readmissions, risk of disease, and efficacy of medications.

Modeling is the process of turning all that data into some structured form or model that reflects the supplied data in useful way. The aim of modeling is to explore the data to address a specific problem by modeling or mimicking the real world. For instance, a lot research has been done in modeling the way in which we make decisions. Machine learning algorithms that use artificial intelligence develop models that closely represent how a human would make a decision. The same methods can be applied to healthcare data were we attempt to model decision making. For instance, we might want to develop a model to predict drug relapse in patients with a history of drug addiction. The machine learning algorithms, using artificial intelligence, would look at all of the data elements to come up with a decision on the likelihood of whether a patient will relapse. Unfortunately, no model can perfectly represent the world. For instance, we might find that our model predicts a patient will relapse even if the patient does not have a history of drug addiction. In the real world, we would never make this mistake, but due to the rules governing the machining learning algorithm, such mistakes are possible.

To ensure that the model is constructed in such a way to limit such mistakes and represent the real world as closely as possible, there are a set of 8 steps that can be followed. First, you must have a clear understanding of the data and the business of healthcare. If you do not know what the data mean, it is likely that your model will not make sense. Second, you must partition your data into training, validation, and testing datasets when building, tuning, and evaluating your model. This way, three different set of data are used to validate your model. Third, build multiple models and compare their performance. You may find that you favor one model, such as a neural network, but that model may not be the most effective. Therefore, comparing the performance of multiple models will yield the most effective end product. Fourth, if you end up developing a perfect model, something went wrong. Healthcare data is messy and complex. It s unlikely that you will develop a model that makes perfect decisions. The laws of probability suggest otherwise that your model will at times make mistakes. Fifth, don t overlook how the model is to be deployed. Some of the algorithms are very difficult to employ. For instance, neural networks are a black box and difficult to automate into a system. However, rule based algorithms just as decision trees are very simple to deploy. Sixth, when constructing your models they should be repeatable and efficient. That is, if you were to take a different set of data and apply your model, you should get similar results. Also, your model shouldn t take 3 days

to run. It should be almost instantaneous otherwise it s unlikely that it can be implemented in a healthcare setting where everything is fast-paced. Seventh, let the data talk to you but no mislead you. If you are certain that the results of your analysis are doubtful, you should question the results. Don t assume that the results are the truth. Test it, test it again, and again. Lastly, after you constructed your model and tested it, communicate your discoveries effectively and visually.

There are many tools available for data mining and constructing models. One of the most popular tools include SAS enterprise miner. The platform is powerful and relatively easy to use. Weka is an open-source platform that supports the development of a variety of different algorithms. Rattle is a package available in the open-source analytics environment R and is also very powerful and diverse. Rattle also supports predictive markup modeling language (PMML) for deploying data mining models. There are many other applications available.

Data mining has some terminologies that should be understood. A dataset is a collection of data. Often times, a dataset will have multiple columns and many rows. In mathematical terms, this is referred to as a matrix while in database terms this would be referred to as a table. The observations make up the rows of data while the variables make up the columns. The dimension of a dataset is the number of observations, or rows, by the number of variables, or columns.

Input variables include the measured data items. This can take on many different forms, either text, numbers in ordinal, nominal, interval, or ratio. Other names for input variables include predictors, covariates, independent variables, observed variables, or descriptive variables. An example would be systolic blood pressure, diastolic blood pressure, medications, weight, age, gender, and so on. Output variables are those that are influence by the input variables. They are also known as target, response, or dependent variables. An example might be a diagnosis of hypertension. We build models to predict the output variables in terms of the input variables. So if we were given data that includes systolic blood pressure, diastolic blood pressure, medications, weight, age, and gender, we could use that data as inputs for predicting the output of a diagnosis of hypertension.

There s one caveat. Some data mining models may not have any output variables. These are referred to as descriptive models and an example is clustering. We will get to these in a moment.

Identifiers are unique variables for a particular observation. They may include a patient s name, or a patient ID. Categorical variables are one that take on a single value and are discrete. They can be nominal where there is not order to them (for example eye color) or ordinal where there is natural order (for example age groups). Numeric variables, also known as continuous variables, are values that are integers or real numbers (for example weight).

There are three datasets that are used when constructing a model: training, validation, and testing datasets. The training dataset is the data that you use to build the initial models. The validation dataset assess the model s performance that you develop using the training dataset. This step helps fine tune the model as appropriate. The testing dataset, applies the refined model and assesses expected performance on future datasets.

When developing a data mining model, you start with one large dataset and partition that into training, validation, and testing datasets. The partitioning is done by randomly selecting observations to one of the three datasets. The training set typically has more data than the other datasets. For instance, if we take a large dataset we can partition our three datasets as follows: 70% of the observations go to the training dataset, 15% to the validation, and 15% to the testing dataset.

The data mining process that is widely accepted is known as CRISP-DM, or CRoss Industry Standard Process for Data Mining. The process includes 6 steps from understanding the business all the way to deploying a model.

The slide on your screen shows a description of the six steps. The first step emphasizes the business understanding for planning your data mining project so that it aligns with the organizations goals. The second is data understanding so that you can assess the quality of the data and define each data element. Data preparation is next where you select the relevant data, clean the data up, carry out basic descriptive statistics, and reformat the data as necessary. Modeling is next where you construct a data model or several models. Evaluation is the step where you evaluate the performance of each of the models constructed and choose the best performing model. Last is deployment where you determine how you will deploy your model and present the findings to the necessary parties.

The CRISP-DM process relates very well to specific data mining tasks. For instance, business understanding relates to developing questions about the data and data selection. The data understanding step is where we explore the data. The data preparation step is where the data is transformed. The modeling step is where we choose and build a model. The evaluation step is where we validate and test our model. Finally, deployment is where we export the model.

When building a model, there are two main categories. The first is descriptive models also known as unsupervised learning. These are models that are constructed when we do not have a target variable. Providing a representation of the knowledge discovered without necessarily modeling a specific outcome. An example of a descriptive model is a clustering analysis. Predictive models, or supervised learning, are those that can be developed when we have a target variable. We can predict the target variable with our given set of input variables. The goal of a predictive model is to extract knowledge from historic data and represent it in such a

form that we can apply the resulting model to new situations. In that way, we are predicting the occurrence of an event of interest. The historic data will already be associated with the outcome and we can learn to make this association on future data. Common predictive algorithms include decision trees, boost, and neural networks.

If the model is found effective and ready for use in real time, the next step is deployment. One method to deploy models is through the use of a language called predictive modeling markup language (PMML). It is an XML-based standard that is supported by many major commercial data mining vendors and many open source data mining tools.

Descriptive and/or predictive models can be used on specific datasets. Different models and algorithms have advantages and disadvantages. Therefore, it is recommended to construct multiple models and choose the best. Deployment of a successful model can be simple using PMML.

When considering the role of Health IT and Meaningful Use and the implications for data mining, the use of data mining techniques can have great potential for the development of clinical decision support systems and outbreak detection to foster better patient outcomes. Also, as the government invests more into health IT, the adopting of data mining approaches will become more of a priority. New ways of analyzing and interpreting the data will be sought after and it is anticipated that data mining will be center stage.