OLSUG Workshop Oracle Data Mining

Similar documents

Statistical Analysis of Gene Expression Data With Oracle & R (- data mining)

Exadata V2 + Oracle Data Mining 11g Release 2 Importing 3 rd Party (SAS) dm models

The Oracle Data Mining Machine Bundle: Zero to Predictive Analytics in Two Weeks Collaborate 15 IOUG

1 Copyright 2011, Oracle and/or its affiliates. All rights reserved.

SQL - the best analysis language for Big Data!

Oracle Data Mining In-Database Data Mining Made Easy!

Predictive Analytics for Better Business Intelligence

Oracle's In-Database Statistical Functions

Seamless Access from Oracle Database to Your Big Data

Big Data Analytics with Oracle Advanced Analytics In-Database Option

Semantic and Data Mining Technologies. Simon See, Ph.D.,

Blazing BI: the Analytic Options to the Oracle Database. ODTUG Kscope 2013

Anomaly and Fraud Detection with Oracle Data Mining 11g Release 2

Oracle Advanced Analytics 12c & SQLDEV/Oracle Data Miner 4.0 New Features

The Data Mining Process

Oracle Big Data SQL Architectural Deep Dive

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

Azure Machine Learning, SQL Data Mining and R

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

extreme Datamining mit Oracle R Enterprise

Introduction to Data Mining

Data Mining - The Next Mining Boom?

DATA ANALYSIS. QEM Network HBCU-UP Fundamentals of Education Research Workshop Gerunda B. Hughes, Ph.D. Howard University

Bowerman, O'Connell, Aitken Schermer, & Adcock, Business Statistics in Practice, Canadian edition

Oracle Data Mining Hands On Lab

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Oracle Data Mining. Concepts 10g Release 2 (10.2) B

business statistics using Excel OXFORD UNIVERSITY PRESS Glyn Davis & Branko Pecar

Statistical tests for SPSS

Fraud and Anomaly Detection Using Oracle Advanced Analytic Option 12c

Analyzing Research Data Using Excel

Prerequisites. Course Outline

SPSS Tests for Versions 9 to 13

Data Mining with Oracle Database 11g Release 2

Directions for using SPSS

Instructions for SPSS 21

Anomaly and Fraud Detection with Oracle Data Mining

Oracle Advanced Analytics Oracle R Enterprise & Oracle Data Mining

Bill Burton Albert Einstein College of Medicine April 28, 2014 EERS: Managing the Tension Between Rigor and Resources 1

An introduction to using Microsoft Excel for quantitative data analysis

Final Project Report

STATISTICAL ANALYSIS WITH EXCEL COURSE OUTLINE

Simple Predictive Analytics Curtis Seare

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Microsoft Azure Machine learning Algorithms

The Dummy s Guide to Data Analysis Using SPSS

Role of Social Networking in Marketing using Data Mining

Oracle Advanced Analytics - Option to Oracle Database: Oracle R Enterprise and Oracle Data Mining. Data Warehouse Global Leaders Winter 2013

Data Analysis Tools. Tools for Summarizing Data

Data analysis process

CONTENTS PREFACE 1 INTRODUCTION 1 2 DATA VISUALIZATION 19

Additional sources Compilation of sources:

Introduction to Data Mining

Model Deployment. Dr. Saed Sayad. University of Toronto

Mathematical Models of Supervised Learning and their Application to Medical Diagnosis

Projects Involving Statistics (& SPSS)

January 26, 2009 The Faculty Center for Teaching and Learning

Normality Testing in Excel

Getting Started with Oracle Data Miner 11g R2. Brendan Tierney

Why is Internal Audit so Hard?

Oracle Data Mining 11g Release 2

Oracle Data Miner (Extension of SQL Developer 4.0)

Data Mining On Diabetics

SPSS TUTORIAL & EXERCISE BOOK

Statistics Graduate Courses

Microarray Data Mining: Puce a ADN

Social Media Mining. Data Mining Essentials

STATISTICA Formula Guide: Logistic Regression. Table of Contents

High Productivity Data Processing Analytics Methods with Applications

Big Data and Predictive Analytics: Fiserv Data Mining Case Study [CON8631] Data Warehouse and Big Data

Name: Srinivasan Govindaraj Title: Big Data Predictive Analytics

Oracle Data Mining. Concepts 11g Release 2 (11.2) E

Data Mining III: Numeric Estimation

Data Mining. SPSS Clementine Clementine Overview. Spring 2010 Instructor: Dr. Masoud Yaghini. Clementine

Introduction. A. Bellaachia Page: 1

Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr

Gamma Distribution Fitting

Predict Influencers in the Social Network

How to Build MicroStrategy Projects on Top of Big Data Sources in the Cloud

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Figure 1. An embedded chart on a worksheet.

Analysing Questionnaires using Minitab (for SPSS queries contact -)

IBM SPSS Statistics 20 Part 4: Chi-Square and ANOVA

Active Learning SVM for Blogs recommendation

Oracle Business Intelligence and Analytics Platform. SFOUG March 22, Shyam Varan Nath Oracle Corporation

Oracle Data Mining. Concepts 11g Release 1 (11.1) B

Oracle Data Mining. Concepts 11g Release 1 (11.1) B

Once saved, if the file was zipped you will need to unzip it. For the files that I will be posting you need to change the preferences.

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

MHI3000 Big Data Analytics for Health Care Final Project Report

Predictive Data modeling for health care: Comparative performance study of different prediction models

Scalable Developments for Big Data Analytics in Remote Sensing

Transcription:

OLSUG Workshop Oracle Data Mining Charlie Berger Sr. Director of Product Mgmt, Life Sciences and Data Mining Oracle Corporation charlie.berger@oracle.com Dr. Lutz Hamel Asst. Professor, Computer Science University of Rhode Island hamel@cs.uri.edu Carolyn K. Hamm, Ph.D. Chief, Decision Support Center Walter Reed Army Medical Center Washington, DC 20307 202 356-1012 x 40166 Carolyn.Hamm@NA.AMEDD.ARMY.MIL

Oracle Data Mining Workshop Oracle Data Mining overview Data mining process & example use cases Explore, build, test, cluster, etc. Clustering and more at URI

Oracle Data Mining Platform for data mining PL/SQL API Java API Oracle Data Miner (GUI) Wide range of algorithms Classification Support Vector Machines, Naïve Bayes, Adaptive Bayes Networks Attribute Importance Association Rules Clustering Enhanced K-Means, Orthogonal Clustering Nonnegative Matrix Factorization (feature extraction) BLAST (Sequence similarity search & alignment)

Oracle Data Mining Algorithms & Example Applications Attribute Importance Identify most influential attributes for a target attribute Factors associated a disease Promising leads Classification and Prediction Predict most likely to: Regression Doctors who prescribe a new drug Patients who respond to a treatment Predict a numeric value Predict a value Predict the size tumor will be reduced A1 A2 A3 A4 A5 A6 A7

Oracle Data Mining Algorithms & Example Applications Clustering Find naturally occurring groups Gene clusters Find disease subgroups Distinguish normal from non-normal behavior Association Rules Find co-occurring items Suggest interactions Feature Extraction Reduce a large dataset into representative new attributes Useful for clustering and text mining F1 F2 F3 F4

Oracle Data Mining Algorithms & Example Applications Text Mining Combine data and text for better models Add unstructured text e.g. physician s notes to structured data e.g. age, weight, height, etc., to predict outcomes Classify and cluster documents Combined with Oracle Text to develop advanced text mining applications e.g. Medline BLAST Sequence matching and alignment Find genes and proteins that are similar ATGCAATGCCAGGATTTCCA CTGCAAGGCCAGGAAGTTCCA ATGCGTTGCCAC ATTTCCA GGC..TGCAATGCCAGGATGACCA ATGCAATGTTAGGACCTCCA

10g Statistics & SQL Analytics Ranking functions rank, dense_rank, cume_dist, percent_rank, ntile Window Aggregate functions (moving and cumulative) Avg, sum, min, max, count, variance, stddev, first_value, last_value LAG/LEAD functions Direct inter-row reference using offsets Reporting Aggregate functions Sum, avg, min, max, variance, stddev, count, ratio_to_report Statistical Aggregates Correlation, linear regression family, covariance Linear regression Fitting of an ordinary-least-squares regression line to a set of number pairs. Frequently combined with the COVAR_POP, COVAR_SAMP, and CORR functions. Descriptive Statistics average, standard deviation, variance, min, max, median (via percentile_count), mode, group-by & roll-up DBMS_STAT_FUNCS: summarizes numerical columns of a table and returns count, min, max, range, mean, stats_mode, variance, standard deviation, median, quantile values, +/- 3 sigma values, top/bottom 5 values Correlations Pearson s correlation coefficients, Spearman's and Kendall's (both nonparametric). Cross Tabs Enhanced with % statistics: chi squared, phi coefficient, Cramer's V, contingency coefficient, Cohen's kappa Hypothesis Testing Student t-test, F-test, Binomial test, Wilcoxon Signed Ranks test, Chi-square, Mann Whitney test, Kolmogorov- Smirnov test, One-way ANOVA Distribution Fitting Kolmogorov-Smirnov Test, Anderson-Darling Test, Chi- Squared Test, Normal, Uniform, Weibull, Exponential Pareto Analysis (documented) 80:20 rule, cumulative results table

Statistics Enables analytic pipelines without removing data to statistical packages for simple analyses (e.g. hypothesis testing)

Workshop Outline Explore the data View data, simple graphs, ranges, etc. Cluster the data (undirected) & look for interesting patterns Determine problem to be solved What factors are associated with target 1, target 2, etc. Predict patients likely to respond to treatment Data transformations Building Models Classification Models Build, Test, Apply Throw out attributes e.g. 100% correlations etc. Classification Models w/ unstructured data (text) Mining Activity Guides 10gR2 Preview Decision Trees (10gR2) Anomaly Detection (10gR2)

Explore the data View data, simple graphs, ranges, etc. Lymphoma_7 data Bpress default Relative value data (WRMC) Cluster the data (undirected) looking for interesting patterns

Determine problem State the problem in terms of data mining What factors are associated with target 1, target 2, etc. Predict patients likely to respond to treatment Find new disease subgroups

Building Models Classification Build, Test, Apply Use SVM on Brain Tumor data w/o TEXT and SVM on Brain Tumor w/ TEXT Throw out attributes e.g. 100% correlations etc. Use Diabetes data Classification Models w/ unstructured data (text)

Mining Activity Guides Step by step guidance to achieve a goal and increase the likelihood of successful data mining

The future of data mining lies in predictive analytics. The Future of Data Mining Predictive Analytics Article published in DM Review Magazine August 2004 Issue By Lou Agosta http://www.dmreview.com/article_sub.cfm?articleid=1007209

What is Predictive Analytics? One click data mining Automatically selects appropriate algorithm Automates all advanced algorithm settings Automates Train, Test, and Apply steps Power data analysts can use Oracle Data Miner (wizard driven gui) PL/SQL API Java API Concept of performing predictive analytics is better than doing nothing

Oracle Data Mining Algorithms & Example Applications Attribute Importance Identify most influential attributes for a target Explain attribute PA Easy Button Factors associated Attribute Importance a disease Promising leads A1 A2 A3 A4 A5 A6 A7 Classification and Prediction Predict most likely to: Doctors who prescribe a new drug Patients who respond to a treatment Regression Predict PA Easy Button Classification & Regression Predict a numeric value Predict a value Predict the size tumor reduction

Life Sciences Oracle Data Miner 10g Release 2 Preview

Oracle Data Mining Decision Trees Decision Trees Popular algorithm Human readable rules Builds classification trees in Database Parallel implementation Status Age >45 <45 Age No Infection Infection >35 <=35 Temp Gender Days ICU <100 >100 F M >4 Problem: Find profiles of high risk patients Risk = 0 Risk = 1 Risk = 0 Risk = 1 Risk = 0 IF (Age > 45 AND Status = Infection AND Temp = >100) THEN P(High Risk=1) =.77 Support = 250 <=4 Risk = 1

Oracle Data Mining 10g Release 2 New Features Anomaly Detection One-Class Classification Builds SVM classification models where only one class e.g. 0 s exists Network intrusion detection Disease outbreaks Outlier detection Rare events, true novelty X2 X1 Problem: Detect rare cases

Oracle Data Mining 10g Release 2 New Features (Continued) Oracle Predictive Analytics PL/SQL Packages (Available now on OTN) EXPLAIN and PREDICT PL/SQL packages completely automate data mining Oracle Spreadsheet Add-In for Predictive Analytics on OTN http://www.oracle.com/technology/products/bi/odm/pa-addin/odm_pred_analytics_addin.htmlprediction Operator SQL-Level Data Mining Capability Prediction Operator SQL-Level Data Mining Capability Fast, SQL-level data mining prediction ( Apply ) functions that can be used to pipeline predictions e.g. Select customers where Churner_predicted >.80 AND Customer_value_prediction > $500 AND Response_likehood >.6 Java Data Mining (JDM) Compliant Java API Oracle Database 10g R2 provides a Java Data Mining (JDM) JSR-73 compliant Java API Implemented on top of the DBMS_DATA_MINING PL/SQL API and unifies the overall product, enabling interoperability of mining models between APIs

Oracle Data Mining 10g Release 2 Updated Oracle Data Miner (GUI) Ability to mine text column Anomaly detection Decision Trees Predictive Analytics ( one click data mining)

Oracle Data Mining 10g Release 2 Decision Trees (10gR2) Anomaly Detection (10gR2)

Oracle Data Mining 10g Release 2 Decision Trees (10gR2) Anomaly Detection (10gR2)

Q U E S T I O N S A N S W E R S

Additional Life Sciences Use Case Slides

Life Sciences Use Cases 1. Gene expression analysis 2. Clinical treatment outcome analysis 3. Classification of Multiple Tumor Types 4. Medline text mining

Oracle Data Mining in the Life Sciences Gene expression analysis Problem 1 Given thousands of gene expression values for each patient, can a small subset of the expressions be identified that can be used to distinguish one type of leukemia from another? Solution Apply ODM s Attribute Importance algorithm to the data to decrease the size of the problem Build an Adaptive Bayes Network Classification model to predict disease type from the gene expressions

Oracle Data Mining in the Life Sciences Gene expression analysis Top Genes (of ~7000) for Classifying Leukemia Gene Expression Relative Importance V00594_s_at 0.298955976210004 D43950_at 0.292217965904811 U34038_at 0.227177556507829 J03827_at 0.227177556507829 U64863_at 0.227177556507829 S85655_at 0.175469338594625 L07758_at 0.17031674247889 U19345_at 0.17031674247889 U89336_cds4_at 0.125995412839 U79295_at 0.125995412839 HG311-HT311_at 0.125995412839 V00599_s_at 0.125995412839

Oracle Data Mining in the Life Sciences Gene expression analysis ABN Model Predictions Lymphoid Leukemia vs. Myeloid Leukemia Predicted LL ML Actual LL 19 1 ML 2 12 Test set accuracy: 91.2%

Oracle Data Mining in the Life Sciences Clinical treatment outcome analysis Problem 2 Is it possible to classify treatments that are most effective in causing improvement in clinical patients suffering from a given disease? Solution Use Attribute Importance to rank the treatment factors Use Association Rules to establish correlations between treatment and outcome Source: Walter Reed Medical Center, Dr. Carolyn Hamm, presentation at Oracle Life Sciences User Group Meeting, June 2004

Oracle Data Mining in the Life Sciences Clinical treatment outcome analysis Factors associated with positive diabetes outcomes 1. DRUG_TYPE 2. COMPLETE_HISTORY_RECORDED (Scorecard) 3. NUM_HOSPITAL_ADMISSIONS 4. GENDER 5. NUM_VISITS_TO_PROVIDER 6. INSURANCE_TYPE 7. BLOOD_PRESSURE_GOAL 8. WEIGHT_GOAL 9. LDL_GOAL 10.PROVIDER_TYPE Source: Walter Reed Medical Center, Dr. Carolyn Hamm, presentation at Oracle Life Sciences User Group Meeting, June 2004

Oracle Data Mining in the Life Sciences Clinical treatment outcome analysis Sample Association Rules If Then OUTCOME Percentage of Cases NUM_HOSPITAL_ADMISSIONS=0 NO_IMPROVEMENT 0.5463472 NUM_VISITS_TO_PROVIDER>5 IMPROVEMENT 0.37195602 NUM_HOSPITAL_ADMISSIONS =0 and NUM_VISITS_TO_PROVIDER>5 IMPROVEMENT 0.36252946 DRUG_GROUP=2 and NUM_HOSPITAL_ADMISSIONS =0 NO_IMPROVEMENT 0.3267871 DRUG_GROUP=2 and NUM_VISITS_TO_PROVIDER>5 NO_IMPROVEMENT 0.30911234 NUM_HOSPITAL_ADMISSIONS =0 and GENDER=FEMALE NO_IMPROVEMENT 0.28711703 COMPLETE_HISTORY_RECORDED=NO NO_IMPROVEMENT 0.28436762 NUM_HOSPITAL_ADMISSIONS =0 and COMPLETE_HISTORY_RECORDED=Yes IMPROVEMENT 0.22663 Source: Walter Reed Medical Center, Dr. Carolyn Hamm, presentation at Oracle Life Sciences User Group Meeting, June 2004

Oracle Data Mining in the Life Sciences Classification of Multiple Tumor Types DNA Microarray Data We feed multiple cancer types data into the Oracle DB: 16,063 genes, 144 cancer patients and 10 samples per class. Oracle Data Mining Actual\Predicted BR PR LU CO LY BL ML UT LE RE PA OV MS BR BREAST-BR 1 1 PROSTATE-PR 1 1 LUNG-LU 1 2 We mine the data using Support Vector Machines and create the confusion matrix COLON-CO 3 LYMPHOMA-LY 6 BLADDER-BL 1 2 78.25% accuracy MELANOMA-ML 1 1 UTERUS-UT 2 LEUKEMIA-LE 1 5 RENAL-RE 3 PANCREAS-PA 1 2 OVARY-OV 1 2 MESOTHELIOMA- 3 MS BRAIN-BR 4 Green=Correct Red=Errors Multiple Examples of tumor tissue (public data from Broad Institute/MIT)

Oracle Data Mining in the Life Sciences Classification of Multiple Tumor Types Multiple examples of 14 tumor types Training set: 144 samples. Test set: 46 samples Microarrays gene expression profiles: 7,129 genes (features) Can we build a model to distinguish between multiple tumor types? Tumor Class # Train # Test Tumor Class # Train # Test Breast (BR) 8 3 Uterus (UT) 8 2 Prostate (PR) 8 2 Leukemia (LE) 24 6 Lung (LU) 8 3 Renal (RE) 8 3 Colorectal (CO) 8 5 Pancreas (PA) 8 3 Lymphoma (LY) 16 6 Ovary (OV) 8 3 Bladder (BL) 8 3 Mesothelioma (MS) 8 3 Melanoma (ML) 8 2 Brain (BR) 16 4

Oracle Data Mining in the Life Sciences Classification of Multiple Tumor Types Multi-Tumor Dataset Oracle Task Read into RDMS as Table SQLLDR Data Preparation (Scaling) SQL query Tumor Labels (Train) Build SVM Model (Training) ODM Model Build Tumor Labels (Test) Evaluate Model on Test Set ODM Model Apply Prediction Results

Oracle Data Mining in the Life Sciences Classification of Multiple Tumor Types The datasets were downloaded from the web site and stored in flat files prior to loading them to the Oracle database The data was loaded using SQLLDR to create a fact table of the following format: column type sid gene expr NUMBER VARCHAR2(30) NUMBER Rescaling: the values were divided by a constant (10000) to make them into small numbers near 1 (to keep the dot products between all samples in the dataset inside the [-1, 1] range

Oracle Data Mining in the Life Sciences Classification of Multiple Tumor Types Entire methodology implemented in Oracle Database The SVM model works with all 7,129 input features (genes) genes and do not require feature selection. The SVM model is relatively fast: 9 minutes training time on 500MHz Netra. The SVM is very accurate for multi-tumor molecular classification: 78.25% accuracy Comparable to published results in Ramaswamy et al PNAS 2001 paper, they also found that k-nn = 63% and Weighted Voting = 46% accuracy

Oracle Data Mining in the Life Sciences Classification of Multiple Tumor Types Results: 78.25% accuracy Actual\Predicted BR PR LU CO LY BL ML UT LE RE PA OV MS BR BREAST-BR 1 1 PROSTATE-PR 1 1 LUNG-LU 1 2 COLON-CO 3 LYMPHOMA-LY 6 BLADDER-BL 1 2 MELANOMA-ML 1 1 UTERUS-UT 2 Green=Correct LEUKEMIA-LE 1 5 RENAL-RE 3 PANCREAS-PA 1 2 OVARY-OV 1 2 Red=Errors Oracle Data Mining s SVM models are able to accurately predict the multi-class tumor problem with 78.25% accuracy. MESOTHELIOMA- 3 MS BRAIN-BR 4