OLSUG Workshop Oracle Data Mining Charlie Berger Sr. Director of Product Mgmt, Life Sciences and Data Mining Oracle Corporation charlie.berger@oracle.com Dr. Lutz Hamel Asst. Professor, Computer Science University of Rhode Island hamel@cs.uri.edu Carolyn K. Hamm, Ph.D. Chief, Decision Support Center Walter Reed Army Medical Center Washington, DC 20307 202 356-1012 x 40166 Carolyn.Hamm@NA.AMEDD.ARMY.MIL
Oracle Data Mining Workshop Oracle Data Mining overview Data mining process & example use cases Explore, build, test, cluster, etc. Clustering and more at URI
Oracle Data Mining Platform for data mining PL/SQL API Java API Oracle Data Miner (GUI) Wide range of algorithms Classification Support Vector Machines, Naïve Bayes, Adaptive Bayes Networks Attribute Importance Association Rules Clustering Enhanced K-Means, Orthogonal Clustering Nonnegative Matrix Factorization (feature extraction) BLAST (Sequence similarity search & alignment)
Oracle Data Mining Algorithms & Example Applications Attribute Importance Identify most influential attributes for a target attribute Factors associated a disease Promising leads Classification and Prediction Predict most likely to: Regression Doctors who prescribe a new drug Patients who respond to a treatment Predict a numeric value Predict a value Predict the size tumor will be reduced A1 A2 A3 A4 A5 A6 A7
Oracle Data Mining Algorithms & Example Applications Clustering Find naturally occurring groups Gene clusters Find disease subgroups Distinguish normal from non-normal behavior Association Rules Find co-occurring items Suggest interactions Feature Extraction Reduce a large dataset into representative new attributes Useful for clustering and text mining F1 F2 F3 F4
Oracle Data Mining Algorithms & Example Applications Text Mining Combine data and text for better models Add unstructured text e.g. physician s notes to structured data e.g. age, weight, height, etc., to predict outcomes Classify and cluster documents Combined with Oracle Text to develop advanced text mining applications e.g. Medline BLAST Sequence matching and alignment Find genes and proteins that are similar ATGCAATGCCAGGATTTCCA CTGCAAGGCCAGGAAGTTCCA ATGCGTTGCCAC ATTTCCA GGC..TGCAATGCCAGGATGACCA ATGCAATGTTAGGACCTCCA
10g Statistics & SQL Analytics Ranking functions rank, dense_rank, cume_dist, percent_rank, ntile Window Aggregate functions (moving and cumulative) Avg, sum, min, max, count, variance, stddev, first_value, last_value LAG/LEAD functions Direct inter-row reference using offsets Reporting Aggregate functions Sum, avg, min, max, variance, stddev, count, ratio_to_report Statistical Aggregates Correlation, linear regression family, covariance Linear regression Fitting of an ordinary-least-squares regression line to a set of number pairs. Frequently combined with the COVAR_POP, COVAR_SAMP, and CORR functions. Descriptive Statistics average, standard deviation, variance, min, max, median (via percentile_count), mode, group-by & roll-up DBMS_STAT_FUNCS: summarizes numerical columns of a table and returns count, min, max, range, mean, stats_mode, variance, standard deviation, median, quantile values, +/- 3 sigma values, top/bottom 5 values Correlations Pearson s correlation coefficients, Spearman's and Kendall's (both nonparametric). Cross Tabs Enhanced with % statistics: chi squared, phi coefficient, Cramer's V, contingency coefficient, Cohen's kappa Hypothesis Testing Student t-test, F-test, Binomial test, Wilcoxon Signed Ranks test, Chi-square, Mann Whitney test, Kolmogorov- Smirnov test, One-way ANOVA Distribution Fitting Kolmogorov-Smirnov Test, Anderson-Darling Test, Chi- Squared Test, Normal, Uniform, Weibull, Exponential Pareto Analysis (documented) 80:20 rule, cumulative results table
Statistics Enables analytic pipelines without removing data to statistical packages for simple analyses (e.g. hypothesis testing)
Workshop Outline Explore the data View data, simple graphs, ranges, etc. Cluster the data (undirected) & look for interesting patterns Determine problem to be solved What factors are associated with target 1, target 2, etc. Predict patients likely to respond to treatment Data transformations Building Models Classification Models Build, Test, Apply Throw out attributes e.g. 100% correlations etc. Classification Models w/ unstructured data (text) Mining Activity Guides 10gR2 Preview Decision Trees (10gR2) Anomaly Detection (10gR2)
Explore the data View data, simple graphs, ranges, etc. Lymphoma_7 data Bpress default Relative value data (WRMC) Cluster the data (undirected) looking for interesting patterns
Determine problem State the problem in terms of data mining What factors are associated with target 1, target 2, etc. Predict patients likely to respond to treatment Find new disease subgroups
Building Models Classification Build, Test, Apply Use SVM on Brain Tumor data w/o TEXT and SVM on Brain Tumor w/ TEXT Throw out attributes e.g. 100% correlations etc. Use Diabetes data Classification Models w/ unstructured data (text)
Mining Activity Guides Step by step guidance to achieve a goal and increase the likelihood of successful data mining
The future of data mining lies in predictive analytics. The Future of Data Mining Predictive Analytics Article published in DM Review Magazine August 2004 Issue By Lou Agosta http://www.dmreview.com/article_sub.cfm?articleid=1007209
What is Predictive Analytics? One click data mining Automatically selects appropriate algorithm Automates all advanced algorithm settings Automates Train, Test, and Apply steps Power data analysts can use Oracle Data Miner (wizard driven gui) PL/SQL API Java API Concept of performing predictive analytics is better than doing nothing
Oracle Data Mining Algorithms & Example Applications Attribute Importance Identify most influential attributes for a target Explain attribute PA Easy Button Factors associated Attribute Importance a disease Promising leads A1 A2 A3 A4 A5 A6 A7 Classification and Prediction Predict most likely to: Doctors who prescribe a new drug Patients who respond to a treatment Regression Predict PA Easy Button Classification & Regression Predict a numeric value Predict a value Predict the size tumor reduction
Life Sciences Oracle Data Miner 10g Release 2 Preview
Oracle Data Mining Decision Trees Decision Trees Popular algorithm Human readable rules Builds classification trees in Database Parallel implementation Status Age >45 <45 Age No Infection Infection >35 <=35 Temp Gender Days ICU <100 >100 F M >4 Problem: Find profiles of high risk patients Risk = 0 Risk = 1 Risk = 0 Risk = 1 Risk = 0 IF (Age > 45 AND Status = Infection AND Temp = >100) THEN P(High Risk=1) =.77 Support = 250 <=4 Risk = 1
Oracle Data Mining 10g Release 2 New Features Anomaly Detection One-Class Classification Builds SVM classification models where only one class e.g. 0 s exists Network intrusion detection Disease outbreaks Outlier detection Rare events, true novelty X2 X1 Problem: Detect rare cases
Oracle Data Mining 10g Release 2 New Features (Continued) Oracle Predictive Analytics PL/SQL Packages (Available now on OTN) EXPLAIN and PREDICT PL/SQL packages completely automate data mining Oracle Spreadsheet Add-In for Predictive Analytics on OTN http://www.oracle.com/technology/products/bi/odm/pa-addin/odm_pred_analytics_addin.htmlprediction Operator SQL-Level Data Mining Capability Prediction Operator SQL-Level Data Mining Capability Fast, SQL-level data mining prediction ( Apply ) functions that can be used to pipeline predictions e.g. Select customers where Churner_predicted >.80 AND Customer_value_prediction > $500 AND Response_likehood >.6 Java Data Mining (JDM) Compliant Java API Oracle Database 10g R2 provides a Java Data Mining (JDM) JSR-73 compliant Java API Implemented on top of the DBMS_DATA_MINING PL/SQL API and unifies the overall product, enabling interoperability of mining models between APIs
Oracle Data Mining 10g Release 2 Updated Oracle Data Miner (GUI) Ability to mine text column Anomaly detection Decision Trees Predictive Analytics ( one click data mining)
Oracle Data Mining 10g Release 2 Decision Trees (10gR2) Anomaly Detection (10gR2)
Oracle Data Mining 10g Release 2 Decision Trees (10gR2) Anomaly Detection (10gR2)
Q U E S T I O N S A N S W E R S
Additional Life Sciences Use Case Slides
Life Sciences Use Cases 1. Gene expression analysis 2. Clinical treatment outcome analysis 3. Classification of Multiple Tumor Types 4. Medline text mining
Oracle Data Mining in the Life Sciences Gene expression analysis Problem 1 Given thousands of gene expression values for each patient, can a small subset of the expressions be identified that can be used to distinguish one type of leukemia from another? Solution Apply ODM s Attribute Importance algorithm to the data to decrease the size of the problem Build an Adaptive Bayes Network Classification model to predict disease type from the gene expressions
Oracle Data Mining in the Life Sciences Gene expression analysis Top Genes (of ~7000) for Classifying Leukemia Gene Expression Relative Importance V00594_s_at 0.298955976210004 D43950_at 0.292217965904811 U34038_at 0.227177556507829 J03827_at 0.227177556507829 U64863_at 0.227177556507829 S85655_at 0.175469338594625 L07758_at 0.17031674247889 U19345_at 0.17031674247889 U89336_cds4_at 0.125995412839 U79295_at 0.125995412839 HG311-HT311_at 0.125995412839 V00599_s_at 0.125995412839
Oracle Data Mining in the Life Sciences Gene expression analysis ABN Model Predictions Lymphoid Leukemia vs. Myeloid Leukemia Predicted LL ML Actual LL 19 1 ML 2 12 Test set accuracy: 91.2%
Oracle Data Mining in the Life Sciences Clinical treatment outcome analysis Problem 2 Is it possible to classify treatments that are most effective in causing improvement in clinical patients suffering from a given disease? Solution Use Attribute Importance to rank the treatment factors Use Association Rules to establish correlations between treatment and outcome Source: Walter Reed Medical Center, Dr. Carolyn Hamm, presentation at Oracle Life Sciences User Group Meeting, June 2004
Oracle Data Mining in the Life Sciences Clinical treatment outcome analysis Factors associated with positive diabetes outcomes 1. DRUG_TYPE 2. COMPLETE_HISTORY_RECORDED (Scorecard) 3. NUM_HOSPITAL_ADMISSIONS 4. GENDER 5. NUM_VISITS_TO_PROVIDER 6. INSURANCE_TYPE 7. BLOOD_PRESSURE_GOAL 8. WEIGHT_GOAL 9. LDL_GOAL 10.PROVIDER_TYPE Source: Walter Reed Medical Center, Dr. Carolyn Hamm, presentation at Oracle Life Sciences User Group Meeting, June 2004
Oracle Data Mining in the Life Sciences Clinical treatment outcome analysis Sample Association Rules If Then OUTCOME Percentage of Cases NUM_HOSPITAL_ADMISSIONS=0 NO_IMPROVEMENT 0.5463472 NUM_VISITS_TO_PROVIDER>5 IMPROVEMENT 0.37195602 NUM_HOSPITAL_ADMISSIONS =0 and NUM_VISITS_TO_PROVIDER>5 IMPROVEMENT 0.36252946 DRUG_GROUP=2 and NUM_HOSPITAL_ADMISSIONS =0 NO_IMPROVEMENT 0.3267871 DRUG_GROUP=2 and NUM_VISITS_TO_PROVIDER>5 NO_IMPROVEMENT 0.30911234 NUM_HOSPITAL_ADMISSIONS =0 and GENDER=FEMALE NO_IMPROVEMENT 0.28711703 COMPLETE_HISTORY_RECORDED=NO NO_IMPROVEMENT 0.28436762 NUM_HOSPITAL_ADMISSIONS =0 and COMPLETE_HISTORY_RECORDED=Yes IMPROVEMENT 0.22663 Source: Walter Reed Medical Center, Dr. Carolyn Hamm, presentation at Oracle Life Sciences User Group Meeting, June 2004
Oracle Data Mining in the Life Sciences Classification of Multiple Tumor Types DNA Microarray Data We feed multiple cancer types data into the Oracle DB: 16,063 genes, 144 cancer patients and 10 samples per class. Oracle Data Mining Actual\Predicted BR PR LU CO LY BL ML UT LE RE PA OV MS BR BREAST-BR 1 1 PROSTATE-PR 1 1 LUNG-LU 1 2 We mine the data using Support Vector Machines and create the confusion matrix COLON-CO 3 LYMPHOMA-LY 6 BLADDER-BL 1 2 78.25% accuracy MELANOMA-ML 1 1 UTERUS-UT 2 LEUKEMIA-LE 1 5 RENAL-RE 3 PANCREAS-PA 1 2 OVARY-OV 1 2 MESOTHELIOMA- 3 MS BRAIN-BR 4 Green=Correct Red=Errors Multiple Examples of tumor tissue (public data from Broad Institute/MIT)
Oracle Data Mining in the Life Sciences Classification of Multiple Tumor Types Multiple examples of 14 tumor types Training set: 144 samples. Test set: 46 samples Microarrays gene expression profiles: 7,129 genes (features) Can we build a model to distinguish between multiple tumor types? Tumor Class # Train # Test Tumor Class # Train # Test Breast (BR) 8 3 Uterus (UT) 8 2 Prostate (PR) 8 2 Leukemia (LE) 24 6 Lung (LU) 8 3 Renal (RE) 8 3 Colorectal (CO) 8 5 Pancreas (PA) 8 3 Lymphoma (LY) 16 6 Ovary (OV) 8 3 Bladder (BL) 8 3 Mesothelioma (MS) 8 3 Melanoma (ML) 8 2 Brain (BR) 16 4
Oracle Data Mining in the Life Sciences Classification of Multiple Tumor Types Multi-Tumor Dataset Oracle Task Read into RDMS as Table SQLLDR Data Preparation (Scaling) SQL query Tumor Labels (Train) Build SVM Model (Training) ODM Model Build Tumor Labels (Test) Evaluate Model on Test Set ODM Model Apply Prediction Results
Oracle Data Mining in the Life Sciences Classification of Multiple Tumor Types The datasets were downloaded from the web site and stored in flat files prior to loading them to the Oracle database The data was loaded using SQLLDR to create a fact table of the following format: column type sid gene expr NUMBER VARCHAR2(30) NUMBER Rescaling: the values were divided by a constant (10000) to make them into small numbers near 1 (to keep the dot products between all samples in the dataset inside the [-1, 1] range
Oracle Data Mining in the Life Sciences Classification of Multiple Tumor Types Entire methodology implemented in Oracle Database The SVM model works with all 7,129 input features (genes) genes and do not require feature selection. The SVM model is relatively fast: 9 minutes training time on 500MHz Netra. The SVM is very accurate for multi-tumor molecular classification: 78.25% accuracy Comparable to published results in Ramaswamy et al PNAS 2001 paper, they also found that k-nn = 63% and Weighted Voting = 46% accuracy
Oracle Data Mining in the Life Sciences Classification of Multiple Tumor Types Results: 78.25% accuracy Actual\Predicted BR PR LU CO LY BL ML UT LE RE PA OV MS BR BREAST-BR 1 1 PROSTATE-PR 1 1 LUNG-LU 1 2 COLON-CO 3 LYMPHOMA-LY 6 BLADDER-BL 1 2 MELANOMA-ML 1 1 UTERUS-UT 2 Green=Correct LEUKEMIA-LE 1 5 RENAL-RE 3 PANCREAS-PA 1 2 OVARY-OV 1 2 Red=Errors Oracle Data Mining s SVM models are able to accurately predict the multi-class tumor problem with 78.25% accuracy. MESOTHELIOMA- 3 MS BRAIN-BR 4