TURKISH ORACLE USER GROUP Data Mining in 30 Minutes Husnu Sensoy Global Maksimum Data & Information Tech. Founder VLDB Expert
Agenda Who am I? Different problems of Data Mining In database data mining?!? German Credit Ad-hoc Attacks DBMS_PREDICTIVE_ANALYTICS Attribute Importance Training/Validation Set Training Evaluation Conclusion
Who am I? Co-Founder of TROUG Oracle ACED on BI DBA of The Year 2009 Senior Member of Oracle DWH CAB Exadata Implementation Specialist
Data Mining Data Information Knowledge In DWH 1.0 we have accumulated sufficient amount of columns and rows. Classical reporting is nothing both rotating, folding, cutting, and pasting the same data again and again. It is just DATA TRANSFORMATION. User should infer the information and knowledge if lucky. Data Mining is all about creating information/insight about your business. Data Scientists are/will be the actual founders of BI environment what we have meant a few decades ago.
Different Problems of Data Mining Classification Regression Outlier Detection Basket Analysis Social Network Analysis Sentiment Analysis
In Database Data Mining 90% of data mining is all about finding the correct inputs In contrast to common belief using fancy algorithms will not improve your results by large factors Finding correct inputs is a matter of Join Group By Densification Database Management Systems are still the best place to handle those operations
German Credit Scoring SOLVING A SAMPLE PROBLEM
Details of SAMPLE DATA 20 Different Inputs A few examples Status of existing checking account Credit History Purpose Credit amount. Details : http://archive.ics.uci.edu/ml/datasets/statlog+%28german+credit+data%29 Classification Target: 1 for Good for Credit, 2 for Bad for Credit
Adhoc Attacks The first trials are always (and should be) adhoc. What is the distribution of Good and Bad Candidates? (Prior) Do we need any strafied sampling? What is the distribution of Good and Bad Candidates given that a variable X get value Y? (Posterior) Correlation between each variable and target value?
Data Mining at Speed of Light DBMS_PREDICTIVE_ANALYTICS functions allow us to perform mining activity very quickly: PREDICT: Support Vector Machine (SVM) model to perform credit score prediction. PROFILE: Decision Tree based explanatory model EXPLAIN: Minimum Descriptive Length (MDL) based attribute importance algorithm.
Attribute Importance Some mining problems may contain extremely high number of attributes: Amazon Access Sample : 20000 attributes Amazon Commerce Reviews Set : 10000 attributes URL Reputation: 3231961 attributes Reducing the number of attributes before any analysis will let you See trees in forest Move quickly Use less resources
Training vs. Validation Set In order to deliver unbiased performance results for data models, training and validation sets should be exclusive. There are different techniques used in literature %X validation vs %(100-X) training K-fold cross validation Ensure that your method is Suitable for your problem type Statistically stable and sound.
Model Build Oracle Data Miner offers several algorithms for data modeling Naïve Bayesian Decision Tree Generalize Linear Model (GLM) Support Vector Machine Remember that all model requires a unique identifier in data set.
Evaluation & Scoring Obviously final point is on how well you did with your model. This step is usually told to be evaluation Once you are sure that your model is sufficiently accurate final step is to score a given customer for credit Batch Real-time
Conclusion Remember that 90% of data modeling is all about adhoc attacks. That makes in database mining very appealing A crude understanding of your data might save huge amount of time. Some problems may ask for input set reduction DBMS_PREDICTIVE_ANALYTICS is the adhoc way of data modeling. For model evaluation & scoring use prediction and prediction_probability operators of SQL.
TEŞEKKÜRLER Husnu Sensoy http://husnusensoy.wordpress.com