Oracle's In-Database Statistical Functions



Similar documents
Oracle Data Mining In-Database Data Mining Made Easy!

Exadata V2 + Oracle Data Mining 11g Release 2 Importing 3 rd Party (SAS) dm models

Predictive Analytics for Better Business Intelligence

Statistical Analysis of Gene Expression Data With Oracle & R (- data mining)

OLSUG Workshop Oracle Data Mining

Big Data Analytics with Oracle Advanced Analytics In-Database Option

The Oracle Data Mining Machine Bundle: Zero to Predictive Analytics in Two Weeks Collaborate 15 IOUG

SQL - the best analysis language for Big Data!

Seamless Access from Oracle Database to Your Big Data

1 Copyright 2011, Oracle and/or its affiliates. All rights reserved.

Blazing BI: the Analytic Options to the Oracle Database. ODTUG Kscope 2013

Semantic and Data Mining Technologies. Simon See, Ph.D.,

Oracle Big Data SQL Architectural Deep Dive

IBM SPSS Statistics 20 Part 4: Chi-Square and ANOVA

Simple Predictive Analytics Curtis Seare

STATISTICAL ANALYSIS WITH EXCEL COURSE OUTLINE

Oracle Advanced Analytics 12c & SQLDEV/Oracle Data Miner 4.0 New Features

business statistics using Excel OXFORD UNIVERSITY PRESS Glyn Davis & Branko Pecar

DATA ANALYSIS. QEM Network HBCU-UP Fundamentals of Education Research Workshop Gerunda B. Hughes, Ph.D. Howard University

Reporting Statistics in Psychology

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

Normality Testing in Excel

Analyzing Research Data Using Excel

SPSS Tests for Versions 9 to 13

Bill Burton Albert Einstein College of Medicine April 28, 2014 EERS: Managing the Tension Between Rigor and Resources 1

The Dummy s Guide to Data Analysis Using SPSS

Statistical tests for SPSS

Descriptive Statistics

Anomaly and Fraud Detection with Oracle Data Mining 11g Release 2

MEASURES OF LOCATION AND SPREAD

The Statistics Tutor s Quick Guide to


Biostatistics: Types of Data Analysis

Statistics. One-two sided test, Parametric and non-parametric test statistics: one group, two groups, and more than two groups samples

Bowerman, O'Connell, Aitken Schermer, & Adcock, Business Statistics in Practice, Canadian edition

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.

t Tests in Excel The Excel Statistical Master By Mark Harmon Copyright 2011 Mark Harmon

Data analysis process

Description. Textbook. Grading. Objective

Introduction to Regression and Data Analysis

Projects Involving Statistics (& SPSS)

Nonparametric Two-Sample Tests. Nonparametric Tests. Sign Test

Course Text. Required Computing Software. Course Description. Course Objectives. StraighterLine. Business Statistics

Parametric and Nonparametric: Demystifying the Terms

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

Study Guide for the Final Exam

Additional sources Compilation of sources:

An introduction to IBM SPSS Statistics

An introduction to using Microsoft Excel for quantitative data analysis

Once saved, if the file was zipped you will need to unzip it. For the files that I will be posting you need to change the preferences.

One-Way Analysis of Variance (ANOVA) Example Problem

Tutorial 5: Hypothesis Testing

STA-201-TE. 5. Measures of relationship: correlation (5%) Correlation coefficient; Pearson r; correlation and causation; proportion of common variance

II. DISTRIBUTIONS distribution normal distribution. standard scores

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Data Analysis Tools. Tools for Summarizing Data

QUANTITATIVE METHODS BIOLOGY FINAL HONOUR SCHOOL NON-PARAMETRIC TESTS

Class 19: Two Way Tables, Conditional Distributions, Chi-Square (Text: Sections 2.5; 9.1)

Analysing Questionnaires using Minitab (for SPSS queries contact -)

Directions for using SPSS

CHAPTER 14 NONPARAMETRIC TESTS

Oracle Advanced Analytics - Option to Oracle Database: Oracle R Enterprise and Oracle Data Mining. Data Warehouse Global Leaders Winter 2013

UNIVERSITY OF NAIROBI

Testing Group Differences using T-tests, ANOVA, and Nonparametric Measures

SPSS TUTORIAL & EXERCISE BOOK

Introduction to Statistical Computing in Microsoft Excel By Hector D. Flores; and Dr. J.A. Dobelman

Introduction to Statistics and Quantitative Research Methods

Using Predictions to Power the Business. Wayne Eckerson Director of Research and Services, TDWI February 18, 2009

Curriculum Map Statistics and Probability Honors (348) Saugus High School Saugus Public Schools

Introduction to Statistics with GraphPad Prism (5.01) Version 1.1

Teaching Business Statistics through Problem Solving

January 26, 2009 The Faculty Center for Teaching and Learning

Overview of Non-Parametric Statistics PRESENTER: ELAINE EISENBEISZ OWNER AND PRINCIPAL, OMEGA STATISTICS

Linear Models in STATA and ANOVA

THE UNIVERSITY OF TEXAS AT TYLER COLLEGE OF NURSING COURSE SYLLABUS NURS 5317 STATISTICS FOR HEALTH PROVIDERS. Fall 2013

ANALYSING LIKERT SCALE/TYPE DATA, ORDINAL LOGISTIC REGRESSION EXAMPLE IN R.

SOCIOLOGY 7702 FALL, 2014 INTRODUCTION TO STATISTICS AND DATA ANALYSIS

Mathematics within the Psychology Curriculum

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries

Advanced Excel for Institutional Researchers

UNIVERSITY of MASSACHUSETTS DARTMOUTH Charlton College of Business Decision and Information Sciences Fall 2010

Chapter G08 Nonparametric Statistics

Nonparametric Statistics

Statistics Review PSY379

SPSS Explore procedure

Univariate Regression

THE KRUSKAL WALLLIS TEST

Descriptive and Inferential Statistics

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

Basic Statistical and Modeling Procedures Using SAS

DATA INTERPRETATION AND STATISTICS

Bussiness Intelligence and Data Warehouse. Tomas Bartos CIS 764, Kansas State University

Data Analysis with Various Oracle Business Intelligence and Analytic Tools

Transcription:

Oracle 11g DB Data Warehousing <Insert Picture ETL Here> Oracle's In-Database Statistical Functions OLAP Statistics Data Mining Charlie Berger Sr. Director Product Management, Data Mining Technologies Oracle Corporation charlie.berger@oracle.com

Synopsis Oracle has delivered on a multi-year strategy to transform the database from a data repository to an analytical database by bringing the "analytics" to the data (data mining, text mining, and statistical functions) This new analytical Database, integrated with Oracle Business Intelligence EE, opens new doors for better BI Why did something happen? What corrective actions should be taken? Which factors are influencing your business s key performance indicators? Which things should I target? What will happen in the future and where should you focus limited resources? Overview of SQL statistical capabilities embedded in Oracle Database Repeat what I was shown hands-on session

Agenda Introduction Oracle s in-database Statistical Functions Several Simple Demonstrations Opportunities for Use Cases Hands-on Exercises User Stories A B C

Market Trends Analytics Provide Competitive Value Competing on Analytics, by Tom Davenport Some companies have built their very businesses on their ability to collect, analyze, and act on data. Although numerous organizations are embracing analytics, only a handful have achieved this level of proficiency. But analytics competitors are the leaders in their varied fields consumer products finance, retail, and travel and entertainment among them. Organizations are moving beyond query and reporting - IDC 2006 Super Crunchers, by Ian Ayers In the past, one could get by on intuition and experience. Times have changed. Today, the name of the game is data. Steven D. Levitt, author of Freakonomics Data-mining and statistical analysis have suddenly become cool... Dissecting marketing, politics, and even sports, stuff this complex and important shouldn't be this much fun to read. Wired

Market Trends Analytics Save Lives Super Crunchers, by Ian Ayers In December 2004, [Berwick] brazenly announced a plan to save 100,000 lives over the next year and a half. The 100,000 Lives Campaign challenged hospitals to implement six changes in care to prevent avoidable deaths. He noticed that thousands of ICU patients die each year from infections after a central line catheter is placed in their chests. About half of all intensive care patients have central line catheters, and ICU infections are deadly (carrying mortality rates of up to 20 percent). He then looked to see if there was any statistical evidence of ways to reduce the chance of infection. He found a 2004 article in Critical Care Medicine that showed that systematic hand-washing (combined with a bundle of improved hygienic procedures such as cleaning the patient s skin with an antiseptic called chlorhexidine) could reduce the risk of infection from central-line catheters by more than 90 percent. Berwick estimated that if all hospitals just implemented this one bundle of procedures, they might be able to save as many as 25,000 lives per year. New York Times, August 23, 2007, Attack of the Super Crunchers: Adventures in Data Mining, By Melissa Lafsky

Competitive Advantage of BI & Analytics Optimization $$ What s the best that can happen? Competitive Advantage Predictive Modeling Forecasting/Extrapolation Statistical Analysis Alerts Query/drill down Ad hoc reports What will happen next? What if these trends continue? Why is this happening? What actions are needed? Where exactly is the problem? How many, how often, where? Analytic$ Access & Reporting Standard Reports What happened? Degree of Intelligence Source: Competing on Analytics, by T. Davenport & J. Harris

Oracle Data Mining & Statistical Functions

Definition: Statistics There are three kinds of lies: lies, damned lies, and statistics. 1 1 This well-known saying is part of a phrase attributed to Benjamin Disraeli and popularized in the U.S. by Mark Twain http://en.wikipedia.org/wiki/statistics

Definition: Statistics Statistics is a mathematical science pertaining to the collection, analysis, interpretation or explanation, and presentation of data. It is applicable to a wide variety of academic disciplines, from the physical and social sciences to the humanities. Statistics are also used for making informed decisions and misused for other reasons in all areas of business and government. http://en.wikipedia.org/wiki/statistics

Definitions: Statistics Statistical methods can be used to summarize or describe a collection of data; this is called descriptive statistics. In addition, patterns in the data may be modeled in a way that accounts for randomness and uncertainty in the observations, and then used to draw inferences about the process or population being studied; this is called inferential statistics. Both descriptive and inferential statistics comprise applied statistics. http://en.wikipedia.org/wiki/statistics

Statistical Concepts

Statistics & SQL Analytics Ranking functions rank, dense_rank, cume_dist, percent_rank, ntile Window Aggregate functions (moving and cumulative) Avg, sum, min, max, count, variance, stddev, first_value, last_value LAG/LEAD functions Direct inter-row reference using offsets Reporting Aggregate functions Sum, avg, min, max, variance, stddev, count, ratio_to_report Statistical Aggregates Correlation, linear regression family, covariance Linear regression Fitting of an ordinary-least-squares regression line to a set of number pairs. Frequently combined with the COVAR_POP, COVAR_SAMP, and CORR functions. Descriptive Statistics average, standard deviation, variance, min, max, median (via percentile_count), mode, group-by & roll-up DBMS_STAT_FUNCS: summarizes numerical columns of a table and returns count, min, max, range, mean, stats_mode, variance, standard deviation, median, quantile values, +/- n sigma values, top/bottom 5 values Correlations Pearson s correlation coefficients, Spearman's and Kendall's (both nonparametric). Cross Tabs Enhanced with % statistics: chi squared, phi coefficient, Cramer's V, contingency coefficient, Cohen's kappa Hypothesis Testing Student t-test, F-test, Binomial test, Wilcoxon Signed Ranks test, Chi-square, Mann Whitney test, Kolmogorov- Smirnov test, One-way ANOVA Distribution Fitting Kolmogorov-Smirnov Test, Anderson-Darling Test, Chi- Squared Test, Normal, Uniform, Weibull, Exponential Note: Statistics and SQL Analytics are included in Oracle Database Standard Edition

> SQL Descriptive Statistics MEDIAN & MODE Median: takes numeric or datetype values and returns the middle value Mode: returns the most common value A. SELECT STATS_MODE(EDUCATION) from CD_BUYERS; B. SELECT MEDIAN(ANNUAL_INCOME) from CD_BUYERS; C. SELECT EDUCATION, MEDIAN(ANNUAL_INCOME) from CD_BUYERS GROUP BY EDUCATION; D. SELECT EDUCATION, MEDIAN(ANNUAL_INCOME) from CD_BUYERS GROUP BY EDUCATION ORDER BY MEDIAN(ANNUAL_INCOME) ASC;

DBMS_STAT_FUNCS Package SUMMARY procedure The SUMMARY procedure is used to summarize a numerical column (ADM_PULSE); the summary is returned as record of type summarytype > SQL DECLARE v_ownername varchar2(8); v_tablename varchar2(50); v_columnname varchar2(50); v_sigma_value number; type n_arr1 is varray(5) of number; type num_table1 is table of number; s1 dbms_stat_funcs.summarytype; BEGIN v_ownername := 'cberger'; v_tablename := 'LYMPHOMA'; v_columnname := 'ADM_PULSE'; v_sigma_value := 3; dbms_stat_funcs.summary(p_ownername=> v_ownername, p_tablename=> v_tablename, p_columnname=> v_columnname, p_sigma_value=> v_sigma_value, s=> s1); END; /

DBMS_STAT_FUNCS Package SUMMARY procedure The SUMMARY procedure is used to summarize a numerical column (ADM_PULSE); the summary is returned as record of type summarytype > SQL set echo off connect CBERGER/CBERGER@ora10gr2 set serveroutput on set echo on declare s DBMS_STAT_FUNCS.SummaryType; begin DBMS_STAT_FUNCS.SUMMARY('CBERGER','LYMPHOMA','ADM_PULSE',3,s); dbms_output.put_line('summary STATISTICS'); dbms_output.put_line('count: ' s.count); dbms_output.put_line('min: ' s.min); dbms_output.put_line('max: ' s.max); dbms_output.put_line('range: ' s.range); dbms_output.put_line('mean:' round(s.mean)); dbms_output.put_line('mode Count: ' s.cmode.count); dbms_output.put_line('mode: ' s.cmode(1)); dbms_output.put_line('variance: ' round(s.variance)); dbms_output.put_line('stddev: ' round(s.stddev)); dbms_output.put_line('quantile 5 ' s.quantile_5); dbms_output.put_line('quantile 25 ' s.quantile_25); dbms_output.put_line('median ' s.median); dbms_output.put_line('quantile 75 ' s.quantile_75); dbms_output.put_line('quantile 95 ' s.quantile_95); dbms_output.put_line('extreme Count: ' s.extreme_values.count); dbms_output.put_line('extremes: ' s.extreme_values(1)); dbms_output.put_line('top 3: ' s.top_5_values(1) ',' s.top_5_values(2) ',' s.top_5_values(3)); dbms_output.put_line('bottom 3:' s.bottom_5_values(5) ',' s.bottom_5_values(4) ',' s.bottom_5_values(3)); end; /

DBMS_STAT_FUNCS Package SUMMARY procedure A subset of data that is returned after execution of the PL/SQL package summarizes the use of the different SUMMARY procedures

Summary Statistics and Histograms Oracle Data Miner (gui for Oracle Data Mining Option) provides graphical histograms with summary statistics

Hypothesis Testing Parametric Tests Parametric tests make some assumptions about the data typically that the data is normally distributed among other assumptions Oracle 10g parametric hypothesis tests include: T-test F-test One-Way ANOVA

T-Test T-tests are used to measure the significance of a difference of means. T-tests include the following: One-sample T-test Paired-samples T-test Independent-samples T-test (pooled variances) Independent-samples T-test (unpooled variances)

Basic Example Compare difference in blood pressures between people who eat meat frequently vs. don t

One-Sample T-Test STATS_T_TEST_* The t-test functions are: STATS_T_TEST_ONE: A one-sample t-test STATS_T_TEST_PAIRED: A two-sample, paired t-test (also known as a crossed t-test) STATS_T_TEST_INDEP: A t-test of two independent groups with the same variance (pooled variances) STATS_T_TEST_INDEPU: A t-test of two independent groups with unequal variance (unpooled variances) http://download-west.oracle.com/docs/cd/b19306_01/server.102/b14200/functions157.htm

One-Sample T-Test Query compares the mean of SURVIVAL_TIME to the assumed value of 35: SELECT avg(survival_time_mo) group_mean, stats_t_test_one(survival_time_mo, 35, 'STATISTIC') t_observed, stats_t_test_one(survival_time_mo, 35) two_sided_p_value FROM LYMPHOMA; Returns the observed t value and its related two-sided significance SQL Worksheet

Paired Samples T-Test Query compares the mean of LOGWT for Pig Weights in Week 3 to Week 8, grouped by Diet: SELECT substr(diet,1,1) as diet, avg(logwt3) logwt3_mean, avg(logwt8) logwt8_mean, stats_t_test_paired(logwt3, LOGWT8,'STATISTIC') t_observed, stats_t_test_paired(logwt3, LOGWT8) two_sided_p_value FROM CBERGER.PIGLETS3 GROUP BY ROLLUP(DIET) ORDER BY 5 ASC; Returns the observed t value and its related two-sided significance SQL Worksheet

Independent Samples T-Test (Pooled Variances) Query compares the mean of AMOUNT_SOLD between MEN and WOMEN within CUST_INCOME_LEVEL ranges SELECT substr(cust_income_level,1,22) income_level, avg(decode(cust_gender,'m',amount_sold,null)) sold_to_men, avg(decode(cust_gender,'f',amount_sold,null)) sold_to_women, stats_t_test_indep(cust_gender, amount_sold, 'STATISTIC','F') t_observed, stats_t_test_indep(cust_gender, amount_sold) two_sided_p_value FROM sh.customers c, sh.sales s WHERE c.cust_id=s.cust_id GROUP BY rollup(cust_income_level) ORDER BY 1; SQL Worksheet

Independent Samples T-Test (Pooled Variances)

F-Test Query compares the variance in the SIZE_TUMOR between MALES and FEMALES SELECT variance(decode(gender,'0', SIZE_TUMOR_MM, null)) var_tumor_men, variance(decode(gender,'1', SIZE_TUMOR_MM,null)) var_tumor_women, stats_f_test(gender, SIZE_TUMOR_MM, 'STATISTIC', '1') f_statistic, stats_f_test(gender, SIZE_TUMOR_MM) two_sided_p_value FROM CBERGER.LYMPHOMA; Returns observed f value and two-sided significance SQL Worksheet

F-Test Query compares the variance in the SIZE_TUMOR between males and females Grouped By GENDER SELECT GENDER, stats_one_way_anova(treatment_plan, SIZE_REDUCTION,'F_RATIO') f_ratio, stats_one_way_anova(treatment_plan, SIZE_REDUCTION,'SIG') p_value, AVG(SIZE_REDUCTION) FROM CBERGER.LYMPHOMA GROUP BY GENDER ORDER BY GENDER; Returns observed f value and two-sided significance SQL Worksheet

One-Way ANOVA In statistics, analysis of variance (ANOVA, or sometimes A.N.O.V.A.) is a collection of statistical models, and their associated procedures, in which the observed variance is partitioned into components due to different explanatory variables. Example Group A is given vodka, Group B is given gin, and Group C is given a placebo. All groups are then tested with a memory task. A one-way ANOVA can be used to assess the effect of the various treatments (that is, the vodka, gin, and placebo). http://en.wikipedia.org/wiki/statistics

One-Way ANOVA Query compares the average SIZE_REDUCTION within different TREATMENT_PLANS Grouped By LYMPH_TYPE: SELECT LYMPH_TYPE, stats_one_way_anova(treatment_plan, SIZE_REDUCTION,'F_RATIO') f_ratio, stats_one_way_anova(treatment_plan, SIZE_REDUCTION,'SIG') p_value FROM CBERGER.LYMPHOMA GROUP BY LYMPH_TYPE ORDER BY 1; Returns one-way ANOVA significance and split by LYMPH_TYPE

Hypothesis Testing (Nonparametric) Nonparametric tests are used when certain assumptions about the data are questionable. This may include the difference between samples that are not normally distributed. All tests involving ordinal scales (in which data is ranked) are nonparametric. Nonparametric tests supported in Oracle Database 10g: Binomial test Wilcoxon Signed Ranks test Mann-Whitney test Kolmogorov-Smirnov test

Customer Example "..Our experience suggests that Oracle 10g Statistics and Data Mining features can reduce development effort of analytical systems by an order of magnitude." Sumeet Muju Senior Member of Professional Staff, SRA International (SRA supports NIH bioinformatics development projects)

Correlation Functions?x The CORR_S and CORR_K functions support nonparametric or rank correlation (finding correlations between expressions that are ordinal scaled). Correlation coefficients take on a value ranging from 1 to 1, where: 1 indicates a perfect relationship 1 indicates a perfect inverse relationship 0 indicates no relationship The following query determines whether there is a correlation between the AGE and WEIGHT of people, using Spearman's correlation: select CORR_S(AGE, WEIGHT) coefficient, CORR_S(AGE, WEIGHT, 'TWO_SIDED_SIG') p_value, substr(treatment_plan, 1,15) as TREATMENT_PLAN from CBERGER.LYMPHOMA GROUP BY TREATMENT_PLAN;

Cross Tabulations This query analyzes the strength of the association between TREATMENT_PLAN and GENDER Grouped By LYMPH_TYPE using a cross tabulation: SELECT LYMPH_TYPE, stats_crosstab(gender, TREATMENT_PLAN, 'CHISQ_OBS') chi_squared, stats_crosstab(gender, TREATMENT_PLAN, 'CHISQ_SIG') p_value, stats_crosstab(gender, TREATMENT_PLAN, 'PHI_COEFFICIENT') phi_coefficient FROM CBERGER.LYMPHOMA GROUP BY LYMPH_TYPE ORDER BY 1; Returns the observed p_value and phi coefficient significance:

Cross Tabulations STATS_CROSSTAB function takes as arguments two expressions (the two variables being analyzed) and a value that determines which test to perform. These values include the following: CHISQ_OBS (observed value of chi-squared) CHISQ_SIG (significance of observed chi-squared) CHISQ_DF (degree of freedom for chi-squared) PHI_COEFFICIENT (phi coefficient) CRAMERS_V (Cramer s V statistic) CONT_COEFFICIENT (contingency coefficient) COHENS_K (Cohen s kappa) Function returns all values as specified by the third argument (default is CHISQ_SIG)

Distribution-Fitting Functions Distribution-fitting functions in Oracle Database 10g include the following NORMAL_DIST_FIT function UNIFORM_DIST_FIT function POISSON_DIST_FIT function WEIBULL_DIST_FIT function EXPONENTIAL_DIST_FIT function These functions test how well a sample of values fits a particular distribution The IN parameter of each function specifies which of the tests to use to measure the fit

Opportunities for Use Cases Control charts Set flags on your data e.g. when a value is above 3 sigma

Opportunities for Use Cases Construction of a Control Chart 1.Calculate means and ranges for each sample 2.Chart 3.Apply out-ofcontrol rules e.g. outside of 3 sigma

Opportunities for Use Cases Construction of a Control Chart 1.Calculate means and ranges for each sample 2.Chart 3.Apply out-ofcontrol rules e.g. outside of 3 sigma

Customer Example "..Our experience suggests that Oracle 10g Statistics and Data Mining features can reduce development effort of analytical systems by an order of magnitude." Sumeet Muju Senior Member of Professional Staff, SRA International (SRA supports NIH bioinformatics development projects)

http://www.oracle.com/technology/products/bi/stats_fns/index.html

In-Database Statistics Advantages Data remains in the database at all times with appropriate access security control mechanisms fewer moving parts Straightforward inclusion within interesting and arbitrarily complex queries Oracle 10g DB Data Warehousing Real-world scalability available for mission critical appls OLAP ETL Statistics Data Mining

Industry Analysts PREDICTIVE ANALYTICS: Extending the Value of Your Data Warehousing Investment, By Wayne W. Eckerson According to our survey, most organizations plan to significantly increase the analytic processing within a data warehouse database in the next three years, particularly for model building and scoring, which show 88% climbs. The amount of data preparation done in databases will only climb 36% in that time, but it will be done by almost two-thirds of all organizations (60%) double the rate of companies planning to use the database to create or score analytical models. it s surprising that about one-third of organizations plan to build analytical models in databases within three years. We leverage the data warehouse database when possible, says one analytics manager. He says most analysts download a data sample to their desktop and then upload it to the data warehouse once it s completed. Ultimately, however, everything will run in the data warehouse, the manager says. http://download.101com.com/pub/tdwi/files/pa_report_q107_f.pdf

1. In-Database Analytics Engine Basic Statistics (Free) Data Mining Text Mining 2. Costs (ODM: $20K cpu) Simplified environment Single server Security 3. IT Platform SQL (standard) Java (standard) Oracle 11g DB Data Warehousing ETL Analytics vs. OLAP Statistics Data Mining 1. External Analytical Engine Basic Statistics Data Mining Text Mining (separate: SAS EM for Text) Advanced Statistics 2. Costs (SAS EM: $150K/5 users) Duplicates data Annual Renewal Fee (AUF) (~45% each year) 3. IT Platform SAS Code (proprietary)

1. In-Database Analytics Engine Basic Statistics (Free) Data Mining Text Mining 2. Costs (ODM: $20K cpu) Simplified environment Single server Security 3. IT Platform SQL (standard) Java (standard) Analytics vs. 1. External Analytical Engine Basic Statistics Data Mining Text Mining (separate: SAS EM for Text) Advanced Statistics 2. Costs (SAS EM: $150K/5 users) Duplicates data Annual Renewal Fee (AUF) (~45% each year) 3. IT Platform SAS Code (proprietary) Oracle 11g DB Data Warehousing ETL Oracle 11g DB Data Warehousing ETL OLAP Statistics OLAP Statistics Data Mining Data Mining

SAS In-Database Processing 3-Year Road Map The goal of the SAS In-Database initiative is to achieve deeper technical integration with database providers, but also blends the best SAS data integration and analytics with the core strengths of databases.. Like all DBMS client applications, the SAS engine often must load and extract data over a network to and from the DBMS. This presents a series of challenges: Network bottlenecks between SAS and the DBMS constrain access to large volumes of data The best practice today is to read data into the SAS environment for processing. For highly repeatable processes, this might not be efficient because it takes time to transfer the data and resources are used to temporarily store in the SAS environment. In some cases, the results of the SAS processing must be transferred back to the DBMS for final storage, which further increases the cost. Addressing this challenge can result in improved resource utilization and enable companies to answer business questions more quickly. Oracle Data Mining is available today Source: SAS In-Database Processing White Paper October 2007

SAS In-Database Processing 3-Year Road Map It boils down to this simple equation: Less data movement = faster analytics, and faster analytics = faster delivery of real-time BI throughout an enterprise. Source: http://www.teradata.com/t/pdf.aspx?a=83673&b=178909 Use SAS to get more power out of your database Move key components of BI, analytics and data integration processes from the server or desktop to inside the database and help shorten your time to intelligence

IDC Worldwide Business Analytics Software Oracle http://www.oracle.com/corporate/analyst/reports/infrastructure/bi_dw/208699e.pdf

References 1. Back to Basics Understanding and Visualising Variation in Data.Pete Ceuppens, Robert Shaw, Zhiping You. AstraZeneca R&D. 2. QuickStart: Oracle Statistics Release 10gR2. Charlie Berger, Oracle Corporation. April, 2007. 3. Oracle Database SQL Reference 10g Release 2 (10.2) Part Number: B14200-02. December 2005 4. Applied Linear Statistical Models. John Neter, William Wasserman, Michael H. Kutner. IRWIN 1985. 5. Mathematical Statistics with Applications. Mendenhall, Scheffer, Wackley. Duxbury Press, Boston, MA. 1981 6. Oracle Database Data Warehousing Guide 10g Release 2 (10.2) Part Number: B14223-02 December 2005 7. Oracle Technology Network: http://www.oracle.com/technology/products/bi/stats_fns/index.html Source: Oracle 10gR2 Statistics Functions, OLSUG08 Workshop, Henri B. Tuthill, AstraZeneca & Charlie Berger, Oracle

Hands-on Exercises Quick Start Statistics

<Insert Picture Here> More Information: Oracle Data Mining 10g oracle.com/technology/products/bi/odm/index.html Oracle Statistical Functions http://www.oracle.com/technology/products/bi/stats_fns/index.html Oracle Business Intelligence Solutions oracle.com/bi Contact Information: Email: Charlie.berger@oracle.com

Q U E S T I O N S A N S W E R S

This presentation is for informational purposes only and may not be incorporated into a contract or agreement.