SQL - the best analysis language for Big Data!

Similar documents
Oracle Big Data SQL Architectural Deep Dive

Oracle Big Data SQL Konference Data a znalosti 2015

Exadata V2 + Oracle Data Mining 11g Release 2 Importing 3 rd Party (SAS) dm models

Statistical Analysis of Gene Expression Data With Oracle & R (- data mining)

The Oracle Data Mining Machine Bundle: Zero to Predictive Analytics in Two Weeks Collaborate 15 IOUG

OLSUG Workshop Oracle Data Mining

Seamless Access from Oracle Database to Your Big Data

Oracle Data Mining In-Database Data Mining Made Easy!

Oracle's In-Database Statistical Functions

Blazing BI: the Analytic Options to the Oracle Database. ODTUG Kscope 2013

1 Copyright 2011, Oracle and/or its affiliates. All rights reserved.

Semantic and Data Mining Technologies. Simon See, Ph.D.,

Big Data Analytics with Oracle Advanced Analytics In-Database Option

Predictive Analytics for Better Business Intelligence

Introducing Microsoft SQL Server 2012 Getting Started with SQL Server Management Studio

Querying Microsoft SQL Server

Course 20461C: Querying Microsoft SQL Server Duration: 35 hours

Querying Microsoft SQL Server 2012

Course ID#: W 35 Hrs. Course Content

Course 10774A: Querying Microsoft SQL Server 2012

Course 10774A: Querying Microsoft SQL Server 2012 Length: 5 Days Published: May 25, 2012 Language(s): English Audience(s): IT Professionals

Querying Microsoft SQL Server 20461C; 5 days

Bowerman, O'Connell, Aitken Schermer, & Adcock, Business Statistics in Practice, Canadian edition

Oracle SQL. Course Summary. Duration. Objectives

Data warehousing in Oracle. SQL extensions for data warehouse analysis. Available OLAP functions. Physical aggregation example

Querying Microsoft SQL Server Course M Day(s) 30:00 Hours

Instant SQL Programming

MOC QUERYING MICROSOFT SQL SERVER

An Oracle White Paper June Patterns everywhere Find them fast! SQL pattern matching in Oracle Database 12c

Oracle Database: SQL and PL/SQL Fundamentals NEW

Querying Microsoft SQL Server 2012

Netezza SQL Class Outline

STATISTICAL ANALYSIS WITH EXCEL COURSE OUTLINE

Saskatoon Business College Corporate Training Centre

Oracle Database: SQL and PL/SQL Fundamentals

Chapter 5 More SQL: Complex Queries, Triggers, Views, and Schema Modification

SPSS Tests for Versions 9 to 13

business statistics using Excel OXFORD UNIVERSITY PRESS Glyn Davis & Branko Pecar

The Dummy s Guide to Data Analysis Using SPSS

Additional sources Compilation of sources:

Oracle Database: SQL and PL/SQL Fundamentals

Projects Involving Statistics (& SPSS)

Bill Burton Albert Einstein College of Medicine April 28, 2014 EERS: Managing the Tension Between Rigor and Resources 1

Analyzing Research Data Using Excel

AV-005: Administering and Implementing a Data Warehouse with SQL Server 2014

Querying Microsoft SQL Server Querying Microsoft SQL Server D. Course 10774A: Course Det ails. Co urse Outline

Normality Testing in Excel

Statistical tests for SPSS

Directions for using SPSS

Instructions for SPSS 21

Introduction Course in SPSS - Evening 1

Oracle Database 10g: Introduction to SQL

How to Build MicroStrategy Projects on Top of Big Data Sources in the Cloud

Programming with SQL

Oracle Database 12c: Introduction to SQL Ed 1.1

DATA ANALYSIS. QEM Network HBCU-UP Fundamentals of Education Research Workshop Gerunda B. Hughes, Ph.D. Howard University

DBMS / Business Intelligence, SQL Server

Data Preprocessing. Week 2

There are six different windows that can be opened when using SPSS. The following will give a description of each of them.

E x c e l : Data Analysis Tools Student Manual

Data Analysis Tools. Tools for Summarizing Data

Horizontal Aggregations In SQL To Generate Data Sets For Data Mining Analysis In An Optimized Manner

MOC 20461C: Querying Microsoft SQL Server. Course Overview

extreme Datamining mit Oracle R Enterprise

Data analysis process

Oracle Database: SQL and PL/SQL Fundamentals NEW

New SQL Features in Firebird 3

Simple Predictive Analytics Curtis Seare

SPSS Explore procedure

Introduction to Statistical Computing in Microsoft Excel By Hector D. Flores; and Dr. J.A. Dobelman

Oracle 10g PL/SQL Training

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

SPSS (Statistical Package for the Social Sciences)

OLAP Systems and Multidimensional Expressions I

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.

Analysing Questionnaires using Minitab (for SPSS queries contact -)

IBM SPSS Statistics 20 Part 4: Chi-Square and ANOVA

Study Guide for the Final Exam

SPSS TUTORIAL & EXERCISE BOOK

KSTAT MINI-MANUAL. Decision Sciences 434 Kellogg Graduate School of Management

STA-201-TE. 5. Measures of relationship: correlation (5%) Correlation coefficient; Pearson r; correlation and causation; proportion of common variance

Using Excel in Research. Hui Bian Office for Faculty Excellence

Data Mining Techniques Chapter 5: The Lure of Statistics: Data Mining Using Familiar Tools

Advanced Big Data Analytics with R and Hadoop

Course Text. Required Computing Software. Course Description. Course Objectives. StraighterLine. Business Statistics

SQL Databases Course. by Applied Technology Research Center. This course provides training for MySQL, Oracle, SQL Server and PostgreSQL databases.

STATISTICA Formula Guide: Logistic Regression. Table of Contents

t Tests in Excel The Excel Statistical Master By Mark Harmon Copyright 2011 Mark Harmon

Performing Queries Using PROC SQL (1)

Oracle Advanced Analytics - Option to Oracle Database: Oracle R Enterprise and Oracle Data Mining. Data Warehouse Global Leaders Winter 2013

QUANTITATIVE METHODS BIOLOGY FINAL HONOUR SCHOOL NON-PARAMETRIC TESTS


Multiple Linear Regression

Transcription:

SQL - the best analysis language for Big Data! NoCOUG Winter Conference 2014 Hermann Bär, hermann.baer@oracle.com Data Warehousing Product Management, Oracle 1

The On-Going Evolution of SQL Introduction of Window functions Statistical functions SQL model clause Partition Outer Join Data mining I Pattern matching Top N clause Data Mining III 4 5 Enhanced Window functions (percentile,etc) Rollup, grouping sets, cube Data mining II SQL Pivot Recursive WITH ListAgg, Nth value window 2

SQL for Analysis and Reporting Benefits Dramatically enhanced analysis capabilities with SQL Native support, e.g. OBI EE Embedding into SQL views Simplified development Investment protection through ANSI standard compliance Increased performance New language constructs enable more efficient plans Internal optimizations 3

Fundamental Concepts 4

Key Concepts Same for all functions - Unified SQL Partitions Groupings of rows within a query result set Orderings Rows can be ordered within a partition Windows (logical or physical) A moving group of rows within a partition Defines the range of an aggregate calculation Current Row 5

Reporting Aggregates Compare total sales of regions with total sales SELECT person, region, sales SUM (sales) OVER (PARTITION BY (region) ) r_sales SUM (sales) OVER () t_sales FROM sales_table ORDER BY region, s_rank; PERSON REGION SALES R_SALES T_SALES Adams East 200 530 1130 Connor East 180 530 1130 Baker East 150 530 1130 Donner West 300 600 1130 Edward West 200 600 1130 Witkowski West 100 600 1130 6

Lag/Lead Functions How does sales compare versus this month last year? SELECT timekey, sales, LAG(sales,12)OVER(ORDER BY timekey) as sales_last_year, (sales - sales_last_year) as sales_change FROM sales; TIMEKEY SALES SALES LAST YEAR SALES CHANGE 2009-01 1100 - -.. 2010-01 2000 1100 900 2010-02 1800 1200 600 2011-01 1900 2000-100 7

SQL for Analysis and Reporting Simplified SQL development Without analytical function select avg(b.event_datetime - a.prior_event_datetime) as avg_wait from (select a.*, rownum as row_number from (select order_id, event_datetime as prior_event_datetime, new_event_cd as prior_event_cd from order_pipeline_events where warehouse_id = 'RNO1' and event_datetime > sysdate - 2 order by order_id, prior_event_datetime, prior_event_cd ) a ) a, (select b.*, rownum as row_number from (select order_id, event_datetime, new_event_cd as event_cd from order_pipeline_events where warehouse_id = 'RNO1' and event_datetime > sysdate - 2 order by order_id, event_datetime, event_cd ) b ) b where a.order_id = b.order_id and a.prior_event_cd = '1001' and b.event_cd = '1002' and a.row_number = b.row_number - 1; With analytical function Select avg(event_datetime - prior_event_datetime) as avg_wait from (select new_event_cd as event_cd, event_datetime, lag(new_event_cd) over (partition by order_id order by event_datetime, new_event_cd) as prior_event_cd, lag(event_datetime) over (partition by order_id order by event_datetime, new_event_cd) as prior_event_datetime from order_pipeline_events where warehouse_id = 'RNO1' and event_datetime > sysdate - 2 ) where prior_event_cd = '1001' and event_cd = '1002'; 8

SQL Pattern Matching What s this about? 9

Pattern Matching in Sequences of Rows The Challenge a real-world business problem detect if a phone card went from phone A to phone B to phone C... and back to phone A within N hours... and detect if pattern above occurs at least N times within 7 days Prior to Oracle Database 12c pattern recognition in SQL is difficult Use multiple self joins (not good for *) T1.handset_id <> T2.handset_id <>T3.handset_id AND. T1.sim_id= X AND T2.time BETWEEN T1.time and T1.time+2. Use recursive query for * (WITH clause, CONNECT BY) Use Window Functions (likely with multiple query blocks) 10

> 1 min Pattern Matching in Sequences of Rows Objective Provide native SQL language construct Find one or more event A followed by one B followed by one or more C in a 1 minute interval Align with well-known regular expression declaration (PERL) Apply expressions across rows EVENT TIME LOCATION A 1 SFO A 1 SFO A 2 ATL A 2 LAX B 2 SFO C 2 LAX A+ B C+ (perl) A 2 ATL A 2 LAX B 2 SFO Soon to be in ANSI SQL Standard C 3 LAS A 3 SFO B 3 NYC C 4 NYC C 2 LAX 11

Finding Patterns in Big Data Typical use cases in today s world of fast exploration of big data Big Data Fraud Tracking Stock Market Financial Services Money Laundering Fraud Utilities Network Analysis Unusual Usage Law & Order Monitoring Suspicious Activities Retail Buying Patterns Sessionization Returns Fraud Telcos SIM Card Money Fraud Laundering Call Quality 12

SQL Pattern Matching Conceptual Example 13

SQL Pattern Matching in Action Basic steps for building the SQL command 1. Define the partitions/buckets and ordering needed to identify the stream of events you are analyzing 2. Define the pattern of events and pattern variables identifying the individual events within the pattern 3. Define measures: source data points, pattern data points and aggregates related to a pattern 4. Determine how the output will be generated 14

SQL Pattern Matching in Action Example: Find A Double Bottom Pattern (W-shape) in ticker stream Find a W-shape pattern in a ticker stream: Stock price Output the beginning and ending date of the pattern Calculate average price in the second ascent Find only patterns that lasted less than a week days 15

SQL Pattern Matching in Action Example: Find W-Shape* Stock price New syntax for discovering patterns using SQL: MATCH_RECOGNIZE ( ) SELECT... FROM ticker MATCH_RECOGNIZE (... ) days 16 * For conceptual clarity, the statement is simplified and ignores an always-true start event. See the notes or documentation for further explanation

SQL Pattern Matching in Action Example: Find W-Shape* Find a W-shape pattern in a ticker stream: Stock price Set the PARTITION BY and ORDER BY clauses SELECT FROM ticker MATCH_RECOGNIZE ( PARTITION BY name ORDER BY time days We will continue to look at the black stock only from now on 17 * For conceptual clarity, the statement is simplified and ignores an always-true start event. See the notes or documentation for further explanation

SQL Pattern Matching in Action Example: Find W-Shape* Find a W-shape pattern in a ticker stream: Stock price Define the pattern the W-shape SELECT FROM ticker MATCH_RECOGNIZE ( PARTITION BY name ORDER BY time days PATTERN (X+ Y+ W+ Z+) 18 * For conceptual clarity, the statement is simplified and ignores an always-true start event. See the notes or documentation for further explanation

SQL Pattern Matching in Action Example: Find W-Shape* Find a W-shape pattern in a ticker stream: X Stock price Define the pattern the first down part of the Wshape SELECT FROM ticker MATCH_RECOGNIZE ( PARTITION BY name ORDER BY time days PATTERN (X+ Y+ W+ Z+) DEFINE X AS (price < PREV(price)), 19 * For conceptual clarity, the statement is simplified and ignores an always-true start event. See the notes or documentation for further explanation

SQL Pattern Matching in Action Example: Find W-Shape* Find a W-shape pattern in a ticker stream: X Y Stock price Define the pattern the first up part of W-shape SELECT FROM ticker MATCH_RECOGNIZE ( PARTITION BY name ORDER BY time days PATTERN (X+ Y+ W+ Z+) DEFINE X AS (price < PREV(price)), Y AS (price > PREV(price)), 20 * For conceptual clarity, the statement is simplified and ignores an always-true start event. See the notes or documentation for further explanation

SQL Pattern Matching in Action Example: Find W-Shape* Find a W-shape pattern in a ticker stream: X Y W Z Stock price Define the pattern the second down (w) and the second up(z) of the Wshape SELECT FROM ticker MATCH_RECOGNIZE ( PARTITION BY name ORDER BY time days PATTERN (X+ Y+ W+ Z+) DEFINE X AS (price < PREV(price)), Y AS (price > PREV(price)), W AS (price < PREV(price)), Z AS (price > PREV(price))) 21 * For conceptual clarity, the statement is simplified and ignores an always-true start event. See the notes or documentation for further explanation

SQL Pattern Matching in Action Example: Find W-Shape* Find a W-shape pattern in a ticker stream: X Stock price Z Define the measures to output once a pattern is matched: FIRST: beginning date LAST: ending date SELECT FROM ticker MATCH_RECOGNIZE ( PARTITION BY name ORDER BY time MEASURES FIRST(x.time) AS first_x, LAST(z.time) AS last_z PATTERN (X+ Y+ W+ Z+) DEFINE X AS (price < PREV(price)), Y AS (price > PREV(price)), W AS (price < PREV(price)), Z AS (price > PREV(price))) days 22 * For conceptual clarity, the statement is simplified and ignores an always-true start event. See the notes or documentation for further explanation

SQL Pattern Matching in Action Example: Find W-Shape* Stock price First_x Last_z 1 9 13 19 Find a W-shape pattern in a ticker stream: Output one row each time we find a match to our pattern 1 9 13 19 days SELECT first_x, last_z FROM ticker MATCH_RECOGNIZE ( PARTITION BY name ORDER BY time MEASURES FIRST(x.time) AS first_x, LAST(z.time) AS last_z ONE ROW PER MATCH PATTERN (X+ Y+ W+ Z+) DEFINE X AS (price < PREV(price)), Y AS (price > PREV(price)), W AS (price < PREV(price)), Z AS (price > PREV(price))) 23 * For conceptual clarity, the statement is simplified and ignores an always-true start event. See the notes or documentation for further explanation

SQL Pattern Matching Example: Find W-Shape lasts < 7 days* Stock price Can refer to previous variables X Z Find a W-shape pattern in a ticker stream: Extend the pattern to find W-shapes that lasted less than a week 1 9 13 19 days SELECT first_x, last_z FROM ticker MATCH_RECOGNIZE ( PARTITION BY name ORDER BY time MEASURES FIRST(x.time) AS first_x, LAST(z.time) AS last_z ONE ROW PER MATCH PATTERN (X+ Y+ W+ Z+) DEFINE X AS (price < PREV(price)), Y AS (price > PREV(price)), W AS (price < PREV(price)), Z AS (price > PREV(price) AND z.time - FIRST(x.time) <= 7 )) 24 * For conceptual clarity, the statement is simplified and ignores an always-true start event. See the notes or documentation for further explanation

SQL Pattern Matching Example: Find average price within W-Shape* Stock price Average stock price: $52.00 Find a W-shape pattern in a ticker stream: Calculate average price in the second ascent 1 9 13 19 days SELECT first_x, last_z, avg_price FROM ticker MATCH_RECOGNIZE ( PARTITION BY name ORDER BY time MEASURES FIRST(x.time) AS first_x, LAST(z.time) AS last_z, AVG(z.price) AS avg_price ONE ROW PER MATCH PATTERN (X+ Y+ W+ Z+) DEFINE X AS (price < PREV(price)), Y AS (price > PREV(price)), W AS (price < PREV(price)), Z AS (price > PREV(price) AND z.time - FIRST(x.time) <= 7 )))) 25 * For conceptual clarity, the statement is simplified and ignores an always-true start event. See the notes or documentation for further explanation

SQL Pattern Matching Declarative Pattern Matching 1. Define the partitions/buckets and ordering needed to identify the stream of events you are analyzing Matching within a stream of events (ordered partition of data) MATCH_RECOGNIZE (PARTITION BY stock_name ORDER BY time MEASURES 2. Define the pattern of events and pattern variables identifying the individual events within the pattern Use framework of Perl regular expressions (conditions on rows) PATTERN (X+ Y+ W+ Z+) Define matching using Boolean conditions on rows DEFINE X AS (price > 15) 26

SQL Pattern Matching, cont. Declarative Pattern Matching 3. Define measures: source data points, pattern data points and aggregates related to a pattern MEASURES FIRST(x.time) AS first_x, LAST(z.time) AS last_z, AVG(z.price) AS avg_price 4. Determine how the output will be generated ONE ROW PER MATCH 27

SQL Pattern Matching MATCH_RECOGNIZE Syntax <table_expression> := <table_expression> MATCH_RECOGNIZE ( [ PARTITION BY <cols> ] ) [ ORDER BY <cols> ] [ MEASURES <cols> ] [ ONE ROW PER MATCH ALL ROWS PER MATCH ] [ SKIP_TO_option ] PATTERN ( <row pattern> ) [ SUBSET <subset list> ] DEFINE <definition list> 28

SQL Pattern Matching I ll get my hands dirty 30

ACME Data Set 31

SQL Pattern Matching Real world use cases 32

Hot Off the Press OTN Forum SQL and PL/SQL, 01/31/2014 33

Hot Off the Press OTN Forum SQL and PL/SQL, 01/31/2014 34

if (linenext == null) { next = ""; } else { next = linenext.getquantity(); } Pattern if (!q.isempty() && (prev.isempty() (eq(q, Matching prev) && gt(q, next)))) { with SQL Analytics state = "S"; return state; } Java vs. SQL: Stock Markets - Searching for W Patterns in Trade Data if (gt(q, prev) && gt(q, next)) { state = "T"; return state; } } if (lt(q, prev) && lt(q, next)) { state = "B"; return state; } if (!q.isempty() && (next.isempty() (gt(q, prev) && eq(q, next)))) { state = "E"; return state; } if (q.isempty() eq(q, prev)) { state = "F"; return state; } return state; private boolean eq(string a, String b) { if (a.isempty() b.isempty()) { return false; } return a.equals(b); } private boolean gt(string a, String b) { if (a.isempty() b.isempty()) { return false; } return Double.parseDouble(a) > Double.parseDouble(b); } private boolean lt(string a, String b) { if (a.isempty() b.isempty()) { return false; } return Double.parseDouble(a) < Double.parseDouble(b); } SELECT first_x, last_z FROM ticker MATCH_RECOGNIZE ( PARTITION BY name ORDER BY time MEASURES FIRST(x.time) AS first_x, LAST(z.time) AS last_z ONE ROW PER MATCH PATTERN (X+ Y+ W+ Z+) DEFINE X AS (price < PREV(price)), Y AS (price > PREV(price)), W AS (price < PREV(price)), Z AS (price > PREV(price) AND z.time - FIRST(x.time) <= 7 )) public String getstate() { return this.state; } } BagFactory bagfactory = BagFactory.getInstance(); 250+ Lines of Java and PIG 12 Lines of SQL @Override public Tuple exec(tuple input) throws IOException { long c = 0; String line = ""; String pbkey = ""; V0Line nextline; V0Line thisline; V0Line processline; V0Line evalline = null; V0Line prevline; boolean nomorevalues = false; String matchlist = ""; ArrayList<V0Line> linefifo = new ArrayList<V0Line>(); boolean finished = false; DataBag output = bagfactory.newdefaultbag(); 20x less code, 5x faster 45 if (input == null) { return null; } if (input.size() == 0) { return null;

Analytical SQL in the Database Summary Ranking functions rank, dense_rank, cume_dist, percent_rank, ntile Window Aggregate functions (moving and cumulative) Avg, sum, min, max, count, variance, stddev, first_value, last_value LAG/LEAD functions Direct inter-row reference using offsets Reporting Aggregate functions Sum, avg, min, max, variance, stddev, count, ratio_to_report Statistical Aggregates Correlation, linear regression family, covariance Linear regression Fitting of an ordinary-least-squares regression line to a set Distribution Fitting of number pairs. Frequently combined with the COVAR_POP, COVAR_SAMP, and CORR functions Descriptive Statistics DBMS_STAT_FUNCS: summarizes numerical columns of a table and returns count, min, max, range, mean, stats_mode, variance, standard deviation, median, quantile values, +/- n sigma values, top/bottom 5 values Correlations Pearson s correlation coefficients, Spearman's and Kendall's (both nonparametric). Cross Tabs Enhanced with % statistics: chi squared, phi coefficient, Cramer's V, contingency coefficient, Cohen's kappa Hypothesis Testing Student t-test, F-test, Binomial test, Wilcoxon Signed Ranks test, Chi-square, Mann Whitney test, Kolmogorov-Smirnov test, One-way ANOVA Kolmogorov-Smirnov Test, Anderson-Darling Test, Chi-Squared Test, Normal, Uniform, Weibull, Exponential 46

Summary New Database 12c SQL Analytics Comprehensive analysis with SQL out of the box ANSI compliant features with some additional extensions Common language SQL speeds up adoption Widely known and used Common syntax reduces learning curve Comprehensive support for SQL based pattern matching Supports a wide range of use cases Simplifies application development Simplifies existing SQL code 47

SQL - the best development language for Big Data? Yes, because SQL is. SIMPLER RICHER FASTER 48

Where to get more information SQL Analytics Home Page on OTN http://www.oracle.com/technetwork/database/bidatawarehousing/sql-analytics-index-1984365.html Oracle By Example Pattern matching Podcasts for pattern matching and SQL analytics Data Sheet Whitepapers Patterns Everywhere - Find then fast! Patterns Everywhere - Find then fast! (Apple ibook) Data Warehouse and SQL Analytics blog http://oracle-big-data.blogspot.co.uk/ 49

50

51

52