TECH TUTORIAL: EMBEDDING ANALYTICS INTO A DATABASE USING SOURCEPRO AND JMSL



Similar documents
Chapter 9 Java and SQL. Wang Yang wyang@njnet.edu.cn

The JAVA Way: JDBC and SQLJ

Database Access from a Programming Language: Database Access from a Programming Language

Database Access from a Programming Language:

SQL and Java. Database Systems Lecture 19 Natasha Alechina

How To Create A C++ Web Service

Predictive Analytics Powered by SAP HANA. Cary Bourgeois Principal Solution Advisor Platform and Analytics

An In-Depth Look at In-Memory Predictive Analytics for Developers

1 File Processing Systems

What is ODBC? Database Connectivity ODBC, JDBC and SQLJ. ODBC Architecture. More on ODBC. JDBC vs ODBC. What is JDBC?

Why Is This Important? Database Application Development. SQL in Application Code. Overview. SQL in Application Code (Contd.

Course Objectives. Database Applications. External applications. Course Objectives Interfacing. Mixing two worlds. Two approaches

Developing Stored Procedures In Java TM. An Oracle Technical White Paper April 1999

Integrating VoltDB with Hadoop

How To Use A Sas Server On A Java Computer Or A Java.Net Computer (Sas) On A Microsoft Microsoft Server (Sasa) On An Ipo (Sauge) Or A Microsas (Sask

CPLEX Tutorial Handout

Effective Java Programming. efficient software development

CSE 373: Data Structure & Algorithms Lecture 25: Programming Languages. Nicki Dell Spring 2014

PHP Language Binding Guide For The Connection Cloud Web Services

Move Data from Oracle to Hadoop and Gain New Business Insights

Hadoop Project for IDEAL in CS5604

Java and Databases. COMP514 Distributed Information Systems. Java Database Connectivity. Standards and utilities. Java and Databases

CS 377 Database Systems SQL Programming. Li Xiong Department of Mathematics and Computer Science Emory University

What s Cooking in KNIME

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

JDBC (Java / SQL Programming) CS 377: Database Systems

Fast Analytics on Big Data with H20

COSC344 Database Theory and Applications. Java and SQL. Lecture 12

Embedded SQL. Unit 5.1. Dr Gordon Russell, Napier University

Introduction to SQL for Data Scientists

Tax Fraud in Increasing

Name: Srinivasan Govindaraj Title: Big Data Predictive Analytics

CS346: Database Programming.

Analytic Modeling in Python

Report Vertiefung, Spring 2013 Constant Interval Extraction using Hadoop

Architectures for Big Data Analytics A database perspective

Data Tool Platform SQL Development Tools

Database Programming. Week *Some of the slides in this lecture are created by Prof. Ian Horrocks from University of Oxford

COC131 Data Mining - Clustering

Oracle to MySQL Migration

extreme Datamining mit Oracle R Enterprise

Java (12 Weeks) Introduction to Java Programming Language

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

LSINF1124 Projet de programmation

Embedded SQL programming

Developing SQL and PL/SQL with JDeveloper

Creating and Using Databases for Android Applications

Internals of Hadoop Application Framework and Distributed File System

How To Solve The Kd Cup 2010 Challenge

Agenda. ! Strengths of PostgreSQL. ! Strengths of Hadoop. ! Hadoop Community. ! Use Cases

SQL and programming languages

! E6893 Big Data Analytics:! Demo Session II: Mahout working with Eclipse and Maven for Collaborative Filtering

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Accesssing External Databases From ILE RPG (with help from Java)

GPSQL Miner: SQL-Grammar Genetic Programming in Data Mining

RapidMiner Radoop Documentation

Comparing SQL and NOSQL databases

Services. Relational. Databases & JDBC. Today. Relational. Databases SQL JDBC. Next Time. Services. Relational. Databases & JDBC. Today.

Big Data Integration: A Buyer's Guide

A Technical Review of TIBCO Patterns Search

SAP Business Objects Business Intelligence platform Document Version: 4.1 Support Package Data Federation Administration Tool Guide

Creating PL/SQL Blocks. Copyright 2007, Oracle. All rights reserved.

Oracle8/ SQLJ Programming

On a Hadoop-based Analytics Service System

An Oracle White Paper June Migrating Applications and Databases with Oracle Database 12c

Oracle Data Miner (Extension of SQL Developer 4.0)

SQL Programming. CS145 Lecture Notes #10. Motivation. Oracle PL/SQL. Basics. Example schema:

DATA EXPERTS MINE ANALYZE VISUALIZE. We accelerate research and transform data to help you create actionable insights

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Data Mining mit der JMSL Numerical Library for Java Applications

D61830GC30. MySQL for Developers. Summary. Introduction. Prerequisites. At Course completion After completing this course, students will be able to:

Data Mining Tools. Jean- Gabriel Ganascia LIP6 University Pierre et Marie Curie 4, place Jussieu, Paris, Cedex 05 Jean-

How To Handle Big Data With A Data Scientist

SAS Enterprise Data Integration Server - A Complete Solution Designed To Meet the Full Spectrum of Enterprise Data Integration Needs

Database Programming with PL/SQL: Learning Objectives

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

RevoScaleR Speed and Scalability

Advanced Big Data Analytics with R and Hadoop

Well packaged sets of preinstalled, integrated, and optimized software on select hardware in the form of engineered systems and appliances

Introducing Microsoft SQL Server 2012 Getting Started with SQL Server Management Studio

Nine Common Types of Data Mining Techniques Used in Predictive Analytics

Numerical Analysis in the Financial Industry

ABSTRACT 1. INTRODUCTION. Kamil Bajda-Pawlikowski

Chapter 13. Introduction to SQL Programming Techniques. Database Programming: Techniques and Issues. SQL Programming. Database applications

A Next-Generation Analytics Ecosystem for Big Data. Colin White, BI Research September 2012 Sponsored by ParAccel

Programming Database lectures for mathema

Product: DQ Order Manager Release Notes

Financial Data Access with SQL, Excel & VBA

Active Learning SVM for Blogs recommendation

Characteristics of Java (Optional) Y. Daniel Liang Supplement for Introduction to Java Programming

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

CHAPTER 5 INTELLIGENT TECHNIQUES TO PREVENT SQL INJECTION ATTACKS

Abstract. Introduction. Web Technology and Thin Clients. What s New in Java Version 1.1

OPINION MINING IN PRODUCT REVIEW SYSTEM USING BIG DATA TECHNOLOGY HADOOP

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Object Oriented Database Management System for Decision Support System.

An Oracle White Paper October In-Database Map-Reduce

Applets, RMI, JDBC Exam Review

Transcription:

TECH TUTORIAL: EMBEDDING ANALYTICS INTO A DATABASE USING SOURCEPRO AND JMSL

This white paper describes how to implement embedded analytics within a database using SourcePro and the JMSL Numerical Library, a native Java library from Rogue Wave Software. The benefits of using embedded analytics include: Real time analysis reports results without data synchronization delay Faster results eliminates the need to move the data across the network Accuracy co-locates the data and the analytics to avoid potential user errors Accessibility allows invoking the data and analytics from any programming language or application that can connect to the database Better quality of data allows preprocessing and cleaning of data before it s stored Higher security data used as input to the analytics never leaves the database and, if necessary, user access is limited to running the stored routines without access to the underlying data Using embedded JMSL offers: Trusted technology JMSL is a known and proven product, built on over 40 years of experience Minimal risk JMSL works with many existing databases as is, so there is no change of platform required We describe in detail how to implement embedded JMSL using a particular relational database management system (RDBMS). We then query the embedded JMSL in RDMBS using SourcePro DB. 2

OVERVIEW While variety is one of the characteristics of big data, it is also a characteristic of how one responds to the challenges to big data. That is, the overall strategy when working with big data needs to be composed of specific tactics and tools for specific challenges. Use of Hadoop and the MapReduce framework is one such tactic 1. Hadoop is ideal for storing and processing very large data sets stored on computer clusters built from inexpensive hardware. But it is not a solution to all the challenges 23. Hadoop s MapReduce embodies one of the fundamental changes when working with big data. Previously, for advanced analysis and processing, data moved from the database to a client machine and results were uploaded back to the database. Now the paradigm is to join the algorithms with the data. The work described in this white paper brings the JMSL algorithms to the data and orchestrates the processing of the data using SourcePro DB. The JMSL Numerical Library is a pure Java numerical library, providing a broad range of advanced mathematics, statistics, and charting for the Java environment. It extends core Java numerics and allows developers to seamlessly integrate advanced analytics into their Java applications. For data mining and predictive analytics applications, JMSL provides algorithms supporting all stages of the data analytical process, including processing and cleaning, exploring and visualizing, data mining, building models, predicting, optimizing, validating, and monitoring. Example algorithms include Naive Bayes, Apriori, K-means clustering, Decision Trees, Neural Nets, Maximum Likelihood Estimation (MLE), Time Series Forecasting with ARMA, Holt-Winters, AutoARIMA, and many others. SourcePro is a complete enterprise C++ development platform. It provides fundamental C++ components as well as robust, reliable cross-platform C++ tools in the areas of analysis, networking, and databases. SourcePro DB, a part of the SourcePro product suite, is a library of C++ classes that provide a high-level, object-oriented, cross-database abstraction over the native C APIs of various database vendors. It provides a high-performance, consistent API encapsulating the database-specific differences in behavior and syntax. EMBEDDING JMSL Relational databases have long had the ability to store and call routines in the native SQL. These routines are usually referred to as stored procedures. In addition, many RDBMSs now implement a JVM within the database. Java, in turn, implements datatypes that are one-to-one mappings of fundamental SQL datatypes. This allows seamless embedding of Java routines. These routines run in the database and eliminate the unnecessary movement of data. Finally, besides single Java classes, one can load Java archives (jar files) directly into the database, allowing any user with sufficient privileges to write Java stored procedures that use classes in the archive. To demonstrate how to embed JMSL, we describe an explicit case using an Oracle 12c database. To demonstrate an example of embedded predictive analytics, we show a complete implementation of the Naïve Bayes classifier algorithm, and create tables in a database containing Fisher iris data (a common data set containing samples for three species of the iris flower) to use as a test case. We then use SourcePro DB to create and execute stored procedures that invoke the embedded Naïve Bayes implementation. The complete Java code is in the appendix. 1 Using JMSL in Hadoop MapReduce applications, Rogue Wave Software, 2015. 2 F1: A Distributed SQL Database that Scales, Research at Google, accessed May 2016. 3 Hadoop is not enough for big data, InfoWorld, October 2013. 3

Loading JMSL The first step is to load the JMSL library into the database. Oracle has a command-line utility called loadjava for adding Java source files, classes, and libraries (jar files) to the database. > loadjava u sys@orcl resolve jmsl.jar Here, the u flag specifies the user and database, and the resolve flag checks for any dependencies of the loaded object. Implementation of a JMSL stored procedure In general, when implementing Java stored procedures on an Oracle database there are three steps: 1. Write the Java class. 2. Load the class into the database with loadjava. 3. Publish the Java class. Write the Java class The Naïve Bayes classifier example has two primary input parameters (four altogether). The first is the training set that contains data categorized into known classes. The second is a data set of unknown type that needs classification. For now, we show just the entry point method to our class: public static int Classify ( ResultSet rst, ResultSet rsc, ResultSet rsrows, ResultSet rsclasses ) Classify returns the previously unknown data s class as an integer. The input parameters are all of type java.sql.resultset. This datatype maps directly to the SQL CURSOR variable type, SYS_REFCURSOR. The inputs are, in order: the training data, the unknown data, the size of the training set, and the number of possible types. This method is part of the RWNaiveBayes.java class that has a complete listing in the appendix. Load the class We use the same command-line utility used for loading the JMSL library: > loadjava verbose u sys@orcl resolve RWNaiveBayes.class We are loading the compiled class rather than the source because our experience tells us that loading source files and compiling them in the database can add complications. In addition to checking for dependencies with the resolve flag, we use verbose output to get full information on the upload. Publish the class methods For each Java method that s called from SQL, you must write a call specification to expose the top-level entry point of the method to Oracle. We will use SourcePro DB to publish the class methods in Oracle and execute them. 4

First, let s connect to the Oracle database. To establish a database connection, we first request an RWDBDatabase instance from the RWDBManager and then instantiate an RWDBConnection object from the database instance. RWDBDatabase db = RWDBManager::database( ORACLE_OCI, <SERVER NAME>, <USER>, <PASSWORD>, ); RWDBConnection conn = db.connection(); if(conn.isvalid()) { std::cout << Connected! << std::endl; else { std::cout << conn.status().message() << std::endl; Now, we create a stored function, RWNaiveBayes, which invokes the JMSL routine with passed in parameters. RWCString rwnaivebayestext( CREATE or REPLACE FUNCTION RWNaiveBayes( \ curs_train IN SYS_REFCURSOR, \ curs_class IN SYS_REFCURSOR, \ curs_numrows IN SYS_REFCURSOR, \ curs_numclass IN SYS_REFCURSOR) \ RETURN NUMBER \ AS LANGUAGE JAVA \ NAME RWNaiveBayes.Classify( \ java.sql.resultset, \ java.sql.resultset, \ java.sql.resultset, \ java.sql.resultset ) return int ; ); if (!conn.executesql(rwnaivebayestext).isvalid()) { std::cout << Failed to create rwnaivebayestext stored function ; return 0; Next, we create a stored procedure, Invoke_RWNaiveBayes, that passes values from the tables iris_trn and faux_iris to the RWNaiveBayes stored function. RWCString stproctext( create or replace procedure Invoke_RWNaiveBayes ( \ val out INTEGER) \ IS \ BEGIN \ SELECT CT into val FROM (SELECT RWNaiveBayes( CURSOR(SELECT classification, sepallength, sepalwidth, petallength, petalwidth FROM iris_trn ), \ CURSOR( SELECT * FROM faux_iris OFFSET 8 ROWS FETCH FIRST 1 ROWS ONLY), \ CURSOR( SELECT COUNT(*) FROM iris_trn ), \ CURSOR( SELECT MAX(classification) FROM iris_trn )) \ AS CT FROM dual); \ END; ); if (!conn.executesql(stproctext).isvalid()) { std::cout << Failed to create stored procedure ; return 0; 5

For demonstration purposes, we have used a single query to classify unknown data and return its type. In practice, one would most likely use stored views rather than the subqueries as in this example. The subquery in line six selects values from the Fisher iris data that has been stored in the table iris_trn. The four selected parameters from the training set are the defining characteristics of the iris data. Line seven selects the eighth row from the table of iris test data stored in faux_iris. Lines eight and nine select the number of rows and the number of classes from the table iris_trn. There are three levels of SELECT being used. The first level gets the classification type from the alias CT. The second level is the function call, SELECT RWNaiveBayes. The third level selects information from the database to pass to the Bayes classifier. Lines six through nine are the four inputs to the Bayes classifier. Typically, when passing queries to any stored procedure, they re in the form of CURSOR expressions5. This is the reason for the syntax: CURSOR( SELECT ). Our point with this example is to show an embedded analytics call solely using basic SQL syntax. It is quite general and can be used with training data that has very many or very few parameters (columns), as long as they are a numeric type. The limit on the size of the training set (rows) is essentially the amount of memory available. Many other algorithms in the JMSL library can be implemented in the same manner. Finally, we invoke the stored procedure we just created and retrieve the output value. int output = 0; RWDBStoredProc stproc = db.storedproc( Invoke_RWNaiveBayes, conn); stproc << &output; if (stproc.execute(conn).isvalid()) { std::cout << Output Classification Value is: << output << std::endl; SUMMARY In this white paper, we have shown explicit steps to embed JMSL in a database to implement a particular algorithm. The Naïve Bayes classifier is just one of many algorithms available in the JMSL library. Using embedded JMSL doesn t require a platform change, so it will work with many existing RDBMS systems. SourcePro DB, with its intuitive API, makes it easy to connect to RDBMS systems and query JMSL algorithms. There is essentially no risk and the only startup cost is writing simple wrapper code to the robust and proven JMSL algorithms. Eliminating the data transfer to separate analysis machines offers increased security and huge throughput improvements. It also removes the possibility of corruption of data due to movements from and to the database. 6

APPENDIX In this appendix, we show the structure of tables used and full implementation of the Java class used in the classifier example. For further information on the algorithm, see the NaiveBayesClassifier entry in the JMSL documentation. We first show the class overview and required imports, and then show the code for the three class methods. Table iris_trn ID INT # ID, used to identify the training pattern SEPALLENGTH DECIMAL(5,1) # sepal length SEPALWIDTH DECIMAL(5,1) # sepal width PETALLENGTH DECIMAL(5,1) # petal length PETALWIDTH DECIMAL(5,1) # petal width SPECIESOFIRIS VARCHAR(20) # name of Iris species CLASSIFICATION INT # Iris species classification Table faux_iris SEPALLENGTH SEPALWIDTH PETALLENGTH PETALWIDTH The class overview DECIMAL(5,1) # sepal length DECIMAL(5,1) # sepal width DECIMAL(5,1) # petal length DECIMAL(5,1) # petal width import java.sql.resultset; import java.sql.resultsetmetadata; import java.sql.sqlexception; import com.imsl.datamining.*; import com.imsl.stat.normaldistribution; public class RWNaiveBayes { // entry point public static int Classify ( ResultSet rst, ResultSet rsc, ResultSet rsrows, ResultSet rsclasses ) throws SQLException { // function over-ride of the entry point method public static int Classify ( ResultSet rst, ResultSet rsc, int nrows, int nclasses ) throws SQLException { // train public static NaiveBayesClassifier Train(ResultSet rs, int nrows, int nclasses) throws SQLException { 7

Entry point public static int Classify ( ResultSet rst, ResultSet rsc, ResultSet rsrows, ResultSet rsclasses ) throws SQLException { int nclass = -1; try { rsrows.next(); int nrows = rsrows.getint(1); rsclasses.next(); int nclasses = rsclasses.getint(1); nclass = Classify( rst, rsc, nrows, nclasses); catch (SQLException e) { System.err.println(e); e.printstacktrace(); return nclass; throw(e); Override of the entry point public static int Classify ( ResultSet rst, ResultSet rsc, int nrows, int nclasses )throws SQLException { int nclass = -1; ResultSetMetaData rsmd = null; try { NaiveBayesClassifier nbtrainer = Train( rst, nrows, nclasses); rsmd = rsc.getmetadata(); int ncontinuous = rsmd.getcolumncount(); double[] continuousinput = new double[ncontinuous]; rsc.next(); for (int jj=0; jj<ncontinuous; jj++) continuousinput[jj]=rsc.getdouble(jj+1); classifiedprobabilities = new double[nclasses]; double[] classifiedprobabilities = nbtrainer.probabilities(continuousinput, null); nclass = nbtrainer.predictclass(continuousinput, null) + 1; catch (SQLException e) { System.err.println(e); e.printstacktrace(); return nclass; throw(e); 8

The trainer public static NaiveBayesClassifier Train(ResultSet rs, int nrows, int nclasses) throws SQLException { NaiveBayesClassifier nbtrainer = null; ResultSetMetaData rsmd = null; try { rsmd = rs.getmetadata(); int ncolumns = rsmd.getcolumncount(); int ncontinuous = ncolumns-1; int nnominal =0; /* no nominal input attributes */ int[] classification = new int [nrows]; double[][] continuousdata = new double[nrows][ncontinuous]; int ii=0; while( rs.next() ) { // get the class type classification[ii] = rs.getint(1)-1; // get the attributes for (int jj=0; jj<ncontinuous; jj++) continuousdata[ii][jj] = rs.getdouble(jj+2); ii++; nbtrainer = new NaiveBayesClassifier(nContinuous, nnominal, nclasses); for (int i=0; i<ncontinuous; i++) nbtrainer.createcontinuousattribute(new NormalDistribution()); // train the classifier nbtrainer.train(continuousdata, classification ); catch (SQLException e) { System.err.println(e); e.printstacktrace(); throw(e); return nbtrainer; 9

Rogue Wave provides software development tools for mission-critical applications. Our trusted solutions address the growing complexity of building great software and accelerates the value gained from code across the enterprise. The Rogue Wave portfolio of complementary, cross-platform tools helps developers quickly build applications for strategic software initiatives. With Rogue Wave, customers improve software quality and ensure code integrity, while shortening development cycle times. 2016 Rogue Wave Software, Inc. All rights reserved.