Universal PMML Plug-in for EMC Greenplum Database



Similar documents
Easy Execution of Data Mining Models through PMML

Model Deployment. Dr. Saed Sayad. University of Toronto

I/O Considerations in Big Data Analytics

Using Attunity Replicate with Greenplum Database Using Attunity Replicate for data migration and Change Data Capture to the Greenplum Database

July Zementis for IBM z Systems

In-Database Analytics

How to Optimize Your Data Mining Environment

EMC Greenplum Driving the Future of Data Warehousing and Analytics. Tools and Technologies for Big Data

Hadoop s Advantages for! Machine! Learning and. Predictive! Analytics. Webinar will begin shortly. Presented by Hortonworks & Zementis

Harnessing the power of advanced analytics with IBM Netezza

The R pmmltransformations Package

Advanced In-Database Analytics

High-Performance Business Analytics: SAS and IBM Netezza Data Warehouse Appliances

Greenplum Database. Getting Started with Big Data Analytics. Ofir Manor Pre Sales Technical Architect, EMC Greenplum

SQL Server 2012 Parallel Data Warehouse. Solution Brief

EMC/Greenplum Driving the Future of Data Warehousing and Analytics

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

Make Better Decisions Through Predictive Intelligence

EMC GREENPLUM DATABASE

Achieve Better Insight and Prediction with Data Mining

Customer Insight Appliance. Enabling retailers to understand and serve their customer

BIG DATA-AS-A-SERVICE

SEIZE THE DATA SEIZE THE DATA. 2015

Achieve Better Insight and Prediction with Data Mining

Mike Maxey. Senior Director Product Marketing Greenplum A Division of EMC. Copyright 2011 EMC Corporation. All rights reserved.

The Use of Open Source Is Growing. So Why Do Organizations Still Turn to SAS?

Operationalise Predictive Analytics

SAP Predictive Analytics: An Overview and Roadmap. Charles Gadalla, SESSION CODE: 603

Optimizing Storage for Better TCO in Oracle Environments. Part 1: Management INFOSTOR. Executive Brief

Copyright 2012 EMC Corporation. All rights reserved.

MASSIVEDATANEWS. Load and Go: Fast Data Loading with the Greenplum Data Computing Appliance (DCA)

Next Generation Data Mining. Data Mining Automation & Realtime-Scoring "on-the-cloud.

Data Virtualization Overview

Develop Predictive Models Using Your Business Expertise

Data Warehouse Appliances: The Next Wave of IT Delivery. Private Cloud (Revocable Access and Support) Applications Appliance. (License/Maintenance)

Integrated Grid Solutions. and Greenplum

ETPL Extract, Transform, Predict and Load

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.

How To Handle Big Data With A Data Scientist

IBM SPSS Modeler Professional

High Performance Analytics with In-Database Processing

IBM SPSS Modeler Professional

Big Data Technologies Compared June 2014

Lowering the Total Cost of Ownership (TCO) of Data Warehousing

Big Data and Data Science: Behind the Buzz Words

GigaSpaces Real-Time Analytics for Big Data

CitusDB Architecture for Real-Time Big Data

2015 Ironside Group, Inc. 2

Improve Results with High- Performance Data Mining

Driving Peak Performance IBM Corporation

BIG DATA APPLIANCES. July 23, TDWI. R Sathyanarayana. Enterprise Information Management & Analytics Practice EMC Consulting

Hortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc All Rights Reserved

Microsoft Dynamics AX 2012 A New Generation in ERP

Grow Revenues and Reduce Risk with Powerful Analytics Software

Upgrading to Microsoft SQL Server 2008 R2 from Microsoft SQL Server 2008, SQL Server 2005, and SQL Server 2000

KnowledgeSTUDIO HIGH-PERFORMANCE PREDICTIVE ANALYTICS USING ADVANCED MODELING TECHNIQUES

Why Big Data in the Cloud?

A financial software company

Focus on the business, not the business of data warehousing!

E M C P E R S P E C T I V E MANAGING HEALTHCARE DATA WITHIN THE ECOSYSTEM WHILE REDUCING IT COSTS AND COMPLEXITIES

Oracle9i Data Warehouse Review. Robert F. Edwards Dulcian, Inc.

ORACLE OLAP. Oracle OLAP is embedded in the Oracle Database kernel and runs in the same database process

WHAT S NEW IN SAS 9.4

TRENDS IN THE DEVELOPMENT OF BUSINESS INTELLIGENCE SYSTEMS

How To Use Hp Vertica Ondemand

IBM SPSS Modeler Premium

ORACLE TAX ANALYTICS. The Solution. Oracle Tax Data Model KEY FEATURES

Netezza and Business Analytics Synergy

MicroStrategy Course Catalog

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Cisco Data Preparation

Why compute in parallel? Cloud computing. Big Data 11/29/15. Introduction to Data Management CSE 344. Science is Facing a Data Deluge!

SQL Server 2005 Features Comparison

Contents. Overview. The solid foundation for your entire, enterprise-wide business intelligence system

Interactive data analytics drive insights

IBM Netezza High Capacity Appliance

Name: Srinivasan Govindaraj Title: Big Data Predictive Analytics

BIG DATA IS MESSY PARTNER WITH SCALABLE

Five Best Practices for Maximizing Big Data ROI

Hexaware E-book on Predictive Analytics

Beyond Conventional Data Warehousing. Florian Waas Greenplum Inc.

WHITE PAPER. Harnessing the Power of Advanced Analytics How an appliance approach simplifies the use of advanced analytics

Make Better Decisions Through Predictive Intelligence

IBM SPSS Modeler 15 In-Database Mining Guide

Knowledge Discovery from patents using KMX Text Analytics

Five Technology Trends for Improved Business Intelligence Performance

RevoScaleR Speed and Scalability

Einsatzfelder von IBM PureData Systems und Ihre Vorteile.

Real Life Performance of In-Memory Database Systems for BI

The Ultimate Guide to Buying Business Analytics

A Next-Generation Analytics Ecosystem for Big Data. Colin White, BI Research September 2012 Sponsored by ParAccel

Pentaho High-Performance Big Data Reference Configurations using Cisco Unified Computing System

IBM SPSS Modeler 14.2 In-Database Mining Guide

Increase Agility and Reduce Costs with a Logical Data Warehouse. February 2014

Ten Things You Need to Know About Data Virtualization

Cost-Effective Business Intelligence with Red Hat and Open Source

Realizing the True Potential of Software-Defined Storage

The Ultimate Guide to Buying Business Analytics

Transcription:

Universal PMML Plug-in for EMC Greenplum Database Delivering Massively Parallel Predictions Zementis, Inc. info@zementis.com USA: 6125 Cornerstone Court East, Suite #250, San Diego, CA 92121 T +1(619) 330-0780 Asia: 19/F, Unit A Ho Lee Commercial Bldg. 38-44 D Aguilar Street, Central, Hong Kong T +852 2868 0878

Delivering Massively Parallel Predictions As advanced analytics becomes pervasive across the enterprise to drive better business decisions, the need for efficient execution of predictive models is paramount. Zementis and Greenplum join forces to help companies easily bring predictive models into their database and score in-place and in-parallel huge amounts of data. This joint product combines the Zementis Universal PMML Plug-in for execution of predictive models with the power and scale of the EMC Greenplum Database. The result is an end-to-end solution that enhances Greenplum s large scale analytics processing capabilities with scoring of standards-based predictive models on a massively parallel architecture. By embedding predictive analytics directly in the database, this solution minimizes the movement of data and enables the efficient in-place processing of very large data sets. In this whitepaper, we demonstrate how to deploy and execute predictive models from several statistical tools, including IBM SPSS and the open source R program. Predictive Model Markup Language (PMML) As the de-facto standard for data mining models, PMML provides tremendous benefits for business, IT, and the data mining industry in general. Developed by the Data Mining Group (DMG - http://www.dmg.org), an independent, vendor-led consortium, PMML increases business agility by eliminating the need for proprietary solutions or custom code development. Today, it is supported by all the major data mining tools, commercial and open source. As an open standard, it enables project stakeholders to standardize on one common representation for data mining models. It practically eliminates the barriers and gaps between development and production deployment of predictive analytics. In effect, it minimizes the complexity, cost, and time to turn predictive models into operable IT and business assets. As the lingua franca for predictive analytics, data mining models can be easily exchanged between PMMLcompliant applications. In this way, a model may be built in one statistical tool and easily moved to another for production deployment or visualization. PMML also serves as a bridge between all the teams involved in the data mining process inside a company since it can be used to disseminate knowledge and best practices. In a world in which sensors and data gathering are becoming more and more pervasive, predictive analytics and standards such as PMML make it possible for organizations to benefit from smart solutions that will truly revolutionize their business. Universal PMML Plug-in for EMC Greenplum Database 1

Zementis Universal PMML Plug-in The Universal PMML Plug-in (Figure 1) builds on the heritage of Zementis s flagship product, the ADAPA Decision Engine, a web services-based framework for the execution of predictive analytics and rules, available onsite or as cloud computing platform. The Universal PMML Plug-in is a highly optimized, in-database scoring engine for predictive models, fully supporting the PMML standard. With PMML, the Plug-in delivers a wide range of predictive analytics for high performance scoring. It shortens time to market for predictive models and empowers users through instant deployment of predictive models. Figure 1: The Universal PMML Plug-in. Data in, predictions out. In the context of in-database scoring, it allows us to execute predictive models from all major commercial and open source data mining tools within the database, minimizing data movement and maximizing processing efficiency. Very large datasets can be easily scored against a variety of predictive models including neural network models, regression models, support vector machines, and decision trees (as well as a host of other advance analytic techniques). Besides models per se, the Universal PMML Plug-in also supports data pre- and post-processing. That is because the latest version of the PMML standard is loaded with built-in functions which allow for arithmetic calculations, string manipulations as well as logic operations. An entire predictive solution, one that operates from raw data all the way to predictions, can be represented in PMML and directly used in the Universal PMML Plug-in for data scoring. The Universal PMML Plug-in not only supports the latest version of PMML, but also older versions. In fact, it is version agnostic since it incorporates a converter which automatically converts older versions of PMML to its newest. EMC Greenplum Database Architecture The EMC Greenplum Database utilizes a shared-nothing MPP (massively parallel processing) architecture that has been designed from the ground up for BI and analytical processing using commodity hardware. In this architecture, data is automatically partitioned across multiple 'segment' servers, and each 'segment' owns and manages a distinct portion of the overall data. All communication is via a network interconnect -- there is no disk-level sharing or contention to be concerned with (i.e. it is a 'shared-nothing' architecture). Most of today s general-purpose relational database management systems (e.g. Oracle, Microsoft SQL Server) were originally designed for Online Transaction Processing (OLTP) applications. These databases utilize 'shared-disk' or 'shared-everything' architectures that are optimized for high transaction rates at the expense of individual query performance and parallelism. Greenplum s shared-nothing MPP architecture (Figure 2) provides every segment with a dedicated, independent high-bandwidth channel to its disk. The segment servers are able to process every query in a fully parallel manner, Universal PMML Plug-in for EMC Greenplum Database 2

use all disk connections simultaneously, and efficiently flow data between segments as query plans dictates. The degree of parallelism and overall scalability that this allows far exceeds general purpose database systems. Figure 2: Greenplum s shared-nothing MPP architecture Universal PMML Plug-in for the EMC Greenplum Database The Universal PMML Plugin for the EMC Greenplum Database enables execution of standards-based predictive analytics directly within the Greenplum Database. It seamlessly embeds the Universal PMML Plug-in into Greenplum s shared-nothing, massively parallel processing (MPP) architecture. The Universal Plug-in s own shared-nothing design philosophy and replication flexibility fits like a glove into multi-server environments. With Greenplum, each individual server (with a dedicated, independent, high-bandwidth channel connection to local disks) houses a separate Universal PMML Plug-in instance that can take full advantage of these local resources (Figure 3). The net result is the ability to leverage the power of standards-based predictive analytics on a massive scale, right where the data resides. The EMC Greenplum PMML Plug-in not only delivers high performance model execution but it does so in an easy and seamless manner. With a couple of simple steps, PMML models are distributed to all segments of the Greenplum installation and are made available for execution. Each model is presented as a separate SQL function that can be used in any query. The name, input parameters and outputs of each function matches the name, input fields, and output fields of the corresponding model as defined in the corresponding PMML file. This way, scoring a Universal PMML Plug-in for EMC Greenplum Database 3

data set with one or more models becomes as simple as writing a SQL statement on that data set. Predictions (scores, probabilities, categories, clusters, etc.) can be just as easily written back to the database, become part of a report, or passed on to an application. Figure 3: Each individual server houses a separate Universal PMML Plug-in instance. In addition, the Universal PMML Plug-in includes the popular Zementis PMML Converter. This means that it accepts PMML models of all versions (2.0, 2.1, 3.0, 3.1, 3.2, and 4.0) generated by any of the major commercial and open source mining tools. Example: Use IBM SPSS and R Models in Greenplum The Universal PMML Plug-in for the EMC Greenplum Database ships with several sample PMML models. A number of these predictive models were created with the well-known Elnino data set. This data set contains oceanographic and surface meteorological readings taken from a series of buoys positioned throughout the equatorial Pacific. The data is expected to aid in the understanding and prediction of El Nino/Southern Oscillation (ENSO) cycles (see http://kdd.ics.uci.edu). Here, we discuss two of these: A neural network model built in IBM SPSS Statistics; A linear regression model built in R. Universal PMML Plug-in for EMC Greenplum Database 4

After being built, all models were directly exported into PMML since IBM SPSS and R provide comprehensive support for the PMML standard. The steps to install and use these models in Greenplum using the PMML plugin are: 1. Prepare and copy PMML files into the Greenplum segments 2. Run the automatically generated script to define the corresponding SQL functions 3. Run queries using the new SQL functions Each step is described in detail below. Prepare and Copy PMML Files In the first step, a script needs to be run to validate the provided PMML files, copy them into the Greenplum segments, and generate a SQL script containing the function definitions for all the provided models. Below we present an excerpt from the SQL script generated for the two sample models. CREATE FUNCTION SPSS_Neural_Network_ElNino(float8,float8,float8,float8,float8,float8) RETURNS float8 AS CREATE FUNCTION R_LinearRegression_ElNino(float8,float8,float8,float8,float8,float8) RETURNS float8 AS To put these definitions in context, below we present the code for the PMML data dictionary and mining schema for the IBM SPSS Neural Network model as listed in the corresponding PMML file. <DataDictionary numberoffields="7"> <DataField name="humidity" optype="continuous" datatype="double"/> <DataField name="latitude" optype="continuous" datatype="double"/> <DataField name="longitude" optype="continuous" datatype="double"/> <DataField name="mer_winds" optype="continuous" datatype="double"/> <DataField name="s_s_temp" optype="continuous" datatype="double"/> <DataField name="zon_winds" optype="continuous" datatype="double"/> <DataField name="airtemp" optype="continuous" datatype="double"/> </DataDictionary> <NeuralNetwork functionname="regression" activationfunction="logistic" modelname="spss Neural Network - ElNino"> <MiningSchema> <MiningField name="humidity" usagetype="active" optype="continuous"/> <MiningField name="latitude" usagetype="active" optype="continuous"/> <MiningField name="longitude" usagetype="active" optype="continuous"/> <MiningField name="mer_winds" usagetype="active" optype="continuous"/> <MiningField name="s_s_temp" usagetype="active" optype="continuous"/> <MiningField name="zon_winds" usagetype="active" optype="continuous"/> <MiningField name="airtemp" usagetype="predicted" optype="continuous"/> </MiningSchema> In the SQL script, each model is presented as a function with six numeric parameters; they all work on the same data and return one numeric value. The name of the SQL function is created from the name of the model (SPSS_Neural_Network_ElNino). The six numeric parameters correspond to the six input (active mining) fields of Universal PMML Plug-in for EMC Greenplum Database 5

type double defined in the PMML file (humidity, latitude, longitude, mer_winds, s_s_temp, and zon_winds). Finally, the numeric return value of the SQL function reflects the predicted output field of type double (airtemp). Run SQL Script to Create SQL Functions The second step is to run the generated SQL script to create the new functions. After the new functions are created, the predictive models are ready to be used in SQL queries like any other built-in or custom function. Execute Queries to Score Data With the installation steps completed, the predictive models can be easily used in SQL queries. Below is an example of such a query: SELECT buoy_day_id, SPSS_Neural_Network_ElNino (latitude, longitude, zon_winds, mer_winds, humidity, s_s_temp) AS airtemp FROM elnino_input Getting predictions from the two models at the same time would be just as easy: SELECT buoy_day_id, R_Linear_Regression_ElNino(latitude, longitude, zon_winds, mer_winds, humidity, s_s_temp) AS airtemp_r, SPSS_Neural_Network_ElNino(latitude, longitude, zon_winds, mer_winds, humidity, s_s_temp) AS airtemp_nn FROM elnino_input Advantages of the Universal PMML Plug-in for the EMC Greenplum Database Zementis and EMC Greenplum bring together two essential technologies, offering the best combination of open standards and scalability for the in-database application of predictive analytics. The Universal PMML Plug-in delivers instant and scalable scoring for big data while retaining compatibility with most major data mining tools through the PMML Standard. In summary, the Universal PMML Plug-in for the EMC Greenplum Database Integrates advanced analytical algorithms directly into the database engine for high-performance scoring in a massively parallel environment; Supports the PMML standard to avoid time-consuming and expensive one-off predictive analytics projects; Executes predictive models from all major commercial and open source data mining tools; Minimizes data movement to enable efficient processing of very large data sets; and Reduces total cost of ownership (TCO) for analytical environment by means of streamlined and platformindependent data mining processes. Universal PMML Plug-in for EMC Greenplum Database 6

About Greenplum and the EMC Data Computing Products Division EMC s new Data Computing Products Division is driving the future of data warehousing and analytics with breakthrough products including Greenplum Database 4.1, Greenplum Data Computing Appliance (DCA), Greenplum Database Single-Node Edition, Greenplum Community Edition and Greenplum Chorus. The division s products embody the power of open systems, cloud computing, virtualization, and social collaboration enabling global organizations to gain greater insight and value from their data than ever before possible. For more information, please visit http://www.greenplum.com About Zementis Zementis, Inc. is a leading software company focused on the operational deployment and integration of predictive analytics and data mining solutions. Its ADAPA decision engine successfully bridges the gap between science and engineering. ADAPA and the Universal PMML Plug-in are designed from the ground up to benefit from open standards and to significantly shorten the time-to-market for predictive analytics in any industry. For more information, please visit http://www.zementis.com Universal PMML Plug-in for EMC Greenplum Database 7