Universal PMML Plug-in for EMC Greenplum Database Delivering Massively Parallel Predictions Zementis, Inc. info@zementis.com USA: 6125 Cornerstone Court East, Suite #250, San Diego, CA 92121 T +1(619) 330-0780 Asia: 19/F, Unit A Ho Lee Commercial Bldg. 38-44 D Aguilar Street, Central, Hong Kong T +852 2868 0878
Delivering Massively Parallel Predictions As advanced analytics becomes pervasive across the enterprise to drive better business decisions, the need for efficient execution of predictive models is paramount. Zementis and Greenplum join forces to help companies easily bring predictive models into their database and score in-place and in-parallel huge amounts of data. This joint product combines the Zementis Universal PMML Plug-in for execution of predictive models with the power and scale of the EMC Greenplum Database. The result is an end-to-end solution that enhances Greenplum s large scale analytics processing capabilities with scoring of standards-based predictive models on a massively parallel architecture. By embedding predictive analytics directly in the database, this solution minimizes the movement of data and enables the efficient in-place processing of very large data sets. In this whitepaper, we demonstrate how to deploy and execute predictive models from several statistical tools, including IBM SPSS and the open source R program. Predictive Model Markup Language (PMML) As the de-facto standard for data mining models, PMML provides tremendous benefits for business, IT, and the data mining industry in general. Developed by the Data Mining Group (DMG - http://www.dmg.org), an independent, vendor-led consortium, PMML increases business agility by eliminating the need for proprietary solutions or custom code development. Today, it is supported by all the major data mining tools, commercial and open source. As an open standard, it enables project stakeholders to standardize on one common representation for data mining models. It practically eliminates the barriers and gaps between development and production deployment of predictive analytics. In effect, it minimizes the complexity, cost, and time to turn predictive models into operable IT and business assets. As the lingua franca for predictive analytics, data mining models can be easily exchanged between PMMLcompliant applications. In this way, a model may be built in one statistical tool and easily moved to another for production deployment or visualization. PMML also serves as a bridge between all the teams involved in the data mining process inside a company since it can be used to disseminate knowledge and best practices. In a world in which sensors and data gathering are becoming more and more pervasive, predictive analytics and standards such as PMML make it possible for organizations to benefit from smart solutions that will truly revolutionize their business. Universal PMML Plug-in for EMC Greenplum Database 1
Zementis Universal PMML Plug-in The Universal PMML Plug-in (Figure 1) builds on the heritage of Zementis s flagship product, the ADAPA Decision Engine, a web services-based framework for the execution of predictive analytics and rules, available onsite or as cloud computing platform. The Universal PMML Plug-in is a highly optimized, in-database scoring engine for predictive models, fully supporting the PMML standard. With PMML, the Plug-in delivers a wide range of predictive analytics for high performance scoring. It shortens time to market for predictive models and empowers users through instant deployment of predictive models. Figure 1: The Universal PMML Plug-in. Data in, predictions out. In the context of in-database scoring, it allows us to execute predictive models from all major commercial and open source data mining tools within the database, minimizing data movement and maximizing processing efficiency. Very large datasets can be easily scored against a variety of predictive models including neural network models, regression models, support vector machines, and decision trees (as well as a host of other advance analytic techniques). Besides models per se, the Universal PMML Plug-in also supports data pre- and post-processing. That is because the latest version of the PMML standard is loaded with built-in functions which allow for arithmetic calculations, string manipulations as well as logic operations. An entire predictive solution, one that operates from raw data all the way to predictions, can be represented in PMML and directly used in the Universal PMML Plug-in for data scoring. The Universal PMML Plug-in not only supports the latest version of PMML, but also older versions. In fact, it is version agnostic since it incorporates a converter which automatically converts older versions of PMML to its newest. EMC Greenplum Database Architecture The EMC Greenplum Database utilizes a shared-nothing MPP (massively parallel processing) architecture that has been designed from the ground up for BI and analytical processing using commodity hardware. In this architecture, data is automatically partitioned across multiple 'segment' servers, and each 'segment' owns and manages a distinct portion of the overall data. All communication is via a network interconnect -- there is no disk-level sharing or contention to be concerned with (i.e. it is a 'shared-nothing' architecture). Most of today s general-purpose relational database management systems (e.g. Oracle, Microsoft SQL Server) were originally designed for Online Transaction Processing (OLTP) applications. These databases utilize 'shared-disk' or 'shared-everything' architectures that are optimized for high transaction rates at the expense of individual query performance and parallelism. Greenplum s shared-nothing MPP architecture (Figure 2) provides every segment with a dedicated, independent high-bandwidth channel to its disk. The segment servers are able to process every query in a fully parallel manner, Universal PMML Plug-in for EMC Greenplum Database 2
use all disk connections simultaneously, and efficiently flow data between segments as query plans dictates. The degree of parallelism and overall scalability that this allows far exceeds general purpose database systems. Figure 2: Greenplum s shared-nothing MPP architecture Universal PMML Plug-in for the EMC Greenplum Database The Universal PMML Plugin for the EMC Greenplum Database enables execution of standards-based predictive analytics directly within the Greenplum Database. It seamlessly embeds the Universal PMML Plug-in into Greenplum s shared-nothing, massively parallel processing (MPP) architecture. The Universal Plug-in s own shared-nothing design philosophy and replication flexibility fits like a glove into multi-server environments. With Greenplum, each individual server (with a dedicated, independent, high-bandwidth channel connection to local disks) houses a separate Universal PMML Plug-in instance that can take full advantage of these local resources (Figure 3). The net result is the ability to leverage the power of standards-based predictive analytics on a massive scale, right where the data resides. The EMC Greenplum PMML Plug-in not only delivers high performance model execution but it does so in an easy and seamless manner. With a couple of simple steps, PMML models are distributed to all segments of the Greenplum installation and are made available for execution. Each model is presented as a separate SQL function that can be used in any query. The name, input parameters and outputs of each function matches the name, input fields, and output fields of the corresponding model as defined in the corresponding PMML file. This way, scoring a Universal PMML Plug-in for EMC Greenplum Database 3
data set with one or more models becomes as simple as writing a SQL statement on that data set. Predictions (scores, probabilities, categories, clusters, etc.) can be just as easily written back to the database, become part of a report, or passed on to an application. Figure 3: Each individual server houses a separate Universal PMML Plug-in instance. In addition, the Universal PMML Plug-in includes the popular Zementis PMML Converter. This means that it accepts PMML models of all versions (2.0, 2.1, 3.0, 3.1, 3.2, and 4.0) generated by any of the major commercial and open source mining tools. Example: Use IBM SPSS and R Models in Greenplum The Universal PMML Plug-in for the EMC Greenplum Database ships with several sample PMML models. A number of these predictive models were created with the well-known Elnino data set. This data set contains oceanographic and surface meteorological readings taken from a series of buoys positioned throughout the equatorial Pacific. The data is expected to aid in the understanding and prediction of El Nino/Southern Oscillation (ENSO) cycles (see http://kdd.ics.uci.edu). Here, we discuss two of these: A neural network model built in IBM SPSS Statistics; A linear regression model built in R. Universal PMML Plug-in for EMC Greenplum Database 4
After being built, all models were directly exported into PMML since IBM SPSS and R provide comprehensive support for the PMML standard. The steps to install and use these models in Greenplum using the PMML plugin are: 1. Prepare and copy PMML files into the Greenplum segments 2. Run the automatically generated script to define the corresponding SQL functions 3. Run queries using the new SQL functions Each step is described in detail below. Prepare and Copy PMML Files In the first step, a script needs to be run to validate the provided PMML files, copy them into the Greenplum segments, and generate a SQL script containing the function definitions for all the provided models. Below we present an excerpt from the SQL script generated for the two sample models. CREATE FUNCTION SPSS_Neural_Network_ElNino(float8,float8,float8,float8,float8,float8) RETURNS float8 AS CREATE FUNCTION R_LinearRegression_ElNino(float8,float8,float8,float8,float8,float8) RETURNS float8 AS To put these definitions in context, below we present the code for the PMML data dictionary and mining schema for the IBM SPSS Neural Network model as listed in the corresponding PMML file. <DataDictionary numberoffields="7"> <DataField name="humidity" optype="continuous" datatype="double"/> <DataField name="latitude" optype="continuous" datatype="double"/> <DataField name="longitude" optype="continuous" datatype="double"/> <DataField name="mer_winds" optype="continuous" datatype="double"/> <DataField name="s_s_temp" optype="continuous" datatype="double"/> <DataField name="zon_winds" optype="continuous" datatype="double"/> <DataField name="airtemp" optype="continuous" datatype="double"/> </DataDictionary> <NeuralNetwork functionname="regression" activationfunction="logistic" modelname="spss Neural Network - ElNino"> <MiningSchema> <MiningField name="humidity" usagetype="active" optype="continuous"/> <MiningField name="latitude" usagetype="active" optype="continuous"/> <MiningField name="longitude" usagetype="active" optype="continuous"/> <MiningField name="mer_winds" usagetype="active" optype="continuous"/> <MiningField name="s_s_temp" usagetype="active" optype="continuous"/> <MiningField name="zon_winds" usagetype="active" optype="continuous"/> <MiningField name="airtemp" usagetype="predicted" optype="continuous"/> </MiningSchema> In the SQL script, each model is presented as a function with six numeric parameters; they all work on the same data and return one numeric value. The name of the SQL function is created from the name of the model (SPSS_Neural_Network_ElNino). The six numeric parameters correspond to the six input (active mining) fields of Universal PMML Plug-in for EMC Greenplum Database 5
type double defined in the PMML file (humidity, latitude, longitude, mer_winds, s_s_temp, and zon_winds). Finally, the numeric return value of the SQL function reflects the predicted output field of type double (airtemp). Run SQL Script to Create SQL Functions The second step is to run the generated SQL script to create the new functions. After the new functions are created, the predictive models are ready to be used in SQL queries like any other built-in or custom function. Execute Queries to Score Data With the installation steps completed, the predictive models can be easily used in SQL queries. Below is an example of such a query: SELECT buoy_day_id, SPSS_Neural_Network_ElNino (latitude, longitude, zon_winds, mer_winds, humidity, s_s_temp) AS airtemp FROM elnino_input Getting predictions from the two models at the same time would be just as easy: SELECT buoy_day_id, R_Linear_Regression_ElNino(latitude, longitude, zon_winds, mer_winds, humidity, s_s_temp) AS airtemp_r, SPSS_Neural_Network_ElNino(latitude, longitude, zon_winds, mer_winds, humidity, s_s_temp) AS airtemp_nn FROM elnino_input Advantages of the Universal PMML Plug-in for the EMC Greenplum Database Zementis and EMC Greenplum bring together two essential technologies, offering the best combination of open standards and scalability for the in-database application of predictive analytics. The Universal PMML Plug-in delivers instant and scalable scoring for big data while retaining compatibility with most major data mining tools through the PMML Standard. In summary, the Universal PMML Plug-in for the EMC Greenplum Database Integrates advanced analytical algorithms directly into the database engine for high-performance scoring in a massively parallel environment; Supports the PMML standard to avoid time-consuming and expensive one-off predictive analytics projects; Executes predictive models from all major commercial and open source data mining tools; Minimizes data movement to enable efficient processing of very large data sets; and Reduces total cost of ownership (TCO) for analytical environment by means of streamlined and platformindependent data mining processes. Universal PMML Plug-in for EMC Greenplum Database 6
About Greenplum and the EMC Data Computing Products Division EMC s new Data Computing Products Division is driving the future of data warehousing and analytics with breakthrough products including Greenplum Database 4.1, Greenplum Data Computing Appliance (DCA), Greenplum Database Single-Node Edition, Greenplum Community Edition and Greenplum Chorus. The division s products embody the power of open systems, cloud computing, virtualization, and social collaboration enabling global organizations to gain greater insight and value from their data than ever before possible. For more information, please visit http://www.greenplum.com About Zementis Zementis, Inc. is a leading software company focused on the operational deployment and integration of predictive analytics and data mining solutions. Its ADAPA decision engine successfully bridges the gap between science and engineering. ADAPA and the Universal PMML Plug-in are designed from the ground up to benefit from open standards and to significantly shorten the time-to-market for predictive analytics in any industry. For more information, please visit http://www.zementis.com Universal PMML Plug-in for EMC Greenplum Database 7