Components for Analytic Prototyping and Production Deployment PyIMSL Studio Features A White Paper by Visual Numerics August 2009 www.vni.com
Components for Analytic Prototyping and Production Deployment PyIMSL Studio Features by Visual Numerics, a Rogue Wave Software Company 2009 by Visual Numerics, Inc. All Rights Reserved Printed in the United States of America Publishing History: February 2009 Initial publication August 2009 Update Trademark Information Visual Numerics, IMSL and PV-WAVE are registered trademarks. JMSL, TS-WAVE, JWAVE, and PyIMSL are trademarks of Visual Numerics, Inc., in the U.S. and other countries. All other product and company names are trademarks or registered trademarks of their respective owners. The information contained in this document is subject to change without notice. Visual Numerics, Inc. makes no warranty of any kind with regard to this material, included, but not limited to, the implied warranties of merchantability and fitness for a particular purpose. Visual Numerics, Inc, shall not be liable for errors contained herein or for incidental, consequential, or other indirect damages in connection with the furnishing, performance, or use of this material.
TABLE OF CONTENTS Abstract... 4 Prototyping versus Production... 4 Why Prototype in a Dynamic Language... 5 Why Python... 5 PyIMSL Studio Components... 6 Simple Installation... 6 Python Language... 6 NumPy... 7 Industry Standard Analytics... 7 Data Tools... 9 Other Python Components... 10 Documentation... 11 Support... 12 Summary... 12
Abstract This paper explores the advantages of using the Python dynamic language and the PyIMSL Studio development environment for analytic model development and how this can benefit the productivity of both those involved in the prototype modeling and the production aspects of deploying analytic applications. PyIMSL Studio is the first and only commercially-available numerical analysis application development environment designed for deploying mathematical and statistical prototype models into production applications. This paper is intended for users familiar with creating production applications that leverage analytical models, math or statistics, and who typically deploy in Python code or C code using the IMSL C Numerical Library. Prototyping versus Production Many organizations today developing analytic applications follow a process that includes creating a prototype model before developing the production application. Many development and numerical analysis tools are available for the prototype stage, where the goal is to explore and refine analytic techniques using sample data, develop requirements for production, and validate the viability of the solution and achieve sign-off to proceed with production development. Production deployment often involves additional development to package the prototype code into an application or embeddable component that handles data from production sources, has the performance desired for production, properly handles errors, and is fully tested. As an open language supported on many platforms, Python can be used for production deployment. There are features and components available that allow standalone applications or web-based applications in Python to be developed and deployed. However, many prototype models eventually become a part of production applications that are written in development languages like C, C++, C#/.NET, Java or Fortran. To transform a model into a component of a production application in these languages, the modeler often must rewrite the prototype in a development language or hand off the prototype to an implementation team to do the work. In many cases, the re-write includes using a separate native library for the analytics in the production application. Several testing steps must happen to ensure that the numeric results from the prototype match the numeric results in the production application. This paper shows how Python and PyIMSL Studio are a good choice for such prototype modeling work, and will show how PyIMSL Studio can also be leveraged for production deployment, in Python or other languages. Page 4
Why Prototype in a Dynamic Language Dynamic languages are well suited for prototype work for a number of reasons: Dynamic languages offer rapid code development, often with a 2-5X reduction in lines of code to produce identical functionality of that in languages like C/C++, Java and C#. There is less syntactic decoration in dynamic languages. Variables are often loosely typed and do not require formal type declarations or function templates. Dynamic languages are higher level and do not involve pointer manipulation and referencing, and operators often have more power in the types of data objects they can perform operations on. In the end the code can be much more readable all at a cost, of course, which usually results in some computing overhead and in moving subtle programming problems from being caught in compilation to appearing at runtime. Dynamic languages can usually be run from an interactive command prompt allowing ad-hoc use as a calculator or allow interactive debugging of code. Dynamic languages do not require the edit, compile, test cycle of development. Code can be changed and immediately executed or entered interactively. All of these features allow for rapid development; easy to read and maintain code; and a shorter learning curve by the domain experts usually involved in prototype modeling who are typically not professional developers. Why Python Python is a leading open source dynamic language well suited for analytic prototype modeling for a number of reasons: It is a well rounded language, which can be used for either procedural or object-oriented development. Other dynamic languages are often more special purpose, with features that address certain kinds of problems but are not balanced for general programming. It is an open language and not a proprietary language, allowing for greater sharing of tools and analytic code across a wider audience of users. There are a large number of open-source toolkits for analytical modeling with Python. This is the result of more than a decade of strong adoption and contributions by the scientific community. It is a loosely typed language with simple syntax that makes it easy to read and understand. The industry-standard NumPy package for Python transforms it into a language for array based operations suitable for efficient storage and manipulation or large multidimensional arrays. NumPy includes a simple syntax to index, subset and perform operations on arrays, and is efficient in memory use and performance. While there are a number of open source analytic libraries and tools available for Python, the PyIMSL wrappers (part of the PyIMSL Studio environment) offer the most comprehensive collection of rich analytic and statistical techniques for both prototype modeling and production deployment, regardless of the final deployed hardware or operating system. Page 5
Python Weaknesses There are a number of general challenges in adopting open source languages and tools, and Python is no exception. These include: 1. Installation of the many components available for Python can be tricky, and may involve compiling code (especially on non-windows platforms). 2. Ability of different open source components to interact is compounded by frequent releases, making it sometimes difficult to maintain a stable environment. 3. Documentation of open source components varies widely, and can be a challenge. Often online resources or postings describe features or problems with out-of-date versions of components. 4. A lack of dedicated support when using these tools in mission critical applications is a risk when using Python and the many open source tools available for it. PyIMSL Studio Components PyIMSL Studio combines the Python language and a selection of robust Python tools with the advanced analytics from the IMSL C Numerical Library. It addresses the Python weaknesses described above by providing a tested, documented, fully supported and easy to install Python environment. Gaps in data I/O and cleansing are filled with additional functionality from Visual Numerics. The components in PyIMSL Studio provide the functionality needed for prototype modelers as well as analytic functionality in C libraries needed to deploy into production environments. Components and services in PyIMSL Studio include: Simple Installation A single installation program allows you to install Python and all PyIMSL Studio components. They are precompiled and fully tested on each supported platform. Python Language The Python language is integrated as part of the PyIMSL Studio. A simple example of code to calculate prime numbers shows what the language looks like: Page 6
NumPy The included NumPy package provides data objects and a set of modules for powerful and efficient data array manipulation. It is the de-facto standard for array and matrix algebra in Python. This example shows multiplying every element in an array by a scalar with one operation: Industry Standard Analytics The PyIMSL package within PyIMSL Studio provides wrappers to the IMSL C Library. These wrappers provide a simple and flexible interface to the underlying C functionality and handle all translation of Python data types and structures into the correct representation used in C. The API for these functions mirrors the C language API making it easy to translate Python code to C code for production use. Page 7
Here is a Python code fragment that trains a neural network and creates a forecast: Here is the code to accomplish the same task in C using the IMSL C Library: Functional areas in the IMSL Libraries include these high level areas and represent more than 450 available routines: Mathematics Matrix Operations Linear Algebra Eigensystems Interpolation & Approximation Quadrature Differential Equations Nonlinear Equations Optimization Special Functions Finance & Bond Calculations Genetic Algorithms Statistics Basic Statistics Time Series & Forecasting Nonparametric Tests Correlation & Covariance Data Mining Regression Analysis of Variance Transforms Goodness of Fit Distribution Functions Random Number Generation Neural Networks Page 8
Data Tools Tools to import data from ASCII files, Excel spreadsheets, ODBC sources and tools to filter and cleanse data are included. Tools developed by Visual Numerics are available both in Python and C. These data tools include: asciiread A PyIMSL Studio routine to read ASCII data files oriented in rows or columns. It has many keyword options to make it the most flexible tool of its kind in Python. It is available for both Python and C. Here is a Python example to read 2 columns from a file: impute A PyIMSL Studio routine to locate missing values, and optionally replace them with estimated values using one of 6 available algorithms, available for both Python and C. This example shows how to replace missing values with the geometric mean of its nearest four neighbors: PyODBC A Python module for database access using the Microsoft ODBC interface. This example shows how to read a column from a database table: xlrd A Python module for reading data from Microsoft Excel files. This example shows operations to inquire about sheets and reads the data from a spreadsheet: Page 9
Other Python Components matplotlib/pylab Python analytical charting components. matplotlib is an extensive plotting module for Python that can create publication quality graphs. Some of the impressive features of this library include alpha transparency layers, anti-aliased graphics, the ability to integrate into Wx and Tkinter GUIs, and the ability to create a variety of image formats including Postscript, SVGs and PDFs. matplotlib has an objectoriented interface and a simplified procedural interface (pylab). Built-in interactivity includes the availability to zoom, pan and export graphics. Here is a sample line plot created with pylab: Tkinter/WxPython two popular toolkits for creating Python user interfaces. Below is an example of a demo application provided with PyIMSL Studio that was built using WxPython widgets: Page 10
IPython/Eclipse a powerful command line interface and a full featured Integrated Development Environment (IDE) for Python. Sometimes an interactive shell environment is desired, and the IPython interface enhances the functionality available from a basic Python shell. Eclipse with the Python pydev plug-in provides a more formal development environment with features like step through debugging, command completion, syntax highlighting and many other useful features. Documentation The PyIMSL Studio User Guide provides an introduction to Python, included components and how to use them together. It has in-depth tutorials on using Python for prototype analytic development with real world problems. It offers an easy way to quickly be productive with prototyping analytic code in Python and serves as a quick reference and source for example code. Page 11
Additionally, complete API documentation is provided for the PyIMSL wrappers to the IMSL C Numerical Library, including complete API descriptions, background mathematics, references and example code. Support World class tech support for both the Visual Numerics components and the bundled Python language and open source components is provided through phone support, email and online forums. Summary This paper described how Python and PyIMSL Studio address the needs of prototype modeling in a stable, tested, supported and documented environment. Deployment of code into production environments is enhanced through the use of PyIMSL Studio which removes the gap introduced when different analytics are used in prototype and production work. With PyIMSL Studio, prototype models become part of production applications quicker and with less re-work, cost, risk and complexity. For more information or to request an evaluation copy, visit the PyIMSL Studio 1 area of the Visual Numerics website. 1 http://www.vni.com/products/imsl/pyimslstudio.php Page 12