Open source framework for data-flow visual analytic tools for large databases

Open source framework for data-flow visual analytic tools for large databases D5.6 v1.0 WP5 Visual Analytics: D5.6 Open source framework for data flow visual analytic tools for large databases Dissemination Level: Public Lead Editor: Janez Demsar Date: 30 April 2015 Status: Final Description from Description of Work: T5.5. Data flow visual analytic tools for large data bases We will investigate ways in which the existing data flow model can work within large databases. We will provide a pilot reimplementation of Orange in which the data flow operators (e.g. Orange s widgets) will communicate by passing SQL queries and/or server side processing using PL/Python instead of retrieved data. Most operations will modify the processing instructions instead of retrieving and modifying the data. We will also move as 2ndQ and other members of the AXLE consortium, 2015 Page 1 of 12

many data operations as possible to the database server. Contributors: Anze Staric, UL Janez Demsar, UL Internal Reviewer(s): BSC: Adrian Cristal, Nehir Sonmez 2ndQ: Tomas Vondra Version History Version Date Authors Sections Affected 0.1 Apr 25, 2015 A Staric, J Demsar initial draft 1.0 Apr 30, 2015 J Dix Final version change Table of Contents 1. Summary 3 2. Description 3 3. Demonstration 4 4. Conclusion 12 List of Figures 1 Example schema 4 2 Selecting columns in the Select Columns widget 6 3 Filtering using the Select Rows widget. 7 2ndQ and other members of the AXLE consortium, 2015 Page 2 of 12

1. Summary The goal of this task is to create a framework similar to existing data flow based data mining frameworks, except that the flow does not consist of data but of meta data that can be translated into actual SQL queries when needed. In this report, we first provide a short description of the work done, followed by an example schema and the corresponding queries. 2. Description Data flow models consist of units (Orange calls them widgets) that manipulate the data. As the data passes through multiple units, manipulations (e.g. filters, value transformations and similar) pile up. The task of reimplementing the widgets so that instead of data they pass meta data has been greatly simplified by the work done within T5.1. With the categories defined in T5.1, data flow visual analytic tools can be used on large data bases if we can efficiently implement the computation of (aggregate) data in those categories, and if each unit s functionality can be expressed in terms of these categories. The former, efficient implementation of data manipulation, is the core of the AXLE project and is researched in WPs 2 4. The latter was explored within T5.5, presented in this deliverable. In parallel with AXLE and in part supported by it, we developed a new major version of Orange that allows for different types of data storages. Besides the in memory storage, widgets can now pass data of other types, such as SQL queries. The storage also defines the basic operations on the data, such as various aggregations, filtering and so forth. In the case of SQL data, basic data manipulation, like filtering or feature construction, is implemented by changing the column selection or adding conditions to the WHERE clause in the SELECT statement. Only small amounts of aggregated data are actually passed to the widgets on the client side. The task description states that We will provide a pilot reimplementation of Orange in which the data flow operators (e.g. Orange s widgets) will communicate by passing SQL queries and/or server side processing using PL/Python instead of retrieved data. Initial study showed that PL/Python scripts do not offer any advantages in comparison to standard SQL queries. However, the work done within T5.1 allowed us to go beyond the pilot reimplementation of Orange: in the new version of Orange, most widgets related to data processing and visualizations already work with data stored in databases that are accessible over non local connections. Widgets for supervised and unsupervised machine 2ndQ and other members of the AXLE consortium, 2015 Page 3 of 12

learning and statistics mostly require actual data and not just aggregates; if data is small enough, these widgets can transfer it from the database and compute locally. If the data fits into working memory, the data transfer can be avoided by using Remote Orange; see the example with the Principal Component Analysis (PCA) in deliverable 5.5. If the data is extremely large and does not fit into working memory, machine learning and statistical methods need to be replaced with iterative algorithms. This is not a part of the AXLE project; AXLE, however, provides the necessary architecture, as shown in the example with the PCA, which indeed uses an iterative PCA instead of the standard one. 3. Demonstration We demonstrate how the workflow works with a simple schema (Figure 1). We connect to the database (SQL Table), select a few columns and define their roles (Select Columns) and filter out some rows (Select Rows). We discretize the data and induce a naive Bayesian classifier and show a Mosaic Display. On the original data, we compute PCA and observe the projections with a Heat map. Figure 1. Example schema Widgets communicate with each other by passing meta data, which can be used to construct the corresponding SQL query that would retrieve the data. The database is accessed only when needed and none of the widgets in the schema actually retrieves any row data from the original table to the desktop client. To demonstrate this principle, we will show SQL queries that would get executed if the Data Table widget, which can be used to show row data, was attached to various widgets. 2ndQ and other members of the AXLE consortium, 2015 Page 4 of 12

We ran the schema on one of the common benchmark data sets from the UCI ML Repository, the Wine data set. For the purpose of these experiments, the data set size was increased (by resampling and adding some artificial noise) to 100 million data rows. SQL Table analyzes the data and creates a corresponding instance of Domain class in Orange. The query that retrieves the sufficient data is as follows. The widget outputs the data that would effectively translate into the following query. 2ndQ and other members of the AXLE consortium, 2015 Page 5 of 12

In the Select Columns widget we selected the first four columns and designated Wine as the target variable. Figure 2. Selecting columns in the Select Columns widget The corresponding query is simplified to The fact Wine is the target variable is not reflected in the query itself. In the Select Rows widget, we selected the wine samples with more the value of alcohol above 13 % and malic_acid above 2.3 %. 2ndQ and other members of the AXLE consortium, 2015 Page 6 of 12

Figure 3. Filtering using the Select Rows widget. This adds the corresponding conditions to the query, still without actually executing (or even actually constructing) it. Discretization widget defines categorical variables corresponding to the original continuous columns so that each bin contains roughly equal number of samples. The thresholds are computed using the quantile function. The widget also supports other types of discretization, in particular the entropy MDL based discretization, which requires contingency tables retrieved by the following query: 2ndQ and other members of the AXLE consortium, 2015 Page 7 of 12

and discretization into bins of equal width. The output of that widget, if translated to a query, would look as follows. Naive Bayesian classifier is induced by computing cross tables from this data. Currently, separate queries are executed for each column. 2ndQ and other members of the AXLE consortium, 2015 Page 8 of 12

The corresponding queries for the Mosaic plot are similar. PCA is computed iteratively using Remote Orange, described in deliverable D5.5. It retrieves actual data rows in batches and updates the projection until convergence or until the client aborts the computation. The output of the PCA data is a transformed table in which each column corresponds to one principle component. We set the widget to output only two components, resulting in the following output query. 2ndQ and other members of the AXLE consortium, 2015 Page 9 of 12

Finally, this is the beginning of the query, which computes the heat map: 2ndQ and other members of the AXLE consortium, 2015 Page 10 of 12

This demonstration shows how Orange widgets process SQL data by modifying the query instead of retrieving and transforming the data. In particular, the last query contains the modifications added by all upstream widgets (some of which are obscured by the use of temporary sample table). 2ndQ and other members of the AXLE consortium, 2015 Page 11 of 12

4. Conclusion All functionality presented in this report is already included in the working version of Orange, available on its website ( http://orange.biolab.si/orange3/ ). The source code is released under the BSD license (except for the GUI part, which is under GPL due to its dependency on PyQt), and available on Github (https://github.com/biolab/orange3). 2ndQ and other members of the AXLE consortium, 2015 Page 12 of 12