Open source framework for data-flow visual analytic tools for large databases



Similar documents
Open source framework for interactive data exploration in server based architecture

Oracle9i Data Warehouse Review. Robert F. Edwards Dulcian, Inc.

Prof. Pietro Ducange Students Tutor and Practical Classes Course of Business Intelligence

CS 6220: Data Mining Techniques Course Project Description

Collaborative Open Market to Place Objects at your Service

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

RAPIDMINER FREE SOFTWARE FOR DATA MINING, ANALYTICS AND BUSINESS INTELLIGENCE. Luigi Grimaudo Database And Data Mining Research Group

1 File Processing Systems

Bringing Big Data Modelling into the Hands of Domain Experts

Oracle Data Miner (Extension of SQL Developer 4.0)

Reference Architecture, Requirements, Gaps, Roles

Chapter 9 Joining Data from Multiple Tables. Oracle 10g: SQL

Master s Program in Information Systems

SAP Data Services 4.X. An Enterprise Information management Solution

Native apps building platform

SQL Server Administrator Introduction - 3 Days Objectives

Oracle Big Data SQL Technical Update

Azure Machine Learning, SQL Data Mining and R

High-Volume Data Warehousing in Centerprise. Product Datasheet

Introduction Predictive Analytics Tools: Weka

Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May Santa Clara, CA

Application of Predictive Analytics for Better Alignment of Business and IT

The basic data mining algorithms introduced may be enhanced in a number of ways.

D3.4.1: Data Fusion Tools

Oracle Database 10g: Introduction to SQL

EnterpriseLink Benefits

Distance Learning and Examining Systems

RapidMiner Radoop Documentation

Bayesian networks - Time-series models - Apache Spark & Scala

Analysis Tools and Libraries for BigData

Data Integration with Talend Open Studio Robert A. Nisbet, Ph.D.

Predictive Analytics Powered by SAP HANA. Cary Bourgeois Principal Solution Advisor Platform and Analytics

Accessing Your Database with JMP 10 JMP Discovery Conference 2012 Brian Corcoran SAS Institute

Business Application Services Testing

Implementing Data Models and Reports with Microsoft SQL Server 2012 MOC 10778

<Insert Picture Here> Oracle SQL Developer 3.0: Overview and New Features

MANUAL TESTING. (Complete Package) We are ready to serve Latest Testing Trends, Are you ready to learn.?? New Batches Info

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Pentaho Data Mining Last Modified on January 22, 2007

Data Mining & Data Stream Mining Open Source Tools

Release 2.1 of SAS Add-In for Microsoft Office Bringing Microsoft PowerPoint into the Mix ABSTRACT INTRODUCTION Data Access

PostgreSQL Business Intelligence & Performance Simon Riggs CTO, 2ndQuadrant PostgreSQL Major Contributor

DATA MINING TOOL FOR INTEGRATED COMPLAINT MANAGEMENT SYSTEM WEKA 3.6.7

Apache Hama Design Document v0.6

CERULIUM TERADATA COURSE CATALOG

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

Vendor: Brio Software Product: Brio Performance Suite

MOC 20461C: Querying Microsoft SQL Server. Course Overview

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

Data and Machine Architecture for the Data Science Lab Workflow Development, Testing, and Production for Model Training, Evaluation, and Deployment

Introducing Microsoft SQL Server 2012 Getting Started with SQL Server Management Studio

Chapter 20: Data Analysis

Toad for Oracle 8.6 SQL Tuning

MicroStrategy Course Catalog

B.Sc (Computer Science) Database Management Systems UNIT-V

SQL Server 2012 Business Intelligence Boot Camp

Developing Microsoft SQL Server Databases MOC 20464

Fast and Easy Delivery of Data Mining Insights to Reporting Systems

How to Implement a SAP HANA Database Procedure and consume it from an ABAP Program Step-by-Step Tutorial

Turn Big Data to Small Data

Professional Organization Checklist for the Computer Science Curriculum Updates. Association of Computing Machinery Computing Curricula 2008

6.2 Reporting BIPublisher Improvements

DB2 for i. Analysis and Tuning. Mike Cain IBM DB2 for i Center of Excellence. mcain@us.ibm.com

Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc.

Oracle Database 11g SQL

TEST AUTOMATION FRAMEWORK

Big Data: Rethinking Text Visualization

Product Summary of XLReporter with OPC Servers

ICAB4136B Use structured query language to create database structures and manipulate data

QAD Business Intelligence Release Notes

Data Mining Extensions (DMX) Reference

Lofan Abrams Data Services for Big Data Session # 2987

Querying Microsoft SQL Server 2012

Application Architectures

Horizontal Aggregations in SQL to Prepare Data Sets for Data Mining Analysis

Course 20461C: Querying Microsoft SQL Server Duration: 35 hours

Course 10774A: Querying Microsoft SQL Server 2012 Length: 5 Days Published: May 25, 2012 Language(s): English Audience(s): IT Professionals

Course 10774A: Querying Microsoft SQL Server 2012

Oracle Data Miner (Extension of SQL Developer 4.0)

InfiniteInsight 6.5 sp4

Hortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc All Rights Reserved

2015 Workshops for Professors

Implementing Data Models and Reports with Microsoft SQL Server 20466C; 5 Days

Data Mining, Predictive Analytics with Microsoft Analysis Services and Excel PowerPivot

SQL Server for developers. murach's TRAINING & REFERENCE. Bryan Syverson. Mike Murach & Associates, Inc. Joel Murach

Lavastorm Analytic Library Predictive and Statistical Analytics Node Pack FAQs

A THREE-TIERED WEB BASED EXPLORATION AND REPORTING TOOL FOR DATA MINING

Oracle Advanced Analytics 12c & SQLDEV/Oracle Data Miner 4.0 New Features

Massive Cloud Auditing using Data Mining on Hadoop

Monitor and Manage Your MicroStrategy BI Environment Using Enterprise Manager and Health Center

Car Insurance. Prvák, Tomi, Havri

Chapter 3: Data Mining Driven Learning Apprentice System for Medical Billing Compliance

Diploma Of Computing

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Quick start. A project with SpagoBI 3.x

DATABASE DESIGN AND IMPLEMENTATION II SAULT COLLEGE OF APPLIED ARTS AND TECHNOLOGY SAULT STE. MARIE, ONTARIO. Sault College

TABLEAU COURSE CONTENT. Presented By 3S Business Corporation Inc Call us at : Mail us at : info@3sbc.com

SAS In-Database Processing

IT-Pruefungen.de. Hochwertige Qualität, neueste Prüfungsunterlagen.

irods and Metadata survey Version 0.1 Date March Abhijeet Kodgire 25th

Transcription:

Open source framework for data-flow visual analytic tools for large databases D5.6 v1.0 WP5 Visual Analytics: D5.6 Open source framework for data flow visual analytic tools for large databases Dissemination Level: Public Lead Editor: Janez Demsar Date: 30 April 2015 Status: Final Description from Description of Work: T5.5. Data flow visual analytic tools for large data bases We will investigate ways in which the existing data flow model can work within large databases. We will provide a pilot reimplementation of Orange in which the data flow operators (e.g. Orange s widgets) will communicate by passing SQL queries and/or server side processing using PL/Python instead of retrieved data. Most operations will modify the processing instructions instead of retrieving and modifying the data. We will also move as 2ndQ and other members of the AXLE consortium, 2015 Page 1 of 12

many data operations as possible to the database server. Contributors: Anze Staric, UL Janez Demsar, UL Internal Reviewer(s): BSC: Adrian Cristal, Nehir Sonmez 2ndQ: Tomas Vondra Version History Version Date Authors Sections Affected 0.1 Apr 25, 2015 A Staric, J Demsar initial draft 1.0 Apr 30, 2015 J Dix Final version change Table of Contents 1. Summary 3 2. Description 3 3. Demonstration 4 4. Conclusion 12 List of Figures 1 Example schema 4 2 Selecting columns in the Select Columns widget 6 3 Filtering using the Select Rows widget. 7 2ndQ and other members of the AXLE consortium, 2015 Page 2 of 12

1. Summary The goal of this task is to create a framework similar to existing data flow based data mining frameworks, except that the flow does not consist of data but of meta data that can be translated into actual SQL queries when needed. In this report, we first provide a short description of the work done, followed by an example schema and the corresponding queries. 2. Description Data flow models consist of units (Orange calls them widgets) that manipulate the data. As the data passes through multiple units, manipulations (e.g. filters, value transformations and similar) pile up. The task of reimplementing the widgets so that instead of data they pass meta data has been greatly simplified by the work done within T5.1. With the categories defined in T5.1, data flow visual analytic tools can be used on large data bases if we can efficiently implement the computation of (aggregate) data in those categories, and if each unit s functionality can be expressed in terms of these categories. The former, efficient implementation of data manipulation, is the core of the AXLE project and is researched in WPs 2 4. The latter was explored within T5.5, presented in this deliverable. In parallel with AXLE and in part supported by it, we developed a new major version of Orange that allows for different types of data storages. Besides the in memory storage, widgets can now pass data of other types, such as SQL queries. The storage also defines the basic operations on the data, such as various aggregations, filtering and so forth. In the case of SQL data, basic data manipulation, like filtering or feature construction, is implemented by changing the column selection or adding conditions to the WHERE clause in the SELECT statement. Only small amounts of aggregated data are actually passed to the widgets on the client side. The task description states that We will provide a pilot reimplementation of Orange in which the data flow operators (e.g. Orange s widgets) will communicate by passing SQL queries and/or server side processing using PL/Python instead of retrieved data. Initial study showed that PL/Python scripts do not offer any advantages in comparison to standard SQL queries. However, the work done within T5.1 allowed us to go beyond the pilot reimplementation of Orange: in the new version of Orange, most widgets related to data processing and visualizations already work with data stored in databases that are accessible over non local connections. Widgets for supervised and unsupervised machine 2ndQ and other members of the AXLE consortium, 2015 Page 3 of 12

learning and statistics mostly require actual data and not just aggregates; if data is small enough, these widgets can transfer it from the database and compute locally. If the data fits into working memory, the data transfer can be avoided by using Remote Orange; see the example with the Principal Component Analysis (PCA) in deliverable 5.5. If the data is extremely large and does not fit into working memory, machine learning and statistical methods need to be replaced with iterative algorithms. This is not a part of the AXLE project; AXLE, however, provides the necessary architecture, as shown in the example with the PCA, which indeed uses an iterative PCA instead of the standard one. 3. Demonstration We demonstrate how the workflow works with a simple schema (Figure 1). We connect to the database (SQL Table), select a few columns and define their roles (Select Columns) and filter out some rows (Select Rows). We discretize the data and induce a naive Bayesian classifier and show a Mosaic Display. On the original data, we compute PCA and observe the projections with a Heat map. Figure 1. Example schema Widgets communicate with each other by passing meta data, which can be used to construct the corresponding SQL query that would retrieve the data. The database is accessed only when needed and none of the widgets in the schema actually retrieves any row data from the original table to the desktop client. To demonstrate this principle, we will show SQL queries that would get executed if the Data Table widget, which can be used to show row data, was attached to various widgets. 2ndQ and other members of the AXLE consortium, 2015 Page 4 of 12

We ran the schema on one of the common benchmark data sets from the UCI ML Repository, the Wine data set. For the purpose of these experiments, the data set size was increased (by resampling and adding some artificial noise) to 100 million data rows. SQL Table analyzes the data and creates a corresponding instance of Domain class in Orange. The query that retrieves the sufficient data is as follows. The widget outputs the data that would effectively translate into the following query. 2ndQ and other members of the AXLE consortium, 2015 Page 5 of 12

In the Select Columns widget we selected the first four columns and designated Wine as the target variable. Figure 2. Selecting columns in the Select Columns widget The corresponding query is simplified to The fact Wine is the target variable is not reflected in the query itself. In the Select Rows widget, we selected the wine samples with more the value of alcohol above 13 % and malic_acid above 2.3 %. 2ndQ and other members of the AXLE consortium, 2015 Page 6 of 12

Figure 3. Filtering using the Select Rows widget. This adds the corresponding conditions to the query, still without actually executing (or even actually constructing) it. Discretization widget defines categorical variables corresponding to the original continuous columns so that each bin contains roughly equal number of samples. The thresholds are computed using the quantile function. The widget also supports other types of discretization, in particular the entropy MDL based discretization, which requires contingency tables retrieved by the following query: 2ndQ and other members of the AXLE consortium, 2015 Page 7 of 12

and discretization into bins of equal width. The output of that widget, if translated to a query, would look as follows. Naive Bayesian classifier is induced by computing cross tables from this data. Currently, separate queries are executed for each column. 2ndQ and other members of the AXLE consortium, 2015 Page 8 of 12

The corresponding queries for the Mosaic plot are similar. PCA is computed iteratively using Remote Orange, described in deliverable D5.5. It retrieves actual data rows in batches and updates the projection until convergence or until the client aborts the computation. The output of the PCA data is a transformed table in which each column corresponds to one principle component. We set the widget to output only two components, resulting in the following output query. 2ndQ and other members of the AXLE consortium, 2015 Page 9 of 12

Finally, this is the beginning of the query, which computes the heat map: 2ndQ and other members of the AXLE consortium, 2015 Page 10 of 12

This demonstration shows how Orange widgets process SQL data by modifying the query instead of retrieving and transforming the data. In particular, the last query contains the modifications added by all upstream widgets (some of which are obscured by the use of temporary sample table). 2ndQ and other members of the AXLE consortium, 2015 Page 11 of 12

4. Conclusion All functionality presented in this report is already included in the working version of Orange, available on its website ( http://orange.biolab.si/orange3/ ). The source code is released under the BSD license (except for the GUI part, which is under GPL due to its dependency on PyQt), and available on Github (https://github.com/biolab/orange3). 2ndQ and other members of the AXLE consortium, 2015 Page 12 of 12