DATA MINING USING PENTAHO / WEKA



Similar documents
Open Source Business Intelligence Intro

DATA MINING TOOL FOR INTEGRATED COMPLAINT MANAGEMENT SYSTEM WEKA 3.6.7

An Introduction to WEKA. As presented by PACE

Introduction Predictive Analytics Tools: Weka

Breadboard BI. Unlocking ERP Data Using Open Source Tools By Christopher Lavigne

Prof. Pietro Ducange Students Tutor and Practical Classes Course of Business Intelligence

The Prophecy-Prototype of Prediction modeling tool

What is Data Mining? Data Mining (Knowledge discovery in database) Data mining: Basic steps. Mining tasks. Classification: YES, NO

DBTech Pro Workshop. Knowledge Discovery from Databases (KDD) Including Data Warehousing and Data Mining. Georgios Evangelidis

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Open Source Business Intelligence Tools: A Review

8. Business Intelligence Reference Architectures and Patterns

Creation of a BI environment from scratch with open source software

Automated Data Ingestion. Bernhard Disselhoff Enterprise Sales Engineer

GeoKettle: A powerful open source spatial ETL tool

Open Source Business Intelligence

The BIg Picture. Dinsdag 17 september 2013

Data Mining Solutions for the Business Environment

Data Mining + Business Intelligence. Integration, Design and Implementation

RAPIDMINER FREE SOFTWARE FOR DATA MINING, ANALYTICS AND BUSINESS INTELLIGENCE. Luigi Grimaudo Database And Data Mining Research Group

IBM DB2 Universal Database Data Warehouse Edition. Version 8.1.2

ANALYTICS CENTER LEARNING PROGRAM

Make Better Decisions Through Predictive Intelligence


DATA MINING ALPHA MINER

Web Data Mining: A Case Study. Abstract. Introduction

Beyond Traditional Management Reporting IBM Corporation

Fast and Easy Delivery of Data Mining Insights to Reporting Systems

Open Source meets Business Intelligence Seminar Business Intelligence Winter Term 06/07

A DATA WAREHOUSE SOLUTION FOR E-GOVERNMENT

Data Integration with Talend Open Studio Robert A. Nisbet, Ph.D.

Discussion on Airport Business Intelligence System Architecture

This Symposium brought to you by

Operationalise Predictive Analytics

Pentaho Data Integration 4 and MySQL. Matt Casters: Pentaho's Chief Data Integration Kettle Project Founder

A Look at Self Service BI with SAP Lumira Natasha Kishinevsky Dunn Solutions Group SESSION CODE: 1405

Introduction. A. Bellaachia Page: 1

KnowledgeSTUDIO HIGH-PERFORMANCE PREDICTIVE ANALYTICS USING ADVANCED MODELING TECHNIQUES

Business Intelligence. A Presentation of the Current Lead Solutions and a Comparative Analysis of the Main Providers

IBM SPSS Modeler Professional

Introduction to Data Mining

Solve your toughest challenges with data mining

Solve your toughest challenges with data mining

Online Courses. Version 9 Comprehensive Series. What's New Series

Business Intelligence for Big Data

Business Intelligence. Advanced visualization. Reporting & dashboards. Mobile BI. Packaged BI

DATA MINING AND WAREHOUSING CONCEPTS

Open source geospatial Business Intelligence (BI) in action!

IBM SPSS Modeler Professional

Business Intelligence at Albert Heijn

Enterprise Solutions. Data Warehouse & Business Intelligence Chapter-8

Alejandro Vaisman Esteban Zimanyi. Data. Warehouse. Systems. Design and Implementation. ^ Springer

ORACLE DATA INTEGRATOR ENTERPRISE EDITION

Data Search. Searching and Finding information in Unstructured and Structured Data Sources

SPSS Modeler Integration with IBM DB2 Analytics Accelerator

ENTERPRISE EDITION ORACLE DATA SHEET KEY FEATURES AND BENEFITS ORACLE DATA INTEGRATOR

Predictive Analytics Powered by SAP HANA. Cary Bourgeois Principal Solution Advisor Platform and Analytics

III JORNADAS DE DATA MINING

Sunnie Chung. Cleveland State University

Palo Open Source BI Suite

Workday Big Data Analytics

Real-time Data Replication

NEMUG Feb Create Your Own Web Data Mart with MySQL

Class 10. Data Mining and Artificial Intelligence. Data Mining. We are in the 21 st century So where are the robots?

Importance or the Role of Data Warehousing and Data Mining in Business Applications

Cloud Computing. With MySQL and Pentaho Data Integration. Matt Casters Chief Data Integration at Pentaho Kettle project founder

PARAMETRIC COMPARISON OF DATA MINING TOOLS

Fusion Applications Overview of Business Intelligence and Reporting components

Index Contents Page No. Introduction . Data Mining & Knowledge Discovery

Data Warehousing and Data Mining in Business Applications

Pentaho Data Mining Last Modified on January 22, 2007

DATA WAREHOUSE AND DATA MINING NECCESSITY OR USELESS INVESTMENT

PREDICTIVE ANALYTICS IN HIGHER EDUCATION NOVEMBER 6, 2014

Data Mining & Data Stream Mining Open Source Tools

COURSE RECOMMENDER SYSTEM IN E-LEARNING

IBM SPSS Modeler 15 In-Database Mining Guide

Data Mining and Data Warehousing on US Farmer s Data

Why include analytics as part of the School of Information Technology curriculum?

IBM AND NEXT GENERATION ARCHITECTURE FOR BIG DATA & ANALYTICS!

Data Mining for Everyone

Data Mart/Warehouse: Progress and Vision

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr

CUSTOMER Presentation of SAP Predictive Analytics

Bussiness Intelligence and Data Warehouse. Tomas Bartos CIS 764, Kansas State University

IBM WebSphere DataStage Online training from Yes-M Systems

Business Intelligence on a Budget: Open Source BI. Paul O Rorke

Transcription:

DATA MINING USING PENTAHO / WEKA Yannis Angelis Channels & Information Exploitation Division Application Delivery Sector EFG Eurobank 1

Agenda BI in Financial Environments Pentaho Community Platform Weka Platform Integration with Pentaho Platform Data mining using Weka 2

A brief (BI) history 2010 2000 1990 1980 1970 Open-source BI suites Open-source OLAP RDBMS support for DW Desktop OLAP Mainframe OLAP MySQL Codd s 12 Rules of OLAP Spreadsheets JRubik Mondrian Oracle partitions, bitmap indexes Essbase Sybase IQ Pentaho JPivot Business Objects Cognos PowerPlay Comshare APL BEE Palo MicroStrategy Express OpenI Microsoft OLAP Services Informix MetaCube

Banking flavors of BI Data warehouse Data integration / cleansing (ETL) Reporting Operational Analytical Data Mining Risk related areas Fraud detection / prediction Credit scoring / scorecards Marketing related areas Next best activity Churn management Anti-attrition 4

Eurobank specificities IBM DW implementation (blueprint & localized) Essbase & MS OLAP engines Cognos & MS Reporting Services (BO in some cases) Different DB flavors in data marts (MS SQL, Oracle, DB2) SPSS & SAS analytics Excel mania Open Source BI Solutions 5

Pentaho explained Pentaho Community - BI Platform - Reporting (PRD) - Mondrian (OLAP engine) - Kettle (PDI - ETL) - Weka (Data mining) All build under Java platform 6

Pentaho in action 7

What is Weka? The Weka or woodhen (Gallirallus australis) is an endemic bird of New Zealand. (Source: WikiPedia) 8

Weka platform WEKA (Waikato Environment for Knowledge Analysis) Funded by the NZ government (for more than 10 years) Develop an open-source state-of-the-art workbench of data mining tools Explore fielded applications Develop new fundamental methods Became part of Pentaho platform in 2006 (PDM Pentaho Data Mining) 9

Weka Functionality & Architecture Comprehensive set of data pre-processing tools, learning algorithms and evaluation methods data pre-processing tools classification/regression algorithms clustering algorithms attribute/subset evaluators & search algorithms for feature selection algorithms for finding association rules Graphical user interfaces (incl. data visualization) Environment for comparing learning algorithms Modular, object-oriented architecture Packages for different types of algorithms (filters, classifiers, clusterers, associations, attribute selection etc.) Sub-packages group components by functionality or purpose E.g. classifiers.bayes, filters.unsupervised.attribute Algorithms are Java Beans (implementing specific interface) GUIs use introspection/reflection to dynamically generate editor dialogs at runtime 10

Weka UI 11

Weka ESI (Extensibility Standards Interoperability) Extensibility Plugin mechanisms allow WEKA to be extended without modifying the classes in the WEKA distribution New algorithms can be implemented and added Standards & Interoperability PMML Support (import / export) LibSVM / SVM-Light data format support 12

Weka / Pentaho Integration Main point of integration is with Kettle (PDI) project Kettle can export data in ARFF format High-volume, low memory consumption Kettle gets WEKA-specific transformation steps WekaScoring: score data using a pre-constructed WEKA model (classification, regression or clustering) or PMML model as part of an ETL transformation KnowledgeFlow: execute arbitrary Knowledge Flow processes as part of an ETL transformation WekaForecasting: load or import a time series forecasting model (created in Weka's time series analysis and forecasting environment) as part of an ETL transformation Only WekaScoring can be used (is free and downloadable) with the community edition of Pentaho 13

Data mining with Weka (One-of-the-many) Definition: Extraction of implicit, previously unknown, and potentially useful information from data Goal: improve marketing, sales, and customer support operations, risk assessment etc. Who is likely to remain a loyal customer? What products should be marketed to which prospects? What determines whether a person will respond to a certain offer? How can I detect potential fraud? Central idea: historical data contains information that will be useful in the future (patterns => generalizations) Data Mining employs a set of algorithms that automatically detect patterns and regularities in data 14

Eurobank s Case Problem: Prediction (Probability Score) of a Corporate Customer Delinquency (or default) in the next year Customer historical data used include: Customer footings behavior (assets & liabilities) Customer delinquencies (rates and time data) Business Sector behavioral data 15

Implementation Thoughts Variable selection using the Information Value (IV) criterion. IV = Value of IV less than 0.02 i (G i B i ) ln( G i B i ) Statistical strength a very weak statistical relation 0.02 0.1 a weak statistical relation 0.1 0.3 an average statistical relation 0.3 0.5 a strong statistical relation greater than 0.5 an extremely strong statistical relation Automatic Binning of continuous data variables was used (Chi-merge). Manual corrections were made to address particularities in the data distribution of some variables (using again IV) 16

Weka Case Screenshots 17

Visualizations 18

Weka Limitations Traditional algorithms need to have all data in (main) memory => big datasets are an issue Solution: Incremental schemes Stream algorithms MOA (Massive Online Analysis) http://moa.cs.waikato.ac.nz/ 19

Conclusion Try it yourself WEKA is available at http://www.cs.waikato.ac.nz/ml/weka A list of projects based on WEKA include applications for natural language processing, text mining etc. 20