testo dello schema Secondo livello Terzo livello Quarto livello Quinto livello



Similar documents
Indesit Smart Energy Management

An Introduction to WEKA. As presented by PACE

Introduction to Data Mining

Model Deployment. Dr. Saed Sayad. University of Toronto

Predicting & Preventing Banking Customer Churn By Unlocking Big Data.

Introduction Predictive Analytics Tools: Weka

not possible or was possible at a high cost for collecting the data.

An Overview of Knowledge Discovery Database and Data mining Techniques

Data Mining Solutions for the Business Environment

A Hybrid Modeling Platform to meet Basel II Requirements in Banking Jeffery Morrision, SunTrust Bank, Inc.

Data Mining. Knowledge Discovery, Data Warehousing and Machine Learning Final remarks. Lecturer: JERZY STEFANOWSKI

An Introduction to Data Mining

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

Title. Introduction to Data Mining. Dr Arulsivanathan Naidoo Statistics South Africa. OECD Conference Cape Town 8-10 December 2010.

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

Social Media Mining. Data Mining Essentials

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

Manjeet Kaur Bhullar, Kiranbir Kaur Department of CSE, GNDU, Amritsar, Punjab, India

D A T A M I N I N G C L A S S I F I C A T I O N

Master's projects at ITMO University. Daniil Chivilikhin PhD ITMO University

Index Contents Page No. Introduction . Data Mining & Knowledge Discovery

Data Mining. SPSS Clementine Clementine Overview. Spring 2010 Instructor: Dr. Masoud Yaghini. Clementine

Didacticiel Études de cas. Association Rules mining with Tanagra, R (arules package), Orange, RapidMiner, Knime and Weka.

Introduction. A. Bellaachia Page: 1

Chapter 12 Discovering New Knowledge Data Mining

Data Mining Part 5. Prediction

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

DATA MINING ALPHA MINER

Master s Program in Information Systems

Data Mining: STATISTICA

What is Data Mining? Data Mining (Knowledge discovery in database) Data mining: Basic steps. Mining tasks. Classification: YES, NO

life science data mining

Course Syllabus For Operations Management. Management Information Systems

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and

Azure Machine Learning, SQL Data Mining and R

How To Use Neural Networks In Data Mining

Audit Analytics. --An innovative course at Rutgers. Qi Liu. Roman Chinchila

Welcome. Data Mining: Updates in Technologies. Xindong Wu. Colorado School of Mines Golden, Colorado 80401, USA

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM

EFFICIENT DATA PRE-PROCESSING FOR DATA MINING

Data, Measurements, Features

Data Mining and Machine Learning in Bioinformatics

COMPARING NEURAL NETWORK ALGORITHM PERFORMANCE USING SPSS AND NEUROSOLUTIONS

Predictive Dynamix Inc

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics

Chapter 6. The stacking ensemble approach

Data Analytics at NICTA. Stephen Hardy National ICT Australia (NICTA)

Microsoft Azure Machine learning Algorithms

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

Data Mining: Concepts and Techniques. Jiawei Han. Micheline Kamber. Simon Fräser University К MORGAN KAUFMANN PUBLISHERS. AN IMPRINT OF Elsevier

Make Better Decisions Through Predictive Intelligence

Learning is a very general term denoting the way in which agents:

DATA MINING TECHNIQUES AND APPLICATIONS

Pentaho Data Mining Last Modified on January 22, 2007

INTRODUCTION TO DATA MINING SAS ENTERPRISE MINER

IBM SPSS Modeler Premium

REVIEW ON PREDICTION SYSTEM FOR HEART DIAGNOSIS USING DATA MINING TECHNIQUES

Prof. Pietro Ducange Students Tutor and Practical Classes Course of Business Intelligence

Data Mining + Business Intelligence. Integration, Design and Implementation

Data Mining. Nonlinear Classification

Hexaware E-book on Predictive Analytics

Using multiple models: Bagging, Boosting, Ensembles, Forests

SAS JOINT DATA MINING CERTIFICATION AT BRYANT UNIVERSITY

KnowledgeSTUDIO HIGH-PERFORMANCE PREDICTIVE ANALYTICS USING ADVANCED MODELING TECHNIQUES

DATA MINING TECHNOLOGY. Keywords: data mining, data warehouse, knowledge discovery, OLAP, OLAM.

RAPIDMINER FREE SOFTWARE FOR DATA MINING, ANALYTICS AND BUSINESS INTELLIGENCE. Luigi Grimaudo Database And Data Mining Research Group

Predictive Modeling of Titanic Survivors: a Learning Competition

Name: Srinivasan Govindaraj Title: Big Data Predictive Analytics

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL

Predicting Customer Default Times using Survival Analysis Methods in SAS

Prerequisites. Course Outline

Some vendors have a big presence in a particular industry; some are geared toward data scientists, others toward business users.

BIOINF 525 Winter 2016 Foundations of Bioinformatics and Systems Biology

8. Machine Learning Applied Artificial Intelligence

KATE GLEASON COLLEGE OF ENGINEERING. John D. Hromi Center for Quality and Applied Statistics

Introduction to Data Mining

STATISTICA. Financial Institutions. Case Study: Credit Scoring. and

Predictive Analytics Certificate Program

Data quality in Accounting Information Systems

Data Mining using Artificial Neural Network Rules

CHURN PREDICTION IN MOBILE TELECOM SYSTEM USING DATA MINING TECHNIQUES

Meta-learning. Synonyms. Definition. Characteristics

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

The Data Mining Process

Experiments in Web Page Classification for Semantic Web

A STUDY ON DATA MINING INVESTIGATING ITS METHODS, APPROACHES AND APPLICATIONS

Hadoop s Advantages for! Machine! Learning and. Predictive! Analytics. Webinar will begin shortly. Presented by Hortonworks & Zementis

Transcription:

Extracting Knowledge from Biomedical Data through Logic Learning Machines and Rulex Marco Muselli Institute of Electronics, Computer and Telecommunication Engineering National Research Council of Italy, Genova, Italy marco.muselli@ieiit.cnr.it

Extracting knowledge from data Basic problem: Infer some knowledge about a biological phenomenon of interest starting from a sample of data. Type of knowledge: Correlation, statistical measures Feature ranking, analysis of relevance Prediction, clustering, risk analysis Intelligible model (rules) K N O W L E D G E Marco Muselli 2

Rule generation methods Extract models described by a set of intelligible rule in if-then form If Pressure > 115 and Heart_rate < 100 then Disease = Yes Divide-and-conquer approach Aggregative approach Emphasis on differences! Marco Muselli 3 Emphasis on similarities!

Statistical vs. Machine learning methods Fare Statistical clic methods per modificare stili del Simpler to be used with huge experience Plenty of commercial and free tools available Limited quantity of knowledge extracted A priori hypotheses on probability distributions Machine learning methods Their application is not straightforward and experience is not so big Commercial tools are often extensions of statistical packages; free programs are not so friendly Relevant quantity of knowledge extracted No a priori hypothesis is required Marco Muselli 4

Machine learning software Commercial software Fare clic per modificare Free stili Software del SAS Enterprise Miner (www.sas.com/technologies/analytics/ datamining/miner) IBM SPSS Statistics Software (www-01.ibm.com/software/analytics/ spss/products/statistics) Salford Systems Data Mining Suite (www.salford-systems.com) Statistica Data Miner (www.statsoft.com/products/datamining-solutions) WEKA (www.cs.waikato.ac.nz/ml/weka) RapidMiner (rapid-i.com) Orange (orange.biolab.si) Machine Learning & Statistical Learning in R language (cran.r-project.org/web/views/ MachineLearning.html) Marco Muselli 5

RULEX Suite The suite RULEX (contraction of RULe Extraction) developed by Impara Srl (www.impara-ai.com), a spin-off of the National Research Council of Italy, offers a new simple and powerful tools for extracting knowledge from real world data. The name RULEX is the contraction of RULe Extraction since it is especially devoted to generate intelligible Secondo rules, although livello a wide range of statistical and machine learning approaches will be made available. An intuitive graphical interface allows to easily apply standard and advanced algorithms for analyzing any dataset of interest, providing solution to classification, regression and clustering problems. The software suite is in rapid evolution; therefore, the number and the functionalities of available tasks increase every day. Marco Muselli 6

RULEX GUI Dataset panel Component panel testo Tasks dello schema Source Stage Marco Muselli 7

Logic Learning Machine Besides standard techniques, such as: Decision trees Neural networks Rulex offers the possibility of applying an original proprietary approach, named Logic learning machine (LLM) Logistic K-nearest-neighbor which represents an efficient implementation of the switching neural network model (Muselli, 2006). Marco Muselli 8

Logic Learning Machine LLM allows to solve classification problems producing sets of intelligible rules capable of achieving an accuracy comparable or superior to that of best machine learning methods. The approach of LLM is based on monotone Boolean function synthesis (Shadow Clustering) and Secondo adopts an aggregative livello policy: at any iteration some patterns belonging to the same output class are clustered to produce an intelligible rule. Since the training process occurs in a binary projected space, the application of LLM must be preceded by a discretization task that finds proper cutoffs for ordered (continuous and discrete) input variables. Marco Muselli 9

An application in biomedical analysis The functionalities of Rulex have been verified by analyzing three biomedical datasets included in the Statlog benchmark: Diabetes: it concerns the problem of diagnosing diabetes starting from 8 input variables; all the 768 considered patients are females at least 21 years old of Pima Indian heritage: 268 of them are cases and 500 are controls. Dna: it has the aim of recognizing Terzo acceptors livello and donors sites in a primate gene sequences with length 60 (basis); the dataset consists of 3186 sequences, subdivided into three classes: acceptor, donor, none. Heart: it deals with the detection of heart disease from a set of 13 input variables concerning patient Quinto status; the total livello sample of 250 elements is formed by 120 cases and 150 controls. Marco Muselli 10

An application of Rulex (results) Five classification algorithms have been considered: LLM, DT, NN, LOGIT, and KNN. Results obtained on an independent test set including 30% of data has been compared both in terms of accuracy and of quantity of knowledge extracted Secondo (number of livello rules and average number of conditions). LLM DT NN LOGIT KNN Accuracy # Rules # Cond. Accuracy # Rules # Cond. Accuracy Accuracy Accuracy Diabetes 77.40% 14 3.00 73.04% 56 4.02 75.22% 77.23% 69.13% Dna 94.01% 64 10.86 90.04% 67 6.26 88.69% 92.57% 40.68% Heart 85.19% 19 5 81.48% 18 3.67 80.25% 83.95% 80.25% Marco Muselli 11

Conclusions A new suite, called Rulex, for the analysis of biomedical datasets through conventional and advanced machine learning techniques has been presented. It is able to solve classification, regression and clustering problems. An intuitive graphical interface allows to construct complex analysis processes through the composition Secondo of elementary tasks. livello Facilities for displaying and managing datasets are also provided. Besides standard methods, like logistic, k-nearest-neighbor, neural networks and decision trees, Rulex makes available a new approach, logic learning machines (LLM), whose models are described by intelligible rules. Results obtained for the analysis Quinto of three biomedical livello datasets belonging to the Statlog benchmark point out the good quality of LLM, which achieves an excellent accuracy while providing understandable knowledge about the problem at hand. Marco Muselli 12

Work in progress Version 2.0 of Rulex is currently under beta testing. Several features have been added with the intent of giving researchers a simple but powerful tool for analyzing their own datasets. Functionalities are continuously added to Rulex to improve the versatility of the suite. Suggestions arising Secondo from researchers are livello extremely important, since they allow us to offer a product satisfying the real needs of users. To this aim, we are searching for researchers interested to try the Rulex suite, signaling bugs and providing us advices for improving each part of the product. If you are interested to test Rulex for your specific application, please send me an email (m.muselli@impara-ai.com) and we will provide you a fully functional copy of Rulex. Marco Muselli 13

Thanks testo for dello your schema attention! www.impara-ai.com