Terminology. Machine Learning and Data Mining An Introduction with WEKA. What s a concept? Classification learning

Similar documents
Social Media Mining. Data Mining Essentials

Data Mining with Weka

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Chapter 12 Discovering New Knowledge Data Mining

Experiments in Web Page Classification for Semantic Web

Active Learning SVM for Blogs recommendation

Content-Based Recommendation

Analysis Tools and Libraries for BigData

Web Document Clustering

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics

Knowledge-based systems and the need for learning

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Data Quality Mining: Employing Classifiers for Assuring consistent Datasets

An Introduction to Data Mining

8. Machine Learning Applied Artificial Intelligence

Data Mining of Web Access Logs

Introduction to Machine Learning Using Python. Vikram Kamath

Data Mining Algorithms Part 1. Dejan Sarka

Customer Classification And Prediction Based On Data Mining Technique

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter

Analysis of WEKA Data Mining Algorithm REPTree, Simple Cart and RandomTree for Classification of Indian News

Supervised Learning (Big Data Analytics)

Big Data Analytics CSCI 4030

Data Mining Techniques for Prognosis in Pancreatic Cancer

Monday Morning Data Mining

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

Open-Source Machine Learning: R Meets Weka

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. ~ Spring~r

Data Mining Practical Machine Learning Tools and Techniques

Example: Document Clustering. Clustering: Definition. Notion of a Cluster can be Ambiguous. Types of Clusterings. Hierarchical Clustering

Overview. Evaluation Connectionist and Statistical Language Processing. Test and Validation Set. Training and Test Set

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

W. Heath Rushing Adsurgo LLC. Harness the Power of Text Analytics: Unstructured Data Analysis for Healthcare. Session H-1 JTCC: October 23, 2015

DATA MINING TECHNIQUES AND APPLICATIONS

Machine Learning in Spam Filtering

Practical Introduction to Machine Learning and Optimization. Alessio Signorini

What is Data Mining? Data Mining (Knowledge discovery in database) Data mining: Basic steps. Mining tasks. Classification: YES, NO

Mining an Online Auctions Data Warehouse

The Data Mining Process

Clustering Connectionist and Statistical Language Processing

Data Exploration and Preprocessing. Data Mining and Text Mining (UIC Politecnico di Milano)

Role of Social Networking in Marketing using Data Mining

Machine learning for algo trading

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Machine Learning. Mausam (based on slides by Tom Mitchell, Oren Etzioni and Pedro Domingos)

IDENTIFYING BANK FRAUDS USING CRISP-DM AND DECISION TREES

Reference Books. Data Mining. Supervised vs. Unsupervised Learning. Classification: Definition. Classification k-nearest neighbors

Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing Classifier

Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr

Comparison of K-means and Backpropagation Data Mining Algorithms

More Data Mining with Weka

Data Mining Essentials

Data mining techniques: decision trees

Introduction to Data Mining

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

Comparative Analysis of EM Clustering Algorithm and Density Based Clustering Algorithm Using WEKA tool.

MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts

Predicting required bandwidth for educational institutes using prediction techniques in data mining (Case Study: Qom Payame Noor University)

How To Predict Web Site Visits

Defending Networks with Incomplete Information: A Machine Learning Approach. Alexandre

Mining a Corpus of Job Ads

Decision-Tree Learning

High-dimensional labeled data analysis with Gabriel graphs

Data Mining. Knowledge Discovery, Data Warehousing and Machine Learning Final remarks. Lecturer: JERZY STEFANOWSKI

Text Mining in JMP with R Andrew T. Karl, Senior Management Consultant, Adsurgo LLC Heath Rushing, Principal Consultant and Co-Founder, Adsurgo LLC

Support Vector Machines with Clustering for Training with Very Large Datasets

dm106 TEXT MINING FOR CUSTOMER RELATIONSHIP MANAGEMENT: AN APPROACH BASED ON LATENT SEMANTIC ANALYSIS AND FUZZY CLUSTERING

MS1b Statistical Data Mining

A Partially Supervised Metric Multidimensional Scaling Algorithm for Textual Data Visualization

Machine Learning Techniques for Data Mining

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and

Removing Web Spam Links from Search Engine Results

Knowledge Discovery from patents using KMX Text Analytics

Supervised Learning Evaluation (via Sentiment Analysis)!

E-Commerce Web Site Trust Assessment Based on Text Analysis

Machine Learning Final Project Spam Filtering

Model Combination. 24 Novembre 2009

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

Data Mining Analysis (breast-cancer data)

Data Mining with R. Decision Trees and Random Forests. Hugh Murrell

Exam in course TDT4215 Web Intelligence - Solutions and guidelines -

Pentaho Data Mining Last Modified on January 22, 2007

Data Mining for Knowledge Management. Classification

Facilitating Business Process Discovery using Analysis

Data Mining Techniques

Data Mining and Visualization

Final Project Report

Visualizing class probability estimators

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

Machine Learning using MapReduce

Collaborative Filtering. Radek Pelánek

Big Data Analytics and Healthcare

Predicting Student Performance by Using Data Mining Methods for Classification

Music Mood Classification

Azure Machine Learning, SQL Data Mining and R

Data Mining. Supervised Methods. Ciro Donalek Ay/Bi 199ab: Methods of Sciences hcp://esci101.blogspot.

Extend Table Lens for High-Dimensional Data Visualization and Classification Mining

Data Mining in CRM & Direct Marketing. Jun Du The University of Western Ontario jdu43@uwo.ca

Introduction Predictive Analytics Tools: Weka

Transcription:

Machine Learning and Data Mining An Introduction with WEKA AHPCRC Workshop - 8/8/ - Dr. Martin Based on slides by Gregory Piatetsky-Shapiro from Kdnuggets http://www.kdnuggets.com/data_mining_course/ Terminology Components of the input: Concepts: kinds of things that can be learned Aim: intelligible and operational concept description Instances: the individual, independent examples of a concept Note: more complicated forms of input are possible Attributes (Features): measuring aspects of an instance We will focus on nominal and numeric ones What s a concept? Data Mining Tasks (Styles of learning): Classification learning: predicting a discrete class Association learning: detecting associations between features Clustering: grouping similar instances into clusters Numeric prediction: predicting a numeric quantity Concept: thing to be learned Concept description: output of learning scheme Classification learning Example problems: attrition prediction, using DNA data for diagnosis, weather data to predict play/not play Classification learning is supervised Scheme is being provided with actual outcome Outcome is called the class of the example Success can be measured on fresh data for which class labels are known ( test data) In practice success is often measured subjectively Association learning Examples: supermarket basket analysis -what items are bought together (e.g. milk+cereal, chips+salsa) Can be applied if no class is specified and any kind of structure is considered interesting Difference with classification learning: Can predict any attribute s value, not just the class, and more than one attribute s value at a time Hence: far more association rules than classification rules Thus: constraints are necessary Minimum coverage and minimum accuracy Clustering Examples: customer grouping Finding groups of items that are similar Clustering is unsupervised The class of an example is not known Success often measured subjectively 2 5 52 0 02 Sepal length 5. 4.9 7.0 6.4 6.3 5.8 Sepal width 3.5 3.0 3.2 3.2 3.3 2.7 Petal length 4.7 4.5 6.0 5. Petal width 0.2 0.2.5 2.5.9 Type Iris setosa Iris setosa Iris versicolor Iris versicolor Iris virginica Iris virginica

Numeric prediction Classification learning, but class is numeric Learning is supervised Scheme is being provided with target value Measure success on test data Outlook Sunny Sunny Overcast Rainy Temperature Mild Humidity Normal Windy True Play-time 5 0 55 40 Today: Focus on Classification Learn Classifier (function, rule, hypothesis) Supervised: learn from training data Feature vector, class Apply to new data Feature vector -> class ( x!, y ),...( x! n,y n ) Now a Real Example Training Data Machine Learning Algorithm Classifier Data collected from Medical Web Pages Goal: learn two classes Reliability (trustworthiness of information) Type of Page New Data Instance Classifier Class Basic Steps Build a training corpus of web pages Tag instances with classes Extract Features Learn a classifier Test on new data Data MMED000 Corpus 000 pages First 00 hits on Google for 0 topics Adrenoleukodistrophy Alzheimer s Endometriosis Fibromyalgia Obesity Pancreatic cancer colloidal silver irritable bowel syndrome late lyme disease lower back pain Spectrum of: Agreement condition diagnosis treatment cause How common 2

254 Features Link-Based Inlinks, Outlinks, PageRank, Domain Server host name, secure HTML markup Symbols, metadata, JavaScript Bold, italics, underline, font Counts and frequencies Text properties 254 Features LSA vector length, coherence, words per paragraph Personal pronouns, punctuation, unique words Lists of words Whole text, outlinks, anchor text Criteria, medical, commercial, alternative Disclaimer, diagnosis, shopping cart, miracle Features Most Strongly Correlated with Reliable Pages! in anchor text - negatively Outlinks with same server host name and.uk Frequency of font changes - negatively medicine in anchor text Words in text clinical diagnosis disease medication medicine patient symptom Features Most Strongly Correlated with Unreliable Pages! in text Secure outlinks in the same server host name - negatively Number of font attributes Words in anchor text price products testimonial Words in Text doctor I, me, my, them, us, we, you, your miracle natural prevention price products purchase to order testimonial therapist Algorithms Selecting features Information Gain Statistics correlation, regression Classification Naïve Bayes Decision Tree (C4.5) SVM N-Closest Using SPSS and Weka "Data Mining: Practical machine learning tools with Java implementations," by Ian H. Witten and Eibe Frank, Morgan Kaufmann, San Francisco, 2000. http://www.cs.waikato.ac.nz/ml/weka/ Decision Trees Nodes are the features Maximize information gain Expected reduction in entropy by partitioning examples according to given attribute "# p i log 2 p i Entropy = 0 when all in same class when equal number in each class 3

Decision Tree: U-O Naïve Bayes <= O(48) <= 6 OutOff_other <= 0.02652 Whole_testimonial <= 0 Whole_we > 6 U(4/) > Freq_header_words > 0.02652 > 0 U(7/) Want to calculate P(C f) C = class, f = set of features observed Use Bayes Rule: P(C f) = (P(C)p(f C))/p(f) Reduces to computing p(f C) Assume features conditionally independent given the class p(f C) = product(p( x i C)) O(8/) U(2) Support Vector Machines Maximum Margin Hyperplane Optimal separation between classes Kernel method Nonlinear transformation between dot product spaces Nonlinear space transformed to linear space http://www.ucl.ac.uk/oncology/microcore/html _resource/images/svm_.jpg Latent Semantic Analysis Provides measures of the semantic relatedness, quality, and quantity of information contained in discourse Implementation: Four Basic Steps Term by document (context) matrix Convert matrix entries to weights Singular Value Decomposition (SVD) performed on matrix Reduce Rank of matrix all but the k highest singular values are set to 0 produces k-dimensional approximation of the original matrix (in least-squares sense) this is the semantic space N-Closest Version of K-Nearest Neighbor algorithm Finds closest pages in LSA semantic space to page P Checks their classes Sums cosines for each class Predicts Ps class based on largest sum Standard Performance Measures Accuracy: percent correct For classifier: (a+d)/(a+b+c+d) Precision: portion of selected items that the system got right For class R: a/(a+b) Recall: portion of the target items that the system selected For class R: a/(a+c) F-Measure: (2*precision*recall)/(precision+recall) R predicted O predicted R is correct a c O is correct b d 4

Kappa Statistic Now Let s Look at the.arff file Chance Corrected Measure of Agreement Cohen 960 κ = (P(O) - P(E)) / ( - P(E)) P(O) = proportion of agreement observed P(E) = proportion of agreement expected by chance 5