Data Mining Individual Assignment report



Similar documents
Machine Learning using MapReduce

Data Mining Part 5. Prediction

Reference Books. Data Mining. Supervised vs. Unsupervised Learning. Classification: Definition. Classification k-nearest neighbors

Web Document Clustering

Classification Techniques (1)

An Overview of Knowledge Discovery Database and Data mining Techniques

KNIME TUTORIAL. Anna Monreale KDD-Lab, University of Pisa

Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr

Social Media Mining. Data Mining Essentials

Performance Metrics for Graph Mining Tasks

Clustering UE 141 Spring 2013

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

Clustering Connectionist and Statistical Language Processing

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

COURSE RECOMMENDER SYSTEM IN E-LEARNING

Classification and Prediction

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and

Data Mining for Knowledge Management. Classification

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus

Analytics on Big Data

Classification of Titanic Passenger Data and Chances of Surviving the Disaster Data Mining with Weka and Kaggle Competition Data

Final Project Report

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Lecture 10: Regression Trees

Chapter 20: Data Analysis

OLAP & DATA MINING CS561-SPRING 2012 WPI, MOHAMED ELTABAKH

Using multiple models: Bagging, Boosting, Ensembles, Forests

Data Mining Applications in Manufacturing

Knowledge Discovery and Data Mining

CSCI 5417 Information Retrieval Systems Jim Martin!

Unsupervised learning: Clustering

Classification algorithm in Data mining: An Overview

Using Data Mining for Mobile Communication Clustering and Characterization

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

Data Mining Fundamentals

K-Means Clustering Tutorial

2 When is a 2-Digit Number the Sum of the Squares of its Digits?

Clustering. Data Mining. Abraham Otero. Data Mining. Agenda

Chapter 4 Data Mining A Short Introduction. 2006/7, Karl Aberer, EPFL-IC, Laboratoire de systèmes d'informations répartis Data Mining - 1

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

Applying Data Mining of Fuzzy Association Rules to Network Intrusion Detection

Model Selection. Introduction. Model Selection

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Knowledge Discovery and Data Mining

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

8. Machine Learning Applied Artificial Intelligence

A Survey on Intrusion Detection System with Data Mining Techniques

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

Cluster Analysis: Basic Concepts and Methods

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer

Data Mining with Weka

Foundations of Business Intelligence: Databases and Information Management

Echidna: Efficient Clustering of Hierarchical Data for Network Traffic Analysis

Standardization and Its Effects on K-Means Clustering Algorithm

Interactive Data Mining and Visualization

Monday Morning Data Mining

SPATIAL DATA CLASSIFICATION AND DATA MINING

Prediction of Car Prices of Federal Auctions

Cluster Analysis: Basic Concepts and Algorithms

Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data

Mining an Online Auctions Data Warehouse

Java Modules for Time Series Analysis

Chapter 12 Discovering New Knowledge Data Mining

Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

Web Mining Patterns Discovery and Analysis Using Custom-Built Apriori Algorithm

InfiniteInsight 6.5 sp4

Pattern-Aided Regression Modelling and Prediction Model Analysis

Cluster Analysis for Evaluating Trading Strategies 1

Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining

Similarity Search in a Very Large Scale Using Hadoop and HBase

CSCI-B 565 DATA MINING Project Report for K-means Clustering algorithm Computer Science Core Fall 2012 Indiana University

Calculation of Minimum Distances. Minimum Distance to Means. Σi i = 1

Comparison of K-means and Backpropagation Data Mining Algorithms

PREDICTING STUDENTS PERFORMANCE USING ID3 AND C4.5 CLASSIFICATION ALGORITHMS

Protein Protein Interaction Networks

Learning is a very general term denoting the way in which agents:

Clustering on Large Numeric Data Sets Using Hierarchical Approach Birch

K-Means Cluster Analysis. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1

Data Preprocessing. Week 2

Data Mining: Foundation, Techniques and Applications

Introduction to Pattern Recognition

Mining Online GIS for Crime Rate and Models based on Frequent Pattern Analysis

Applied Mathematical Sciences, Vol. 7, 2013, no. 112, HIKARI Ltd,

Project Report. 1. Application Scenario

DATA MINING TOOL FOR INTEGRATED COMPLAINT MANAGEMENT SYSTEM WEKA 3.6.7

Statistical Databases and Registers with some datamining

Data Mining & Data Stream Mining Open Source Tools

Machine Learning and Data Mining. Regression Problem. (adapted from) Prof. Alexander Ihler

An Introduction to Data Mining

Robust Outlier Detection Technique in Data Mining: A Univariate Approach

VISUAL GUIDE to. RX Scripting. for Roulette Xtreme - System Designer 2.0

Framing Business Problems as Data Mining Problems

Specific Usage of Visual Data Analysis Techniques

1. What are the uses of statistics in data mining? Statistics is used to Estimate the complexity of a data mining problem. Suggest which data mining

Course Syllabus. Purposes of Course:

Decision Support System Methodology Using a Visual Approach for Cluster Analysis Problems

Transcription:

Björn Þór Jónsson bjrr@itu.dk Data Mining Individual Assignment report This report outlines the implementation and results gained from the Data Mining methods of preprocessing, supervised learning, frequent pattern mining and clustering, using data from questionnaire results submitted by students in a 2014 Data Mining class. The implementation is split into Java packages, one for each Data Mining method, and the package names accompany each section name here below, for easy reference. Comments may be sparse but descriptive method and variable names should make up for that that s a coding style I ve come to appreciate, where meta data in code comments can do more harm than good when they re not maintained as the code changes and become outdated. Hope the implementation proves to be readable. Plots of generated data are made with simple R scripts that can be found in the plots directory within the project root. Preprocessing code namespace: is.bthj.itu.datamining.preprocessing The attributes chosen from the data to work with are: age, programming skill, years at university, preferred operating system, favorite programming languages, whether more mountains should be in Denmark and if one is fed up with the winter, and the favorite color. Cleaning the data consists of normalization in the form of inferring consistent values from ones that are considered the same and clamping numerical values to a defined range. After that process, tuples are removed that still have unknown values. Specifically, age values are only accepted if they are between 18 and 120, inclusive; programming skill is clamped to the range 1 10; years at university values are accepted as they are if they prove to be a known numerical value; prefered operating system answers are set to consistent values inferred from a list of alternative spellings, as can be seen in OSSynonyms; values from the list of favorite programming languages are in a similar way set to consistent ones inferred from lists of synonyms in the enumeration ProgrammingLanguages; the boolean attributes about mountains and winter in Denmark are set to either Yes or No by comparing with many different synonyms for those words, in the enumeration BooleanSynonyms; favorite color is set to the closest match found in the list of color names in BasicColorNames. Cleaning the data in this way and writing it to disk can be done by running the mainmethod of CSVFileReaderin the.preprocessingpackage; the results can be seen in the file

cleaned dataset.csvin the project s root. In the rest of the project, the cleaning method QuestionairePreProcessor.getCleanedQuestionairesis called directly in code instead of reading from this file, for ease over efficiency. Supervised learning: classification is.bthj.itu.datamining.classification For classification with supervised learning, the knn method was chosen and the target attribute: Do you think there should be more mountains in Denmark? Different combinations of the other attributes, that are both numerical and nominal, were tried to compute the distance between tuples (by commenting out different parts of ClassificationKNN.distanceBetweenTwoTuples that could indeed have been done in a more elegant way). The implementation can be tested by running the mainmethod in the ClassificationKNNclass. Plots of classification accuracy for a few of the different combinations can be seen here below, where the Favorite color attribute alone proves to be best for classifying the tuples, where k = 11 gives 89% accuracy. Distance metric by: color attribute age attribute age, programming skill and operating system all attributes years at university

Frequent pattern / association mining is.bthj.itu.datamining.association For finding frequent patterns with a given support and association rules with a given minimum confidence, the Apriori algorithm was implemented and targeted at the Favorite programming languages attribute. The implementation can be tested by running the mainmethod in the Aprioriclass. To test and validate the implementation, data was used from Example 6.3 and Table 6.1 in the textbook, Data Mining Concepts and Techniques, 3rd edition see method Apriori.getTextBookTransactionalData. That proved to be a good idea as it uncovered errors in the implementation, when compared with the results in Example 6.3; One error was in the frequent item set search, where support for candidate sets was found by only comparing the first elements of the set with the first elements of each set in the data, in other words depending the same order of occurrence of the compared elements, instead of searching specifically for the existence of each element in the candidate set, anywhere in each data record set see method Apriori.countSupport. Another uncovered error was in the generation of association rules where the confidence calculation was flawed as confidence( A => B)was computed as support_count( B ) / support_count( A )instead of support_count( A U B) / support_count( A ) see method Apriori.printAssociationRules Output from the implementation, by running the main method in the Apriori class, with support set to 2 and and minimum confidence set to 70%, is the following: ***Frequent itemsets with minimum support: 2 [C, CSharp, Java] [CPlusPlus, CSharp, Java] [CSharp, FSharp, Java] [CSharp, FSharp, Scala] [CSharp, Java, JavaScript] [CSharp, Java, PHP] [CSharp, Java, Python] [CSharp, JavaScript, Python] ***Association rules with minimum conficence = 70% C,CSharp => Java, confidence = 2/2 = 100% C,Java => CSharp, confidence = 2/2 = 100% CPlusPlus,CSharp => Java, confidence = 2/2 = 100% FSharp,Scala => CSharp, confidence = 2/2 = 100% Java,JavaScript => CSharp, confidence = 3/4 = 75% CSharp,PHP => Java, confidence = 7/8 = 88% Java,PHP => CSharp, confidence = 7/8 = 88% PHP => CSharp,Java, confidence = 7/10 = 70% From this we can for example say that Java and JavaScript preference implies CSharp preference, with 75% confidence.

Clustering is.bthj.itu.datamining.clustering To cluster the tuples into k numbers of partitions, the k Means technique was implemented. Only one dimension of the data was used to partition by age but more dimensions could easily be added by expanding the method KMeans.getTupleValue. The implementation can be tested by running the mainmethod in the KMeansclass. To measure the quality of the clusters formed in this dimension, for different values of k, the sum of square errors for each partition count k was computed, and as initial cluster centroids are chosen at random, an average of errors from 10 computations for each k was computed: Average of 10 sums of square errors for partition size k = 2: 22.065603 Average of 10 sums of square errors for partition size k = 3: 11.402188 Average of 10 sums of square errors for partition size k = 4: 13.638395 Average of 10 sums of square errors for partition size k = 5: 5.0326624 Average of 10 sums of square errors for partition size k = 6: 1.6928288 Average of 10 sums of square errors for partition size k = 7: 2.0382862 Average of 10 sums of square errors for partition size k = 8: 6.4158945 Average of 10 sums of square errors for partition size k = 9: 1.112445 Average of 10 sums of square errors for partition size k = 10: 0.49283415 k = 2...10 k = 2...30 From this can be seen that k = 6 gives a comparatively low local minimum of error, with a reasonably low number of partitions, so k = 6 seems to be a good choice when clustering the tuples from values in the age attribute. Though clustering is unsupervised, and so has no predefined classes, it could be interesting to look at how well this clustering method performs as a classifier, for example by measuring how dominantly similar single nominal values are within each cluster, like Favorite color, as a measure of goodness, but I ll let the sum of square errors suffice as a measure for now.

Conclusion: It has been interesting to get acquainted with those Data Mining methods and I can foresee using them in my future game development. IT University of Copenhagen spring 2014 Björn Þór Jónsson (bjrr@itu.dk)