CSC 177 Fall 2014 Team Project Final Report



Similar documents
Data Mining and Data Warehousing on US Farmer s Data

CSC 177 Data warehouse and Mining project. Pooja Vora Vishma Shah Guided by Prof. Meiliu lu

An Introduction to WEKA. As presented by PACE

Introduction Predictive Analytics Tools: Weka

COC131 Data Mining - Clustering

Pentaho Data Mining Last Modified on January 22, 2007

DBTech Pro Workshop. Knowledge Discovery from Databases (KDD) Including Data Warehousing and Data Mining. Georgios Evangelidis

Index Contents Page No. Introduction . Data Mining & Knowledge Discovery

Classification of Titanic Passenger Data and Chances of Surviving the Disaster Data Mining with Weka and Kaggle Competition Data

Data Mining for Customer Service Support. Senioritis Seminar Presentation Megan Boice Jay Carter Nick Linke KC Tobin

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr

Data Mining Application in Direct Marketing: Identifying Hot Prospects for Banking Product

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

The Scientific Data Mining Process

COURSE RECOMMENDER SYSTEM IN E-LEARNING

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing

City University of Hong Kong. Information on a Course offered by Department of Information Systems with effect from Semester B in 2013 / 2014

Social Media Mining. Data Mining Essentials

Data Mining + Business Intelligence. Integration, Design and Implementation

In this tutorial, we try to build a roc curve from a logistic regression.

Oracle Advanced Analytics 12c & SQLDEV/Oracle Data Miner 4.0 New Features

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

DATA MINING, DIRTY DATA, AND COSTS (Research-in-Progress)

In this presentation, you will be introduced to data mining and the relationship with meaningful use.

Lecture: Mon 13:30 14:50 Fri 9:00-10:20 ( LTH, Lift 27-28) Lab: Fri 12:00-12:50 (Rm. 4116)

Data Mining Part 5. Prediction

An Overview of Knowledge Discovery Database and Data mining Techniques

A Regression Approach for Forecasting Vendor Revenue in Telecommunication Industries

Final Project Report

Numerical Algorithms Group

College of Health and Human Services. Fall Syllabus

Data Mining. Knowledge Discovery, Data Warehousing and Machine Learning Final remarks. Lecturer: JERZY STEFANOWSKI

DHL Data Mining Project. Customer Segmentation with Clustering

not possible or was possible at a high cost for collecting the data.

CONTENTS PREFACE 1 INTRODUCTION 1 2 DATA VISUALIZATION 19

King Saud University

The Prophecy-Prototype of Prediction modeling tool

Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP

Introduction to Data Mining

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL

Data Mining with SQL Server Data Tools

Data mining techniques: decision trees

STATISTICA. Financial Institutions. Case Study: Credit Scoring. and

Data Mining. SPSS Clementine Clementine Overview. Spring 2010 Instructor: Dr. Masoud Yaghini. Clementine

How To Solve The Kd Cup 2010 Challenge

Strategic Management System for Effective Health Care Planning (SMS-EHCP)

Gerry Hobbs, Department of Statistics, West Virginia University

Management Decision Making. Hadi Hosseini CS 330 David R. Cheriton School of Computer Science University of Waterloo July 14, 2011

Syllabus. HMI 7437: Data Warehousing and Data/Text Mining for Healthcare

Data Mining is sometimes referred to as KDD and DM and KDD tend to be used as synonyms

2015 Workshops for Professors

ON INTEGRATING UNSUPERVISED AND SUPERVISED CLASSIFICATION FOR CREDIT RISK EVALUATION

An Introduction to Data Mining

IT462 Lab 5: Clustering with MS SQL Server

Course Syllabus. Purposes of Course:

Web Mining as a Tool for Understanding Online Learning

Decision Trees from large Databases: SLIQ

INTRODUCTION TO DATA MINING SAS ENTERPRISE MINER

K-means Clustering Technique on Search Engine Dataset using Data Mining Tool

Microsoft Azure Machine learning Algorithms

2. A typical business process

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

Comparison of K-means and Backpropagation Data Mining Algorithms

Server Load Prediction

Azure Machine Learning, SQL Data Mining and R

Some vendors have a big presence in a particular industry; some are geared toward data scientists, others toward business users.

Data Mining Solutions for the Business Environment

Data Mining: Concepts and Techniques. Jiawei Han. Micheline Kamber. Simon Fräser University К MORGAN KAUFMANN PUBLISHERS. AN IMPRINT OF Elsevier

Role of Customer Response Models in Customer Solicitation Center s Direct Marketing Campaign

WEKA Explorer User Guide for Version 3-4-3

Audit Analytics. --An innovative course at Rutgers. Qi Liu. Roman Chinchila

The basic data mining algorithms introduced may be enhanced in a number of ways.

Table of Contents. Chapter No. 1 Introduction 1. iii. xiv. xviii. xix. Page No.

Lowering social cost of car accidents by predicting high-risk drivers

2 Decision tree + Cross-validation with R (package rpart)

Clustering Marketing Datasets with Data Mining Techniques

Data Mining. Nonlinear Classification

Big Data Analysis. Rajen D. Shah (Statistical Laboratory, University of Cambridge) joint work with Nicolai Meinshausen (Seminar für Statistik, ETH

Data Mining: Overview. What is Data Mining?

What is Data Mining? MS4424 Data Mining & Modelling. MS4424 Data Mining & Modelling. MS4424 Data Mining & Modelling. MS4424 Data Mining & Modelling

The Data Mining Process

How Organisations Are Using Data Mining Techniques To Gain a Competitive Advantage John Spooner SAS UK

Maschinelles Lernen mit MATLAB

Data Mining III: Numeric Estimation

Text Mining in JMP with R Andrew T. Karl, Senior Management Consultant, Adsurgo LLC Heath Rushing, Principal Consultant and Co-Founder, Adsurgo LLC

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder

CSci 538 Articial Intelligence (Machine Learning and Data Analysis)

Using Data Mining for Mobile Communication Clustering and Characterization

EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d.

Knowledge Discovery in Data with FIT-Miner

Foundations of Business Intelligence: Databases and Information Management

Clustering Connectionist and Statistical Language Processing

Improving spam mail filtering using classification algorithms with discretization Filter

Transcription:

CSC 177 Fall 2014 Team Project Final Report Project Title, Data Mining on Farmers Market Data Instructor: Dr. Meiliu Lu Team Members: Yogesh Isawe Kalindi Mehta Aditi Kulkarni

CSc 177 DM Project Cover Page Due 12-15-14 5pm (Submit it to the CSC Department office before 5pm 12/15/14 Or to the instructor at 5:15pm in RVR 5029) Student(s) Name : Aditi Kulkarni, Kalindi Mehta, Yogesh Isawe Grade Title of the project: Data Mining on Farmers Market Hand-in-check list: A hardcopy of final report (without appendix) with cover page for the term project An electronic copy on a CD including all of important writings of your term project Project oral presentation power point file with improvement made based on comments of the class and instructor during oral presentation. Project final report (100%) containing the following parts, font >= 11: 1. objective statement of the term project (1/3-1/2 page); 2. background information (1 page); 3. design principle of your data mining system/ scope of study (1/3 1/2 page); 4. implementation issues and solutions/ survey results/ diagrams/ tables (3-5 pages); 5. summary of learning experience such as experiments and readings (1/2-1 page); 6. References (authors, title, publishing source data, date of publication, URL) and you should quote each reference in your report text. 7. Appendix (optional) containing a set of supporting material such as examples, sample demo sessions, and any information that reflects your effort regarding the project.

TABLE OF CONTENTS Chapter 1. OBJECTIVE 2. BACK GROUND INFORMATION 3. DESIGN PRINCIPLES 4. IMPLEMENTATION ISSUES AND SOLUTIONS 5. SUMMARY OF LEARNING EXPERIENCE 6. FUTURE SCOPE 7. REFERENCES

1. Abstract Data set consists of Location of U.S. Farmers Market, Goods availability at the market as per season. We have created a data mart that can provide the information and answers questions. We have designed questions to address two types of users Consumer and Government officials. For data mining project, we are working on the same data to find patterns. 2. Objective Using data mining tool WEKA to do a multi-step data mining exercise. Interpreting the data well, understanding the structure of the data using one or more data mining algorithms, and presenting the findings. Mining data to extract knowledge from available data. To explore alternative data mining tools such as Rapidminer. 3. Background Information In data mining project we are mining US Farmers Market data to extract knowledge. Here we are using WEKA tool to mine the data. Data source for data is http://catalog.data.gov/dataset/farmers-markets-geographic-data. Original dataset consists of 8000 records with 41 different attributes related to farmers market. Our primary goal is to use different mining tools to apply classification and clustering algorithms. 4. Design Principles The design principles of this project included data cleaning and preprocessing. The first phase of this project includes cleaning the data and makes it compatible to data mining tool, the next phase is to apply data mining algorithms to get classification and clustering results and study these algorithms. The Data is cleaned and pre-processed manually by checking all the attribute entries and made changes using Microsoft Office Excel. Using WEKA -Data Mining tool, based on the structure and type of DB, we applied following algorithms: 1. Classification Algorithms: a. Logistic Algorithm b. J48 (Decision Tree) 2. Clustering Algorithms: a. Expectation Maximization (EM) Algorithm b. K-Means Algorithm 5. Implementation

To mine data we have followed KDD process. Following are steps we followed: 1. Data Preprocessing: As it is real time data, it is noisy data and need preprocessing. To make it easy to handle, we have trimmed original data to 1907 rows. We are using 35 attributes out of 41. Season attribute was not consistent throughout the data. In some records it was mention as date or duration of months. To make it consistent we added two columns named Season start and season end. Some special characters were used in data which is not accepted by Weka so we remove these characters or replace with appropriate one. 2. Import preprocessed data in Weka. 3. Applied Classification and Clustering algorithms as mention below: Based on the structure of Data Set and type of DB, specific algorithms can only yield the results that interpret data well. 6. Classification Algorithms We used same database for data mining projects and data warehousing project. As the database is very vast and distributive with many independent and with few dependent attributes. After analyzing database, we come to conclusion that to apply different data mining algorithms on different sets of attributes from the database, to interpret data well. Two broad sets formed for the data mining project are; 1. Goods Prediction and Clustering: Location + Season Information + Goods Available

Basic Classification Histogram In the above diagram we can select different goods from class and visualize distribution of that selected good for all the states or season. Red- Interprets particular good is available Blue- Interprets particular good is not available

2. Nutrition Program Prediction and Clustering: Location + Season Information + Nutrition Programs For nutrition programs we find out what program is available at which market location and during what season. Red- Interprets particular nutrition program is available Blue- Interprets particular nutrition program is not available

All the instances from the dataset are visualized based on two conditions for each of the above attributes, i.e. whether the nutrition program is available (red) or not (blue). 6.1 Logistic Algorithm Highly regarded classical statistical technique for making predictions. Logistic Algorithm assigns weightage to the attributes in the Data Set. And uses the logistic regression formula to predict how accurately a particular attribute value can be determined for the future instances. Thus using relative (interdependent) attributes increases prediction capability as oppose to using all the data available. Since using independent attributes would affect assignment of weightage which is used to formulate the prediction accuracy. To apply logistic algorithm classification on Goods data Set of relevant attributes i.e. dependent are used. Logistic algorithm then assigns weightage to all attributes in dataset.

Then these weightages are run through logistic regression formula to predict the attribute under consideration in this example wine

Logistic Algorithm for class Wine Thus from the above diagram we interpret that using Logistic Classification Algorithm can predict next/ future instance of wine with 88.8% accuracy, given the dependent relations among all the attributes, that we used for this example.(location +season+all goods) Similarly, for the nutrition program we use location + season + nutrition program related dataset. And predict accuracy for the SFNMP in following example, the algorithm can predicts future instance of SFNMP with 83.4% accuracy.

Logistic Algorithm for class SFNMP Logistic Algorithm for class WICcash 6.2 J48 Algorithm (Decision Tree)

Logistic Algorithms cannot predict numeric values. Whereas J48 Algorithm can predict both nominal and numeric attribute values. J48 algorithm uses most relevant attribute from the dataset to determine the prediction values, thus it s better to have all the attributes rather that only relevant attributes, as we did in logistic algorithm. Using all the data set for J48 Algorithm, the prediction efficiency increases. J48 Algorithms visualizes result in the form of Decision Tree, where most relevant attributes are used for prediction of particular attribute s future-instance value. Using this tree rules can be formed J48 Algorithm on Bake-goods From the above diagram, Bake-goods can be predicted with 94% accuracy using the attribute Vegetables which is determined as most relative by J48.

Decision Tree for Bake goods Where attribute vegetable is not alone used to predict the bakegoods, but other relevant attributes such as prepared and soap. Rules that can be formed from the above decision tree are; 1. If Vegetables=Yes then Bake-goods=Yes 2. If Vegetables=No And Prepared=Yes then Bake-goods=Yes 3. If Vegetables=No And Prepared=No And Soap=Yes then Bake-goods=Yes 4. If Vegetables=No And Prepared=No And Soap=No then Bake-goods=No Next diagram shows Prediction of instance of Herb with 90.8% accuracy.

J48 Algorithm for class Herbs In the case of Herb J48 again chooses most relevant attribute vegetable, but then there are other attributes from dataset to form the rules. These attributes are jams, eggs, seafood, prepared. Rules can be formed similar to above case using following decision tree.

Decision Tree for Herbs class

J48 Algorithm for class SNAP

Decision Tree SNAP

J48 Algorithm for class WIC

7. Clustering Algorithms Decision Tree for WIC Clustering algorithms are applied to set of similar data, to interpret data well. We created two sets of attributes; 1. All Goods 2. Nutrition Programs Number of distinct values for attributes are two, Yes/No (Y/N). Thus numbers of clusters used for both EM and K-Means algorithm are two.

Basic clustering histogram for goods Basic clustering histogram for Nutrition Program 7.1 EM Algorithm

Properties to choose for applying clustering algorithm, where we can specify various algorithm values so as to interpret data well. NumClusters: Number cluster for clustering. In EM algorithm we don t need to specify the number. EM algorithm determines number of clusters based on data. Thus the value is -1 that means algorithm will form number clusters based on datasets. Seed: Provides the virtualization method to choose initial random center value around which algorithm forms cluster. Depending on the vastness and distributive nature of data, we keep the value 100. Thus from the above diagram, EM forms two clusters, the reason for two clusters might be based on various distinct values in dataset.

EM Algorithm Applied for Nutrition Program 7.2 Simple K-Means Second clustering algorithm we used is simple K-Means algorithm. Properties for simple K-Means:

numclusters : In case of K-Means algorithm we do have to specify number of clusters to form. We input number of clusters two here, so as to compare results with EM-Algorithm which determined based on dataset, to form two clusters. Seed: For comparing EM Algorithm result with K-Means result and for better chance at forming clusters we make this value 100.

Simple K-Means Applied for Nutrition Program By comparing both the clustering results for Nutrition Program We get nearly similar result with ~70% instances in one cluster and ~30% instances in another cluster. Following diagram shows the clustering algorithms applied on Goods Data

1st cluster: 51% instances 2nd cluster: 49% insrtances EM Algorithm Applied for Goods Simple K-Means applied on Goods 1st cluster: 57% instances 2nd cluster: 43% instances

Here we do not get similar clustering as that we have seen in case of Nutrition Program. This might the effect of vastness and distributive nature of Goods dataset. 8. Summary of learning experience such as experiments and readings Learned Data Mining tool such as WEKA Got better understanding of classification algorithms such as J 48, Logistic Regression algorithm Learned different Clustering algorithms as EM, Simple K-Means Learned real time application and analysis of result for algorithms Team work advantages Read many articles to get clear idea of how to do data mining 9. References Data Source: http://catalog.data.gov/dataset/farmers-markets-geographic-data Weka Tutorial: http://youtu.be/m7kpibgedki Rapid Miner Tutorial: https://www.youtube.com/watch?v=eyyghzsvzpm&list=pllyinnlbo1evvz 2WJLWfbp_JWgg5It1O6