CSC 177 Fall 2014 Team Project Final Report Project Title, Data Mining on Farmers Market Data Instructor: Dr. Meiliu Lu Team Members: Yogesh Isawe Kalindi Mehta Aditi Kulkarni
CSc 177 DM Project Cover Page Due 12-15-14 5pm (Submit it to the CSC Department office before 5pm 12/15/14 Or to the instructor at 5:15pm in RVR 5029) Student(s) Name : Aditi Kulkarni, Kalindi Mehta, Yogesh Isawe Grade Title of the project: Data Mining on Farmers Market Hand-in-check list: A hardcopy of final report (without appendix) with cover page for the term project An electronic copy on a CD including all of important writings of your term project Project oral presentation power point file with improvement made based on comments of the class and instructor during oral presentation. Project final report (100%) containing the following parts, font >= 11: 1. objective statement of the term project (1/3-1/2 page); 2. background information (1 page); 3. design principle of your data mining system/ scope of study (1/3 1/2 page); 4. implementation issues and solutions/ survey results/ diagrams/ tables (3-5 pages); 5. summary of learning experience such as experiments and readings (1/2-1 page); 6. References (authors, title, publishing source data, date of publication, URL) and you should quote each reference in your report text. 7. Appendix (optional) containing a set of supporting material such as examples, sample demo sessions, and any information that reflects your effort regarding the project.
TABLE OF CONTENTS Chapter 1. OBJECTIVE 2. BACK GROUND INFORMATION 3. DESIGN PRINCIPLES 4. IMPLEMENTATION ISSUES AND SOLUTIONS 5. SUMMARY OF LEARNING EXPERIENCE 6. FUTURE SCOPE 7. REFERENCES
1. Abstract Data set consists of Location of U.S. Farmers Market, Goods availability at the market as per season. We have created a data mart that can provide the information and answers questions. We have designed questions to address two types of users Consumer and Government officials. For data mining project, we are working on the same data to find patterns. 2. Objective Using data mining tool WEKA to do a multi-step data mining exercise. Interpreting the data well, understanding the structure of the data using one or more data mining algorithms, and presenting the findings. Mining data to extract knowledge from available data. To explore alternative data mining tools such as Rapidminer. 3. Background Information In data mining project we are mining US Farmers Market data to extract knowledge. Here we are using WEKA tool to mine the data. Data source for data is http://catalog.data.gov/dataset/farmers-markets-geographic-data. Original dataset consists of 8000 records with 41 different attributes related to farmers market. Our primary goal is to use different mining tools to apply classification and clustering algorithms. 4. Design Principles The design principles of this project included data cleaning and preprocessing. The first phase of this project includes cleaning the data and makes it compatible to data mining tool, the next phase is to apply data mining algorithms to get classification and clustering results and study these algorithms. The Data is cleaned and pre-processed manually by checking all the attribute entries and made changes using Microsoft Office Excel. Using WEKA -Data Mining tool, based on the structure and type of DB, we applied following algorithms: 1. Classification Algorithms: a. Logistic Algorithm b. J48 (Decision Tree) 2. Clustering Algorithms: a. Expectation Maximization (EM) Algorithm b. K-Means Algorithm 5. Implementation
To mine data we have followed KDD process. Following are steps we followed: 1. Data Preprocessing: As it is real time data, it is noisy data and need preprocessing. To make it easy to handle, we have trimmed original data to 1907 rows. We are using 35 attributes out of 41. Season attribute was not consistent throughout the data. In some records it was mention as date or duration of months. To make it consistent we added two columns named Season start and season end. Some special characters were used in data which is not accepted by Weka so we remove these characters or replace with appropriate one. 2. Import preprocessed data in Weka. 3. Applied Classification and Clustering algorithms as mention below: Based on the structure of Data Set and type of DB, specific algorithms can only yield the results that interpret data well. 6. Classification Algorithms We used same database for data mining projects and data warehousing project. As the database is very vast and distributive with many independent and with few dependent attributes. After analyzing database, we come to conclusion that to apply different data mining algorithms on different sets of attributes from the database, to interpret data well. Two broad sets formed for the data mining project are; 1. Goods Prediction and Clustering: Location + Season Information + Goods Available
Basic Classification Histogram In the above diagram we can select different goods from class and visualize distribution of that selected good for all the states or season. Red- Interprets particular good is available Blue- Interprets particular good is not available
2. Nutrition Program Prediction and Clustering: Location + Season Information + Nutrition Programs For nutrition programs we find out what program is available at which market location and during what season. Red- Interprets particular nutrition program is available Blue- Interprets particular nutrition program is not available
All the instances from the dataset are visualized based on two conditions for each of the above attributes, i.e. whether the nutrition program is available (red) or not (blue). 6.1 Logistic Algorithm Highly regarded classical statistical technique for making predictions. Logistic Algorithm assigns weightage to the attributes in the Data Set. And uses the logistic regression formula to predict how accurately a particular attribute value can be determined for the future instances. Thus using relative (interdependent) attributes increases prediction capability as oppose to using all the data available. Since using independent attributes would affect assignment of weightage which is used to formulate the prediction accuracy. To apply logistic algorithm classification on Goods data Set of relevant attributes i.e. dependent are used. Logistic algorithm then assigns weightage to all attributes in dataset.
Then these weightages are run through logistic regression formula to predict the attribute under consideration in this example wine
Logistic Algorithm for class Wine Thus from the above diagram we interpret that using Logistic Classification Algorithm can predict next/ future instance of wine with 88.8% accuracy, given the dependent relations among all the attributes, that we used for this example.(location +season+all goods) Similarly, for the nutrition program we use location + season + nutrition program related dataset. And predict accuracy for the SFNMP in following example, the algorithm can predicts future instance of SFNMP with 83.4% accuracy.
Logistic Algorithm for class SFNMP Logistic Algorithm for class WICcash 6.2 J48 Algorithm (Decision Tree)
Logistic Algorithms cannot predict numeric values. Whereas J48 Algorithm can predict both nominal and numeric attribute values. J48 algorithm uses most relevant attribute from the dataset to determine the prediction values, thus it s better to have all the attributes rather that only relevant attributes, as we did in logistic algorithm. Using all the data set for J48 Algorithm, the prediction efficiency increases. J48 Algorithms visualizes result in the form of Decision Tree, where most relevant attributes are used for prediction of particular attribute s future-instance value. Using this tree rules can be formed J48 Algorithm on Bake-goods From the above diagram, Bake-goods can be predicted with 94% accuracy using the attribute Vegetables which is determined as most relative by J48.
Decision Tree for Bake goods Where attribute vegetable is not alone used to predict the bakegoods, but other relevant attributes such as prepared and soap. Rules that can be formed from the above decision tree are; 1. If Vegetables=Yes then Bake-goods=Yes 2. If Vegetables=No And Prepared=Yes then Bake-goods=Yes 3. If Vegetables=No And Prepared=No And Soap=Yes then Bake-goods=Yes 4. If Vegetables=No And Prepared=No And Soap=No then Bake-goods=No Next diagram shows Prediction of instance of Herb with 90.8% accuracy.
J48 Algorithm for class Herbs In the case of Herb J48 again chooses most relevant attribute vegetable, but then there are other attributes from dataset to form the rules. These attributes are jams, eggs, seafood, prepared. Rules can be formed similar to above case using following decision tree.
Decision Tree for Herbs class
J48 Algorithm for class SNAP
Decision Tree SNAP
J48 Algorithm for class WIC
7. Clustering Algorithms Decision Tree for WIC Clustering algorithms are applied to set of similar data, to interpret data well. We created two sets of attributes; 1. All Goods 2. Nutrition Programs Number of distinct values for attributes are two, Yes/No (Y/N). Thus numbers of clusters used for both EM and K-Means algorithm are two.
Basic clustering histogram for goods Basic clustering histogram for Nutrition Program 7.1 EM Algorithm
Properties to choose for applying clustering algorithm, where we can specify various algorithm values so as to interpret data well. NumClusters: Number cluster for clustering. In EM algorithm we don t need to specify the number. EM algorithm determines number of clusters based on data. Thus the value is -1 that means algorithm will form number clusters based on datasets. Seed: Provides the virtualization method to choose initial random center value around which algorithm forms cluster. Depending on the vastness and distributive nature of data, we keep the value 100. Thus from the above diagram, EM forms two clusters, the reason for two clusters might be based on various distinct values in dataset.
EM Algorithm Applied for Nutrition Program 7.2 Simple K-Means Second clustering algorithm we used is simple K-Means algorithm. Properties for simple K-Means:
numclusters : In case of K-Means algorithm we do have to specify number of clusters to form. We input number of clusters two here, so as to compare results with EM-Algorithm which determined based on dataset, to form two clusters. Seed: For comparing EM Algorithm result with K-Means result and for better chance at forming clusters we make this value 100.
Simple K-Means Applied for Nutrition Program By comparing both the clustering results for Nutrition Program We get nearly similar result with ~70% instances in one cluster and ~30% instances in another cluster. Following diagram shows the clustering algorithms applied on Goods Data
1st cluster: 51% instances 2nd cluster: 49% insrtances EM Algorithm Applied for Goods Simple K-Means applied on Goods 1st cluster: 57% instances 2nd cluster: 43% instances
Here we do not get similar clustering as that we have seen in case of Nutrition Program. This might the effect of vastness and distributive nature of Goods dataset. 8. Summary of learning experience such as experiments and readings Learned Data Mining tool such as WEKA Got better understanding of classification algorithms such as J 48, Logistic Regression algorithm Learned different Clustering algorithms as EM, Simple K-Means Learned real time application and analysis of result for algorithms Team work advantages Read many articles to get clear idea of how to do data mining 9. References Data Source: http://catalog.data.gov/dataset/farmers-markets-geographic-data Weka Tutorial: http://youtu.be/m7kpibgedki Rapid Miner Tutorial: https://www.youtube.com/watch?v=eyyghzsvzpm&list=pllyinnlbo1evvz 2WJLWfbp_JWgg5It1O6