HIGH DIMENSIONAL UNSUPERVISED CLUSTERING BASED FEATURE SELECTION ALGORITHM
|
|
|
- Beatrice Preston
- 9 years ago
- Views:
Transcription
1 HIGH DIMENSIONAL UNSUPERVISED CLUSTERING BASED FEATURE SELECTION ALGORITHM Ms.Barkha Malay Joshi M.E. Computer Science and Engineering, Parul Institute Of Engineering & Technology, Waghodia. India Prof. G.B. Jethava Information Technology Dept., Parul Institute Of Engineering & Technology, Waghodia Prof. Hetal B.Bhavsar Information Technology Dept, SVIT, Vasad ABSRACT: Feature selection is a process which selects the subset of attributes from the original dataset by removing the irrelevant and redundant attribute. Clustering is the technique in data mining which group the similar object in to one cluster and dissimilar object into other cluster. Some clustering technique does not support high dimensional dataset. By applying the feature selection as a preprocessing step for the clustering make it possible to handle the high dimensional dataset. Feature selection reduce the computational time greatly due to reduced feature subset and also improve clustering quality. Feature selection methods are available for supervised and unsupervised learning. This paper is related to working of feature selection method which is applied on different feature selection algorithm. The result proved that Feature selection through feature clustering algorithm is reduced the more attributes than the standard feature selection algorithm like relief and fisher filter. KEYWORDS: Unsupervised feature selection, Feature similarity measure clustering based feature selection. Introduction Real world data may contain hundreds of attributes, many of which may be irrelevant to the mining or redundant. Keeping such attributes causing contention for mining algorithm employed, can result in discovered pattern of poor quality as well as slow down the mining process. Feature selection is the important and frequently used technique in data pre-processing for data mining[11]. Feature selection or dimensionality reduction is a pre-processing technique which selects the relevant feature from the dataset and removes the redundant and irrelevant feature from the dataset. The goal of feature selection is to find a minimum set of attributes such that the resulting probability distance of the data classes is as close as possible to the original distance obtained using all attributes[2][3]. Clustering is an interesting topic in the many research field which include data mining, statistics, machine learning, pattern recognition and biology. Clustering is the technique that put the object of high similarity into one cluster and objects having less similarity into different cluster. So that the same cluster data objects are most similar than the different clusters. There are so many clustering algorithms like k-means, k- medoids which does not support the high dimensionality and feature sparseness. Therefore it is highly desirable to reduce the feature space dimensionality. There are mainly two techniques deal with this problem feature selection and feature extraction. The available methods for the feature selection can be classified as Supervised and unsupervised feature selections. The supervised and unsupervised feature selection methods, like documents frequency(df), term strength(ts), can be easily applied to clustering[3]. The supervised feature selection methods using information ISSN : Vol. 4 No.05 May
2 gain(ig), χ2 statistic(chi) can be applied for text classification where the class labels of the documents are available[1][3]. Sometimes, various data mining analysis (supervised and unsupervised learning) may need to apply to the same data. A subset selected by a supervised feature selection method may not be a good one for unsupervised learning and vice versa. So it is very desirable to have a feature selection method which works well for both. The clustering based feature selection does not need the class label information in the data set and is suitable for both supervised learning and unsupervised learning. This paper is related to the various methods of feature selection. The experiment results have been demonstrated for the relief filer, fisher filter and Feature selection through feature clustering algorithm apply on cancer dataset. The rest of this paper is organized as follows: Section II covers Unsupervised feature selection with different approaches, section III covers the feature similarity measure, section IV demonstrate the clustering based feature selection, Dataset characteristics included in section V section VI demonstrate the experiment results and graph and section VII shows the conclusion, future work and references cover in the section VIII and IX. I. Unsupervised Feature Selection Unsupervised feature selection (UFS) received some attentions in data mining and machine learning as clustering high dimensional data sets becomes an essential and routine task in data mining. Unsupervised feature selection is becoming an essential pre-processing step because it can not only reduce computational time greatly due to reduced feature subset but also improve clustering quality because no redundant features that could act as noises are involved in unsupervised learning. The following are the approaches used for unsupervised feature selection. 1. Wrapper approach The wrapper approach uses the induction algorithm for the estimating the feature subset. This approach is provide the better performance than the filter approach. Figure 2 shows the basic process of the wrapper approach[7][12]. The wrapper approach divides the task into three components: (1) feature search, (2) clustering algorithm, and (3) feature subset evaluation. Figure 2 Wrapper approach. (1) Feature search Perform the feature search using the greedy approach like sequential forward and backward elimination. Forward search start with zero feature and added one feature at a time while backward elimination start with all the features and removes the worst attributes remaining in the set. The procedure may employ a threshold on the measure to determine when to stop the attribute selection process. (2) Clustering algorithm Different clustering algorithms are use for the generation of the accurate cluster. K-means,K-medoid and EM are the example of clustering algorithms. ISSN : Vol. 4 No.05 May
3 (3) Feature subset evaluation Select the features from the generated cluster Which remove the redundant and irrelevant attributes. This method is use for selecting the interesting features from the clusters. Using this method we can get the quality of feature attributes. 2. Filter approach Filter approach uses intrinsic properties of data for feature selection. This is the unsupervised feature selection approach. This approach performs the feature selection without using induction algorithms which is display in the figure 3[11]. This method is used for the transformation of variable space. This transformation of variable space is required for the collation and computation of all the features before dimension reduction can be achieved. Figure 3:Filter approach This approach is computationally simple, fast and scalable. Various feature selection based techniques already available like information gain (IG), Relief-F (RF), t-test (T) and chi-squared test (CS),Fisher filter. II. Feature Similarity Measure Unsupervised algorithm which used feature similarity for redundancy reduction but requiring no feature search. Feature similarity between two random variables based on the linear dependency. Feature similarity approach classify into two categories like non parametrically test the closeness of probability distributions of the variable and functional dependency of variable. Non parametrically test the closeness approach is sensitive of location and dispersion of the distribution. Functional dependency of variable is linearly dependant which is easily remove the dependency. Linearly dependant variable of the attribute can calculate the feature similarity using the three techniques. a) Correlation coefficient b) Least Square Regression Error c) Maximum Information Compression Index (MICI). a) Correlation coefficient Correlation coefficient can be measures by. Correlation coefficient is define as between two random variables x and y Where cov= covariance, var = variance. This technique is not desirable as variance has high information content and also sensitive to rotation is also not ISSN : Vol. 4 No.05 May
4 desirable in many applications. b) Least Square Regression Error Least square regression error can be measure by the e. Least square regression error e between two variable x and y is obtain by the mean square error Where This technique is also sensitive to rotation of the scatter diagram in x-y plane. c) Maximum Information Compression Index MICI method is use for the finding the similarity of the attributes. This technique does not required the feature search. Maximum information compression index can be measure by. i.e. This technique is a measure of minimum amount of information loss. MICI generate the single variable pair from the large amount of attributes. III. Clustering Based Feature Selection The absence of class label information makes unsupervised feature selection challenging and so it is hard to distinguish relevant features from the irrelevant features. Also, some real time application may require to apply various data mining functionality on same data. So, it is desirable to have some feature selection method that can be apply to supervised and unsupervised learning. The Clustering based feature selection algorithm, does not depend on the class label information in the data set, and hence it is unsupervised feature selection but it works for both supervised and unsupervised learning. It does not required any feature search and reduced the redundant and irrelevant data from the dataset. Feature cluster Feature cluster is done in two steps: i. Features are partition into different clusters based on the similarity of feature ii. Select representative feature from each cluster. ISSN : Vol. 4 No.05 May
5 Partition of the feature is done based on the k-nn principle using one of the similarity measures. Let original features be D and original feature set be O ={ }. Dissimilarity between features and by S( ). represent the dissimilarity between feature and its kth nearest neighbour feature in R then[9], 1) Choose k D-1. Reduced feature subset R to original feature set O. 2) Compute. For each 3) Find minimum of feature Retain the feature R and discard k nearest features of. 4) Let If k cardinality(r)-1. 5) If k =1 than return feature set R as reduced set. 6) k=k-1 then 7) K=1 then return feature set R as reduced set. 8) Go to step 2. Feature selection through feature clustering (FSFC) algorithm steps: 1. Select the dataset which contain the large number of features. 2. Calculate the MICI for each cluster. MICI calculate using Where var= variance, cov= covariance between two variables x,y. 3. Select the minimum C(Si,Sj) and merge that Si and Sj. Process continue until all objects feature information into the single cluster. 4. Select the top k cluster in the hierarchical cluster tree. MICI is maximal information compression index which is used for feature selection. This algorithm is generic in nature and has a capability of multi scale representation of the data.. The above process stops until all the clusters are merged into one cluster. Complexity for this algorithm is where D indicate the dimension. Feature selection approach based on hierarchical clustering which is very easy to generate the features attribute without much computation overhead. Feature selection through feature clustering(fsfc) does not required feature search and feature selection occurs only once so the performance time will be reduce and generates accurate clusters with little overhead. This algorithm is fast compare to traditional unsupervised feature selection algorithms. IV. Dataset characteristics We used the 4 public cancer dataset. This dataset is available on 1. Colon Dataset: This is the Alon et al.'s Colon cancer dataset which contains information of 62 samples on 2000 genes. The samples belong to tumor and normal colon tissues. 2. Leukaemia: The total number of genes to be tested tested is 7129, and number of samples to be tested is 72, which are all acute leukaemia patients, either acute lymphoblastic leukaemia (ALL) or acute myelogenous leukaemia (AML). 3. Lymphoma: Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. 4. Breast cancer: The training data contains 78 patient samples and number of genes is ISSN : Vol. 4 No.05 May
6 Dataset No of Instance No of Original Features Class Colon Leukemia Lymphoma Breast Cancer Table:1 Dataset Characteristic V. Experimental result Simulation on cancer dataset generates the results for the different feature selection methods. There are many standard feature selection algorithms available in the simulation tool. In this paper we have simulated the relief and fisher filter algorithms on the cancer dataset and generate the reduced features and compare the result with Feature selection through feature clustering(fsfc) algorithm Relief is the most successful algorithm for reduce the features. This algorithm does not consider the redundancy of the input attributes. The precondition for the relief filter is at least two attribute must be available and target attribute must be discrete. Relief filter is used the ranking search method for the reduce the original attribute process. Fisher filter is also the available standard algorithm based on the filter approach. In this algorithm process selection is independent from learning algorithm. The precondition for this algorithm is at least one discrete attribute and one or more continuous attributes must be available. Feature search also done based on the ranking of the attributes. Relief and fisher filter feature selection methods are applied on the cancer data set and try to generate the reduced features. The table 2 shows the results for Relief, fisher filter and feature selection through feature clustering (FSFC) algorithm. The result shows that, FSFC algorithm reduced the more attributes other than the standard algorithms like relief and fisher filter. Sr. No Dataset No of Original Attributes Feature Selection Algorithm No of Reduced Attributes Relief Colon 2000 FisherFilter 53 Feature selection through feature clustering 11 Relief Leukemia 7130 FisherFilter 286 Feature selection through feature clustering 7 Relief Lymphoma 4027 FisherFilter 180 Feature selection through feature clustering 15 Relief Breast Cancer FisherFilter 86 Feature selection through feature clustering 25 Table:2 Simulation of Feature selection ISSN : Vol. 4 No.05 May
7 Figure :4 Feature selection algorithm on cancer dataset VI. Conclusion Feature selection is the pre-processing step which reduces the feature subset and thus supports high dimensional dataset. In this paper I have discussed many feature selection algorithm along with the Feature selection through feature clustering algorithm. The available algorithms like relief, fisher filter and proposed algorithm Feature selection through feature clustering are applied on cancer data and proved that Feature selection through feature clustering reduce more number of attributes compared to relief filter and fisher filter. VII. Future Work The relief and fisher filter generate the less number of attributes but this algorithms does not remove the redundant data. Clustering based feature selection algorithm remove the redundancy from the attributes and also provide the reduced or required attributes from the original attribute set. Clustering based feature selection algorithm support the high dimensional data set. In my future work I am planning to implement Clustering based feature selection algorithm on other clustering method which does not support high dimensional dataset and try to prove that the Clustering based feature selection algorithm generates more accurate clusters with minimum errors. VIII.References [1] Y. Yang and J. O. Pedersen, A Comparative Study on Feature Selection in Text Categorization, Proc. of Int l Conf. on Machine Learning, pp , [2] Jiawei Han and Micheline Kamber Data Mining: Concepts and Techniques, 2nd ed. [3] T. Liu, S. Liu, Z. Chen, and W. Ma, An Evaluation on Feature Selection for Text Clustering, Proc. of Int l Conf. on Machine Learning, [4] J. R. Quinlan, Induction of Decision Trees, Machine Learning, vol. 1,pp , [5] M. Dash and H. Liu, Feature Selection for Classification, Intelligent Data Analysis, vol. 1, no. 3, pp , [6] Yanjun Li, Congnan Luo, and Soon M. Chung, Member, IEEE, Text Clustering with Feature Selection by Using Statistical Data, IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. YY, [7] Jennifer G. Dy, Jennifer G. Dy. Feature selection for unsupervised learning Journal of Machine Learning Research 5 (2004) ISSN : Vol. 4 No.05 May
8 [8] Guangrong Li1, 2, Xiaohua Hu3, 4, Xiajiong Shen4, Xin Chen3, Zhoujun Li5 A Novel Unsupervised Feature Selection Method for Bioinformatics Data Setsthrough Feature Clustering, Granular Computing, GrC IEEE International Conference on Aug [9] P. Mitra, C. A. Murthy, and Sankar K. Pal, Unsupervised Feature Selection Using Feature Similarity, IEEE Transactions on Pattern Analysis and MachineIntelligenc, vol. 24, pp , [10] Y. Yang, Noise Reduction in a Statistical Approach to Text Categorization, Proc. of Annual ACM SIGIR Conf. on Research and Development in Information Retrieval, pp , [11] Asha Gowda Karegowda, M.A.Jayaram, A.S. Manjunath Feature Subset Selection Problem using Wrapper Approach in Supervised Learning 2010 International Journal of Computer Applications ( ) Volume 1 No. 7. [12] Barkha H.Desai, Hetal B.Bhavsar, Nisha V. Shah Supervised and Unsupervised Feature Selection based Clustering Algorithms NCTM-2012,Visnagar. ISSN : Vol. 4 No.05 May
Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data
CMPE 59H Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Non-linear
How To Cluster
Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main
ANALYSIS OF FEATURE SELECTION WITH CLASSFICATION: BREAST CANCER DATASETS
ANALYSIS OF FEATURE SELECTION WITH CLASSFICATION: BREAST CANCER DATASETS Abstract D.Lavanya * Department of Computer Science, Sri Padmavathi Mahila University Tirupati, Andhra Pradesh, 517501, India [email protected]
DATA MINING TECHNIQUES AND APPLICATIONS
DATA MINING TECHNIQUES AND APPLICATIONS Mrs. Bharati M. Ramageri, Lecturer Modern Institute of Information Technology and Research, Department of Computer Application, Yamunanagar, Nigdi Pune, Maharashtra,
An Overview of Knowledge Discovery Database and Data mining Techniques
An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,
Social Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014
RESEARCH ARTICLE OPEN ACCESS A Survey of Data Mining: Concepts with Applications and its Future Scope Dr. Zubair Khan 1, Ashish Kumar 2, Sunny Kumar 3 M.Tech Research Scholar 2. Department of Computer
A Review of Data Mining Techniques
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 4, April 2014,
Supervised Feature Selection & Unsupervised Dimensionality Reduction
Supervised Feature Selection & Unsupervised Dimensionality Reduction Feature Subset Selection Supervised: class labels are given Select a subset of the problem features Why? Redundant features much or
Use of Data Mining Techniques to Improve the Effectiveness of Sales and Marketing
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 4, April 2015,
Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques
Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques Subhashree K 1, Prakash P S 2 1 Student, Kongu Engineering College, Perundurai, Erode 2 Assistant Professor,
Robust Outlier Detection Technique in Data Mining: A Univariate Approach
Robust Outlier Detection Technique in Data Mining: A Univariate Approach Singh Vijendra and Pathak Shivani Faculty of Engineering and Technology Mody Institute of Technology and Science Lakshmangarh, Sikar,
Data Mining Part 5. Prediction
Data Mining Part 5. Prediction 5.1 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Classification vs. Numeric Prediction Prediction Process Data Preparation Comparing Prediction Methods References Classification
A Survey on Pre-processing and Post-processing Techniques in Data Mining
, pp. 99-128 http://dx.doi.org/10.14257/ijdta.2014.7.4.09 A Survey on Pre-processing and Post-processing Techniques in Data Mining Divya Tomar and Sonali Agarwal Indian Institute of Information Technology,
Clustering Technique in Data Mining for Text Documents
Clustering Technique in Data Mining for Text Documents Ms.J.Sathya Priya Assistant Professor Dept Of Information Technology. Velammal Engineering College. Chennai. Ms.S.Priyadharshini Assistant Professor
Classification and Prediction
Classification and Prediction Slides for Data Mining: Concepts and Techniques Chapter 7 Jiawei Han and Micheline Kamber Intelligent Database Systems Research Lab School of Computing Science Simon Fraser
PERFORMANCE ANALYSIS OF CLUSTERING ALGORITHMS IN DATA MINING IN WEKA
PERFORMANCE ANALYSIS OF CLUSTERING ALGORITHMS IN DATA MINING IN WEKA Prakash Singh 1, Aarohi Surya 2 1 Department of Finance, IIM Lucknow, Lucknow, India 2 Department of Computer Science, LNMIIT, Jaipur,
Unsupervised Data Mining (Clustering)
Unsupervised Data Mining (Clustering) Javier Béjar KEMLG December 01 Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 1 / 51 Introduction Clustering in KDD One of the main tasks in
Research Article Traffic Analyzing and Controlling using Supervised Parametric Clustering in Heterogeneous Network. Coimbatore, Tamil Nadu, India
Research Journal of Applied Sciences, Engineering and Technology 11(5): 473-479, 215 ISSN: 24-7459; e-issn: 24-7467 215, Maxwell Scientific Publication Corp. Submitted: March 14, 215 Accepted: April 1,
Using Data Mining for Mobile Communication Clustering and Characterization
Using Data Mining for Mobile Communication Clustering and Characterization A. Bascacov *, C. Cernazanu ** and M. Marcu ** * Lasting Software, Timisoara, Romania ** Politehnica University of Timisoara/Computer
Chapter ML:XI (continued)
Chapter ML:XI (continued) XI. Cluster Analysis Data Mining Overview Cluster Analysis Basics Hierarchical Cluster Analysis Iterative Cluster Analysis Density-Based Cluster Analysis Cluster Evaluation Constrained
Principles of Dat Da a t Mining Pham Tho Hoan [email protected] [email protected]. n
Principles of Data Mining Pham Tho Hoan [email protected] References [1] David Hand, Heikki Mannila and Padhraic Smyth, Principles of Data Mining, MIT press, 2002 [2] Jiawei Han and Micheline Kamber,
Comparison and Analysis of Various Clustering Methods in Data mining On Education data set Using the weak tool
Comparison and Analysis of Various Clustering Metho in Data mining On Education data set Using the weak tool Abstract:- Data mining is used to find the hidden information pattern and relationship between
Data Mining Project Report. Document Clustering. Meryem Uzun-Per
Data Mining Project Report Document Clustering Meryem Uzun-Per 504112506 Table of Content Table of Content... 2 1. Project Definition... 3 2. Literature Survey... 3 3. Methods... 4 3.1. K-means algorithm...
Mobile Phone APP Software Browsing Behavior using Clustering Analysis
Proceedings of the 2014 International Conference on Industrial Engineering and Operations Management Bali, Indonesia, January 7 9, 2014 Mobile Phone APP Software Browsing Behavior using Clustering Analysis
FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM
International Journal of Innovative Computing, Information and Control ICIC International c 0 ISSN 34-48 Volume 8, Number 8, August 0 pp. 4 FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT
A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE
A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE Kasra Madadipouya 1 1 Department of Computing and Science, Asia Pacific University of Technology & Innovation ABSTRACT Today, enormous amount of data
Static Data Mining Algorithm with Progressive Approach for Mining Knowledge
Global Journal of Business Management and Information Technology. Volume 1, Number 2 (2011), pp. 85-93 Research India Publications http://www.ripublication.com Static Data Mining Algorithm with Progressive
Data, Measurements, Features
Data, Measurements, Features Middle East Technical University Dep. of Computer Engineering 2009 compiled by V. Atalay What do you think of when someone says Data? We might abstract the idea that data are
Data Mining Part 5. Prediction
Data Mining Part 5. Prediction 5.7 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Introduction Linear Regression Other Regression Models References Introduction Introduction Numerical prediction is
Data Mining 資 料 探 勘. 分 群 分 析 (Cluster Analysis)
Data Mining 資 料 探 勘 Tamkang University 分 群 分 析 (Cluster Analysis) DM MI Wed,, (:- :) (B) Min-Yuh Day 戴 敏 育 Assistant Professor 專 任 助 理 教 授 Dept. of Information Management, Tamkang University 淡 江 大 學 資
Predicting the Risk of Heart Attacks using Neural Network and Decision Tree
Predicting the Risk of Heart Attacks using Neural Network and Decision Tree S.Florence 1, N.G.Bhuvaneswari Amma 2, G.Annapoorani 3, K.Malathi 4 PG Scholar, Indian Institute of Information Technology, Srirangam,
Customer Classification And Prediction Based On Data Mining Technique
Customer Classification And Prediction Based On Data Mining Technique Ms. Neethu Baby 1, Mrs. Priyanka L.T 2 1 M.E CSE, Sri Shakthi Institute of Engineering and Technology, Coimbatore 2 Assistant Professor
Information Management course
Università degli Studi di Milano Master Degree in Computer Science Information Management course Teacher: Alberto Ceselli Lecture 01 : 06/10/2015 Practical informations: Teacher: Alberto Ceselli ([email protected])
AnalysisofData MiningClassificationwithDecisiontreeTechnique
Global Journal of omputer Science and Technology Software & Data Engineering Volume 13 Issue 13 Version 1.0 Year 2013 Type: Double Blind Peer Reviewed International Research Journal Publisher: Global Journals
Comparative Analysis of EM Clustering Algorithm and Density Based Clustering Algorithm Using WEKA tool.
International Journal of Engineering Research and Development e-issn: 2278-067X, p-issn: 2278-800X, www.ijerd.com Volume 9, Issue 8 (January 2014), PP. 19-24 Comparative Analysis of EM Clustering Algorithm
Evaluation of Feature Selection Methods for Predictive Modeling Using Neural Networks in Credits Scoring
714 Evaluation of Feature election Methods for Predictive Modeling Using Neural Networks in Credits coring Raghavendra B. K. Dr. M.G.R. Educational and Research Institute, Chennai-95 Email: [email protected]
Supervised and unsupervised learning - 1
Chapter 3 Supervised and unsupervised learning - 1 3.1 Introduction The science of learning plays a key role in the field of statistics, data mining, artificial intelligence, intersecting with areas in
DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS
DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDD-LAB ISTI- CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar
An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015
An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content
IMPROVISATION OF STUDYING COMPUTER BY CLUSTER STRATEGIES
INTERNATIONAL JOURNAL OF ADVANCED RESEARCH IN ENGINEERING AND SCIENCE IMPROVISATION OF STUDYING COMPUTER BY CLUSTER STRATEGIES C.Priyanka 1, T.Giri Babu 2 1 M.Tech Student, Dept of CSE, Malla Reddy Engineering
Comparison of K-means and Backpropagation Data Mining Algorithms
Comparison of K-means and Backpropagation Data Mining Algorithms Nitu Mathuriya, Dr. Ashish Bansal Abstract Data mining has got more and more mature as a field of basic research in computer science and
Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca
Clustering Adrian Groza Department of Computer Science Technical University of Cluj-Napoca Outline 1 Cluster Analysis What is Datamining? Cluster Analysis 2 K-means 3 Hierarchical Clustering What is Datamining?
Financial Trading System using Combination of Textual and Numerical Data
Financial Trading System using Combination of Textual and Numerical Data Shital N. Dange Computer Science Department, Walchand Institute of Rajesh V. Argiddi Assistant Prof. Computer Science Department,
How To Solve The Kd Cup 2010 Challenge
A Lightweight Solution to the Educational Data Mining Challenge Kun Liu Yan Xing Faculty of Automation Guangdong University of Technology Guangzhou, 510090, China [email protected] [email protected]
Web Document Clustering
Web Document Clustering Lab Project based on the MDL clustering suite http://www.cs.ccsu.edu/~markov/mdlclustering/ Zdravko Markov Computer Science Department Central Connecticut State University New Britain,
International Journal of Advance Research in Computer Science and Management Studies
Volume 2, Issue 12, December 2014 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online
A Survey on Outlier Detection Techniques for Credit Card Fraud Detection
IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661, p- ISSN: 2278-8727Volume 16, Issue 2, Ver. VI (Mar-Apr. 2014), PP 44-48 A Survey on Outlier Detection Techniques for Credit Card Fraud
Practical Applications of DATA MINING. Sang C Suh Texas A&M University Commerce JONES & BARTLETT LEARNING
Practical Applications of DATA MINING Sang C Suh Texas A&M University Commerce r 3 JONES & BARTLETT LEARNING Contents Preface xi Foreword by Murat M.Tanik xvii Foreword by John Kocur xix Chapter 1 Introduction
ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)
ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING) Gabriela Ochoa http://www.cs.stir.ac.uk/~goc/ OUTLINE Preliminaries Classification and Clustering Applications
Data Preprocessing. Week 2
Data Preprocessing Week 2 Topics Data Types Data Repositories Data Preprocessing Present homework assignment #1 Team Homework Assignment #2 Read pp. 227 240, pp. 250 250, and pp. 259 263 the text book.
Search and Data Mining: Techniques. Applications Anya Yarygina Boris Novikov
Search and Data Mining: Techniques Applications Anya Yarygina Boris Novikov Introduction Data mining applications Data mining system products and research prototypes Additional themes on data mining Social
Reference Books. Data Mining. Supervised vs. Unsupervised Learning. Classification: Definition. Classification k-nearest neighbors
Classification k-nearest neighbors Data Mining Dr. Engin YILDIZTEPE Reference Books Han, J., Kamber, M., Pei, J., (2011). Data Mining: Concepts and Techniques. Third edition. San Francisco: Morgan Kaufmann
Map-Reduce Algorithm for Mining Outliers in the Large Data Sets using Twister Programming Model
Map-Reduce Algorithm for Mining Outliers in the Large Data Sets using Twister Programming Model Subramanyam. RBV, Sonam. Gupta Abstract An important problem that appears often when analyzing data involves
ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA
ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA D.Lavanya 1 and Dr.K.Usha Rani 2 1 Research Scholar, Department of Computer Science, Sree Padmavathi Mahila Visvavidyalayam, Tirupati, Andhra Pradesh,
Data Exploration and Preprocessing. Data Mining and Text Mining (UIC 583 @ Politecnico di Milano)
Data Exploration and Preprocessing Data Mining and Text Mining (UIC 583 @ Politecnico di Milano) References Jiawei Han and Micheline Kamber, "Data Mining: Concepts and Techniques", The Morgan Kaufmann
ADVANCED MACHINE LEARNING. Introduction
1 1 Introduction Lecturer: Prof. Aude Billard ([email protected]) Teaching Assistants: Guillaume de Chambrier, Nadia Figueroa, Denys Lamotte, Nicola Sommer 2 2 Course Format Alternate between: Lectures
Data Mining for Knowledge Management. Classification
1 Data Mining for Knowledge Management Classification Themis Palpanas University of Trento http://disi.unitn.eu/~themis Data Mining for Knowledge Management 1 Thanks for slides to: Jiawei Han Eamonn Keogh
CS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning.
Lecture Machine Learning Milos Hauskrecht [email protected] 539 Sennott Square, x5 http://www.cs.pitt.edu/~milos/courses/cs75/ Administration Instructor: Milos Hauskrecht [email protected] 539 Sennott
Machine Learning using MapReduce
Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous
10-601. Machine Learning. http://www.cs.cmu.edu/afs/cs/academic/class/10601-f10/index.html
10-601 Machine Learning http://www.cs.cmu.edu/afs/cs/academic/class/10601-f10/index.html Course data All up-to-date info is on the course web page: http://www.cs.cmu.edu/afs/cs/academic/class/10601-f10/index.html
Distances, Clustering, and Classification. Heatmaps
Distances, Clustering, and Classification Heatmaps 1 Distance Clustering organizes things that are close into groups What does it mean for two genes to be close? What does it mean for two samples to be
A Survey of Classification Techniques in the Area of Big Data.
A Survey of Classification Techniques in the Area of Big Data. 1PrafulKoturwar, 2 SheetalGirase, 3 Debajyoti Mukhopadhyay 1Reseach Scholar, Department of Information Technology 2Assistance Professor,Department
Pattern-Aided Regression Modelling and Prediction Model Analysis
San Jose State University SJSU ScholarWorks Master's Projects Master's Theses and Graduate Research Fall 2015 Pattern-Aided Regression Modelling and Prediction Model Analysis Naresh Avva Follow this and
Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal
Learning Example Chapter 18: Learning from Examples 22c:145 An emergency room in a hospital measures 17 variables (e.g., blood pressure, age, etc) of newly admitted patients. A decision is needed: whether
EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set
EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set Amhmed A. Bhih School of Electrical and Electronic Engineering Princy Johnson School of Electrical and Electronic Engineering Martin
Mining for Web Engineering
Mining for Engineering A. Venkata Krishna Prasad 1, Prof. S.Ramakrishna 2 1 Associate Professor, Department of Computer Science, MIPGS, Hyderabad 2 Professor, Department of Computer Science, Sri Venkateswara
A Review of Anomaly Detection Techniques in Network Intrusion Detection System
A Review of Anomaly Detection Techniques in Network Intrusion Detection System Dr.D.V.S.S.Subrahmanyam Professor, Dept. of CSE, Sreyas Institute of Engineering & Technology, Hyderabad, India ABSTRACT:In
DATA MINING USING INTEGRATION OF CLUSTERING AND DECISION TREE
DATA MINING USING INTEGRATION OF CLUSTERING AND DECISION TREE 1 K.Murugan, 2 P.Varalakshmi, 3 R.Nandha Kumar, 4 S.Boobalan 1 Teaching Fellow, Department of Computer Technology, Anna University 2 Assistant
D-optimal plans in observational studies
D-optimal plans in observational studies Constanze Pumplün Stefan Rüping Katharina Morik Claus Weihs October 11, 2005 Abstract This paper investigates the use of Design of Experiments in observational
MISSING DATA IMPUTATION IN CARDIAC DATA SET (SURVIVAL PROGNOSIS)
MISSING DATA IMPUTATION IN CARDIAC DATA SET (SURVIVAL PROGNOSIS) R.KAVITHA KUMAR Department of Computer Science and Engineering Pondicherry Engineering College, Pudhucherry, India DR. R.M.CHADRASEKAR Professor,
Extend Table Lens for High-Dimensional Data Visualization and Classification Mining
Extend Table Lens for High-Dimensional Data Visualization and Classification Mining CPSC 533c, Information Visualization Course Project, Term 2 2003 Fengdong Du [email protected] University of British Columbia
Unsupervised and supervised dimension reduction: Algorithms and connections
Unsupervised and supervised dimension reduction: Algorithms and connections Jieping Ye Department of Computer Science and Engineering Evolutionary Functional Genomics Center The Biodesign Institute Arizona
Movie Classification Using k-means and Hierarchical Clustering
Movie Classification Using k-means and Hierarchical Clustering An analysis of clustering algorithms on movie scripts Dharak Shah DA-IICT, Gandhinagar Gujarat, India [email protected] Saheb Motiani
The Scientific Data Mining Process
Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In
Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier
Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier D.Nithya a, *, V.Suganya b,1, R.Saranya Irudaya Mary c,1 Abstract - This paper presents,
A Review of Missing Data Treatment Methods
A Review of Missing Data Treatment Methods Liu Peng, Lei Lei Department of Information Systems, Shanghai University of Finance and Economics, Shanghai, 200433, P.R. China ABSTRACT Missing data is a common
Complex Network Visualization based on Voronoi Diagram and Smoothed-particle Hydrodynamics
Complex Network Visualization based on Voronoi Diagram and Smoothed-particle Hydrodynamics Zhao Wenbin 1, Zhao Zhengxu 2 1 School of Instrument Science and Engineering, Southeast University, Nanjing, Jiangsu
Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016
Clustering Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 1 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate data attributes with
ASSOCIATION RULE MINING ON WEB LOGS FOR EXTRACTING INTERESTING PATTERNS THROUGH WEKA TOOL
International Journal Of Advanced Technology In Engineering And Science Www.Ijates.Com Volume No 03, Special Issue No. 01, February 2015 ISSN (Online): 2348 7550 ASSOCIATION RULE MINING ON WEB LOGS FOR
Client Perspective Based Documentation Related Over Query Outcomes from Numerous Web Databases
Beyond Limits...Volume: 2 Issue: 2 International Journal Of Advance Innovations, Thoughts & Ideas Client Perspective Based Documentation Related Over Query Outcomes from Numerous Web Databases B. Santhosh
Dynamical Clustering of Personalized Web Search Results
Dynamical Clustering of Personalized Web Search Results Xuehua Shen CS Dept, UIUC [email protected] Hong Cheng CS Dept, UIUC [email protected] Abstract Most current search engines present the user a ranked
Data Mining: Concepts and Techniques. Jiawei Han. Micheline Kamber. Simon Fräser University К MORGAN KAUFMANN PUBLISHERS. AN IMPRINT OF Elsevier
Data Mining: Concepts and Techniques Jiawei Han Micheline Kamber Simon Fräser University К MORGAN KAUFMANN PUBLISHERS AN IMPRINT OF Elsevier Contents Foreword Preface xix vii Chapter I Introduction I I.
Data Mining: A Preprocessing Engine
Journal of Computer Science 2 (9): 735-739, 2006 ISSN 1549-3636 2005 Science Publications Data Mining: A Preprocessing Engine Luai Al Shalabi, Zyad Shaaban and Basel Kasasbeh Applied Science University,
K-Means Cluster Analysis. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1
K-Means Cluster Analsis Chapter 3 PPDM Class Tan,Steinbach, Kumar Introduction to Data Mining 4/18/4 1 What is Cluster Analsis? Finding groups of objects such that the objects in a group will be similar
Support Vector Machines with Clustering for Training with Very Large Datasets
Support Vector Machines with Clustering for Training with Very Large Datasets Theodoros Evgeniou Technology Management INSEAD Bd de Constance, Fontainebleau 77300, France [email protected] Massimiliano
Categorical Data Visualization and Clustering Using Subjective Factors
Categorical Data Visualization and Clustering Using Subjective Factors Chia-Hui Chang and Zhi-Kai Ding Department of Computer Science and Information Engineering, National Central University, Chung-Li,
Standardization and Its Effects on K-Means Clustering Algorithm
Research Journal of Applied Sciences, Engineering and Technology 6(7): 399-3303, 03 ISSN: 040-7459; e-issn: 040-7467 Maxwell Scientific Organization, 03 Submitted: January 3, 03 Accepted: February 5, 03
Lasso-based Spam Filtering with Chinese Emails
Journal of Computational Information Systems 8: 8 (2012) 3315 3322 Available at http://www.jofcis.com Lasso-based Spam Filtering with Chinese Emails Zunxiong LIU 1, Xianlong ZHANG 1,, Shujuan ZHENG 2 1
2.1. Data Mining for Biomedical and DNA data analysis
Applications of Data Mining Simmi Bagga Assistant Professor Sant Hira Dass Kanya Maha Vidyalaya, Kala Sanghian, Distt Kpt, India (Email: [email protected]) Dr. G.N. Singh Department of Physics and
A MACHINE LEARNING APPROACH TO FILTER UNWANTED MESSAGES FROM ONLINE SOCIAL NETWORKS
A MACHINE LEARNING APPROACH TO FILTER UNWANTED MESSAGES FROM ONLINE SOCIAL NETWORKS Charanma.P 1, P. Ganesh Kumar 2, 1 PG Scholar, 2 Assistant Professor,Department of Information Technology, Anna University
International Journal of Advanced Computer Technology (IJACT) ISSN:2319-7900 PRIVACY PRESERVING DATA MINING IN HEALTH CARE APPLICATIONS
PRIVACY PRESERVING DATA MINING IN HEALTH CARE APPLICATIONS First A. Dr. D. Aruna Kumari, Ph.d, ; Second B. Ch.Mounika, Student, Department Of ECM, K L University, [email protected]; Third C.
Neural Networks Lesson 5 - Cluster Analysis
Neural Networks Lesson 5 - Cluster Analysis Prof. Michele Scarpiniti INFOCOM Dpt. - Sapienza University of Rome http://ispac.ing.uniroma1.it/scarpiniti/index.htm [email protected] Rome, 29
SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING
AAS 07-228 SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING INTRODUCTION James G. Miller * Two historical uncorrelated track (UCT) processing approaches have been employed using general perturbations
Cluster Analysis: Basic Concepts and Algorithms
Cluster Analsis: Basic Concepts and Algorithms What does it mean clustering? Applications Tpes of clustering K-means Intuition Algorithm Choosing initial centroids Bisecting K-means Post-processing Strengths
Preprocessing Imbalanced Dataset Using Oversampling Approach
Journal of Recent Research in Engineering and Technology, 2(11), 2015, pp 10-15 Article ID J111503 ISSN (Online): 2349 2252, ISSN (Print):2349 2260 Bonfay Publications Research article Preprocessing Imbalanced
Distributed forests for MapReduce-based machine learning
Distributed forests for MapReduce-based machine learning Ryoji Wakayama, Ryuei Murata, Akisato Kimura, Takayoshi Yamashita, Yuji Yamauchi, Hironobu Fujiyoshi Chubu University, Japan. NTT Communication
Impact of Feature Selection on the Performance of Wireless Intrusion Detection Systems
2009 International Conference on Computer Engineering and Applications IPCSIT vol.2 (2011) (2011) IACSIT Press, Singapore Impact of Feature Selection on the Performance of ireless Intrusion Detection Systems
