e-commerce product classification: our participation at cdiscount 2015 challenge
|
|
|
- Melvyn Stanley Perkins
- 10 years ago
- Views:
Transcription
1 e-commerce product classification: our participation at cdiscount 2015 challenge Ioannis Partalas Viseo R&D, France Georgios Balikas University of Grenoble Alpes, France Abstract This report describes our participation in the cdiscount 2015 challenge where the goal was to classify product items in a predefined taxonomy of products. Our best submission yielded an accuracy score of 64.20% in the private part of the leaderboard and we were ranked 10th out of 175 participating teams. We followed a text classification approach employing mainly linear models. The final solution was a weighted voting system which combined a variety of trained models. 1 Introduction In this report we present our participation in the cdicount 2015 product classification challenge that was organised in the platform. The organisers provided a large collection of product items containing mainly its textual description. We followed a text classification approach using mainly linear models (like SVMs) as our base models. Our final solution consisted of a weighted voting system which combined a variety of base models. In Section 2 we provide a brief description of the cdiscount task and data. Section 3 describes our implementations and the results we obtained. Finally, Section 4 concludes with a discussion and the lessons learnt. 2 e-commerce Product Classification Item categorization is fundamental to many aspects of an item life cycle for e-commerce sites such as search, recommendation, catalog building etc. It can be formulated as a supervised classification problem where the categories are the target classes and the features are the words composing some textual description of the items. In the context of the cdiscount 2015 challenge 1 the organisers provided the descriptions of e-commerce items and the goal was to develop a system that would perform automatic classification of those products. Table 1 presents some training instances. There were 15,786,886 product descriptions in the training test and 35,066 instances to be classified in the test set. The target categories (classes) were organised in a hierarchy that comprised 3 levels: the top level had 52 nodes, the middle 536 and the lowest level 5,789. The goal of the challenge was to predict the lowest level category for each test instance. It is to be noted that most of the classes were represented in the training set with only a few examples and there were only a few classes with many examples. For instance 40% of the training instances belong to the 10 most common classes and around 1,500 classes contain from 1 to 30 product items. 3 Our approach We followed a text classification approach working with the textual information provided in the training data. In this context x R d represents a document in a vector space and y Y = {1... K} its associated class label where Y > 2. The large number of classes in the problem treated as well as the scarcity of data for minority classes is a typical situation in large-scale systems. For example, we cite here challenges in the same line where in contrast the available text is much bigger like LSHTC (Partalas et al., 2015) and BioASQ (Balikas et al., 2014). 1 challenge/20/details#tab_brief55
2 Categorie3 Description Libelle Marque De Collectif aux éditions SOLESMES Benedictions de l eglise De Collectif aux éditions SOLESMES Notice de st benoit lot de or 750, poids : 3.45gr, diamants : 0.26carats Bague or et diamants AUCUNE Champagne Brut - Champagne-Vendu à l unité-1 x 75cl Mumm Brut AUCUNE Table 1: Part of the training data. We only present the fields Description, Libelle and Marque that we used in our implementations. The Categorie3 value was the class to be predicted. 3.1 Setup Data Pre-processing. Since we used only the textual information that was provided for each training/test instance by the organisers our first task was to clean the data. The pipeline for cleaning the data included: removal of non-ascii characters, removal of non-printable characters, removal of html tags, accents removal, punctuation removal and lower-casing. We also split words that consisted of a text part and a numerical part in to two distinct parts. For instance, 12cm would become 12 and cm. In addition, we did not perform stemming, lemmatization and stop-word removal due to the fact that the text spans were small and such operations would result in loss of information (Shen et al., 2012). Finally, we tokenized the remaining text in words using the white space as delimiter. Vectorization. Having cleaned the text, we generated the vectors using an one-hot-encoding approach. We experimented with binary one-hotencoded vectors (i.e. a word exists or not), with term frequency vectors (how many times each word occurred in the instance) and with the tf idf (term frequency, inverse document frequency) weighting scheme. In the early stages of the challenge, we found that the latter performed the best in our experiments and we used it exclusively. To calculate tf idf vectors we smoothed idf and applied the sublinear tf scaling 1+log(tf). Finally, each vector has been normalized to a unit vector. Classification of short documents can benefit from successful feature selection of n-grams. The size of the vocabulary of our cleaned dataset combined with the large number of training instances was prohibitively large for using all the n-grams with n = 1, 2, 3. On the other hand, selecting too many features may lead to over-fitting. As a result, selecting a representative part of those n- grams required careful tuning. Apart from the Description and Libelle fields that we concatenated, we also used the Marque field. We examined two ways of integrating this information in our pipeline: by concatenating its value with the already existing text of Description and Libelle and by generating binary flags for each of the values of the field seen on the training set. Either way, compared to feature generation using only Description and Libelle benefited our models. After generating the vectors we applied the α-power normalization, where each vector x = (x 1, x 2,..., x d ) is transformed to x power = (x α 1, xα 2,..., xα d ). This normalisation has been used in computer vision tasks (Jegou et al., 2012). The main intuition is that it reduces the effect of the most common words. After the transformation the vector is normalized to the unit vector. We found that taking the root (a = 0.5) of the values of the features consistently benefited the performance. Subsampling As the training dataset was highly imbalanced the learned models would be biased towards the big classes. For this reason and also in order to speed up the vectorization process as well as the training of the systems we randomly sampled the data by downsampling the majority classes. This procedure helped to improve the performance of all the single systems (around +2.5% to our best single system) and also reduced the training time of the base models. The size of the vocabularies for the several sub-samples ranged from around 1 million unigrams for around half of the data to 1.6 millions of unigrams for the full dataset. Tuning and Validation Strategy Tuning the hyper-parameters of our models was an important aspect of our approach. Performing k-fold cross validation in 15 million training instances with 2
3 thousands of features (corresponding to the vocabulary size of the training set) was prohibitive given our cpu resources. We overcame this problem by performing hyper-parameter tuning locally, in a subsample of the training set. The subsample consisted of the instances of 1500 classes randomly selected. This approach allowed us to accelerate the calculations. Note that we also validated the applicability of our tuning strategy using the public part of the leaderboard; the decisions that improved the accuracy of our models locally had the same effect on our scores on the public leaderboard. A trick that we found useful was to use the current best submission on the public board as golden standard in order to validate the trained models. This helped us to avoid unnecessary submissions. 3.2 Training and Prediction We relied on linear models as our base learning choice due to their efficiency on high-dimensional tasks like text classification. Support Vector Machines (SVMs) are well known for achieving state-of-the-art performance in such tasks. For learning the base models we used the Liblinear library which can support linear models learning in high dimensional datasets (Fan et al., 2008). We tried two main strategies: a) flat classifiers which ignore the hierarchical structure among the classes and b) hierarchical top-down classification. For flat classification we followed the One- Versus-All (OVA) approach with a complexity O(K) in the number of classes. For top-down classification we trained a multi-class classifier for each parent in the hierarchy of products and during prediction we started from the root and selected the best class according to the current multi-class classifiers. We iteratively proceeded until reaching a leaf node. Note that top-down hierarchical classification has a logarithmic complexity to the number of classes which accelerates the training and prediction processes significantly. Trying to explore different learning methods that would also help to diversify the ensemble, we experimented with k-nearest Neighbors classifiers, Rochio classifiers also known as Nearest- Centroid classifiers and online stochastic gradientdescent methods. Although widely studied we found that those methods did not give satisfactory results. For instance, our 3-Nearest Neighbors runs using 200 thousands unigram features, tf idf feature representation achieved accuracy score 41.62% in the public leaderboard. We also experimented with text embeddings using the word2vec tool (Mikolov et al., 2013); generating text representations in a low dimensional space (200 dimensions) and 3-NN as a category prediction approach improved over using the tf idf representation but was still away from performing competitively. For the above mentioned reasons we only report results in the rest of the report for SVMs. In the majority of the models we used L 2 - regularized L 2 -loss SVMs in the dual and we set C = 1.0 with bias ( B 1). 3.3 Ensembling Our final solutions were based on averaging the models in the ensemble which contained 103 models. We experimented with simple voting as well as with weighted voting schemes. Simple voting was always improving performance when it was calculated in a fraction of the whole ensemble containing only the best performing models. A simple approach was to order the models according to their performance (in terms of accuracy) with respect to the current best model on the leader board. Then a simple majority voting was applied on around 20%-30% of the ordered classifiers. This procedure would create a homogeneous sub-ensemble that would likely reduce the variance. Also, we employed a weighted voting scheme weighting more few of our best single models. Our final best submission (64.20% on the private board) was a weighting voting ensemble giving bigger weights to the two best models. Weighted voting consistently improved accuracy about 1.2%-1.8%. 3.4 Software and Hardware During the challenge we used scikit-learn (Pedregosa et al., 2011) as well as our own scripts to pre-process the raw data. For training the models we experimented with Liblinear, scikit-learn and Vowpal Wabbit. We had full access to a machine with 4 cores at 2,4Ghz and 16Gb of RAM and limited access to a shared machine with 24 cores at 3,3Ghz and 128 Gb of RAM. In the first machine 3
4 we ran our experiments with up to 270 thousand features and in the second the experiments with more features that required more memory. 4 Results We provide in Table 2 a subset of our submissions along with the public and private scores we obtained. Comparing the pairs of submissions (1),(2) and (9), (10) it is clear that the α-power transformation helps the performance with respect to accuracy. From submissions (1), (3) and (4) one can see that by increasing the number of the unigram features the performance increases. However, at around 300 thousand features the improvements becomes negligible. The pairs of submissions (3), (5) and (2), (6) demonstrate the advantage of adding bigrams apart from unigrams to the feature set. Going further and adding, apart from unigrams and bigrams, trigrams also improves the performance as indicated by comparing submissions (8), (9) and (10) with the rest of the Table. Note that in each of the above cases the features were selected with criterion their frequency. For instance in submission (1) we selected the 200k most common unigrams, and in submission (9) the most common features between all unigrams, bigrams and trigrams. We tested also several hierarchical models using 200,000 unigrams for each parent node in the hierarchy. The best submission was one with the first level pruned achieving 60.61% on the private board while the fully hierarchical model got only 58.88%. Note that in the case of the pruned hierarchy we remove a step during prediction and thus reduce the propagation errors. Several other hierarchical models were used to increase the variability of the ensemble. Table 3 presents the best single system which was trained in approximately half of the data where the Marque was concatenated directly in the description. Also we present the results on the two best weighted voting systems using a total of 103 base models. Additionally, we report the coverage of each system which shows how many classes were detected during the prediction phase. 5 Conclusion and Discussion Participating in the cdiscount 2015 was a nice and interesting experience with several lessons learnt. We shortly discuss below the most important ones: Data Preparation. Although short text classification for e-commerce product is not a new task the cdiscount problems had two particularities: the huge number of training instances and the big vocabulary even after carefully cleaning the dataset. We found that feature selection and feature engineering performed by studying the statistics of the dataset, the predictions of base models and by trying to identify patterns for the classes played an important role. As an example we would like to highlight the Marque field. We found in the late stages of the challenge that by using this information our models could benefit significantly. Learning tools. From the early stages of the challenge we decided to use SVMs which are known to perform well in such problems. It it the case, however, that the more tools one can use efficiently the better since there is no strategy that works in every setting. Unfortunately, we did not perform an exhaustive hyper-tuning of Vowpal Wabbit on-line learning system which had the potential to provide a high performing system. Ensemble methods. Base models such as SVMs can obtain satisfactory performance in classification tasks. In the framework of a challenge, however, ensemble methods have the potential of deciding the winner. Ensemble methods (stacking, feature weighted linear stacking etc.) with a pool of strong and diversified models can improve performance by several units of accuracy. Here we relied on weighted voting processes but we believe that using more sophisticated techniques would have helped us. Acknowledgements We would like to thank the AMA team of the University of Grenoble-Alpes for providing us the machines where we ran our algorithms. References [Balikas et al.2014] George Balikas, Ioannis Partalas, Axel-Cyrille Ngonga Ngomo, Anastasia Krithara, 4
5 Description Public Private 1 200K unigrams K unigrams, α= K unigrams K unigrams K unigrams, 250K bigrams K unigrams, 400K bigrams, α= K unigrams, 400K bigrams, α=0.5, Marque as binary feature ,2 M unigrams, bigrams, trigrams, α= M unigrams, bigrams, trigrams M unigrams, bigrams, trigrams, α= Table 2: A subset of our base model submissions with their scores in the public and private leaderboard Description Public Private Coverage 1 270K unigrams, α = 0.5, half data ,208 2 Weighted voting ,128 3 Weighted voting ,116 Table 3: Our best submissions with their scores in the public and private leaderboard Eric Gaussier, and George Paliouras Results of the bioasq track of the question answering lab at clef Results of the BioASQ Track of the Question Answering Lab at CLEF, 2014: [Fan et al.2008] Rong-En Fan, Kai-Wei Chang, Cho- Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin Liblinear: A library for large linear classification. The Journal of Machine Learning Research, 9: Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12: [Shen et al.2012] Dan Shen, Jean-David Ruvini, and Badrul Sarwar Large-scale item categorization for e-commerce. In Proceedings of the 21st ACM international conference on Information and knowledge management, pages ACM. [Jegou et al.2012] H. Jegou, F. Perronnin, M. Douze, J. Sanchez, P. Perez, and C. Schmid Aggregating local image descriptors into compact codes. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 34(9): , Sept. [Mikolov et al.2013] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean Efficient estimation of word representations in vector space. arxiv preprint arxiv: [Partalas et al.2015] Ioannis Partalas, Aris Kosmopoulos, Nicolas Baskiotis, Thierry Artieres, George Paliouras, Eric Gaussier, Ion Androutsopoulos, Massih-Reza Amini, and Patrick Galinari Lshtc: A benchmark for large-scale text classification. arxiv preprint arxiv: [Pedregosa et al.2011] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay
Social Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2
2nd International Conference on Advances in Mechanical Engineering and Industrial Informatics (AMEII 2016) The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2 1 School of
Sentiment Analysis for Movie Reviews
Sentiment Analysis for Movie Reviews Ankit Goyal, [email protected] Amey Parulekar, [email protected] Introduction: Movie reviews are an important way to gauge the performance of a movie. While providing
Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval
Information Retrieval INFO 4300 / CS 4300! Retrieval models Older models» Boolean retrieval» Vector Space model Probabilistic Models» BM25» Language models Web search» Learning to Rank Search Taxonomy!
Distributed forests for MapReduce-based machine learning
Distributed forests for MapReduce-based machine learning Ryoji Wakayama, Ryuei Murata, Akisato Kimura, Takayoshi Yamashita, Yuji Yamauchi, Hironobu Fujiyoshi Chubu University, Japan. NTT Communication
Data Mining Practical Machine Learning Tools and Techniques
Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea
BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES
BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES 123 CHAPTER 7 BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES 7.1 Introduction Even though using SVM presents
Data Mining. Nonlinear Classification
Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15
Web Document Clustering
Web Document Clustering Lab Project based on the MDL clustering suite http://www.cs.ccsu.edu/~markov/mdlclustering/ Zdravko Markov Computer Science Department Central Connecticut State University New Britain,
II. RELATED WORK. Sentiment Mining
Sentiment Mining Using Ensemble Classification Models Matthew Whitehead and Larry Yaeger Indiana University School of Informatics 901 E. 10th St. Bloomington, IN 47408 {mewhiteh, larryy}@indiana.edu Abstract
Knowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 11 Sajjad Haider Fall 2013 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right
Clustering Technique in Data Mining for Text Documents
Clustering Technique in Data Mining for Text Documents Ms.J.Sathya Priya Assistant Professor Dept Of Information Technology. Velammal Engineering College. Chennai. Ms.S.Priyadharshini Assistant Professor
MEUSE: RECOMMENDING INTERNET RADIO STATIONS
MEUSE: RECOMMENDING INTERNET RADIO STATIONS Maurice Grant [email protected] Adeesha Ekanayake [email protected] Douglas Turnbull [email protected] ABSTRACT In this paper, we describe a novel Internet
Supervised Learning (Big Data Analytics)
Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used
Feature Engineering and Classifier Ensemble for KDD Cup 2010
JMLR: Workshop and Conference Proceedings 1: 1-16 KDD Cup 2010 Feature Engineering and Classifier Ensemble for KDD Cup 2010 Hsiang-Fu Yu, Hung-Yi Lo, Hsun-Ping Hsieh, Jing-Kai Lou, Todd G. McKenzie, Jung-Wei
Predicting the Stock Market with News Articles
Predicting the Stock Market with News Articles Kari Lee and Ryan Timmons CS224N Final Project Introduction Stock market prediction is an area of extreme importance to an entire industry. Stock price is
Email Spam Detection A Machine Learning Approach
Email Spam Detection A Machine Learning Approach Ge Song, Lauren Steimle ABSTRACT Machine learning is a branch of artificial intelligence concerned with the creation and study of systems that can learn
Simple Language Models for Spam Detection
Simple Language Models for Spam Detection Egidio Terra Faculty of Informatics PUC/RS - Brazil Abstract For this year s Spam track we used classifiers based on language models. These models are used to
Classifying Manipulation Primitives from Visual Data
Classifying Manipulation Primitives from Visual Data Sandy Huang and Dylan Hadfield-Menell Abstract One approach to learning from demonstrations in robotics is to make use of a classifier to predict if
VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter
VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter Gerard Briones and Kasun Amarasinghe and Bridget T. McInnes, PhD. Department of Computer Science Virginia Commonwealth University Richmond,
Predicting borrowers chance of defaulting on credit loans
Predicting borrowers chance of defaulting on credit loans Junjie Liang ([email protected]) Abstract Credit score prediction is of great interests to banks as the outcome of the prediction algorithm
How To Identify A Churner
2012 45th Hawaii International Conference on System Sciences A New Ensemble Model for Efficient Churn Prediction in Mobile Telecommunication Namhyoung Kim, Jaewook Lee Department of Industrial and Management
Classification Algorithms for Detecting Duplicate Bug Reports in Large Open Source Repositories
Classification Algorithms for Detecting Duplicate Bug Reports in Large Open Source Repositories Sarah Ritchey (Computer Science and Mathematics) [email protected] - student Bonita Sharif (Computer
Data Mining - Evaluation of Classifiers
Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05
Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification
Chapter 6. The stacking ensemble approach
82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described
Building a Question Classifier for a TREC-Style Question Answering System
Building a Question Classifier for a TREC-Style Question Answering System Richard May & Ari Steinberg Topic: Question Classification We define Question Classification (QC) here to be the task that, given
Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data
CMPE 59H Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Non-linear
A fast multi-class SVM learning method for huge databases
www.ijcsi.org 544 A fast multi-class SVM learning method for huge databases Djeffal Abdelhamid 1, Babahenini Mohamed Chaouki 2 and Taleb-Ahmed Abdelmalik 3 1,2 Computer science department, LESIA Laboratory,
Improving Credit Card Fraud Detection with Calibrated Probabilities
Improving Credit Card Fraud Detection with Calibrated Probabilities Alejandro Correa Bahnsen, Aleksandar Stojanovic, Djamila Aouada and Björn Ottersten Interdisciplinary Centre for Security, Reliability
A Logistic Regression Approach to Ad Click Prediction
A Logistic Regression Approach to Ad Click Prediction Gouthami Kondakindi [email protected] Satakshi Rana [email protected] Aswin Rajkumar [email protected] Sai Kaushik Ponnekanti [email protected] Vinit Parakh
W6.B.1. FAQs CS535 BIG DATA W6.B.3. 4. If the distance of the point is additionally less than the tight distance T 2, remove it from the original set
http://wwwcscolostateedu/~cs535 W6B W6B2 CS535 BIG DAA FAQs Please prepare for the last minute rush Store your output files safely Partial score will be given for the output from less than 50GB input Computer
CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.
CI6227: Data Mining Lesson 11b: Ensemble Learning Sinno Jialin PAN Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore Acknowledgements: slides are adapted from the lecture notes
Fast Matching of Binary Features
Fast Matching of Binary Features Marius Muja and David G. Lowe Laboratory for Computational Intelligence University of British Columbia, Vancouver, Canada {mariusm,lowe}@cs.ubc.ca Abstract There has been
Data Mining for Knowledge Management. Classification
1 Data Mining for Knowledge Management Classification Themis Palpanas University of Trento http://disi.unitn.eu/~themis Data Mining for Knowledge Management 1 Thanks for slides to: Jiawei Han Eamonn Keogh
Classification algorithm in Data mining: An Overview
Classification algorithm in Data mining: An Overview S.Neelamegam #1, Dr.E.Ramaraj *2 #1 M.phil Scholar, Department of Computer Science and Engineering, Alagappa University, Karaikudi. *2 Professor, Department
Risk pricing for Australian Motor Insurance
Risk pricing for Australian Motor Insurance Dr Richard Brookes November 2012 Contents 1. Background Scope How many models? 2. Approach Data Variable filtering GLM Interactions Credibility overlay 3. Model
University of Glasgow Terrier Team / Project Abacá at RepLab 2014: Reputation Dimensions Task
University of Glasgow Terrier Team / Project Abacá at RepLab 2014: Reputation Dimensions Task Graham McDonald, Romain Deveaud, Richard McCreadie, Timothy Gollins, Craig Macdonald and Iadh Ounis School
Search Engines. Stephen Shaw <[email protected]> 18th of February, 2014. Netsoc
Search Engines Stephen Shaw Netsoc 18th of February, 2014 Me M.Sc. Artificial Intelligence, University of Edinburgh Would recommend B.A. (Mod.) Computer Science, Linguistics, French,
Lecture 13: Validation
Lecture 3: Validation g Motivation g The Holdout g Re-sampling techniques g Three-way data splits Motivation g Validation techniques are motivated by two fundamental problems in pattern recognition: model
GURLS: A Least Squares Library for Supervised Learning
Journal of Machine Learning Research 14 (2013) 3201-3205 Submitted 1/12; Revised 2/13; Published 10/13 GURLS: A Least Squares Library for Supervised Learning Andrea Tacchetti Pavan K. Mallapragada Center
Data Mining Project Report. Document Clustering. Meryem Uzun-Per
Data Mining Project Report Document Clustering Meryem Uzun-Per 504112506 Table of Content Table of Content... 2 1. Project Definition... 3 2. Literature Survey... 3 3. Methods... 4 3.1. K-means algorithm...
Employer Health Insurance Premium Prediction Elliott Lui
Employer Health Insurance Premium Prediction Elliott Lui 1 Introduction The US spends 15.2% of its GDP on health care, more than any other country, and the cost of health insurance is rising faster than
E-commerce Transaction Anomaly Classification
E-commerce Transaction Anomaly Classification Minyong Lee [email protected] Seunghee Ham [email protected] Qiyi Jiang [email protected] I. INTRODUCTION Due to the increasing popularity of e-commerce
Knowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Evaluating the Accuracy of a Classifier Holdout, random subsampling, crossvalidation, and the bootstrap are common techniques for
Analysis Tools and Libraries for BigData
+ Analysis Tools and Libraries for BigData Lecture 02 Abhijit Bendale + Office Hours 2 n Terry Boult (Waiting to Confirm) n Abhijit Bendale (Tue 2:45 to 4:45 pm). Best if you email me in advance, but I
Recognition. Sanja Fidler CSC420: Intro to Image Understanding 1 / 28
Recognition Topics that we will try to cover: Indexing for fast retrieval (we still owe this one) History of recognition techniques Object classification Bag-of-words Spatial pyramids Neural Networks Object
An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015
An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content
Data Mining Yelp Data - Predicting rating stars from review text
Data Mining Yelp Data - Predicting rating stars from review text Rakesh Chada Stony Brook University [email protected] Chetan Naik Stony Brook University [email protected] ABSTRACT The majority
Towards better accuracy for Spam predictions
Towards better accuracy for Spam predictions Chengyan Zhao Department of Computer Science University of Toronto Toronto, Ontario, Canada M5S 2E4 [email protected] Abstract Spam identification is crucial
Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang
Classifying Large Data Sets Using SVMs with Hierarchical Clusters Presented by :Limou Wang Overview SVM Overview Motivation Hierarchical micro-clustering algorithm Clustering-Based SVM (CB-SVM) Experimental
Spotting false translation segments in translation memories
Spotting false translation segments in translation memories Abstract Eduard Barbu Translated.net [email protected] The problem of spotting false translations in the bi-segments of translation memories
Classification and Prediction
Classification and Prediction Slides for Data Mining: Concepts and Techniques Chapter 7 Jiawei Han and Micheline Kamber Intelligent Database Systems Research Lab School of Computing Science Simon Fraser
SVM Ensemble Model for Investment Prediction
19 SVM Ensemble Model for Investment Prediction Chandra J, Assistant Professor, Department of Computer Science, Christ University, Bangalore Siji T. Mathew, Research Scholar, Christ University, Dept of
MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts
MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts Julio Villena-Román 1,3, Sara Lana-Serrano 2,3 1 Universidad Carlos III de Madrid 2 Universidad Politécnica de Madrid 3 DAEDALUS
Web Mining. Margherita Berardi LACAM. Dipartimento di Informatica Università degli Studi di Bari [email protected]
Web Mining Margherita Berardi LACAM Dipartimento di Informatica Università degli Studi di Bari [email protected] Bari, 24 Aprile 2003 Overview Introduction Knowledge discovery from text (Web Content
How To Solve The Kd Cup 2010 Challenge
A Lightweight Solution to the Educational Data Mining Challenge Kun Liu Yan Xing Faculty of Automation Guangdong University of Technology Guangzhou, 510090, China [email protected] [email protected]
Data Mining Methods: Applications for Institutional Research
Data Mining Methods: Applications for Institutional Research Nora Galambos, PhD Office of Institutional Research, Planning & Effectiveness Stony Brook University NEAIR Annual Conference Philadelphia 2014
Active Learning SVM for Blogs recommendation
Active Learning SVM for Blogs recommendation Xin Guan Computer Science, George Mason University Ⅰ.Introduction In the DH Now website, they try to review a big amount of blogs and articles and find the
Classification of Electrocardiogram Anomalies
Classification of Electrocardiogram Anomalies Karthik Ravichandran, David Chiasson, Kunle Oyedele Stanford University Abstract Electrocardiography time series are a popular non-invasive tool used to classify
A Genetic Algorithm-Evolved 3D Point Cloud Descriptor
A Genetic Algorithm-Evolved 3D Point Cloud Descriptor Dominik Wȩgrzyn and Luís A. Alexandre IT - Instituto de Telecomunicações Dept. of Computer Science, Univ. Beira Interior, 6200-001 Covilhã, Portugal
Parallel Data Mining. Team 2 Flash Coders Team Research Investigation Presentation 2. Foundations of Parallel Computing Oct 2014
Parallel Data Mining Team 2 Flash Coders Team Research Investigation Presentation 2 Foundations of Parallel Computing Oct 2014 Agenda Overview of topic Analysis of research papers Software design Overview
Machine Learning Final Project Spam Email Filtering
Machine Learning Final Project Spam Email Filtering March 2013 Shahar Yifrah Guy Lev Table of Content 1. OVERVIEW... 3 2. DATASET... 3 2.1 SOURCE... 3 2.2 CREATION OF TRAINING AND TEST SETS... 4 2.3 FEATURE
Twitter sentiment vs. Stock price!
Twitter sentiment vs. Stock price! Background! On April 24 th 2013, the Twitter account belonging to Associated Press was hacked. Fake posts about the Whitehouse being bombed and the President being injured
DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES
DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES Vijayalakshmi Mahanra Rao 1, Yashwant Prasad Singh 2 Multimedia University, Cyberjaya, MALAYSIA 1 [email protected]
Detecting Obfuscated JavaScripts using Machine Learning
Detecting Obfuscated JavaScripts using Machine Learning Simon Aebersold, Krzysztof Kryszczuk, Sergio Paganoni, Bernhard Tellenbach, Timothy Trowbridge Zurich University of Applied Sciences, Switzerland
The Data Mining Process
Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data
Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research [email protected]
Introduction to Machine Learning Lecture 1 Mehryar Mohri Courant Institute and Google Research [email protected] Introduction Logistics Prerequisites: basics concepts needed in probability and statistics
Learning to classify e-mail
Information Sciences 177 (2007) 2167 2187 www.elsevier.com/locate/ins Learning to classify e-mail Irena Koprinska *, Josiah Poon, James Clark, Jason Chan School of Information Technologies, The University
Evaluation of Feature Selection Methods for Predictive Modeling Using Neural Networks in Credits Scoring
714 Evaluation of Feature election Methods for Predictive Modeling Using Neural Networks in Credits coring Raghavendra B. K. Dr. M.G.R. Educational and Research Institute, Chennai-95 Email: [email protected]
SIGMOD RWE Review Towards Proximity Pattern Mining in Large Graphs
SIGMOD RWE Review Towards Proximity Pattern Mining in Large Graphs Fabian Hueske, TU Berlin June 26, 21 1 Review This document is a review report on the paper Towards Proximity Pattern Mining in Large
Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes
Knowledge Discovery and Data Mining Lecture 19 - Bagging Tom Kelsey School of Computer Science University of St Andrews http://tom.host.cs.st-andrews.ac.uk [email protected] Tom Kelsey ID5059-19-B &
An Approach to Detect Spam Emails by Using Majority Voting
An Approach to Detect Spam Emails by Using Majority Voting Roohi Hussain Department of Computer Engineering, National University of Science and Technology, H-12 Islamabad, Pakistan Usman Qamar Faculty,
A Knowledge-Poor Approach to BioCreative V DNER and CID Tasks
A Knowledge-Poor Approach to BioCreative V DNER and CID Tasks Firoj Alam 1, Anna Corazza 2, Alberto Lavelli 3, and Roberto Zanoli 3 1 Dept. of Information Eng. and Computer Science, University of Trento,
Anti-Spam Filter Based on Naïve Bayes, SVM, and KNN model
AI TERM PROJECT GROUP 14 1 Anti-Spam Filter Based on,, and model Yun-Nung Chen, Che-An Lu, Chao-Yu Huang Abstract spam email filters are a well-known and powerful type of filters. We construct different
An Ensemble Learning Approach for the Kaggle Taxi Travel Time Prediction Challenge
An Ensemble Learning Approach for the Kaggle Taxi Travel Time Prediction Challenge Thomas Hoch Software Competence Center Hagenberg GmbH Softwarepark 21, 4232 Hagenberg, Austria Tel.: +43-7236-3343-831
Support Vector Machines with Clustering for Training with Very Large Datasets
Support Vector Machines with Clustering for Training with Very Large Datasets Theodoros Evgeniou Technology Management INSEAD Bd de Constance, Fontainebleau 77300, France [email protected] Massimiliano
A Survey of Classification Techniques in the Area of Big Data.
A Survey of Classification Techniques in the Area of Big Data. 1PrafulKoturwar, 2 SheetalGirase, 3 Debajyoti Mukhopadhyay 1Reseach Scholar, Department of Information Technology 2Assistance Professor,Department
Transform-based Domain Adaptation for Big Data
Transform-based Domain Adaptation for Big Data Erik Rodner University of Jena Judy Hoffman Jeff Donahue Trevor Darrell Kate Saenko UMass Lowell Abstract Images seen during test time are often not from
Supporting Online Material for
www.sciencemag.org/cgi/content/full/313/5786/504/dc1 Supporting Online Material for Reducing the Dimensionality of Data with Neural Networks G. E. Hinton* and R. R. Salakhutdinov *To whom correspondence
Large-Scale Many-Class Learning
Large-Scale Many-Class Learning Omid Madani Michael Connor Abstract A number of tasks, such as large-scale text categorization and word prediction, can benefit from efficient learning and classification
Naive-Deep Face Recognition: Touching the Limit of LFW Benchmark or Not?
Naive-Deep Face Recognition: Touching the Limit of LFW Benchmark or Not? Erjin Zhou [email protected] Zhimin Cao [email protected] Qi Yin [email protected] Abstract Face recognition performance improves rapidly
Decision Trees from large Databases: SLIQ
Decision Trees from large Databases: SLIQ C4.5 often iterates over the training set How often? If the training set does not fit into main memory, swapping makes C4.5 unpractical! SLIQ: Sort the values
Orange: Data Mining Toolbox in Python
Journal of Machine Learning Research 14 (2013) 2349-2353 Submitted 3/13; Published 8/13 Orange: Data Mining Toolbox in Python Janez Demšar Tomaž Curk Aleš Erjavec Črt Gorup Tomaž Hočevar Mitar Milutinovič
Journée Thématique Big Data 13/03/2015
Journée Thématique Big Data 13/03/2015 1 Agenda About Flaminem What Do We Want To Predict? What Is The Machine Learning Theory Behind It? How Does It Work In Practice? What Is Happening When Data Gets
Data mining and statistical models in marketing campaigns of BT Retail
Data mining and statistical models in marketing campaigns of BT Retail Francesco Vivarelli and Martyn Johnson Database Exploitation, Segmentation and Targeting group BT Retail Pp501 Holborn centre 120
On the application of multi-class classification in physical therapy recommendation
RESEARCH Open Access On the application of multi-class classification in physical therapy recommendation Jing Zhang 1,PengCao 1,DouglasPGross 2 and Osmar R Zaiane 1* Abstract Recommending optimal rehabilitation
Statistical Feature Selection Techniques for Arabic Text Categorization
Statistical Feature Selection Techniques for Arabic Text Categorization Rehab M. Duwairi Department of Computer Information Systems Jordan University of Science and Technology Irbid 22110 Jordan Tel. +962-2-7201000
Data Mining with R. Decision Trees and Random Forests. Hugh Murrell
Data Mining with R Decision Trees and Random Forests Hugh Murrell reference books These slides are based on a book by Graham Williams: Data Mining with Rattle and R, The Art of Excavating Data for Knowledge
A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier
A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier G.T. Prasanna Kumari Associate Professor, Dept of Computer Science and Engineering, Gokula Krishna College of Engg, Sullurpet-524121,
Data Mining Classification: Decision Trees
Data Mining Classification: Decision Trees Classification Decision Trees: what they are and how they work Hunt s (TDIDT) algorithm How to select the best split How to handle Inconsistent data Continuous
Bagged Ensemble Classifiers for Sentiment Classification of Movie Reviews
www.ijecs.in International Journal Of Engineering And Computer Science ISSN:2319-7242 Volume 3 Issue 2 February, 2014 Page No. 3951-3961 Bagged Ensemble Classifiers for Sentiment Classification of Movie
MACHINE LEARNING IN HIGH ENERGY PHYSICS
MACHINE LEARNING IN HIGH ENERGY PHYSICS LECTURE #1 Alex Rogozhnikov, 2015 INTRO NOTES 4 days two lectures, two practice seminars every day this is introductory track to machine learning kaggle competition!
BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL
The Fifth International Conference on e-learning (elearning-2014), 22-23 September 2014, Belgrade, Serbia BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL SNJEŽANA MILINKOVIĆ University
Car Insurance. Havránek, Pokorný, Tomášek
Car Insurance Havránek, Pokorný, Tomášek Outline Data overview Horizontal approach + Decision tree/forests Vertical (column) approach + Neural networks SVM Data overview Customers Viewed policies Bought
