# Introduction to Bayesian Classification (A Practical Discussion) Todd Holloway Lecture for B551 Nov. 27, 2007

Save this PDF as:

Size: px
Start display at page:

Download "Introduction to Bayesian Classification (A Practical Discussion) Todd Holloway Lecture for B551 Nov. 27, 2007"

## Transcription

1 Introduction to Bayesian Classification (A Practical Discussion) Todd Holloway Lecture for B551 Nov. 27, 2007

2 Naïve Bayes Components ML vs. MAP Benefits Feature Preparation Filtering Decay Extended Examples Spell Checking Spam Filtering Ensemble Learning Outline

3 Bayes Rule Bayes Three parts. What can be said about the third part for classification tasks?

4 Bayes Rule Three parts. What can be said about the third part for classification tasks? Unnecessary if we only care about the classification, not the probability estimation. May result in division by zero in domains where previously unseen features arise. Can you think of such a domain? So the denominator is either ignored entirely, or represented as a constant under tasks in which we need the estimates.

5 Bayes Rule What about the class priors P(C)? How do they effect the probability estimates? Do we need the class priors?

6 Bayes Rule: ML and MAP ML (Maximum Likelihood) is selecting the class that maximizes P(d c) -Class priors are uniform, or ignored MAP (Maximum a Posteriori) is selecting the class that maximizes P(d c)p(c) Both are embodiments of Ockham s razor ML may be problematic when the data is small MAP may be less appropriate when the class priors are suspect

7 Bayes Rule Finally, if we assume conditional independence of the features, Is this assumption reasonable?

8 Bayes Rule: Naïve Bayes And, finally, we arrive at Naïve Bayes MAP Naïve Bayes

9 Smoothing

10 Time/Space Complexities Training: O(examples*features) Decision Tree: O(examples*features^2) What about space?

11 Feature Preparation Filtering TFIDF (Lift) Mutual Information Time Decay Tokenization

12 Feature Filtering Why? Efficiency Text classification often involves a huge number of features Remove features while maintaining accuracy Features which are independent of the class provide no information Accuracy Helps prevent over-fitting

13 Feature Filtering: Lift The lift of a feature value is the ratio of the confidence of the feature value to the expected confidence of the feature value. Local (individual example) confidence vs. global (all examples) confidence How do we use lift? Order features by lift Keep top X features, or features above a certain threshold

14 Feature Filtering: TFIDF TFIDF is one lift measure which is useful in text classification tasks. TFIDF (Term Frequency, Inverse Document Frequency) - Intuitively, its how important a word is to a document in a collection - Has its own Wikipedia page - TFIDF, or TF/DF, is df D D ni Examples (word = tf/df = lift) some =.005/.8 =.006 a =.01/1 =.01 football =.01/.05 =.2 Packers =.01/.01 = 1

15 Feature Filtering: TFIDF TFIDF Filtering Benefits Accuracy (2.2% increase-yahoo!) Speed, memory (less features)

16 Feature Filtering: Mutual Information Another manner of filtering is to measure how well a feature discriminates between classes.

17 Feature Decay We already saw how we can weight the individual feature values with lift. We can also weight an example as a whole. Often we want to reduce the contribution of an example after it gets old. decay = reduce contribution of example to classifier t Use chemistry formula Nt N 0 e 7 day half life = day half life = Example: 180 day half life. 30 days old. 1.0 is decayed to 0.89

18 Tokenization Add phrases as features Use sliding window - Example - Example Spam: Mr. Holloway, I invite you to use our consolidated student loan services. We can save you \$50,000 on your student loans Window of size 2, new features: Mr. Holloway, Holloway I, I invite, invite you, you to, to use, and so on. Use lift to weed out poor combinations Why? If we know of dependencies, but want to keep the independence assumption, explicitly adding the dependent features as a new feature may improve performance.

19 Example 1: Spell Checker

20 Spell Checker from Peter Norvig Source code provided in Python, Scheme, Perl, C, Java, Haskell, F#, Ruby, Erlang, and Rebol

21 Spell Checker P(c), the language model, is the probability that a proposed correction c. Intuitively, How likely is c to appear in an English text? P("the") would have a relatively high probability P("zxzxzxzyyy") would be near zero. Should we use words or phrases or something else? P(w c), the error model, is the probability that w would be typed in a text when the author meant c. Intuitively, How likely is it that the author would type w by mistake when c was intended?"

22 Spell Checker Where does P(c) come from? Read in a bunch of books, webpages, Wikipedia, etc Google Makes available its phrase counts data ( 24 GB compressed, just to warn you What about unseen classes?

23 Spell Checker Where does P(w c) come from? Trivial model: Use edit distance to generate and score possibilities Consider only possibilities that have already been seen (real words / phrases) Can you think of another way to get these probabilities?

24 Spell Checker Can you think of another way to get these probabilities? Get a corpus of spelling errors, and count how likely it is to make each insertion, deletion, or alteration, given the surrounding characters. Incorporate feedback from users

26 Example 2: Spam Filter

27 Spam Filter From Paul Graham s essays A Plan for Spam Better Bayesian Filtering Better tokenization (more separators) Note: These are non-personalized filters

28 Spam Filter Feature Preparation 1. Gather spam and non-spam s 2. Convert the s to sets of features (sometimes called bag of words ) Tokenize Use TFIDF to remove common words Remove duplicates (Should we do this?) Example: The CSGA is meeting for lunch today. Free pizza will be served at the meeting. => CSGA, meeting, lunch, today, free, pizza, served

29 Spam Filter I get a lot of containing the word "Lisp", and (so far) no spam that does. P(C) C is binary (spam, not spam) Graham uses an equal number of spam and non-spam messages ML What are the conditions under which we should think seriously about this parameter? (remember ML vs. MAP discussion)

30 Spam Filter P(F c) Just count the tokens and divide by the number of s in the class Any observations? P(f spam) Examples perl 0.01 python 0.01 tcl 0.01 scripting 0.01 morris 0.01 graham guarantee cgi paul quite pop various prices managed

31 Spam Filter How to use the spam filter 1. New arrives. It is converted to tokens as the training examples were. 2. For each token in the new , we look up (constant time) the probability, and multiply them together. 3. We then have the probability that its spam and the probability its not spam. We choose the greater of the two (MAP) and filter the appropriately.

32 Spam Filter Improvements 1. Add bias We would rather misclassify as not spam than spam 2. Personalize How do we do this? Any other ideas?

33 Ensemble Version Using AdaBoost Increase weights of misclassified examples Use weights directly with Bayes Generate a fixed number of classifiers Does not changes the runtime or space complexities May be similar to learning in humans Learning a boosted naive Bayesian classifier can be done by rehearsing past experiences (Elkan 1997)

34 Ensemble Approaches Diabetes in Pima Indians. German Credit Elkan, C. Boosting and Naive Bayesian Learning

35 Summary From Bayes Rule to Naïve Bayes MAP vs. ML Practicality Spell Checker Spam Filter Ensemble Version

37 Sources

### Machine Learning Final Project Spam Email Filtering

Machine Learning Final Project Spam Email Filtering March 2013 Shahar Yifrah Guy Lev Table of Content 1. OVERVIEW... 3 2. DATASET... 3 2.1 SOURCE... 3 2.2 CREATION OF TRAINING AND TEST SETS... 4 2.3 FEATURE

### Part III: Machine Learning. CS 188: Artificial Intelligence. Machine Learning This Set of Slides. Parameter Estimation. Estimation: Smoothing

CS 188: Artificial Intelligence Lecture 20: Dynamic Bayes Nets, Naïve Bayes Pieter Abbeel UC Berkeley Slides adapted from Dan Klein. Part III: Machine Learning Up until now: how to reason in a model and

### Some fitting of naive Bayesian spam filtering for Japanese environment

Some fitting of naive Bayesian spam filtering for Japanese environment Manabu Iwanaga 1, Toshihiro Tabata 2, and Kouichi Sakurai 2 1 Graduate School of Information Science and Electrical Engineering, Kyushu

### 1 Maximum likelihood estimation

COS 424: Interacting with Data Lecturer: David Blei Lecture #4 Scribes: Wei Ho, Michael Ye February 14, 2008 1 Maximum likelihood estimation 1.1 MLE of a Bernoulli random variable (coin flips) Given N

### Simple Language Models for Spam Detection

Simple Language Models for Spam Detection Egidio Terra Faculty of Informatics PUC/RS - Brazil Abstract For this year s Spam track we used classifiers based on language models. These models are used to

### Spam Filtering based on Naive Bayes Classification. Tianhao Sun

Spam Filtering based on Naive Bayes Classification Tianhao Sun May 1, 2009 Abstract This project discusses about the popular statistical spam filtering process: naive Bayes classification. A fairly famous

### Machine Learning. CS 188: Artificial Intelligence Naïve Bayes. Example: Digit Recognition. Other Classification Tasks

CS 188: Artificial Intelligence Naïve Bayes Machine Learning Up until now: how use a model to make optimal decisions Machine learning: how to acquire a model from data / experience Learning parameters

### Data Mining Practical Machine Learning Tools and Techniques

Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea

### Spam Filtering using Naïve Bayesian Classification

Spam Filtering using Naïve Bayesian Classification Presented by: Samer Younes Outline What is spam anyway? Some statistics Why is Spam a Problem Major Techniques for Classifying Spam Transport Level Filtering

### Anti Spamming Techniques

Anti Spamming Techniques Written by Sumit Siddharth In this article will we first look at some of the existing methods to identify an email as a spam? We look at the pros and cons of the existing methods

### Discrete Structures for Computer Science

Discrete Structures for Computer Science Adam J. Lee adamlee@cs.pitt.edu 6111 Sennott Square Lecture #20: Bayes Theorem November 5, 2013 How can we incorporate prior knowledge? Sometimes we want to know

### Machine Learning in Spam Filtering

Machine Learning in Spam Filtering A Crash Course in ML Konstantin Tretyakov kt@ut.ee Institute of Computer Science, University of Tartu Overview Spam is Evil ML for Spam Filtering: General Idea, Problems.

### Bayesian Spam Filtering

Bayesian Spam Filtering Ahmed Obied Department of Computer Science University of Calgary amaobied@ucalgary.ca http://www.cpsc.ucalgary.ca/~amaobied Abstract. With the enormous amount of spam messages propagating

### Big Data & Scripting Part II Streaming Algorithms

Big Data & Scripting Part II Streaming Algorithms 1, Counting Distinct Elements 2, 3, counting distinct elements problem formalization input: stream of elements o from some universe U e.g. ids from a set

### On Attacking Statistical Spam Filters

On Attacking Statistical Spam Filters Gregory L. Wittel and S. Felix Wu Department of Computer Science University of California, Davis One Shields Avenue, Davis, CA 95616 USA Paper review by Deepak Chinavle

### CSE 473: Artificial Intelligence Autumn 2010

CSE 473: Artificial Intelligence Autumn 2010 Machine Learning: Naive Bayes and Perceptron Luke Zettlemoyer Many slides over the course adapted from Dan Klein. 1 Outline Learning: Naive Bayes and Perceptron

### Chapter 6. The stacking ensemble approach

82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described

### Bayes and Naïve Bayes. cs534-machine Learning

Bayes and aïve Bayes cs534-machine Learning Bayes Classifier Generative model learns Prediction is made by and where This is often referred to as the Bayes Classifier, because of the use of the Bayes rule

### Anti-Spam Filter Based on Naïve Bayes, SVM, and KNN model

AI TERM PROJECT GROUP 14 1 Anti-Spam Filter Based on,, and model Yun-Nung Chen, Che-An Lu, Chao-Yu Huang Abstract spam email filters are a well-known and powerful type of filters. We construct different

### Question 2 Naïve Bayes (16 points)

Question 2 Naïve Bayes (16 points) About 2/3 of your email is spam so you downloaded an open source spam filter based on word occurrences that uses the Naive Bayes classifier. Assume you collected the

### Attribution. Modified from Stuart Russell s slides (Berkeley) Parts of the slides are inspired by Dan Klein s lecture material for CS 188 (Berkeley)

Machine Learning 1 Attribution Modified from Stuart Russell s slides (Berkeley) Parts of the slides are inspired by Dan Klein s lecture material for CS 188 (Berkeley) 2 Outline Inductive learning Decision

### Supervised Learning (Big Data Analytics)

Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used

### CMPSCI 240: Reasoning about Uncertainty

CMPSCI 240: Reasoning about Uncertainty Lecture 18: Spam Filtering and Naive Bayes Classification Andrew McGregor University of Massachusetts Last Compiled: April 9, 2015 Review Total Probability If A

### Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

### Email Spam Detection A Machine Learning Approach

Email Spam Detection A Machine Learning Approach Ge Song, Lauren Steimle ABSTRACT Machine learning is a branch of artificial intelligence concerned with the creation and study of systems that can learn

### Search Engines. Stephen Shaw <stesh@netsoc.tcd.ie> 18th of February, 2014. Netsoc

Search Engines Stephen Shaw Netsoc 18th of February, 2014 Me M.Sc. Artificial Intelligence, University of Edinburgh Would recommend B.A. (Mod.) Computer Science, Linguistics, French,

### Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu

Introduction to Machine Learning Lecture 1 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Introduction Logistics Prerequisites: basics concepts needed in probability and statistics

### It is designed to resist the spam in the Internet. It can provide the convenience to the email user and save the bandwidth of the network.

1. Abstract: Our filter program is a JavaTM 2 SDK, Standard Edition Version 1.5.0 (J2SE) based application, which can be running on the machine that has installed JDK 1.5.0. It can integrate with a JavaServer

### Adaption of Statistical Email Filtering Techniques

Adaption of Statistical Email Filtering Techniques David Kohlbrenner IT.com Thomas Jefferson High School for Science and Technology January 25, 2007 Abstract With the rise of the levels of spam, new techniques

### Tweaking Naïve Bayes classifier for intelligent spam detection

682 Tweaking Naïve Bayes classifier for intelligent spam detection Ankita Raturi 1 and Sunil Pranit Lal 2 1 University of California, Irvine, CA 92697, USA. araturi@uci.edu 2 School of Computing, Information

### Naive Bayes Spam Filtering Using Word-Position-Based Attributes

Naive Bayes Spam Filtering Using Word-Position-Based Attributes Johan Hovold Department of Computer Science Lund University Box 118, 221 00 Lund, Sweden johan.hovold.363@student.lu.se Abstract This paper

### Content-Based Recommendation

Content-Based Recommendation Content-based? Item descriptions to identify items that are of particular interest to the user Example Example Comparing with Noncontent based Items User-based CF Searches

### CONTENTS PREFACE 1 INTRODUCTION 1 2 DATA VISUALIZATION 19

PREFACE xi 1 INTRODUCTION 1 1.1 Overview 1 1.2 Definition 1 1.3 Preparation 2 1.3.1 Overview 2 1.3.2 Accessing Tabular Data 3 1.3.3 Accessing Unstructured Data 3 1.3.4 Understanding the Variables and Observations

### 4.5 Symbol Table Applications

Set ADT 4.5 Symbol Table Applications Set ADT: unordered collection of distinct keys. Insert a key. Check if set contains a given key. Delete a key. SET interface. addkey) insert the key containskey) is

### A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering

A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering Khurum Nazir Junejo, Mirza Muhammad Yousaf, and Asim Karim Dept. of Computer Science, Lahore University of Management Sciences

### A crash course in probability and Naïve Bayes classification

Probability theory A crash course in probability and Naïve Bayes classification Chapter 9 Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s

### Bayesian Spam Detection

Scholarly Horizons: University of Minnesota, Morris Undergraduate Journal Volume 2 Issue 1 Article 2 2015 Bayesian Spam Detection Jeremy J. Eberhardt University or Minnesota, Morris Follow this and additional

### Multi-Protocol Content Filtering

Multi-Protocol Content Filtering Matthew Johnson MEng Individual Project 1 Title hello, etc. 1-1 Why filter content? Information overload Specific personal interests General signal-to-noise

### Combining Evidence: the Naïve Bayes Model Vs. Semi-Naïve Evidence Combination

Software Artifact Research and Development Laboratory Technical Report SARD04-11, September 1, 2004 Combining Evidence: the Naïve Bayes Model Vs. Semi-Naïve Evidence Combination Daniel Berleant Dept. of

### Distributed Computing and Big Data: Hadoop and MapReduce

Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:

### Search and Information Retrieval

Search and Information Retrieval Search on the Web 1 is a daily activity for many people throughout the world Search and communication are most popular uses of the computer Applications involving search

### Search Engine Architecture I

Search Engine Architecture I Software Architecture The high level structure of a software system Software components The interfaces provided by those components The relationships between those components

### OPINION MINING IN PRODUCT REVIEW SYSTEM USING BIG DATA TECHNOLOGY HADOOP

OPINION MINING IN PRODUCT REVIEW SYSTEM USING BIG DATA TECHNOLOGY HADOOP 1 KALYANKUMAR B WADDAR, 2 K SRINIVASA 1 P G Student, S.I.T Tumkur, 2 Assistant Professor S.I.T Tumkur Abstract- Product Review System

### CS570 Data Mining Classification: Ensemble Methods

CS570 Data Mining Classification: Ensemble Methods Cengiz Günay Dept. Math & CS, Emory University Fall 2013 Some slides courtesy of Han-Kamber-Pei, Tan et al., and Li Xiong Günay (Emory) Classification:

### Ensemble Methods. Adapted from slides by Todd Holloway h8p://abeau<fulwww.com/2007/11/23/ ensemble- machine- learning- tutorial/

Ensemble Methods Adapted from slides by Todd Holloway h8p://abeau

### 1 Introductory Comments. 2 Bayesian Probability

Introductory Comments First, I would like to point out that I got this material from two sources: The first was a page from Paul Graham s website at www.paulgraham.com/ffb.html, and the second was a paper

### Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Evaluating the Accuracy of a Classifier Holdout, random subsampling, crossvalidation, and the bootstrap are common techniques for

### Learning to classify e-mail

Information Sciences 177 (2007) 2167 2187 www.elsevier.com/locate/ins Learning to classify e-mail Irena Koprinska *, Josiah Poon, James Clark, Jason Chan School of Information Technologies, The University

### Not So Naïve Online Bayesian Spam Filter

Not So Naïve Online Bayesian Spam Filter Baojun Su Institute of Artificial Intelligence College of Computer Science Zhejiang University Hangzhou 310027, China freizsu@gmail.com Congfu Xu Institute of Artificial

### L4: Bayesian Decision Theory

L4: Bayesian Decision Theory Likelihood ratio test Probability of error Bayes risk Bayes, MAP and ML criteria Multi-class problems Discriminant functions CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna

### Machine Learning. Mausam (based on slides by Tom Mitchell, Oren Etzioni and Pedro Domingos)

Machine Learning Mausam (based on slides by Tom Mitchell, Oren Etzioni and Pedro Domingos) What Is Machine Learning? A computer program is said to learn from experience E with respect to some class of

### BUILDING A SPAM FILTER USING NAÏVE BAYES. CIS 391- Intro to AI 1

BUILDING A SPAM FILTER USING NAÏVE BAYES 1 Spam or not Spam: that is the question. From: "" Subjet: real estate is the only way... gem oalvgkay Anyone an buy real estate with no

### Statistical Machine Learning

Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes

### Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Unit # 11 Sajjad Haider Fall 2013 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right

### Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05

Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification

### Spam Filtering with Naive Bayesian Classification

Spam Filtering with Naive Bayesian Classification Khuong An Nguyen Queens College University of Cambridge L101: Machine Learning for Language Processing MPhil in Advanced Computer Science 09-April-2011

### Predictive Modeling Techniques in Insurance

Predictive Modeling Techniques in Insurance Tuesday May 5, 2015 JF. Breton Application Engineer 2014 The MathWorks, Inc. 1 Opening Presenter: JF. Breton: 13 years of experience in predictive analytics

### Machine Learning for Naive Bayesian Spam Filter Tokenization

Machine Learning for Naive Bayesian Spam Filter Tokenization Michael Bevilacqua-Linn December 20, 2003 Abstract Background Traditional client level spam filters rely on rule based heuristics. While these

### Statistical Feature Selection Techniques for Arabic Text Categorization

Statistical Feature Selection Techniques for Arabic Text Categorization Rehab M. Duwairi Department of Computer Information Systems Jordan University of Science and Technology Irbid 22110 Jordan Tel. +962-2-7201000

### Linear Classification. Volker Tresp Summer 2015

Linear Classification Volker Tresp Summer 2015 1 Classification Classification is the central task of pattern recognition Sensors supply information about an object: to which class do the object belong

### Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Unit # 10 Sajjad Haider Fall 2012 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right

### Ensemble Learning Better Predictions Through Diversity. Todd Holloway ETech 2008

Ensemble Learning Better Predictions Through Diversity Todd Holloway ETech 2008 Outline Building a classifier (a tutorial example) Neighbor method Major ideas and challenges in classification Ensembles

### Immunity from spam: an analysis of an artificial immune system for junk email detection

Immunity from spam: an analysis of an artificial immune system for junk email detection Terri Oda and Tony White Carleton University, Ottawa ON, Canada terri@zone12.com, arpwhite@scs.carleton.ca Abstract.

### CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

CI6227: Data Mining Lesson 11b: Ensemble Learning Sinno Jialin PAN Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore Acknowledgements: slides are adapted from the lecture notes

### Machine learning for algo trading

Machine learning for algo trading An introduction for nonmathematicians Dr. Aly Kassam Overview High level introduction to machine learning A machine learning bestiary What has all this got to do with

### Chapter 5. Phrase-based models. Statistical Machine Translation

Chapter 5 Phrase-based models Statistical Machine Translation Motivation Word-Based Models translate words as atomic units Phrase-Based Models translate phrases as atomic units Advantages: many-to-many

### Handling Unsolicited Commercial Email (UCE) or spam using Microsoft Outlook at Staffordshire University

Reference : USER 190 Issue date : January 2004 Revised : November 2007 Classification : Staff Originator : Richard Rogers Handling Unsolicited Commercial Email (UCE) or spam using Microsoft Outlook at

### Sentiment analysis using emoticons

Sentiment analysis using emoticons Royden Kayhan Lewis Moharreri Steven Royden Ware Lewis Kayhan Steven Moharreri Ware Department of Computer Science, Ohio State University Problem definition Our aim was

### Using MS Excel to Analyze Data: A Tutorial

Using MS Excel to Analyze Data: A Tutorial Various data analysis tools are available and some of them are free. Because using data to improve assessment and instruction primarily involves descriptive and

### Car Insurance. Havránek, Pokorný, Tomášek

Car Insurance Havránek, Pokorný, Tomášek Outline Data overview Horizontal approach + Decision tree/forests Vertical (column) approach + Neural networks SVM Data overview Customers Viewed policies Bought

### Class Overview and General Introduction to Machine Learning

Class Overview and General Introduction to Machine Learning Piyush Rai www.cs.utah.edu/~piyush CS5350/6350: Machine Learning August 23, 2011 (CS5350/6350) Intro to ML August 23, 2011 1 / 25 Course Logistics

### Data Mining for Business Intelligence. Concepts, Techniques, and Applications in Microsoft Office Excel with XLMiner. 2nd Edition

Brochure More information from http://www.researchandmarkets.com/reports/2170926/ Data Mining for Business Intelligence. Concepts, Techniques, and Applications in Microsoft Office Excel with XLMiner. 2nd

### CS 348: Introduction to Artificial Intelligence Lab 2: Spam Filtering

THE PROBLEM Spam is e-mail that is both unsolicited by the recipient and sent in substantively identical form to many recipients. In 2004, MSNBC reported that spam accounted for 66% of all electronic mail.

### Recommender Systems: Content-based, Knowledge-based, Hybrid. Radek Pelánek

Recommender Systems: Content-based, Knowledge-based, Hybrid Radek Pelánek 2015 Today lecture, basic principles: content-based knowledge-based hybrid, choice of approach,... critiquing, explanations,...

### Learning from Data: Naive Bayes

Semester 1 http://www.anc.ed.ac.uk/ amos/lfd/ Naive Bayes Typical example: Bayesian Spam Filter. Naive means naive. Bayesian methods can be much more sophisticated. Basic assumption: conditional independence.

### INFO 2950 Intro to Data Science. Lecture 17: Power Laws and Big Data

INFO 2950 Intro to Data Science Lecture 17: Power Laws and Big Data Paul Ginsparg Cornell University, Ithaca, NY 29 Oct 2013 1/25 Power Laws in log-log space y = cx k (k=1/2,1,2) log 10 y = k log 10 x

### Projektgruppe. Categorization of text documents via classification

Projektgruppe Steffen Beringer Categorization of text documents via classification 4. Juni 2010 Content Motivation Text categorization Classification in the machine learning Document indexing Construction

### Pattern Recognition: An Overview. Prof. Richard Zanibbi

Pattern Recognition: An Overview Prof. Richard Zanibbi Pattern Recognition (One) Definition The identification of implicit objects, types or relationships in raw data by an animal or machine i.e. recognizing

### WE DEFINE spam as an e-mail message that is unwanted basically

1048 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 10, NO. 5, SEPTEMBER 1999 Support Vector Machines for Spam Categorization Harris Drucker, Senior Member, IEEE, Donghui Wu, Student Member, IEEE, and Vladimir

### Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

Learning Example Chapter 18: Learning from Examples 22c:145 An emergency room in a hospital measures 17 variables (e.g., blood pressure, age, etc) of newly admitted patients. A decision is needed: whether

### A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier G.T. Prasanna Kumari Associate Professor, Dept of Computer Science and Engineering, Gokula Krishna College of Engg, Sullurpet-524121,

### Web Document Clustering

Web Document Clustering Lab Project based on the MDL clustering suite http://www.cs.ccsu.edu/~markov/mdlclustering/ Zdravko Markov Computer Science Department Central Connecticut State University New Britain,

### E-MAIL DEFENDER SERVICES

E-MAIL DEFENDER SERVICES Email Defender User Guide 2015-02-12 What does E-Mail Defender do? Anti-Virus testing to eliminate known and suspected viruses. Blacklist services check distributed lists for fingerprints

### Learning Organizational Principles in Human Environments

Learning Organizational Principles in Human Environments Outline Motivation: Object Allocation Problem Organizational Principles in Kitchen Environments Datasets Learning Organizational Principles Features

### Why is Internal Audit so Hard?

Why is Internal Audit so Hard? 2 2014 Why is Internal Audit so Hard? 3 2014 Why is Internal Audit so Hard? Waste Abuse Fraud 4 2014 Waves of Change 1 st Wave Personal Computers Electronic Spreadsheets

### Wiki Vandalysis- Wikipedia Vandalism Analysis

Wiki Vandalysis- Wikipedia Vandalism Analysis Manoj Harpalani, Thanadit Phumprao, Megha Bassi, Michael Hart, and Rob Johnson Stony Brook University Text Features o Edit Distance o Text Changes o Spelling

### MHI3000 Big Data Analytics for Health Care Final Project Report

MHI3000 Big Data Analytics for Health Care Final Project Report Zhongtian Fred Qiu (1002274530) http://gallery.azureml.net/details/81ddb2ab137046d4925584b5095ec7aa 1. Data pre-processing The data given

### Automated News Item Categorization

Automated News Item Categorization Hrvoje Bacan, Igor S. Pandzic* Department of Telecommunications, Faculty of Electrical Engineering and Computing, University of Zagreb, Croatia {Hrvoje.Bacan,Igor.Pandzic}@fer.hr

### Open Source IR Tools and Libraries

Open Source IR Tools and Libraries Giorgos Vasiliadis, gvasil@csd.uoc.gr CS-463 Information Retrieval Models Computer Science Department University of Crete 1 Outline Google Search API Lucene Terrier Lemur

### Predicting the Stock Market with News Articles

Predicting the Stock Market with News Articles Kari Lee and Ryan Timmons CS224N Final Project Introduction Stock market prediction is an area of extreme importance to an entire industry. Stock price is

### Monotonicity Hints. Abstract

Monotonicity Hints Joseph Sill Computation and Neural Systems program California Institute of Technology email: joe@cs.caltech.edu Yaser S. Abu-Mostafa EE and CS Deptartments California Institute of Technology

### Combining Global and Personal Anti-Spam Filtering

Combining Global and Personal Anti-Spam Filtering Richard Segal IBM Research Hawthorne, NY 10532 Abstract Many of the first successful applications of statistical learning to anti-spam filtering were personalized

### 203.4770: Introduction to Machine Learning Dr. Rita Osadchy

203.4770: Introduction to Machine Learning Dr. Rita Osadchy 1 Outline 1. About the Course 2. What is Machine Learning? 3. Types of problems and Situations 4. ML Example 2 About the course Course Homepage:

### Investigation of Support Vector Machines for Email Classification

Investigation of Support Vector Machines for Email Classification by Andrew Farrugia Thesis Submitted by Andrew Farrugia in partial fulfillment of the Requirements for the Degree of Bachelor of Software

### Mining a Corpus of Job Ads

Mining a Corpus of Job Ads Workshop Strings and Structures Computational Biology & Linguistics Jürgen Jürgen Hermes Hermes Sprachliche Linguistic Data Informationsverarbeitung Processing Institut Department

### Lecture 9. Semantic Analysis Scoping and Symbol Table

Lecture 9. Semantic Analysis Scoping and Symbol Table Wei Le 2015.10 Outline Semantic analysis Scoping The Role of Symbol Table Implementing a Symbol Table Semantic Analysis Parser builds abstract syntax

### MACHINE LEARNING IN HIGH ENERGY PHYSICS

MACHINE LEARNING IN HIGH ENERGY PHYSICS LECTURE #1 Alex Rogozhnikov, 2015 INTRO NOTES 4 days two lectures, two practice seminars every day this is introductory track to machine learning kaggle competition!

### Robust personalizable spam filtering via local and global discrimination modeling

Knowl Inf Syst DOI 10.1007/s10115-012-0477-x REGULAR PAPER Robust personalizable spam filtering via local and global discrimination modeling Khurum Nazir Junejo Asim Karim Received: 29 June 2010 / Revised: