4.5 Symbol Table Applications



Similar documents
dictionary find definition word definition book index find relevant pages term list of page numbers

cs2010: algorithms and data structures

Hashing. hash functions collision resolution applications. References: Algorithms in Java, Chapter 14.

1.4 Arrays Introduction to Programming in Java: An Interdisciplinary Approach Robert Sedgewick and Kevin Wayne Copyright /6/11 12:33 PM!

Achieve more with less

Homework 4 Statistics W4240: Data Mining Columbia University Due Tuesday, October 29 in Class

API for java.util.iterator. ! hasnext() Are there more items in the list? ! next() Return the next item in the list.

AP Computer Science Java Mr. Clausen Program 9A, 9B

Introduction to Bayesian Classification (A Practical Discussion) Todd Holloway Lecture for B551 Nov. 27, 2007

Anti Spamming Techniques

Adaption of Statistical Filtering Techniques

CompSci 125 Lecture 08. Chapter 5: Conditional Statements Chapter 4: return Statement

4.2 Sorting and Searching

Apache Lucene. Searching the Web and Everything Else. Daniel Naber Mindquarry GmbH ID 380

Spam Detection A Machine Learning Approach

Building a Multi-Threaded Web Server

6. Cholesky factorization

CS 2112 Spring Instructions. Assignment 3 Data Structures and Web Filtering. 0.1 Grading. 0.2 Partners. 0.3 Restrictions

OPINION MINING IN PRODUCT REVIEW SYSTEM USING BIG DATA TECHNOLOGY HADOOP

Lecture 2 Matrix Operations

Spam Detection Using Customized SimHash Function

A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering

Using Files as Input/Output in Java 5.0 Applications

Webapps Vulnerability Report

Question 2: How do you solve a matrix equation using the matrix inverse?

Sample CSE8A midterm Multiple Choice (circle one)

CS 141: Introduction to (Java) Programming: Exam 1 Jenny Orr Willamette University Fall 2013

Chapter 8: Bags and Sets

Bayesian Filtering. Scoring

Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms. Spiros Papadimitriou, IBM Research Jimeng Sun, IBM Research Rong Yan, Facebook

Part 1: Link Analysis & Page Rank

Securepoint Security Systems

Qlik REST Connector Installation and User Guide

Simple Language Models for Spam Detection

Anti-Spam Filter Based on Naïve Bayes, SVM, and KNN model

Big Data & Scripting Part II Streaming Algorithms

Machine Learning Final Project Spam Filtering

J a v a Quiz (Unit 3, Test 0 Practice)

Bayesian Spam Filtering

Smart E-Marketer s Guide

Spam Filtering based on Naive Bayes Classification. Tianhao Sun

Moving from CS 61A Scheme to CS 61B Java

Master of Sciences in Informatics Engineering Programming Paradigms 2005/2006. Final Examination. January 24 th, 2006

Handout 1. Introduction to Java programming language. Java primitive types and operations. Reading keyboard Input using class Scanner.

First Java Programs. V. Paúl Pauca. CSC 111D Fall, Department of Computer Science Wake Forest University. Introduction to Computer Science

(Eng. Hayam Reda Seireg) Sheet Java

Translating to Java. Translation. Input. Many Level Translations. read, get, input, ask, request. Requirements Design Algorithm Java Machine Language

Visual Basic Programming. An Introduction

System.out.println("\nEnter Product Number 1-5 (0 to stop and view summary) :

1 Determinants and the Solvability of Linear Systems

Common Patterns and Pitfalls for Implementing Algorithms in Spark. Hossein

CMSC 202H. ArrayList, Multidimensional Arrays

12-6 Write a recursive definition of a valid Java identifier (see chapter 2).

DNS LOOKUP SYSTEM DATA STRUCTURES AND ALGORITHMS PROJECT REPORT

The world s largest matrix computation. (This chapter is out of date and needs a major overhaul.)

Savita Teli 1, Santoshkumar Biradar 2

General Framework for an Iterative Solution of Ax b. Jacobi s Method

Glossary of Object Oriented Terms

Lecture 9. Semantic Analysis Scoping and Symbol Table

This material is built based on, Patterns covered in this class FILTERING PATTERNS. Filtering pattern

Spam Filtering Based on Latent Semantic Indexing

dedupe Documentation Release Forest Gregg, Derek Eder, and contributors

How To Filter Spam Image From A Picture By Color Or Color

International Journal of Research in Advent Technology Available Online at:

Link Analysis. Chapter PageRank

Introduction to Object-Oriented Programming

MIDTERM 1 REVIEW WRITING CODE POSSIBLE SOLUTION

G563 Quantitative Paleontology. SQL databases. An introduction. Department of Geological Sciences Indiana University. (c) 2012, P.

Evaluation. Copy. Evaluation Copy. Chapter 7: Using JDBC with Spring. 1) A Simpler Approach ) The JdbcTemplate. Class...

Building Java Programs

Basics of Java Programming Input and the Scanner class

Search Engines. Stephen Shaw 18th of February, Netsoc

Unit 6. Loop statements

CS170 Lab 11 Abstract Data Types & Objects

Immunity from spam: an analysis of an artificial immune system for junk detection

Machine Learning in Spam Filtering

Web Forensic Evidence of SQL Injection Analysis

Install Java Development Kit (JDK) 1.8

The following program is aiming to extract from a simple text file an analysis of the content such as:

Hadoop WordCount Explained! IT332 Distributed Systems

LAB4 Making Classes and Objects

CS 348: Introduction to Artificial Intelligence Lab 2: Spam Filtering

LINKED DATA STRUCTURES

Using an Access Database

AP Computer Science Java Subset

How does the Excalibur Technology SPAM & Virus Protection System work?

1. Classification problems

Tightening the Net: A Review of Current and Next Generation Spam Filtering Tools

Using ilove SharePoint Web Services Workflow Action

Spam filtering. Peter Likarish Based on slides by EJ Jung 11/03/10

Introduction to Java

CSE 1223: Introduction to Computer Programming in Java Chapter 2 Java Fundamentals

Transcription:

Set ADT 4.5 Symbol Table Applications Set ADT: unordered collection of distinct keys. Insert a key. Check if set contains a given key. Delete a key. SET interface. addkey) insert the key containskey) is given key present? removekey) remove the key iterator) return iterator over all keys Q. How to implement? Java library: java.util.hashset. Set Client: Remove Duplicates More Set Applications Remove duplicates. [e.g., from commercial mailing list] Read in a key. If key is not in set, insert and print it out. Application Purpose Key Spell checker Identify misspelled words Word Browser Highlight previously visited pages URL public class DeDup { public static void mainstring[] args) { SET<String> set = new SET<String>); while StdIn.isEmpty)) { String key = StdIn.readString); if set.containskey)) { set.addkey); System.out.printlnkey); Chess Detect repetition draw Board position Spam blacklist Prevent spam IP addresses of spammers 3 4

Vectors and Matrices Inverted Index Inverted index. Given a list of documents, preprocess so that you can quickly find all documents containing a given query. Ex 1. Book index. Ex. Web search engine index. Key. Query word. Value. Set of documents. 6 Inverted Index Implementation Inverted Index Implementation cont) public class InvertedIndex { public static void mainstring[] args) { ST<String, SET<String>> st; st = new ST<String, SET<String>>); use SET to avoid duplicates // create inverted index for String filename : args) { In in = new Infilename); while in.isempty)) { String word = in.readstring); if st.containsword)) st.putword, new SET<String>)); SET<String> set = st.getword); set.addfilename); // read queries from standard input In stdin = new In); while stdin.isempty)) { String query = in.readstring); if st.containsquery)) System.out.println"NOT FOUND"); else { SET<String> set = st.getquery); for String filename : set) System.out.printfilename + " "); System.out.println); 7 8

Inverted Index Extensions. Ignore stopwords: the, on, of, etc. Multi-word queries: set intersection AND) set union OR). Record position and number of occurrences of word in document. Sparse Vectors and Matrices 9 Vectors and Matrices Sparsity Vector. Ordered sequence of N real numbers. Matrix. N-by-N table of real numbers. Def. An N-by-N matrix is sparse if it has ON) nonzeros entries. Empirical fact. Large matrices that arise in practice are usually sparse. [ ], b = ["1 ] [ ] a = 0 3 15 a + b = "1 5 17 a o b = 0 # "1) + 3 # ) + 15 # ) = 36 Matrix representations. D array: space proportional to N. Goal: space proportional to number of nonzeros, without sacrificing fast access to individual elements. A = # 0 1 1& 4 ", B = $ 0 3 15' # 1 0 0& 0 0, A + B = $ 0 0 3' # 1 1 1& 6 " $ 0 3 18' Ex. Google performs matrix-vector product with N = 4 billion # 0 1 1& 4 " ) $ 0 3 15' #"1& $ ' = # 4& $ 36' 11 1

Sparse Vector Implementation Sparse Vector Implementation cont) public class SparseVector { private final int N; private ST<Integer, Double> st = new ST<Integer, Double>); public SparseVectorint N) { this.n = N; public void putint i, double value) { if value == 0.0) st.removei); else st.puti, value); public double getint i) { if st.containsi)) return st.geti); else return 0.0; a[i] = value a[i] // return a b public double dotsparsevector b) { SparseVector a = this; double sum = 0.0; for int i : a.st) if b.st.containsi)) sum += a.geti) * b.geti); return sum; // return c = a + b public SparseVector plussparsevector b) { SparseVector a = this; SparseVector c = new SparseVectorN); for int i : a.st) c.puti, a.geti)); for int i : b.st) c.puti, b.geti) + c.geti)); return c; 13 14 Sparse Matrix Implementation Sparse Matrix Implementation cont) public class SparseMatrix { private final int N; private SparseVector[] rows; public SparseMatrixint N) { this.n = N; rows = new SparseVector[N]; for int i = 0; i < N; i++) rows[i] = new SparseVectorN); public void putint i, int j, double value) { rows[i].putj, value); public double getint i, int j) { return rows[i].getj); N-by-N matrix of all zeros A[i][j] = value A[i][j] // return b = A*x public SparseVector timessparsevector x) { SparseMatrix A = this; SparseVector b = new SparseVectorN); for int i = 0; i < N; i++) b.puti, rows[i].dotx)); return b; // return C = A + B public SparseMatrix plussparsematrix B) { SparseMatrix A = this; SparseMatrix C = new SparseMatrixN); for int i = 0; i < N; i++) C.rows[i] = A.rows[i].plusB.rows[i]); return C; 15 16

A Plan for Spam A Plan for Spam Bayesian spam filter. Filter based on analysis of previous messages. User trains the filter by classifying messages as spam or ham. Parse messages into tokens alphanumeric, dashes, ', $) Build data structures. Hash table A of tokens and frequencies for spam. Hash table B of tokens and frequencies for ham. Hash table C of tokens with probability p that they appear in spam. double h =.0 * ham.freqword); double s = 1.0 * spam.freqword); double p = s/spams) / h/hams + s/spams); bias probabilities to avoid false positives Reference: http://www.paulgraham.com/spam.html 18 A Plan for Spam Identify incoming email as spam or ham. Find 15 most interesting tokens difference from 0.5). Combine probabilities using Bayes law. which data structure? p 1 " p " L " p 15 p1 " p " L" p15) + 1 p1) " 1 p ) " L" 1 p15) ) Declare as spam if threshold > 0.9. Details. Words you've never seen. Words that appear in ham corpus but not spam corpus, vice versa. Words that appear less than 5 times in spam and ham corpuses. Update data structures. 19