# An Analysis of Factors Used in Search Engine Ranking

Save this PDF as:

Size: px
Start display at page:

## Transcription

1 An Analysis of Factors Used in Search Engine Ranking Albert Bifet 1 Carlos Castillo 2 Paul-Alexandru Chirita 3 Ingmar Weber 4 1 Technical University of Catalonia 2 University of Chile 3 L3S Research Center 4 Max-Planck-Institute for Computer Science Adversarial Information Retrieval Workshop AIRWeb 14th IWWW Conference Tokyo

2 Outline 1 Motivation 2 Retrieval Information Features Estimating a Ranking Function 3 Experimental Results Architecture of the system Main Results

3 Outline 1 Motivation 2 Retrieval Information Features Estimating a Ranking Function 3 Experimental Results Architecture of the system Main Results

4 Search Engine Ranking. Un buscador es alguien que busca, no necesariamente alguien que encuentra. JORGE BUCAY Why it s Search Engine Ranking so important? 88% of the time we use it, given a new task E-commerce depends on a high ranking How we can get a good ranking? Ranking depends on certain features Don t forget : CONTENT

5 An Analysis of Factors Used in Search Engine Ranking Influence of different page features on the ranking of search engine results: We use Google Binary classification problem Linear and non-linear methods Training set and a test set

6 Outline 1 Motivation 2 Retrieval Information Features Estimating a Ranking Function 3 Experimental Results Architecture of the system Main Results

7 IR Model Definition An information retrieval model is a quadruple [D, Q, F, R(q i, d j )] where 1 D is a set composed of logical representations for the documents in the collection 2 Q is a set composed of logical representations for the user information needs (Queries) 3 F is a framework for modeling documents representations, queries, and their relationships 4 R(q i, d j ) is a ranking function wich associates a real number with a query q i Q and a document representation d j D

8 IR Model Definition Let t be the number of index terms in the system and k i be a generic index term. 1 K = {k 1,..., k t } is the set of all index terms 2 A weight w i,j is associated with each k i of a document d j 3 With the document d j is associated an index term vector d j = (w 1,j, w 2,j,..., w t,j ) Example 1 Document 1 =" My tailor is rich", Document 2 ="Your chair is red" 2 K = {"tailor","chair","rich","red" } 3 d1 = (1, 0, 1, 0), d 2 = (0, 1, 0, 1)

9 Vector Model Definition For the vector model, the weight w i,j associated with a pair (k i, d j ) is positive and non-binary. The index terms in the query are also weighted 1 Let w i,q be the weight associated with the pair [k i, q] 2 We define the query vector q as q = (w 1,q, w 2,q,..., w t,q ) 3 With the document d j is associated an index term vector d j = (w 1,j, w 2,j,..., w t,j ) 4 Sim(d j, q) = d j q d j q = i w i,j w i,q i w 2 i,j i w 2 i,q

10 Vector Model Example 1 Document 1 ="My tailor is rich" 2 Document 2 ="Your chair is red" 3 K = { "tailor", "chair", "rich" } 4 Query= "rich" Example 1 q = (0, 0, 1) 2 d1 = (1, 0, 2), d 2 = (0, 1, 0) 5, 3 d Sim(d 1, q) = 1 q d = 2 1 q Sim(d 2, q) = 0

11 Vector Model Definition 1 Let N be the total number of documents in the system 2 Let n i be the number of documents in wich k i appears 3 The best known term-weighting schemes use weights wich are given by w i,j = f i,j idf i 4 The inverse document frequency is idf i = log N n i

12 PageRank The Anatomy of a Search Engine Sergey Brin and Lawrence Page Definition We assume page A has pages T 1...T n which point to it. d is a damping factor between 0 and 1. Also C(A) is defined as the number of links going out of page A. The PageRank of a page A is given as follows: PR(A) = (1 d) + d( PR(T 1) C(T 1 ) PR(T n) C(T n ) ) Note that the PageRanks form a probability distribution over web pages, so the sum of all web pages PageRanks will be one.

13 PageRank

14 Features Outline 1 Motivation 2 Retrieval Information Features Estimating a Ranking Function 3 Experimental Results Architecture of the system Main Results

15 Features Content Features Content features, query independent: FNEN Fraction of terms in the documents which can not be found in an English dictionary NBOD Number of bytes of the original document RFFT Relative frequency of the more frequent term ATLE Average term length Content features, query dependent: SIMT Similarity of the term to the document. AMQT Average matches of the query terms FATT Anchor text term frequency

16 Features Formatting, Link and Metadata Features Formatting features, query-dependent. TMKY Term in the meta keywords or description (N) Link features, query-independent: ILNK Number of pages linking to a page, in-degree approximated using Google API link: queries PRNK PageRank of the page, or the approximation of the PageRank in a 0-10 scale obtained from Google s toolbar. Metadata features: TURL Term is in the page s URL or not. TDIR Term is listed in a web directory or not.

17 Estimating a Ranking Function Outline 1 Motivation 2 Retrieval Information Features Estimating a Ranking Function 3 Experimental Results Architecture of the system Main Results

18 Estimating a Ranking Function Data Set Arts : Albertinelli, Bacchiacca, Botticelli States : Arizona, Arkansas, Connecticut Spam : buy cds, buy dvds, cheap software Multiple : anova bootstrap feature missing principal squared, analysis frequent likelihood misclassification pruning statistical, Training Set (7 queries): learn a linear scoring function or a decision tree. Validation Set (2 queries): used for feature selection and pruning the decision tree. Test Set (3 queries): To estimate the generalization error of our ranking function,

19 Estimating a Ranking Function Binary classification problem Let q be a query, and u, v be feature vector of pages. Let u < q v represent the ordering returned by the ranking function of a search engine for the given query. We want to find f such that f (u) < f (v) whenever u < v. If we assume that f is linear, then there exists a vector w such that f (u) = w u. Then f (u) < f (v) w u < w v w (v u) > 0.

20 Estimating a Ranking Function Logistic Regression and SVM Logistic regression models the posterior probabilities of the classes. In the case of only two classes and + the model has the form P(class = + X = x) log P(class = X = x) = β 0 + w x (1) Support Vector Machines typically use linear decision boundaries in a transformed feature space. In this higher dimension space data becomes separable by hyperplanes

21 Estimating a Ranking Function Binary classification trees f might not be linear Search engine might use several layers of índices A classification tree is built through binary recursive partitioning. Pruning the tree leads to a better performance on general data

22 Architecture of the system Outline 1 Motivation 2 Retrieval Information Features Estimating a Ranking Function 3 Experimental Results Architecture of the system Main Results

23 Architecture of the system Architecture. Downloader: software that executes a query and downloads the returned pages using the data set queries Feature Extractor: software that computes the features of the pages downloaded. Analyzer: software that analyzes the features of the returned pages and estimates a function

24 Main Results Outline 1 Motivation 2 Retrieval Information Features Estimating a Ranking Function 3 Experimental Results Architecture of the system Main Results

25 Main Results Precision values obtained using only individual features. Feature Arts States Spam Multiple NBOD (-)53.8% 54.4% 51.1% 50.9% FNEN (-)59.4% 53.5% 51.6% (-)59.8% RFFT 52.1% (-)54.3% 50.0% 53.9% ATLE (-)54.2% 54.0% 50.0% 54.4% FATT 56.2% 50.3% 53.0% 51.8% AMQT 55.4% 50.5% 52.1% 56.9% SIMT (N) 56.5% (-)52.0% 52.7% 59.0% SIMT 55.4% (-)50.9% 52.6% 69.7% TMKY (N) 51.4% 51.4% 54.5% 50.0% TMKY 53.0% 51.4% 55.1% 50.0% ILNK 58.0% 66.3% 57.0% 53.9% PRNK 58.7% 60.6% 55.0% 57.2% TDIR 52.3% 53.5% 54.3% 51.9%

26 Main Results Best precision achieved on all. Table: Best precision achieved on all, shifted and top pairs. We include the performance on the test data as well as on the whole data set, including training, validation and test sets. % all % shifted % top pairs correct pairs correct pairs correct Dataset Test All Test All Test All Best model Arts 63.7% 61.8% 69.1% 66.4% 47.6% 48.0% Log. regr., strongest 3 features States 64.6% 66.3% 73.2% 73.8% 97.6% 98.5% Class. tree, only ILINK feature Spam 62.5% 59.5% 70.5% 62.1% 98.2% 74.8% Log. regr., strongest 10 features Multiple 67.5% 70.9% 78.1% 81.3% 81.0% 87.0% Log. reg., strongest 3 features

27 Main Results Relevant Features hidden The query logs, which Google obtains through its toolbar. The age of the incoming links and other information related to web link dynamics. The rate of change at which a website changes, obtained by repeated web crawls. The true number of ingoing links, as Google s link:www.abc.com only gives a lower bound. The true PageRank used by Google, as the one displayed in its toolbar is only an approximation, and furthermore, seems to be too strongly correlated to the number of in-links.

28 Summary Influence of different page features on the ranking of search engine results: We use Google Binary classification problem Training set and a test set Ranking only according to the strongest feature for a category gives is able to predict the order in which any pair of pages will appear in the results with a precision of between 57% and 70%.

### An Analysis of Factors Used in Search Engine Ranking

An Analysis of Factors Used in Search Engine Ranking Albert Bifet Technical University of Catalonia abifet@lsi.upc.es Paul-Alexandru Chirita L3S Research Center chirita@l3s.de Carlos Castillo University

### An Analysis of Factors Used in Search Engine Ranking

An Analysis of Factors Used in Search Engine Ranking Albert Bifet Technical University of Catalonia abifet@lsi.upc.es Paul-Alexandru Chirita L3S Research Center chirita@l3s.de Carlos Castillo University

### Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Information Retrieval INFO 4300 / CS 4300! Retrieval models Older models» Boolean retrieval» Vector Space model Probabilistic Models» BM25» Language models Web search» Learning to Rank Search Taxonomy!

### 1 o Semestre 2007/2008

Departamento de Engenharia Informática Instituto Superior Técnico 1 o Semestre 2007/2008 Outline 1 2 3 4 5 Outline 1 2 3 4 5 Exploiting Text How is text exploited? Two main directions Extraction Extraction

### International Journal of Engineering Research-Online A Peer Reviewed International Journal Articles are freely available online:http://www.ijoer.

RESEARCH ARTICLE SURVEY ON PAGERANK ALGORITHMS USING WEB-LINK STRUCTURE SOWMYA.M 1, V.S.SREELAXMI 2, MUNESHWARA M.S 3, ANIL G.N 4 Department of CSE, BMS Institute of Technology, Avalahalli, Yelahanka,

### Enhancing the relativity between Content, Title and Meta Tags Based on Term Frequency in Lexical and Semantic Aspects

Enhancing the relativity between Content, Title and Meta Tags Based on Term Frequency in Lexical and Semantic Aspects Mohammad Farahmand, Abu Bakar MD Sultan, Masrah Azrifah Azmi Murad, Fatimah Sidi me@shahroozfarahmand.com

### Web Search Engines: Solutions

Web Search Engines: Solutions Problem 1: A. How can the owner of a web site design a spider trap? Answer: He can set up his web server so that, whenever a client requests a URL in a particular directory

### Search and Information Retrieval

Search and Information Retrieval Search on the Web 1 is a daily activity for many people throughout the world Search and communication are most popular uses of the computer Applications involving search

### IJREAS Volume 2, Issue 2 (February 2012) ISSN: 2249-3905 STUDY OF SEARCH ENGINE OPTIMIZATION ABSTRACT

STUDY OF SEARCH ENGINE OPTIMIZATION Sachin Gupta * Ankit Aggarwal * ABSTRACT Search Engine Optimization (SEO) is a technique that comes under internet marketing and plays a vital role in making sure that

### AN EXPLORATORY STUDY ON SEARCH ENGINE OPTIMIZATION Meghna Bhatia 1 Prayga Gulati 2 Anu Thomas 3 Sies (Nerul) College of Arts. Science and Commerce

AN EXPLORATORY STUDY ON SEARCH ENGINE OPTIMIZATION Meghna Bhatia 1 Prayga Gulati 2 Anu Thomas 3 Sies (Nerul) College of Arts. Science and Commerce Abstract: Search Engine Optimization (SEO) has become

### Content-Based Recommendation

Content-Based Recommendation Content-based? Item descriptions to identify items that are of particular interest to the user Example Example Comparing with Noncontent based Items User-based CF Searches

### Incorporating Window-Based Passage-Level Evidence in Document Retrieval

Incorporating -Based Passage-Level Evidence in Document Retrieval Wensi Xi, Richard Xu-Rong, Christopher S.G. Khoo Center for Advanced Information Systems School of Applied Science Nanyang Technological

### Analysis of Web Archives. Vinay Goel Senior Data Engineer

Analysis of Web Archives Vinay Goel Senior Data Engineer Internet Archive Established in 1996 501(c)(3) non profit organization 20+ PB (compressed) of publicly accessible archival material Technology partner

### Search engine ranking

Proceedings of the 7 th International Conference on Applied Informatics Eger, Hungary, January 28 31, 2007. Vol. 2. pp. 417 422. Search engine ranking Mária Princz Faculty of Technical Engineering, University

### Web Search. 2 o Semestre 2012/2013

Dados na Dados na Departamento de Engenharia Informática Instituto Superior Técnico 2 o Semestre 2012/2013 Bibliography Dados na Bing Liu, Data Mining: Exploring Hyperlinks, Contents, and Usage Data, 2nd

### PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical

### Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

### Search Engine Architecture I

Search Engine Architecture I Software Architecture The high level structure of a software system Software components The interfaces provided by those components The relationships between those components

### RANKING WEB PAGES RELEVANT TO SEARCH KEYWORDS

ISBN: 978-972-8924-93-5 2009 IADIS RANKING WEB PAGES RELEVANT TO SEARCH KEYWORDS Ben Choi & Sumit Tyagi Computer Science, Louisiana Tech University, USA ABSTRACT In this paper we propose new methods for

### Programming Exercise 3: Multi-class Classification and Neural Networks

Programming Exercise 3: Multi-class Classification and Neural Networks Machine Learning November 4, 2011 Introduction In this exercise, you will implement one-vs-all logistic regression and neural networks

### Data Mining in Web Search Engine Optimization and User Assisted Rank Results

Data Mining in Web Search Engine Optimization and User Assisted Rank Results Minky Jindal Institute of Technology and Management Gurgaon 122017, Haryana, India Nisha kharb Institute of Technology and Management

### Social Business Intelligence Text Search System

Social Business Intelligence Text Search System Sagar Ligade ME Computer Engineering. Pune Institute of Computer Technology Pune, India ABSTRACT Today the search engine plays the important role in the

### Removing Web Spam Links from Search Engine Results

Removing Web Spam Links from Search Engine Results Manuel EGELE pizzaman@iseclab.org, 1 Overview Search Engine Optimization and definition of web spam Motivation Approach Inferring importance of features

### Web Search Engines. Search Engine Characteristics. Web Search Queries. Chapter 27, Part C Based on Larson and Hearst s slides at UC-Berkeley

Web Search Engines Chapter 27, Part C Based on Larson and Hearst s slides at UC-Berkeley http://www.sims.berkeley.edu/courses/is202/f00/ Database Management Systems, R. Ramakrishnan 1 Search Engine Characteristics

### Building Enriched Document Representations using Aggregated Anchor Text

Building Enriched Document Representations using Aggregated Anchor Text Donald Metzler, Jasmine Novak, Hang Cui, and Srihari Reddy Yahoo! Labs 1 http://www.search-engines-book.com/ http://research.microsoft.com/

### Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang

Classifying Large Data Sets Using SVMs with Hierarchical Clusters Presented by :Limou Wang Overview SVM Overview Motivation Hierarchical micro-clustering algorithm Clustering-Based SVM (CB-SVM) Experimental

### Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines

### Review of Computer Engineering Research WEB PAGES CATEGORIZATION BASED ON CLASSIFICATION & OUTLIER ANALYSIS THROUGH FSVM. Geeta R.B.* Shobha R.B.

Review of Computer Engineering Research journal homepage: http://www.pakinsight.com/?ic=journal&journal=76 WEB PAGES CATEGORIZATION BASED ON CLASSIFICATION & OUTLIER ANALYSIS THROUGH FSVM Geeta R.B.* Department

### Software Engineering 4C03

Software Engineering 4C03 Research Paper: Google TM Servers Researcher: Nathan D. Jory Last Revised: March 29, 2004 Instructor: Kartik Krishnan Introduction The Google TM search engine is a powerful and

### Enhancing the Ranking of a Web Page in the Ocean of Data

Database Systems Journal vol. IV, no. 3/2013 3 Enhancing the Ranking of a Web Page in the Ocean of Data Hitesh KUMAR SHARMA University of Petroleum and Energy Studies, India hkshitesh@gmail.com In today

Finding Advertising Keywords on Web Pages Scott Wen-tau Yih Joshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University Contextual Ads 101 Publisher s website Digital Camera Review The

### Supervised Learning (Big Data Analytics)

Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used

### Machine Learning Final Project Spam Email Filtering

Machine Learning Final Project Spam Email Filtering March 2013 Shahar Yifrah Guy Lev Table of Content 1. OVERVIEW... 3 2. DATASET... 3 2.1 SOURCE... 3 2.2 CREATION OF TRAINING AND TEST SETS... 4 2.3 FEATURE

### Removing Web Spam Links from Search Engine Results

Removing Web Spam Links from Search Engine Results Manuel Egele Technical University Vienna pizzaman@iseclab.org Christopher Kruegel University of California Santa Barbara chris@cs.ucsb.edu Engin Kirda

### PSG College of Technology, Coimbatore-641 004 Department of Computer & Information Sciences BSc (CT) G1 & G2 Sixth Semester PROJECT DETAILS.

PSG College of Technology, Coimbatore-641 004 Department of Computer & Information Sciences BSc (CT) G1 & G2 Sixth Semester PROJECT DETAILS Project Project Title Area of Abstract No Specialization 1. Software

### Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees

Statistical Data Mining Practical Assignment 3 Discriminant Analysis and Decision Trees In this practical we discuss linear and quadratic discriminant analysis and tree-based classification techniques.

### A COMPREHENSIVE REVIEW ON SEARCH ENGINE OPTIMIZATION

Volume 4, No. 1, January 2013 Journal of Global Research in Computer Science REVIEW ARTICLE Available Online at www.jgrcs.info A COMPREHENSIVE REVIEW ON SEARCH ENGINE OPTIMIZATION 1 Er.Tanveer Singh, 2

### Introduction to Information Retrieval http://informationretrieval.org

Introduction to Information Retrieval http://informationretrieval.org IIR 7: Scores in a Complete Search System Hinrich Schütze Center for Information and Language Processing, University of Munich 2014-05-07

### Forecasting stock markets with Twitter

Forecasting stock markets with Twitter Argimiro Arratia argimiro@lsi.upc.edu Joint work with Marta Arias and Ramón Xuriguera To appear in: ACM Transactions on Intelligent Systems and Technology, 2013,

### So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)

Internet Technology Prof. Indranil Sengupta Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture No #39 Search Engines and Web Crawler :: Part 2 So today we

### Dynamical Clustering of Personalized Web Search Results

Dynamical Clustering of Personalized Web Search Results Xuehua Shen CS Dept, UIUC xshen@cs.uiuc.edu Hong Cheng CS Dept, UIUC hcheng3@uiuc.edu Abstract Most current search engines present the user a ranked

### Linear Classification. Volker Tresp Summer 2015

Linear Classification Volker Tresp Summer 2015 1 Classification Classification is the central task of pattern recognition Sensors supply information about an object: to which class do the object belong

### Recognition. Sanja Fidler CSC420: Intro to Image Understanding 1 / 28

Recognition Topics that we will try to cover: Indexing for fast retrieval (we still owe this one) History of recognition techniques Object classification Bag-of-words Spatial pyramids Neural Networks Object

### Active Learning SVM for Blogs recommendation

Active Learning SVM for Blogs recommendation Xin Guan Computer Science, George Mason University Ⅰ.Introduction In the DH Now website, they try to review a big amount of blogs and articles and find the

### Make search become the internal function of Internet

Make search become the internal function of Internet Wang Liang 1, Guo Yi-Ping 2, Fang Ming 3 1, 3 (Department of Control Science and Control Engineer, Huazhong University of Science and Technology, WuHan,

Optimize Your Content Need to create content that is both what search engines need, and what searchers want to see. This chapter covers: What search engines look for The philosophy of writing for search

### Introduction to Support Vector Machines. Colin Campbell, Bristol University

Introduction to Support Vector Machines Colin Campbell, Bristol University 1 Outline of talk. Part 1. An Introduction to SVMs 1.1. SVMs for binary classification. 1.2. Soft margins and multi-class classification.

### SEO BASICS. March 20, 2015

SEO BASICS March 20, 2015 1. Historical SEO 2. SEO 101 3. Live Site Reviews 4. Current Landscape 5. The Future of SEO 6. Resources 7. Q&A AGENDA 2 SEO Basics Augusta Arts HISTORICAL SEO Search Engine Optimization

### An Introduction to Machine Learning and Natural Language Processing Tools

An Introduction to Machine Learning and Natural Language Processing Tools Presented by: Mark Sammons, Vivek Srikumar (Many slides courtesy of Nick Rizzolo) 8/24/2010-8/26/2010 Some reasonably reliable

### Data Mining Algorithms Part 1. Dejan Sarka

Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka (dsarka@solidq.com) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses

### Corso di Biblioteche Digitali

Corso di Biblioteche Digitali Vittore Casarosa casarosa@isti.cnr.it tel. 050-315 3115 cell. 348-397 2168 Ricevimento dopo la lezione o per appuntamento Valutazione finale 70-75% esame orale 25-30% progetto

### STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct

### LCs for Binary Classification

Linear Classifiers A linear classifier is a classifier such that classification is performed by a dot product beteen the to vectors representing the document and the category, respectively. Therefore it

American Journal of Engineering Research (AJER) 2013 American Journal of Engineering Research (AJER) e-issn: 2320-0847 p-issn : 2320-0936 Volume-2, Issue-4, pp-39-43 www.ajer.us Research Paper Open Access

### Efficient visual search of local features. Cordelia Schmid

Efficient visual search of local features Cordelia Schmid Visual search change in viewing angle Matches 22 correct matches Image search system for large datasets Large image dataset (one million images

### An Information Retrieval using weighted Index Terms in Natural Language document collections

Internet and Information Technology in Modern Organizations: Challenges & Answers 635 An Information Retrieval using weighted Index Terms in Natural Language document collections Ahmed A. A. Radwan, Minia

### HydroDesktop Overview

HydroDesktop Overview 1. Initial Objectives HydroDesktop (formerly referred to as HIS Desktop) is a new component of the HIS project intended to address the problem of how to obtain, organize and manage

### Improving Search Engines via Classification

Improving Search Engines via Classification Zheng Zhu May 2011 A Dissertation Submitted to Birkbeck College, University of London in Partial Fulfillment of the Requirements for the Degree of Doctor of

### Machine Learning in Spam Filtering

Machine Learning in Spam Filtering A Crash Course in ML Konstantin Tretyakov kt@ut.ee Institute of Computer Science, University of Tartu Overview Spam is Evil ML for Spam Filtering: General Idea, Problems.

### Forecasting Trade Direction and Size of Future Contracts Using Deep Belief Network

Forecasting Trade Direction and Size of Future Contracts Using Deep Belief Network Anthony Lai (aslai), MK Li (lilemon), Foon Wang Pong (ppong) Abstract Algorithmic trading, high frequency trading (HFT)

### Big Data Analytics CSCI 4030

High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising

### Homework 2. Page 154: Exercise 8.10. Page 145: Exercise 8.3 Page 150: Exercise 8.9

Homework 2 Page 110: Exercise 6.10; Exercise 6.12 Page 116: Exercise 6.15; Exercise 6.17 Page 121: Exercise 6.19 Page 122: Exercise 6.20; Exercise 6.23; Exercise 6.24 Page 131: Exercise 7.3; Exercise 7.5;

### Disambiguating Implicit Temporal Queries by Clustering Top Relevant Dates in Web Snippets

Disambiguating Implicit Temporal Queries by Clustering Top Ricardo Campos 1, 4, 6, Alípio Jorge 3, 4, Gaël Dias 2, 6, Célia Nunes 5, 6 1 Tomar Polytechnic Institute, Tomar, Portugal 2 HULTEC/GREYC, University

### Semantic Search in Portals using Ontologies

Semantic Search in Portals using Ontologies Wallace Anacleto Pinheiro Ana Maria de C. Moura Military Institute of Engineering - IME/RJ Department of Computer Engineering - Rio de Janeiro - Brazil [awallace,anamoura]@de9.ime.eb.br

### Online Marketing Optimization Essentials

Online Marketing Optimization Essentials Bilal Saleh Principal Partner E-Nor Inc. May 20, 2014 Agenda 2 E-Nor Overview Search Engine Optimization (SEO) Paid search Web Analytics Q&A Graphics by: http://www.iconarchive.com/show/seo-icons-by-designbolts.html

### Folksonomies versus Automatic Keyword Extraction: An Empirical Study

Folksonomies versus Automatic Keyword Extraction: An Empirical Study Hend S. Al-Khalifa and Hugh C. Davis Learning Technology Research Group, ECS, University of Southampton, Southampton, SO17 1BJ, UK {hsak04r/hcd}@ecs.soton.ac.uk

### Web Usage in Client-Server Design

Web Search Web Usage in Client-Server Design A client (e.g., a browser) communicates with a server via http Hypertext transfer protocol: a lightweight and simple protocol asynchronously carrying a variety

### Lecture 8 February 4

ICS273A: Machine Learning Winter 2008 Lecture 8 February 4 Scribe: Carlos Agell (Student) Lecturer: Deva Ramanan 8.1 Neural Nets 8.1.1 Logistic Regression Recall the logistic function: g(x) = 1 1 + e θt

### Practical Graph Mining with R. 5. Link Analysis

Practical Graph Mining with R 5. Link Analysis Outline Link Analysis Concepts Metrics for Analyzing Networks PageRank HITS Link Prediction 2 Link Analysis Concepts Link A relationship between two entities

### Hyperlink Analysis for the Web

Hyperlink Analysis for the Web Hyperlink analysis algorithms allow search engines to deliver focused results to user queries.this article surveys ranking algorithms used to retrieve information on the

### Optimization of Search Results with Duplicate Page Elimination using Usage Data A. K. Sharma 1, Neelam Duhan 2 1, 2

Optimization of Search Results with Duplicate Page Elimination using Usage Data A. K. Sharma 1, Neelam Duhan 2 1, 2 Department of Computer Engineering, YMCA University of Science & Technology, Faridabad,

### WHAT IS THE TEMPORAL VALUE OF WEB SNIPPETS?

WHAT IS THE TEMPORAL VALUE OF WEB SNIPPETS? Ricardo Campos 1, 2, 4 Gaël Dias 2, Alípio Jorge 3, 4 1 Tomar Polytechnic Institute, Tomar, Portugal 2 Centre of Human Language Tecnnology and Bioinformatics,

### Bootstrapping Big Data

Bootstrapping Big Data Ariel Kleiner Ameet Talwalkar Purnamrita Sarkar Michael I. Jordan Computer Science Division University of California, Berkeley {akleiner, ameet, psarkar, jordan}@eecs.berkeley.edu

### Social Media Mining. Data Mining Essentials

Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

### An ontology-based approach for semantic ranking of the web search engines results

An ontology-based approach for semantic ranking of the web search engines results Editor(s): Name Surname, University, Country Solicited review(s): Name Surname, University, Country Open review(s): Name

### Pixels Description of scene contents. Rob Fergus (NYU) Antonio Torralba (MIT) Yair Weiss (Hebrew U.) William T. Freeman (MIT) Banksy, 2006

Object Recognition Large Image Databases and Small Codes for Object Recognition Pixels Description of scene contents Rob Fergus (NYU) Antonio Torralba (MIT) Yair Weiss (Hebrew U.) William T. Freeman (MIT)

### How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

How to use Big Data in Industry 4.0 implementations LAURI ILISON, PhD Head of Big Data and Machine Learning Big Data definition? Big Data is about structured vs unstructured data Big Data is about Volume

### CSC321 Introduction to Neural Networks and Machine Learning. Lecture 21 Using Boltzmann machines to initialize backpropagation.

CSC321 Introduction to Neural Networks and Machine Learning Lecture 21 Using Boltzmann machines to initialize backpropagation Geoffrey Hinton Some problems with backpropagation The amount of information

### BIG DATA What it is and how to use?

BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14

### Let the data speak to you. Look Who s Peeking at Your Paycheck. Big Data. What is Big Data? The Artemis project: Saving preemies using Big Data

CS535 Big Data W1.A.1 CS535 BIG DATA W1.A.2 Let the data speak to you Medication Adherence Score How likely people are to take their medication, based on: How long people have lived at the same address

### Automatic Web Page Classification

Automatic Web Page Classification Yasser Ganjisaffar 84802416 yganjisa@uci.edu 1 Introduction To facilitate user browsing of Web, some websites such as Yahoo! (http://dir.yahoo.com) and Open Directory

Definition of Web Spam "Web spamming refers to actions intended to mislead search engines and give some pages higher ranking than they deserve." "Web spam taxonomy" by Zoltan Gyöngyi, PhD (Research Scientist

### CS4047 Multimedia Industry Perspectives Search Engine Optimisation (SEO) Introduction to SEO. Martina Skelly Activate Marketing.

CS4047 Multimedia Industry Perspectives Search Engine Optimisation (SEO) Martina Skelly Introduction to SEO Martina Skelly Activate Marketing 18 November 2011 Agenda The Elements of a successful SEO strategy

### BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES

BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES 123 CHAPTER 7 BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES 7.1 Introduction Even though using SVM presents

### Mining Text Data: An Introduction

Bölüm 10. Metin ve WEB Madenciliği http://ceng.gazi.edu.tr/~ozdemir Mining Text Data: An Introduction Data Mining / Knowledge Discovery Structured Data Multimedia Free Text Hypertext HomeLoan ( Frank Rizzo

### Measuring the Utilization of On-Page Search Engine Optimization in Selected Domain

JIOS, VOL. 39, NO. 2 (2015) SUBMITTED 07/15; ACCEPTED 10/15 UDC 001.81:004.774 Original Scientific Paper Measuring the Utilization of On-Page Search Engine Optimization in Selected Domain Goran Matošević

### DATA MINING Introductory and Advanced Topics Part I

DATA MINING Introductory and Advanced Topics Part I Source : Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Companion slides for the text by Dr. M.H.Dunham,

### The Need for Training in Big Data: Experiences and Case Studies

The Need for Training in Big Data: Experiences and Case Studies Guy Lebanon Amazon Background and Disclaimer All opinions are mine; other perspectives are legitimate. Based on my experience as a professor

### Agenda. Mathias Lanner Sas Institute. Predictive Modeling Applications. Predictive Modeling Training Data. Beslutsträd och andra prediktiva modeller

Agenda Introduktion till Prediktiva modeller Beslutsträd Beslutsträd och andra prediktiva modeller Mathias Lanner Sas Institute Pruning Regressioner Neurala Nätverk Utvärdering av modeller 2 Predictive

### SEO 101. Learning the basics of search engine optimization. Marketing & Web Services

SEO 101 Learning the basics of search engine optimization Marketing & Web Services Table of Contents SEARCH ENGINE OPTIMIZATION BASICS WHAT IS SEO? WHY IS SEO IMPORTANT? WHERE ARE PEOPLE SEARCHING? HOW

### Data Mining Methods: Applications for Institutional Research

Data Mining Methods: Applications for Institutional Research Nora Galambos, PhD Office of Institutional Research, Planning & Effectiveness Stony Brook University NEAIR Annual Conference Philadelphia 2014

### AUTOMATION OF ENERGY DEMAND FORECASTING. Sanzad Siddique, B.S.

AUTOMATION OF ENERGY DEMAND FORECASTING by Sanzad Siddique, B.S. A Thesis submitted to the Faculty of the Graduate School, Marquette University, in Partial Fulfillment of the Requirements for the Degree

### Parallel Data Selection Based on Neurodynamic Optimization in the Era of Big Data

Parallel Data Selection Based on Neurodynamic Optimization in the Era of Big Data Jun Wang Department of Mechanical and Automation Engineering The Chinese University of Hong Kong Shatin, New Territories,

### Oracle9i Data Warehouse Review. Robert F. Edwards Dulcian, Inc.

Oracle9i Data Warehouse Review Robert F. Edwards Dulcian, Inc. Agenda Oracle9i Server OLAP Server Analytical SQL Data Mining ETL Warehouse Builder 3i Oracle 9i Server Overview 9i Server = Data Warehouse

### Exploring Adaptive Window Sizes for Entity Retrieval

Exploring Adaptive Window Sizes for Entity Retrieval Fawaz Alarfaj, Udo Kruschwitz, and Chris Fox School of Computer Science and Electronic Engineering University of Essex Colchester, CO4 3SQ, UK {falarf,udo,foxcj}@essex.ac.uk

### The Data Mining Process

Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data

### Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur

Probabilistic Linear Classification: Logistic Regression Piyush Rai IIT Kanpur Probabilistic Machine Learning (CS772A) Jan 18, 2016 Probabilistic Machine Learning (CS772A) Probabilistic Linear Classification:

### Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Logistic Regression Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Logistic Regression Preserve linear classification boundaries. By the Bayes rule: Ĝ(x) = arg max