5/6/2008. Data mining the web. Oscar Djupfeldt, Avraz Hirori, Christoffer Kullman Group 5. Introduction. Structure mining Content mining Usage mining

Similar documents
A SURVEY ON WEB MINING TOOLS

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)

AN EFFICIENT APPROACH TO PERFORM PRE-PROCESSING

An Overview of Knowledge Discovery Database and Data mining Techniques

Data Mining in Web Search Engine Optimization and User Assisted Rank Results

Web Advertising Personalization using Web Content Mining and Web Usage Mining Combination

Search engine ranking

Web Mining. Margherita Berardi LACAM. Dipartimento di Informatica Università degli Studi di Bari

Research and Development of Data Preprocessing in Web Usage Mining

Search Engine Optimization (SEO): Improving Website Ranking

A Survey on Web Mining From Web Server Log

Search and Information Retrieval

Arti Tyagi Sunita Choudhary

Analysis of Web Archives. Vinay Goel Senior Data Engineer

An Effective Analysis of Weblog Files to improve Website Performance

1 o Semestre 2007/2008

The PageRank Citation Ranking: Bring Order to the Web

Understanding Web personalization with Web Usage Mining and its Application: Recommender System

PREPROCESSING OF WEB LOGS

Bisecting K-Means for Clustering Web Log data

Web Mining Functions in an Academic Search Application

Enhance Preprocessing Technique Distinct User Identification using Web Log Usage data

A Survey on Preprocessing of Web Log File in Web Usage Mining to Improve the Quality of Data

AN OVERVIEW OF PREPROCESSING OF WEB LOG FILES FOR WEB USAGE MINING

Search Result Optimization using Annotators

International Journal of Engineering Research-Online A Peer Reviewed International Journal Articles are freely available online:

Web Mining Techniques in E-Commerce Applications

Web Search. 2 o Semestre 2012/2013

A COMPREHENSIVE REVIEW ON SEARCH ENGINE OPTIMIZATION

Pre-Processing: Procedure on Web Log File for Web Usage Mining

Data Preprocessing and Easy Access Retrieval of Data through Data Ware House

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

A Study of Web Log Analysis Using Clustering Techniques

Search Engine Optimization Glossary

Introduction to Data Mining

ANALYSIS OF WEBSITE USAGE WITH USER DETAILS USING DATA MINING PATTERN RECOGNITION

ANALYSIS OF WEB LOGS AND WEB USER IN WEB MINING

Monitoring Replication

Advanced Preprocessing using Distinct User Identification in web log usage data

Abstract. 2.1 Web log file data

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment

Preprocessing Web Logs for Web Intrusion Detection

ASSOCIATION RULE MINING ON WEB LOGS FOR EXTRACTING INTERESTING PATTERNS THROUGH WEKA TOOL

Identifying the Number of Visitors to improve Website Usability from Educational Institution Web Log Data

PERFORMANCE M edia P lacement

Chapter 6. Attracting Buyers with Search, Semantic, and Recommendation Technology

Web Mining using Artificial Ant Colonies : A Survey

Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

Clustering Technique in Data Mining for Text Documents

WEB SITE DEVELOPMENT WORKSHEET

DIGITAL MARKETING BASICS: SEO

CHAPTER-7 EXPERIMENTS AND TEST RESULTS FOR PROPOSED PREDICTION MODEL

APPLICATION OF INTELLIGENT METHODS IN COMMERCIAL WEBSITE MARKETING STRATEGIES DEVELOPMENT

Discover The Benefits Of SEO & Search Marketing

SEO AND CONTENT MANAGEMENT SYSTEM

ISSN: A Review: Image Retrieval Using Web Multimedia Mining

How To Rank High In The Search Engines

An Introduction to Data Mining

Search Engine Submission

Analytics Configuration Reference

ABSTRACT The World MINING R. Vasudevan. Trichy. Page 9. usage mining. basic. processing. Web usage mining. Web. useful information

Corso di Biblioteche Digitali

Web Usage mining framework for Data Cleaning and IP address Identification

Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data

User Data Analytics and Recommender System for Discovery Engine

An Enhanced Framework For Performing Pre- Processing On Web Server Logs

Using Data Mining for Mobile Communication Clustering and Characterization

LANCOM Techpaper Content Filter

Search engine marketing

A Comparative Study of Different Log Analyzer Tools to Analyze User Behaviors

Web Log Data Sparsity Analysis and Performance Evaluation for OLAP

How to Drive More Traffic to Your Event Website

Digital Research Paper Recommendation System Appling Feature Based Method

iweb for Business Practical Search Engine Optimization A Step by Step Guide

Analysis of Requirement & Performance Factors of Business Intelligence Through Web Mining

Web Mining as a Tool for Understanding Online Learning

7.22. YourDomain.com Prepared by: Your Company Name

Context Aware Predictive Analytics: Motivation, Potential, Challenges

Website Audit Reports

Top Online Activities (Jupiter Communications, 2000) CS276A Text Information Retrieval, Mining, and Exploitation

SEO REPORT. Prepared for searchoptions.com.au

Chapter-1 : Introduction 1 CHAPTER - 1. Introduction

Click stream reporting & analysis for website optimization

Online Marketing Optimization Essentials

Data Mining for Fun and Profit

Exploitation of Server Log Files of User Behavior in Order to Inform Administrator

A Survey on Web Mining Tools and Techniques

Worst Practices in. Search Engine Optimization. contributed articles

Analyzing the User Behaviours by Mining Web Access Log Files

How To Find Out What A Web Log Data Is Worth On A Blog

Usage Analysis Tools in SharePoint Products and Technologies

BIG DATA What it is and how to use?

DATA MINING TECHNOLOGY. Keywords: data mining, data warehouse, knowledge discovery, OLAP, OLAM.

Internet Advertising Glossary Internet Advertising Glossary

V E N D O R P R O F I L E. F i c s t a r : S i m p l i f y i n g W e b D a t a E x t r a c t i o n I D C O P I N I O N

monthly search engine optimization

Database Marketing, Business Intelligence and Knowledge Discovery

Mobile Phone APP Software Browsing Behavior using Clustering Analysis

SEARCH ENGINE OPTIMIZATION

Transcription:

Data mining the web Oscar Djupfeldt, Avraz Hirori, Christoffer Kullman Group 5 Introduction Structure mining Content mining Usage mining 1

Problem There is a huge amount of information on the web The information available on the web is very diverse Information on the web is exists in almost all types of formats The web is semi-structured because of HTMLs nestsed structure The web is linked Lots of redundant information Web Crawling Getting data about web pages Web search Done in two ways: Uniformed graph search Guided, informed search 2

Structure Mining Analyzes which places a web page points to and which places point to a web page To find relevant data about a web page Used for web search and social networks Document Structure Look at the HTML- or XML-structure of a web page The structure can reveal which data in a document is relevant as well as providing a context to the information 3

Document Structure <title>web Mining</title> <meta name= Author content= John Doe > -- <h2><big>web Mining</big></h2> -- <a href= presentation.html >Web Mining </a> Document Structure What do we get from this? Tagging and indexing Expand main index. Small, seperate indicies for faster access. Relevance ranking Terms location affects ranking. Different tags = different weights. Good at natural relevance ranking. 4

Document Structure Problems Invisible words Misleading text in titles, metatags, etc. Missing or incorrect information Anchor text Most significant in use due Additional information Link Analysis - Hyperlinks Independent evaluation of web page popularity or authority Idea is from social networks Popularity Authority Prestige Ranking Link analysis algorithms: PageRank HITS 5

Link Analysis - PageRank Democratic a links to b = a votes for b. The voter is also analyzed Designed for the random clicking user - likelihood a user will reach a certain page Iterative process Link Analysis - PageRank The damping factor Random page switch Sink pages No outbound links? 6

Content Mining Focused on text mining Video Audio Images Structured records like tables Information retrieval Information Retrieval Information retrieval mainly uses two types of measurement, precision and recall. Precision: the proportion of correctly returned pages out of all the returned pages. Recall: of all the pages that are correct, how many are returned? 7

Information Retrieval We typically have one of the following: Perfect recall and very low precision Perfect precision and very low recall Content Mining Web search engines Web directories 8

Content Mining Classification Genetic algorithms Memory based reasoning or K-Nearest neighbors based classification (K-NN) Vector space model Vector Space Model Page similarity Weighted vectors Term frequency-inverse document frequency method 9

Clustering Used to further help the user find what they are looking for Refine searches on a more general keyword Clustering 10

Usage Mining Web logs File (or several files) automatically created and maintained by a server of activity performed by it. Typically added Client IP address request date/time page requested HTTP code bytes served user agent referer Usage Mining Web Usage Mining consists of three phases Preprocessing Patten discovery Patten analysis 11

Preprocessing To clean up the data User identification Session identification Path completion Pattern Discovery Clustering algorithm Dependency modeling Classification Association Rules 12

Pattern Analysis SQL Filter out uninteresting rules Visualization techniques Graphic Color Usage Mining Example 13

Conclusions Useful for personalizing the Internet Good commercial use Simplifies searching and browsing Adds much needed structure to the data on the web Easy to manipulate No authority Needed Bibliography "PageRank." Wikipedia. 8 Apr. 2008. Accessed on 4 May 2008 <http://en.wikipedia.org/wiki/web_mining>. "Vector Space Model." Wikipedia. 22 April 2008. Accessed on 6 May 2008 <http://en.wikipedia.org/wiki/vector_space_model>. Larose, Markov. Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage. Hoboken, New Jersey: Wiley-Interscience, 2007. Nasraou, O. Approaches to Mining the Web, CECS 694 Mining the Web for E-Commerce & Information Retrieval. University of Louisville, 21 Oct. 2004 <http://webmining.spd.louisville.edu/websites/tutorials/chapter2-approachesmining-web.pdf>. Cooley, Deshpande, Srivastava, Tan. "Web Usage Mining: Discovery and Applications of Usage Patterns From Web Data." SIGKDD Explorations 1 (2000). Accessed on 4 May 2008 <http://www.sigkdd.org/explorations/issue.php?volume=1&issue=2&year=2000&month=01>. Sing, T. Web Content Mining. Oakland University. Accessed on 5 May 2008. <personalwebs.oakland.edu/~tsingh23/presentationmain.ppt>. "Google Technology." Google. Accessed on 6 May 2008 <http://www.google.com/technology/>. Introduction. 19 Feb 2002, Accessed on 5 May 2008 <http://www2002.org/cdrom/refereed/643/node1.html>. "Introduction." 18 Feb. 2002. 5 May 2008 <http://www2002.org/cdrom/refereed/643/node1.html>. 14