COMP3420: Advanced Databases and Data Mining. Web data mining

Similar documents
Mining Text Data: An Introduction

ISSN: A Review: Image Retrieval Using Web Multimedia Mining

Web Mining. Margherita Berardi LACAM. Dipartimento di Informatica Università degli Studi di Bari

Information Management course

PSG College of Technology, Coimbatore Department of Computer & Information Sciences BSc (CT) G1 & G2 Sixth Semester PROJECT DETAILS.

Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

A SURVEY ON WEB MINING TOOLS

Super Resellers // Getting Started Guide. Getting Started Guide. Super Resellers. AKJZNAzsqknsxxkjnsjx Getting Started Guide Page 1

Search Engine Optimization (SEO): Improving Website Ranking

Geo Targeting Server location, country-targeting, language declarations & hreflang

Analysis of Web Archives. Vinay Goel Senior Data Engineer

Search Engine Optimization Glossary

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

Web 3.0 image search: a World First

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)

Building a Website using Site Builder

SEO BEST PRACTICES IPRODUCTION. SEO Best Practices. November 2013 V3. 1 iproduction

Web Mining using Artificial Ant Colonies : A Survey

Chapter-1 : Introduction 1 CHAPTER - 1. Introduction

Search Engine Optimization

Search engine ranking

Word Taxonomy for On-line Visual Asset Management and Mining

INTERNET MARKETING. SEO Course Syllabus Modules includes: COURSE BROCHURE

Professional Diploma in Digital Marketing Module 2: Search Engine Optimisation Version 4.0 Location: Oslo/Norway Lecturer: Nina Furu

Arti Tyagi Sunita Choudhary

Analysis of Data Mining Concepts in Higher Education with Needs to Najran University

State of Nevada. Ektron Content Management System (CMS) Basic Training Guide

[Ramit Solutions] SEO SMO- SEM - PPC. [Internet / Online Marketing Concepts] SEO Training Concepts SEO TEAM Ramit Solutions

Web Mining Techniques in E-Commerce Applications

SEO FOR VIDEO: FIVE WAYS TO MAKE YOUR VIDEOS EASIER TO FIND

A LANGUAGE INDEPENDENT WEB DATA EXTRACTION USING VISION BASED PAGE SEGMENTATION ALGORITHM

ANALYSIS OF WEBSITE USAGE WITH USER DETAILS USING DATA MINING PATTERN RECOGNITION

Bitrix Site Manager 4.1. User Guide

Yandex: Webmaster Tools Overview and Guidelines

Web Usage mining framework for Data Cleaning and IP address Identification

Research and Implementation of View Block Partition Method for Theme-oriented Webpage

SUBJECT CODE : 4074 PERIODS/WEEK : 4 PERIODS/ SEMESTER : 72 CREDIT : 4 TIME SCHEDULE UNIT TOPIC PERIODS 1. INTERNET FUNDAMENTALS & HTML Test 1

Lecture Overview. Web 2.0, Tagging, Multimedia, Folksonomies, Lecture, Important, Must Attend, Web 2.0 Definition. Web 2.

Search Engine Optimization

Understanding Web personalization with Web Usage Mining and its Application: Recommender System

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

Checklist of Competencies for Effective Online Teaching

ABSTRACT The World MINING R. Vasudevan. Trichy. Page 9. usage mining. basic. processing. Web usage mining. Web. useful information

Data Mining Introduction

Web Development. Owen Sacco. ICS2205/ICS2230 Web Intelligence

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Unit Title: Content Management System Website Creation

WikiPrints Rendering Enterprise Wiki Content for Printing

Internet Advertising Glossary Internet Advertising Glossary

OPTIMIZING YOUR WEBSITE. Inbound Certification Class #2

SEO. Module 1: Basic of SEO:

DIGITAL MARKETING PROPOSAL. Stage 1: SEO Audit/Correction.

Basic tutorial for Dreamweaver CS5

Quick Guide for Accessible PDF July 2013 Training:

Content Management Software Drupal : Open Source Software to create library website

Web mining: address, title and image. Andrei Tabarcea Najlaa Gali

A Survey on Web Mining From Web Server Log

Software Development Training Camp 1 (0-3) Prerequisite : Program development skill enhancement camp, at least 48 person-hours.

Data Warehousing and Data Mining

A Statistical Text Mining Method for Patent Analysis

Search and Information Retrieval

Ingeniux 8 CMS Web Management System ICIT Technology Training and Advancement (training@uww.edu)

Abstract. 2.1 Web log file data

PREPROCESSING OF WEB LOGS

Web Mining Functions in an Academic Search Application

Search Result Optimization using Annotators

WEB SITE OPTIMIZATION THROUGH MINING USER NAVIGATIONAL PATTERNS

Parallels Plesk Panel

Click stream reporting & analysis for website optimization

An Enhanced Framework For Performing Pre- Processing On Web Server Logs

2/24/2010 ClassApps.com

Web Mining as a Tool for Understanding Online Learning

COURSE SYLLABUS COURSE TITLE:

Graph Mining and Social Network Analysis

How to catch Google s attention Search Engine Optimization and Marketing. Linda Schuster President/CEO (502) x115

Joomla! 2.5.x Training Manual

1. Nuxeo DAM User Guide Nuxeo DAM Concepts Working with digital assets Import assets in Nuxeo DAM

What to do Post Google Panda

COMPLIANCE MATRIX of GIGW

Search Engine Optimisation Managed Service

WHAT DEVELOPERS ARE TALKING ABOUT?

1. Digital Asset Management User Guide Digital Asset Management Concepts Working with digital assets Importing assets in

Blog Post Extraction Using Title Finding

A Survey on Web Mining Tools and Techniques

Sreekariyam P.O,Trivandrum - 17 Kerala Ph M info@acewaretechnology.com Web

RESEARCH ISSUES IN WEB MINING

A Survey on Preprocessing of Web Log File in Web Usage Mining to Improve the Quality of Data

Enhance Preprocessing Technique Distinct User Identification using Web Log Usage data

SEO for Web 2.0 Enterprise Search Summit West September 2008

Unit 351: Website Software Level 3

Using the Scoilnet Blogging service to create a School Website - Introduction and User Guide

SmallBiz Dynamic Theme User Guide

How To Find Out What A Web Log Data Is Worth On A Blog

Introduction to Data Mining

SEO Tutorial PDF for Beginners

UHR Training Services Student Manual

Context Aware Predictive Analytics: Motivation, Potential, Challenges

Web Development I & II*

How to select the right Marketing Cloud Edition

Transcription:

COMP3420: Advanced Databases and Data Mining Web data mining

Lecture outline The Web as a data source Challenges the Web poses to data mining Types of Web data mining Mining the Web page layout structure Mining the Web's link structure Mining multimedia data on the Web Automatic classification of Web documents Web usage mining Important information about paper presentation

The Web as a data source The biggest source of information Distributed, dynamic, linked, different data types Web pages contain semi structured data (HTML and XML), as well as free format text, images, videos, sounds, etc. Many Web pages are dynamically created, often by accessing databases (for example online stores, information directories, search engines) Web pages are linked Some parts of the Web are only accessibly to certain people (logins required)

Challenges the Web poses to data mining The size of the Web (billions of pages, Petabytes of data) The complexity of Web pages (more complex than any traditional databases or text document collections) The Web is a highly dynamic source of information The Web serves a broad diversity of user communities (with different backgrounds, culture, interests and knowledge) Only a very small proportion of the Web is truly relevant or useful Trust in Web site content is often questionable

Types of Web data mining Mining the Web page layout structure Mining the Web's link structure Mining multimedia data on the Web Automatic classification of Web documents Weblog mining

Mining the Web page layout structure Compared to plain text, a Web page is a two dimensional presentation Rich visual effects created by different font types, formats, separators, blank areas, colors, pictures, etc Different parts of a page are not equally important Title: CNN.com International H1: IAEA: Iran had secret nuke agenda H3: EXPLOSIONS ROCK BAGHDAD TEXT BODY (with position and font type): The International Atomic Energy Agency has concluded that Iran has secretly produced small amounts of nuclear materials including low enriched uranium and plutonium that could be used to develop nuclear weapons according to a confidential report obtained by CNN Hyperlink: URL: http://www.cnn.com/... Anchor Text: AI oaeda Image: URL: http://www.cnn.com/image/... Alt & Caption: Iran nuclear Anchor Text: CNN Homepage News

Web page block Better information unit Web Page Blocks Importance = Low Importance = Med Importance = High

Motivation for VIPS (VIsion based Page Segmentation) Problems of treating a Web page as an atomic unit Web pages usually contain not only pure content, but also noise (navigation, decoration, interaction, etc.) Web pages often contain multiple topics Different parts of a page are not equally important Web pages have internal structure Two dimensional logical structure and visual layout presentation > Free text document < < Structured document > DOM (Document Object Model) tree structure of elements of a page Layout the third dimension of Web pages First dimension: content Second dimension: hyperlink

Is DOM a good representation of page structure? Extract structural tags such as <P>, <TABLE>, <UL>, <TITLE>, <H1> to <H6>, etc. DOM is more related to content display, does not necessarily reflect semantic structure How about XML? A long way to go to replace HTML

VIPS algorithm Motivation In many cases, topics can be distinguished with visual clues, such as position, distance, font, color, etc. Goal Extract the semantic structure of a Web page based on its visual presentation Procedure Top down partition the Web page based on separators Result A tree structure, each node in the tree corresponds to a block in the page Each node will be assigned a value (degree of coherence) to indicate how coherent is the content in the block based on visual perception Each block will be assigned an importance value Segments obtained by VIPS are more semantically aggregated than in DOM tree

Example of Web page segmentation DOM Structure VIPS Structure Can be applied on Web image retrieval Source: Han and Kamber, DM Book, 2 nd Ed. (Copyright 2006 Elsevier Inc.)

Mining the Web's link structure The Web is a massive graph (Web pages are nodes, hyperlinks between them are edges) Can use graph and link mining approaches (textbook chapter, not covered) Example: Find authoritative Web pages on a certain topic A page many other pages point to Not as easy, for example www.google.com does not explicitly contain Web search engine Commercial and competitive interests, such as advertisements, distort the picture, as do many navigational links To find authoritative Web pages, use hub pages (pages that provides links to many authoritative Web pages) Example hub page: a personal home page with a list of recommended links HITS (Hyperlink Induced Topic Search): Start from search query, get root page set, which is then expanded, and a weight propagation is iteratively conducted for hub and authoritative page weights

Mining multimedia data on the Web Data includes images, video, audio, graphs, etc. Mostly embedded into Web pages, often via hyperlinks Increasing demand for effective methods to organise and retrieve multimedia data Web page layout mining can be used to find multimedia blocks Example: Classify images Use VIPS to identify multimedia blocks in a Web page Use textual description around images for classification/categorisation Use block level link analysis (rather than page level link analysis such as Google's PageRank)

Example image categories Source: Han and Kamber, DM Book, 2 nd Ed. (Copyright 2006 Elsevier Inc.)

Automatic classification of Web documents Supervised classification approach, training data required (classified Web pages) Classification done using keyword based classification methods (text mining approach) A Web page is seen as a document of terms Take structure into account (block based page content analysis) Hyperlinks contain high quality semantics (use to improve classification quality) Hyperlinks surrounding a document (meta data, header, footers, etc.) are noise, can actually decrease classification accuracy Often a very imbalanced problems, possibility to use one class classifiers Large amount of research into semantic Web Bring structure to Web based on semantic meaning of Web pages Build ontologies of the Web (data model that represents a set of concepts) Web page classification will help extracting semantic meaning of Web pages

Web usage mining Mining Web log records to discover user access patterns of Web pages For example: after looking a digital camera pages, 70% of users will look at memory card pages Web log entry: URL requested, source IP address, time stamp, browsers details, cookies, etc. Low level details, need to be cleaned, condensed, and transformed Use data stream mining techniques Apply association and frequent pattern mining, and trend analysis Applications e Commerce, improve Web system design (navigation and caching), Web page pre fetching, adaptive Web sites (depending upon user's history)

What now... things to do Read text book sections 10.4 and 10.5 Work on assignment 2 due this Friday 28 May, 12:00 (noon)! Tutorial 5 this week, tutorial 6 next week (both on using Rattle) Course review and exam preparation lecture next Tuesday 1 June Guest lecture by Graham Williams (Australian Taxation Office, and developer of the Rattle data mining tool) on Thursday 3 June, 10 11 in CSIT N101