Data mining the web Oscar Djupfeldt, Avraz Hirori, Christoffer Kullman Group 5 Introduction Structure mining Content mining Usage mining 1
Problem There is a huge amount of information on the web The information available on the web is very diverse Information on the web is exists in almost all types of formats The web is semi-structured because of HTMLs nestsed structure The web is linked Lots of redundant information Web Crawling Getting data about web pages Web search Done in two ways: Uniformed graph search Guided, informed search 2
Structure Mining Analyzes which places a web page points to and which places point to a web page To find relevant data about a web page Used for web search and social networks Document Structure Look at the HTML- or XML-structure of a web page The structure can reveal which data in a document is relevant as well as providing a context to the information 3
Document Structure <title>web Mining</title> <meta name= Author content= John Doe > -- <h2><big>web Mining</big></h2> -- <a href= presentation.html >Web Mining </a> Document Structure What do we get from this? Tagging and indexing Expand main index. Small, seperate indicies for faster access. Relevance ranking Terms location affects ranking. Different tags = different weights. Good at natural relevance ranking. 4
Document Structure Problems Invisible words Misleading text in titles, metatags, etc. Missing or incorrect information Anchor text Most significant in use due Additional information Link Analysis - Hyperlinks Independent evaluation of web page popularity or authority Idea is from social networks Popularity Authority Prestige Ranking Link analysis algorithms: PageRank HITS 5
Link Analysis - PageRank Democratic a links to b = a votes for b. The voter is also analyzed Designed for the random clicking user - likelihood a user will reach a certain page Iterative process Link Analysis - PageRank The damping factor Random page switch Sink pages No outbound links? 6
Content Mining Focused on text mining Video Audio Images Structured records like tables Information retrieval Information Retrieval Information retrieval mainly uses two types of measurement, precision and recall. Precision: the proportion of correctly returned pages out of all the returned pages. Recall: of all the pages that are correct, how many are returned? 7
Information Retrieval We typically have one of the following: Perfect recall and very low precision Perfect precision and very low recall Content Mining Web search engines Web directories 8
Content Mining Classification Genetic algorithms Memory based reasoning or K-Nearest neighbors based classification (K-NN) Vector space model Vector Space Model Page similarity Weighted vectors Term frequency-inverse document frequency method 9
Clustering Used to further help the user find what they are looking for Refine searches on a more general keyword Clustering 10
Usage Mining Web logs File (or several files) automatically created and maintained by a server of activity performed by it. Typically added Client IP address request date/time page requested HTTP code bytes served user agent referer Usage Mining Web Usage Mining consists of three phases Preprocessing Patten discovery Patten analysis 11
Preprocessing To clean up the data User identification Session identification Path completion Pattern Discovery Clustering algorithm Dependency modeling Classification Association Rules 12
Pattern Analysis SQL Filter out uninteresting rules Visualization techniques Graphic Color Usage Mining Example 13
Conclusions Useful for personalizing the Internet Good commercial use Simplifies searching and browsing Adds much needed structure to the data on the web Easy to manipulate No authority Needed Bibliography "PageRank." Wikipedia. 8 Apr. 2008. Accessed on 4 May 2008 <http://en.wikipedia.org/wiki/web_mining>. "Vector Space Model." Wikipedia. 22 April 2008. Accessed on 6 May 2008 <http://en.wikipedia.org/wiki/vector_space_model>. Larose, Markov. Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage. Hoboken, New Jersey: Wiley-Interscience, 2007. Nasraou, O. Approaches to Mining the Web, CECS 694 Mining the Web for E-Commerce & Information Retrieval. University of Louisville, 21 Oct. 2004 <http://webmining.spd.louisville.edu/websites/tutorials/chapter2-approachesmining-web.pdf>. Cooley, Deshpande, Srivastava, Tan. "Web Usage Mining: Discovery and Applications of Usage Patterns From Web Data." SIGKDD Explorations 1 (2000). Accessed on 4 May 2008 <http://www.sigkdd.org/explorations/issue.php?volume=1&issue=2&year=2000&month=01>. Sing, T. Web Content Mining. Oakland University. Accessed on 5 May 2008. <personalwebs.oakland.edu/~tsingh23/presentationmain.ppt>. "Google Technology." Google. Accessed on 6 May 2008 <http://www.google.com/technology/>. Introduction. 19 Feb 2002, Accessed on 5 May 2008 <http://www2002.org/cdrom/refereed/643/node1.html>. "Introduction." 18 Feb. 2002. 5 May 2008 <http://www2002.org/cdrom/refereed/643/node1.html>. 14