1 o Semestre 2007/2008

Save this PDF as:
 WORD  PNG  TXT  JPG

Size: px
Start display at page:

Download "1 o Semestre 2007/2008"

Transcription

1 Departamento de Engenharia Informática Instituto Superior Técnico 1 o Semestre 2007/2008

2 Outline

3 Outline

4 Exploiting Text How is text exploited? Two main directions Extraction

5 Extraction Entity and relationship (link) extraction Entity resolution/matching Other types of extraction: Events Opinions Sentiments IE bibliography

6 Goals: Representation, organization, storage and access to information items in order to provide the user with easy access to information The emphasis is on information

7 vs. Data Data retrieval Given a specified condition (e.g. {lab, ethics} document), find all items that satisfy the condition retrieval Given a user query, find all items that contain information relevant to the user s needs However how do you characterize the user s information need?

8 vs. Data Data retrieval Given a specified condition (e.g. {lab, ethics} document), find all items that satisfy the condition retrieval Given a user query, find all items that contain information relevant to the user s needs However how do you characterize the user s information need?

9 Translating the user information need An example Find all pages containing information on the ethical treatment of animals for medical experiments. The pages should contain references to recent related scientific articles, together with an enumeration of known existing alternatives for different medical fields. try this on Google Usually this is translated to ethics animals medical experiments but is this a convenient translation?

10 Translating the user information need An example Find all pages containing information on the ethical treatment of animals for medical experiments. The pages should contain references to recent related scientific articles, together with an enumeration of known existing alternatives for different medical fields. try this on Google Usually this is translated to ethics animals medical experiments but is this a convenient translation?

11 Outline

12 IR Tasks Document processing Indexing Crawling Query processing Distributed IR String processing... processing Ad-hoc retrieval Classification Clustering Filtering Question answering...

13 The Process

14 s IR s Classic models Boolean Vector Probabilistic Fuzzy Extended Boolean... LSI Neural Networks... Belief Network Language s... Alternative models

15 Outline

16 Index Terms In the classic IR models, documents are represented by index terms full text/selected keywords structure/no structure Not all terms are equally useful index terms can be weighted We assume that terms are mutually independent this is, of course, a simplification

17 An Example Example document I heartily accept the motto, That government is best which governs least ; and I should like to see it acted up to more rapidly and systematically. Carried out, it finally amounts to this, which also I believe That government is best which governs not at all ; and when men are prepared for it, that will be the kind of government which they will have.

18 An Example Index terms I accept acted all also amounts and are at be believe best carried finally for government governs have heartily is it kind least like men more motto not of out prepared rapidly see should systematically that the they this to up when which will

19 An Example Index terms I 3 accept 1 acted 2 all 3 also 1 amounts 1 and 3 are 1 at 1 be 1 believe 1 best 2 carried 1 finally 1 for 1 government 3 governs 2 have 1 heartily 1 is 2 it 3 kind 1 least 1 like 1 men 1 more 1 motto 1 not 1 of 1 out 1 prepared 1rapidly 1 see 1 should 1 systematically 1 that 3 the 2 they 1 this 1 to 3 up 1 when 1 which 4 will 2

20 An Example Index terms I 3 accept 1 acted 2 all 3 also 1 amounts 1 and 3 are 1 at 1 be 1 believe 1 best 2 carried 1 finally 1 for 1 government 3 governs 2 have 1 heartily 1 is 2 it 3 kind 1 least 1 like 1 men 1 more 1 motto 1 not 1 of 1 out 1 prepared 1rapidly 1 see 1 should 1 systematically 1 that 3 the 2 they 1 this 1 to 3 up 1 when 1 which 4 will 2

21 An Example Logical view of the documents accept acted all... government governs... d d d d

22 Documents as Vectors Documents are represented as vectors d j = (w 1,j,w 2,j,...,w t,j ) w i,j is the weight of term i in document j Queries are also vectors q = (w 1,q,w 2,q,...,w t,q ) Vector operations cab be used to compare queries documents (or documents documents)

23 An example Example Suppose the vocabulary has two terms k 1 = men, k 2 = government Two documents, d 1 and d 2 can be defined as, for instance d 1 = (2.2,5.2) d 2 = (4.9,1.0)

24 An example d 1 d 1 = (2.2, 5.2) d 2 = (4.9, 1.0) government d 2 men

25 Defining Document Vectors Two questions are still unanswered: 1 How do we define term weights? 2 How do we compare documents to queries?

26 Defining Term Weights TF Term frequency Term frequency is a measure of term importance within a document Definition Let N be the total number of documents in the system and n i be the number of documents in which term k i appears. The normalized frequency of a term k i in document d j is given by: f i,j = freq i,j max l freq l,j where freq i,j is the number of occurrences of term k i in document d j.

27 Defining Term Weights IDF (Inverse) Document frequency Document frequency is a measure of term importance within a collection Definition The inverse document frequency of a term k i is given by: idf i = log N n i

28 Defining Term Weights TF-IDF Definition The weight of a term k i in document d j for the vector space model is given by the tf-idf formula: w i,j = f i,j log N n i

29 Document Similarity Similarity between documents and queries is a measure of the correlation between their vectors Documents/queries that share the same terms, with similar weights, should be more similar Thus, as similarity a measure, we use the cosine of the angle between the vectors sim(d j, q) = d j q d j q = t i=1 w i,j w i,q t i=1 w2 i,j t i=1 w2 i,q

30 An example government α d 1 q cos(α) = 0.9 cos(θ) = 0.8 θ d 2 men

31 Outline

32 Traditional IR vs. IR Traditional IR systems Worth of a document regarding a query is intrinsic to the document. Documents are self-contained units Documents are descriptive and truthful The World Wide Indefinitely growing Non-textual content Documents are not self-complete No coherence of style, vocabulary, language,... Most web queries 2 words long

33 IR More information to explore Multimedia Images Video Sound (Semi-)Structured content Hyperlinks

34 Hyperlink graph analysis Hypermedia is a social network Social network theory Extensive research in applying graph notions Centrality and prestige Co-citation (relevance judgment) Applications search: HITS, Google Classification and topic distillation

35 Ranking Through Link Analysis Ranking search results Problems: Keyword queries are not selective enough Documents do not have enough text Solution: Use graph notions of popularity/prestige E.g., use the algorithm

36 Outline

37 Link Each page is a node without any textual properties Each hyperlink is an edge connecting two nodes with possibly only a positive edge weight property

38 Two perspectives The prestige of a page is proportional to the sum of the prestige scores of pages linking to it Idea of a random surfer on a strongly connected web graph

39 Overview of Pre-computes a rank-vector Provides a-priori (offline) importance estimates for all pages on Independent of search query In-degree prestige Not all votes are worth the same Prestige of a page depends on the prestige of citing pages Pre-compute query independent prestige score Query time: prestige scores used in conjunction with query-specific IR scores

40 The algorithm: E is adjacency matrix of the { 1 iff there is a link from u to v E[u, v] = 0 otherwise The out-degree of node u is given by N u = v E[u, v] Start with an initial prestige vector p 0 [u] Compute p i+1 [v] = (u,v) E p i [u] N u

41 Computing

42 Computing

43 Computing

44 Computing

45 Problems of Convergence graph is not strongly connected Only a fourth of the graph is! graph is not aperiodic Rank-sinks Pages without out-links Directed cyclic paths

46 A simple fix Two way choice at each node With a certain probability d (0.1 < d < 0.2), the surfer jumps to a random page on the With probability 1 d the surfer decides to choose, uniformly at random, an out-neighbor p i+1 [v] = d N + (1 d) (u,v) E p i [u] N u

47 architecture at Google Ranking of pages more important than exact values of p Convergence of page ranks in 52 iterations for a crawl with 322 million links. Pre-compute and store the of each page. independent of any query or textual content. Ranking scheme combines with textual match Unpublished Many empirical parameters, human effort and regression testing.

48 Questions?

Web Search Engines: Solutions

Web Search Engines: Solutions Web Search Engines: Solutions Problem 1: A. How can the owner of a web site design a spider trap? Answer: He can set up his web server so that, whenever a client requests a URL in a particular directory

More information

Efficient visual search of local features. Cordelia Schmid

Efficient visual search of local features. Cordelia Schmid Efficient visual search of local features Cordelia Schmid Visual search change in viewing angle Matches 22 correct matches Image search system for large datasets Large image dataset (one million images

More information

Part 1: Link Analysis & Page Rank

Part 1: Link Analysis & Page Rank Chapter 8: Graph Data Part 1: Link Analysis & Page Rank Based on Leskovec, Rajaraman, Ullman 214: Mining of Massive Datasets 1 Exam on the 5th of February, 216, 14. to 16. If you wish to attend, please

More information

Search Engine Architecture I

Search Engine Architecture I Search Engine Architecture I Software Architecture The high level structure of a software system Software components The interfaces provided by those components The relationships between those components

More information

Practical Graph Mining with R. 5. Link Analysis

Practical Graph Mining with R. 5. Link Analysis Practical Graph Mining with R 5. Link Analysis Outline Link Analysis Concepts Metrics for Analyzing Networks PageRank HITS Link Prediction 2 Link Analysis Concepts Link A relationship between two entities

More information

An Analysis of Factors Used in Search Engine Ranking

An Analysis of Factors Used in Search Engine Ranking An Analysis of Factors Used in Search Engine Ranking Albert Bifet 1 Carlos Castillo 2 Paul-Alexandru Chirita 3 Ingmar Weber 4 1 Technical University of Catalonia 2 University of Chile 3 L3S Research Center

More information

Supervised Learning Evaluation (via Sentiment Analysis)!

Supervised Learning Evaluation (via Sentiment Analysis)! Supervised Learning Evaluation (via Sentiment Analysis)! Why Analyze Sentiment? Sentiment Analysis (Opinion Mining) Automatically label documents with their sentiment Toward a topic Aggregated over documents

More information

Search and Information Retrieval

Search and Information Retrieval Search and Information Retrieval Search on the Web 1 is a daily activity for many people throughout the world Search and communication are most popular uses of the computer Applications involving search

More information

Web Search. 2 o Semestre 2012/2013

Web Search. 2 o Semestre 2012/2013 Dados na Dados na Departamento de Engenharia Informática Instituto Superior Técnico 2 o Semestre 2012/2013 Bibliography Dados na Bing Liu, Data Mining: Exploring Hyperlinks, Contents, and Usage Data, 2nd

More information

6.042/18.062J Mathematics for Computer Science October 3, 2006 Tom Leighton and Ronitt Rubinfeld. Graph Theory III

6.042/18.062J Mathematics for Computer Science October 3, 2006 Tom Leighton and Ronitt Rubinfeld. Graph Theory III 6.04/8.06J Mathematics for Computer Science October 3, 006 Tom Leighton and Ronitt Rubinfeld Lecture Notes Graph Theory III Draft: please check back in a couple of days for a modified version of these

More information

TF-IDF. David Kauchak cs160 Fall 2009 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture6-tfidf.ppt

TF-IDF. David Kauchak cs160 Fall 2009 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture6-tfidf.ppt TF-IDF David Kauchak cs160 Fall 2009 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture6-tfidf.ppt Administrative Homework 3 available soon Assignment 2 available soon Popular media article

More information

Homework 2. Page 154: Exercise 8.10. Page 145: Exercise 8.3 Page 150: Exercise 8.9

Homework 2. Page 154: Exercise 8.10. Page 145: Exercise 8.3 Page 150: Exercise 8.9 Homework 2 Page 110: Exercise 6.10; Exercise 6.12 Page 116: Exercise 6.15; Exercise 6.17 Page 121: Exercise 6.19 Page 122: Exercise 6.20; Exercise 6.23; Exercise 6.24 Page 131: Exercise 7.3; Exercise 7.5;

More information

Search engines: ranking algorithms

Search engines: ranking algorithms Search engines: ranking algorithms Gianna M. Del Corso Dipartimento di Informatica, Università di Pisa, Italy ESP, 25 Marzo 2015 1 Statistics 2 Search Engines Ranking Algorithms HITS Web Analytics Estimated

More information

Graph Mining and Social Network Analysis

Graph Mining and Social Network Analysis Graph Mining and Social Network Analysis Data Mining and Text Mining (UIC 583 @ Politecnico di Milano) References Jiawei Han and Micheline Kamber, "Data Mining: Concepts and Techniques", The Morgan Kaufmann

More information

An Information Retrieval using weighted Index Terms in Natural Language document collections

An Information Retrieval using weighted Index Terms in Natural Language document collections Internet and Information Technology in Modern Organizations: Challenges & Answers 635 An Information Retrieval using weighted Index Terms in Natural Language document collections Ahmed A. A. Radwan, Minia

More information

Link-based Analysis on Large Graphs. Presented by Weiren Yu Mar 01, 2011

Link-based Analysis on Large Graphs. Presented by Weiren Yu Mar 01, 2011 Link-based Analysis on Large Graphs Presented by Weiren Yu Mar 01, 2011 Overview 1 Introduction 2 Problem Definition 3 Optimization Techniques 4 Experimental Results 2 1. Introduction Many applications

More information

Maschinelle Sprachverarbeitung

Maschinelle Sprachverarbeitung Maschinelle Sprachverarbeitung Retrieval Models and Implementation Ulf Leser Content of this Lecture Information Retrieval Models Boolean Model Vector Space Model Inverted Files Ulf Leser: Maschinelle

More information

Social Business Intelligence Text Search System

Social Business Intelligence Text Search System Social Business Intelligence Text Search System Sagar Ligade ME Computer Engineering. Pune Institute of Computer Technology Pune, India ABSTRACT Today the search engine plays the important role in the

More information

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE

More information

The PageRank Citation Ranking: Bring Order to the Web

The PageRank Citation Ranking: Bring Order to the Web The PageRank Citation Ranking: Bring Order to the Web presented by: Xiaoxi Pang 25.Nov 2010 1 / 20 Outline Introduction A ranking for every page on the Web Implementation Convergence Properties Personalized

More information

Mining Text Data: An Introduction

Mining Text Data: An Introduction Bölüm 10. Metin ve WEB Madenciliği http://ceng.gazi.edu.tr/~ozdemir Mining Text Data: An Introduction Data Mining / Knowledge Discovery Structured Data Multimedia Free Text Hypertext HomeLoan ( Frank Rizzo

More information

Eng. Mohammed Abdualal

Eng. Mohammed Abdualal Islamic University of Gaza Faculty of Engineering Computer Engineering Department Information Storage and Retrieval (ECOM 5124) IR HW 5+6 Scoring, term weighting and the vector space model Exercise 6.2

More information

Social Media Mining. Network Measures

Social Media Mining. Network Measures Klout Measures and Metrics 22 Why Do We Need Measures? Who are the central figures (influential individuals) in the network? What interaction patterns are common in friends? Who are the like-minded users

More information

Information Retrieval Models

Information Retrieval Models Information Retrieval Models Djoerd Hiemstra University of Twente 1 Introduction author version Many applications that handle information on the internet would be completely inadequate without the support

More information

Development of an Enhanced Web-based Automatic Customer Service System

Development of an Enhanced Web-based Automatic Customer Service System Development of an Enhanced Web-based Automatic Customer Service System Ji-Wei Wu, Chih-Chang Chang Wei and Judy C.R. Tseng Department of Computer Science and Information Engineering Chung Hua University

More information

Text Analytics. Models for Information Retrieval 1. Ulf Leser

Text Analytics. Models for Information Retrieval 1. Ulf Leser Text Analytics Models for Information Retrieval 1 Ulf Leser Content of this Lecture IR Models Boolean Model Vector Space Model Relevance Feedback in the VSM Probabilistic Model Latent Semantic Indexing

More information

Ranking on Data Manifolds

Ranking on Data Manifolds Ranking on Data Manifolds Dengyong Zhou, Jason Weston, Arthur Gretton, Olivier Bousquet, and Bernhard Schölkopf Max Planck Institute for Biological Cybernetics, 72076 Tuebingen, Germany {firstname.secondname

More information

Yifan Chen, Guirong Xue and Yong Yu Apex Data & Knowledge Management LabShanghai Jiao Tong University

Yifan Chen, Guirong Xue and Yong Yu Apex Data & Knowledge Management LabShanghai Jiao Tong University Yifan Chen, Guirong Xue and Yong Yu Apex Data & Knowledge Management LabShanghai Jiao Tong University Presented by Qiang Yang, Hong Kong Univ. of Science and Technology 1 In a Search Engine Company Advertisers

More information

CSCI 5417 Information Retrieval Systems Jim Martin!

CSCI 5417 Information Retrieval Systems Jim Martin! CSCI 5417 Information Retrieval Systems Jim Martin! Lecture 9 9/20/2011 Today 9/20 Where we are MapReduce/Hadoop Probabilistic IR Language models LM for ad hoc retrieval 1 Where we are... Basics of ad

More information

1 Contact:

1 Contact: 1 Contact: fowler@panam.edu Visualizing the Web as Hubs and Authorities Richard H. Fowler 1 and Tarkan Karadayi Technical Report CS-02-27 Department of Computer Science University of Texas Pan American

More information

Ranking Biomedical Passages for Relevance and Diversity

Ranking Biomedical Passages for Relevance and Diversity Ranking Biomedical Passages for Relevance and Diversity University of Wisconsin-Madison at TREC Genomics 2006 Andrew B. Goldberg,, David Andrzejewski, Jurgen Van Gael, Burr Settles, Xiaojin Zhu, Mark Craven

More information

Search Engines. Stephen Shaw <stesh@netsoc.tcd.ie> 18th of February, 2014. Netsoc

Search Engines. Stephen Shaw <stesh@netsoc.tcd.ie> 18th of February, 2014. Netsoc Search Engines Stephen Shaw Netsoc 18th of February, 2014 Me M.Sc. Artificial Intelligence, University of Edinburgh Would recommend B.A. (Mod.) Computer Science, Linguistics, French,

More information

International Journal of Engineering Research-Online A Peer Reviewed International Journal Articles are freely available online:http://www.ijoer.

International Journal of Engineering Research-Online A Peer Reviewed International Journal Articles are freely available online:http://www.ijoer. RESEARCH ARTICLE SURVEY ON PAGERANK ALGORITHMS USING WEB-LINK STRUCTURE SOWMYA.M 1, V.S.SREELAXMI 2, MUNESHWARA M.S 3, ANIL G.N 4 Department of CSE, BMS Institute of Technology, Avalahalli, Yelahanka,

More information

Exam in course TDT4215 Web Intelligence - Solutions and guidelines -

Exam in course TDT4215 Web Intelligence - Solutions and guidelines - English Student no:... Page 1 of 12 Contact during the exam: Geir Solskinnsbakk Phone: 94218 Exam in course TDT4215 Web Intelligence - Solutions and guidelines - Friday May 21, 2010 Time: 0900-1300 Allowed

More information

Big Data Analytics. Lucas Rego Drumond

Big Data Analytics. Lucas Rego Drumond Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany MapReduce II MapReduce II 1 / 33 Outline 1. Introduction

More information

Web Search Engines. Search Engine Characteristics. Web Search Queries. Chapter 27, Part C Based on Larson and Hearst s slides at UC-Berkeley

Web Search Engines. Search Engine Characteristics. Web Search Queries. Chapter 27, Part C Based on Larson and Hearst s slides at UC-Berkeley Web Search Engines Chapter 27, Part C Based on Larson and Hearst s slides at UC-Berkeley http://www.sims.berkeley.edu/courses/is202/f00/ Database Management Systems, R. Ramakrishnan 1 Search Engine Characteristics

More information

Introduction to Information Retrieval http://informationretrieval.org

Introduction to Information Retrieval http://informationretrieval.org Introduction to Information Retrieval http://informationretrieval.org IIR 6&7: Vector Space Model Hinrich Schütze Institute for Natural Language Processing, University of Stuttgart 2011-08-29 Schütze:

More information

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02) Internet Technology Prof. Indranil Sengupta Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture No #39 Search Engines and Web Crawler :: Part 2 So today we

More information

Trust and Reputation Management

Trust and Reputation Management Trust and Reputation Management Omer Rana School of Computer Science and Welsh escience Centre, Cardiff University, UK Omer Rana (CS, Cardiff, UK) CM0356/CMT606 1 / 28 Outline 1 Context Defining Trust

More information

Interactive Recovery of Requirements Traceability Links Using User Feedback and Configuration Management Logs

Interactive Recovery of Requirements Traceability Links Using User Feedback and Configuration Management Logs Interactive Recovery of Requirements Traceability Links Using User Feedback and Configuration Management Logs Ryosuke Tsuchiya 1, Hironori Washizaki 1, Yoshiaki Fukazawa 1, Keishi Oshima 2, and Ryota Mibe

More information

Information Retrieval

Information Retrieval Information Retrieval Models for Information Retrieval 1 Ulf Leser Content of this Lecture IR Models Boolean Model Vector Space Model Relevance Feedback in the VSM Probabilistic Model Latent Semantic Indexing

More information

Web Mining. Margherita Berardi LACAM. Dipartimento di Informatica Università degli Studi di Bari berardi@di.uniba.it

Web Mining. Margherita Berardi LACAM. Dipartimento di Informatica Università degli Studi di Bari berardi@di.uniba.it Web Mining Margherita Berardi LACAM Dipartimento di Informatica Università degli Studi di Bari berardi@di.uniba.it Bari, 24 Aprile 2003 Overview Introduction Knowledge discovery from text (Web Content

More information

dm106 TEXT MINING FOR CUSTOMER RELATIONSHIP MANAGEMENT: AN APPROACH BASED ON LATENT SEMANTIC ANALYSIS AND FUZZY CLUSTERING

dm106 TEXT MINING FOR CUSTOMER RELATIONSHIP MANAGEMENT: AN APPROACH BASED ON LATENT SEMANTIC ANALYSIS AND FUZZY CLUSTERING dm106 TEXT MINING FOR CUSTOMER RELATIONSHIP MANAGEMENT: AN APPROACH BASED ON LATENT SEMANTIC ANALYSIS AND FUZZY CLUSTERING ABSTRACT In most CRM (Customer Relationship Management) systems, information on

More information

Lecture 11: Graphical Models for Inference

Lecture 11: Graphical Models for Inference Lecture 11: Graphical Models for Inference So far we have seen two graphical models that are used for inference - the Bayesian network and the Join tree. These two both represent the same joint probability

More information

Web Usage in Client-Server Design

Web Usage in Client-Server Design Web Search Web Usage in Client-Server Design A client (e.g., a browser) communicates with a server via http Hypertext transfer protocol: a lightweight and simple protocol asynchronously carrying a variety

More information

Recognition. Sanja Fidler CSC420: Intro to Image Understanding 1 / 28

Recognition. Sanja Fidler CSC420: Intro to Image Understanding 1 / 28 Recognition Topics that we will try to cover: Indexing for fast retrieval (we still owe this one) History of recognition techniques Object classification Bag-of-words Spatial pyramids Neural Networks Object

More information

Index Terms Domain name, Firewall, Packet, Phishing, URL.

Index Terms Domain name, Firewall, Packet, Phishing, URL. BDD for Implementation of Packet Filter Firewall and Detecting Phishing Websites Naresh Shende Vidyalankar Institute of Technology Prof. S. K. Shinde Lokmanya Tilak College of Engineering Abstract Packet

More information

Medical Information-Retrieval Systems. Dong Peng Medical Informatics Group

Medical Information-Retrieval Systems. Dong Peng Medical Informatics Group Medical Information-Retrieval Systems Dong Peng Medical Informatics Group Outline Evolution of medical Information-Retrieval (IR). The information retrieval process. The trend of medical information retrieval

More information

Query Recommendation employing Query Logs in Search Optimization

Query Recommendation employing Query Logs in Search Optimization 1917 Query Recommendation employing Query Logs in Search Optimization Neha Singh Department of Computer Science, Shri Siddhi Vinayak Group of Institutions, Bareilly Email: singh26.neha@gmail.com Dr Manish

More information

W. Heath Rushing Adsurgo LLC. Harness the Power of Text Analytics: Unstructured Data Analysis for Healthcare. Session H-1 JTCC: October 23, 2015

W. Heath Rushing Adsurgo LLC. Harness the Power of Text Analytics: Unstructured Data Analysis for Healthcare. Session H-1 JTCC: October 23, 2015 W. Heath Rushing Adsurgo LLC Harness the Power of Text Analytics: Unstructured Data Analysis for Healthcare Session H-1 JTCC: October 23, 2015 Outline Demonstration: Recent article on cnn.com Introduction

More information

The world s largest matrix computation. (This chapter is out of date and needs a major overhaul.)

The world s largest matrix computation. (This chapter is out of date and needs a major overhaul.) Chapter 7 Google PageRank The world s largest matrix computation. (This chapter is out of date and needs a major overhaul.) One of the reasons why Google TM is such an effective search engine is the PageRank

More information

Hyperlink Analysis for the Web

Hyperlink Analysis for the Web Hyperlink Analysis for the Web Hyperlink analysis algorithms allow search engines to deliver focused results to user queries.this article surveys ranking algorithms used to retrieve information on the

More information

Methodical Fuzzy Search for Unique Recognition System

Methodical Fuzzy Search for Unique Recognition System IJCST Vo l. 4, Is s u e Sp l - 4, Oc t - De c 2013 ISSN : 0976-8491 (Online) ISSN : 2229-4333 (Print) Methodical Fuzzy Search for Unique Recognition System Kothapalli Chaitanya Deepthi Dept. of CSE, Sir

More information

KEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS

KEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS ABSTRACT KEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS In many real applications, RDF (Resource Description Framework) has been widely used as a W3C standard to describe data in the Semantic Web. In practice,

More information

Recommender Systems: Content-based, Knowledge-based, Hybrid. Radek Pelánek

Recommender Systems: Content-based, Knowledge-based, Hybrid. Radek Pelánek Recommender Systems: Content-based, Knowledge-based, Hybrid Radek Pelánek 2015 Today lecture, basic principles: content-based knowledge-based hybrid, choice of approach,... critiquing, explanations,...

More information

Analysis of Web Archives. Vinay Goel Senior Data Engineer

Analysis of Web Archives. Vinay Goel Senior Data Engineer Analysis of Web Archives Vinay Goel Senior Data Engineer Internet Archive Established in 1996 501(c)(3) non profit organization 20+ PB (compressed) of publicly accessible archival material Technology partner

More information

PSG College of Technology, Coimbatore-641 004 Department of Computer & Information Sciences BSc (CT) G1 & G2 Sixth Semester PROJECT DETAILS.

PSG College of Technology, Coimbatore-641 004 Department of Computer & Information Sciences BSc (CT) G1 & G2 Sixth Semester PROJECT DETAILS. PSG College of Technology, Coimbatore-641 004 Department of Computer & Information Sciences BSc (CT) G1 & G2 Sixth Semester PROJECT DETAILS Project Project Title Area of Abstract No Specialization 1. Software

More information

Big Data Analytics CSCI 4030

Big Data Analytics CSCI 4030 High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising

More information

Protein Protein Interaction Networks

Protein Protein Interaction Networks Functional Pattern Mining from Genome Scale Protein Protein Interaction Networks Young-Rae Cho, Ph.D. Assistant Professor Department of Computer Science Baylor University it My Definition of Bioinformatics

More information

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. ~ Spring~r

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. ~ Spring~r Bing Liu Web Data Mining Exploring Hyperlinks, Contents, and Usage Data With 177 Figures ~ Spring~r Table of Contents 1. Introduction.. 1 1.1. What is the World Wide Web? 1 1.2. ABrief History of the Web

More information

IC05 Introduction on Networks &Visualization Nov. 2009. <mathieu.bastian@gmail.com>

IC05 Introduction on Networks &Visualization Nov. 2009. <mathieu.bastian@gmail.com> IC05 Introduction on Networks &Visualization Nov. 2009 Overview 1. Networks Introduction Networks across disciplines Properties Models 2. Visualization InfoVis Data exploration

More information

Data Pre-Processing in Spam Detection

Data Pre-Processing in Spam Detection IJSTE - International Journal of Science Technology & Engineering Volume 1 Issue 11 May 2015 ISSN (online): 2349-784X Data Pre-Processing in Spam Detection Anjali Sharma Dr. Manisha Manisha Dr. Rekha Jain

More information

Mining Web Informative Structures and Contents Based on Entropy Analysis

Mining Web Informative Structures and Contents Based on Entropy Analysis IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 16, NO. 1, JANUARY 2004 1 Mining Web Informative Structures and Contents Based on Entropy Analysis Hung-Yu Kao, Shian-Hua Lin, Member, IEEE Computer

More information

USING SPECTRAL RADIUS RATIO FOR NODE DEGREE TO ANALYZE THE EVOLUTION OF SCALE- FREE NETWORKS AND SMALL-WORLD NETWORKS

USING SPECTRAL RADIUS RATIO FOR NODE DEGREE TO ANALYZE THE EVOLUTION OF SCALE- FREE NETWORKS AND SMALL-WORLD NETWORKS USING SPECTRAL RADIUS RATIO FOR NODE DEGREE TO ANALYZE THE EVOLUTION OF SCALE- FREE NETWORKS AND SMALL-WORLD NETWORKS Natarajan Meghanathan Jackson State University, 1400 Lynch St, Jackson, MS, USA natarajan.meghanathan@jsums.edu

More information

Big Data Technology Motivating NoSQL Databases: Computing Page Importance Metrics at Crawl Time

Big Data Technology Motivating NoSQL Databases: Computing Page Importance Metrics at Crawl Time Big Data Technology Motivating NoSQL Databases: Computing Page Importance Metrics at Crawl Time Edward Bortnikov & Ronny Lempel Yahoo! Labs, Haifa Class Outline Link-based page importance measures Why

More information

DATA MINING TECHNOLOGY. Keywords: data mining, data warehouse, knowledge discovery, OLAP, OLAM.

DATA MINING TECHNOLOGY. Keywords: data mining, data warehouse, knowledge discovery, OLAP, OLAM. DATA MINING TECHNOLOGY Georgiana Marin 1 Abstract In terms of data processing, classical statistical models are restrictive; it requires hypotheses, the knowledge and experience of specialists, equations,

More information

Towards running complex models on big data

Towards running complex models on big data Towards running complex models on big data Working with all the genomes in the world without changing the model (too much) Daniel Lawson Heilbronn Institute, University of Bristol 2013 1 / 17 Motivation

More information

Department of Cognitive Sciences University of California, Irvine 1

Department of Cognitive Sciences University of California, Irvine 1 Mark Steyvers Department of Cognitive Sciences University of California, Irvine 1 Network structure of word associations Decentralized search in information networks Analogy between Google and word retrieval

More information

An Effective Content Based Web Page Ranking Approach

An Effective Content Based Web Page Ranking Approach An Effective Content Based Web Page Ranking Approach NIDHI SHALYA M.Tech (Computer Science), nidz.cs@hotmail.com SHASHWAT SHUKLA Computer Science Department, shashwatshukla10aug@gmail.com DEEPAK ARORA

More information

Analysis of MapReduce Algorithms

Analysis of MapReduce Algorithms Analysis of MapReduce Algorithms Harini Padmanaban Computer Science Department San Jose State University San Jose, CA 95192 408-924-1000 harini.gomadam@gmail.com ABSTRACT MapReduce is a programming model

More information

Intelligent Search for Answering Clinical Questions Coronado Group, Ltd. Innovation Initiatives

Intelligent Search for Answering Clinical Questions Coronado Group, Ltd. Innovation Initiatives Intelligent Search for Answering Clinical Questions Coronado Group, Ltd. Innovation Initiatives Search The Way You Think Copyright 2009 Coronado, Ltd. All rights reserved. All other product names and logos

More information

Text and Web Mining A big challenge for Data Mining. Nguyen Hung Son Warsaw University

Text and Web Mining A big challenge for Data Mining. Nguyen Hung Son Warsaw University Text and Web Mining A big challenge for Data Mining Nguyen Hung Son Warsaw University Outline Text vs. Web mining Search Engine Inside: Why Search Engine so important Search Engine Architecture Crawling

More information

Social Search. Communities of users actively participating in the search process

Social Search. Communities of users actively participating in the search process Chapter 1 Social Search Social Search Social search Communities of users actively participating in the search process Goes beyond classical search tasks Key differences Users interact with the system Users

More information

Graph Processing and Social Networks

Graph Processing and Social Networks Graph Processing and Social Networks Presented by Shu Jiayu, Yang Ji Department of Computer Science and Engineering The Hong Kong University of Science and Technology 2015/4/20 1 Outline Background Graph

More information

Text Analytics. Models for Information Retrieval 1. Ulf Leser

Text Analytics. Models for Information Retrieval 1. Ulf Leser Text Analytics Models for Information Retrieval 1 Ulf Leser Content of this Lecture IR Models Boolean Model Vector Space Model Relevance Feedback in the VSM Probabilistic Model Latent Semantic Indexing

More information

On-line Data De-duplication. Ιωάννης Κρομμύδας

On-line Data De-duplication. Ιωάννης Κρομμύδας On-line Data De-duplication Ιωάννης Κρομμύδας Roadmap Data Cleaning Importance Data Cleaning Tasks Challenges for Fuzzy Matching Baseline Method (fuzzy match data cleaning) Improvements: online data cleaning

More information

Machine Learning using MapReduce

Machine Learning using MapReduce Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous

More information

Geographical Classification of Documents Using Evidence from Wikipedia

Geographical Classification of Documents Using Evidence from Wikipedia Geographical Classification of Documents Using Evidence from Wikipedia Rafael Odon de Alencar (odon.rafael@gmail.com) Clodoveu Augusto Davis Jr. (clodoveu@dcc.ufmg.br) Marcos André Gonçalves (mgoncalv@dcc.ufmg.br)

More information

NAVIGATING SCIENTIFIC LITERATURE A HOLISTIC PERSPECTIVE. Venu Govindaraju

NAVIGATING SCIENTIFIC LITERATURE A HOLISTIC PERSPECTIVE. Venu Govindaraju NAVIGATING SCIENTIFIC LITERATURE A HOLISTIC PERSPECTIVE Venu Govindaraju BIOMETRICS DOCUMENT ANALYSIS PATTERN RECOGNITION 8/24/2015 ICDAR- 2015 2 Towards a Globally Optimal Approach for Learning Deep Unsupervised

More information

COMP3420: Advanced Databases and Data Mining. Web data mining

COMP3420: Advanced Databases and Data Mining. Web data mining COMP3420: Advanced Databases and Data Mining Web data mining Lecture outline The Web as a data source Challenges the Web poses to data mining Types of Web data mining Mining the Web page layout structure

More information

Tutorial, IEEE SERVICE 2014 Anchorage, Alaska

Tutorial, IEEE SERVICE 2014 Anchorage, Alaska Tutorial, IEEE SERVICE 2014 Anchorage, Alaska Big Data Science: Fundamental, Techniques, and Challenges (Data Mining on Big Data) 2014. 6. 27. By Neil Y. Yen Presented by Incheon Paik University of Aizu

More information

Enhancing the Ranking of a Web Page in the Ocean of Data

Enhancing the Ranking of a Web Page in the Ocean of Data Database Systems Journal vol. IV, no. 3/2013 3 Enhancing the Ranking of a Web Page in the Ocean of Data Hitesh KUMAR SHARMA University of Petroleum and Energy Studies, India hkshitesh@gmail.com In today

More information

Effective Data Retrieval Mechanism Using AML within the Web Based Join Framework

Effective Data Retrieval Mechanism Using AML within the Web Based Join Framework Effective Data Retrieval Mechanism Using AML within the Web Based Join Framework Usha Nandini D 1, Anish Gracias J 2 1 ushaduraisamy@yahoo.co.in 2 anishgracias@gmail.com Abstract A vast amount of assorted

More information

Efficient Identification of Starters and Followers in Social Media

Efficient Identification of Starters and Followers in Social Media Efficient Identification of Starters and Followers in Social Media Michael Mathioudakis Department of Computer Science University of Toronto mathiou@cs.toronto.edu Nick Koudas Department of Computer Science

More information

Web Page Scoring Based on Link Analysis of Web Page Sets

Web Page Scoring Based on Link Analysis of Web Page Sets Web Page Scoring Based on Link Analysis of Web Page Sets Hitoshi Nakakubo, Shinsuke Nakajima, Kenji Hatano, Jun Miyazaki, and Shunsuke Uemura Innovative Technology Development Center, U-TEC Corporation

More information

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM International Journal of Innovative Computing, Information and Control ICIC International c 0 ISSN 34-48 Volume 8, Number 8, August 0 pp. 4 FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT

More information

Incorporating Window-Based Passage-Level Evidence in Document Retrieval

Incorporating Window-Based Passage-Level Evidence in Document Retrieval Incorporating -Based Passage-Level Evidence in Document Retrieval Wensi Xi, Richard Xu-Rong, Christopher S.G. Khoo Center for Advanced Information Systems School of Applied Science Nanyang Technological

More information

Database Ph.D. Qualifying Exam Spring 2008 (April 4, 2008)

Database Ph.D. Qualifying Exam Spring 2008 (April 4, 2008) Database Ph.D. Qualifying Exam Spring 2008 (April 4, 2008) NOTES: (1) THE EXAM IS DIVIDED INTO 5 LONG QUESTIONS AND 5 SHORT QUESTIONS. THE GENERAL SCOPE OF THE QUESTION IS INDICATED FOR EACH QUESTION IN

More information

Data Mining in Web Search Engine Optimization and User Assisted Rank Results

Data Mining in Web Search Engine Optimization and User Assisted Rank Results Data Mining in Web Search Engine Optimization and User Assisted Rank Results Minky Jindal Institute of Technology and Management Gurgaon 122017, Haryana, India Nisha kharb Institute of Technology and Management

More information

Stemming Methodologies Over Individual Query Words for an Arabic Information Retrieval System

Stemming Methodologies Over Individual Query Words for an Arabic Information Retrieval System Stemming Methodologies Over Individual Query Words for an Arabic Information Retrieval System Hani Abu-Salem* and Mahmoud Al-Omari Department of Computer Science, Mu tah University, P.O. Box (7), Mu tah,

More information

Recommender Systems Seminar Topic : Application Tung Do. 28. Januar 2014 TU Darmstadt Thanh Tung Do 1

Recommender Systems Seminar Topic : Application Tung Do. 28. Januar 2014 TU Darmstadt Thanh Tung Do 1 Recommender Systems Seminar Topic : Application Tung Do 28. Januar 2014 TU Darmstadt Thanh Tung Do 1 Agenda Google news personalization : Scalable Online Collaborative Filtering Algorithm, System Components

More information

Web Graph Analyzer Tool

Web Graph Analyzer Tool Web Graph Analyzer Tool Konstantin Avrachenkov INRIA Sophia Antipolis 2004, route des Lucioles, B.P.93 06902, France Email: K.Avrachenkov@sophia.inria.fr Danil Nemirovsky St.Petersburg State University

More information

Classification of Web Pages using TF-IDF and Ant Colony Optimization PAN EI SAN Ph.D Scholar, UCSY, Myanmar,

Classification of Web Pages using TF-IDF and Ant Colony Optimization PAN EI SAN Ph.D Scholar, UCSY, Myanmar, ISSN 2319-8885 Vol.03,Issue.46 December-2014, Pages:9450-9454 www.ijsetr.com Classification of Web Pages using TF-IDF and Ant Colony Optimization Ph.D Scholar, UCSY, Myanmar, E-mail: paneisan1985@gmail.com.

More information

Why is Internal Audit so Hard?

Why is Internal Audit so Hard? Why is Internal Audit so Hard? 2 2014 Why is Internal Audit so Hard? 3 2014 Why is Internal Audit so Hard? Waste Abuse Fraud 4 2014 Waves of Change 1 st Wave Personal Computers Electronic Spreadsheets

More information

Top Online Activities (Jupiter Communications, 2000) CS276A Text Information Retrieval, Mining, and Exploitation

Top Online Activities (Jupiter Communications, 2000) CS276A Text Information Retrieval, Mining, and Exploitation Top Online Activities (Jupiter Communications, 2000) CS276A Text Information Retrieval, Mining, and Exploitation Lecture 11 12 November, 2002 Email Web Search 88% 96% Special thanks to Andrei Broder, IBM

More information

Dynamical Clustering of Personalized Web Search Results

Dynamical Clustering of Personalized Web Search Results Dynamical Clustering of Personalized Web Search Results Xuehua Shen CS Dept, UIUC xshen@cs.uiuc.edu Hong Cheng CS Dept, UIUC hcheng3@uiuc.edu Abstract Most current search engines present the user a ranked

More information

Recommending Web Pages using Item-based Collaborative Filtering Approaches

Recommending Web Pages using Item-based Collaborative Filtering Approaches Recommending Web Pages using Item-based Collaborative Filtering Approaches Sara Cadegnani 1, Francesco Guerra 1, Sergio Ilarri 2, María del Carmen Rodríguez-Hernández 2, Raquel Trillo-Lado 2, and Yannis

More information

A Comparison Framework of Similarity Metrics Used for Web Access Log Analysis

A Comparison Framework of Similarity Metrics Used for Web Access Log Analysis A Comparison Framework of Similarity Metrics Used for Web Access Log Analysis Yusuf Yaslan and Zehra Cataltepe Istanbul Technical University, Computer Engineering Department, Maslak 34469 Istanbul, Turkey

More information

Mining Social Network Graphs

Mining Social Network Graphs Mining Social Network Graphs Debapriyo Majumdar Data Mining Fall 2014 Indian Statistical Institute Kolkata November 13, 17, 2014 Social Network No introduc+on required Really? We s7ll need to understand

More information

Local Approximation of PageRank and Reverse PageRank

Local Approximation of PageRank and Reverse PageRank Local Approximation of PageRank and Reverse PageRank Ziv Bar-Yossef Department of Electrical Engineering Technion, Haifa, Israel and Google Haifa Engineering Center Haifa, Israel zivby@ee.technion.ac.il

More information