A toponym-based dual vector for topical relevance. calculation in focused spatial crawling

Similar documents
Importance of Domain Knowledge in Web Recommender Systems

Secure semantic based search over cloud

Research on Sentiment Classification of Chinese Micro Blog Based on

PERSONALIZED WEB MAP CUSTOMIZED SERVICE

Mining Text Data: An Introduction

Data Mining in Web Search Engine Optimization and User Assisted Rank Results

A PERSONALIZED WEB PAGE CONTENT FILTERING MODEL BASED ON SEGMENTATION

Remote support for lab activities in educational institutions

The 2006 IEEE / WIC / ACM International Conference on Web Intelligence Hong Kong, China

A FRAMEWORK OF WEB-BASED SERVICE SYSTEM FOR SATELLITE IMAGES AUTOMATIC ORTHORECTIFICATION

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 5, Sep-Oct 2015

Clustering Technique in Data Mining for Text Documents

A Focused Crawling Method Based on Detecting Communities in Complex Networks

Recommending Web Pages using Item-based Collaborative Filtering Approaches

Index Terms Domain name, Firewall, Packet, Phishing, URL.

1 o Semestre 2007/2008

ReLink: Recovering Links between Bugs and Changes

Social Business Intelligence Text Search System

Efficient Focused Web Crawling Approach for Search Engine

Development of an Enhanced Web-based Automatic Customer Service System

Towards Effective Recommendation of Social Data across Social Networking Sites

Web Document Clustering

Interactive Recovery of Requirements Traceability Links Using User Feedback and Configuration Management Logs

Ontology based ranking of documents using Graph Databases: a Big Data Approach

A Load Balancing Algorithm based on the Variation Trend of Entropy in Homogeneous Cluster

International Journal of Computer Engineering and Applications, Volume IX, Issue VI, Part I, June

Word Taxonomy for On-line Visual Asset Management and Mining

A Personalized Spam Filtering Approach Utilizing Two Separately Trained Filters

Parallel Data Selection Based on Neurodynamic Optimization in the Era of Big Data

Continuous Fastest Path Planning in Road Networks by Mining Real-Time Traffic Event Information

How To Extract Content From Thai Websites

Mobile Storage and Search Engine of Information Oriented to Food Cloud

DSEC: A Data Stream Engine Based Clinical Information System *

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Blog Post Extraction Using Title Finding

Framework for Intelligent Crawler Engine on IaaS Cloud Service Model

Ranked Keyword Search in Cloud Computing: An Innovative Approach

Journal of Chemical and Pharmaceutical Research, 2014, 6(5): Research Article

Automatic Mining of Internet Translation Reference Knowledge Based on Multiple Search Engines

Enhancing the relativity between Content, Title and Meta Tags Based on Term Frequency in Lexical and Semantic Aspects

A Composite Intelligent Method for Spam Filtering

User Modeling in Big Data. Qiang Yang, Huawei Noah s Ark Lab and Hong Kong University of Science and Technology 杨 强, 华 为 诺 亚 方 舟 实 验 室, 香 港 科 大

Search and Information Retrieval

Financial Trading System using Combination of Textual and Numerical Data

Network Intrusion Detection System and Its Cognitive Ability based on Artificial Immune Model WangLinjing1, ZhangHan2

Multiword Keyword Recommendation System for Online Advertising

A MULTILINGUAL AND LOCATION EVALUATION OF SEARCH ENGINES FOR WEBSITES AND SEARCHED FOR KEYWORDS

DATA PREPARATION FOR DATA MINING

Web Mining. Margherita Berardi LACAM. Dipartimento di Informatica Università degli Studi di Bari

One Map Database and Its Importance

Semantic Concept Based Retrieval of Software Bug Report with Feedback

An Information Retrieval using weighted Index Terms in Natural Language document collections

Legal Informatics Final Paper Submission Creating a Legal-Focused Search Engine I. BACKGROUND II. PROBLEM AND SOLUTION

Personalized News Filtering and Summarization on the Web

Identifying Focus, Techniques and Domain of Scientific Papers

Stemming Methodologies Over Individual Query Words for an Arabic Information Retrieval System

Multi-agent System for Web Advertising

Web Mining Based Distributed Crawling with Instant Backup Supports

RESEARCH ON THE FRAMEWORK OF SPATIO-TEMPORAL DATA WAREHOUSE

A Novel Framework for Personalized Web Search

Web based English-Chinese OOV term translation using Adaptive rules and Recursive feature selection

Folksonomies versus Automatic Keyword Extraction: An Empirical Study

Understanding Web Hosting Utility of Chinese ISPs

Query Recommendation employing Query Logs in Search Optimization

Search Result Optimization using Annotators

A Strategic Framework for Enterprise Information Integration of ERP and E-Commerce

SEARCH ENGINE WITH PARALLEL PROCESSING AND INCREMENTAL K-MEANS FOR FAST SEARCH AND RETRIEVAL

Recommender Systems: Content-based, Knowledge-based, Hybrid. Radek Pelánek

CONNECTED TRANSACTION FORWARD SHARE PURCHASE

Search Engine Optimization for Improving Page Rank And Image Search Accuracy

A Supervised Forum Crawler

How To Protect Your Privacy From Big Data, Privacy Protection, And Information Security From A Data Control System

Development of a Web-based Information Service Platform for Protected Crop Pests

An Information Retrieval System for Expert and Consumer Users

Research and Practice of DataRBAC-based Big Data Privacy Protection

Design call center management system of e-commerce based on BP neural network and multifractal

New Web tool to create educational and adaptive courses in an E-Learning platform based fusion of Web resources

# # % &# # ( # ) + #, # #./0 /1 & 2 % & 6 4 & 4 # 6 76 /0 / 6 7 & 6 4 & 4 # // 8 / 5 & /0 /# 6222 # /90 8 /9: ; & /0 /!<!

Preface: Cognitive Informatics, Cognitive Computing, and Their Denotational Mathematical Foundations (II)

ENTERED INTO A COOPERATION FRAMEWORK AGREEMENT IN RELATION TO JOINT INVESTMENT IN NEW MEDIA PROJECT WITH TIANHE FUND

Approaches of Using a Word-Image Ontology and an Annotated Image Corpus as Intermedia for Cross-Language Image Retrieval

Science Navigation Map: An Interactive Data Mining Tool for Literature Analysis

Web Log Based Analysis of User s Browsing Behavior

Review: Information Retrieval Techniques and Applications

Interactive Information Visualization of Trend Information

PDF hosted at the Radboud Repository of the Radboud University Nijmegen

Self-adaptive e-learning Website for Mathematics

Analyzing Chinese-English Mixed Language Queries in a Web Search Engine

A Comparative Approach to Search Engine Ranking Strategies

Incorporating Window-Based Passage-Level Evidence in Document Retrieval

Analysis of Social Media Streams

PCD Stores (Group) Limited. 中 國 春 天 百 貨 集 團 有 限 公 司 (Incorporated in the Cayman Islands with limited liability) (Stock Code: 331)

Log Analysis of Academic Digital Library: User Query Patterns

Medical Image Segmentation of PACS System Image Post-processing *

Research of the Combination of Distributed Business Processes Based on Dynamic Planning

GeoLife: A Collaborative Social Networking Service among User, Location and Trajectory

China Success Finance Group Holdings Limited ( ) (Incorporated in the Cayman Islands with limited liability) (Stock Code: 3623)

Effective Data Retrieval Mechanism Using AML within the Web Based Join Framework

THUTR: A Translation Retrieval System

A Network Simulation Experiment of WAN Based on OPNET

Transcription:

A toponym-based dual vector for topical relevance calculation in focused spatial crawling Dongyang Hou 1,2, Hao Wu 1, Jun Chen 1 1 National Geomatics Center of China, Beijing 100830, China 2 School of Environment Science and Spatial Informatics, China University of Mining and Technology, Xuzhou,Jiangsu,221116, China 1. Introduction Focused crawler is a Web crawler that tries to download only pages that are relevant to a given topic of interest (Siemiński 2009, Almpanidis 2011). That is to say, it is necessary for focused crawler to calculate relevance between pages and specific topic (Rungsawang, 2005). Recently, the specific topic involving spatial information especially toponyms such as the topic about the Diaoyu Islands conflict between China and Japan has become much more, because there are about 18 percent webpages describing localization information and 70 percent webpages containing geographic information webpages in the worldwide webpages (Zhou et al. 2005, Hill 2009). For the given topic involving toponyms, it means that focused crawler should ensure the downloaded pages are not only relevant with common topic and also relevant with the toponyms, which requires more accurate topical relevance calculation. At present, most researchers adopt Vector Space Model (VSM) (Fox 1984, Batsakis et al. 2009) or improved VSM (Yang et al. 2010, Li 2008) to calculate the topical relevance. In these methods the given topic and document are regarded as one keywords vector, which means that toponyms are consider as common topic. Considering toponyms as common topic may reduce the crawling accuracy, because a toponym can involve different common topics and a common topic also can relate to many toponyms. To solve the problem, we make a difference between toponyms and common keywords according to the idea of Geographic Information Retrieval (Purves 2007) and develop a toponym-based dual vector for topical relevance calculation. 2. The toponym-based dual vector In traditional VSM, a topic T and a document D are separately represented as a single vector V T and V D, as shown in equation (1) and (2), in which toponyms are regarded as common keywords. (1) (2)

Where and represent the weight of Nth keyword in topic T and document D respectively. In the paper, toponyms are regarded as an independent vector and the rest common keywords are considered as another vector. We call the method toponym-based dual vector. In toponym-based dual vector, a topic T is represented as toponyms vector TV T and common keywords vector CKV T, as shown in equation (3) and (4). Correspondingly, a document D can be expressed as TV D and CKV D, as shown in equation (5) and (6). (3) (4) (5) (6) Where and represents the weight of toponym in topic T and document D respectively. and indicts the weight of common keyword in topic T and document D separately. The weight and can be by manual or be calculated in corpus in the paper. The weight and are often calculated through term-frequency (tf) algorithm or term frequency-inverse frequency (tf-idf) algorithm. Because computing inverse document frequency (idf) weights during crawling may be problematic (Batsakis 2009), the tf algorithm is employed in the paper. Therefore, and where and represent the occurrence of toponym and common keyword in the document. 3. Relevance calculation utilizing toponym-based dual vector In the paper, the topic T and document D are both represented by two vectors, therefore, the relevance calculation is carried out through two-steps, as shown in figure 1. The first step is calculating the relevance between common keywords vector of topic and document. For common keywords vectors and, the relevance between them is calculated by cosine similarity function (Hao et al. 2011) as equation (7) shows. (7) If is greater than the specific threshold, the next step will be implemented, if otherwise, the document will be abandoned. The second step is computing the relevance between theirs toponyms vector. Similarly, the relevance between and is calculated as shown in equation (8). (8)

If is greater than the specific threshold, the document is judged to be relevant with the given topic. At last, the finial relevance, which will be utilized in predicting URL queue priority, is the weighted average of and : Where a and b are weighted factors, and they satisfy. In the paper, a is assigned to 0.4 and b is assigned to 0.6. Note that if the specific topic does not contain toponym, the method will be the same to the traditional method, that is only the first step is implemented. Document Segment Topic Common keywords Common keywords Relevance calculation 1 Toponym Relevance calculation 2 Topic Toponym Yes > threshold > threshold Yes Final relevance No Abandon document No Abandon document Figure 1. Frame diagram of relevance calculation utilizing toponym-based dual vector 4. Experiment In order to validate the toponym-based dual vector for topical relevance calculation, a simple experiment is implemented. In the experiment, the given topic is the Diaoyu Islands conflict between China and Japan. Five common keywords and five toponyms which are extracted by some sample Chinese news represent the specific topic.

In table 1, four test documents D1, D2, D3 and D4 are represented by the ten keywords, where D1 and D3 are relevant with the given topic, D2 is about the Huangyan Island conflict between China and Philippines, D4 is about the economic comparison between China and Japan, and GSV is the abbreviation of Government surveillance vessel. In table 2, the relevance between the four documents and topic is calculated through traditional method and the proposed method. Cruise On island Conflict Protest Gsv Diaoyu Islands Japan China Taiwan D1 3 0 0 0 3 1 3 2 0 0 D2 7 0 5 0 7 0 0 8 0 0 D3 0 2 4 2 0 37 23 10 5 1 D4 0 0 1 0 0 0 37 31 0 0 Table 1. Document representation for four test documents Hong kong D1 0.62 0.71 0.87 D2 0.31 0.79 0.42 D3 0.97 0.64 0.97 D4 0.69 0.36 0.70 Table 2. Results of the relevance calculation utilising two methods As shown in table 2, if all threshold values are 0.5, only D2 can be filtered out in traditional method, but in the new strategy D4 can be filtered out in the first step and D2 can be filtered out in the second step. Therefore, the new method can improve the accuracy of relevance judgments. 5. Conclusions For the given topic involving toponyms, we develop a toponym-based dual vector for topical relevance calculation. In the method, the toponyms are regarded as an independent vector and the relevance calculation is carried out through two steps. Through a simple experiment, we can conclude that the proposed method can improve the accuracy of relevance judgments. In the future, we will carry out some complex experiments to further validate the method s performance. 6. References Almpanidis G, Kotropoulos C and Pitas I, 2007, Combining text and link analysis for focused crawler-an application for vertical search engines. Information Systems, 32:886-908. Batsakis S, Petrakis E G M and Milios E, 2009, Improving the performance of focused web crawlers. Data & Knowledge Engineering, 68:1001-1013. Fox E A, 1984, Extending the boolean and vector space models of Information Retrieval with p-norm queries and multiple concept types, Dissertation Abstracts International Part B: Science and

Engineering, NYC: Cornell University, 44(9):386. Hao H W, Mu C X, Yin X C and Li S et al., 2011, An Improved Topic Relevance Algorithm for Focused Crawling. IEEE International Conference on Systems, Man, and Cybernetics (SMC), Anchorage, AK, 850-855. Hill L L, 2009, Georeferencing: The Geographic Associations of Information. The MIT Press. Lu C S, 2011, Thematic VSM based on ontology semantic tree. Computer Systems & Applications, 20(10):44-48. (in Chinese) Purves R S, Clough P, Jones C B and Arampatzis A et al., 2007, The design and implementation of SPIRIT: a spatially aware search engine for information retrieval on the Internet. International Journal of Geographical Information Science, 21(7): 717-745. Rungsawang A, Angkawattanawit N, 2005, Learnable topic-specific web crawler, Journal of Network and Computer Applications, 28(2):97-114. Siemiński A, 2009, Using WordNet to Measure the Similarity of Link Texts, In: Nguyen N T, Kowalczyk R and Chen S M (eds), Proceedings of First International Conference, ICCCI 2009, Wrocław, Poland, October, 720-731. Yang X, Sui A N and Tang Z K, 2010, Topical Crawler Based On Multi-Lev el Vector Space Model and Optimized Hyperlink Chosen Strategy, In: Sun F, Wang Y, Lu J, Zhang B, Kinsner W and Zadeh L A (eds.), Proceedings of the 9th IEEE International Conference on Cognitive Informatics (ICCI 2010), 430-435. Zhou Y H, Xie X, Wang C and Gong Y C et al., 2005, Hybrid index structures for location-based web search. Proceedings of the 14th ACM international conference on Information and knowledge management. Bremen, Germany, 155 162.