Search Engine Architecture I

Similar documents
Search and Information Retrieval

Search Engines Chapter 2 Architecture Felix Naumann

Information Retrieval Elasticsearch

SEARCH ENGINE OPTIMIZATION

Building Multilingual Search Index using open source framework

Chapter 6. Attracting Buyers with Search, Semantic, and Recommendation Technology

Analysis of Web Archives. Vinay Goel Senior Data Engineer

Slide 7. Jashapara, Knowledge Management: An Integrated Approach, 2 nd Edition, Pearson Education Limited Nisan 14 Pazartesi

The Open Source Knowledge Discovery and Document Analysis Platform

Unifying Search for the Desktop, the Enterprise and the Web

Web Mining. Margherita Berardi LACAM. Dipartimento di Informatica Università degli Studi di Bari

Anotaciones semánticas: unidades de busqueda del futuro?

1 o Semestre 2007/2008

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)

Search Engines. Stephen Shaw 18th of February, Netsoc

What is intelligent content?

SEO FOR VIDEO: FIVE WAYS TO MAKE YOUR VIDEOS EASIER TO FIND

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

Text Mining and Analysis

Topics in basic DBMS course

Best Practices for WordPress and SEO

Automated News Item Categorization

Chapter-1 : Introduction 1 CHAPTER - 1. Introduction

Custom Online Marketing Program Proposal for: Hearthstone Homes

Data Mining in Web Search Engine Optimization and User Assisted Rank Results

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. ~ Spring~r

Secure Enterprise Search. One Search Across Your Enterprise Repositories: Comprehensive, Secure, And Easy To Use January 2007

Dr. Anuradha et al. / International Journal on Computer Science and Engineering (IJCSE)

Web Document Clustering

Recognition. Sanja Fidler CSC420: Intro to Image Understanding 1 / 28

IBM Software Group Thought Leadership Whitepaper. IBM Customer Experience Suite and Enterprise Search Optimization

Course DSS. Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Delivering Smart Answers!

Apache Lucene. Searching the Web and Everything Else. Daniel Naber Mindquarry GmbH ID 380

Optimization of Search Results with Duplicate Page Elimination using Usage Data A. K. Sharma 1, Neelam Duhan 2 1, 2

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Development of Framework System for Managing the Big Data from Scientific and Technological Text Archives

SEO Techniques for various Applications - A Comparative Analyses and Evaluation

Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

IT and CRM A basic CRM model Data source & gathering system Database system Data warehouse Information delivery system Information users

HubSpot's Website Grader

W. Heath Rushing Adsurgo LLC. Harness the Power of Text Analytics: Unstructured Data Analysis for Healthcare. Session H-1 JTCC: October 23, 2015

Why are Organizations Interested?

Inverted Indexes: Trading Precision for Efficiency

Efficient Focused Web Crawling Approach for Search Engine

Mining Text Data: An Introduction

Terrier: A High Performance and Scalable Information Retrieval Platform

C o p yr i g ht 2015, S A S I nstitute Inc. A l l r i g hts r eser v ed. INTRODUCTION TO SAS TEXT MINER

TDPA: Trend Detection and Predictive Analytics

Text Analytics. A business guide

PMML and UIMA Based Frameworks for Deploying Analytic Applications and Services

Search Result Optimization using Annotators

Website Standards Association. Business Website Search Engine Optimization

Corso di Biblioteche Digitali

SHAREPOINT CI. Spots Business Opportunities, Eliminates Risk COMPETITIVE INTELLIGENCE FOR SHAREPOINT

DIGITAL MARKETING BASICS: SEO

Comparison of Standard and Zipf-Based Document Retrieval Heuristics

Embedded inside the database. No need for Hadoop or customcode. True real-time analytics done per transaction and in aggregate. On-the-fly linking IP

Draft Response for delivering DITA.xml.org DITAweb. Written by Mark Poston, Senior Technical Consultant, Mekon Ltd.

International Journal of Engineering, Business and Enterprise Applications (IJEBEA)

Alert Notification as a Service

Enhancing the relativity between Content, Title and Meta Tags Based on Term Frequency in Lexical and Semantic Aspects

Increasing Traffic to Your Website Through Search Engine Optimization (SEO) Techniques

Advanced Analytics. The Way Forward for Businesses. Dr. Sujatha R Upadhyaya

BUSINESS VALUE OF SEMANTIC TECHNOLOGY

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Search Engine Optimization (SEO): Improving Website Ranking

Developing a Collaborative MOOC Learning Environment utilizing Video Sharing with Discussion Summarization as Added-Value

Yandex: Webmaster Tools Overview and Guidelines

Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

Information Access Platforms: The Evolution of Search Technologies

Legal Informatics Final Paper Submission Creating a Legal-Focused Search Engine I. BACKGROUND II. PROBLEM AND SOLUTION

GRAPHICAL USER INTERFACE, ACCESS, SEARCH AND REPORTING

An XML Based Marine Data Management Framework

Text Processing with Hadoop and Mahout Key Concepts for Distributed NLP

Chapter 5. Warehousing, Data Acquisition, Data. Visualization

News media analysis at Lab SAPO UPorto. Jorge Teixeira

isecure: Integrating Learning Resources for Information Security Research and Education The isecure team

Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words

Big Data With Hadoop

SEO Glossary A-B-C-D-E-F-G-H-I-J-K-L-M-N-O-P-R-S-T-U-V-W-X-Y

A Comparative Approach to Search Engine Ranking Strategies

TO ASK FOR YOUR FREE TRIAL: lexum.com/decisia. OR CONTACT US: EFFICIENT ACCESS TO YOUR DECISIONS

Web Search by the people, for the people Michael Christen,

DIGITAL MARKETING TRAINING

Flattening Enterprise Knowledge

CONCEPTCLASSIFIER FOR SHAREPOINT

to Boost SEO Growth Services To learn more, go to: teletech.com

Filtering the Web to Feed Data Warehouses

SEO Interview Questions: SEO Interviewing Questions and Answer

A Survey of Text Mining Techniques and Applications

Big Data Analytics. An Introduction. Oliver Fuchsberger University of Paderborn 2014

A Time Efficient Algorithm for Web Log Analysis

MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts

Site Files. Pattern Discovery. Preprocess ed

Data Pre-Processing in Spam Detection

SEO 101. Learning the basics of search engine optimization. Marketing & Web Services

Index Terms Domain name, Firewall, Packet, Phishing, URL.

Deposit Identification Utility and Visualization Tool

CSCI 5417 Information Retrieval Systems Jim Martin!

Transcription:

Search Engine Architecture I

Software Architecture The high level structure of a software system Software components The interfaces provided by those components The relationships between those components J. Pei: Information Retrieval and Web Search -- Search Engine Architecture 2

UIMA An architecture to provide a standard for integrating search and related language technology components Unstructured Information Management Architecture (www.research.ibm.com/uima) Defining interfaces for components to simplify the addition of new technologies into systems that handle text and other unstructured data J. Pei: Information Retrieval and Web Search -- Search Engine Architecture 3

UIMA http://domino.research.ibm.com/comm/research_projects.nsf/pages/uima.architecturehighlights.html/$file/blockdiagram.gif J. Pei: Information Retrieval and Web Search -- Search Engine Architecture 4

A Good Reference J. Pei: Information Retrieval and Web Search -- Search Engine Architecture 5

Primary Goals of Search Engines Effectiveness (quality): to retrieve the most relevant set of documents for a query Process text and store text statistics to improve relevance Efficiency (speed): process queries from users as fast as possible Use specialized data structures Specific goals usually fall into the above primary goals Example: handling changing document collections both an effectiveness issue and an efficiency issue J. Pei: Information Retrieval and Web Search -- Search Engine Architecture 6

Two Major Functions Search engine components support two major functions The index process: building data structures that enable searching The query process: using those data structures to produce a ranked list of documents for a user s query J. Pei: Information Retrieval and Web Search -- Search Engine Architecture 7

The Indexing Process J. Pei: Information Retrieval and Web Search -- Search Engine Architecture 8

Text Acquisition Identifying and making available the documents that will be searched How? Crawling or scanning the web, a corporate intranet, or other sources of information Building a document data store containing the text and metadata for all the documents Metadata: document type, document structure, document length, J. Pei: Information Retrieval and Web Search -- Search Engine Architecture 9

Crawlers Identifying and acquiring documents for the search engine Web crawler: following links on web pages to discover new pages Efficiency: how to handle the huge volume of new pages and updated pages Web crawler restricted to a single site supports site search Topic-based/focused crawlers: using classification techniques to restrict pages that are likely relevant to a specific topic Used in vertical or topical search Enterprise document crawler: following links to discover both internal and external pages J. Pei: Information Retrieval and Web Search -- Search Engine Architecture 10

Document Feeds A mechanism for accessing a real-time stream of documents RSS: a common standard used for web feeds for content such as news, blogs, or video An RSS reader subscribes to RSS feeds, and provides new content when it arrives RSS feeds are formatted in XML J. Pei: Information Retrieval and Web Search -- Search Engine Architecture 11

Conversion Converting a variety of formats (e.g., HTML, XML, PDF, ) into a consistent text and metadata format Resolving encoding problem Using ASCII (7 bits) or extended ASCII (8 bits) for English Using Unicode (16 bits) for international languages J. Pei: Information Retrieval and Web Search -- Search Engine Architecture 12

Document Data Stores A simple database to manage large numbers of documents and structured data Document components are typically stored in a compressed form for efficiency Structured data consists of document metadata and other information extracted from the documents such as links and anchor text [Discussion] Why do we need document data stores local to search engines? The original documents are available on the web J. Pei: Information Retrieval and Web Search -- Search Engine Architecture 13

The Indexing Process J. Pei: Information Retrieval and Web Search -- Search Engine Architecture 14

Text Transformation / Index Creation Text transformation: transforming documents into index terms or features Index terms: the parts of a document that are stored in the index and used in searching Features: parts of a text document that is used to represent its content Examples: phrases, names, dates, links, Index vocabulary: the set of all the terms that are indexed for a document collection Index creation: creating the indexes or data structures Example: building inverted list indexes J. Pei: Information Retrieval and Web Search -- Search Engine Architecture 15

Parsers, Stopping, and Stemming Processing the sequence of text tokens in the document to recognize structural elements such as titles, figures, links, and headings Tokenization: identifying units to be indexed Using syntax of markup languages to identify structures Stopping: removing common words from the stream of tokens, e.g., the, of, to, Reducing index size considerably Stemming: group words that are derived from a common stem Example: fish, fishes, fishing Increase the likelihood that words used in queries and documents will match J. Pei: Information Retrieval and Web Search -- Search Engine Architecture 16

Other Text Transformation Tasks Link extraction and analysis Links can be indexed separately from the general text content Using link analysis algorithms, e.g., PageRank, to quantify page popularity and find authority pages Using anchor text to enhance the text content of a page that the link points to Information extraction: identifying index terms that are more complex than single words Entity identification, e.g., finding names Classifiers: identifying class-related metadata for documents or parts of documents Example: finding spam documents and non-content parts of documents (e.g., ads) Alternatively, clustering related documents J. Pei: Information Retrieval and Web Search -- Search Engine Architecture 17

The Indexing Process J. Pei: Information Retrieval and Web Search -- Search Engine Architecture 18

Collecting Document Statistics Gathering and recording statistical information about words, features, and documents Statistics will be used to compute scores of documents Stored in lookup tables Examples Counts of index term occurrences Positions in the documents where the index terms occurred Counts of occurrences over groups of documents Lengths of documents in terms of the number of tokens Actual data depends on the retrieval model and the associated ranking method J. Pei: Information Retrieval and Web Search -- Search Engine Architecture 19

Weighting Calculating index term weights using the document statistics and storing them in lookup tables Pre-computation can improve query answering efficiency TF/IDF weighting TF (the term frequency): the frequency of index term occurrences in a document IDF (inverse document frequency): the inverse of the frequency of index term occurrences in all documents N/n (N: # documents indexed, n: # documents containing a particular term) J. Pei: Information Retrieval and Web Search -- Search Engine Architecture 20

Inversion and Index Distribution Inversion: changing the stream of document-term information coming from the text transformation component into term-document information for the creation of inverted indexes Core of the indexing process The number of documents is large The indexes are updated with new documents from feeds and crawls, and are often compressed for high efficiency Index distribution: distributing indexes across multiple computers/sites on a network Document distribution, term distribution, and replication J. Pei: Information Retrieval and Web Search -- Search Engine Architecture 21

Summary A high-level description of search engine software architecture The indexing process Building blocks and their functionalities J. Pei: Information Retrieval and Web Search -- Search Engine Architecture 22

To-Do List Expand the figures of the indexing process to include the detailed functionalities Reach Chapter 2.1-2.3.3 J. Pei: Information Retrieval and Web Search -- Search Engine Architecture 23