DEPARTMENT OF COMPUTER SCIENCE CORNELL UNIVERSITY

Similar documents

I. The SMART Project - Status Report and Plans. G. Salton. The SMART document retrieval system has been operating on a 709^

Schneps, Leila; Colmez, Coralie. Math on Trial : How Numbers Get Used and Abused in the Courtroom. New York, NY, USA: Basic Books, p i.

Hi iv. Declaration Certificate Acknowledgement Preface. List o f Table. List o f Figures. viii xvi xvii. 1.1 Introduction 1

Application of Accounting Standards to Financial Year Accounts and Consolidated Accounts of Disclosing Entities other than Companies

Learn AX: A Beginner s Guide to Microsoft Dynamics AX. Managing Users and Role Based Security in Microsoft Dynamics AX Dynamics101 ACADEMY

Implementation Plan: Development of an asset and financial planning management. Australian Capital Territory

Contents RELATIONAL DATABASES

THE COMPLETE PROJECT MANAGEMENT METHODOLOGY AND TOOLKIT

Agenda item number: 5 FINANCE AND PERFORMANCE MANAGEMENT OVERVIEW AND SCRUTINY COMMITTEE FUTURE WORK PROGRAMME

Introduction to Windchill Projectlink 10.2

Switching and Finite Automata Theory

Business Administration of Windchill PDMLink 10.0

3 BUSINESS ACCOUNTING STANDARD,,INCOME STATEMENT I. GENERAL PROVISIONS

TABLE OF CONTENTS CHAPTER NO. TITLE PAGE NO. ABSTRACT iii LIST OF TABLES LIST OF FIGURES LIST OF ABBREVIATIONS

CHAPTER 42A. Case management of certain personal injuries actions. 42A.1. (1) Subject to paragraph (3), this Chapter applies to actions

Data Mining: Concepts and Techniques. Jiawei Han. Micheline Kamber. Simon Fräser University К MORGAN KAUFMANN PUBLISHERS. AN IMPRINT OF Elsevier

TABLE OF CONTENTS. Dedication. Table of Contents. Preface. Overview of Wireless Networks. vii xvii

Windchill PDMLink Curriculum Guide

An Information Retrieval using weighted Index Terms in Natural Language document collections

TABLE OF CONTENTS CHAPTER TITLE PAGE

Data Security at the KOKU

CITY UNIVERSITY OF HONG KONG. A Study of Electromagnetic Radiation and Specific Absorption Rate of Mobile Phones with Fractional Human Head Models

TITLE 9. HEALTH SERVICES CHAPTER 1. DEPARTMENT OF HEALTH SERVICES ADMINISTRATION ARTICLE 4. CODES AND STANDARDS REFERENCED

Master of Library & Information Science (MLIS)

FINAL JOINT PRETRIAL ORDER. This matter is before the Court on a Final Pretrial Conference pursuant to R. 4:25-1.

DIGITAL MARKETING PROPOSAL. Stage 1: SEO Audit/Correction.

NATIONAL UNIVERSITY OF SCIENCE AND TECHNOLOGY

THE MANAGEMENT OF SICKNESS ABSENCE BY NHS TRUSTS IN WALES

To define and explain different learning styles and learning strategies.

SOUTH DAKOTA STATE UNIVERSITY Policy and Procedure Manual

San$Diego$Imperial$Counties$Region$of$Narcotics$Anonymous$ Western$Service$Learning$Days$$ XXX$Host$Committee!Guidelines$ $$

COVERAGE: All Employees

1. Who can use Agent Portal? 2. What is the definition of an active agent? 3. How to access Agent portal? 4. How to login?

b) Discussion of Bid c) Voting (1) Results: Coastal Carolina wins B. State Communications Coordinator of the Year 1. Winthrop University

Department of International Trade at Feng Chia University Master s Program Requirements Policy

Regulatory Guide Configuration Management Plans for Digital Computer Software Used in Safety Systems of Nuclear Power Plants

Introduction. Part I: Finding Bottlenecks when Something s Wrong. Chapter 1: Performance Tuning 3

FLORIDA STATE COLLEGE AT JACKSONVILLE NON-COLLEGE CREDIT COURSE OUTLINE. Insurance Customer Service Representative

Workflow Administration of Windchill 10.2

FHE DEFINITIVE GUIDE. ^phihri^^lv JEFFREY GARBUS. Joe Celko. Alvin Chang. PLAMEN ratchev JONES & BARTLETT LEARN IN G. y ti rvrrtuttnrr i t i r

Introduction. Acknowledgments Support & Feedback Preparing for the Exam. Chapter 1 Plan and deploy a server infrastructure 1

vii TABLE OF CONTENTS CHAPTER TITLE PAGE DECLARATION DEDICATION ACKNOWLEDGEMENT ABSTRACT ABSTRAK

College Governance Statement of Principles, Scheme of Delegation and Terms of Reference

IBM DB2 Data Archive Expert for z/os:

Consolidated Annual Report of the AB Capital Group for the financial year 2008/2009. covering the period from July 1, 2008 to June 30, 2009

APPENDIX 3B Financial Criteria for Retention on the Specialist List and Requirements for Acceptance of a Tender

APPENDIX 7-B SUGGESTED OUTLINE OF A QUALITY ASSURANCE PROJECT PLAN

Introduction to Windchill PDMLink 10.0 for Heavy Users

Integrity 10. Curriculum Guide

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. ~ Spring~r

1 INTRODUCTION TO SYSTEM ANALYSIS AND DESIGN

Screen Design : Navigation, Windows, Controls, Text,

HUMAN RESOURCES. Resourcing and Appointment. HR 1.1 Recruitment, Selection and Appointment

Beginning C# 5.0. Databases. Vidya Vrat Agarwal. Second Edition

EDWARDSTOWN SOLDIERS MEMORIAL RECREATION GROUND BACKGROUND REPORT

Contents. 1 Introduction. 2 Feature List. 3 Feature Interaction Matrix. 4 Feature Interactions

PROJECT MANAGEMENT PROFESSIONAL CERTIFIED ASSOCIATE IN PROJECT MANAGEMENT (PMP & CAPM) EXAM PREPARATION WORKSHOP

SECTION 106 CLAWBACK MAJOR 1. means the actual additional pre-tax profit of the Development Proposal to be calculated in accordance with paragraph B10

FIRST QUESTIONNAIRE ON OTHER CRA PRODUCTS

Enrolled Copy S.B. 40

STATE UNIVERSITY OF NEW YORK COLLEGE OF TECHNOLOGY CANTON, NEW YORK COURSE OUTLINE JUST 201 CRITICAL ISSUES IN CRIMINAL JUSTICE

Computer-Aided Multivariate Analysis

Professional Review Guide for the CCS-P Examination 2009 Edition

STATE UNIVERSITY OF NEW YORK COLLEGE OF TECHNOLOGY CANTON, NEW YORK COURSE OUTLINE EADM 400 INCIDENT COMMAND: SYSTEM COORDINATION AND ASSESSMENT

TABLE OF CONTENTS ABSTRACT ACKNOWLEDGEMENT LIST OF FIGURES LIST OF TABLES

Children s Council of the International Technology and Engineering Educators Association

IEEE Computer Society Certified Software Development Associate Beta Exam Application

1 of 7 31/10/ :34

Oracle Database 11g: Administer a Data Warehouse

SOFTWARE TESTING AS A SERVICE

THE FIRST SCHEDULE (See rule 7) Table I - FEES PAYABLE

Oracle Property Manager User Guide

ATTORNEYS MAKING OUT LIKE BANDITS: IT IS LEGAL, BUT IS IT ETHICAL? By Elizabeth Ann Escobar

A Health and Social Care Research and Development Strategic Framework for Wales

Practical Applications of DATA MINING. Sang C Suh Texas A&M University Commerce JONES & BARTLETT LEARNING

Implementing and Administering an Enterprise SharePoint Environment

Regulatory Story. RNS Number : 8343I. DCD Media PLC. 08 July TR-1: NOTIFICATION OF MAJOR INTEREST IN SHARES i

How the Computer Translates. Svetlana Sokolova President and CEO of PROMT, PhD.

To define and explain different learning styles and learning strategies.

Contents. vii. Preface. P ART I THE HONEYNET 1 Chapter 1 The Beginning 3. Chapter 2 Honeypots 17. xix

Declaration to be submitted by directors in the Applicant Company 1

Data Algorithms. Mahmoud Parsian. Tokyo O'REILLY. Beijing. Boston Farnham Sebastopol

15 Organisation/ICT/02/01/15 Back- up

Perceived Workplace Discrimination as a Mediator of the Relationship between Work Environment and Employee Outcomes: Does Minority Status Matter?

The Essential Guide to User Interface Design An Introduction to GUI Design Principles and Techniques

SITE PHOTOGRAPHS (GOOGLE EARTH): ROAD REHABILITATION Refer to Appendix B for a map of the viewpoint locations.

Assessment of Personal Injury Damages

West Virginia Legal Research

Transcription:

DEPARTMENT OF COMPUTER SCIENCE CORNELL UNIVERSITY INFORMATION STORAGE AND RETRIEVAL Scientific Report No. ISR-11 to The National Science Foundation Ithaca, New York June 1966 Gerard Salton Project Director

Department of Computer Science Cornell University Ithaca, New York IW50 Scientific Report No. ISR-11 INFORMATION STORAGE AND RETRIEVAL to The National Science Foundation Ithaca, New York June 1966 Gerard Salton Project Director

Copyright, 1966 by Cornell University Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the Uhited States Government.

Staff of the Department of Computer Science Cornell Uhiversity Richard W. Conway Margaret Dodd Patrick C. Fischer Juris Hartmanis Michael Keen Joann Newman Christopher Pottle Gerard Salton Sidney Saltzman Robert J. Walker Project Staff at the Aiken Computation Laboratory Harvard University Jeffrey Bean Claudine Harris Guy Hochgesang Michael Lesk iii

TABLE OF CONTENTS SUMMARY Page xiii SECTION I SALTON, G.: "The SMART System ~ Retrieval Results and Future Plans" 1. Introduction 1-1 2. Experimental Results......... 1-3 3. Discussion and Future Plans 1-5 SECTION II LESK, M. E.: "Operating Instructions for the SMART Text Processing and Document Retrieval System" 1. Introduction. II-1 1.1. Processing Summary II-2 1.2. Operating Programs II-4 2. Basic Operating Procedures H-4 2.1. Run Outline H-4 2.2. Tape Setup........ H-6 2.3. Input Deck Setup H-T 3. Specifications for the SMART Retrieval System.. II-7 3.1. Specifications Affecting Lookup... H-9 3*2. Specifications Affecting Hirase Searching 11-10 iv

TABLE OF CONTENTS (continued) Page SECTION II (continued) 3.3. Vector Expansions by Means of Concept- Concept Correlation.. 11-12 3A. Vector Expansion by Means of Concept Hierarchies 11-15 3-5- Vector Formation II-16 3-6. Request-Document Correlation. U-17 3-7* Document-Document Expansion 11-19 3*8. Other Specifications 11-19 h. Data Input. 11-20 fc.l. Natural Language Documents 11-20 b.2. Binary Documents 11-28 fc-3* DOCTAFS. 11-29 h.k. Relevance Judgment Data 11-29 k.5. Other Instruction Cards H-30 5* Tape Preparation Programs 11-31 5-1. Writing a New library Tape.... n-31 5* 1.1* Thesaurus and Suffix List Formation.. * 11*33 5* 1*2. Statistical Phrase Dictionary.... H-35 5.1.3* Syntactic Suffix List...... H-37 5*1^» The Condensed Oramnar Pile II-38 5*1*5* The Criterion Tree Pile n-fco 5*1.5.1. Criterion Tree Input Format 11-41 5*1*6. Hierarchy 11-50 5*2. The Document Tape 11-51 5»3«The Program Tape U-52 v

TABLE OF CONTENTS (continued) Page SECTION n (continued) 6. Auxiliary Programs for Use with the SMART System 11-52 6.1. THES. 11-54 6.2. M)RVAL 11-54 6.3. SOCCER n-55 7* A Sample Input Deck 11-56 8. Miscellaneous........... 11-59 8.1. Size Limits. H-59 8.2. Timing 11-59 9* Acknowledgments... 11-60 SECTION III HOCHGESAN^r, G. T.: "SOCCER - A Concordance Program 11 1. Introduction HI-1 2. The Concordance HI-2 A. Definitions ni-2 B. The Input Text HI-2 C. Processing the Text HI-2 D. The Output Format HI-6 3. Tape Usage III-6 A. Control Cards HI-6 B. The INPUT, 0UTPUT, and SMRTAP Tapes.. III-6 vi

o J TABLE OP CONTENTS (continued) Page SECTION IH (continued) C. Scratch Tapes m-7 k. Control Cards HI-8 5. Examples of S0CCER Usage..... HI-11 6. Subroutines used Toy S0CCER III-13 A. IN0T.. in-13 B. SPECTR ni-15 C. CL0CK 111-17 7. Some Details about the S0CCER Program HI-17 A. Source Deck Changes 111-17 B. Timing. III-18 Appendix 111-19 1 SECTION IV SALTON, G., and LESK, M«E.: "Information Analysis and Dictionary Construction" 1. Introduction IV-1 2. Language Analysis IV-2 3. Dictionary Construction IV-7 A) The Synonym Dictionary (Thesaurus) IV-7 B) The Ntill Thesaurus and Suffix List. 17-15 C) The Phrase Dictionaries IV-21 vii

TABtM # CdMEENTg (&6fftihued) Page SECTION IV (continued) D) The Concept Hierarchy IV-27 k. Dictionary Performance IV-32 A) The Null Thesaurus IV-33 B) The Regular Thesaurus IV-38 C) The Phrase Dictionary IV-U2 5* Automatic Thesaurus Construction iv-kk A) Fully-Automatic Mtethods IV-48 B) Semi-Automatic Methods IV-50 C) Sample Thesaurus Generation IV-56 6. Semi-Automatic Hierarbhy Formation IV-59 «SECTION V LESK, M. E., and SALTON, G.: "Design Criteria for Automatic Information Systems" 1. Introduction V-l 2. The SMART Experiments v-3 3. Evaluation Results and Design Criteria.... 7-11 A) Indexing Depth and Document Length V-ll B) Synonym Recognition V-16 C) Phrase Processing V-19 D) Statistical Association Methods.. V-22 viii

TABLE OP CONTENTS (oontiitued) Page SECTION V (continued) E) Hierarchical Subject Expansion...... v-28 P) Manual Indexing.. V-30 G) Iterative Searching V-32 H) Summary V-33 SECTION VI» RIDDLE, W., HORWITZ, T., and DIETZ, R.: "Relevance Feedback in an Information Retrieval System" 1«Introduction VI-1 2. Principal Methods VI-4 A) Determination of the Number of Documents Retrieved VI-5 B) The Effect of the Correlation Function. VT-5 C) Determination of the Relevance Weighting Factors VI-6 D) Determination of the Value of & VI-8 E) Termination of the Modification Process.. VI-10 3# Experimental Results. VI-lJO k. Conclusians......... VI-12 Appendix A Appendix B (by. E. M. Keen) VI-16 VI-19 ix

TABLE OP CONTENTS (continued) Page SECTION vn LESSER, V* R.: "A Modified Two-Level Search Algorithm Using Request Clustering" 1. Introduction VH-1 2. A Modified Clustering Algorithm and a Corresponding Two-Level Search Strategy VII-3 3. Advantages of the Query Clustering System.. VII-5 k. Design of an Experiment to Compare the Modified with the Normal Two-Level Search Scheme. VTI-7 A) Problem Areas. VII-7 B) Tests to Compare the Effectiveness of Each Search Procedure VII-7 C) Implementation of the Noimal and Modified Two-Level Search Schemes.... VH-9 D) Test Data Base. VII-12 5. Actual Comparisons of the Modified versus the Normal Two-Level Searches VII-13 A) Data Generated for Two-Level Search Algorithm B) Data Generated for Modified Two-Level Search Algorithm VII-13 VII-1^ C) Experimental Evaluation VH-16 D) Evaluation Results VH-25 6. A New Criterion for Search Effectiveness.... VII-27 7. Conclusions VII-28 Appendix A VII-30 x

TABLE OF CONTENTS (continued) Page SECTION VUI BLOMSREN, G., GOODMAN, A., and KELLY, L.: "An Experimental Investigation of Automatic Hierarchy Generation 1 * 1. Introduction VIII-1 2. Automatic Construction of Hierarchies. VHI-2 3. Outline of the Investigation VTII-12 Appendix A VHI-15 SECTION DC BROFFIT, J. D., MDRGAN, H. L., and SODEN, J. V.: " On Some Clustering Techniques for Information Retrieval" 1. Introduction ix-i 2. Similarity Measures IX-k 3. Rocchio's Procedure DC-5 h. Bonner's Procedure 2X-7 5* The Experiment IX-10 6. Evaluation ix-ll 7. Results and Conclusions IX-12 xi

TABLE OF CONTENTS (continued) Page SECTION X LESK, M. E.: "Design Considerations for Time Shared Automatic Documentation Centers" 1. Introduction X-l 2. Principles X-2 3. Methods X-5 k. Practicalities X-9 5* Conclusions x-17 xii

Summary The present report is the eleventh in a series covering research in automatic storage and retrieval conducted initially at the Computation Laboratory of Harvard University, and more recently jointly undertaken by Harvard and by the Department of Computer Science of Cornell tfaiversity. From the outset, the design of automatic information systems was of principal concern, and the research dealt specifically with the evaluation of a variety of fully automatic methods for information analysis and search. This work resulted in the design of an experimental, fully automatic document retrieval system, called SMART, operating on an IBM 709^ computer, and described in detail in two previous reports in this series, numbered ISR-7 dated June 1 9 ^, and ISR-9 dated August 1965«The SMART system is characterized by the fact that documents and search requests are handled in the natural language without any prior manual analysis, and are processed by one of many different content analysis procedures incorporated into the system. Among these are various statistical and syntactic language analysis methods, and table look-up routines based on a variety of dictionaries and thesauruses. The dictionaries are normally constructed not by committees of subject experts, but send-automatically starting with representative document collections for each subject area. Since it is unreasonable to expect that the documents retrieved by a single search of the collection should provide adequate answers to all users in all circumstances, iterative search procedures have been used in conjunction xiii

with the SMART system which make it possible to obtain improvements in subsequent searches, using feedback information supplied by the users as a result of earlier searches. Evaluation results comparing the effectiveness of some of the automatic analysis and search procedures incorporated into the SMART system were first published in report ISR-8 in this series, dated December 196^. More extensive evaluation output is included in the present report, summarizing the work performed during the fall of 1965 and the first half of 1966. The present report contains work in three main subject areas : automatic and semi-automatic dictionary construction, evaluation output based on results obtained by processing four document collections in three subject areas, and iterative search experiments based on user feedback. Section I by G. Salton contains a short report on the present state of the SMART project, including also a summary of the research proposed for the immediate future. A complete set of operating instructions for the present version of the SMART system is presented in section II by M. Lesk. A study of this section should make it possible to other interested parties to run portions of the SMART system on different 709*f installations. Various aspects of the automatic dictionary construction problem are described in sections III, IV and V H I of the present report. Section III by G. Hochgesang contains a description of a very fast concordance generating program which produces keyword-in-context (KWIC) type output from ordinary text input. This program is used to generate the concordances which are later incorporated in the dictionary construction system. xiv

The concordance program described in section H I is presently being distributed through the SHAEE organization. In section IV by M. Lesk and G. Salton the complete information dissemination process is examined with emphasis on the use and construction by automatic or semi-automatic techniques of synonym dictionaries and hierarchical subject arrangements. One specific proposal for the fully automatic construction of subject hierarchies is presented in section V I H by G. Blcmgren, A. Goodman, and L. Kelly. It is shown in particular how the structure of the hierarchical arrangement changes as various parameters are changed. Section V of this report by M. Lesk and G. Salton contains in summary form the systems evaluation output produced by the SMART system, based on extensive operations with four document collections in three subject fields (documentation, computer science, and aerodynamics). One document collection used in the experiments consists of document abstracts manually indexed by trained indexers, thus permitting a comparison between the effectiveness of the standard keyword matching techniques and the automatic analysis procedures incorporated into the SMART system. Another collection was available in the form of abstracts as well as longer summaries, thus permitting an evaluation of the effects of document length. Three sections are devoted to a study of iterative search techniques and user feedback techniques, including sections VI, VII, and IX. Section VI by W. Riddle, T. Horwitz, and R# Dietz examines the effectiveness of a variety of relevance feedback procedures in which the users supply to the system relevance judgments about documents previously retrieved. xv These

judgments are then used automatically to generate a new search request more indicative of user need and preference. This relevance feedback process, first described in detail in section III of report ISR-1X), is shown to be extremely effective, and to provide continued improvements in search effectiveness througjh at least three feedback iterations. Section VII by V. R. Lesser, and section IX by J. D. Broffitt, H. L. Morgan, and J. V. Soden describe variations of the multi-level search techniques originally introduced in section IV of report ISR-10. These techniques drastically reduce the number of comparisons needed between incoming search requests and stored documents by grouping the stored items, and comparing search requests at first only with a typical item for each group. documents included in Individual comparisons are then made only for those highly scoring groups* The effectiveness of a variety of document grouping procedures used as part of a multi-level search process is evaluated in section IK. Section VII, on the other hand, describes an experiment in which requests previously processed by the system are grouped, rather than documents, and new incoming requests are first compared with these request clusters. The search procedure for a new request is then made to depend on the results obtained with similar requests processed by the system at some previous time. The results of section VII indicate that this heuristic process is useful in reducing search time when the requests to be processed fall into definite patterns, as they may be expected to do in an operational situation. The last section, number X by M. Lesk contains system design specifications for a SMART type system operating in a time-sharing environment xvi

where many users have access to a central document file, and users originate search requests asynchronously, and independently of each other. Equipment specifications and timing considerations are included. xvii