Searching JACo PDF files on the web. Pascal Le Roux JACo Team Meeting Thoiry, France, 18-19 February 2002



Similar documents
Contents. BMC Remedy AR System Compatibility Matrix

Crystal Reports XI Release 1 for Windows

Contents. BMC Atrium Core Compatibility Matrix

Fast, Easy to use and On-demand A Content Platform from the 21st Century

Desktop Solutions Quick Reference Card StarOffice 7 and StarSuite 7

IBM Tivoli Monitoring for Databases

Key Considerations for Vulnerability Management: Audit and Compliance

Crystal Reports XI Release 2 - Service Pack 6

Frequently Asked Questions. Secure Log Manager. Last Update: 6/25/ Barfield Road Atlanta, GA Tel: Fax:

Comparing Free Virtualization Products

Crystal Reports XI Release 2 for Windows Service Pack 3

Orixcloud Backup Client. Frequently Asked Questions

CA Client Automation

SMART Meeting Pro Premium 2.3

E-commerce. Web Servers Hardware and Software

Adobe InDesign Server CS2

SUN COBALT RaQ 4 Server Appliance FAQ

i.sight ecommerce system

Fax Service QUICK START GUIDE

Server and Storage Virtualization. Virtualization. Overview. 5 Reasons to Virtualize

FOTOSTATION. The best media file organizer. It s organized

Customer Responsibilities

Web Conferencing Comparison Guide

Oracle Universal Content Management

McAfee Web Reporter Turning volumes of data into actionable intelligence

netfusion Data Guardian Online Backup

SUN COBALT Qube 3 Appliance FAQ

Usage Analysis Tools in SharePoint Products and Technologies

Readme File for All Platforms

HP Client Automation Standard Fast Track guide

SAP BusinessObjects BI Platform 4.1 Supported Platforms (PAM)

TARGETPROCESS INSTALLATION GUIDE

SPEX for Windows Client Server Version 8.3. Pre-Requisite Document V th August 2006 SPEX CS 8.3

Web Development Kit Applications Language Pack Installation and Release Notes

Patch Management for Windows. User s Guide

SAP Predictive Analytics 2.3 Supported Platforms (PAM)

System Requirements for Microsoft Dynamics GP 2013

EMC Documentum. Environment and System Requirements Guide. Version 7.2

HOB Remote Desktop VPN Secure access for remote workers and business partners to your enterprise network

Unifying Search for the Desktop, the Enterprise and the Web

AdminToys Suite. Installation & Setup Guide

IBM Tivoli Monitoring for Applications

EMC Software Release and Service Dates for NetWorker and NetWorker Modules Last Updated on February 21, 2013

System Requirements for Web Applications

E-Notebook SQL 12.0 Desktop Database Migration and Upgrade Guide. E-Notebook SQL 12.0 Desktop Database Migration and Upgrade Guide

A Data Robotics Corporation Product

SMART Meeting Pro System Administrator s Guide

Crystal Reports Server Embedded 2008 with Service Pack 7 for Windows Supported Platforms

Comparing BlackBerry Solutions

CA Client Automation: Patch Manager - Supported Patches

This brochure has been created using Acrobat PDF format from Adobe Systems Incorporated. All Rights Reserved. Copyright 2009, Hitachi, Ltd.

STEALTHbits Technologies, Inc. StealthAUDIT v5.1 System Requirements and Installation Notes

EMC APPLICATIONXTENDER 8.0 Real-Time Document Management

FREQUENTLY ASKED QUESTIONS

IBM's practice for facilitating interoperability of Operating Systems

Sharp Remote Device Manager (SRDM) Server Software Setup Guide

Hardware, Software and Training Requirements for DMFAS 6

Backup Exec 15 Software Compatibility List (SCL)

SecureVault Online Backup Service FAQ

SysAidTM Freeware Installation Guide

Backup Exec 2014 Software Compatibility List (SCL)

HP Service Manager Compatibility Matrix

SMART Notebook System Administrator s Guide. Windows Operating Systems

Education Software Installer 2014

Open Cloud Store. End-user manual. For

Softline VIP Payroll System Requirements v2.9a January 2010

HP Backup and Recovery Manager

Personal Archive User Guide

IT Application Support Engineer (Database, Web & User)

Remedy IT Service Management 5.6 Installation and Configuration Guide

ABBYY FineReader 11 Corporate Edition

vrealize Business System Requirements Guide

Microsoft Windows Server Update Services Questions & Answers About The Product

ManageEngine SupportCenter Plus 7.7 Edition Comparison

Xtreeme Search Engine Studio Help Xtreeme

HP Universal CMDB. Software Version: Support Matrix

BusinessObjects Enterprise XI Release 2 for Solaris

SAP BusinessObjects Edge BI, Standard Package Preferred Business Intelligence Choice for Growing Companies

SAP BusinessObjects Edge BI, Preferred Business Intelligence. SAP Solutions for Small Business and Midsize Companies

Reporting for Contact Center Setup and Operations Guide. BCM Contact Center

Minimum Hardware Specifications Upgrades

SMART Notebook 10 System Administrator s Guide

Web Conferencing Version 8.3 Troubleshooting Guide

INDEX. OutIndex Services...2. Collection Assistance...2. ESI Processing & Production Services...2. Computer-Based Language Translation...

Administrator Manual Across Personal Edition v6 (Revision: February 4, 2015)

TG Web. Technical FAQ

Enterprise solution comparison chart

IBM Tivoli Remote Control

Lindenbaum Web Conference

AIMS Installation and Licensing Guide

CA Anti-Virus r8.1. Benefits. Overview. CA Advantage

QuickSpecs. HP Data Protector Reporter Software Overview. Powerful enterprise reporting with point and click simplicity

Special Topics in Vendor- Specific Systems. Objective

CA Productivity Accelerator v :

Operating System Evaluation 3/27/2001

Patch Assessment Content Update Release Notes for CCS Version: Update

v7.1 Technical Specification

imvision System Manager

Infor M3 Report Manager. Solution Consultant

Transcription:

Searching JACo PDF files on the web Pascal Le Roux JACo Team Meeting Thoiry, France, 18-19 February 2002 1

Status of the CERN JACoW Site The CERN Joint Accelerator Conference Web site is hosted on the CERN central web servers (a pool of 10 machines running Windows 2000 Server + SP2+ (x thousand) patches with Microsoft Internet Information Services 5.0 web server). 10 conferences are published on this site: 4 PAC (1995, 1997, 1999, 2001) 3 EPAC (1996, 1997, 2000) 1 APAC (1998) 1 ICALEPC (1999) 1 LINAC (1998) About 8000 PDF files. We recently received the CDs from Cyclotrons 2001 and Linac 96 but the PDF files are not yet JACoW compliant (files not cropped, no keywords ) 2

A tool is required to search papers! The CERN JACoW web site provides a search form which serves as a custom interface of the search engine: http://accelconf.web.cern.ch/accelconf/top-page.html 3

Once you click on the Go! Button, the form is sent to an ASP script that parses the fields, and formats the query string which is redirected to the CERN global search engine. The query looks like this: http://search.cern.ch/query.html?col=cern&qp=&qt=%2burl%3aacc elconf+-url%3aabstract+site%3aaps.anl.gov+%2bdoctype%3apdf+%2btitle%3amagnet&qs=&qc= cern&pw=600&ws=0&qm=0&st=1&nh=10&lk=1&rf=0&rq=0 This customized query string restricts the search to PDF files published on the JACoW site and specifies where (in which hidden field) to search for the words entered by the user. 4

Once the engine gets a bunch of matches, it sorts them according to a relevance ranking or by date before sending back the customized result page. 5

CERN Global Search Engine Since 1997, CERN has used Infoseek Ultraseek search engine, running on a Sun Ultra 1 with Sun OS 5.6. In 2000, Inktomi acquired Infoseek Corporation. Inktomi is a leader in the web-wide search market, providing results for major sites such as: MSN Search, Yahoo, Oracle, IBM and Fermi National Accelerator Laboratory In November 2001, CERN upgraded its search engine from Ultraseek 4.08 to Inktomi Enterprise Search 4.2. 6

Product changes Basically, the main product changes are bug and security fixes, cosmetic changes for the users, supports of direct indexing of Oracle and other ODBC compliant databases, plus indexing of NTFS file sources and improvements in International support. Platform and performance The search engine now runs on a PC with Dual 500Mhz CPU, 1GB of RAM, 70 GB SCSI drive, Windows 2000 server + SP2 (but it s also available for Sun and Linux) This platform can indexed the CERN Intranet : approximately 1 million documents Every 3 / 7 days Answers about 1000 queries per day With peaks up to 200 queries / hour 7

Specifications Inktomi Enterprise Search supports: HTML, XML, Text, RTF, MS OFFICE, PDF (search in hidden fields, and full text search), PostScript, Framemaker, Lotus, WordPerfect In English, French, German, Spanish, Portuguese, Italian, Dutch, Swedish, Norvegian, Danish, Finnish, Chinese and Japanese. In addition to the full PDF text indexation, the engine can also index PDF metadata (our hidden fields: Title, subject, author, keywords). As a result, the search results are therefore more accurate than a simple full text search. The search result page provides : Linked results titles to the PDF doc. Smart Summaries Path and Size of the PDF file The results can be sorted by date or by relevance ranking. Comments from the staff who installed the search engine CERN has not done any evaluation since 1997, except for Microsoft SharePoint (2001) which was not adapted for CERN needs, but we can recommend Inktomi as it requires little work and gives reasonable results. 8

Price $2,995 for 1-3,000 pages, $7,495 to 10,000 pages But CERN IT people told me: We had a nice price from Inktomi. I cannot tell you how much This was our main reason to purchase this product as the IT budget is small 9

Is there an alternative to the Inktomi Enterprise Search locally? Hundreds of other search services/products are available on the market. But they do not always suit PDF searches. Some tools are not capable to index the text contained in the PDF hidden fields. 10

Local search tool Local search tool, Remote Search service? This is the solution described previously. You have to purchase : the search engine software. A powerful machine dedicated to this indexing and search service. An administrator who takes care of the system 24 hours a day. CERN has selected Inktomi mainly because they got a really interesting price for such a product. But of course, many products are available on the market. Since I didn t make any product evaluation, I can t rate them without serious testing. I can only give you a list of leading product according to articles found on the web 11

Product Price Platform supported Specifications AltaVista Enterprise Search Google Search Appliance Inktomi Enterprise Search $15,000 for smaller companies to millions for large corporations!! $20,000 for 1x rack mountable box (150,000 documents) $2,995 for 1-3,000 pages $7,495 to 10,000 pages Windows NT, Windows 2000, Tru64 UNIX, HP/UX, Solaris, Linux Google-specific Linux on supplied hardware Windows NT/2000; Unix: Solaris 2.5 and above, Linux, HP-UX 11.0 Handle over 200 files formats. Including XML, PDF, PostScript, MS Office Support about 30 languages Can index 10000 files / hour Microsoft Index Server + Adobe PDF IFilter 5.0 Free: integrated with Microsoft Internet Information Server and the Windows NT Server 4.0 Free Adobe PDF IFilter 5.0 Windows NT (Server only, not Workstation), Windows 2000 Adobe PDF IFilter 5.0 extends the search capabilities of MIS by indexing all the hidden fields PDF WebSearch (based on dtsearch) $7,500 Windows 95/98/NT4/2000 A search engine specially designed for PDF Elan Web Search? Windows NT 4 / 2000 + Microsoft Internet Information Server Optimized to support the searching of PDF hidden fields + 16 more custom fields More exhaustive list at : http://www.searchtools.com/info/pdf.html 12

Remote search services In this case, you just have to sign up for one of the various search services available online. Some of them are free, completely supported by advertising. Advantages You don t have to worry about the work involved in setting up a search engine. No expensive software to buy. No machine to maintain No technician to pay for taking care of the service. Remote search engines work just as well as local ones. Drawbacks You don t have as much control: On the indexing process. You do not know how often your site is indexed. (Sometimes it can take many weeks for free services ) On the search engine accessibility and response time. On the design of the search result pages (advertising ) If you pay the services and have a lot of pages to index. Local searching solution can be really cheaper. 13

Product Price Indexing frequency Comments Atomz Enterprise $10,000 per years and up depending on the number of domains and pages Weekly and on demand No advertising just an Atomz logo. 15 languages supported Indexes and searches hidden fields in PDF Google - Free with Google Logo and limited customization. - Paid version offers many more options... Google controls scheduling ( 1 month for free version) FreeFind Enterprise - Free with advertising $79 per month for 5,000 pages Daily for paid version PDF indexing available only for paid version. For a more exhaustive list, have a look at: http://www.searchtools.com/info/pdf.html 14

Example of remote search service using Google web wide search engine Since our CERN web servers are indexed by the Google web wide search engine. I ve duplicated the JACo search form to test Google. In the free version of Google: you can t create precise query using title and keywords fields. You can only perform full text searches or author field searches. But you can restrict the search to a given domain (http://accelconf.web.cern.ch/accelconf/) and a given file type (PDF), to search only the PDF files located on our JACoW site. 15

The result page is quite similar the Inktomi one, with an interesting feature: the possibility to get an HTML version of the PDF. The PAC 2001 papers which were added on the site mid January are not yet indexed! (Like a few EPAC 2000 papers ) (It took 3 days to be indexed by Inktomi). 16

My Conclusions We (the JACo team at CERN) don t have to worry about the search engine tool. An administrator has installed and upgraded the system for us, and keeps the machine and the software up 24 hours a day The indexation is done quite often (maximum of 7 days) The only things to do were to create the HTML form and the ASP script and of course, upload all the files on a web server. Since November 2001 (when the search engine was upgraded), we have received about 3600 hits on the JACo search form. We never received any complains from the users of the CERN instance (Yes, this doesn t mean that the service is fine ) I don t think that the CERN JACoW site needs another search engine. This service is sufficient. It could be used at FNAL since they already have the same search engine ;-) 17