Searching JACo PDF files on the web Pascal Le Roux JACo Team Meeting Thoiry, France, 18-19 February 2002 1
Status of the CERN JACoW Site The CERN Joint Accelerator Conference Web site is hosted on the CERN central web servers (a pool of 10 machines running Windows 2000 Server + SP2+ (x thousand) patches with Microsoft Internet Information Services 5.0 web server). 10 conferences are published on this site: 4 PAC (1995, 1997, 1999, 2001) 3 EPAC (1996, 1997, 2000) 1 APAC (1998) 1 ICALEPC (1999) 1 LINAC (1998) About 8000 PDF files. We recently received the CDs from Cyclotrons 2001 and Linac 96 but the PDF files are not yet JACoW compliant (files not cropped, no keywords ) 2
A tool is required to search papers! The CERN JACoW web site provides a search form which serves as a custom interface of the search engine: http://accelconf.web.cern.ch/accelconf/top-page.html 3
Once you click on the Go! Button, the form is sent to an ASP script that parses the fields, and formats the query string which is redirected to the CERN global search engine. The query looks like this: http://search.cern.ch/query.html?col=cern&qp=&qt=%2burl%3aacc elconf+-url%3aabstract+site%3aaps.anl.gov+%2bdoctype%3apdf+%2btitle%3amagnet&qs=&qc= cern&pw=600&ws=0&qm=0&st=1&nh=10&lk=1&rf=0&rq=0 This customized query string restricts the search to PDF files published on the JACoW site and specifies where (in which hidden field) to search for the words entered by the user. 4
Once the engine gets a bunch of matches, it sorts them according to a relevance ranking or by date before sending back the customized result page. 5
CERN Global Search Engine Since 1997, CERN has used Infoseek Ultraseek search engine, running on a Sun Ultra 1 with Sun OS 5.6. In 2000, Inktomi acquired Infoseek Corporation. Inktomi is a leader in the web-wide search market, providing results for major sites such as: MSN Search, Yahoo, Oracle, IBM and Fermi National Accelerator Laboratory In November 2001, CERN upgraded its search engine from Ultraseek 4.08 to Inktomi Enterprise Search 4.2. 6
Product changes Basically, the main product changes are bug and security fixes, cosmetic changes for the users, supports of direct indexing of Oracle and other ODBC compliant databases, plus indexing of NTFS file sources and improvements in International support. Platform and performance The search engine now runs on a PC with Dual 500Mhz CPU, 1GB of RAM, 70 GB SCSI drive, Windows 2000 server + SP2 (but it s also available for Sun and Linux) This platform can indexed the CERN Intranet : approximately 1 million documents Every 3 / 7 days Answers about 1000 queries per day With peaks up to 200 queries / hour 7
Specifications Inktomi Enterprise Search supports: HTML, XML, Text, RTF, MS OFFICE, PDF (search in hidden fields, and full text search), PostScript, Framemaker, Lotus, WordPerfect In English, French, German, Spanish, Portuguese, Italian, Dutch, Swedish, Norvegian, Danish, Finnish, Chinese and Japanese. In addition to the full PDF text indexation, the engine can also index PDF metadata (our hidden fields: Title, subject, author, keywords). As a result, the search results are therefore more accurate than a simple full text search. The search result page provides : Linked results titles to the PDF doc. Smart Summaries Path and Size of the PDF file The results can be sorted by date or by relevance ranking. Comments from the staff who installed the search engine CERN has not done any evaluation since 1997, except for Microsoft SharePoint (2001) which was not adapted for CERN needs, but we can recommend Inktomi as it requires little work and gives reasonable results. 8
Price $2,995 for 1-3,000 pages, $7,495 to 10,000 pages But CERN IT people told me: We had a nice price from Inktomi. I cannot tell you how much This was our main reason to purchase this product as the IT budget is small 9
Is there an alternative to the Inktomi Enterprise Search locally? Hundreds of other search services/products are available on the market. But they do not always suit PDF searches. Some tools are not capable to index the text contained in the PDF hidden fields. 10
Local search tool Local search tool, Remote Search service? This is the solution described previously. You have to purchase : the search engine software. A powerful machine dedicated to this indexing and search service. An administrator who takes care of the system 24 hours a day. CERN has selected Inktomi mainly because they got a really interesting price for such a product. But of course, many products are available on the market. Since I didn t make any product evaluation, I can t rate them without serious testing. I can only give you a list of leading product according to articles found on the web 11
Product Price Platform supported Specifications AltaVista Enterprise Search Google Search Appliance Inktomi Enterprise Search $15,000 for smaller companies to millions for large corporations!! $20,000 for 1x rack mountable box (150,000 documents) $2,995 for 1-3,000 pages $7,495 to 10,000 pages Windows NT, Windows 2000, Tru64 UNIX, HP/UX, Solaris, Linux Google-specific Linux on supplied hardware Windows NT/2000; Unix: Solaris 2.5 and above, Linux, HP-UX 11.0 Handle over 200 files formats. Including XML, PDF, PostScript, MS Office Support about 30 languages Can index 10000 files / hour Microsoft Index Server + Adobe PDF IFilter 5.0 Free: integrated with Microsoft Internet Information Server and the Windows NT Server 4.0 Free Adobe PDF IFilter 5.0 Windows NT (Server only, not Workstation), Windows 2000 Adobe PDF IFilter 5.0 extends the search capabilities of MIS by indexing all the hidden fields PDF WebSearch (based on dtsearch) $7,500 Windows 95/98/NT4/2000 A search engine specially designed for PDF Elan Web Search? Windows NT 4 / 2000 + Microsoft Internet Information Server Optimized to support the searching of PDF hidden fields + 16 more custom fields More exhaustive list at : http://www.searchtools.com/info/pdf.html 12
Remote search services In this case, you just have to sign up for one of the various search services available online. Some of them are free, completely supported by advertising. Advantages You don t have to worry about the work involved in setting up a search engine. No expensive software to buy. No machine to maintain No technician to pay for taking care of the service. Remote search engines work just as well as local ones. Drawbacks You don t have as much control: On the indexing process. You do not know how often your site is indexed. (Sometimes it can take many weeks for free services ) On the search engine accessibility and response time. On the design of the search result pages (advertising ) If you pay the services and have a lot of pages to index. Local searching solution can be really cheaper. 13
Product Price Indexing frequency Comments Atomz Enterprise $10,000 per years and up depending on the number of domains and pages Weekly and on demand No advertising just an Atomz logo. 15 languages supported Indexes and searches hidden fields in PDF Google - Free with Google Logo and limited customization. - Paid version offers many more options... Google controls scheduling ( 1 month for free version) FreeFind Enterprise - Free with advertising $79 per month for 5,000 pages Daily for paid version PDF indexing available only for paid version. For a more exhaustive list, have a look at: http://www.searchtools.com/info/pdf.html 14
Example of remote search service using Google web wide search engine Since our CERN web servers are indexed by the Google web wide search engine. I ve duplicated the JACo search form to test Google. In the free version of Google: you can t create precise query using title and keywords fields. You can only perform full text searches or author field searches. But you can restrict the search to a given domain (http://accelconf.web.cern.ch/accelconf/) and a given file type (PDF), to search only the PDF files located on our JACoW site. 15
The result page is quite similar the Inktomi one, with an interesting feature: the possibility to get an HTML version of the PDF. The PAC 2001 papers which were added on the site mid January are not yet indexed! (Like a few EPAC 2000 papers ) (It took 3 days to be indexed by Inktomi). 16
My Conclusions We (the JACo team at CERN) don t have to worry about the search engine tool. An administrator has installed and upgraded the system for us, and keeps the machine and the software up 24 hours a day The indexation is done quite often (maximum of 7 days) The only things to do were to create the HTML form and the ASP script and of course, upload all the files on a web server. Since November 2001 (when the search engine was upgraded), we have received about 3600 hits on the JACo search form. We never received any complains from the users of the CERN instance (Yes, this doesn t mean that the service is fine ) I don t think that the CERN JACoW site needs another search engine. This service is sufficient. It could be used at FNAL since they already have the same search engine ;-) 17