Catalyst CR Document Indexing Policy While Catalyst CR can accept a wide variety of files for viewing, many formats are not appropriate for full-text indexing. This document sets forth our policy and procedures for indexing files in Catalyst CR. We index a wide variety of document formats including most Microsoft Office formats, email formats, WordPerfect and Lotus formats and general text files. We do not index non-document formats such as image files (which can be loaded and displayed) container files (zip, tar, gzip, etc.), database files, mail archives or system files. These formats are either inappropriate for indexing or should be processed first to extract their contents for indexing. Section 1 provides a list of standard document formats that generally can be indexed on the site. Section 2 provides examples of file types that are not appropriate for indexing. Section 3 provides special rules for very large files which require special arrangements for loading and indexing. If you have questions about this policy, please contact your Catalyst Client Services representative. Caveat Just because a file type is appropriate for indexing does not mean that it will be properly indexed. Native files can be corrupt, have illegal programming characters, or other issues we cannot describe in advance that may cause our search indexer to fail or to only partially index a file. You should know that we do not inspect individual files to determine whether they are indexed properly or whether all of the text in the file was indexed. It would be all but impossible to do so. We do provide a utility that will allow you to view files that had indexing issues. This utility is designed to report when a file cannot be indexed (e.g. an image file, no document found, file can t be indexed). However, the utility does not report instances where some text was indexed but other text could not be. We are not aware of any method to provide this information other than comparing indexed and actual text by hand. In loading files into Catalyst CR, we do not (and cannot) guarantee that all of the text in any particular document is properly indexed. We use the indexing software provided through our FAST search engine license and index on a best efforts by the software basis. File Formats Accepted for Indexing Our FAST engine uses the industry-accepted Stellent filters to extract text from documents for indexing and search. While formats are commonly identified by their 3-letter file extension, our system examines each file to determine its actual character regardless of extension. Catalyst CR will accept most versions of the following file formats for indexing. If you have a question about a specific format, please ask and we will determine whether that file type is indexable. If not, we can assist you in converting it to a format that is indexable, e.g. PDF, text or HTML. www.catalystsecure.com 877.557.4273 info@catalystsecure.com
Word Processing Formats: Lotus WordPro and related versions MacWrite II Microsoft Rich Text Format Microsoft Word for DOS Microsoft Word for Macintosh Microsoft Word for Windows Microsoft WordPad Microsoft Works for DOS Microsoft Works for Macintosh Microsoft Works for Windows Microsoft Write Novell/Corel Perfect Works Novell/Corel WordPerfect for DOS Novell/Corel WordPerfect for Mac Novell/Corel WordPerfect for Windows Open Office Writer (Text Only) Spreadsheet Formats: Lotus 1-2-3 (DOS & Windows) Lotus Symphony Microsoft Excel for Macintosh Microsoft Excel for Windows Microsoft Windows Works Microsoft Works (DOS) Microsoft Works (Macintosh) Open office Calc (Text Only) QuattroPro for DOS & Windows StarOffice Calc (Text Only) 3 2
Presentation Formats: Microsoft PowerPoint for Windows Microsoft PowerPoint for Macintosh Novell\Corel Presentations Open Office Impress (Text Only) Star Office Impress (Text Only) Email Formats: MIME (text mail) MSG Outlook Mail Message (Windows text only) EML (Standards based email formats) Other Formats: PDF (Adobe Acrobat) HTML Text files (Subject to size limitations see below) Microsoft Project (text only) vcard Electronic Business Card As of this writing, Catalyst cannot index Office 2007 formats or Microsoft OneNote files. If you need to index these files, they have to be processed separately and converted to an indexable format. Formats Not Accepted for Indexing Certain file formats are not appropriate for indexing in Catalyst CR without additional processing or other special attention. Here are examples of representative formats that will be excluded from indexing 1 : Container files: ZIP, GZIP, LZA, Microsoft Binder, UNIX TAR Compressed container formats are usually decompressed during processing. We will index the contents of container files but will not index the container files themselves except to present a list of their contents. 1 This list is meant for illustrative purposes and is not meant to be comprehensive. 3
Email Container Formats: PST, OST, NSF and all other mail archives These formats contain msg and eml files that should be processed and extracted before they are uploaded into Catalyst CR. We will index the extracted contents of these files but not the containers themselves. Database Formats: Access, dbase, FoxBase, SQL, Paradox, etc. Database formats are not appropriate for full-text indexing in Catalyst CR. If you need to review database information, we can assist by creating reports and records that may be appropriate for indexing in CR. Please consult with a Client Services representative for special treatment of these files. Graphics Formats: BMP, CGM, GIF, JPEG, PCX. PSP, PNG, TIFF, WMF, WPG, SWF While these formats can be displayed on the site, they do not typically contain text that can be indexed by the system. In some case, they will be accompanied by text files that can be indexed. TIFF is an example of a file format that is regularly accompanied by a matching text file.. Executables: EXE, DLL and all system or program files These formats are often excluded during the processing phase and are rarely needed or desired in a document review system. Unknown files without identifiable extensions or content. Files that are not indexed in Catalyst CR can still be uploaded to the site if desired. The system will index any related metadata for the file contained in the record associated with it, e.g. Date, Author, Description, Control Numbers. Let us know if you have special file formats you need indexed for a particular matter. In most cases we can accommodate so long as the format has text to index and can be accessed using our filters. Special Indexing Rules for Very Large Files Large text or document files often contain program code, sql dumps or other content that is not suitable for full-text searching and which can damage our document indexes. To protect against this and because such files are seldom needed for search, we apply the following rules for very large files: Any file that contains over 8 million unique words (any combination of letters or numbers separated by spaces or punctuation) will be truncated at 8 million words. Any file that is over 85 megabytes in size will not be indexed. Instead, the file will be flagged as oversized allowing for special review and treatment. 4
Any file where the extracted text is greater than 15 megabytes in size will not be indexed. Instead, the file will be flagged as oversized allowing for special review and treatment. To put these rules in perspective, a five hundred page novel typically contains about a megabyte of text. Files containing 15+ megabytes of text are usually system log or core dump files containing substantial amounts of numbers and text that are not suitable for indexing or search. 5