Semaphore Overview. A Smartlogic White Paper. Executive Summary

Semaphore Overview A Smartlogic White Paper Executive Summary Enterprises no longer face an acute information access challenge. This is mainly because the information search market has matured immensely in the last few years so access to information via search is very common. However, enterprise search, by itself, is of limited value. Today, enterprises want to search content with the end goal of findability* and determining governance such as content disposition, compliance, governance, records management, etc. Enterprise information access technologies are maturing and now offer better indexing, querying, presentation and drill-down of results. However, the real value of information access technologies is in the upfront and ongoing efforts needed to establish effective taxonomies, to index, and to classify content of all kinds that must be accessed. By itself, the search function has limited value. - Tom Eid, VP Research, Gartner Smartlogic s Semaphore is the leading Enterprise Semantic Platform that augments traditional information management systems like enterprise search, content management and business workflow engines by capturing important topics, resources and people into a model (list, taxonomy or ontology) and then using this to classify content and enrich it with metadata to a more complete enterprise information management experience. In collaboration with existing enterprise applications Semaphore provides a step change in search and content navigation for intranets and websites and ensures findability and appropriate content disposition, governance, data loss prevention, compliance and records management. Semaphore consists of four core modules: Ontology Server & Manager allows multiple users to collaborate on the development and management of ontologies which capture the essential topics, resources and vocabulary for the business. The Ontology Manager ensures the integrity and relevance of the models while dramatically reducing the effort to build them. Advanced Linguistics Pack provides text mining and entity extraction based on part ofspeech tagging. It identifies over 30 different entity types (e.g. people, places, products, organizations, dates, facilities, URLs, etc.) which reduces the time to build ontologies and helps support accurate classification. Classification Server a rules-based semantic classification engine providing accurate metadata tagging of content in 26 languages. It can also be used to identify known entities using a range of vocabulary or to determine the about-ness of content through the use of over 20 rule types. Rulebases can be generated automatically from the ontologies managed in the Ontology Server dramatically reducing set-up time. The Classification Server provides statistical output to identify relationships between topics, entities or nodes and supports over 280 file formats. Semantic Enhancement Server / Search Application Framework enhance search engines (e.g. Microsoft SharePoint Search, Microsoft FAST, Lucene/Solr, Google Search Appliance, etc.) and other platforms with semantics to deliver the most compelling search and dynamic content navigation capabilities available today. It also provides published APIs and tag libraries for rapid user interface development. *Findability is defined the quality of a particular object being locatable or the quality of a whole system being navigable.

The purpose of this document is to help organizations better understand what Semaphore is and how it works in the enterprise to provide the benefits of findability through improved search and navigation and content governance, data loss prevention, compliance, and records management. What is Metadata? According to Gartner, metadata is information that describes various facets of an information asset to improve its usability throughout its lifecycle. Metadata is structured information that describes, explains, locates, or otherwise makes it easier to retrieve, use, or manage an information resource. Metadata is often called data about data or information about information. Tagging an information asset with metadata makes it more usable and controllable. The more valuable the information asset, the more metadata it usually needs, and the more accurate and definitive that metadata has to be. Metadata is what defines and unlocks the value of data; therefore it requires attention. Tagging with metadata is just the first step. Turning metadata into business metadata using semantics is what makes it more valuable. Metadata that describes contextually relevant or domain-specific information about content (in the right context) based on an industry-specific or enterprise is known as business metadata. For example, if the content is from the business domain, the relevant semantic meta data could be company name, ticker symbol, industry, sector, executives, etc., whereas if the content is from the intelligence domain, the relevant semantic meta data could be terrorist name, event, location, organization, etc. Metadata for Facilitating Findability and Reusability of Content: Content is an asset for enterprises. While search engines have made it possible to access information locked away in various repositories, search is still, by and far, very basic. Tagging content with metadata significantly enhances its findability. Metadata also improves the consistency and quality of the output so content can be repurposed and reused slashing time it takes to create new content. Metadata for Governance of Content: Metadata provides content governance, which is the set of policies, procedures, guidelines, rules and compliance metrics that helps maintain content within a system. According to IDC, unstructured data and metadata have an average annual growth rate of 62 percent. More importantly, high-value information (data and content that are governed by security, compliance and preservation obligations) is skyrocketing and IDC forecasts that high-value information will comprise close to 50 percent of the digital universe by the end of 2020. To ensure enterprises deal with metadata appropriately, Smartlogic has developed Semaphore, an Enterprise Semantic Platform, which links the user interface serving up information and the enterprise systems implemented to manage this high value information. The goal of Semaphore is to provide findability so the information can be surfaced without delay and to boost the quality of the information asset so it can be put to work more effectively for governance, e-discovery, content disposition, and records management. 2

What is an Enterprise Semantic Platform? A semantic platform provides meaning and context to unstructured content. Many types of application can benefit from semantic enhancement, from enterprise search to content management to workflow and business process engines. In order to be able to apply metadata consistently, according to the enterprise standards, and in a way that can be demonstrated to be trustworthy and accurate, it stands to reason that the determination and application of metadata be an automated process. This automated process needs a model of these standards with which to drive the classification process. The classification process should use the models to determine the metadata values for any particular piece of information. The mechanism that applies the metadata should be auditable, repeatable and controllable for compliance purposes. The process should be accurate, predictable and quick to implement The engine that determines the metadata values should be a published resource for use by any other 3rd party application. These models are called semantic models as they refer to the science of meaning in language and contain the vocabulary associated with the domain being classified and searched. Semantic models contain all types of controlled vocabulary from simple lists of terms to thesauri to taxonomies to full ontologies. Semantic models offer a way to organize knowledge and information. They make it possible to define a subject domain using a hierarchy of terms and the relationships between those terms. The model makes the subject clear to a user. These models help illustrate to a person what they know, what they don t know and what they need to know. Whether it s researching a company; finding the right product; or identifying with whom to talk, semantics is a powerful tool by which people quickly find answers, gain understanding or stimulate problem solving. A model approach to classification and navigation is often applied when: Metadata standards to information need to be applied. Regulations are mandated that involve scrutiny of unstructured content. Legal discovery (sometimes referred to as ediscovery) is a requirement for information stored in repositories within very precise guidelines. An improvement in search experience is required to give the user more accurate results and a unique navigation experience that intuitively supports user journeys. Content, products, services, resources or advertisements need to be promoted based on the context and interest of the user. Statistical methods for clustering information are inadequate, either because the subjects are too close together or because the suggested facets do not make logical sense. 3

Semaphore, the Smartlogic Enterprise Semantic Platform Semaphore is an enterprise semantic platform which uniquely captures an organization s subjects and topics into a taxonomy or ontology [model] and enhances traditional information management systems like search, content management and business workflow engines by adding advanced content classification, metadata enrichment, and navigation capabilities to deliver a more complete enterprise information management experience. The software comprises core functional modules which provide: Complete lifecycle management of complex ontologies, taxonomies or thesauri (Semaphore models). Natural Language Processing of text to classify content against the Semaphore models. A highly scalable index of the Semaphore model to drive any user interface components (facet filters, taxonomy selection trees, did you mean widgets, etc.). 4

Semaphore Modules Semaphore consists of four core modules 1 and five integration packs 2 : Semaphore Module Ontology Manager Content Classification and Text Mining Server Features Semantic model management, including the security, control, edit and reporting environment for taxonomies, ontologies and thesauri. Automatically analyze text, in numerous formats and languages and return metadata tags that classify the item. The tags should be sourced from the semantic model, or provide useful mark up evidence by algorithmically extracting dates, people, companies place names, etc. A Part of Speech algorithm provides feedback on the number and groups of phrases within a set of content and show how it relates to the semantic model. Semantic Enhancement Server Search Application Framework Integration with enterprise search, content and document management systems, offering enhanced navigation, management and discoverability. A best-practice search interface that exposes the ontology and metadata to deliver an exceptional user search, find and discover experience over Google Search Appliance, Microsoft FAST or Apache Solr Search engines. Semaphore Solutions Semaphore Solution Semaphore for GSA Semaphore for MS FAST Semaphore for Apache/ Solr Semaphore for MS SharePoint Semaphore for Opentext Features Adds taxonomy derived metadata to the Google Search Appliance (GSA) indexing pipeline and sample interface code for the query pipeline. Adds taxonomy derived metadata to the Microsoft FAST indexing pipeline and sample interface code for the query pipeline. Adds taxonomy derived metadata to the Apache Solr index via the Nutch indexing pipeline. Tightly integrated and adds taxonomy derived metadata against any SharePoint item: Document Library, Page Library, Blog, and Wiki Page using native site columns. Provides search web parts that put the metadata in context of the taxonomy. For 2010, Semaphore tightly integrated with Term Store and MMS. Adds taxonomy derived metadata onto a page in the RedDot under editorial control (suggest terms, but allow the editor to add or remove terms). 1 The modules are designed to work together, but can run standalone as well. 2 Integration Packs are solution code packs developed, provided and supported by Smartlogic. Semaphore modules have easy to use XML over HTTP and REST based web service interfaces and can connect to any system that has suitable connection points. 5

Semaphore Modules and Solutions Ontology Manager The Ontology Manager Desktop lies at the heart of the solution and provides the work bench for Information Scientists to not only manage the taxonomy or ontology, but also to drive and influence the classification and search experience. The screenshot below shows the Ontology Manager software with some of the elements that drive the Semaphore experience highlighted: 6

1. A poly-hierarchical tree view of the taxonomy (in this case a NASA Ontology) showing all the topics. 2. Hierarchical relationships can be viewed and edited - allows end users to navigate up and down the taxonomy or content. 3. Associative Relationships allows end users to browse to related topics, people and resources. Smartlogic supports unlimited relationship classes and behaviors to be defined Ontological capabilities. 4. Equivalent Terms i.e. a thesaurus. Supports Did you mean capability and also used in automated classification of content. 5. Additional meta-data and attributes used to control search behavior, offer information about topics, Best Bets, A-Z s, Scope Notes and much more. Illustration of Semaphore s Text Miner Semaphore s Text Miner is a noun phrase and entity extraction tool that improves user productivity by focusing on commonly occurring phrases within a sample document set and letting them view those phrases in context so that the Information Scientist can choose to include in the taxonomy or ontology. Ontology Manager provides a full set of XML exports (in zthes standard formats) or other report formats to support both the taxonomy development process and simple integration to other applications. 7

The sample reports shown above include a hierarchical report output as Excel, an XML output transformed via XSLT (all supported directly in the tool). Capturing user feedback is one of the best ways of improving the performance and usability of a taxonomy or ontology. Semaphore includes an interactive, graphical interface for subject matter experts (SMEs) to navigate these models and to provide feedback and suggestions. This feedback is collated into a management view in Ontology Manager so that suggestions can be implemented. Feedback can also be collected from end users as they search and navigate for content or as content is classified. 8

Classification and Text Mining Server Content classification is managed by the rule-based classification server. This server performs the following tasks: The core Classification Server is supported by tools that allow a user to build, publish, test and monitor the classification rules. The Rulebase Generator generates classification rules for each topic in the taxonomy, almost eliminating the laborious task of training the classification process associated with other products, while still delivering control, accuracy and auditability. 9

The Rulebase Generator takes a logic template and adds the term information from the taxonomy to create a classification rulebase. The template encapsulates classification logic, as shown below in a logical representation of a rulebase template; the real version is in an XML format developed within the Semaphore toolkit. Once configured, the process is run as a simple Publish > Generate Rulebases command from Ontology Manager Desktop: Classification Testing and Modification The Bulk test utility is designed to give a broad overview of the classification results and the output, analyzed in an Excel pivot table helps the user identify anomalies. Preferred Terms If in the title Return the max of the following As an exact phrase weight = 40 Words in close proximity weight = 30 Any words weight = 10 If in the body Non-Preferred Terms If in the title As an exact phrase weight = 25 Words in close proximity weight = 15 Child Terms Add from any child terms weight = 10 Related Terms Add from any related terms weight = 15 etc 10

The Rule and Template Editor provides a detailed explanation to the question what was it in the rule-base that caused this document to be tagged? This gives the process transparency, auditability and control that allow issues to be resolved much quicker and easier than with statistical based approaches. The screenshot shows a research note on Ecological Networks and the Rule and Template Editor interface showing the extracted text from this article with the evidence terms highlighted in the left-hand window and the rule logic structure (again with the matched elements highlighted) in the right-hand window. Semaphore employs 20 different types of rule, with many different control attributes, ways to describe expressions and different types of wildcard. Hence, Semaphore can create sophisticated classification rules that deliver very precise results. Different classification strategies are possible. A geography or company listing will use an Entity rulebase which attempts to normalize terms so if USA is mentioned results will be returned for United States of America as the agreed label. Subject taxonomies tend to use About-ness rulebases several evidence terms from the taxonomy having to combine in order that the agreed tag is returned. A third type of rule is a Candidate rule where a text zone will be identified and returned, e.g. selecting the first thing that looks like a date field after the phrase Report written on. 11

Semantic Enhancement Server The Semantic Enhancement Server delivers taxonomy information to users when they interact with content. This works with the output of the content classification server to deliver an enhanced findability experience. A summary of the functions and features 1 is listed below: Function Concept Mapping Features In language there are often many names for the same thing. This creates a problem for search engines because content is usually written by subject matter experts who use technical language. However content is often searched for by people trying to learn about topics who use layman s or common use terms. Concept Mapping allows users to search using their own words. The taxonomy is used to identify the possible topics the user is interested in. By clicking on one of these topics users will get: a more complete set of results because content that uses different language will still be found a more precise set of results because the ambiguity of a user s query is removed a more precise set of results because the user chooses a narrower topic Topic Maps Best Bets Related Topics Concept Mapping also eliminates the need to use complex advanced search interfaces or for users to be familiar with large or complex taxonomies or navigational structures. Searches often yield thousands of results which are hard to work through. Topic maps show users how the results break down by listing the topics covered in the results and indicating how often each topic is represented. Users can easily and intuitively understand and then refine the results by selecting the topics they are interested in. Topic Maps are typically displayed as a hierarchy tree under each facet, but can be shown as flat lists. They are also often referred to as Faceted Search Best Bets and Visual Best Bets provide links to web page, document, application or an especially useful resources or contact for a selected topic Related Topic Boxes identify for the user other topics that are likely to be interesting or informative given the topic the user is looking at. The user can click on any of these topics to search for content about that topic. Related topics can be used to identify experts that the user can contact for help, show resources of use, list product accessories and so forth. Taxonomy Navigator A-Z Topic lists Dynamic Summaries Topic Path Browser The related topics shown in the box is determined by and controlled through the taxonomy (using Associative Relationships), and so can be easily changed by a business user and without software development effort. The Taxonomy Navigator allows a user to view the taxonomy hierarchy and drill down through it to select a topic from the taxonomy. Once the topic is selected all content for that topic is displayed The A-Z Topic List makes it easy for users to find the topic they are interested in. Users can click on a letter of the alphabet see topics starting with that letter and then simply click on the desired topic to see results. The A-Z listing is generated automatically from the taxonomy. This makes the management of what terms to be included easier and also provides a search capability for when the user looks under the letter they believe will hold the right topic, but don t find it there. Dynamic Summaries enhance search results by highlighting the evidence found during the classification process that is the preferred term, synonyms and related topics are highlighted in the search results. In SharePoint Search the preferred term only is highlighted. Smartlogic are investigating ways to highlight synonyms as well. Topic Path Browser provide an excellent way for users to gain context for the topic they are looking at by viewing more abstract results or related topics. A Topic Path Box shows the user where the topic they are looking at fits within a taxonomy. Each parent topic is shown. Clicking on one of these topics allows a user to see results for that topic which will provide broader, more contextual and less detailed results. Users can also view narrower topics for any of the topics shown in the browser allowing them to go into more detail at any point. 12

Search Application Framework The Semaphore Search Application Framework, an out-of-the-box interface that encapsulates search best practices, is a collection of web services and tag libraries that exposes core functionality like search results, relevance/date ranking, etc. with the advanced Semaphore capability of taxonomy or entity-based facet navigators, topic maps and filters, and more. The code is provided as a fully functioning application that can use one or more of the application processors that hook the interface into the native search workflow of several search engines. As search interfaces have to overcome increasing volumes of content and more varied user journeys visualization provides a compelling way to engage users with the power of the ontology to navigate and explore content. Various forms of visualization enable users to gain context, follow different paths through the content, zoom-in, zoom-out and pan around content. In doing so they quickly find resources and people in order to get answers, learn about topics and solve problems. Diabetes Example Demonstrating the Search Interface 13

Semaphore Solutions Semaphore has been integrated with a number of enterprise search applications and content management systems including, but not limited to, the Google Search Appliance (GSA), Apache Solr, and Microsoft FAST ESP, Open Text Web Site Management Server (formerly RedDot CMS), SharePoint (2007 and 2010) and others, to add semantic processing capabilities to these systems. Semaphore Solutions provide the following benefits: The highest quality of automatic subject classification available. Semaphore s unique rules based classification is accurate, fast, easy to understand and controllable and yet requires very little setup or training time. Comprehensive text analytics including entity extraction and natural language processing tools to deliver better performing taxonomies in a fraction of the time. Full lifecycle "develop, enhance and maintain" taxonomy management and governance preventing the chaos caused when anyone has rights to modify shared taxonomies. Components that semantically enhance the user experience in any consuming application for example to deliver the most powerful and flexible navigation along with the most accurate search results. The reader is referred to the various product sheets for the individual solutions. At a glance, the following solutions are detailed here, namely Semaphore for FAST ESP, Semaphore for the Google Search Appliance and Semaphore for SharePoint. a. Semaphore for FAST ESP Semaphore Enterprise Semantic Platform is integrated to Microsoft s FAST Enterprise Search Platform (ESP). The diagram below illustrates how Semaphore is integrated with Microsoft FAST ESP basically the Semaphore search components work alongside FAST s facet navigation. Semaphore Integration with Microsoft FAST 14

Semaphore for FAST provides the following benefits: Improved precision and recall on search results by applying sophisticated classification routines to add metadata to the FAST index Enhanced user search and navigation experience: When a user moves from free-text search to model based navigation, the experience can not only be prompted and controlled but contextualized as well. Intelligent concept suggestions are made in navigation aids and the user can drill up, down, or across the model and review the content that matches that concept. b. Semaphore for Google Search Appliance Semaphore Enterprise Semantic Platform is integrated to the Google Search Appliance s Indexing Pipeline ensuring that all content indexed by the Google Search Appliance is passed to the Semaphore Classification Server for tagging. Semaphore for Google Search Appliance delivers better quality search results through taxonomies, ontologies and automated content classification. The addition of Semaphore provides the following benefits to GSA: More accurate and complete search results with the best content at the top by applying complex classification routines to add metadata to the GSA index. Improved findability by enhancing Google enterprise search experience: users can filter and navigate topics, break down results, identify areas of interest, explore related topics and locate expertise, and resources. Semaphore Integration with Google Search Appliance 15

c. Semaphore for SharePoint Semaphore is tightly integrated with SharePoint 2010 and Semaphore for SharePoint is a solution for inclusion into the Microsoft SharePoint farm. Semaphore for SharePoint addresses the various gaps in functionality of SharePoint. Semaphore is integrated with the SharePoint 2010 Term Store, as well as SharePoint Search and FAST for SharePoint. Semaphore for SharePoint provides the following benefits: Automatic, Assisted or Manual classification modes (often referred to as tagging ). Classification in multiple Western European and Arabic languages Classification against many taxonomies or facets in a single pass. Classify content in any list (e.g. document libraries, blogs, publishing sites, wikis and discussion threads). Support for ontologies providing compelling and intuitive content navigation and visualization. Bulk classification support from the interface or the command line (STSADM or PowerShell). Semaphore Search Web Parts: taxonomy browser, concept expansion, related topic navigation, best bets (maintained in the taxonomy), and Faceted Search Filters (topic maps). Central Admin menu to deploy and configure Semaphore over required web applications, site collections, content types and libraries. Supports deployment via Templates and Feature Stapling. Developed, tested and in production in large, complex, organization-wide SharePoint deployments (tens of thousands of users, libraries with hundreds of thousands documents). Uses native SharePoint coding and capabilities, such as Term Store integration. Enables effective migration, records retention, data loss prevention, and ediscovery Semaphore Integration with Microsoft SharePoint 2010 16

Summary: What Sets Semaphore Apart? Full life-cycle Ontology Management The Ontology Manager tool is not just for simple term hierarchies and thesauri. Powerful ontologies can be modeled and used to drive the classification and search navigation experiences. An intuitive interface and capabilities such as the ability to drag branches of existing taxonomies to form a new structure; text mine content sets to find evidence terms; and publish out review copies for user feedback, make its users highly productive and drives down the cost and effort of building and managing high performance ontologies. Support for principles and standards Delivering semantic enhancement requires a powerful model: synonym management, linguist attributes modeling the ambiguity and importance of terms etc. Because of those requirements Semaphore Ontology Manager implements ISO 2788 and ANSI/NISO Z39.19. The Z39.19 standard, also known as zthes, can be readily converted to SKOS (www.w3.org/2004/02/skos/) and vice-versa. Scalable, accurate, comprehensive rule-based classification Rule-based classification is acknowledged to provide the highest level of precision, but this normally comes at a cost the cost of creating a rule-base for each taxonomy term. Semaphore solves this issue by applying rulebase logic to all the evidence in the taxonomy (relationships, synonyms, the term names, etc.) and automatically publishing rulebases. The system must be able to rapidly process large volumes of documents. Semaphore Classification Server applies many natural language processing stages to open, analyze and tag a document and has been tuned to do this efficiently, quickly and accurately against one or many large taxonomies. In addition, Semaphore Classification Server provides auditability and control while reducing the costs of implementing classification. Scalable, responsive intelligent search enhancement Adding ontology based search facets, drill up, drill down, drill across navigation, topic maps and concept searches have a huge impact on the end user search and find experience over the core free-text search capabilities provided by Enterprise Search engines. However, Enterprise Search engines are developed and tuned to deliver results quickly. The ontology search enhancements cannot slow this process down and Semaphore Semantic Enhancement Server is designed to scale with the search engine to provide the performance to match the maximum queries per second. Total governance of content Semaphore classifies data based on context and automatically adds metadata tags that allow organizations to apply policies that determine how content is to be controlled for disposition, governance, compliance, data loss prevention, ediscovery, etc. About Smartlogic Smartlogic is a software company that specializes in semantics. Smartlogic s Semaphore is an Enterprise Semantic Platform that augments traditional information management systems like search, content management and business workflow engines by adding advanced content classification, metadata and navigation capabilities to deliver a more complete enterprise information management experience. More than 250 companies like NASA, Bank of America, AutoDesk, Intel, Oxy, UBS, Ford Foundation, Pitney Bowes, The National Health Service, RBS, The Office of Public Sector Information, Yell.com and others use Smartlogic. Smartlogic 560 S. Winchester Blvd, Suite 500, San Jose CA 95128 14 Greville St, London EC1N 8SB, UK info@smartlogic.com; www.smartlogic.com 2011 Smartlogic. All rights reserved. Smartlogic and Semaphore are registered trademarks. All other products and service names mentioned herein are or may be registered trademarks or trademarks of their respective companies or organizations. 17