1 Text Analytics with Ambiverse Text to Knowledge
2 Version 1.0, February 2016
3 Contents 1 Ambiverse: Text to Knowledge Text is all Around Ambiverse: Leading research to industry Text to Knowledge 6 2 Named Entity Disambiguation What is it? Why is it Important? Why is it Challenging? Ambiverse Gives Meaning to Text Ambiverse & YAGO, a Powerful Combination Integrating Domain-specific Knowledge Ambiverse Text Analytics in Facts 10 3 Applications Ambiverse Search Ambiverse Analyze Ambiverse Write Personalized Text Analytics 17
5 1. Ambiverse: Text to Knowledge 1.1 Text is all Around Most of the information produced by persons, organizations, and public institutions is in the form of text. In 2014, 300 million new websites were created. 1 Every year, 2 million blog posts are written, 2 thousands of news sites around the globe publish articles, and millions of new updates in social networks are generated. In fact, most of human interaction is performed via unstructured data (e.g., articles, reports, social network posts, adds, comments, reviews, etc). Companies and public institutions also tend to produce, on a regular basis, large quantities of internal documents. This vast amount of text goes beyond of what is commonly understood as big data. Textual information is not easy to interpret, it basically lacks a well defined structure. To make use of it, it is necessary to provide the machine with certain text understanding capabilities so that these huge collections of documents can be computationally analyzed and transformed into useful data. It is being increasingly understood that text analytics gives a big leverage to companies, persons, and public institutions. The text analytics market is expected to grow at an average rate of 25% per year. 3 By 2013 only 1% of the companies were processing its textual information, by % will do (Figure 1.1). 4 In domains such as news, advertising, finance, insurance, among others, companies are starting to make sense of its textual data as a means of adding value to their businesses newsletter_turning_dark_data_into_smart_data.pdf
6 6 Chapter 1. Ambiverse: Text to Knowledge % of companies using text analystics Figure 1.1: The use of text analytics will increase dramatically in the coming years 1.2 Ambiverse: Leading research to industry Ambiverse, a spin-off of the Max Planck Institute for Informatics, joins the new world of text analytics. Ambiverse develops a technology to automatically understand, analyze, and manage big collections of textual data. Ambiverse is built on years of state-of-the-art research in text analytics. In 2015, Ambiverse received an EXIST Transfer of Research grant by the German Federal Ministry for Economic Affairs and the European Union. 1.3 Text to Knowledge Our technology is focused on the recognition and disambiguation of named entities in text. It relies on years of experience in scientific developments by the Max Planck Institute for Informatics, a world leading institution in automatic text understanding. Our technology for named entity disambiguation was named the best named entity disambiguation system by IBM 5 and our corresponding scientific publications are among the most cited in the international automatic text understanding community 67. This cutting edge technology gives Ambiverse an advantage in the text analytics world, allowing the development of a new generation of text analytics tools to transform textual information into machine-understandable knowledge. 5 D. A. Ferrucci (2012). Introduction to This is Watson. IBM Journal of Research and Development. 6 J. Hoffart et al. (2011). Robust Disambiguation of Named Entities in Text. In Proceedings of the Conference on Empirical Methods on Natural Language Processing (EMNLP). 7 J. Hoffart et al. (2013). YAGO2: A Spatially and Temporally Enhanced Knowledge Base from Wikipedia. Artificial Intelligence.
7 2. Named Entity Disambiguation 2.1 What is it? A named entity, or simply entity, is a real-world object such as a person, an organization, a location or a product. Named entity disambiguation is the task of automatically recognizing the names of these objects in text and identifying their real-world reference. For instance, in the sentence Page played the hit Kashmir on his uniquely tuned Les Paul our disambiguation system recognizes that the mention Page refers to the famous rock guitarist Jimmy Page and not to Larry Page, founder of Google, and that Les Paul refers to the guitar and not its designer (see Figure 2.1). Figure 2.1: Selecting the correct entity for each mention: Jimmy Page, the song Kashmir and a Les Paul guitar
8 Chapter 2. Named Entity Disambiguation Why is it Important? Ambiguous entities are all around us. The variety of names is much smaller than one may think; there are more entities than names. Places are named after people, and people after people. Also places tend to have similar names, the same as people or products. In this context, knowing the real-world object of a reference produces significant gains in text understanding capabilities. If one wants to select or analyze documents mentioning the city of Paris in France, first we have to make sure that the mentions of Paris refer to the entity we are interested in and not, for instance, to the city of Paris in Texas. If one wants to efficiently search for information about Larry Page, we have to make sure to exclude documents about Jimmy Page, another famous Page. Even more, if companies want to analyze customer opinions about cars, they need to understand that a tweet refers to the Jeep Wrangler and not to Jeans Wrangler ( I bought a Wrangler, and it is very comfortable, I sell my brand new Wranglers, Figure 2.3). Knowing the correct meaning of a name allows to more efficiently analyze and search over large text collections. Ambiverse developed a state-of-the-art technology to disambiguate entities and a set of applications around it for smart text analytics. Image from flickr (zombieite) - CC-BY 2.0 Figure 2.2: Ambiverse Text Analytics helps to identify the real enthusiastic fans. 2.3 Why is it Challenging? Named entity mentions can be very ambiguous. The name Page can already refer to hundreds of entities, for more ambiguous names like John the potential candidates are likely in the thousands. A machine needs to resolve the meanings of all names in a single text assuring coherence among the entities (e.g., it is reasonable that Paris and France are simultaneously assigned to the french capital and the European country). Naive approaches of simply enumerating all possible combinations would quickly come up against a brick wall. Even for a single sentence with three or four moderately ambiguous names, the combination exceeds 100,000. For full documents, this becomes infeasible for even the fastest machines. Solving such a problem requires smart technologies as the one we provide in Ambiverse Text Analytics.
9 2.4 Ambiverse Gives Meaning to Text 9 Page played the hit Kashmir on his uniquely tuned Les Paul. 500 x 50 x 5 = possible candidate combinations Figure 2.3: There are 500 possible Pages, 50 possible Kashmirs, 5 possible Les Paul, leading to possible entity combinations. 2.4 Ambiverse Gives Meaning to Text Ambiverse Text Analytics opens up a wide range of possibilities to manage and understand big text collections. Its main characteristic is the capability to understand the meaning of the objects, detaching them from their textual representations. For instance, in the sentences Page played Kashmir., Jimmy rocked the show at Knebworth! and James Patrick Page is one of the greatest guitarists of all time., Ambiverse Text Analytics understands that Jimmy, Page, and James Patrick Page all refer to the same person (Figure 2.4). It understands real world concepts in text regardless of how they are actually mentioned. This allows Ambiverse to develop a set of applications around the named entity disambiguation technology, changing the way in which text is stored, searched, analyzed and produced. James Patrick Page is one of the greatest guitarists of all time. Page played Kashmir. Jimmy rocked the show at Knebworth! Figure 2.4: Ambiverse Text Analytics understands that all sentences refer to the same Jimmy Page. 2.5 Ambiverse & YAGO, a Powerful Combination All entities like Jimmy Page, Larry Page, Les Paul (person) and his self-named guitar are present in our YAGO knowledge base [Hof+13]. YAGO, which is derived from Wikipedia, can be thought of as a very large collection of entities. YAGO also contains accurate characterizations of all entities. It knows that Larry Page is a computer scientist, a corporate director, and a billionaire, that Google is a U. S. company, or that Jimmy Page is a guitarist and a musician. These characteristics of the entities are called categories or classes and are the key to develop useful applications
10 10 Chapter 2. Named Entity Disambiguation around named entity disambiguation technology. An example of YAGO is shown in Figure 2.5. artifact subclass subclass Classes song musician guitar type 1975 type type in created plays was played at played at Entities happened in Figure 2.5: Example of the knowledge stored in YAGO: The entities, their classes, and the relations between them. 2.6 Integrating Domain-specific Knowledge The flexible architecture of Ambiverse Text Analytics allows the use of additional domainspecific entities. Other knowledge bases (e.g., a company-specific knowledge base or a product catalog) can be easily integrated into our system or a specific user can concentrate in a specific slice of YAGO. This enables companies to focus on the entities of importance to them, like their products or customers. Ambiverse Text Analytics to be fully customized to the specific needs of our customers. 2.7 Ambiverse Text Analytics in Facts Performance The following numbers correspond to average length news articles processed on a compute instance with 16 CPU cores and 32 GB of memory. Documents per hour with high accuracy: Documents per hour with highest accuracy: The exact accuracy depends on the nature of the documents. An experimental evaluation on a large set of newswire documents [Hof+11] showed 80% accuracy for the high accuracy setting and 83% accuracy for the highest accuracy setting.
11 2.7 Ambiverse Text Analytics in Facts Languages We currently support English and German, while the prototype research languages include Arabic, Chinese, Italian and Spanish Knowledge Base A brief comparison of the size of YAGO and other prominent openly available knowledge bases shows that YAGO is among the most comprehensive and precise. YAGO s distinct advantages are the clear semantic modelling of entities and especially the specific class hierarchy, ranging from very general categories like person to highly specific ones like British rhythm and blues boom musicians. Also, YAGO is the only knowledge base that has been evaluated in terms of accuracy [Hof+13]. Entities Classes Accuracy English YAGO3 3.5 million 550 thousand > 95% Combined YAGO3 (10 languages) 4.6 million 570 thousand > 95% English DBpedia 4.8 million 735 not evaluated Combined DBpedia 38.3 million 735 not evaluated Table 2.1: Facts about the YAGO knowledge base! More details about YAGO are available at
13 3. Applications Ambiverse s cutting edge text analysis technology allows the development of a whole range of next-generation applications to manage, search, analyze and produce text. 3.1 Ambiverse Search Searching for Entities Traditional search engines take words or phrases as input and return a set of documents, in which these words or phrases may be more relevant. They have limited understanding of the user intent in the sense that they do not give meaning to the input words. They only understand their form. For instance, they cannot understand if the input word Paris refers to the city in France, to Paris Hilton, or to the mythological Greek character. Searching for Paris in a regular search engine will return documents where the word Paris appears without distinguishing which Paris it is. Probably documents referring to the city of Paris in France will be ranked at the top since it is the most popular entity. Users searching for less common Paris references should refine their input (e.g. Paris Greece Troy ), forcing them to express their intention by incorporating (sometimes unavailable) extra knowledge into the input. However, if the documents are first processed via Ambiverse Text Analytics (meaning that all entities in all documents have been previously identified), the user can search for the entities themselves independently of how they are mentioned in the text, and without any additional background knowledge. The user intent is fully described in the input entity itself. For instance, the user can directly search for Paris Hilton and no matter how she is referred to (e.g. Paris, Paris Hilton, Hilton s granddaughter, etc.), all documents in which she is mentioned will be retrieved (and properly ranked). All other documents where other Paris occurrences appear (Paris, France; the Greek character; Paris, Texas) will be excluded. This type of ambiguity is more common that one may think, resulting in highly imprecise search results. Ambiverse Search gives the user the capability to search for meaning or concepts on huge text collections, reaching more precise results by better interpreting the user s
14 14 Chapter 3. Applications Figure 3.1: Searching for the word Prada is imprecise due to its ambiguity. Figure 3.2: Searching for the company Prada gives precise results: Ambiguities have been resolved. intent, abstracting meaning from textual forms. Out of the box, we provide search for 4.6 million entities, to which, in addition, customer-specific entities can easily be added (see Section 2.6). Figures 3.1 and 3.2 provide an example of regular and smart search.! Try your own examples in the prototype of Ambiverse Search at https://stics.mpi-inf.mpg.de.
15 3.2 Ambiverse Analyze Searching for Categories: the Power of the YAGO Knowledge Base As mentioned before, YAGO contains information about categories for each entity. This allows us to incorporate a new abstraction layer to our search, something impossible in traditional search engines. Instead of searching for a given entity, we can directly search for a category so that a set of entities is grouped in our search. For instance, we can directly search for fashion labels, and all the documents mentioning a fashion label (e.g., Prada, Gucci, Chanel, etc.) will be retrieved. We can also search for documents containing German soccer players (e.g., Schweinsteiger, Thomas Müller, Mesut Özil, etc.), Harvard alumni (e.g., Barack Obama, Ban Ki-Moon, Natalie Portman, Robert Solow, etc.), or any other category available in our knowledge base. The secret here is that Ambiverse Text Analytics is capable of identifying the entities in the text and our knowledge base knows the categories of those entities. Our knowledge base contains more than 570k categories. Figure 3.3: Searching for the category high fashion brands finds documents on all fashion labels. 3.2 Ambiverse Analyze Understanding entities in text allows a whole new range of text analytics tools. For instance, one can visualize the correlation over time between two companies or even the correlation between a company and its sector. Ambiverse Analyze helps you understand how mentions of the fashion label Prada correlate to mentions of all fashion labels (Figure 3.4).! Try the prototype of Ambiverse Analyze at https://stics.mpi-inf.mpg.de/stats.
16 16 Chapter 3. Applications Figure 3.4: Ambiverse Analyze plots the trends of Prada against all other fashion labels. 3.3 Ambiverse Write Understanding entities is also a key element in the production of intelligent texts. We developed Ambiverse Write, a smart authoring platform for intelligent text production: While typing, entities are automatically recognized, relevant entities are suggested and background information is provided to the author on the fly. An author writing about fashion topics will get suggestions about fashion brands or designers, and background information about them directly while typing. Figure 3.5: Ambiverse Write allows authors to write texts and link entities at the same time. Once the writing process has been completed, the text is ready for smart publishing: it gets annotated with the correct entities and can be immediately integrated into Ambiverse Search and Analyze. This integration also enables Ambiverse to continuously improve the quality of its technology, incorporating user specific annotations.
17 3.4 Personalized Text Analytics 17 In the example shown in Figure 3.5, authors can get a deeper understanding about the entities they are writing about without ever leaving the editor. Additionally, the links improve the reading experience for all readers, adding value to the article, making them stay longer, and use the article as a prominent reference.! Contact us for a demonstration of the prototype. 3.4 Personalized Text Analytics Companies or even individual users usually have their own knowledge base or want to add their own customization to YAGO (e.g., they may be interested in only a part of it or modify some entities or categories). We developed a framework that allows users to add their own entities to their specific knowledge base making our disambiguation technology fully customizable to each particular user and/or organization. Ambiverse Text Analytics will then focus on entities of interest for the user or adapt to the setting that the user considers most appropriate. The tool for augmenting an existing knowledge base is very intuitive and extremely simple to use. The user has different possibilities to easily generate its customized knowledge base without specific knowledge of our technology.! Contact us for a demonstration of the prototype.
19 References [Hof+11] [Hof+13] Johannes Hoffart et al. Robust Disambiguation of Named Entities in Text. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2011, pages (cited on page 10). Johannes Hoffart et al. YAGO2: A Spatially and Temporally Enhanced Knowledge Base from Wikipedia. In: Artificial Intelligence 194 (2013), pages (cited on pages 9, 11). Demos & Further Readings Ambiverse Search The prototype is available at https://stics.mpi-inf.mpg.de Ambiverse Analyze Explore the prototype at https://stics.mpi-inf.mpg.de/stats YAGO More details are available at Ambiverse Max Planck Institute for Informatics Campus E Saarbrücken Germany Phone: Fax: