AIDA-light: High-Throughput Named-Entity Disambiguation

Similar documents
Statistical Analyses of Named Entity Disambiguation Benchmarks

Text Analytics with Ambiverse. Text to Knowledge.

ONTOLOGIES A short tutorial with references to YAGO Cosmina CROITORU

Exploiting Linked Open Data as Background Knowledge in Data Mining

Big Text: from Names and Phrases to Entities and Relations

Mining the Web of Linked Data with RapidMiner

MEN'S FASHION UK Items are ranked in order of popularity.

Predicting stocks returns correlations based on unstructured data sources

Real Madrid brings the stadium closer to 450 million fans around the globe, with the Microsoft Cloud

Media and Photography

MLg. Big Data and Its Implication to Research Methodologies and Funding. Cornelia Caragea TARDIS November 7, Machine Learning Group

Tech Presentation 2016

SCAN: A Structural Clustering Algorithm for Networks

SpiderStore: A Native Main Memory Approach for Graph Storage

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

Open Domain Information Extraction. Günter Neumann, DFKI, 2012

Relational model. Relational model - practice. Relational Database Definitions 9/27/11. Relational model. Relational Database: Terminology

A Test Collection for Entity Linking

CIRGIRDISCO at RepLab2014 Reputation Dimension Task: Using Wikipedia Graph Structure for Classifying the Reputation Dimension of a Tweet

Social Media Usage In European Clubs Football Industry. Is Digital Reach Better Correlated With Sports Or Financial Performane?

Uncovering Plagiarism, Authorship, and Social Software Misuse

Keyphrase Extraction for Scholarly Big Data

Wikipedia and Web document based Query Translation and Expansion for Cross-language IR

Big Data Methods for Computational Linguistics

Model of a flow in intersecting microchannels. Denis Semyonov

Gain Contextual Awareness for a Smarter Digital Enterprise with SAP HANA Vora

Automatic Knowledge Base Construction Systems. Dr. Daisy Zhe Wang CISE Department University of Florida September 3th 2014

Analyze This! Get Better Insight with Power BI for Office 365

High-Performance Scorecards. Best practices to build a winning formula every time

parent ROADMAP SUPPORTING YOUR CHILD IN GRADE FIVE ENGLISH LANGUAGE ARTS

Greetings from the Iron Bear Club

Domain Adaptive Relation Extraction for Big Text Data Analytics. Feiyu Xu

Chapter 8. Final Results on Dutch Senseval-2 Test Data

Plato: A Selective Context Model for Entity Resolution

Taxonomic Data Integration from Multilingual Wikipedia Editions

Measuring and Monitoring the Quality of Master Data By Thomas Ravn and Martin Høedholt, November 2008

The Manuscript as Cultural Heritage: Digitisation ++

SMART InTeRneT OF ThIngS

Energy Efficient Systems

BIG DATA AND ANALYTICS

harmon.ie Delivers the Business Value of Office 365 Migrations

Finding Negative Key Phrases for Internet Advertising Campaigns using Wikipedia

Model-Based Requirements Engineering with AutoRAID

A Statistical Text Mining Method for Patent Analysis

Enhanced Information Access to Social Streams. Enhanced Word Clouds with Entity Grouping

IT based Knowledge Management. Abstract

WIKITOLOGY: A NOVEL HYBRID KNOWLEDGE BASE DERIVED FROM WIKIPEDIA. by Zareen Saba Syed

UNDERGRADUATE PROGRAMME SPECIFICATION

Personal Browsing History and B-hist Data Storage

Qlik s Associative Model

ONTOLOGY FOR MOBILE PHONE OPERATING SYSTEMS

Essential Graphics/Design Concepts for Non-Designers

GLOBAL TRANSFER MARKET REPORT 2015

FOOTBALL INVESTOR. Member s Guide 2014/15

Two new DB2 Web Query options expand Microsoft integration As printed in the September 2009 edition of the IBM Systems Magazine

Specops Command. Installation Guide

SAP S/4HANA Embedded Analytics

CORE Assessment Module Module Overview

the sporting index group

Cross-Language Information Retrieval by Domain Restriction using Web Directory Structure

Towards a Sales Assistant using a Product Knowledge Graph

Advertising Keywords Recommendation for Short-Text Web Pages Using Wikipedia

LEADS 360 Assessment Price List

Protein Protein Interaction Networks

The Alignment of Common Core and ACT s College and Career Readiness System. June 2010

Yet Another Triple Store Benchmark? Practical Experiences with Real-World Data

Extracting Publications and Citations from Scopus

SAP BW 7.4 Real-Time Replication using Operational Data Provisioning (ODP)

Transcription:

AIDA-light: High-Throughput Named-Entity Disambiguation Ba Dat Nguyen Johannes Hoffart Martin Theobald Gerhard Weikum Max-Planck-Institut für Informatik Saarbrücken, Germany 1

2 / 25 Overview Named Entity Disambiguation High-performance Accurate Entity Disambiguation Simplifying Expensive Features Categories and Domains Multi-phase Computation Experiments 2

Named Entity Disambiguation ` (NED) NED aims to map mentions of ambiguous names in natural language onto a set of known entities (e.g. YAGO or DBpedia). Text & Mentions correct entities Under Fergie, United won the Premier League title 13 times. Fergie_(singer), an American singer, songwriter, fashion designer, television host and actress. Alex_Ferguson, a former Scottish football manager of Manchester United F.C. Sarah, Duchess_of_York, the former wife of Prince Andrew, Duke of York.... United_Airlines, an American major airline. United_Airways, a Bangladeshi airline. Manchester_United_F.C., an English professional football club.... Premier League, the English professional football league.... 3

4 / 25 State-of-the-art NED Systems Accurate Systems: AIDA and Illinois Wikifier: use rich contextual features (and joint inference) emphasis on quality. High-performance Systems: DBpedia Spotlight and TagMe: mention-by-mention inference with more lightweight features emphasis on speed. 4

5 / 25 AIDA-light Goal: reconcile efficiency and accuracy. Approach: simplify expensive features. add novel features with low footprint. multi-phase computation. 5

Joint Inference over Disambiguation ` Graph Construct an undirected weighted graph between mentions and entities. Compute the best joint mapping sub-graph. Mentions Entities 6

7 / 25 Simplify Expensive Features Key-phrases (AIDA): link anchor texts including categories, citation titles, and external references. Key-tokens: extracted from all key-phrases except stop words. Example: AIDA key-phrases: U.S. President, President of the U.S. AIDA-light key-tokens: President, U.S. 7

8 / 25 Categories and Domains Entity, Categories and Domains For example: Domain Hierarchy Entity:Premier_League Category:Football_Leagues Domain:Football Domain-Entity Coherence A entity belongs to a domain if it belongs to at least one category of the domain recompute the mention-entity edge s weight under the domain. Entity-Entity Coherence connect entities from the same domain give higher weight to same-domain entity-entity coherence edges. 8

9 / 25 e Multi-phase Computation Easy mentions: mentions with very few candidates or with skewed distributions. Update the context by chosen entities (with domains). Better understanding of the context. Reduce the complexity of the later stages. Mentions: Fergie, United, Premier League Mention:Premier League Entity:Premier_League Domain:Football Domain:Football Fergie United Alex Ferguson Fergie Singer Sarah Manchester Duchess of York United F.C. 9

10 / 25 Experimental Setup Systems under comparison: AIDA-light AIDA DBpedia Spotlight Performance measures: All systems take the same mentions as the input. Each mention is mapped to one entity in DBpedia YAGO. Mapping a mention of in-kb entity to null is a failure. We apply per-mention precision. 10

11 / 25 Experimental Corpora CoNLL-YAGO testb: news articles with long-tail entities. WP: short contexts with highly ambiguous mentions and long-tail entities. Wikipedia articles: Wikipedia articles with internal links as mentions. Wiki-links: long documents with a few mentions. 11

12 / 25 Results on NED Quality Precision on different corpora, statistically significant improvements over Spotlight are marked with an asterisk. 12

13 / 25 Results on Run-time Average per-document run-time results. AIDA uses a SQL database, not considered here. 13

14 / 25 Conclusion A high-performance accurate NED system First method to consider domain coherence. Judicious choice of high benefit/cost features. Experiments: AIDA-light as good as rich-feature systems. as efficient as fastest systems. 14

AIDA-light source code is available to download at https://www.mpi-inf.mpg.de/yago-naga/aida/ Thanks! 15