Real-Time Identification of MWE Candidates in Databases from the BNC and the Web



Similar documents
Interactive Dynamic Information Extraction

Corpus and Discourse. The Web As Corpus. Theory and Practice MARISTELLA GATTO LONDON NEW DELHI NEW YORK SYDNEY

Search and Information Retrieval

Simple maths for keywords

CONCEPTCLASSIFIER FOR SHAREPOINT

On the Fly Query Segmentation Using Snippets

Collecting Polish German Parallel Corpora in the Internet

Unifying Search for the Desktop, the Enterprise and the Web

Term extraction for user profiling: evaluation by the user

Vendor briefing Business Intelligence and Analytics Platforms Gartner 15 capabilities

Query term suggestion in academic search

ifinder ENTERPRISE SEARCH

QUT Digital Repository:

From Terminology Extraction to Terminology Validation: An Approach Adapted to Log Files

Bridging CAQDAS with text mining: Text analyst s toolbox for Big Data: Science in the Media Project

MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts

BUSINESS INTELLIGENCE

Flattening Enterprise Knowledge

Computer Assisted Language Learning (CALL): Room for CompLing? Scott, Stella, Stacia

Micro blogs Oriented Word Segmentation System

MarkLogic Enterprise Data Layer

Special Topics in Computer Science

Voice Driven Animation System

Spam Filtering using Naïve Bayesian Classification

Performance Management Platform

Text Analytics. A business guide

In Memory Accelerator for MongoDB

BIRT Document Transform

Markus Dickinson. Dept. of Linguistics, Indiana University Catapult Workshop Series; February 1, 2013

CINDOR Conceptual Interlingua Document Retrieval: TREC-8 Evaluation.

3 Paraphrase Acquisition. 3.1 Overview. 2 Prior Work

SYSTRAN Chinese-English and English-Chinese Hybrid Machine Translation Systems for CWMT2011 SYSTRAN 混 合 策 略 汉 英 和 英 汉 机 器 翻 译 系 CWMT2011 技 术 报 告

Learn to Personalized Image Search from the Photo Sharing Websites

Word Completion and Prediction in Hebrew

Optimizing Multilingual Search With Solr

INSIGHT NAV. White Paper

Terminology Extraction from Log Files

Comprendium Translator System Overview

SYSTRAN 混 合 策 略 汉 英 和 英 汉 机 器 翻 译 系 统

Terminology Extraction from Log Files

Machine Translation. Agenda

CLOUD ANALYTICS: Empowering the Army Intelligence Core Analytic Enterprise

AntConc: Design and Development of a Freeware Corpus Analysis Toolkit for the Technical Writing Classroom

Motivation. Korpus-Abfrage: Werkzeuge und Sprachen. Overview. Languages of Corpus Query. SARA Query Possibilities 1

Electronic Document Management Using Inverted Files System

Complex, true real-time analytics on massive, changing datasets.

A Java proxy for MS SQL Server Reporting Services

Using Microsoft Business Intelligence Dashboards and Reports in the Federal Government

Delivering the power of the world s most successful genomics platform

About This Document 3. About the Migration Process 4. Requirements and Prerequisites 5. Requirements... 5 Prerequisites... 5

KnowledgeSTUDIO HIGH-PERFORMANCE PREDICTIVE ANALYTICS USING ADVANCED MODELING TECHNIQUES

Get the most value from your surveys with text analysis

opencrx Enterprise Class Open Source CRM

A Monitored Student Testing Application Using Cloud Computing

Document Management Server - Overview

One Approach of e-learning Platform Customization for Primary Education

A Corpus-Based Tool for Exploring Domain-Specific Collocations in English

Business Value Reporting and Analytics

Domain Adaptive Relation Extraction for Big Text Data Analytics. Feiyu Xu

In-Database Analytics

4D and SQL Server: Powerful Flexibility

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information

TRENDS IN THE DEVELOPMENT OF BUSINESS INTELLIGENCE SYSTEMS

COURSE SYLLABUS EDG 6931: Designing Integrated Media Environments 2 Educational Technology Program University of Florida

Annotated Corpora in the Cloud: Free Storage and Free Delivery

Microsoft Dynamics NAV

Web Archiving and Scholarly Use of Web Archives

Data Deduplication in Slovak Corpora

ElegantJ BI. White Paper. The Enterprise Option Reporting Tools vs. Business Intelligence

Search Engine Submission

Learning Translations of Named-Entity Phrases from Parallel Corpora

Microsoft Dynamics NAV

The SYSTRAN Linguistics Platform: A Software Solution to Manage Multilingual Corporate Knowledge

Introduction to Hadoop

Sketch Engine. Sketch Engine. SRDANOVIĆ ERJAVEC Irena, Web 1 Word Sketch Thesaurus Sketch Difference Sketch Engine

Pre-processing of Bilingual Corpora for Mandarin-English EBMT

The Online Grade Book A Case Study in Learning about Object-Oriented Database Technology

Tagetik Extends Customer Value with SQL Server 2012

Architecture of an Ontology-Based Domain- Specific Natural Language Question Answering System

Seamless Web Data Entry for SAS Applications D.J. Penix, Pinnacle Solutions, Indianapolis, IN

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

Giuseppe Riccardi, Marco Ronchetti. University of Trento

Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov [Folie 1]

Microsoft Dynamics NAV

Customer Insight Appliance. Enabling retailers to understand and serve their customer

Banking Industry Performance Management

Customizing an English-Korean Machine Translation System for Patent Translation *

Analyzing survey text: a brief overview

RingStor User Manual. Version 2.1 Last Update on September 17th, RingStor, Inc. 197 Route 18 South, Ste 3000 East Brunswick, NJ

T14 RUMatricula Phase II. Section 1 Metaphor and requirements

Transcription:

Real-Time Identification of MWE Candidates in Databases from the BNC and the Web Identifying and Researching Multi-Word Units British Association for Applied Linguistics Corpus Linguistics SIG Oxford Text Archive Oxford 21 April 2005 William H. Fletcher United States Naval Academy (2004-05 Radboud University of Nijmegen) http://pie.usna.edu http://kwicfinder.com

Objectives of Presentation Describe background and biases Define key terms elastically Outline my software applications Sketch range of uses and target audiences envisioned Show and compare MRR and MI Encourage feedback and suggestions for further development 2

Background and Biases 1 of 4 Multimedia in CALL user (interface) KWiCFinder to Identify useful texts Find examples of actual use for teaching and writing Clarify linguistic questions Explore emerging semantic fields build web-based ad-hoc corpora download free from KWiCFinder.com 3

Background and Biases 2 of 4 kfngram n-grams, phrase-frames free, flexible, GUI; fast even on large datasets (20MW) Phrases in English website n-grams (n = 1 8) phrase-frames: set of n-gram variants identical in all but one word PoS-grams: set of n-gram variants with the same sequence of PoS tags chargrams now BNC; sub-corpora, MICASE and ANS to follow 4

Background and Biases 3 of 4 Web as Corpus Search Engine Consortium Initiative of Silvia Bernardini and Marco Baroni, University of Bologna, Forlì Other WAC enthusiasts: Mr. Collocations Stefan Evert, Sebastian Hoffmann, Adam Kilgarriff and myself Initial goal: gigaword Web corpus (800M English, 100M each German and Italian) 5

Background and Biases 4 of 4 Emphasis on the practical: reasonable speed, acceptable precision and recall Motivations on-the-fly subcorpora for PIE kfngramdb overcome kfngram limitations: static lists, straight frequency managing KWiCFinder ad-hoc Web corpora better integration all three tools 6

Objective Evaluate and compare statistical techniques to identify MWE candidates for corpus database queries for MWEs with specific lexical items subsequent screening, either manually or with processing-intensive metrics deemed more effective than those used here 7

Terms MWE cover-term for multi-word units, salient collocations, formulaic expressions Real-time / on-the-fly with tolerable delay Scalable from kilo- to mega- and gigacorpora In practice real time for (sub)corpora 25 MW 8

Target Audience kfngramdb (Corpus) linguists compare subcorpora in large linguistic databases identify content domain and text-type for Web as corpus learn database principles by example on PC Language professionals teachers, advanced language learners: readings, instructional materials, examples; identify (MW)Es writers (L2 / L1): organize / maintain exemplars for imitatio, personal corpus and reference materials translators: domain-specific parallel / comparable corpora, possibly compiled ad-hoc from Web sources 9

Relational databases (RDMS) Why? 1 of 3 organize linguistic data, rapid retrieval sophisticated queries relating the content of one field or table to others filter / focus results by relevant criteria dynamic interactive datasets, not static list standard query language SQL: skills transfer to other RDMS several powerful RDMS systems are free multi-platform (develop on PC, deploy on *nix) 10

Relational databases Which? 2 of 3 Microsoft Access + wizards easy to learn + produces SQL queries adaptable to other RDMS + excellent front-end to other RDMSs (e.g. MySQL) Windows only (MS Office Pro Suite) 11

Relational databases Which? 3 of 3 MySQL +free, fast, scalable +tight integration with PHP for Web interface ±powerful non-standard SQL extensions +active development, large, helpful user base +user-defined C functions callable in queries (e.g. to calculate lexical association metrics) +embeddable in other applications +multiple platform 12

Which Lexical Association Metric? Gravity Counts for the boundaries of collocations * Compares Mutual Information, T-score, Dice, Gravity Counts Gravity Counts take larger context into account most useful for identifying collocation boundaries but data processing intensive * Daudaravičius, Vidas and Rūta Marcinkevičienė, International Journal of Corpus Linguistics, 9:2 (2004), 321-348. 13

Mutual Rank Ratio 1 of 4 Paul Deane, Educational Testing Service, A Nonparametric Method for Extraction of Candidate Phrasal terms, Association for Computational Linguistics 2005. lexical association metric for knowledge-free extraction of phrasal terms, identification of MWUs in untagged text Based on ratio of global to local shared ranks Performance similar or superior to other metrics identifying 2- and 3-grams in WordNet when n-grams including the top 160 ranked types are excluded 14

Mutual Rank Ratio 2 of 4 shared rank: tied items assigned same rank e.g. items 10-15 all have frequency 512 shared rank is (10 + 15) / 2 = 12.5 next item ranked 16 (higher if shared) global and local rank United Kingdom local rank of a specific n-gram (LR) united kingdom global rank of phrase-frames of which n-gram is a variant (GR) * kingdom (the k., animal k., his k. united k.) united * (u. kingdom, u. states, u. nations, u. distillers ) 15

Mutual Rank Ratio 3 of 4 Formula MRR = (GR united * GR * kingdom ) 1/2 LR united kingdom nth root of product of all Global (phrase-frame) Ranks divided by Local (n-gram) Rank 16

Mutual Rank Ratio Pros & Cons 4 of 4 + Easy to calculate, especially if n-grams and phrase-frames are already known (PIE, kfngram) + Finds MWUs in untagged text >= others* + Weighting reflects Zipfian distribution - Excludes MWUs - with top types (state of the art, matter of principle) - not in phrase-frames ( singletons ) * if most frequent types excluded 17

Mutual Information 1 of 2 Popular metric for finding rare word pairs Formula (after D & M) MI(x,y) = log 2 (N f(x,y) / f(x) f(y)) N corpus size f(x,y) frequency of co-occurrence f(x), f(y) total frequency in corpus Calculated for pairs of words with frequency rank > 150, span 2-4 words; n-grams with these pairs retrieved (could include state of the art, matter of principle) 18

Mutual Information Pros & Cons 2 of 2 +Straightforward calculation with parameters needed for some other metrics +Finds elusive items, including singletons +Complements MRR - Strong bias toward the infrequent: Two co-occurring rare words will show a high score, but two co-occurring frequent words will show a low score. (D & M 325) 19

1 of 3 MRR and MI Compared Minimal overlap in MWEs (top 500 items < 20% shared; ranking very different) Complementary: both identify different sets of interesting MWE candidates Both calculation on-the-fly in series of SQL queries alone impractical / intractable on PC for corpora > 5MW hybrid approach with programmatic math faster, more scalable 20

2 of 3 MRR and MI Compared Top-ranked 500 n-grams by MRR but not by MI by MI but not by MRR by both (<20% of total) in MICASE (MIchigan Corpus of Academic Spoken English, 1.8 MW) EuroParl (European Parliament transcripts, 500 KW) Click for word lists 21

3 of 3 MRR and Singletons In a large tagged corpus (BNC), Mutual Rank Ratio strands many MWE singletons, n-grams lacking a phrase-frame for at least one wildword position Frequent singletons should be reviewed for potential MWEs Singletons less problematic for smaller untagged corpora Click for word lists 22

Toward Gigacorpora Today s RDMSs excel at locating and relating millions of records, but do not scale well into the billions Search engine technology points the way Doug Cutting s Lucene open source text indexer (Java) handles large plain-text collections Hybrid approach Lucene to locate documents / passages RDMS to manage text metadata, markup 23

Real-Time Identification of MWE Candidates Reactions and suggestions encouraged. http://www.kwicfinder.com/ http://pie.usna.edu fletcher@usna.edu 24