Multilingual Term Extraction as a Service from Acrolinx. Ben Gottesman Michael Klemme Acrolinx CHAT2013

Similar documents
Integration of Content Optimization Software into the Machine Translation Workflow. Ben Gottesman Acrolinx

Vesna Lušicky & Tanja Wissik. Translating and the Computer Conference

CINDOR Conceptual Interlingua Document Retrieval: TREC-8 Evaluation.

Getting Off to a Good Start: Best Practices for Terminology

Customizing an English-Korean Machine Translation System for Patent Translation *

CENG 734 Advanced Topics in Bioinformatics

TRANSREAD LIVRABLE 3.1 QUALITY CONTROL IN HUMAN TRANSLATIONS: USE CASES AND SPECIFICATIONS. Projet ANR CORD 01 5

Tanja Wissik. COTSOES Terminology and Documentation Working Group Meeting, 13 May 2013, Stockholm

Computer Aided Translation

KantanMT.com. The world s #1 MT Platform. No Hardware. No Software. No Hassle MT.

Extracting translation relations for humanreadable dictionaries from bilingual text

Question template for interviews

Copyright Soleran, Inc. esalestrack On-Demand CRM. Trademarks and all rights reserved. esalestrack is a Soleran product Privacy Statement

Dublin City University at CLEF 2004: Experiments with the ImageCLEF St Andrew s Collection

Automated Translation Quality Assurance and Quality Control. Andrew Bredenkamp Daniel Grasmick Julia V. Makoushina

Automatic Mining of Internet Translation Reference Knowledge Based on Multiple Search Engines

Collecting Polish German Parallel Corpora in the Internet

Wikipedia and Web document based Query Translation and Expansion for Cross-language IR

HIERARCHICAL HYBRID TRANSLATION BETWEEN ENGLISH AND GERMAN

THUTR: A Translation Retrieval System

Statistical Machine Translation

Format OCR ICR. ID Protect From Vanguard Systems, Inc.

Machine Translation. Agenda

Dutch Parallel Corpus

The SYSTRAN Linguistics Platform: A Software Solution to Manage Multilingual Corporate Knowledge

Automation of Translation: Past, Presence, and Future Karl Heinz Freigang, Universität des Saarlandes, Saarbrücken

Statistical Machine Translation

SYSTRAN Chinese-English and English-Chinese Hybrid Machine Translation Systems for CWMT2011 SYSTRAN 混 合 策 略 汉 英 和 英 汉 机 器 翻 译 系 CWMT2011 技 术 报 告

Hybrid Strategies. for better products and shorter time-to-market

Joint Research Centre

Survey Results: Requirements and Use Cases for Linguistic Linked Data

Recent Developments in the Law & Technology Relating to Predictive Coding

Crowdsourcing Fraud Detection Algorithm Based on Psychological Behavior Analysis

Master of Arts in Linguistics Syllabus

TS3: an Improved Version of the Bilingual Concordancer TransSearch

Report on the embedding and evaluation of the second MT pilot

LINGUISTIC SUPPORT IN "THESIS WRITER": CORPUS-BASED ACADEMIC PHRASEOLOGY IN ENGLISH AND GERMAN

Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words

Text-Driven Ontology Generation and Extension in the Finance Domain. Mihaela Vela Language Technology Lab DFKI Saarbrücken

Glossary of translation tool types

Publish Acrolinx Terminology Changes via RSS

Automatic Text Processing: Cross-Lingual. Text Categorization

Integra(on of human and machine transla(on. Marcello Federico Fondazione Bruno Kessler MT Marathon, Prague, Sept 2013

Configuring and Administering Hyper-V in Windows Server 2012 MOC 55021

ChildFreq: An Online Tool to Explore Word Frequencies in Child Language

ArcGIS for Server: Administrative Scripting and Automation

Tibetan-Chinese Bilingual Sentences Alignment Method based on Multiple Features

Preparing RTF and MS Word Files with Untranslatable Content for SDL Trados TagEditor & Déjà Vu

SYSTRAN 混 合 策 略 汉 英 和 英 汉 机 器 翻 译 系 统

Your single-source partner for corporate product communication. Transit NXT Evolution. from Service Pack 0 to Service Pack 8

Implementing Heuristic Miner for Different Types of Event Logs

Mining event log patterns in HPC systems

Big Data Analytics. An Introduction. Oliver Fuchsberger University of Paderborn 2014

SYSTRAN v6 Quick Start Guide

The Value of Advanced Data Integration in a Big Data Services Company. Presenter: Flavio Villanustre, VP Technology September 2014

Automatic Speech Recognition and Hybrid Machine Translation for High-Quality Closed-Captioning and Subtitling for Video Broadcast

Extraction and Visualization of Protein-Protein Interactions from PubMed

CLOUD ANALYTICS: Empowering the Army Intelligence Core Analytic Enterprise

POSBIOTM-NER: A Machine Learning Approach for. Bio-Named Entity Recognition

Markus Dickinson. Dept. of Linguistics, Indiana University Catapult Workshop Series; February 1, 2013

Terminology Management in the Localization Industry. Results of the LISA Terminology Survey

Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov [Folie 1]

Automated Multilingual Text Analysis in the Europe Media Monitor (EMM) Ralf Steinberger. European Commission Joint Research Centre (JRC)

SVM Based Learning System For Information Extraction

Transit NXT. Ergonomic design New functions Process-optimised user interface. STAR Group your single-source partner for information services & tools

Project Management. From industrial perspective. A. Helle M. Herranz. EXPERT Summer School, Pangeanic - BI-Europe

STAR Deutschland GmbH

Anubis - speeding up Computer-Aided Translation

Technical Report. The KNIME Text Processing Feature:

ON GETTING THE MOST OUT OF INTERNET RESOURCES TO RAISE TRANSLATION QUALITY OF PROFESSIONAL DOCUMENTATION

Appendix efile (EDI) Upload - Quick Start Guide. Tennessee Motor Fuels Electronic Filing System Motor Fuels efile (EDI) - Quick Start Guide

Challenges of Automation in Translation Quality Management

Content Management & Translation Management

NAPCS Product List for NAICS 54193: Translation and Interpretation Services

Research Statement Immanuel Trummer

The Principle of Translation Management Systems

Translation and Localization Services

EXMARaLDA and the FOLK tools two toolsets for transcribing and annotating spoken language

How To Use Neural Networks In Data Mining

Why Evaluation? Machine Translation. Evaluation. Evaluation Metrics. Ten Translations of a Chinese Sentence. How good is a given system?

Discovery of Electronically Stored Information ECBA conference Tallinn October 2012

Machine Translation. Why Evaluation? Evaluation. Ten Translations of a Chinese Sentence. Evaluation Metrics. But MT evaluation is a di cult problem!

IBM SPSS Modeler Text Analytics 16 User's Guide

dedupe Documentation Release Forest Gregg, Derek Eder, and contributors

Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov [Folie 1]

Multi-language E-Discovery

Transcription:

Multilingual Term Extraction as a Service from Acrolinx Ben Gottesman Michael Klemme Acrolinx CHAT2013

Definitions term extraction: automatically identifying potential terms in a document (corpus) multilingual term extraction: automatically identifying potential terms and their translations in a document and its translation (parallel corpus / translation memory) The wizard begins creating the bootable image. Der Assistent beginnt mit der Erstellung des bootfähigen Image. ( or, if the source-language terminology already exists, just identify translations)

Synonyms Identify same-language synonyms via translations in common German Die Spannungsversorgung für die Elektronik wird vom Speisegerät G526 sichergestellt. Spannungsversorgung für interne Speisung (X3e) Unterspannung in der Stromversorgung English The voltage supply for the electronics is maintained by the power supply unit G526. Power supply for internal supply (X3e) Undervoltage in the power supply Spannungsversorgung Stromversorgung voltage supply power supply

Outline What is multilingual term extraction? What is the workflow from customer perspective? customer use case examples show extraction results, demonstrate human validation How does the extraction work? how we identify candidates source-language candidates translation candidates how we filter translation candidates how we identify source-language synonyms What is Acrolinx and how does MTE fit in?

Outline What is multilingual term extraction? What is the workflow from customer perspective? customer use case examples show extraction results, demonstrate human validation How does the extraction work? how we identify candidates source-language candidates translation candidates how we filter translation candidates how we identify source-language synonyms What is Acrolinx and how does MTE fit in?

Workflow: Customer perspective 1. Customer provides translated documents 2. Acrolinx provides extracted multilingual term candidates to customer 3. Customer validates candidates 4. Validated results become (or are added to) customer s term bank

Customer use cases, past examples Use case 1 de-<en,fr,es,it,pt> (mostly de-en) ~142,000 bilingual segments; ~2,685,000 tokens (total) Use case 2 de-<en,fr> (all data trilingual) ~132,000 bilingual segments; ~1,259,000 tokens data document-aligned, not segment-aligned, so extra step required Use case 3 en-de ~942,000 bilingual segments; ~25,000,000 tokens extract translations of a given list of keywords determine which keywords don t occur in data

Results human validation in Excel Baugruppe has been translated inconsistently into English in the past Mark respective translations as preferred/deprecated to guide translators in the future.

Results Stromversorgung and Einspeisung have translations in common. automatically identified as possible synonyms, so same Cluster ID To validate synonym link, edit Subcluster IDs to be the same. Mark respective variants as preferred/deprecated to guide authors.

Outline What is multilingual term extraction? What is the workflow from customer perspective? customer use case examples show extraction results, demonstrate human validation How does the extraction work? how we identify candidates source-language candidates translation candidates how we filter translation candidates how we identify source-language synonyms What is Acrolinx and how does MTE fit in?

How does the extraction work? Extract source-language term candidates from source-language text (unless source-language terminology exists) The wizard begins creating the bootable image. linguistics-based especially part-of-speech patterns same functionality built into the core Acrolinx product

How does the extraction work? Extract translation candidates of each sourcelanguage term candidate from target-language text The wizard begins creating the bootable image. Der Assistent beginnt mit der Erstellung des bootfähigen Image. use statistical phrase-alignment technology same used in statistical machine translation

How does the extraction work? Filter translation candidates translation candidates for Eingangsspannung (pink = filtered out) based on: confidence score calculated from translation probabilities can adjust threshold to favour precision or recall surface characteristics (closed-class words, punctuation) term-candidacy of translation (if possible for language)

How does the extraction work? Identify synonyms ( cluster candidates) cluster around Stromwandler (minimum link confidence threshold = 0.01) link confidence based on the degree to which translations are shared can adjust threshold to favour precision or recall of links

How does the extraction work? Identify synonyms ( cluster candidates) cluster around Stromwandler (minimum link confidence threshold = 0.03) link confidence based on the degree to which translations are shared can adjust threshold to favour precision or recall of links

Outline What is multilingual term extraction? What is the workflow from customer perspective? customer use case examples show extraction results, demonstrate human validation How does the extraction work? how we identify candidates source-language candidates translation candidates how we filter translation candidates how we identify source-language synonyms What is Acrolinx and how does MTE fit in?

What is Acrolinx? Acrolinx is Content Optimization Software. It helps authors make there text more correct, more consistent, and more readable.

What is Acrolinx? Acrolinx is Content Optimization Software. It helps authors make their text more correct, more consistent, and more readable. Consistent use of terminology is an important factor in the readability of text. Acrolinx provides: term extraction (monolingual, aka term harvesting) terminology management term checking Multilingual Term Extraction as a Service is a natural complement to the prior terminology functions.

Acrolinx @ tekom Visit Acrolinx at tekom! Hall 3, Stand 310

Outline What is multilingual term extraction? What is the workflow from customer perspective? customer use case examples show extraction results, demonstrate human validation How does the extraction work? how we identify candidates source-language candidates translation candidates how we filter translation candidates how we identify source-language synonyms What is Acrolinx and how does MTE fit in?

Questions?