Multilingual Term Extraction as a Service from Acrolinx Ben Gottesman Michael Klemme Acrolinx CHAT2013
Definitions term extraction: automatically identifying potential terms in a document (corpus) multilingual term extraction: automatically identifying potential terms and their translations in a document and its translation (parallel corpus / translation memory) The wizard begins creating the bootable image. Der Assistent beginnt mit der Erstellung des bootfähigen Image. ( or, if the source-language terminology already exists, just identify translations)
Synonyms Identify same-language synonyms via translations in common German Die Spannungsversorgung für die Elektronik wird vom Speisegerät G526 sichergestellt. Spannungsversorgung für interne Speisung (X3e) Unterspannung in der Stromversorgung English The voltage supply for the electronics is maintained by the power supply unit G526. Power supply for internal supply (X3e) Undervoltage in the power supply Spannungsversorgung Stromversorgung voltage supply power supply
Outline What is multilingual term extraction? What is the workflow from customer perspective? customer use case examples show extraction results, demonstrate human validation How does the extraction work? how we identify candidates source-language candidates translation candidates how we filter translation candidates how we identify source-language synonyms What is Acrolinx and how does MTE fit in?
Outline What is multilingual term extraction? What is the workflow from customer perspective? customer use case examples show extraction results, demonstrate human validation How does the extraction work? how we identify candidates source-language candidates translation candidates how we filter translation candidates how we identify source-language synonyms What is Acrolinx and how does MTE fit in?
Workflow: Customer perspective 1. Customer provides translated documents 2. Acrolinx provides extracted multilingual term candidates to customer 3. Customer validates candidates 4. Validated results become (or are added to) customer s term bank
Customer use cases, past examples Use case 1 de-<en,fr,es,it,pt> (mostly de-en) ~142,000 bilingual segments; ~2,685,000 tokens (total) Use case 2 de-<en,fr> (all data trilingual) ~132,000 bilingual segments; ~1,259,000 tokens data document-aligned, not segment-aligned, so extra step required Use case 3 en-de ~942,000 bilingual segments; ~25,000,000 tokens extract translations of a given list of keywords determine which keywords don t occur in data
Results human validation in Excel Baugruppe has been translated inconsistently into English in the past Mark respective translations as preferred/deprecated to guide translators in the future.
Results Stromversorgung and Einspeisung have translations in common. automatically identified as possible synonyms, so same Cluster ID To validate synonym link, edit Subcluster IDs to be the same. Mark respective variants as preferred/deprecated to guide authors.
Outline What is multilingual term extraction? What is the workflow from customer perspective? customer use case examples show extraction results, demonstrate human validation How does the extraction work? how we identify candidates source-language candidates translation candidates how we filter translation candidates how we identify source-language synonyms What is Acrolinx and how does MTE fit in?
How does the extraction work? Extract source-language term candidates from source-language text (unless source-language terminology exists) The wizard begins creating the bootable image. linguistics-based especially part-of-speech patterns same functionality built into the core Acrolinx product
How does the extraction work? Extract translation candidates of each sourcelanguage term candidate from target-language text The wizard begins creating the bootable image. Der Assistent beginnt mit der Erstellung des bootfähigen Image. use statistical phrase-alignment technology same used in statistical machine translation
How does the extraction work? Filter translation candidates translation candidates for Eingangsspannung (pink = filtered out) based on: confidence score calculated from translation probabilities can adjust threshold to favour precision or recall surface characteristics (closed-class words, punctuation) term-candidacy of translation (if possible for language)
How does the extraction work? Identify synonyms ( cluster candidates) cluster around Stromwandler (minimum link confidence threshold = 0.01) link confidence based on the degree to which translations are shared can adjust threshold to favour precision or recall of links
How does the extraction work? Identify synonyms ( cluster candidates) cluster around Stromwandler (minimum link confidence threshold = 0.03) link confidence based on the degree to which translations are shared can adjust threshold to favour precision or recall of links
Outline What is multilingual term extraction? What is the workflow from customer perspective? customer use case examples show extraction results, demonstrate human validation How does the extraction work? how we identify candidates source-language candidates translation candidates how we filter translation candidates how we identify source-language synonyms What is Acrolinx and how does MTE fit in?
What is Acrolinx? Acrolinx is Content Optimization Software. It helps authors make there text more correct, more consistent, and more readable.
What is Acrolinx? Acrolinx is Content Optimization Software. It helps authors make their text more correct, more consistent, and more readable. Consistent use of terminology is an important factor in the readability of text. Acrolinx provides: term extraction (monolingual, aka term harvesting) terminology management term checking Multilingual Term Extraction as a Service is a natural complement to the prior terminology functions.
Acrolinx @ tekom Visit Acrolinx at tekom! Hall 3, Stand 310
Outline What is multilingual term extraction? What is the workflow from customer perspective? customer use case examples show extraction results, demonstrate human validation How does the extraction work? how we identify candidates source-language candidates translation candidates how we filter translation candidates how we identify source-language synonyms What is Acrolinx and how does MTE fit in?
Questions?