SakkamMT White Paper

Similar documents
WHITE PAPER. Machine Translation of Language for Safety Information Sharing Systems

Modern foreign languages

How the Computer Translates. Svetlana Sokolova President and CEO of PROMT, PhD.

Introduction to formal semantics -

Overview of MT techniques. Malek Boualem (FT)

A Survey of Online Tools Used in English-Thai and Thai-English Translation by Thai Students

Building a Question Classifier for a TREC-Style Question Answering System

Taxonomies in Practice Welcome to the second decade of online taxonomy construction

Parsing Technology and its role in Legacy Modernization. A Metaware White Paper

Ask your teacher about any which you aren t sure of, especially any differences.

Natural Language to Relational Query by Using Parsing Compiler

International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November ISSN

Clarified Communications

Integrating Reading and Writing for Effective Language Teaching

Ling 201 Syntax 1. Jirka Hana April 10, 2006

Knowledge Discovery from patents using KMX Text Analytics

English Descriptive Grammar

But have you ever wondered how to create your own website?

Presented to The Federal Big Data Working Group Meetup On 07 June 2014 By Chuck Rehberg, CTO Semantic Insights a Division of Trigent Software

Teaching Vocabulary to Young Learners (Linse, 2005, pp )

A + dvancer College Readiness Online Alignment to Florida PERT

COURSE OBJECTIVES SPAN 100/101 ELEMENTARY SPANISH LISTENING. SPEAKING/FUNCTIONAl KNOWLEDGE

Introduction to Software Paradigms & Procedural Programming Paradigm

A terminology model approach for defining and managing statistical metadata

NATURAL LANGUAGE QUERY PROCESSING USING PROBABILISTIC CONTEXT FREE GRAMMAR

SDL BeGlobal: Machine Translation for Multilingual Search and Text Analytics Applications

Mining the Software Change Repository of a Legacy Telephony System

Why major in linguistics (and what does a linguist do)?

Clustering Connectionist and Statistical Language Processing

Advice Document: Bilingual Drafting, Translation and Interpretation

Extraction of Legal Definitions from a Japanese Statutory Corpus Toward Construction of a Legal Term Ontology

The SYSTRAN Linguistics Platform: A Software Solution to Manage Multilingual Corporate Knowledge

Guidelines for Masters / Magister / MA Theses

Concept Formation. Robert Goldstone. Thomas T. Hills. Samuel B. Day. Indiana University. Department of Psychology. Indiana University

Paraphrasing controlled English texts

Section 8 Foreign Languages. Article 1 OVERALL OBJECTIVE

Comprendium Translator System Overview

Multi language e Discovery Three Critical Steps for Litigating in a Global Economy

Academic Standards for Reading, Writing, Speaking, and Listening June 1, 2009 FINAL Elementary Standards Grades 3-8

Overview of the TACITUS Project

Semantic analysis of text and speech

From Logic to Montague Grammar: Some Formal and Conceptual Foundations of Semantic Theory

KNOWLEDGE-BASED IN MEDICAL DECISION SUPPORT SYSTEM BASED ON SUBJECTIVE INTELLIGENCE

Language Meaning and Use

QUALITY CONTROL PROCESS FOR TAXONOMY DEVELOPMENT

NTT DATA Big Data Reference Architecture Ver. 1.0

Moving Enterprise Applications into VoiceXML. May 2002

Appendix B Data Quality Dimensions

FUNDAMENTAL TECHNOLOGIES FOR IP TRANSLATION SERVICES

Universal. Event. Product. Computer. 1 warehouse.

Natural Language Database Interface for the Community Based Monitoring System *

Online Multilingual Translation of Technical Service Reports over the World Wide Web

DATA QUALITY AND SCALE IN CONTEXT OF EUROPEAN SPATIAL DATA HARMONISATION

Writing learning objectives

Hybrid Strategies. for better products and shorter time-to-market

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time

Flattening Enterprise Knowledge

MAP for Language & International Communication Spanish Language Learning Outcomes by Level

Protecting Data with a Unified Platform

Introduction: Reading and writing; talking and thinking

TEN RULES OF GRAMMAR AND USAGE THAT YOU SHOULD KNOW

Some Implications of Controlling Contextual Constraint: Exploring Word Meaning Inference by Using a Cloze Task

The Principle of Translation Management Systems

FOR IMMEDIATE RELEASE

Competencies for Secondary Teachers: Computer Science, Grades 4-12

Empirical Machine Translation and its Evaluation

Why SBVR? Donald Chapin. Chair, OMG SBVR Revision Task Force Business Semantics Ltd

The Harvard style. Reference with confidence. (2012 Edition)

Curriculum Vitae JEFF LOUCKS

Study Plan for Master of Arts in Applied Linguistics

Translation Solution for

Week 3. COM1030. Requirements Elicitation techniques. 1. Researching the business background

A framing effect is usually said to occur when equivalent descriptions of a

Application Architectures

Introduction to Intercultural Communication 1.1. The Scope of Intercultural Communication

KPMG Unlocks Hidden Value in Client Information with Smartlogic Semaphore

Extracted Templates. Postgres database: results

Application of Natural Language Interface to a Machine Translation Problem

Open-Source, Cross-Platform Java Tools Working Together on a Dialogue System

DEFINING, TEACHING AND ASSESSING LIFELONG LEARNING SKILLS

Introduction. Philipp Koehn. 28 January 2016

3. What is Knowledge Management

Keywords academic writing phraseology dissertations online support international students

FEAWEB ASP Issue: 1.0 Stakeholder Needs Issue Date: 03/29/ /07/ Initial Description Marco Bittencourt

SYNTACTIC PATTERNS IN ADVERTISEMENT SLOGANS Vindi Karsita and Aulia Apriana State University of Malang

stress, intonation and pauses and pronounce English sounds correctly. (b) To speak accurately to the listener(s) about one s thoughts and feelings,

Using In-Memory Computing to Simplify Big Data Analytics

Writing Goals and Objectives If you re not sure where you are going, you re liable to end up some place else. ~ Robert Mager, 1997

D2.4: Two trained semantic decoders for the Appointment Scheduling task

INPUTLOG 6.0 a research tool for logging and analyzing writing process data. Linguistic analysis. Linguistic analysis

ICAME Journal No. 24. Reviews

User choice as an evaluation metric for web translation services in cross language instant messaging applications

How To Write The English Language Learner Can Do Booklet

Text Mining - Scope and Applications

Discourse Markers in English Writing

Auto-Classification for Document Archiving and Records Declaration

Evolution of Forex the Active Trader s Market

DIFFERENT TECHNIQUES FOR DEVELOPING COMMUNICATION SKILLS

English Grammar Checker

CELTA. Syllabus and Assessment Guidelines. Fourth Edition. Certificate in Teaching English to Speakers of Other Languages

Transcription:

SakkamMT White Paper Sakkam K.K. Wakamatsu Building 7F 3-3-6 Nihonbashi Honcho Chuo-ku Tokyo

Introduction Ever since the emergence of machine translation there has been a debate about if and when machines will replace humans for everyday translation tasks. With Sakkam Machine Translation (SakkamMT) we do not attempt to replace human translators; instead we focus on translation activities that are impractical or impossible with a traditional approach to translation. These activities are characterized by one or more of the following elements. Extreme time sensitivity. The value of some categories of information decreases rapidly from the moment it is made available. The announcement of US employment data has the potential to move currency markets in the seconds following its release but is old news one minute later. To have value, the translation of such releases needs to be subsecond. Massive volume. The volume of user-generated content is exploding, whether through social networking sites, online auctions or game networks. There are simply not enough human translators available to translate or monitor or extract meaning from tens of millions of messages per day. Limited availability of domain expertise. Many translation tasks require a high degree of domain expertise in addition to fluency in source and target languages. It is often difficult to identify translators with the requisite skill set and this problem is exacerbated if the translation task is not a discrete project but an ongoing, around the clock activity. We have also taken a fundamentally different approach in developing the technology that underpins SakkamMT. From the outset, SakkamMT was designed to provide native speaker quality translations within limited domains. As such, both its targeted applications and usefulness are very different than existing attempts at machine translation, which seek to provide general capabilities, but of low quality. As anyone who has ever used existing Internet and PC-based translation systems will know, their output is unusable within a business context. This paper provides an overview of the SakkamMT architecture and how SakkamMT is being used where human translation is either impractical or impossible. Readers wishing to learn more about the technical foundations of Sakkam MT are invited to review the bibliography.

Architecture The Interim Representation Model SakkamMT works by parsing the source language text using domain specific rules to create a language independent Interim Representation Model (IRM). The IRM can then be used to drive output to another language. This approach affords a number of benefits, 1. As more output languages are added, the systems scales linearly (O(N)) rather than with the number of language pairs (O(N 2 )). 2. The IRM may be used not just to output other human languages, but also to provide a computerreadable API for entity extraction. For example, dates could be extracted from an email to populate a calendar. 3. The same content may be translated in different styles. For example, a new headline could be output in a highly abbreviated news style, and/or a more grammatically correct prose style depending upon the requirement. 4. The IRM facilitates a common-sense check of understanding and prevents some of the more egregious errors MT systems often make. Once the IRM has been populated, it is possible to judge whether the source text has been understood or not. At this stage a potential wrong translation can be suppressed. This is in marked contrast to generalized translation systems that will always output something regardless of quality. Cognitive Categories The design of the IRM draws heavily on research in Cognitive Psychology into semantic categorizations, which differ significantly from mathematically formal categories. For example, cognitive categories do not in general support transitive closure, which is a characteristic of mathematically formal categories. Hence, the categorization system that is used in IRM understands that although a car seat is a chair and a chair is an item of furniture, a car seat is not an item of furniture. Cognitive categories also display a degree of typicality a goldfish is not as good an example of a pet as a dog. Sometimes, category membership can be so ill-defined as to need legal action to clarify. The question of whether a tomato is a vegetable or a fruit ended up being decided by the US Supreme Court in response to a new tax on vegetables but not fruit. Perhaps not surprisingly, they choose to classify it as a vegetable to bring it within scope of the tax. Botanists would class it as a fruit. Cognitive categories are also crucially ad-hoc and can be determined dynamically, set by context. For example, a tomato can be a missile. Traditional hierarchically structured schemas and taxonomies are unsuited to this, but Sakkam s IRM has been designed from the outset with this flexibility in mind. Populating the IRM Populating the IRM requires parsing the source language to extract meaning. This is done using a series of linguistic rules which operate atomically upon the text until the structure of the IRM has been built. These rules are different from what many MT systems use in that they are primarily based upon the linguistic field of pragmatics, rather than more typical syntax-based grammatical rules. SakkamMT uses little syntactical grammar in its parsing for the simple reason that in many cases (such as headlines, emails etc), the grammar can quite often be wrong. Instead, SakkamMT

takes what is closer to a construction grammar approach in using pragmatic criteria to determine the logical units. Note that here, as in the rest of this paper, we are taking pragmatics to refer to the sub-division of linguistics, not the everyday English meaning of the word. Figure 1. shows how alternate expressions using different phrasings result in the same pragmaticbased model, despite having very different syntactic structures. While in some cases, syntax is crucial (for example A killed B, only syntax can tell you which is the subject, and which is the object), in many cases it has little to add. In this particular example, the grammatical part of speech that approximate appears as, can be a noun, verb, adjective or adverb, with little implication to meaning. Source Syntactic Structure Interim Representation Model (IRM) Approximately, the distance to London is 200 km The approximate distance to London is 200 km The distance approximately to London is 200 km The distance to London approximation is 200 km The distance to London approximately is 200 km The distance to London approximates to 200 km The distance to London is approximately 200 km The distance to London is 200 km approximately It is approximately 200 km distance to London ADV NP PP VP(V NP) SENTENCE [ ITEM[ TYPE[London(LOCATION) NP(ADJ N) PP VP(V NP) ATTRIBUTE[ MEASURE[ TYPE[distance(LENGTH) NP ADV PP VP(V NP) RELATION[to NP(NP PP N) VP(V NP) ATTRIBUTE[ VALUE[ NP PP VP(ADV V NP) NUMBER[200 UNITS[Km NP PP VP(V NP) SIGN[DEFAULT-POSITIVE PRECISION[approximate NP PP VP(V ADV NP) NP PP VP(V NP ADV) NP VP(ADV NP(NP NP) PP) Figure 1. Syntactic Structure and Pragmatic Structure Once the IRM has been populated, a set of output rules can be brought to bear to express the IRM in the target language. Variations of style can be introduced at this stage. It should be noted that for some very terse styles such as are common in news headlines, we may be generating grammatically incorrect text, but nevertheless it is actually more appropriate for the audience. Looking at the example in Figure 1 again, we also note that for any particular representation, then both the variety and acceptability of grammatical phrasings will be highly dependent upon the target language. Existing machine translation systems will often instead try to translate the gross sentence syntactic structure, and then fill in the slots with translations for the individual noun and verb phrases. This however can lead to combinations that at best seem unnatural and at worst can appear nonsensical. As an example, one Internet-based machine translation system translates: Approximately, the distance to London is 200 km

as 200. Literally this would be: Approximation, it is 200km to London. The English is deliberately stilted to reflect how it would sound in Japanese. The sentence pattern exactly corresponds to the English source but the result is an incorrect Japanese sentence. Translating the other nine English phrasings results in nine different translations of varying degrees of accuracy. By contrast, SakkamMT uses the same IRM for all ten phrasings and uses the more acceptable and correct Japanese form, 200, again the same for all ten variations of the input. Another example illustrates how pragmatics can enable the correct disambiguation between the different meanings of a word. The following Japanese text is an extract from an item that was listed on Yahoo! Auctions. / The meaning of the first part of the text (shown in blue) is, Transformers Bumblebee Replica Mask 1:1 scale, genuine goods. Most translation engines will translate the second part (shown in red) as "I have a rash!". Although this is linguistically correct it is pragmatically wrong. In the context of an auction for a mask, the correct translation, and the one that is made by SakkamMT, is "You can wear it!"

Business Implementation A SakkamMT implementation consists of the following activities, Project definition. Agree the scope, scale and timeframes for the project. The project may include a pilot phase, which provides an opportunity to demonstrate the effectiveness of the SakkamMT approach in the client s own environment. Enhancement of the Interim Representation Model (IRM). Most deployments require modifications to our standard IRM to reflect the precise nature of the communication for the client s application. Compilation of project specific dictionaries and named entity databases. Existing client dictionaries and translation memories are used as available and, where necessary, additional material is developed through a combination of manual compilation and automated text analysis. All new content is then categorized to ensure consistency with the SakkamMT categorization model. Lifecycle planning. Typical SakkamMT deployments involve the continual translation of a feed of information over a period of months and years. Over the life of the project, the source content may change as new terminology, and even concepts, are introduced. In some cases, SakkamMT will be able to adapt automatically to these changes but there may also be an ongoing the requirement to ensure new terminology is being correctly used and named entity databases are up-to-date. Technical integration. Sakkam provides a simple, secure Web API for integration with client systems. The SakkamMT infrastructure is hosted at Amazon EC2, which enables us to scale our servers smoothly to many millions of translations per day and, by deploying production servers in different continents, to offer continuity of service even if any entire datacenter becomes unavailable. Pilot phase. End-to-end deployment of SakkamMT within the client environment for a limited but representation subset of the overall project scope. Live deployment. Full deployment of SakkamMT with the client environment.

Case study: Financial News Feed The foreign currency exchange markets dwarf those of equities, with on a typical day something of the order of $1.5 trillion being traded and the Yen/Dollar rate as one of the major currency pairs. The Yen/Dollar rate is highly sensitive to US economic news, with billions of dollars changing hands within seconds of the release of a key economic indicator. Often by the time Japanese translations are available, the market has already moved to factor in the news and the opportunity to profit has been lost. SakkamMT is being used by Intisar Technology to provide business critical translations of US Financial News headlines from a leading financial news vendor. Translations are made with no more than a sub-second delay, enabling Intisar s customers valuable time to profit. Clearly, accuracy of translation, as well as immediacy of output is essential in this environment. Intisar has integrated this feed of translated stories into their realtime market data platform that supports traders workstations, such as Tradesignal. Figure 2. SakkamMT translated news stories in Tradesignal Existing general purpose machine translation systems are incapable of providing even the gist of these news releases and human translators, if they have no domain expertise, are liable to make mistakes. Competing Japanese news vendors that rely on human translation are tens of seconds to several minutes behind, as well as requiring a high cost base (skilled bilingual financial domain experts available 24x7), that must ultimately be passed on to the consumer. For details of this news feed and other Intisar services, please contact www.intisartechnology.com.

Case Study: Internet Auctions Increasing globalization has driven cross-boarder interest in collectables, and this now acts as a major revenue driver for Internet auctions and marketplaces. As a specific example, there is significant worldwide interest in manga and anime, many of which have limited availability outside of Japan. Figure 3 illustrates the price difference for a given item s English language listing on ebay and an identical item s Japanese language listing on Yahoo! Auctions. The difference in price represents the language premium that US collectors need to pay to search for items and bid in English. Figure 3: US-Japan Price Differentials Some collectors do attempt to use web based translators but these general services are too inaccurate to allow a collector bid with any confidence. Examples proliferate of inaccurate translations and even the reversal of meaning, as in the below example. Original listing Internet translation Home for longer storage is a new unused item, in a box outside the gall, and a GE. For a completely new, please bid. Unfortunately this is wrong. In fact the translation should be do not bid as is correctly shown in the below SakkamMT:

Original listing SakkamMT translation This is a new, unused product. However, since it has been stored at home for a long time there are a number of marks and scratches on the outer box. Please do not bid if you are looking for a completely new item. SakkamMT is being used to provide immediate translations of anime related memorabilia auctioned on Internet sites. For a live demonstration of SakkamMT translating auction content from Japan s Yahoo! Auctions, please follow @animeauctions on Twitter (http://twitter.com/animeauctions).

Bibliography: Technical Foundations Conceptual Modelling Margolis & Laurence (ed.), Concepts: Core Readings, MIT Press, 1991 Talmy, Leonard Toward a Cognitive Semantics Vol 1, MIT Press, 2000 Vosniadou & Ortony (ed.), Similarity and Analogical Reasoning, Cambridge University press, 1989 Linguistics Croft, William, Radical Construction Grammar: syntactic theory in typological perspective, Oxford University Press, 2001 Levin, Beth, English Verb Classes and Alternations: A Preliminary Investigation, University of Chicago Press, Chicago, IL. 1993 Sperber, Dan and Wilson, Deirdre. Relevance: Communication and Cognition. Oxford: Blackwell, 1986/1995. Matsui, Tomoko. Bridging and Relevance. John Benjamins Publishing Co. 2000. Wilson, Deirdre & Carston, Robyn. 2007. A unitary approach to lexical pragmatics: Relevance, inference and ad hoc concepts. In N. Burton-Roberts (ed.) Pragmatics. Palgrave, London : 230-259. Computational Architecture Mitchell, Melanie, Analogy-Making as Perception: A Computer Model. Cambridge, MA: MIT Press, 1993 Mitchell, Melanie, Analogy-making as a complex adaptive system. In L. Segel and I. Cohen (editors), Design Principles for the Immune System and Other Distributed Autonomous Systems. New York: Oxford University Press, 2001