Big Data Adaptation DatAptor Dr Khalil Sima'an

Similar documents
HIERARCHICAL HYBRID TRANSLATION BETWEEN ENGLISH AND GERMAN

Empirical Machine Translation and its Evaluation

Language and Computation

Computer Aided Translation

Masters in Information Technology



FNWI Master Evening 19 February 2015 Computer Science. Alban Ponse, University of Amsterdam FNWI Master Evening : Computer Science 1/18

PROMT Technologies for Translation and Big Data

Machine Translation at the European Commission

Machine translation techniques for presentation of summaries

IRIS - English-Irish Translation System

3TU Master of Science in Systems and Control. An essential engineering qualification for future decades

Lecture 17: Requirements Specifications

Python Programming: An Introduction to Computer Science

The SPES Methodology Modeling- and Analysis Techniques

UNIVERSITY OF AMSTERDAM FACULTY OF SCIENCE. EDUCATION AND EXAMINATION REGULATIONS Academic Year PART B THE MASTER S PROGRAMME IN LOGIC

CS 40 Computing for the Web

Testing Introduction. IEEE Definitions

WHITE PAPER. Peter Drucker. intentsoft.com 2014, Intentional Software Corporation

Programming Languages The McGraw-Hill Companies, Inc. All rights reserved.

Model-based Quality Assurance of Automotive Software

Course MS10975A Introduction to Programming. Length: 5 Days

DEGREE PLAN INSTRUCTIONS FOR COMPUTER ENGINEERING

CS 51 Intro to CS. Art Lee. September 2, 2014

School of Computer Science

What makes Big Visual Data hard?

3TU MSc in Embedded Systems. A critical engineering qualification for future decades

CPU Organization and Assembly Language

Python Programming: An Introduction to Computer Science

DNS use guidelines in AWS Jan 5, 2015 Version 1.0

Realizing business flexibility through integrated SOA policy management.

Introducción. Diseño de sistemas digitales.1

M LTO Multilingual On-Line Translation

Non-Data Aided Carrier Offset Compensation for SDR Implementation

WINDOWS AZURE NETWORKING

SURVEY REPORT DATA SCIENCE SOCIETY 2014

International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November ISSN

Masters in Computing and Information Technology

Question template for interviews

Fall 2012 Q530. Programming for Cognitive Science

FPGA-based MapReduce Framework for Machine Learning

A Unified View of Virtual Machines

Overview Motivating Examples Interleaving Model Semantics of Correctness Testing, Debugging, and Verification

Automatic Speech Recognition and Hybrid Machine Translation for High-Quality Closed-Captioning and Subtitling for Video Broadcast

What is a programming language?

Using Evaluation to Support a Results- Based Management System

Building Metrics for a Process

Phrase-Based MT. Machine Translation Lecture 7. Instructor: Chris Callison-Burch TAs: Mitchell Stern, Justin Chiu. Website: mt-class.

FOPR-I1O23 - Fundamentals of Programming

COURSE SYLLABUS Department of Biblical Studies BL650 Online Greek Language Spring, 2016

a Data Science Univ. Piraeus [GR]

Structure of Presentation. The Role of Programming in Informatics Curricula. Concepts of Informatics 2. Concepts of Informatics 1

Folksonomies versus Automatic Keyword Extraction: An Empirical Study

Using Artificial Intelligence to Manage Big Data for Litigation

Hardware Assisted Virtualization

ICT Call 7 ROBOHOW.COG FP7-ICT

Unit A451: Computer systems and programming. Section 2: Computing Hardware 1/5: Central Processing Unit

A Python Tour: Just a Brief Introduction CS 303e: Elements of Computers and Programming

FPGA area allocation for parallel C applications

Information Technology and Knowledge Management

Statistical Machine Translation

Compiler and Language Processing Tools

COCOVILA Compiler-Compiler for Visual Languages

Custom Web Development Guidelines

THE CHINESE UNIVERSITY OF HONG KONG DEPARTMENT OF TRANSLATION COURSE OUTLINE

Systematic Comparison of Professional and Crowdsourced Reference Translations for Machine Translation

(Refer Slide Time: 02:39)

Hybrid Machine Translation Guided by a Rule Based System

ROMANIAN - AMERICAN UNIVERSITY Foundation Year. Learn Romanian and about Romania! Improve your English!

Compiler I: Syntax Analysis Human Thought

Introduction to Software Engineering. 8. Software Quality

USER MODELLING IN ADAPTIVE DIALOGUE MANAGEMENT

HARDWARE, SOFTWARE AND CONFIGURATION REQUIREMENTS

what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?

Careers for Physics Majors

Project Management for Scientists

Reusable Knowledge-based Components for Building Software. Applications: A Knowledge Modelling Approach

CHAPTER 13 Capital Structure and Leverage

Computer Network. Interconnected collection of autonomous computers that are able to exchange information

CLUSTER ANALYSIS WITH R

A Laboratory Information. Management System for the Molecular Biology Lab

M3039 MPEG 97/ January 1998

Created by: Austin Davis Neel Iyer Darcie Jones Sascha Schwarz

Chapter 1. Dr. Chris Irwin Davis Phone: (972) Office: ECSS CS-4337 Organization of Programming Languages

Transcription:

Big Data Adaptation DatAptor Dr Khalil Sima'an an Institute for Logic, Language and Computation University of Amsterdam The Netherlands

MT at ILLC-UvA Statistical Language Processing and Learning Lab. Language Technology Machine Translation Linguistic structure Statistical Learning

MT at ILLC-UvA Statistical Language Processing and Learning Lab. Machine Translation Main topics within SLPL Syntax-driven SMT (learning, decoding) Learning latent reordering for translation Hierarchical models with morph. And syntax Linguistics Statistical Learning... Data-powered Adaptation SLPL Lab (Growing: 8 PhD students; 4 postdocs; Programmer) Five projects on Statistical MT (2012 2019) This talk: Big Data and DatAptor (Feb 2013 - Feb 2017)

BIG DATA What Comes to Mind?

BIG DATA Data Data Data Data... (Repeated epeated many many times)

BIG DATA Everyone wants big data (Does anyone know what for?)

BIG DATA What comes to mind? Efficient computing Big storage + Fast search Diversity Quality differences Noise; Difficult statistics... Saturated statistics Just count and divide Simple models are enough [cf. The Unreasonable Effectiveness of Data. Halevy, Norvig and Pereira 2009.]

BIG DATA What comes to mind? Efficient computing Storage + search. Diversity Quality differences. Saturated statistics Just count and divide Simple models are enough. Noise, difficult statistics... [cf. The Unreasonable Effectiveness of Data. Halevy, Norvig and Pereira 2009.] Diversity also offers advantages Language use (different domains) Translators practice and guidelines

BIG DATA What comes to mind? Efficient computers Storage + search. Diversity Quality differences. Saturated statistics Just count and divide Simple models are enough. Noise, difficult statistics... [cf. The Unreasonable Effectiveness of Data. Halevy, Norvig and Pereira 2009.] Diversity also offers advantages Language use (different domains) Translators practice and guidelines DatAptor Project

DatAptor Project 2013-2017 Data-Powered Domain-Specific Translation Services on Demand Partners (User Board) TAUS EC DGT Intel Inc. Symantec Researchers (SLPL, ILLC, UvA) Dr Khalil Sima'an (principal investigator) Dr Christof Monz (senior researcher) Dr Bart Mellebeek (postdoc: Jan 2013) Milos Stanojevic (PhD student: March 2013) Vacancies: postdoc + programmer

DatAptor Motivation Versatile adaptation needed Potential demand vs. current demand Continuously increasing text volumes Large variability in kinds of texts (domains) Changing translation market Changing domains, e.g. shifting international trade/cultural/... exchange Changing acceptance for automatic translation Versatile adaptation of MT Engines? How?

So many domains... so little time... Versatile Build an MT System For every domain of language use sports; news; politics; financial; banking; automotive; drugs; food; appliance manuals; hardware/software manuals; scientific articles;... Automatically and rapidly Minimal human intervention On demand: user specs. User supplied example texts to be translated A population of MT Systems!

Current Practice Tiny Data Adaptation

Current Practice Tiny Data Adaptation DATAout (News) DATAin MTIn MTIn MTOut

Current Practice Tiny Data Adaptation DATAout (News) DATAin Din Din Adapt Adapt A_MTIn MTOut

Tiny Data Adaptation Current Practice Task Build MT system from tiny in-domain data using whatever out-of-domain data exists Theoretically Challenging, interesting, very difficult Practice Assumptions maybe too strong Alternative Scenario BIG DATA Adaptation

Big Data == Diversity Drugs Food Sports Banking Appliance News Financial Hardware Politics Software

DatAptor Hypothesis Big Data Metaphore Imagine a world of translators A translator: background + experience A translator for every situation For every new translation order Find the best suitable translator

DatAptor Hypothesis Diversity Big Data == Diversity Metaphore Imagine a world of translators A translator for every situation A translator with own background and experience For every new translation order Find the best suitable translator If (Big Data == ``A World of Domains ) Diversity enables rapid adaptation to new domain ``Find the most suitable MT system in the Data

DatAptor Challenges I INPUT: User documents from some domain + BIG DATA OUTPUT: SMT system adapted for domain Distill from BIG DATA a suitable training data Weigh some documents as more relevant than others Train SMT system on distilled data. Map of BIG DATA Efficiency for distilling suitable training data Map: the more related, the closer to each other on the map How to measure domain similarity? Statistical (hierarchical-)topic similarity; translation-equivalence and instance weighting

BIG DATA vs. BIG Trans. DATA

BIG T-DATA Translation Data BIG DATA Translation Equations VII = 7 Meaning Equations V+II = VII 5+2 = 7 Semantic Equation VII = (V+II) = (5+2) = 7 + Domain Diversity Topic Similarity(?) Semantic Similarity(?)

BIG T-DATA VII = 7 Translation Data Translation Equations VII = 7 Meaning Equations V+II = VII 5+2 = 7 Semantic Equation VII = (V+II) = (5+2) = 7 + BIG DATA Domain Diversity Topic Similarity(?) Semantic Similarity(?) More Many More Accurate Topic/Semantic Equations/Similarity relations

DatAptor Challenges II User-Driven Data-Powered Adaptation Recursive, Hierarchical Meaning Equations Structures translation equations; Better reordering Better fit with morpho-syntax; ``Deeper meaning equations

To conclude Big Trans. Data Enables Data-Powered Adaptation (DatAptation) Statistics over ``Meaning Equations More than MT? Language understanding! Thank you! k.simaan@uva.nl