LANGUAGE CODING IN INFORMATION TECHNOLOGIES

Similar documents

The Rise of Documentary Linguistics and a New Kind of Corpus

Language Translation Services RFP Issued: January 1, 2015

Internationalizing the Domain Name System. Šimon Hochla, Anisa Azis, Fara Nabilla

Standard Recommended Practice extensible Markup Language (XML) for the Interchange of Document Images and Related Metadata

BACKGROUND. Namespace Declaration and Qualification

CoLang 2014 Data Management and Archiving Course. Session 2. Nick Thieberger University of Melbourne

The Unicode Standard Version 8.0 Core Specification

Tibetan For Windows - Software Development and Future Speculations. Marvin Moser, Tibetan for Windows & Lucent Technologies, USA

odt2braille brings Braille to your Office

The IANA Functions. An Introduction to the Internet Assigned Numbers Authority (IANA) Functions

Computerized Language Analysis (CLAN) from The CHILDES Project

Sterling Web. Localization Guide. Release 9.0. March 2010

Reading Competencies

Liblouis a universal solution for Braille transcription services

VoiceXML Data Logging Overview

How To Change Marc To A Bibbone Model

Open Vulnerability and Assessment Language (OVAL ) Validation Program Test Requirements (DRAFT)

Heritage Voice: Program. Lenape Language Education Program of the Lenape Nation of Pennsylvania and Swarthmore College

Nefertari International Schools IBDP Candidate School Whole School Language Policy

Globalization and Localization

Information and documentation The Dublin Core metadata element set

CatDV Pro Workgroup Serve r

Preservation Handbook

White Paper. Translation Quality - Understanding factors and standards. Global Language Translations and Consulting, Inc. Author: James W.

How To Manage Your Digital Assets On A Computer Or Tablet Device

A Short Introduction to Transcribing with ELAN. Ingrid Rosenfelder Linguistics Lab University of Pennsylvania

Internationalizing JavaScript Applications Norbert Lindenberg. Norbert Lindenberg All rights reserved.

Extensible Markup Language (XML): Essentials for Climatologists

The future of International SEO. The future of Search Engine Optimization (SEO) for International Business

ECM Governance Policies

Internet Structure and Organization

Year Abroad Project Handbook 2015

How To Teach Reading

Developing LMF-XML Bilingual Dictionaries for Colloquial Arabic Dialects

CSE 373: Data Structure & Algorithms Lecture 25: Programming Languages. Nicki Dell Spring 2014

Data Portability: It is about the Data the Quality of the Data

HP Service Manager Compatibility Matrix

Master of Arts in Linguistics Syllabus

Translating QueueMetrics into a new language

Scandinavian Dialect Syntax Transnational collaboration, data collection, and resource development

Chapter 2 Text Processing with the Command Line Interface

Multi-lingual Cataloguing: culture, practice and systems

SignLEF: Sign Languages within the European Framework of Reference for Languages

Annotation in Language Documentation

Introduction to Unicode. By: Atif Gulzar Center for Research in Urdu Language Processing

Overview of admission requirements for the master s degree programs of the Faculty of Arts

Transcription Format

How To Teach English To Other People

MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts

Designing Global Applications: Requirements and Challenges

St. Petersburg College. RED 4335/Reading in the Content Area. Florida Reading Endorsement Competencies 1 & 2. Reading Alignment Matrix

Roselle Public School District Curriculum Framework 2011 (Preparing Students for the 21 st Century) Sixth Grade

1 REVISOR C. show verification of completing a Board of Teaching preparation program

Challenges of Multilingualism and Possible Approach for Standardization of e-governance Solutions in India

ICT Project on Text Transcription of Technical Video Lectures and Creation of Video Searchable Index, Metadata and Online Quizzes

Standards and Guidelines for. Information Technology. Infrastructure, Architecture, and Ongoing Operations

SETTING UP A MULTILINGUAL INFORMATION REPOSITORY : A CASE STUDY WITH EPRINTS.ORG SOFTWARE

A Sensible Approach to Asset Management

EXMARaLDA and the FOLK tools two toolsets for transcribing and annotating spoken language

Contents. BMC Atrium Core Compatibility Matrix

The World Atlas of Language Structures & Follow-up notes

SDL BeGlobal: Machine Translation for Multilingual Search and Text Analytics Applications

eb Service Oriented Architecture Catalog of Patterns

Library Technology Reports

WHAT S NEW IN RADAROPUS 1.39 AND 1.40 (PROGRAM UPDATES)

READING SPECIALIST STANDARDS

ETD The application of Persistent Identifiers as one approach to ensure long-term referencing of Online-Theses

Product Internationalization of a Document Management System

Queen s Open Journal System (OJS) Business Case

Fulfilling World Language Requirements through Alternate Means

REQUEST FOR PROPOSAL ACQUISITION & IMPLEMENTATION OF CENTRALIZED LOG MANAGEMENT SYSTEM

COMPANIES REGISTRY. Third Party Software Interface Specification. (Part 1 Overview)

1 Building a metadata schema where to start 1

Modern foreign languages

Microsoft & Open Source Software

Higher Education Georgia State University (GSU) Arts & Sciences, Department of Modern & Classical Languages

Beginner s Android Development Tutorial!

Using Dublin Core for DISCOVER: a New Zealand visual art and music resource for schools

Chapter 3: XML Namespaces

3PlayMedia. Closed Captioning, Transcription, and Subtitling

Closed Captioning Resources & Best Practices

Panel Decision. B12 of the.eu Dispute Resolution Rules (ADR Rules) Case No.: Time of Filing: :44:40 Administrative Contact:

PHONETIC TOOL FOR THE TUNISIAN ARABIC

Transcription:

LANGUAGE CODING IN INFORMATION TECHNOLOGIES TKE 2014: Language Codes at the Crossroads Peter Constable Microsoft Corporation / Unicode Consortium

Does industry need or care about ISO 639-3?

The main question that I have is whether language identification should be a task for ISO, the International Organization for Standardization ISO is basically an organization [f]or industry, not for science The reason why ISO got involved in language name issues in the first place is of course the economic significance of translation and localization, which is far greater than the relevance of distant stars for businesses. But does this mean that someone needs ISO s industry standard to identify little-known languages of small communities that are never or hardly used in writing and that are often in danger of extinction? Martin Haspelmath, Diversity Linguistics Comment, December 2013

Business has an interest in the stable identification of economically significant languages, for example for translation and computer localization, and this is why the ISO 639-1 and 639-2 standards were established in the first place. However, those standards are adequate for the needs of industry; business has no significant interest in the many small, unwritten and often endangered languages with no measurable economic impact. Kwamikagami (Wikipedia contributor), April 2014

Information technologies rely heavily on ISO 639, including ISO 639-3 IETF BCP 47 is a key industry technology using ISO 639-3 Many of linguists needs for language identification can be accommodated by the same BCP 47 mechanisms used in the IT industry

AGENDA Use of language identifiers in information technologies History: development of ISO 639-3 and industry adoption via BCP 47 Which languages are significant? Overview of IETF BCP 47 (language tags) Utility of industry mechanisms for linguists

USE OF LANGUAGE IDENTIFIERS IN INFORMATION TECHNOLOGIES

USE OF LANGUAGE IDENTIFIERS Tagging content to declare the language of content Text, audio, video Tagging of software resources for language-specific processing Matching user language preferences with content Matching content with language-specific processes

USE OF LANGUAGE IDENTIFIERS Examples Display of content in my preferred language Web pages, videos, captions, etc. Display of application user interfaces in my preferred language Activating input methods for different languages Spell checking Text-to-speech (many others)

DEVELOPMENT OF ISO 639-3 AND INDUSTRY ADOPTION VIA BCP 47

LANDSCAPE CIRCA 2000 Language documentation / applied linguistics Research, literature development in 1000s of languages Large language corpora Simons and Bird (2000), forming of OLAC

LANDSCAPE CIRCA 2000 Industry Limited locale identifier mechanisms Windows: numbers 512 maximum Mac: numbers 150 defined Internet: RFC 1766 based on ISO 639-1 and ISO 3166-1, e.g., en-us XML: using RFC 1766 ISO 639-1: 180 languages ISO 639-2: 350 languages

LANDSCAPE CIRCA 2000 Changing industry landscape Unicode Consortium mission: This Corporation s specific purpose shall be to enable people around the world to use computers in any language 1 Unicode 3.0, finally becoming mainstream in software Office 97, Windows 2000, Mac OS X, X E T E X, XML,.Net, Java, C++, ECMAScript, Pango Rapidly-growing interest in expanding language support Major vendors: We don t want to be a bottleneck for language communities!

LANDSCAPE CIRCA 2000 We need a comprehensive language coding standard! Industry support for development of ISO 639-3

DEVELOPMENT SINCE 2000 ISO 639-3 2002: start of work 2007: published BCP 47 2001: IETF RFC 3066 incorporation of ISO 639-2 2006: IETF RFC 4646 enhancements to compatibility, stability, structure 2009: IETF RFC 5646 incorporation of ISO 639-3 Widespread adoption of BCP 47, ISO 639-3 across technologies, Web, and OS platforms

CURRENT INDUSTRY USE OF ISO 639-3 Unicode CLDR 25: Used in Android, Mac OS, ios, Windows, Debian Linux, Apache, Data for 369 languages in ISO 639-3 not in ISO 639-1 / ISO 639-2 Exemplar data for 600+ more General support for all of ISO 639-3 Windows 8: Explicit use of 115 IDs in ISO 639-3 not in ISO 639-1 / ISO 639-2 General support for all of ISO 639-3

WHICH LANGUAGES ARE SIGNIFICANT?

WHICH LANGUAGES ARE SIGNIFICANT FOR INDUSTRY? Extended Graded Intergenerational Disruption Scale See: https://www.ethnologue.com/about/languagestatus

3000 2500 2000 1500 1000 500 0 0 1 2 3 4 5 6a 6b 7 8a 8b 9 10

Institutional Mass media / publishing Libraries Education Commerce, marketing 3000 2500 Product localization, translation 2000 1500 1000 500 0 0 1 2 3 4 5 6a 6b 7 8a 8b 9 10

Developing waning Limited-to-no institutional support, mass media, etc. Use of ICTs: End-user content Web, SMS, email 3000 2500 2000 1500 1000 Some product localization Significant enhancement to language stabilization, vitality 500 0 0 1 2 3 4 5 6a 6b 7 8a 8b 9 10

Dying extinct Use of ICTs: Language documentation XML 3000 2500 2000 1500 1000 500 0 0 1 2 3 4 5 6a 6b 7 8a 8b 9 10

WHICH LANGUAGES ARE SIGNIFICANT FOR INDUSTRY? Some will be more used and better supported by industry than others but all need and get some level of industry support Industry is creating technologies relevant to all languages!

BCP 47 OVERVIEW

BCP 47 IETF Best Current Practice specification Reference: http://tools.ietf.org/html/bcp47 History: 1995: RFC 1766 2001: RFC 3066 2006: RFC 4646 + RFC 4647 2009: RFC 5646 + RFC 4647 Designed to accommodate language variations Language, writing system, orthography, dialect, Extended concepts that have language as a core component

HANDLING VARIATIONS Start with IDs for discrete languages from ISO 639-1, ISO 639-3 Using BCP 47, add qualifiers to language tags as needed Examples: pt-br = Portuguese as used in Brazil az-cyrl = Azerbaijani written in Cyrillic script ca-valencia = Valencian de-1996 = German using 1996 orthographic conventions en-latn-fonipa-scouse = Scouse dialect of English in IPA transcription

KEY COMPONENTS OF BCP 47 Tag syntax Subtag registry (maintained by IANA) Mechanism to register variant subtags Mechanism to register extensions

BCP 47 SYNTAX Language-Tag = langtag / privateuse langtag = language ; ISO 639 ("-" script)? ; ISO 15924 ("-" region)? ; ISO 3166-1 or UN M.49 ("-" variant)* ; registered ("-" extension)* ; registered RFC ("-" privateuse)? extension = privateuse = singleton ("-" alphanum{2,8})+ "x" ("-" alphanum{1,8})+

BCP 47 SYNTAX Examples: haw language pt-br language + ISO 3166-1 region es-419 language + UN M.49 region az-cyrl language + script ca-valencia language + variant pww-latn-fonipa language + script + variant x-foobar private use fil-x-foobar language + private use und-hebr-t-und-latn-m0-ungegn-1977 language + script + t extension

VARIANT SUBTAGS Registration requests can be submitted by anyone Reviewed for best practice Added to IANA Language Subtag Registry Process: see http://tools.ietf.org/html/bcp47#section-3.5 64 variant subtags registered to date

VARIANT SUBTAGS Examples: Variant subtag aluku balanka itihasa vallader Meaning Aluku dialect of the "Busi Nenge Tongo" English-based Creole continuum in Eastern Suriname and Western French Guiana The Balanka dialect of Anii Epic Sanskrit Vallader idiom of Romansh 1959acad "Academic" ("governmental") variant of Belarusian as codified in 1959 1606nict baku1926 Late Middle French (to 1606 as in Jean Nicot, "Thresor de la langue francoyse", 1606) Unified Turkic Latin Alphabet (principles codified at the 1926 Turkological Conference in Baku)

VARIANT SUBTAGS Registration form example: LANGUAGE SUBTAG REGISTRATION FORM 1. Name of requester: Tomaž Erjavec 2. E-mail address of requester: tomaz.erjavec&ijs.si 3. Record Requested: Type: variant Subtag: metelko Description: Slovene in Metelko alphabet Prefix: sl Comments: The subtag represents the alphabet codified by Franc Serafin Metelko and used from 1825 to 1833. 4. Intended meaning of the subtag: The subtag marks texts written in Slovene using the historical Metelko alphabet, which is distinguished from the contemporary norm by borrowing (and modifying) letters from Cyrillic. 5. Reference to published description of the language (book or article): http://en.wikipedia.org/wiki/metelko_alphabet Stabej, Marko. Franc Serafin Metelko in Metelčica. In (Janez Cvirn, ed.) Slovenska Kronika XIX. stoletja. (2001). Print. 6. Any other relevant information: The tag "sl-metelko" is relevant as a possible value of the @xml:lang attribute to be used by language technology applications for transcribing and modernising such texts, e.g. for text search in cultural heritage digital libraries. E.g. the National and University Library of Slovenia has plans to digitise about 5,000 pages of books written in the Metelko alphabet.

EXTENSIONS For concepts that go beyond language but have language as a core component Created by IETF process Details of an extension can be owned by other authorities Existing extensions t Transformed content RFC 6497 Maintaining authority: Unicode Consortium u Unicode locale RFC 6067 Maintaining authority: Unicode Consortium

BCP 47 EXTENSIONS Example: t extension Transformed content Extension defined in RFC 6497 + Unicode specification UTS #35 Example tag: und-hebr-t-und-latn-m0-ungegn-1977 Syntax: language + script + t extension Meaning: content in Hebrew script transformed from Latin script according to a UNGEGN 1977 transliteration specification

UTILITY OF INDUSTRY MECHANISMS FOR LINGUISTS

LINGUISTICS / LANGUAGE DOCUMENTATION Goals: Language development (literacy, lexicography, content development) Scholarly documentation Objects of scholarly investigation Languages, dialects Linguistic varieties / variety networks, languoids

HANDLING LINGUISTS NEEDS Use existing BCP 47 mechanisms Used in xml:lang Used in Dublin Core Metadata Element Set, Version 1.1 1 Existing process to register variant subtags Existing subtags registered for different kinds of variants Dialect variants Pronunciation variants Historic variants Orthographic / spelling variants

HANDLING LINGUISTS NEEDS Linguists could create a new BCP 47 extension Extension could use glottocode or other domain-specific vocabulary Example (hypothetical): pww-l-gc-maes1238-sc-l2fsi2-sd-disarthr ISO 639-3: Northern Pwo Karen Extension key-value pairs: glottocode: Mae Sarieng variant speaker competence: L2 speaker, estimated FSI level 2 speech defect: symptoms of dysarthria Process: see http://tools.ietf.org/html/bcp47#section-3.7

HANDLING LINGUISTS NEEDS What if ISO 639, BCP 47 aren t enough? May need properties that don t belong in language tags (e.g., speaker s social network) Create appropriate data schemas May need properties pertaining directly to language variation, but BCP 47 variants / extensions deemed not a good fit (e.g., tentative analysis, doesn t align to existing ISO 639-3 categories) Create other metadata vocabularies and data schemas Trade-off: not supported in general industry specifications (e.g., xml:lang)

SUMMATION

SUMMATION Information technologies rely heavily on ISO 639, including ISO 639-3, most often via BCP 47 Many needs of linguists can be accommodated by existing BCP 47 mechanisms Additional needs of linguists might be accommodated by creation of a new BCP 47 extension Linguists should use ISO 639-3 and BCP 47 whenever appropriate!