LANGUAGE CODING IN INFORMATION TECHNOLOGIES TKE 2014: Language Codes at the Crossroads Peter Constable Microsoft Corporation / Unicode Consortium
Does industry need or care about ISO 639-3?
The main question that I have is whether language identification should be a task for ISO, the International Organization for Standardization ISO is basically an organization [f]or industry, not for science The reason why ISO got involved in language name issues in the first place is of course the economic significance of translation and localization, which is far greater than the relevance of distant stars for businesses. But does this mean that someone needs ISO s industry standard to identify little-known languages of small communities that are never or hardly used in writing and that are often in danger of extinction? Martin Haspelmath, Diversity Linguistics Comment, December 2013
Business has an interest in the stable identification of economically significant languages, for example for translation and computer localization, and this is why the ISO 639-1 and 639-2 standards were established in the first place. However, those standards are adequate for the needs of industry; business has no significant interest in the many small, unwritten and often endangered languages with no measurable economic impact. Kwamikagami (Wikipedia contributor), April 2014
Information technologies rely heavily on ISO 639, including ISO 639-3 IETF BCP 47 is a key industry technology using ISO 639-3 Many of linguists needs for language identification can be accommodated by the same BCP 47 mechanisms used in the IT industry
AGENDA Use of language identifiers in information technologies History: development of ISO 639-3 and industry adoption via BCP 47 Which languages are significant? Overview of IETF BCP 47 (language tags) Utility of industry mechanisms for linguists
USE OF LANGUAGE IDENTIFIERS IN INFORMATION TECHNOLOGIES
USE OF LANGUAGE IDENTIFIERS Tagging content to declare the language of content Text, audio, video Tagging of software resources for language-specific processing Matching user language preferences with content Matching content with language-specific processes
USE OF LANGUAGE IDENTIFIERS Examples Display of content in my preferred language Web pages, videos, captions, etc. Display of application user interfaces in my preferred language Activating input methods for different languages Spell checking Text-to-speech (many others)
DEVELOPMENT OF ISO 639-3 AND INDUSTRY ADOPTION VIA BCP 47
LANDSCAPE CIRCA 2000 Language documentation / applied linguistics Research, literature development in 1000s of languages Large language corpora Simons and Bird (2000), forming of OLAC
LANDSCAPE CIRCA 2000 Industry Limited locale identifier mechanisms Windows: numbers 512 maximum Mac: numbers 150 defined Internet: RFC 1766 based on ISO 639-1 and ISO 3166-1, e.g., en-us XML: using RFC 1766 ISO 639-1: 180 languages ISO 639-2: 350 languages
LANDSCAPE CIRCA 2000 Changing industry landscape Unicode Consortium mission: This Corporation s specific purpose shall be to enable people around the world to use computers in any language 1 Unicode 3.0, finally becoming mainstream in software Office 97, Windows 2000, Mac OS X, X E T E X, XML,.Net, Java, C++, ECMAScript, Pango Rapidly-growing interest in expanding language support Major vendors: We don t want to be a bottleneck for language communities!
LANDSCAPE CIRCA 2000 We need a comprehensive language coding standard! Industry support for development of ISO 639-3
DEVELOPMENT SINCE 2000 ISO 639-3 2002: start of work 2007: published BCP 47 2001: IETF RFC 3066 incorporation of ISO 639-2 2006: IETF RFC 4646 enhancements to compatibility, stability, structure 2009: IETF RFC 5646 incorporation of ISO 639-3 Widespread adoption of BCP 47, ISO 639-3 across technologies, Web, and OS platforms
CURRENT INDUSTRY USE OF ISO 639-3 Unicode CLDR 25: Used in Android, Mac OS, ios, Windows, Debian Linux, Apache, Data for 369 languages in ISO 639-3 not in ISO 639-1 / ISO 639-2 Exemplar data for 600+ more General support for all of ISO 639-3 Windows 8: Explicit use of 115 IDs in ISO 639-3 not in ISO 639-1 / ISO 639-2 General support for all of ISO 639-3
WHICH LANGUAGES ARE SIGNIFICANT?
WHICH LANGUAGES ARE SIGNIFICANT FOR INDUSTRY? Extended Graded Intergenerational Disruption Scale See: https://www.ethnologue.com/about/languagestatus
3000 2500 2000 1500 1000 500 0 0 1 2 3 4 5 6a 6b 7 8a 8b 9 10
Institutional Mass media / publishing Libraries Education Commerce, marketing 3000 2500 Product localization, translation 2000 1500 1000 500 0 0 1 2 3 4 5 6a 6b 7 8a 8b 9 10
Developing waning Limited-to-no institutional support, mass media, etc. Use of ICTs: End-user content Web, SMS, email 3000 2500 2000 1500 1000 Some product localization Significant enhancement to language stabilization, vitality 500 0 0 1 2 3 4 5 6a 6b 7 8a 8b 9 10
Dying extinct Use of ICTs: Language documentation XML 3000 2500 2000 1500 1000 500 0 0 1 2 3 4 5 6a 6b 7 8a 8b 9 10
WHICH LANGUAGES ARE SIGNIFICANT FOR INDUSTRY? Some will be more used and better supported by industry than others but all need and get some level of industry support Industry is creating technologies relevant to all languages!
BCP 47 OVERVIEW
BCP 47 IETF Best Current Practice specification Reference: http://tools.ietf.org/html/bcp47 History: 1995: RFC 1766 2001: RFC 3066 2006: RFC 4646 + RFC 4647 2009: RFC 5646 + RFC 4647 Designed to accommodate language variations Language, writing system, orthography, dialect, Extended concepts that have language as a core component
HANDLING VARIATIONS Start with IDs for discrete languages from ISO 639-1, ISO 639-3 Using BCP 47, add qualifiers to language tags as needed Examples: pt-br = Portuguese as used in Brazil az-cyrl = Azerbaijani written in Cyrillic script ca-valencia = Valencian de-1996 = German using 1996 orthographic conventions en-latn-fonipa-scouse = Scouse dialect of English in IPA transcription
KEY COMPONENTS OF BCP 47 Tag syntax Subtag registry (maintained by IANA) Mechanism to register variant subtags Mechanism to register extensions
BCP 47 SYNTAX Language-Tag = langtag / privateuse langtag = language ; ISO 639 ("-" script)? ; ISO 15924 ("-" region)? ; ISO 3166-1 or UN M.49 ("-" variant)* ; registered ("-" extension)* ; registered RFC ("-" privateuse)? extension = privateuse = singleton ("-" alphanum{2,8})+ "x" ("-" alphanum{1,8})+
BCP 47 SYNTAX Examples: haw language pt-br language + ISO 3166-1 region es-419 language + UN M.49 region az-cyrl language + script ca-valencia language + variant pww-latn-fonipa language + script + variant x-foobar private use fil-x-foobar language + private use und-hebr-t-und-latn-m0-ungegn-1977 language + script + t extension
VARIANT SUBTAGS Registration requests can be submitted by anyone Reviewed for best practice Added to IANA Language Subtag Registry Process: see http://tools.ietf.org/html/bcp47#section-3.5 64 variant subtags registered to date
VARIANT SUBTAGS Examples: Variant subtag aluku balanka itihasa vallader Meaning Aluku dialect of the "Busi Nenge Tongo" English-based Creole continuum in Eastern Suriname and Western French Guiana The Balanka dialect of Anii Epic Sanskrit Vallader idiom of Romansh 1959acad "Academic" ("governmental") variant of Belarusian as codified in 1959 1606nict baku1926 Late Middle French (to 1606 as in Jean Nicot, "Thresor de la langue francoyse", 1606) Unified Turkic Latin Alphabet (principles codified at the 1926 Turkological Conference in Baku)
VARIANT SUBTAGS Registration form example: LANGUAGE SUBTAG REGISTRATION FORM 1. Name of requester: Tomaž Erjavec 2. E-mail address of requester: tomaz.erjavec&ijs.si 3. Record Requested: Type: variant Subtag: metelko Description: Slovene in Metelko alphabet Prefix: sl Comments: The subtag represents the alphabet codified by Franc Serafin Metelko and used from 1825 to 1833. 4. Intended meaning of the subtag: The subtag marks texts written in Slovene using the historical Metelko alphabet, which is distinguished from the contemporary norm by borrowing (and modifying) letters from Cyrillic. 5. Reference to published description of the language (book or article): http://en.wikipedia.org/wiki/metelko_alphabet Stabej, Marko. Franc Serafin Metelko in Metelčica. In (Janez Cvirn, ed.) Slovenska Kronika XIX. stoletja. (2001). Print. 6. Any other relevant information: The tag "sl-metelko" is relevant as a possible value of the @xml:lang attribute to be used by language technology applications for transcribing and modernising such texts, e.g. for text search in cultural heritage digital libraries. E.g. the National and University Library of Slovenia has plans to digitise about 5,000 pages of books written in the Metelko alphabet.
EXTENSIONS For concepts that go beyond language but have language as a core component Created by IETF process Details of an extension can be owned by other authorities Existing extensions t Transformed content RFC 6497 Maintaining authority: Unicode Consortium u Unicode locale RFC 6067 Maintaining authority: Unicode Consortium
BCP 47 EXTENSIONS Example: t extension Transformed content Extension defined in RFC 6497 + Unicode specification UTS #35 Example tag: und-hebr-t-und-latn-m0-ungegn-1977 Syntax: language + script + t extension Meaning: content in Hebrew script transformed from Latin script according to a UNGEGN 1977 transliteration specification
UTILITY OF INDUSTRY MECHANISMS FOR LINGUISTS
LINGUISTICS / LANGUAGE DOCUMENTATION Goals: Language development (literacy, lexicography, content development) Scholarly documentation Objects of scholarly investigation Languages, dialects Linguistic varieties / variety networks, languoids
HANDLING LINGUISTS NEEDS Use existing BCP 47 mechanisms Used in xml:lang Used in Dublin Core Metadata Element Set, Version 1.1 1 Existing process to register variant subtags Existing subtags registered for different kinds of variants Dialect variants Pronunciation variants Historic variants Orthographic / spelling variants
HANDLING LINGUISTS NEEDS Linguists could create a new BCP 47 extension Extension could use glottocode or other domain-specific vocabulary Example (hypothetical): pww-l-gc-maes1238-sc-l2fsi2-sd-disarthr ISO 639-3: Northern Pwo Karen Extension key-value pairs: glottocode: Mae Sarieng variant speaker competence: L2 speaker, estimated FSI level 2 speech defect: symptoms of dysarthria Process: see http://tools.ietf.org/html/bcp47#section-3.7
HANDLING LINGUISTS NEEDS What if ISO 639, BCP 47 aren t enough? May need properties that don t belong in language tags (e.g., speaker s social network) Create appropriate data schemas May need properties pertaining directly to language variation, but BCP 47 variants / extensions deemed not a good fit (e.g., tentative analysis, doesn t align to existing ISO 639-3 categories) Create other metadata vocabularies and data schemas Trade-off: not supported in general industry specifications (e.g., xml:lang)
SUMMATION
SUMMATION Information technologies rely heavily on ISO 639, including ISO 639-3, most often via BCP 47 Many needs of linguists can be accommodated by existing BCP 47 mechanisms Additional needs of linguists might be accommodated by creation of a new BCP 47 extension Linguists should use ISO 639-3 and BCP 47 whenever appropriate!