Internationalizing the Domain Name System Šimon Hochla, Anisa Azis, Fara Nabilla
Internationalize Internet Master in Innovation and Research in Informatics problematic of using non-ascii characters ease of use to have local linguistic implemented in Internet (suit users needs) Solution: localization: adapting local computing environment to suit local linguistic needs But localization doesn t appear as a compelling solution in multilingual environment for DNS, why? DNS binds all users language symbols together. DNS spans the entire network DNS means to allow the use of all those language symbols within the same system (internationalization) Autumn 2015 2 Computer Networks (MIRI)
Situation of the current DNS DNS is the most common means of initiating a network transaction, whether it is a BitTorrent session, The Web, e-mail or any other form of network activity. Assumption : DNS is often sequence of words or abbreviation in English and using ASCII character set Can DNS support non western script and diacritics? Some implementations of DNS don't support any other characters than ASCII. Use of acute, grave symbols, umlauts and similar marks can provide unwanted results. DNS resolver doesn t recognize non-ascii character, unicode URL needs to be transformed/encoded into DNS LDH. Autumn 2015 Computer Networks (MIRI) 3
Multilingual Characters we have managed to get the non-latin-based scripts into many applications and also can enter non-latin characters on computer keyboards What does it mean Internationalizing DNS? where Latin and English character is used and a communication is initiated in one locale and then the language and presentation are preserved wherever the communication is received Autumn 2015 Computer Networks (MIRI) 4
Terminology Master in Innovation and Research in Informatics Language: A language uses characters drawn from a collection of scripts. Script: A script is a collection of characters that are related in their use by a language. Character: A character is a unit of a script. Glyph: The presentation of a character within the style of a font is called a glyph. Font: A font is a collection of glyphs encompassing a script character set that share a consistent presentation style. Autumn 2015 Computer Networks (MIRI) 5
Unicode Master in Innovation and Research in Informatics What s the objective of Internationalizing the DNS? DNS can support the union of all character sets, while avoiding ambiguity and uncertainty in terms of resolution of any individual DNS name Solution : Unicode - "universal characters set" universal encoding of characters in the contexts of all scripts and all languages Autumn 2015 Computer Networks (MIRI) 6
Unicode representations Unicode can be represented in multiple ways by using different character encoding schemes in a Unicode Transformation Format (UTF). Most common are utf-8 and utf-16. UTF-8, UTF-16 variable-length, UTF-32 fixed-length UTF-16 characters that don't belong to the basic multilingual plane are mapped by into a pair of 16-bit words criticism for penalisation of certain scripts by requiring more bytes to represent their code ponts Autumn 2015 Computer Networks (MIRI) 7
Context of a script and a language Unicode - weaknesses in terms of identifying a context of a script and a language for a given character sequence Solution: tag the content with the script and encoding scheme Tagging is useful for e-mail and web page content, but breaks down in the context of the DNS, why? "universal" character set and a "universal" language context. No natural space in DNS names to contain tags DNS must have implicit tags of all characters and all languages Autumn 2015 Computer Networks (MIRI) 8
DNS 7-bit ASCII vs Unicode 8-bit clean 8-bit clean: computer system that correctly handles 8-bit character encodings, such as the ISO 8859 series and the UTF-8 encoding of Unicode The Unicode UTF-8, UTF-16, and UTF-32 encodings all require an 8- bit clean storage and transmission medium. traditional DNS domain names are representable with 7-bit ASCII characters IETF s IDN Working Group decided to move towards application assistance instead of the DNS supporting non ASCII characters Why is DNS domain names are representable with 7-bit ASCII and not with 8-bit clean? LDH restriction applied on DNS domain names Autumn 2015 Computer Networks (MIRI) 9
LDH convention RFC 1035: 1. Each DNS label must begin with a letter, restricted to the Latin character subset of A through Z and a through z, followed by a sequence of letters, digits, or hyphens, with a trailing letter or digit, and no trailing hyphen. 2. The case of the letter is not important to the DNS, so, within the DNS a is equivalent to A, and so on (monocase character) 3. DNS uses a left-to-right ordering of these labels, with the ASCII period as the label delimiter. Autumn 2015 Computer Networks (MIRI) 10
Internationalisation Master in Innovation and Research in Informatics Allow DNS to be set in the user s own language, and at the same time allow the DNS to operate in a consistent and deterministic manner within its restricted language 2 options: make DNS 8-bit clean or applications have to do the work and present to the DNS an encoded form of the Unicode sequences that conform to the restricted DNS character repertoire Autumn 2015 Computer Networks (MIRI) 11
IDN framework Master in Innovation and Research in Informatics IDN Working Group of the IETF formed in 2000 wit the goal of developing standards to internationalize domain names. Outcome is the IDNA framework. ASCII Compatible Encoding (ACE): Unicode strings of IDNs into ASCII character encoding IETF adopted punycode as its standards IDN ACE Autumn 2015 Computer Networks (MIRI) 12
IDN An internationalized domain name (IDN) is an Internet domain name that contains at least one label that is displayed in software applications, in a language-specific script or alphabet. These writing systems are encoded by computers in multi-byte Unicode. Autumn 2015 Computer Networks (MIRI) 13
Punycode way to represent Unicode with the limited character subset of ASCII supported by the Domain Name System e.g. "München" (German name for the city of Munich) would be encoded as "Mnchen-3ya". RFC 3454[6] defines a presentation layer in IDNaware applications that is responsible for the punycode ACE encoding and decoding. Autumn 2015 Computer Networks (MIRI) 14
IDN in Applications Master in Innovation and Research in Informatics Role of aplication in IDN: transform the domain name expressed in a particular language using a particular script into ASCII-compatible LDH-encoded string and reverse critical: encoding and decoding function works correctly, deterministically, and uniformly DNS stores encoded version of the canonical name DNS is deterministic, does not return a set of possible answers to a query, we cannot use approximation Autumn 2015 Computer Networks (MIRI) 15
The Presentation Layer Transform for IDNs algorithm groups "equivalent" unicode strings from the DNS LDH string into the unicode string single "canonical" string from the group of possible IDN strings selected Stringprep: original unicode string (numerous transformations applied)-> regular or canonical form of the IDN string (transformation using the punycode ACE) - > encoded DNS string Autumn 2015 Computer Networks (MIRI) 16
Transformation Mapping: converting a string to a normal, or canonical, form transforms to lower case and removes characters without semantic meaning that do not affect the equivalence Normalisation: many languages use different character sequences for the same meaning e.g. letter Ä : LATIN CAPITAL A WITH DIARESIS, LATIN CAPITAL LETTER followed by COMBINING DIARESIS Autumn 2015 Computer Networks (MIRI) 17
Nameprep: A Stringprep Profile for the DNS specifies stringprep for internationalized domain names, specifying a character repertoire and a profile of mappings, normalization (form KC ), prohibited characters, and bidirectional character handling Autumn 2015 Computer Networks (MIRI) 18
The Punycode ASCII-Compatible Encoding transformation from the canonical form of the Unicode name string into a LDH-equivalent string using an ACE Algorithm: division into basic and exteded points literal reproduction of the basic points goes first A delimiter is added (a basic code point that does not occur in the remainder of the string) The extended code points added to the string as a series of integers expressed through an encoding into the basic (LDH) code set Autumn 2015 Computer Networks (MIRI) 19
Example of Punycode Master in Innovation and Research in Informatics bücher The encoded form for ü (code 252) delta code of 745 in base 35 expressed as (21 x 35) + 10, (10,22,1) in reverse notation kva. xn--bcher-kva Autumn 2015 Computer Networks (MIRI) 20
Homoglyphs Distinct characters not necessarily displayed in unique ways www.paypal.com vs www.paypal.com the domain name www.paypal.com resolved in the DNS as www.paypal.com, in the second case www.paypal.com translated to www.xn--pypal-4ve.com no clear relationship between characters and glyphs multiple characters - single glyph, e.g. pair f l displayed as the single glyph fl, single character - multiple glyphs. Autumn 2015 Computer Networks (MIRI) 21
Homoglyphs cont. Two unequal strings can be indistinguishable from the point of view Browsers: first response: disable the IDN support second response: expose the punycode version of the URL most popular browsers display the glyphs rather than the ASCII punycode Autumn 2015 Computer Networks (MIRI) 22
Ambiguity The intention in the IDN effort was to preserve the deterministic property of DNS resolution, but it did not quite manage to reach goal. Languages are human-use systems resistant to automated processing. Language and script context are needed to resolve the homoglyphs refined definition of IDN labels that lists which Unicode code points can be used in the context of IDNs, excluding all others Autumn 2015 Computer Networks (MIRI) 23
What about putting IDN codes into the root of the DNS as alternative top-level domains (TLDs)? natural extension of adding punycode-encoded name entries into lower levels of the DNS allow any DNS name to be wholly expressible in the user s language, implying that all parts of the DNS should be able to carry native languageencoded DNS names Autumn 2015 Computer Networks (MIRI) 24
Multilingual equivalents of protocol identifier codes The multilingual presentation of these elements is on the application, rather than attempting to alter the protocol identifiers in the relevant standards Autumn 2015 Computer Networks (MIRI) 25
Internationalisation of TLD Equivalence of the TLD (top level domain) when it is in the ASCII format and the punycode:.jp vs.xn--wgv71a should they be in the same or in a different DNS zone? precisetly the same subdomain name set registration in one of these equivalence names is in effect a name registration across the entire equivalence set multiligual should not only multiscript:.com in German represented as.kom Autumn 2015 Computer Networks (MIRI) 26
DNAME Record of TLD name aliases for their ASCII equivalents DNAME places load back on the name servers - still in early development locks up IDNs into the hands of the TLD name-registry operators single registrar with each IDN variant of the same TLD, competition between the various registrars - may do more harm than good for Internet users DNS top-level name space is very conservatively managed, and new entries into this space are not made lightly Autumn 2015 Computer Networks (MIRI) 27
TLD and the presentation layer presentation layer could perform also the mapping the punycode ACE equivalents of the TLDs to the actual ASCII TLDs as 日 本 into xn--wgv71a xn--wgv71a into jp Autumn 2015 Computer Networks (MIRI) 28
Conclusion ICANN in a challenging situation, many people point that it ignores noenglish languages because of its political bias overwhelming majority of Internet users and commercial activity of the Internet is in languages other than native English, ASCII only in DNS is unnatural when making changes to dns we need to think long term and not to do ad hoc decisions as it could eventually cause a fragmented internet internationalisation of DNS is necessary, we need both internationalisation and localisation What causes a user the least amount of surprise? Autumn 2015 Computer Networks (MIRI) 29
Questions to Discuss Why is DNS such a heavily restricted language? Is it better to have IDN for TLD?(IDN in DNS root)? Autumn 2015 Computer Networks (MIRI) 30