Introduction to Unicode. By: Atif Gulzar Center for Research in Urdu Language Processing

Save this PDF as:
 WORD  PNG  TXT  JPG

Size: px
Start display at page:

Download "Introduction to Unicode. By: Atif Gulzar Center for Research in Urdu Language Processing"

Transcription

1 Introduction to Unicode By: Atif Gulzar Center for Research in Urdu Language Processing

2 Introduction to Unicode Unicode Why Unicode? What is Unicode? Unicode Architecture

3 Why Unicode?

4 Pre-Unicode Standards and their Limitations ASCII by ANSI in 1964 (7 bit code) ISO adopt ASCII in 1967 as ISO 646 ISO 2022 (8 bit code) ISO 8859 ISO 8859 is a family of 16 Standards Code Page And plenty of standards for East Asian languages

5 ISO 8859 cont. ISO , Latin-1, Western European ISO , Latin-2, Eastern European ISO , Latin-3, Southern European ISO , Latin-4, Northern European ISO , 5, Cyrillic, Russian, Bulgarian.. ISO , Arabic, Arabic ISO , Greek, Greek ISO , 8, Hebrew, Hebrew

6 ISO 8859 ISO , 9, Latin-5, Turkish ISO , Latin-6, Northern European ISO , Thai, Thai ISO , Latin-7, Baltic ISO , Latin-8, Celtic ISO , Latin-9, Western European ISO , Latin-10, Eastern European

7 ASCII Upper case (A-Z) 26 Digits (0-9) 10 Space 1 Punctuation marks (.,+{)%) 32 Lower case (a-z) 26 Control characters (tab, cr, lf) 33 ====================== Total 128

8 ASCII code page

9 ANSI 1252 code page

10 ANSI 1256 code page Arabic

11 ANSI 1252 code page Central Europe Arabic Č ب È 00C8 Hebrew Greek Θ Cyrillic Thai И ศ

12 Code page for Windows Characters not common to other codepages

13 The Code Page Problem cont. Each character set: ASCII (common) + ANSI (Western Europe) Eastern Europe Baltic Greek Cyrillic Thai, Turkish, Arabic, Hebrew, etc. ASCII + extended Extensions for many countries Characters above 128 change meaning

14 The Code Page Problem cont. Characters in most languages are traditionally represented by single-byte values Allows for 256 characters max Real limit for most encodings is 192 characters This includes letters, digits, punctuation, symbols When a system is used for a new language, the encoding has to be adapted to use that language s characters

15 The Code Page Problem Each language or group of languages gets its own encoding Different vendors or standards committees devise different encodings, so generally each language has several, often incompatible, encodings

16 Interoperability Problems Can t t easily mix languages in a document or system Data not tagged with encoding, so loss can occur when transferring between systems Most encodings are ASCII-based, so problems often not seen with English-only data Two possible solutions: Systematic tagging of textual data with encoding ID Universal encoding standard with all languages characters

17 What is Unicode?

18 Unicode or Universal Code One Universal Code for every character no matter what the platform, no matter what the program, no matter what the language. Unicode is not just a bunch of code points Initially it was a 2 byte code, that can support over 65,000 characters Unicode Standard, Version 4.0 provides codes for 96,447 characters Adopted by ISO as ISO 10464

19 Principles of the Unicode Standard Universality Efficiency Characters, not glyphs Semantics Plain text Logical order Unification Convertibility Accurate

20 Universality (Unicode Coverage) European scripts Latin, Greek, Cyrillic, Armenian, Georgian, IPA Bidirectional (Middle Eastern) scripts Hebrew, Arabic, Syriac,, Thaana Indic (Indian and Southeast Asian) scripts Devanagari,, Bengali, Gurmukhi,, Gujarati, Oriya, Tamil, Telugu, Kannada, Malayalam, Sinhala,, Thai, Lao, Khmer, Myanmar, Tibetan, Philippine East Asian scripts Chinese (Han) characters, Japanese (Hiragana and Katakana), Korean (Hangul), Yi Other modern scripts Mongolian, Ethiopic, Cherokee, Canadian Aboriginal Historical scripts Runic, Ogham,, Old Italic, Gothic, Deseret Punctuation and symbols Numerals, math symbols, scientific symbols, arrows, blocks, geometric shapes, Braille, musical notation, etc.

21 Characters and Glyphs fi fi لا ل ا -->

22 Plain Text Its is a plain text Its is a formatted text

23 Logical Data Ordering and Bidi- Algorithm

24 Unification Unicode standard avoid duplicate encoding of characters within Scripts across languages. In Chinese, Japanese and Korean many ideographs are common. The character code U+0057 Y is same in English, German and French

25 Character Semantics cont. The Unicode standard includes an extensive database that specifies a large number of character properties, including: Name Type (e.g., letter, digit, punctuation mark) Decomposition Case and case mappings (for cased letters) Numeric value (for digits and numerals) Combining class (for combining characters) Directionality Line-breaking behavior Cursive joining behavior For Chinese characters, mappings to various other standards and many other properties

26 Character semantics 1781; KHMER LETTER KHA;Lo;0;L;;;;;N;;;;; 17BE KHMER VOWEL SIGN OE;Mc;0;L;;;;;N;;;;; 17E5 KHMER DIGIT FIVE;Nd;0;L;;5;5;5;N;;;;;

27 Unicode Architecture

28 Unicode Architecture Initially Unicode was designed for 16-bit encoding space, consisting of 256 rows of 256 characters each ISO was designed for 32 bit encoding space, thus ISO has room for 2,147,483,648 characters After Unicode came out that 16-bit encoding is too small In Unicode 3.0 the length is increased to 21-bit, allows for 1,114,112 characters

29 Encoding Space Early versions of Unicode used 16 bits Unicode now uses 21 bits Plane number Row number Character number

30 The Unicode Encoding Space 10 F E D C B A Basic Multilingual Plane

31 The Unicode Encoding Space 10 F E D C B A Supplementary Planes

32 The Unicode Encoding Space 10 F E D C B A Private Use Planes Supplementary Special-Purpose Plane Supplementary Ideographic Plane Supplementary Multilingual Plane

33 The Basic Multilingual Plane A B C D E F General Scripts Area Symbols Area CJK Punct. Yi Private Use Area Han Hangul Surrogates Area CJK Punct. Compatibility Area

34 The General Scripts Area 00/01 02/03 04/05 06/07 08/09 0A/0B 0C/0D 0E/0F 10/11 12/13 14/15 16/17 18/19 1A/1B 1C/1D 1E/1F Latin IPA Diacriticals Greek Cyrillic Armenian Hebrew Arabic Syriac Thaana Devanagari Bengali Gurmukhi Gujarati Oriya Tamil Telugu Kannada Malayalam Sinhala Thai Lao Tibetan Myanmar Georgian Hangul Ethiopic Cherokee Canadian Aboriginal Syllabics Ogh Khmer am Runic Philippine Mongolian Latin Greek

35 Unicode Storage Formats or UTF-32 UTF-16 UTF-8 UTF-7 CESU-8 UTF-EBCDIC BOCU Unicode Encodings

36 Storage formats cont. UTF-32: The 21-bit abstract Unicode value is simply zero-padded to 32 bits:

37 UTF-16: Storage formats For characters in the BMP, the 21-bit value is simply truncated to 16 bits: For other characters, the 21-bit value is turned into a sequence of two 16-bit values called a surrogate pair: A particular numeric value is either a BMP character, a high surrogate, or a low surrogate.

38 UTF-8: Storage formats For ASCII characters, the 21-bit value is truncated to 8 bits: For other characters, the 21-bit value is turned into a sequence of two, three, or four 8-bit values: Different numeric ranges are used for ASCII characters and leading and trailing bytes. Different ranges are used for leading bytes of different-length sequences.

39 Detecting Unicode Storage Format If the files starts with 0xFE 0xFF 0xFF 0xFE 0x00 0x00 0xFE 0xFF 0xFF 0xFE 0x00 0x00 0xEF 0xBB 0x BF 0xDD 0x73 0x73 0x73 0x0E 0xFE 0xFE Any thing else The file contains UTF-16 Byte-swapped UTF-16 UTF-32 Byte-swapped UTF-32 UTF-8 UTF-EBCDIC CESU Non-Unicode or Untagged Unicode

40 The Unicode standard The Unicode standard consists of: The standard text, published in book form (this includes a complete set of printed code charts) The Unicode Character Database, a set of data files providing complete property information on every character Various Web-published supplemental materials: Unicode Standard Annexes (UAX): Amendments to the standard since the last book was published Unicode Technical Standards (UTS): Allied standards maintained separately from Unicode itself Unicode Technical Reports (UTR): Non-normative normative documents providing background info, implementation hints, or other useful information Unicode Technical Notes (UTN): Other articles of

41 References The Unicode Standard Version 4.0 by Unicode Consortium Unicode Demystified by Richard Gillam Unicode Character Database ( Unicode Charts (

42 Questions?

Basic Data Communications Concepts

Basic Data Communications Concepts Basic Data Communications Concepts OVERVIEW Bits & Bytes Characters Codes Parallel vs. Serial Timing Methods for Serial Transmission Directionality of the Transmission Path FIGURE 3-1: TERMINAL AND HOST

More information

DRH specification framework

DRH specification framework DRH specification framework 2007-03-15 EDM - NIED Takeshi KAWAMOTO, Hiroaki NEGISHI, Mitsuaki SASAKI 1 DRH Basic Development before Sep. 2007 Server architectures Search architectures Multilanguage Architectures

More information

IDN Code Points Policy for the.ac Top Level Domain

IDN Code Points Policy for the.ac Top Level Domain IDN Code Points Policy for the.ac Top Level Domain Purpose: This document defines the characters that are allowed in the.ac Top Level Domain. Other code points are not allowed unless specified here. Selection

More information

Introduction to Unicode and Writing Systems

Introduction to Unicode and Writing Systems Introduction to Unicode and Writing Systems Denis Kiryaev If you are a programmer working in 2003 and you don't know the basics of characters, character sets, encodings, and Unicode, and I catch you, I'm

More information

Using Complex Languages in QPro. Version 1.01

Using Complex Languages in QPro. Version 1.01 Using Complex Languages in QPro Version 1.01 Contents Using Complex Languages in QPro... 3 Configuring QPro for Complex Languages... 3 Appendix A... 7 Supported Languages in QPro... 7 If you need to contact

More information

Using International Characters in BarTender

Using International Characters in BarTender Using International Characters in BarTender How to Read Data and Print Characters from almost every Language and Writing System in the World WHITE PAPER Contents Overview 3 BarTender's Unicode Support

More information

Creation of Digital Libraries in Indian Languages Using Unicode

Creation of Digital Libraries in Indian Languages Using Unicode Workshop on Digital Libraries: Theory and Practice March, 2003 DRTC, Bangalore Creation of Digital Libraries in Indian Languages Using Unicode Dr. ARD Prasad Associate Professor Documentation Research

More information

The Unicode Standard Version 8.0 Core Specification

The Unicode Standard Version 8.0 Core Specification The Unicode Standard Version 8.0 Core Specification To learn about the latest version of the Unicode Standard, see http://www.unicode.org/versions/latest/. Many of the designations used by manufacturers

More information

The m17n Library A General Purpose Multilingual Library for Unix/Linux Applications

The m17n Library A General Purpose Multilingual Library for Unix/Linux Applications The m17n Library A General Purpose Multilingual Library for Unix/Linux Applications Nishikimi Mikiko and Handa Kenichi and Takahashi Naoto and Tomura Satoru National Institute of Advanced Industrial Science

More information

Font Support from HP

Font Support from HP Font Support from HP Introduction... 2 HP LaserJet Printers with UTF-8 Firmware... 2 HP s International Printing Solution... 3 IPS CD version (BA392AA)... 4 IPS Flash memory card version (BA438AA)... 5

More information

Solving International Label Printing Challenges with Unicode A ZEBRA BLACK&WHITE

Solving International Label Printing Challenges with Unicode A ZEBRA BLACK&WHITE Solving International Label Printing Challenges with Unicode A ZEBRA BLACK&WHITE PAPER Copyrights 2007 ZIH Corp. All product names and numbers are Zebra trademarks and Zebra is a registered trademark of

More information

Unicode. J. Stanley Warford Computer Science Department Pepperdine University Malibu, CA 90263

Unicode. J. Stanley Warford Computer Science Department Pepperdine University Malibu, CA 90263 Unicode J. Stanley Warford Computer Science Department Pepperdine University Malibu, CA 90263 The first electronic computers were developed to perform mathematical calculations with numbers. Eventually,

More information

Internationalization. Character Encodings

Internationalization. Character Encodings Internationalization 'i18n' for short Deals with charactersets text direction local formats sorting order calendars The Web must be i18n 11 - Unicode CSC309 1 Character Encodings CharacterSet: a collection

More information

Introduction to Internationalized Domain Names (IDN)

Introduction to Internationalized Domain Names (IDN) Introduction to ized Domain Names (IDN) IP Symposium for CEE, CIS and Baltic States Moscow, Russia 16-19 September 2003 Robert Shaw ITU Internet Strategy and Policy Advisor Agenda

More information

Inventory of Romanization Tools

Inventory of Romanization Tools Inventory of Romanization Tools Standards Intellectual Management Office Library and Archives Canad Ottawa 2006 Inventory of Romanization Tools page 1 Amharic Ethiopic BGN/PCGN 1967 Arabic Arabic ISO 233:1984.Transliteration

More information

Computer Systems Architecture

Computer Systems Architecture Computer Systems Architecture http://cs.nott.ac.uk/ txa/g51csa/ Thorsten Altenkirch School of Computer Science and IT University of Nottingham Lecture 01: Bits, bytes and numbers Bits Bit = Binary Digit

More information

Red Hat Enterprise Linux International Language Support Guide

Red Hat Enterprise Linux International Language Support Guide Red Hat Enterprise Linux International Language Support Guide Red Hat Enterprise Linux International Language Support Guide Copyright This book is about international language support for Red Hat Enterprise

More information

A simple approach for building transliteration editors for Indian languages

A simple approach for building transliteration editors for Indian languages 1354 Prahallad et al. / J Zhejiang Univ SCI 2005 6A(11):1354-1361 Journal of Zhejiang University SCIENCE ISSN 1009-3095 http://www.zju.edu.cn/jzus E-mail: jzus@zju.edu.cn A simple approach for building

More information

Multi-lingual Label Printing with Unicode

Multi-lingual Label Printing with Unicode Multi-lingual Label Printing with Unicode White Paper Version 20100716 2009 SATO CORPORATION. All rights reserved. http://www.satoworldwide.com softwaresupport@satogbs.com 2009 SATO Corporation. All rights

More information

HKSCS-2004 Support for Windows Platform

HKSCS-2004 Support for Windows Platform HKSCS-2004 Support for Windows Platform Windows XP Font Pack for ISO 10646:2003 + Amendment 1 Traditional Chinese Support (HKSCS-2004) update for Windows XP and Windows Server 2003 June 2010 Version 1.0

More information

The Unicode Standard Version 8.0 Core Specification

The Unicode Standard Version 8.0 Core Specification The Unicode Standard Version 8.0 Core Specification To learn about the latest version of the Unicode Standard, see http://www.unicode.org/versions/latest/. Many of the designations used by manufacturers

More information

Unicode, the Moving Target

Unicode, the Moving Target Unicode, the Moving Target Roozbeh Pournader roozbeh@sharif.edu T E X Users Group Meeting, September 2002 Trivandrum, India The outline Introduction What is Unicode? What is OpenType? Other Friends New

More information

Building Apps Last updated: 11 December 2015

Building Apps Last updated: 11 December 2015 Building Apps Last updated: 11 December 2015 Contents 1. How to build your first app... 3 2. App Creation Basics... 6 2.1. Which format data files does Dictionary App Builder recognize?... 6 2.2. How should

More information

International Language Support for Crestron Touchpanels Reference Guide

International Language Support for Crestron Touchpanels Reference Guide International Language Support for Crestron Touchpanels Reference Guide This document was prepared and written by the Technical Documentation department at: Crestron Electronics, Inc. 15 Volvo Drive Rockleigh,

More information

EMERGING TECHNOLOGIES Multilingual Computing

EMERGING TECHNOLOGIES Multilingual Computing Language Learning & Technology http://llt.msu.edu/vol6num2/emerging/ May 2002, Volume 6, Number 2 pp. 6-11 Robert Godwin-Jones Virginia Commonwealth University EMERGING TECHNOLOGIES Multilingual Computing

More information

PRICE LIST. ALPHA TRANSLATION AGENCY www.biuro-tlumaczen.tv info@biuro-tlumaczen.tv

PRICE LIST. ALPHA TRANSLATION AGENCY www.biuro-tlumaczen.tv info@biuro-tlumaczen.tv We encourage you to get to know the prices of the services provided by Alpha Translation Agency in the range of standard and certified written translations of common and rare languages, as well as interpretation

More information

Data Integrator. Encoding Reference. Pervasive Software, Inc. 12365-B Riata Trace Parkway Austin, Texas 78727 USA

Data Integrator. Encoding Reference. Pervasive Software, Inc. 12365-B Riata Trace Parkway Austin, Texas 78727 USA Data Integrator Encoding Reference Pervasive Software, Inc. 12365-B Riata Trace Parkway Austin, Texas 78727 USA Telephone: 888.296.5969 or 512.231.6000 Fax: 512.231.6010 Email: info@pervasiveintegration.com

More information

Special Item Numbers (SIN): (translation) and (interpretation) Schedule Title: Language Services (738 II)

Special Item Numbers (SIN): (translation) and (interpretation) Schedule Title: Language Services (738 II) Since 1997, CETRA Language Solutions has provided language services to government entities at all levels. These services include translation, interpretation, American Sign Language, localization, multilingual

More information

Speaking your language...

Speaking your language... 1 About us: Cuttingedge Translation Services Pvt. Ltd. (Cuttingedge) has its corporate headquarters in Noida, India and an office in Glasgow, UK. Over the time we have serviced clients from various backgrounds

More information

EM108 Software Development for Engineers Section 5 Storing Information

EM108 Software Development for Engineers Section 5 Storing Information EM108 5 Storing Information page 1 of 11 EM108 Software Development for Engineers Section 5 Storing Information 5.1 Motivation: Various information types o Various types of numbers o Text o Images, Audios,

More information

EURESCOM - P923 (Babelweb) PIR.3.1

EURESCOM - P923 (Babelweb) PIR.3.1 Multilingual text processing difficulties Malek Boualem, Jérôme Vinesse CNET, 1. Introduction Users of more and more applications now require multilingual text processing tools, including word processors,

More information

Character Codes for Modern Computers

Character Codes for Modern Computers Character Codes for Modern Computers This lecture covers the standard ways in which characters are stored in modern computers. There are five main classes of characters. 1. Alphabetic characters: upper

More information

Pango. An open-source Unicode text layout engine. Owen Taylor

Pango. An open-source Unicode text layout engine. Owen Taylor Pango An open-source Unicode text layout engine Owen Taylor otaylor@redhat.com 25th Internationalization and Unicode Conference Washington, DC March/April 2004 Introduction - why Pango? Application view

More information

Table 1: TSQM Version 1.4 Available Translations

Table 1: TSQM Version 1.4 Available Translations Quintiles, Inc. 1 Tables 1, 2, & 3 below list the existing and available translations for the TSQM v1.4, TSQM vii, TSQM v9. If Quintiles does not have a translation that your Company needs, the Company

More information

FOREIGN LANGUAGE AND AREA STUDIES (FLAS) FELLOWSHIP For Graduate Students Academic Year 2016 2017

FOREIGN LANGUAGE AND AREA STUDIES (FLAS) FELLOWSHIP For Graduate Students Academic Year 2016 2017 FOREIGN LANGUAGE AND AREA STUDIES (FLAS) FELLOWSHIP For Graduate Students Academic Year 2016 2017 Program: Foreign Language and Area Studies (FLAS) Fellowships provide funding to students to encourage

More information

Binary Representation

Binary Representation Binary Representation The basis of all digital data is binary representation. Binary - means two 1, 0 True, False Hot, Cold On, Off We must tbe able to handle more than just values for real world problems

More information

Tel: +971 4 266 3517 Fax: +971 4 268 9615 P.O. Box: 22392, Dubai - UAE info@communicationdubai.com comm123@emirates.net.ae www.communicationdubai.

Tel: +971 4 266 3517 Fax: +971 4 268 9615 P.O. Box: 22392, Dubai - UAE info@communicationdubai.com comm123@emirates.net.ae www.communicationdubai. Tel: +971 4 266 3517 Fax: +971 4 268 9615 P.O. Box: 22392, Dubai - UAE info@communicationdubai.com comm123@emirates.net.ae www.communicationdubai.com ALL ABOUT TRANSLATION Arabic English Online Human Translation

More information

www.cle.org.pk PROFESSOR AND HEAD DR. SARMAD HUSSAIN Al- Khwarizmi Institute of Computer Sciences University of Engineering and Technology, Lahore

www.cle.org.pk PROFESSOR AND HEAD DR. SARMAD HUSSAIN Al- Khwarizmi Institute of Computer Sciences University of Engineering and Technology, Lahore Internationalized Domain Names (IDNs) www.cle.org.pk DR. SARMAD HUSSAIN PROFESSOR AND HEAD Al- Khwarizmi Institute of Computer Sciences University of Engineering and Technology, Lahore sarmad.hussain@kics.edu.pk

More information

The Design & Development of Pan-CJK Fonts

The Design & Development of Pan-CJK Fonts The Design & Development of Pan-CJK Fonts Dr. Ken Lunde Senior Computer Scientist Adobe Systems Incorporated lunde@adobe.com 2010 Adobe Systems Incorporated. All rights reserved. 1 What Is A Pan-CJK Font?

More information

Preservation Handbook

Preservation Handbook Preservation Handbook Plain text Author Version 2 Date 17.08.05 Change History Martin Wynne and Stuart Yeates Written by MW 2004. Revised by SY May 2005. Revised by MW August 2005. Page 1 of 7 File: presplaintext_d2.doc

More information

Internationalizing the Domain Name System. Šimon Hochla, Anisa Azis, Fara Nabilla

Internationalizing the Domain Name System. Šimon Hochla, Anisa Azis, Fara Nabilla Internationalizing the Domain Name System Šimon Hochla, Anisa Azis, Fara Nabilla Internationalize Internet Master in Innovation and Research in Informatics problematic of using non-ascii characters ease

More information

Frequently Asked Questions on character sets and languages in MT and MX free format fields

Frequently Asked Questions on character sets and languages in MT and MX free format fields Frequently Asked Questions on character sets and languages in MT and MX free format fields Version Final 17 January 2008 Preface The Frequently Asked Questions (FAQs) on character sets and languages that

More information

Printing International Characters Using BarTender

Printing International Characters Using BarTender The World's Leading Software for Label, Barcode, RFID & Card Printing White Paper Printing International Characters Using BarTender How to Read Data and Print Characters from almost every Language and

More information

ASCII Code. Numerous codes were invented, including Émile Baudot's code (known as Baudot

ASCII Code. Numerous codes were invented, including Émile Baudot's code (known as Baudot ASCII Code Data coding Morse code was the first code used for long-distance communication. Samuel F.B. Morse invented it in 1844. This code is made up of dots and dashes (a sort of binary code). It was

More information

Chapter 4: Computer Codes

Chapter 4: Computer Codes Slide 1/30 Learning Objectives In this chapter you will learn about: Computer data Computer codes: representation of data in binary Most commonly used computer codes Collating sequence 36 Slide 2/30 Data

More information

International Cataloging: Use Non-Latin Scripts

International Cataloging: Use Non-Latin Scripts OCLC Connexion Client Guides International Cataloging: Use Non-Latin Scripts Last updated: July 2016 OCLC Online Computer Library Center, Inc. 6565 Kilgour Place Dublin, OH 43017-3395 Contents 1. Connexion

More information

How WEIRD are WALS languages? Östen Dahl Stockholm University

How WEIRD are WALS languages? Östen Dahl Stockholm University How WEIRD are WALS languages? Östen Dahl Stockholm University WEIRD societies Henrich et al. (2010, 61) note that behavioural scientists tend to make broad claims about human psychology and behavior based

More information

HP Business Notebook Password Localization Guidelines V1.0

HP Business Notebook Password Localization Guidelines V1.0 HP Business Notebook Password Localization Guidelines V1.0 November 2009 Table of Contents: 1. Introduction..2 2. Supported Platforms...2 3. Overview of Design...3 4. Supported Keyboard Layouts in Preboot

More information

Email Content Control. Admin Guide

Email Content Control. Admin Guide Email Content Control Admin Guide Document Revision Date: May 7, 2013 Email Content Control Admin Guide i Contents Introduction... 1 About Content Control... 1 Configuration Overview for Content Control...

More information

Character Sets, Encodings, Java and Other Headaches Character Sets, Encodings, Java and Other Headaches

Character Sets, Encodings, Java and Other Headaches Character Sets, Encodings, Java and Other Headaches Character Sets, Encodings, Java and Other Headaches Character Sets, Encodings, Java and Other Headaches Brian Clapper ArdenTex, Inc. bmc@ardentex.com 1 Introduction Java can read, write, and convert among

More information

Introduction to Computer Science

Introduction to Computer Science Information in a system point of view Information in a system point of view All information processed by computer systems is only a sequence of zeros and ones. Nothing more. In a system point of view ALL

More information

Thompson Library. Floor Inlays & Elevator Etchings. Foundation Stones

Thompson Library. Floor Inlays & Elevator Etchings. Foundation Stones Thompson Library Floor Inlays & Elevator Etchings Foundation Stones Foundation Stones of the Library Set in the terrazzo of the William Oxley Thompson Memorial Library s ground and first floors are 9 metal

More information

Kazuraki : Under The Hood

Kazuraki : Under The Hood Kazuraki : Under The Hood Dr. Ken Lunde Senior Computer Scientist Adobe Systems Incorporated Why Develop Kazuraki? To build excitement and awareness about OpenType Japanese fonts Kazuraki is the first

More information

Who We Are. Services We Offer

Who We Are. Services We Offer Who We Are Atkins Translation Services is a professional language agency providing cost effective and rapid language services. Our network of over 70 native language professionals ensures we are able to

More information

The future of International SEO. The future of Search Engine Optimization (SEO) for International Business

The future of International SEO. The future of Search Engine Optimization (SEO) for International Business The future of International SEO The future of Search Engine Optimization (SEO) for International Business Whitepaper The World Wide Web is now allowing special characters in URLs which means crawlers now

More information

Right-to-Left Language Support in EMu

Right-to-Left Language Support in EMu EMu Documentation Right-to-Left Language Support in EMu Document Version 1.1 EMu Version 4.0 www.kesoftware.com 2010 KE Software. All rights reserved. Contents SECTION 1 Overview 1 SECTION 2 Switching

More information

Crash Course on Character Encodings

Crash Course on Character Encodings Crash Course on Character Encodings Yusuke Shinyama NYCNLP Oct. 27, 2006 Introduction 2 Are they the same? Unicode UTF 3 Two Mappings Character Character Code Byte Sequence 64 64 ض 1590 216 182 32654 231

More information

coral SOFTWARE LOCALISATION LANGUAGE SERVICES WEBSITE TRANSLATION MEDICAL TRANSLATION MULTILINGUAL DTP TRANSCRIPTION VOICEOVER & SUBTITLING

coral SOFTWARE LOCALISATION LANGUAGE SERVICES WEBSITE TRANSLATION MEDICAL TRANSLATION MULTILINGUAL DTP TRANSCRIPTION VOICEOVER & SUBTITLING SOFTWARE LOCALISATION LANGUAGE SERVICES // TRANSCRIPTION MULTILINGUAL DTP MEDICAL TRANSLATION WEBSITE TRANSLATION VOICEOVER & SUBTITLING INTERPRETER SERVICES elearning TRANSLATION about us Coral Knowledge

More information

Multilingual Computing with the 9.1 SAS Unicode Server Stephen Beatrous, SAS Institute, Cary, NC

Multilingual Computing with the 9.1 SAS Unicode Server Stephen Beatrous, SAS Institute, Cary, NC Multilingual Computing with the 9.1 SAS Unicode Server Stephen Beatrous, SAS Institute, Cary, NC ABSTRACT In today s business world, information comes in many languages and you may have customers and employees

More information

An Introduction to Unicode. Henri Sivonen

An Introduction to Unicode. Henri Sivonen An Introduction to Unicode Henri Sivonen What s Unicode? 21-bit coded character set Includes property data, rules and algorithms Aims to cover all human writing systems currently in use Also covers some

More information

Four ACEs. A Survey of ASCII Compatible Encodings. International Unicode Conference 22 September 2002

Four ACEs. A Survey of ASCII Compatible Encodings. International Unicode Conference 22 September 2002 Four ACEs A Survey of ASCII Compatible Encodings International Unicode Conference 22 September 2002 by Addison P. Phillips Director, Globalization Architecture c TABLE OF CONTENTS INTRODUCTION... 3 WHAT'S

More information

Internationalized Domain Names -

Internationalized Domain Names - Internationalized Domain Names - Getting them to work Gihan Dias LK Domain Registry What is IDN? Originally DNS names were restricted to the characters a-z (letters), 0-9 (digits) and '-' (hyphen) (LDH)

More information

New International features of Internet Explorer

New International features of Internet Explorer New International features of Internet Explorer Michel Suignard Microsoft Corporation 1 Summary This document presents new implementations of international features by Microsoft Internet Explorer version

More information

EMC SourceOne. Products Compatibility Guide 300-008-041 REV 54

EMC SourceOne. Products Compatibility Guide 300-008-041 REV 54 EMC SourceOne Products Compatibility Guide 300-008-041 REV 54 Copyright 2005-2016 EMC Corporation. All rights reserved. Published in the USA. Published February 23, 2016 EMC believes the information in

More information

Lecture.5. Number System:

Lecture.5. Number System: Number System: Lecturer: Dr. Laith Abdullah Mohammed Lecture.5. The number system that we use in day-to-day life is called Decimal Number System. It makes use of 10 fundamental digits i.e. 0, 1, 2, 3,

More information

Localization of FEMLAB. This document contains information about localization of the FEMLAB user interface.

Localization of FEMLAB. This document contains information about localization of the FEMLAB user interface. 1 Localization of FEMLAB This document contains information about localization of the FEMLAB user interface. 1 Adding Locales to the FEMLAB GUI FEMLAB Resource Files All strings and words that appear in

More information

TRIDINDIA IT TRANSLATION SERVICES PRIVATE LIMITED

TRIDINDIA IT TRANSLATION SERVICES PRIVATE LIMITED TRIDINDIA IT TRANSLATION SERVICES PRIVATE LIMITED As we understand your business is mostly about words, we not only translate words, we transform business in the world of words. Established in 2002 with

More information

Developing international webapplications. Frode Eika Sandnes Faculty of Engineering, Oslo University College. internationalisation 18 letters.

Developing international webapplications. Frode Eika Sandnes Faculty of Engineering, Oslo University College. internationalisation 18 letters. Developing international webapplications Frode Eika Sandnes Faculty of Engineering, Oslo University College internationalisation 18 letters i18n 1 Internationalisation vs localisation Internationalisation

More information

Crawling. T. Yang, UCSB 290N Some of slides from Crofter/Metzler/Strohman s textbook

Crawling. T. Yang, UCSB 290N Some of slides from Crofter/Metzler/Strohman s textbook Crawling T. Yang, UCSB 290N Some of slides from Crofter/Metzler/Strohman s textbook Table of Content Basic crawling architecture and flow Distributed crawling Scheduling: Where to crawl Crawling control

More information

Report on Data from the 2004 05 MLA Guide to Doctoral Programs in English and Other Modern Languages

Report on Data from the 2004 05 MLA Guide to Doctoral Programs in English and Other Modern Languages Prepublication Release: The final version of this report will appear in the ADE Bulletin No. 140, Fall 2006. Report on Data from the 2004 05 MLA Guide to Doctoral Programs in and Other Modern Languages

More information

Analyzing Unicode Text with Regular Expressions

Analyzing Unicode Text with Regular Expressions Analyzing Unicode Text with Regular Expressions Andy Heninger IBM Corporation heninger@us.ibm.com Abstract For decades now, Regular Expressions have been used in the analysis of text data, for searching

More information

Rendering/Layout Engine for Complex script. Pema Geyleg pgeyleg@dit.gov.bt

Rendering/Layout Engine for Complex script. Pema Geyleg pgeyleg@dit.gov.bt Rendering/Layout Engine for Complex script Pema Geyleg pgeyleg@dit.gov.bt Overview What is the Layout Engine/ Rendering? What is complex text? Types of rendering engine? How does it work? How does it support

More information

Freescale Embedded GUI Converter Utility 2.0 Quick User Guide

Freescale Embedded GUI Converter Utility 2.0 Quick User Guide Freescale Semiconductor User Guide Document Number: EGUICUG Rev. 1, 08/2010 Freescale Embedded GUI Converter Utility 2.0 Quick User Guide 1 Introduction The Freescale Embedded GUI Converter Utility 2.0

More information

Internationalized Domain Names (IDNs) : A Key to Inclusive and Multilingual Internet

Internationalized Domain Names (IDNs) : A Key to Inclusive and Multilingual Internet Internationalized Domain Names (IDNs) : A Key to Inclusive and Multilingual Internet Background: Until recently, the Root was limited to a set of characters conforming to US-ASCII (American Standard Code

More information

Pemrograman Dasar. Basic Elements Of Java

Pemrograman Dasar. Basic Elements Of Java Pemrograman Dasar Basic Elements Of Java Compiling and Running a Java Application 2 Portable Java Application 3 Java Platform Platform: hardware or software environment in which a program runs. Oracle

More information

Survey of University of Michigan Graduate-level Area Studies Alumni/ae & FLAS Recipients from 1996-2006: Selected Findings

Survey of University of Michigan Graduate-level Area Studies Alumni/ae & FLAS Recipients from 1996-2006: Selected Findings Survey of University of Michigan Graduate-level Area Studies Alumni/ae & FLAS Recipients from 1996-2006: Selected Findings Azumi Ann Takata, Center for Japanese Studies, International Institute Donna Parmelee,

More information

Lao Character Set. Abstract. 1. Definition

Lao Character Set. Abstract. 1. Definition Lao Character Set Phonpasit Phissamay and Nadir Durrani Science Technology and Environment Agency, Center for Research in Urdu Language Processing phonpasit@stea.gov.la, nadirdurrani@yahoo.com Abstract

More information

Localization & Internationalization Testing

Localization & Internationalization Testing Localization & Internationalization Testing Shanthi.AL:Shanthi.Alagappan@cognizant.com ------------------------------------------------------------------------------- Cognizant Technology Solutions India

More information

Adobe FrameMaker 7.0

Adobe FrameMaker 7.0 Adobe FrameMaker 7.0 ii Contents Character sets and encoding methods.......................... 1 Inline input.....................................................1 Typesetting rules...............................................

More information

HP ProtectTools password guidelines

HP ProtectTools password guidelines HP ProtectTools password guidelines Table of contents Introduction... 2 Overview of HP ProtectTools Security Manager... 2 Supported keyboard layouts in Preboot Security and Drive Encryption... 3 HP ProtectTools

More information

Using International Languages in Microsoft Windows and Microsoft Word

Using International Languages in Microsoft Windows and Microsoft Word Using International Languages in Microsoft Windows and Microsoft Word In order to use the international language options, you must first add the respective language to your computer. (See the Adding International

More information

I. FOR STUDENTS WHO WANT TO CONTINUE A FOREIGN LANGUAGE:

I. FOR STUDENTS WHO WANT TO CONTINUE A FOREIGN LANGUAGE: R e c o m m e n d e d C o u r s e s f o r T H H S B r i d g e Y e a r S t u d e n t s The following is a list of Fall 2016 Queens College courses which are recommended for Townsend Harris seniors. For

More information

Activity 1: Bits and Bytes

Activity 1: Bits and Bytes ICS3U (Java): Introduction to Computer Science, Grade 11, University Preparation Activity 1: Bits and Bytes The Binary Number System Computers use electrical circuits that include many transistors and

More information

Advanced Topics: Unicode and XSL

Advanced Topics: Unicode and XSL Advanced Topics: Unicode and XSL SET09103 Advanced Web Technologies School of Computing Napier University, Edinburgh, UK Module Leader: Uta Priss 2008 Copyright Napier University Advanced Topics: Unicode

More information

Internationalization & Localization

Internationalization & Localization Internationalization & Localization Of OpenOffice.org - The Indian Perspective Comprehensive Office Suite for Multilingual Indic Computing Bhupesh Koli, Shikha G Pillai

More information

San José, February 16, 2001

San José, February 16, 2001 San José, February 16, 2001 Feel free to distribute this text (version 1.4) including the author s e-mail address (mailto:dmeyer@adobe.com) and to contact him for corrections and additions. Please do not

More information

Bachelor of Science Degree Requirements for Students Fulfilling the REVISED General Education Curriculum

Bachelor of Science Degree Requirements for Students Fulfilling the REVISED General Education Curriculum General College Requirements Bachelor of Science Degree Requirements for Students Fulfilling the REVISED General Education Curriculum Note: see the Middle Childhood Education (MCE) curriculum sheet for

More information

Who s Up Next: Succession Planning and Implementation

Who s Up Next: Succession Planning and Implementation Who s Up Next: Succession Planning and Implementation Ken Jeffers, Manager, Access and Diversity Rosa Jones Imhotep, Operations Support Officer, Access and Diversity Unit Toronto Parks, Forestry and Recreation

More information

Binary Representation. Number Systems. Base 10, Base 2, Base 16. Positional Notation. Conversion of Any Base to Decimal.

Binary Representation. Number Systems. Base 10, Base 2, Base 16. Positional Notation. Conversion of Any Base to Decimal. Binary Representation The basis of all digital data is binary representation. Binary - means two 1, 0 True, False Hot, Cold On, Off We must be able to handle more than just values for real world problems

More information

Designing Global Applications: Requirements and Challenges

Designing Global Applications: Requirements and Challenges Designing Global Applications: Requirements and Challenges Sourav Mazumder Abstract This paper explores various business drivers for globalization and examines the nature of globalization requirements

More information

1.3 Data Representation

1.3 Data Representation 8628-28 r4 vs.fm Page 9 Thursday, January 2, 2 2:4 PM.3 Data Representation 9 appears at Level 3, uses short mnemonics such as ADD, SUB, and MOV, which are easily translated to the ISA level. Assembly

More information

Encoding Systems: Combining Bits to form Bytes

Encoding Systems: Combining Bits to form Bytes Encoding Systems: Combining Bits to form Bytes Alphanumeric characters are represented in computer storage by combining strings of bits to form unique bit configuration for each character, also called

More information

.ASIA CJK (Chinese Japanese Korean) IDN Policies

.ASIA CJK (Chinese Japanese Korean) IDN Policies Date: Status: Version: 1.1.ASIA IDN Policies 04-May-2011 COMPLETE Archive URL: References: http://dot.asia/policies/dotasia-cjk-idn-policies-complete--2011-05-04.pdf.asia ZH / JA / KO IDN Language Tables

More information

IDN: Challenges and Opportunities A registry s view of the multilingual web. Rome, March 2013!

IDN: Challenges and Opportunities A registry s view of the multilingual web. Rome, March 2013! IDN: Challenges and Opportunities A registry s view of the multilingual web " Rome, March 2013! Everything is about the end user! 2! Name! Deng Fu Xiang"! Occupation! Freelance photographer" " Age! 35

More information

WORKING DRAFT. ISO/IEC International Standard International Standard 10646. ISO/IEC 10646 1 st Edition + Amd1

WORKING DRAFT. ISO/IEC International Standard International Standard 10646. ISO/IEC 10646 1 st Edition + Amd1 ISO/IEC JC1/SC2/WG2 N2937 ISO/IEC International Standard International Standard 10646 ISO/IEC 10646 1 st Edition + Amd1 Information technology Universal Multiple-Octet Coded Character Set (UCS) Architecture

More information

Translation/interpreting Services in Nottingham

Translation/interpreting Services in Nottingham Translation/interpreting Services in Nottingham (This is not a conclusive list, check telephone directories and the internet for more. Nottingham CDP cannot be help responsible for the quality of work

More information

Number Systems and Data Representation CS221

Number Systems and Data Representation CS221 Number Systems and Data Representation CS221 Inside today s computers, data is represented as 1 s and 0 s. These 1 s and 0 s might be stored magnetically on a disk, or as a state in a transistor, core,

More information

INTERNATIONALIZATION FEATURES IN THE MICROSOFT.NET DEVELOPMENT PLATFORM AND WINDOWS 2000/XP

INTERNATIONALIZATION FEATURES IN THE MICROSOFT.NET DEVELOPMENT PLATFORM AND WINDOWS 2000/XP INTERNATIONALIZATION FEATURES IN THE MICROSOFT.NET DEVELOPMENT PLATFORM AND WINDOWS 2000/XP Dr. William A. Newman, Texas A&M International University, wnewman@tamiu.edu Mr. Syed S. Ghaznavi, Texas A&M

More information

Multilingual Ediscovery: Options, Obstacles and Opportunities Report

Multilingual Ediscovery: Options, Obstacles and Opportunities Report Multilingual Ediscovery: Options, Obstacles and Opportunities Report A guide to collecting, filtering, reviewing and producing multilingual documents in discovery. An Altegrity Company Copyright 2014 Kroll

More information