Perl & Multiple-byte Characters



Similar documents
Data Integrator. Encoding Reference. Pervasive Software, Inc B Riata Trace Parkway Austin, Texas USA

Chapter 4: Computer Codes

Binary Representation

Introduction to Unicode. By: Atif Gulzar Center for Research in Urdu Language Processing

Numeral Systems. The number twenty-five can be represented in many ways: Decimal system (base 10): 25 Roman numerals:

San José, February 16, 2001

Python Programming: An Introduction to Computer Science

Binary Representation. Number Systems. Base 10, Base 2, Base 16. Positional Notation. Conversion of Any Base to Decimal.

Kazuraki : Under The Hood

Frequently Asked Questions on character sets and languages in MT and MX free format fields

ASCII Encoding. The char Type. Manipulating Characters. Manipulating Characters

The following functions are provided by the Digest::MD5 module. None of these functions are exported by default.

Base Conversion written by Cathy Saxton

The Unicode Standard Version 8.0 Core Specification

plc numbers Encoded values; BCD and ASCII Error detection; parity, gray code and checksums

Unicode Security. Software Vulnerability Testing Guide. July 2009 Casaba Security, LLC

Project: Simulated Encrypted File System (SEFS)

Number Representation

EE282 Computer Architecture and Organization Midterm Exam February 13, (Total Time = 120 minutes, Total Points = 100)

Java Interview Questions and Answers

Ecma/TC39/2013/NN. 4 th Draft ECMA-XXX. 1 st Edition / July The JSON Data Interchange Format. Reference number ECMA-123:2009

Object-Oriented Design Lecture 4 CSU 370 Fall 2007 (Pucella) Tuesday, Sep 18, 2007

Section 1.4 Place Value Systems of Numeration in Other Bases

SMPP protocol analysis using Wireshark (SMS)

Cyber Security Workshop Encryption Reference Manual

Bluetooth HID Profile

The programming language C. sws1 1

Unraveling Unicode: A Bag of Tricks for Bug Hunting

EE 261 Introduction to Logic Circuits. Module #2 Number Systems

Perl version documentation - Digest::SHA NAME SYNOPSIS SYNOPSIS (HMAC-SHA) Page 1

Content Control. Admin Guide

OPENID AUTHENTICATION SECURITY

Name: Class: Date: 9. The compiler ignores all comments they are there strictly for the convenience of anyone reading the program.

Counting in base 10, 2 and 16

I PUC - Computer Science. Practical s Syllabus. Contents

Regular Expression Syntax

Finding XSS in Real World

Digital System Design Prof. D Roychoudhry Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Soft-Starter SSW-06 V1.6X

Pemrograman Dasar. Basic Elements Of Java

How to represent characters?

Red Hat Enterprise Linux International Language Support Guide

HKSCS-2004 Support for Windows Platform

DNA Data and Program Representation. Alexandre David

Systems I: Computer Organization and Architecture

An overview of FAT12

Section 4.1 Rules of Exponents

Chapter 7D The Java Virtual Machine

Chapter 5 Names, Bindings, Type Checking, and Scopes

INTERNATIONAL STANDARD

[MS-OXTNEF]: Transport Neutral Encapsulation Format (TNEF) Data Algorithm

Introduction to Programming (in C++) Loops. Jordi Cortadella, Ricard Gavaldà, Fernando Orejas Dept. of Computer Science, UPC

Example. Introduction to Programming (in C++) Loops. The while statement. Write the numbers 1 N. Assume the following specification:

Informatica e Sistemi in Tempo Reale

Product Internationalization of a Document Management System

How To Write Portable Programs In C

CS177 MIDTERM 2 PRACTICE EXAM SOLUTION. Name: Student ID:

Appendix C: Keyboard Scan Codes

=

Chapter 3. if 2 a i then location: = i. Page 40

ELEG3924 Microprocessor Ch.7 Programming In C

Section IV.1: Recursive Algorithms and Recursion Trees

Storing Measurement Data

Japanese Character Printers EPL2 Programming Manual Addendum

Goals. Unary Numbers. Decimal Numbers. 3,148 is s 100 s 10 s 1 s. Number Bases 1/12/2009. COMP370 Intro to Computer Architecture 1

7 Why Use Perl for CGI?

Technical Specifications for KD5HIO Software

Internationalization of Domain Names

The Answer to the 14 Most Frequently Asked Modbus Questions

Table 1 below is a complete list of MPTH commands with descriptions. Table 1 : MPTH Commands. Command Name Code Setting Value Description

Grandstream Networks, Inc.

Python Loops and String Manipulation

UNDERSTANDING SMS: Practitioner s Basics

XML. CIS-3152, Spring 2013 Peter C. Chapin

RS-485 Protocol Manual

EURESCOM - P923 (Babelweb) PIR.3.1

SMALL INDEX LARGE INDEX (SILT)

Automating SQL Injection Exploits

Secret Communication through Web Pages Using Special Space Codes in HTML Files

Instruction Set Architecture. or How to talk to computers if you aren t in Star Trek

Today s topics. Digital Computers. More on binary. Binary Digits (Bits)

Lab 4.4 Secret Messages: Indexing, Arrays, and Iteration

[MS-OXMSG]: Outlook Item (.msg) File Format. Intellectual Property Rights Notice for Open Specifications Documentation

SQL Injection Optimization and Obfuscation Techniques

Embedded Programming in C/C++: Lesson-1: Programming Elements and Programming in C

Preservation Handbook

The use of binary codes to represent characters

Chapter 2. Binary Values and Number Systems

TECHNICAL UNIVERSITY OF CRETE DATA STRUCTURES FILE STRUCTURES

public static void main(string[] args) { System.out.println("hello, world"); } }

Perl in a nutshell. First CGI Script and Perl. Creating a Link to a Script. print Function. Parsing Data 4/27/2009. First CGI Script and Perl

CTNET Field Protocol Specification November 19, 1997 DRAFT

Xbox 360 File Specifications Reference

Critical Values for I18n Testing. Tex Texin Chief Globalization Architect XenCraft

Compiler Construction

Binary Trees and Huffman Encoding Binary Search Trees

Transcription:

Perl & Multiple-byte Characters Ken Lunde CJKV Type Development Adobe Systems Incorporated bc ftp://ftp.ora.com/pub/examples/nutshell/ujip/perl/perl97.pdf

Multiple-byte Issues Why Worry? Affects all CJKV (Chinese, Japanese, Korean, and Vietnamese) encodings up to four bytes per character! EUC-CN, ISO-2022-CN, ISO-2022-CN-EXT, HZ, and GBK for Simplified Chinese (China) EUC-TW, ISO-2022-CN, ISO-2022-CN-EXT, and Big Five for Traditional Chinese (Taiwan) EUC-JP, ISO-2022-JP, and Shift-JIS for Japanese EUC-KR, ISO-2022-KR, Johab, and UHC for Korean EUC-VN and ISO-2022-VN for Vietnamese not yet defined Required in the context of Unicode Unicode encoding is 16-bit fixed-length (aka UTF-16) UTF-8 encoding is variable-length one, two, or three bytes Remember: One byte does not always equal one character

Multiple-byte Concerns Code conversion For converting data into common or different encoding Required for cross-platform development Data manipulation Simple text processing, such as search/replace required for product localization Related to code conversion, but not a global operation Searching Simple matching also required for product localization Extensive use of regular expressions The basis for performing multiple-byte tricks through proven and related techniques such as anchoring and trapping

Code Conversion Techniques Algorithmic Mathematical, and applied equally to every character Table-driven Requires a lookup table Round-trip may be an issue with undefined code points Selective Convert only certain characters or certain character classes Can be algorithmic or table-driven usually table-driven Example: Half- to full-width katakana table-driven ftp://ftp.ora.com/pub/examples/nutshell/ujip/perl/unkana.pl HIJ (three characters) KL (two characters) Combination of above techniques

Algorithmic Techniques Pure algorithm Mathematical transformations are applied equally to every character Example: Unicode (UTF-16) UTF-8 UTF-7 Example: EUC-JP ISO-2022-JP Shift-JIS Normalization identical sequence, incompatible encoding Character codes are normalized to become a continuous sequence beginning at zero Example: Johab hangul Unicode hangul OP (0x8BB1 0xC3A1) OP (0xAE40 0xCE58)

Table-driven Techniques Used for conversion between incompatible encodings Example: Unicode other CJK encodings Example: UHC hangul Unicode hangul Example: Big Five EUC-TW (CNS 11643-1992) Example: Half- to full-width katakana Hash Simple hash-based lookup using the original (unmodified) character codes Zero-based table Character codes are normalized to become a continuous sequence beginning at zero Consider EUC-JP code set 1 (JIS X 0208:1997)

#!/usr/local/bin/perl -w # Converting EUC-JP code set 1 to zero-based values $ch = "\xb7\xf5"; # The kanji A of JIS X 0208:1997 # Subtract 0xA1 (161) from the first byte then multiple by 94 # Subtract 0xA1 (161) from the second byte # Add the two values to obtain zero-based value $zeroch = ((ord(substr($ch,0,1)) - 0xA1) * 94) + (ord(substr($ch,1,1)) - 0xA1); print "Zero-based value of $ch is $zeroch\n"; # $zeroch equals 2152 -- the 2,153rd character # The following converts the zero-based value back to the original # by reversing the effects of zero-based conversion $ch = chr(($zeroch / 94) + 0xA1). chr(($zeroch % 94) + 0xA1); print "$ch\n"; # $ch again equals 0xB7F5 (A)

Selective Code Conversion Act as filters applied to specific characters or character classes Example: Simplified to traditional Chinese characters M N Example: Half- to full-width katakana Usually table-driven, but can be algorithmic

Combination Code Conversion EUC-JP ISO-2022-JP Shift-JIS conversion may also involve half- to full-width katakana conversion as a method of filtering JConv (ANSI C) forces half- to full-width katakana conversion when converting EUC-JP or Shift-JIS to ISO-2022-JP http://www.ora.com/people/authors/lunde/j_tools.html The following pages provide a complete set of working Japanese code converters JIS X 0208:1997, ASCII/JIS-Roman, and half-width katakana support Emphasis on readability rather than efficiency there is more than one way to do it

#!/usr/local/bin/perl -w # ISO-2022-JP to EUC-JP while (defined($line = <STDIN>)) { $line =~ s{ # JIS X 0208:1997 \e\$[\@b] # ESC $ plus @ or B ((?:[\x21-\x7e][\x21-\x7e])+) # Two-byte characters \e\([bhj] # ESC ( plus B, H, or J }{($x = $1) =~ tr/\x21-\x7e/\xa1-\xfe/, # From 7- to 8-bit $x }egx; $line =~ s{ # JIS X 0201-1997 half-width katakana \e\(i # ESC ( I ([\x21-\x7e]+) # Half-width katakana \e\([bhj] # ESC ( plus B, H, or J }{($x = $1) =~ tr/\x21-\x7e/\xa1-\xfe/, # From 7- to 8-bit ($y = $x) =~ s/([\xa1-\xfe])/\x8e$1/g, # Prefix with SS2 $y }egx; print STDOUT $line; }

#!/usr/local/bin/perl -w # EUC-JP to ISO-2022-JP while (defined($line = <STDIN>)) { $line =~ s{ # JIS X 0208:1997 ((?:[\xa1-\xfe][\xa1-\xfe])+) }{\e\$b$1\e\(j}gx; $line =~ s{ # JIS X 0201-1997 half-width katakana ((?:\x8e[\xa0-\xdf])+) # Half-width katakana }{\e\(i$1\e\(j}gx; $line =~ s/\x8e//g; # Remove SS2s $line =~ tr/\xa1-\xfe/\x21-\x7e/; # From 8- to 7-bit print STDOUT $line; }

# Some functions for Shift-JIS conversions sub convert2sjis { # For EUC-JP and ISO-2022-JP to Shift-JIS my @euc = unpack("c*", $_[0]); my @out = (); while (($hi, $lo) = splice(@euc, 0, 2)) { $hi &= 0x7f; $lo &= 0x7f; push(@out, (($hi + 1) >> 1) + ($hi < 95? 112 : 176), $lo + (($hi & 1)? ($lo > 95? 32 : 31) : 126)); } return pack("c*", @out); } sub sjis2jis { # For Shift-JIS to ISO-2022-JP and EUC-JP my @ord = unpack("c*", $_[0]); for ($i = 0; $i < @ord; $i += 2) { $ord[$i] = (($ord[$i]-($ord[$i]<160?112:176))<<1)- ($ord[$i+1]<159?1:0); $ord[$i+1] -= ($ord[$i+1]<159?($ord[$i+1]>127?32:31):126); } return pack("c*", @ord); }

#!/usr/local/bin/perl -w # ISO-2022-JP or EUC-JP to Shift-JIS while (defined($line = <STDIN>)) { $line =~ s{( # EUC-JP (?:[\xa1-\xfe][\xa1-\xfe])+ # JIS X 0208:1997 (?:\x8e[\xa0-\xdf])+ # Half-width katakana )}{substr($1,0,1) eq "\x8e"? (($x = $1) =~ s/\x8e//g, $x) : &convert2sjis($1)}egx; $line =~ s{ # Handle ISO-2022-JP \e\$[\@b] ((?:[\x21-\x7e][\x21-\x7e])+) \e\([bhj] }{&convert2sjis($1)}egx; $line =~ s{ # Handle ISO-2022-JP half-width katakana \e\(i ([\x20-\x5f]+) \e\([bhj] }{($x = $1) =~ tr/\x20-\x5f/\xa0-\xdf/, $x}egx; print STDOUT $line; }

#!/usr/local/bin/perl -w # Shift-JIS to ISO-2022-JP while (defined($line = <STDIN>)) { $line =~ s{( # JIS X 0208:1997 and half-width katakana (?:[\x81-\x9f\xe0-\xef][\x40-\x7e\x80-\xfc])+ [\xa0-\xdf]+ )}{ ($x=$1)!~ /^[\xa0-\xdf]/? "\e\$b". &sjis2jis($1). "\e\(j" : "\e\(i". (($y=$x) =~ tr/\xa0-\xdf/\x20-\x5f/, $y). "\e\(j" }egx; print STDOUT $line; }

#!/usr/local/bin/perl -w # Shift-JIS to EUC-JP while (defined($line = <STDIN>)) { $line =~ s{( # JIS X 0208:1997 and half-width katakana (?:[\x81-\x9f\xe0-\xef][\x40-\x7e\x80-\xfc])+ [\xa0-\xdf]+ )}{ ($x = $1)!~ /^[\xa0-\xdf]/? (($y = &sjis2jis($x)) =~ tr/\21-\x7e/\xa1-\xfe/, $y) : (($y = $x) =~ s/([\xa0-\xdf])/\x8e$1/g, $y) }egx; print STDOUT $line; }

Code Conversion Pitfalls No round-trip mapping for most table-driven conversions Consider Shift-JIS/EUC-JP to Unicode (UTF-16) conversion Of the 8,836 code points available in Shift-JIS and EUC-JP code set 1 (JIS X 0208:1997), only 6,879 are assigned and have mappings to Unicode the rest are converted into the same Unicode code point Some encodings have regions that are not compatible with other related encodings Example: The Shift-JIS user-defined range (0xF040 0xFCFC) is not encoded in EUC-JP Example: EUC-JP code set 3 (JIS X 0212-1990) is not encoded in Shift-JIS

Regex Techniques Multiple-byte anchoring Necessary for successful and correct matching of multiplebyte characters Trapping all characters Otherwise, matching may occur across character boundaries Necessary for selective conversion or data manipulation You must specify the complete encoding range in order to successfully trap or anchor multiple-byte text Convenient to store the complete encoding specification in a variable for trapping and anchoring, such as $encoding Don t forget to apply the ox regex modifiers when using the free-formatted encoding specifications that follow!

$encoding = qq<[\x00-\xff][\x00-\xff]>; # UCS-2 $encoding = qq< # EUC-JP [\x00-\x8d\x90-\xa0\xff] # Code set 0 & one-byte \x8e[\xa0-\xdf] # Code set 2 \x8f[\xa1-\xfe][\xa1-\xfe] # Code set 3 [\xa1-\xfe][\xa1-\xfe] # Code set 1 >; $encoding = qq< # EUC-TW [\x00-\x8d\x8f-\xa0\xff] # Code set 0 & one-byte \x8e[\xa1-\xb0][\xa1-\xfe][\xa1-\xfe] # Code set 2 [\xa1-\xfe][\xa1-\xfe] # Code set 1 >; $encoding = qq< # EUC-KR and EUC-CN [\x00-\xa0\xff] # Code set 0 & one-byte [\xa1-\xfe][\xa1-\xfe] # Code set 1 >; $encoding = qq< # GBK [\x00-\x80\xff] # One-byte [\x81-\xfe][\40-\x7e\x80-\xfe] # GBK >;

#!/usr/local/bin/perl -w # Multiple-byte anchoring when matching Shift-JIS encoded text $search = "\x8c\x95"; # A $text1 = "Text 1 \x90\x56\x8c\x95\x93\xb9"; # BAC $text2 = "Text 2 \x94\x92\x8c\x8c\x95\x61"; # DEF $encoding = qq< # Shift-JIS encoding [\x00-\x80\xfd-\xff] # ASCII and other one-byte [\xa0-\xdf] # Half-width katakana [\x81-\x9f\xe0-\xfc][\x40-\x7e\x80-\xfc] # Two-byte range >; print "First attempt -- no anchoring\n"; print " Matched Text1\n" if $text1 =~ /$search/o; print " Matched Text2\n" if $text2 =~ /$search/o; print "Second attempt -- anchoring\n"; print " Matched Text1\n" if $text1 =~ /^(?:$encoding)*?$search/ox; print " Matched Text2\n" if $text2 =~ /^(?:$encoding)*?$search/ox;

Regex Techniques (Cont d) Create an array that has elements with varying numbers of bytes implements trapping Most useful for variable-length encodings Process text by iterating over the resulting array using operators such as foreach Using length() then tells you how long each character is

#!/usr/local/bin/perl -w # This program shows how to break up multiple-byte text into # separate array elements; it prints every character, one per # line; two-byte characters as hexadecimal with "0x" prefix $encoding = qq< # Shift-JIS encoding [\x00-\x80\xfd-\xff] # ASCII and other one-byte [\xa0-\xdf] # Half-width katakana [\x81-\x9f\xe0-\xfc][\x40-\x7e\x80-\xfc] # Two-byte range >; while (defined($line = <STDIN>)) { @enc = $line =~ /($encoding)/gox; # One character per element foreach $element (@enc) { if (length($element) == 2) { # If two-byte character print STDOUT "0x". ($x = uc unpack("h*",$element), $x); } else { # All others are one-byte characters print STDOUT "$element\n"; } } }

Regular Expression Pitfalls \W ( not a word ) versus \w ( a word ) in cross-platform environments you may get more than you bargained for The standard definition of \w: [0-9A-Z_a-z] (or [\x30-\x39\x41-\x5a\x5f\x61-\x7a]) But MacPerl s definition of \w adds the following: [\x80-\x9f\xae\xaf\xbe\xbf\xcb-\xcf\xd8\xd9\xe5-\xef\xf1-\xf5] When encoding ranges are needed, use explicit ones Specify the entire encoding range, including unused code points All code points from 0x00 through 0xFF must either represent themselves (that is, be one-byte characters), or else must be a valid first byte of a multiple-byte character

Other Useful Techniques Multiple-byte data is often obscured (damaged?) by Base64, URL, or quoted-printable encoding HTML forms E-mail Use the MIME module for Base64 decoding To decode a URL-encoded string: $string =~ s/%([0-9a-fa-f][0-9a-fa-f])/pack("c",hex($1))/ge; To decode a quoted-printable string: $string =~ s/=([0-9a-fa-f][0-9a-fa-f])/pack("c",hex($1))/ge; $string =~ s/=[\x0a\x0d]+$//;

Advantages of Unicode A single encoding consider the possible encodings for the Chinese character G 0x4E2D ( center or middle ) Simplified Chinese = 0x5650 or 0xD6D0 Traditional Chinese = 0x4463, 0xC4E3, 0x8EA1C4E3, or 0xA4A4 Japanese = 0x4366, 0xC3E6, or 0x9286 Korean = 0x7169, 0xF1E9, or 0xF3E9 Vietnamese = 0x4A36 or 0xCAB6 All Unicode characters with the exception of the surrogates are 16-bit (two bytes) All characters are given equal treatment No more need to deal with multiple-byte data Consider Java s Unicode-based model

Advantages of Unicode (Cont d) Java Version 1.1 s code conversion model Table-driven conversion for CJKV characters Non-Unicode encodings treated as raw data (also called byte sequences or byte arrays ) that must be imported Unicode data are considered type char Shift-JIS EUC-JP becomes Shift-JIS Unicode EUC-JP Strings objects can be explicitly converted to/from Unicode Input and output streams can be converted on-the-fly to/ from Unicode Table-driven code conversion can lose data but only undefined code points, mind you

Other Useful Information jcode.pl library by Kazumasa Utashiro (utashiro@iij.ad.jp) ftp://ftp.iij.ad.jp/pub/iij/dist/utashiro/perl/ Unicode module by Gisle Aas (aas@sn.no) http://www.perl.com/cpan/authors/gisle_aas/ Supports UTF-16 (Unicode) UTF-8 algorithmic conversion Useful multiple-byte capable Perl programs ftp://ftp.ora.com/pub/examples/nutshell/ujip/perl/ JPerl by Hirofumi Watanabe (watanabe@ase.ptg.sony.co.jp) http://www.perl.com/cpan/authors/hirofumi_watanabe/ Japanese version of Perl with Japanese-enhanced regular expressions and other Japanese-specific enhancements Supports EUC-JP and Shift-JIS encodings Beware of portability issues