UTF-16 and C/C++ language. TANAKA Keishiro EBLE Markus



Similar documents
The Inevitable Unicode Project

Product Internationalization of a Document Management System

Unicode Security. Software Vulnerability Testing Guide. July 2009 Casaba Security, LLC

The Unicode Standard Version 8.0 Core Specification

1/20/2016 INTRODUCTION

Unicode Support in Enterprise COBOL. Nick Tindall Stephen Miller Sam Horiguchi August 13, 2003

The programming language C. sws1 1

FileMaker 13. ODBC and JDBC Guide

Unicode Enabling Java Web Applications

FileMaker 14. ODBC and JDBC Guide

Chapter 4: Computer Codes

How to represent characters?

Internationalizing the Domain Name System. Šimon Hochla, Anisa Azis, Fara Nabilla

The use of binary codes to represent characters

Performance Tuning Guidelines for PowerExchange for Microsoft Dynamics CRM

HP Service Manager Compatibility Matrix

Number Representation

BC450 ABAP Performance: Analysis and Optimization

Preservation Handbook

Hypertable Architecture Overview

Compiler I: Syntax Analysis Human Thought

Greenplum Database 4.0 Connectivity Tools for Windows

How To Write Portable Programs In C

Bypassing Web Application Firewalls (WAFs) Ing. Pavol Lupták, CISSP, CEH Lead Security Consultant


Semantic Stored Procedures Programming Environment and performance analysis

Instruction Set Architecture (ISA)

Full and Para Virtualization

A Case Study of ebay UTF-8 Database Migration

Session ID: SPC251 Unicode Interfaces Data Exchange Between Unicode and non-unicode Systems

Active Directory Integration with Blue Coat

AQA GCSE in Computer Science Computer Science Microsoft IT Academy Mapping

Migrating Non-Oracle Databases and their Applications to Oracle Database 12c O R A C L E W H I T E P A P E R D E C E M B E R

HTML Web Page That Shows Its Own Source Code

Client-server Sockets

The Advantages of Block-Based Protocol Analysis for Security Testing

Object Oriented Database Management System for Decision Support System.

Intershop 7 System Requirements Sheet

Memory Systems. Static Random Access Memory (SRAM) Cell

Unraveling Unicode: A Bag of Tricks for Bug Hunting

4D Plugin SDK v11. Another minor change, real values on 10 bytes is no longer supported.

Microsoft Networks. SMB File Sharing Protocol Extensions. Document Version 3.4

JavaCard. Java Card - old vs new

This guide specifies the required and supported system elements for the application.

NiceLabel Automation Version 1.5 Release Notes. Rev-1602

encoding compression encryption

Backup & Restore with SAP BPC (MS SQL 2005)

Limi Kalita / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 5 (3), 2014, Socket Programming

CPS221 Lecture: Operating System Structure; Virtual Machines

Internationalization of Domain Names

Mobile Operating Systems. Week I

How To Manage An Sap Solution

Java Interview Questions and Answers

Programming Against Hybrid Databases with Java Handling SQL and NoSQL. Brian Hughes IBM

Jet Data Manager 2012 User Guide

The New IoT Standard: Any App for Any Device Using Any Data Format. Mike Weiner Product Manager, Omega DevCloud KORE Telematics

Section 1.4 Place Value Systems of Numeration in Other Bases

HR Data Retrieval in a LDAP- Enabled Directory Service

Batch Loading Bibliographic Records for Electronic Resources revised 5/30/2013. Summary of Steps to Load a Batch of E-Resources Bib Records

SAP Banking Technology. Technical Overview Roland Keller Solution Architect SAP NetWeaver Technology. Layer. SAP Application. (e.g.

[MS-EVEN]: EventLog Remoting Protocol. Intellectual Property Rights Notice for Open Specifications Documentation

Multi-lingual Label Printing with Unicode

NASSI-SCHNEIDERMAN DIAGRAM IN HTML BASED ON AML

Raima Database Manager Version 14.0 In-memory Database Engine

XML. CIS-3152, Spring 2013 Peter C. Chapin

Compilers. Introduction to Compilers. Lecture 1. Spring term. Mick O Donnell: michael.odonnell@uam.es Alfonso Ortega: alfonso.ortega@uam.

Byte Ordering of Multibyte Data Items

How To Design A Layered Network In A Computer Network

D. Best Practices D.1. Assurance The 5 th A

[MS-RDPESC]: Remote Desktop Protocol: Smart Card Virtual Channel Extension

X.500 and LDAP Page 1 of 8

Software Engineering

What is a database? COSC 304 Introduction to Database Systems. Database Introduction. Example Problem. Databases in the Real-World

FTP Client Engine Library for Visual dbase. Programmer's Manual

ODBC Driver User s Guide. Objectivity/SQL++ ODBC Driver User s Guide. Release 10.2

Pemrograman Dasar. Basic Elements Of Java

Distributed File Systems

C Programming. for Embedded Microcontrollers. Warwick A. Smith. Postbus 11. Elektor International Media BV. 6114ZG Susteren The Netherlands

SQL Server An Overview

Best of breed Multi Architecture Cloud

IBM Tivoli Composite Application Manager for Microsoft Applications: Microsoft Internet Information Services Agent Version Fix Pack 2.

Optional custom API wrapper. C/C++ program. M program

Chapter 1. Dr. Chris Irwin Davis Phone: (972) Office: ECSS CS-4337 Organization of Programming Languages

Remote System Monitor for Linux [RSML]: A Perspective

DNA Data and Program Representation. Alexandre David

Today s topics. Digital Computers. More on binary. Binary Digits (Bits)

ZABBIX. An Enterprise-Class Open Source Distributed Monitoring Solution. Takanori Suzuki MIRACLE LINUX CORPORATION October 22, 2009

Mobility Information Series

How To Write A Web Server In Javascript

Chapter 6, The Operating System Machine Level

Language Evaluation Criteria. Evaluation Criteria: Readability. Evaluation Criteria: Writability. ICOM 4036 Programming Languages

Designing Global Applications: Requirements and Challenges

CHAPTER 1: OPERATING SYSTEM FUNDAMENTALS

Java 7 Recipes. Freddy Guime. vk» (,\['«** g!p#« Carl Dea. Josh Juneau. John O'Conner

DataFlex Connectivity Kit For ODBC User's Guide. Version 2.2

Base Conversion written by Cathy Saxton

Server Virtualization and Consolidation

1 File Processing Systems

Transcription:

UTF-16 and C/C++ language TANAKA Keishiro EBLE Markus

Contents Why do we need UTF-16 in C/C++? How to support UTF-16 in C/C++ Practical experiences Question period

Why do we need UTF-16 in C/C++? Classic Client-Server Architecture Client AppServer Client Client AppServer Database Each client wants to use his native language and script

Why do we need UTF-16 in C/C++? Typical B2B Collaborative Systems Finance B2B Database Human Ressource Business Partner How to communicate between different companies?

Why do we need UTF-16 in C/C++? Ok, let s use Unicode. But which encoding shall we use? Considering: Integration with the existing Unicode products Migration of existing non-unicode products Performance and memory consumption

Why do we need UTF-16 in C/C++? Inter-process communication Communication across the border of an address space Maybe within one machine or cross machines Data representation may differ In-process communication Communication within one address space e.g. Function call into a shared library Data represention shareable

Why do we need UTF-16 in C/C++? Inter-process communication AppServer No endian problems UTF-8 Minimum average data size Limited communication with non-unicode systems possible AppServer

Why do we need UTF-16 in C/C++? Which encoding is good for in-process communication in C/C++ programs? In-process communication typically has much higher frequency than inter-process communication In-process communication has high performance requirements Time consuming data conversion should be avoided The same data representation should be shareable between several programming languages Encoding of text data should be defined exactly

Why do we need UTF-16 in C/C++? Structure of SAP application server User Interface ABAP (UTF-?) JavaScript (UTF-16) Business Logic ABAP (UTF-?) Technical Base SAP Kernel C/C++ (?) DB-Interface(UTF-8/UTF-16) Why didsap choose UTF-16?

Why do we need UTF-16 in C/C++? UTF-16 combines the worst of UTF-8 and UTF-32? Use UTF-32 to get rid of multibyte handling Use UTF-8 if memory consumption is important

Why do we need UTF-16 in C/C++? UTF-32 vs UTF-16 In UTF-32 one 32-bit unit encodes one character No multibyte handling required But handling for combining characters needed UTF-32 requires about twice the memory size for text data UTF-32 would cause about 70% more memory consumption for the whole SAP system

Why do we need UTF-16 in C/C++? UTF-8 vs UTF-16 UTF-8 has the minimum average character size Size is language dependent For fixed-size buffer 50% larger size is required to cope with worst case UTF-8 causes higher complexity of algorithms because up to 3 bytes look ahead is required Error handling more tricky

Why do we need UTF-16 in C/C++? Important Unicode products using UTF-16 Java: The main programming language for the Internet Microsoft Windows: The most important frontend platform Databases: UTF-16 API available from all vendors IBM ICU: Platform and programming language independent internationalisation library

How to support UTF-16 in C/C++ Text processing support in C/C++ Character and string literals Character data type C standard library functions C++ standard library classes

How to support UTF-16 in C/C++ String literal definition in C/C++ single byte string implementation-defined encoding 8 bits width on any systems character can consist of multibyte character seq As encoding, ASCII is used broadly L wide character string implementation-defined value and size character consist of fixed length value There is no de facto standard encoding From the pragmatic point of view, fixed bit width on any systems (16bits preferred) and well-known and broadly used encoding character and string literals are required

How to support UTF-16 in C/C++ u Unicode string literal Suggestion: u Unicode string literal e.g u"hello, Unicode" Multibyte characters are converted to 16 bits width code value sequence by C/C++ compiler \u escape sequence supports all planes e.g u A\u3230B is encoded as 0x0041 0x3230 0x0042

How to support UTF-16 in C/C++ Merits of u Unicode string literal Well defined encoding value On different systems, we can get same result. Less storage consumption than wchar_t By comparison 2 bytes u"literal" and 4bytes wchar_t, the half storage consumption. Locale independent behavior of application Application works without system dependent locale mechanism if you need. Similiar to other languages, especially to Java. Easy to modify the existing source program.

How to support UTF-16 in C/C++ C standard library for UTF-16 example strchr wchar_t* wcschr( const wchar_t *wcs, wint_t c) { } wchar_t wc = c; do { if (*wcs == wc) return (wchar_t *) wcs; } while (*wcs++!= L'\0') ); return NULL; utf16_t* strchru16( const utf16_t *ucs, uint_t c) { } utf16_t uc = c; do { if (*ucs == uc) return (utf16_t *) ucs; } while (*ucs++!= u'\0' ); return NULL;

How to support UTF-16 in C/C++ C++ standard library for UTF-16 char_traits define basic properties for character type ctype defines character properties numpunct, moneypunct, time_get, time_put define how to parse and print numerical, monetary and time values codecvt defines conversion between internal and external character format

Practical experiences Modification Fujitsu C/C++ compiler supporting u Unicode string literal About 1 month by 2 persons Extension of standard C library 2 students, 6 months Extension of standard C++ library About 3 months by 1 person

Practical experiences SAP kernel (Unicode and non-unicode) is developed with one single source tree. SAP supports the following systems - Unicode without wchar_t (UTF-16) - Unicode system with 16bits wchar_t (UTF-16) - non-unicode system (locale dependent)

Summary Introducing UTF-16 in C/C++ improves - memory consumption and performance - platform independence - integration with other programming languages Workload for implementation is moderate We suggest to add u Unicode string literal to the C/C++ standard or accept it as defacto standard spec