Gaiji: Characters, Glyphs, Both, or Neither?



Similar documents
The Unicode Standard Version 8.0 Core Specification

The basics of digital print

Tips for optimizing your publications for commercial printing

Adobe InDesign Server CS2

Adobe Training Services Exam Guide. ACE: Illustrator CS6

Preservation Handbook

Archiving digital documents and s in PDF/A

The Adobe PostScript Printing Primer

EURESCOM - P923 (Babelweb) PIR.3.1

Pageflex Persona Cross Media Suite

The Camelot Project J. Warnock

SOLO NETWORK (11) (21) (31) (41) (48) (51) (61) version Adobe PageMaker 7.

ACE: Illustrator CC Exam Guide

SCHOOL YEARBOOKS PLAN. CREATE. PRINT.

Print Stream Shootout AFP. Don Maxwell

38 Essential Website Redesign Terms You Need to Know


Planning for External Printing Services

7.0. Adobe PageMaker. New Feature Highlights. Mac OS 9.1 and Mac OS X (Classic mode)/microsoft Windows NT/Windows 2000/Windows XP

Preservation Handbook

B.A IN GRAPHIC DESIGN

Preparing a File For Commercial Printing using Microsoft Publisher

Kazuraki : Under The Hood

Electronic Records Management Guidelines - File Formats

To be productive in today s graphic s industry, a designer, artist, or. photographer needs to have some basic knowledge of various file

ebooks: Exporting EPUB files from Adobe InDesign

LittleCMS: A free color management engine in 100K.

Essential Graphics/Design Concepts for Non-Designers

FILE PREPARATION GUIDE

PDF Primer PDF. White Paper

CHAPTER 5: PRODUCTIVITY APPLICATIONS

designed and prepared for california safe routes to school by circle design circledesign.net Graphic Standards

Creating Interactive PDF Forms

When older typesetting methods gave

Graphic Design I GT Essential Goals and Objectives

What s New in Version Cue CS2

Workflow. The key to streamlining the production printing process

Designing Global Applications: Requirements and Challenges

What's new in Word 2010

Best Practices: PDF Export

Links. Blog. Great Images for Papers and Presentations 5/24/2011. Overview. Find help for entire process Quick link Theses and Dissertations

Print Services User Guide

ebooks: From Adobe InDesign to the Kindle Store

Word Processing programs and their uses

Adobe Reader 7.0 Frequently Asked Questions for Digital Edition Users

Creating a High Resolution PDF File with Adobe Acrobat Software

Data processing goes big

How to research and develop signatures for file format identification

Microsoft Office System Tip Sheet

Importing PDF Files in WordPerfect Office

Helpful Hints, Inserting Images, and creating Links to documents

XTM for Language Service Providers Explained

What s New in QuarkXPress 8

OpenOffice.org Writer

Catalog Creator by On-site Custom Software

PDF Accessibility Overview

Ten Simple Steps Toward Universal Design of Online Courses

Version of Barcode Toolbox adds support for Adobe Illustrator CS

Actuate Business Intelligence and Reporting Tools (BIRT)

TeleScope Digital Asset Management Solution Overview. Campaigns Can Only Move as Fast as Their Content

TABLE OF CONTENTS. SECTION ONE: OVERVIEW... 4 Who are these guidelines for?... 4 What is a visual identity guideline?... 4

What Resolution Should Your Images Be?

Current Page Location. Tips for Authors and Creators of Digital Content: Using your Institution's Repository: Using Version Control Software:

SPECIFICATION BY EXAMPLE. Gojko Adzic. How successful teams deliver the right software. MANNING Shelter Island

The key to success: Enterprise social collaboration fuels innovative sales & operations planning

How To Choose the Right Vendor Information you need to select the IT Security Testing vendor that is right for you.

Mobile App Proposal Magazine company- @address.com. January 12, y. Direct Contact.

Hypercosm. Studio.

CREATING DIGITAL ARTWORK

Inline XBRL Saving cost and effort for company reporting

Using the Acrobat X Pro Accessibility Checker

Server-Based PDF Creation: Basics

Barcode Labels Feature Focus Series. POSitive For Windows

BAR CODE 39 ELFRING FONTS INC.

Curl Building RIA Beyond AJAX

Composition Services Outsourcing Outsourcing Composition Services at a Higher Level

Artwork - What Do I Need To Know Before I Start Printing?

About XML in InDesign

SUBMITTING A PRESS-READY COVER For Paperback Books with Perfect Binding, Plastic Comb, and Plastic Coil Binding

Tibiscus University, Timişoara

Mobile web apps: The best option for business? A whitepaper from mrc

Portable Document Format (PDF) Finally, a Universal Document Exchange Technology Wan-Lee Cheng

So you say you want something printed...

Table of Contents File Set Up

What's New in QuarkXPress 10

Mac OS X 10 Using the Keyboard Viewer and Character Palette

Adobe Anywhere for video Collaborate without boundaries

Logo Standards Guideline

Putting on an exhibition about your research

Creating Accessible PDF Documents with Adobe Acrobat 7.0 A Guide for Publishing PDF Documents for Use by People with Disabilities

Image Resolution. Color Spaces: RGB and CMYK. File Types and when to use. Image Resolution. Finding Happiness at 300 dots-per-inch

CSCA0101 Computing Basics CSCA0101 COMPUTING BASICS. Chapter 7 Software

Web Design Foundations ( )

Big Data Integration: A Buyer's Guide

Transcription:

Gaiji: Characters, Glyphs, Both, or Neither? A Graphics and Publishing Industry View Jim DeLaHunt Type Development Group, Adobe Systems Incorporated 1 Abstract Unicode encodes Han characters by the tens of thousands, but fonts typically have only thousands of glyphs. Some fonts may have more glyphs, some may have fewer. And since the Han character repertoire is fundamentally open-ended, there will always be characters which are not encoded. The characters legal for the script, but not in your font, are known as gaiji. Writers and publishers insist on being able to use gaiji, so the Japanese publishing and computer industries have come up with a number of gaiji mechanisms. Looking from the viewpoint of a publishing software and font developer, we describe and evaluate a few of the most important gaiji mechanisms. Finally, we look at gaiji in terms of the Unicode character-glyph model. Are they glyph variants, or characters, or both, or neither? Author Jim DeLaHunt is an engineering manager at Adobe Systems, responsible for software related to Japanese font handling and to gaiji. He was introduced to the gaiji requirement when he first joined Adobe thirteen years ago, and still isn t satisfied with any gaiji mechanism he has found in the market. Jim DeLaHunt +1-408-536-2690 <delahunt@adobe.com> Type Development Group, Adobe Systems Incorporated, M/S W-12, 345 Park Avenue, San Jose, CA 95110, USA <http://partners.adobe.com/asn/developer/type/gaiji.html> 1 San Jose, California, September 2002

Overview What are gaiji? In publishing and graphics industry context Prepress workflow Existing mechanisms to support Evaluation of existing mechanisms Gaiji in Unicode character-glyph model Caveat: use of terms character and glyph 2 In this paper, we hope to cover five topics: 1. What are gaiji? Speaking from the point of view of a publishing and graphics tool supplier, we will explain what gaiji means in this context, and review the market requirement for gaiji support. 2. Prepress workflow. For the benefit of readers who aren t familiar with the steps required to go from characters to glyphs to printed pages, we review the prepress workflow. This is particularly important, because a goal of the publishing industry is to move work as far back up the workflow as possible. We will evaluate gaiji mechanisms in terms of this workflow. 3. Existing Mechanisms to Support Gaiji. We review six mechanisms presently used in the publishing industry for handling gaiji. 4. Evaluation of Existing Mechanisms. We evaluate each of the six mechanisms, in terms of their strengths and weakness relative to the publishing industry s requirements. 5. Gaiji in the Unicode Character-Glyph Model. Unicode has an explicit character-glyph model. We look at what gaiji represent in terms of this model. In particular, we answer the question, Are gaiji characters, glyphs, both, or neither?. We also look at a useful notion of abstract glyphs, and suggest a useful application of Unicode Variation Selectors in this context. We will frequently refer to printers in this paper. This refers to the staff of a printing company, the people usually performing text layout and publishing activities involving gaiji. A caveat about terminology: in the first four sections we attempt to describe the situation of the publishing industry in Japan today. This industry uses the terms character and glyph almost interchangeably. Indeed, the concepts aren t nearly as distinct in this industry as they are among experts familiar with the Unicode characterglyph model. So the alert reader will notice some muddling of these terms during the first four sections. This is inevitable if we are to accurately describe the industry. In the last section we attempt to be more precise with the terms character and glyph in Unicode terms. Where we do not want to take a position on whether a Japanese element is a character or a glyph, we often use the term ideograph. 2 San Jose, California, September 2002

What Are Gaiji? Characters and or glyphs that are legal, but not in font Historical variants, personal names 50,000 chars in dictionary, only 8,000-20,000 in fonts Kanji writing system fundamentally open-ended 1913 2002 Personal name oka 1999 Euro symbol 3 So what are gaiji? The term gaiji is a Japanese word meaning outside character. For the purpose of this paper, it is: Any character or glyph which is valid in your written language, but is not in the font you are using. Gaiji are particularly prominent in the CJKV ideograph script, i.e. the Chinese, Japanese, Korean, and Vietnamese languages. In the slide above, there are two variants of a character oka, used in personal Japanese names like Maruoka. The variant on the left is taken from a 1913 book, written by a person named Maruoka. The variant on the right is how that same character oka is written today. We have not discovered the 1913 variant in any existing font or character collection standard (though we haven t searched Unihan Extension B closely). It is a gaiji. Any publication that wanted to talk about the 1913 author Maruoka using the characters he used would have to find some way to reproduce the archaic character variant. In fact, historical forms are a rich source of gaiji. Countries using the CJKV ideographic script periodically attempt to reform and simplify the writing system. Japan had a reform in the 1940 s, and China in the 1950 s. But the characters obseleted by the reform are still important in a historical context, and so authors may wish to use them when writing about events or people or places of the prereform era. Variants of a standard character are called itaiji in Japanese. There can be a lot of variants. At the top of the slide, you can also see two rows of characters that mostly look similar. These are 21 different characters used in the name Watanabe. While the first three variants appear in standards, individuals may use any of the other variants, depending on the tradition in their family, and there is demand in publishing to be able to print the correct variant. All these variants are present in the OpenType Pro font glyph complement recommended by Adobe to font vendors, and they are present because there is a demand for them. Stepping back, the reference dictionaries for the Japanese language list about 50,000 characters. Standard personal computer fonts have only about 8,000 characters in them. The OpenType Pro font glyph complement numbers 15,000; there are other fonts that cover about 20,000 characters. Even such a comprehensive font as this still leaves 30,000 characters unrepresented. Any author wanting to use these characters would consider them gaiji. But it s worse than that, because the CJKV ideographic script is fundamentally open-ended. This is due to the wide variations in ways to write characters. Adding or removing a stroke, or changing one set of strokes for another, can be significant either because it matches historical practice, or it changes the meaning of the character. 3 San Jose, California, September 2002

Which Jobs Require Gaiji? Most frequently encountered in Literary, historical, academic writing Government records (person and place names) Think e-government! Technical specialties (e.g. chemistry) Dictionaries, works on CJK language Gaiji glyph designs match font Every printer gets jobs requiring gaiji, from time to time. 4 What print jobs require gaiji? Or to put it another way, what is the business case for gaiji? This discussion will focus on the needs of the Japanese print publishing market. The same issues apply to Chinese, Korean, and Vietnamese publishing, but the Japanese market tends to lead the technology because it is large, technologically advanced, and extremely demanding of quality results. As mentioned before, archaic variants of kanji are a frequent source of gaiji. Traditional literature, literary criticism, and histories of the time before the last writing reform, are all commercially viable publications that demand gaiji. Government records in Japan are required to represent the exact characters for the personal names and birthplaces of people. As these records are computerised, the computer systems must be able to handle gaiji. Recently, several Asian governments have launched e-government initiatives that make this need more acute. Technical specialties may have special characters for technical terms. Finally, works like dictionaries and textbooks that describe the CJKV languages have a particular need to be able to represent the specific character variants they discuss. The reference dictionary of Japanese, which contains 50,000 characters, requires a publishing system able to print those 50,000 characters. A particular class of gaiji used in language commentary are intentionally incorrect characters, intended to illustrate a point. These are amusingly known as usoji, or characters that are lies. An interesting aside: in October 2001, the anthrax attacks in the US became a big news story in Japan. The Japanese term for anthrax spores, 炭 疽 菌 ( tanso-kin ) suddenly went from an esoteric biological term to a household word. The character 疽 ( so ) is a rare one. It is in standard computer fonts, but not in some television studio equipment, so they had to use a hiragana alternative or insert it as a gaiji. Also, input methods didn t have the term tanso in the dictionary, forcing writers to use a roundabout method to enter the characters. Characters that seem esoteric can become critical in the right circumstances. Quality demands increase the gaiji requirement in two ways. First, a demand for quality leads authors to insist on using the specific character variant they want, rather than settling for a standard character that is easier to enter and print. Second, authors insist that the character variant not only have the correct shape, they also match the design of the font used for the standard characters. This means that a printer may potentially need several different instances of a specific character variant, to match the design of several different fonts. In a typical corpus of Japanese documents, gaiji are used very rarely. For almost any commercial publisher, the totality of all gaiji used, compared to the number of non-gaiji characters in a year s publications, will be negligible. But the commercially significant measure is that any printing business can expect that from time to time, some customer will come in with a job requiring gaiji. Therefore, as a practical matter, every printing company needs the capability of handling some amount of gaiji. Some firms will specialise in being especially capable with gaiji. 4 San Jose, California, September 2002

A Complete Publishing Solution in CJKV Requires a Gaiji Solution Can t escape need to support many glyphs Words and language are the core of what is published In practice, no font could include all glyphs in a CJKV language Dai Kanwa Jiten: approx 50,000 Kanji JIS X0208 + X0213: 11,223 characters People invent gaiji every day For font vendors, OpenType Pro fonts (15,000 glyphs) are the upper limit of sustainable size 5 Combine the linguistic reality of the open-ended ideograph script with the business requirement for printers to be able to print the exact character variants chosen by authors, and the result is that a complete, high-quality publishing system for Japanese, Chinese, Korean, and even Vietnamese needs to include gaiji support. It is a stretch to say everything published is made up of words and text. Comic books, photo essays, and catalogues with many pictures form some portion of the printing market. But it is fair to say that the vast majority of the information published is published in the form of characters, text, and language. And to do justice to high-quality ideograph script text requires gaiji support. One response to this reality is to have fonts cover more and more characters. This certainly addresses part of the gaiji requirement. Taking gaiji from 1% of all jobs to 0.1% of all jobs does mean that a printer needs to face the expense of handling gaiji less often. But this cannot be a complete solution. First, fonts are far from covering the characters known today. A standard computer font has 8,000 characters, which is a fraction of the 50,000 kanji in the reference dictionary for the Japanese language. Even the extended glyph coverage of the new JIS X0213 character set standard, at 11,223 characters when combined with the existing JIS X0208 complement, is a fraction of the language. And the ideographic script remains open-ended; people can and do invent new kanji characters. Furthermore, there is another problem. As character sets get larger, fonts get more and more expensive to produce, and more and more likely to contain errors. Font vendors are increasingly unable to recover their costs. Our sense is that OpenType Pro fonts, at 15,000 glyphs, are about as large as a glyph complement (or character set) as font vendors are likely to be able to sustain in the Japanese market. Larger glyph complements will only be addressed by a few fonts from a few vendors. And even the OpenType Pro font complement is so large that it is hard to justify making a wide variety of unusual display typefaces. So the market demands a way to handle gaiji, but ever-increasing fonts are not the answer. 5 San Jose, California, September 2002

Prepress workflow character glyph plate paper Plain text, Unicode, Database content Word Processing doc, HTML+CSS PDF file, SVG Pre-DTP prepress house Plates ready to print Book or Magazine Characters Text with formatting markup Character to glyph mapping Glyph to outlines Outlines, graphics, images to raster Raster to paper/film (or to Plate ) Paste-up paper/film Colour seps/imposition Make printing plates Print Fonts 6 For the benefit of readers who aren t familiar with the steps required to go from characters to glyphs to printed pages, we review the prepress workflow. This is particularly important, because a goal of the publishing industry is to move work as far back up the workflow as possible. We will evaluate gaiji mechanisms in terms of this workflow. This is a simplified and conceptual explanation. The workflow starts with a file of plain text and needs to end up with text printed on the page. (The interesting topic of text entry beyond our scope here.) Plain text is used in many contexts, including the fields of a database. It is the realm that Unicode is attempting to address. As part of a layout process, the printer applies formatting to the text: selecting which font to use for the text, applying ligatures, flowing text into areas of the page. Text with formatting markup is roughly at the abstraction level of a word processing document, or of HTML formatted with Cascading Style Sheets (CSS). Next, and crucially for Unicode s Character-Glyph Model, is character-to-glyph mapping. The font helps the text layout engine map character codes to glyph codes. Variant glyphs, ligatures, and so on are invoked at this stage. The layout engine makes line break and glyph positioning decisions. The result is a file of fullylaid-out text, with all glyphs chosen and all glyph positions fixed. Font data may be embedded with the file to support the next stage. A Portable Document Format (PDF) or Scalable Vector Graphics (SVG) file is at roughly this level of abstraction. The page is then rendered. The rendering system takes glyph codes, uses them to index the corresponding font data, finds the glyph outlines, and generates a raster bitmap form of the glyph appropriate for the printing device. At the same time, line graphics and images are also rendered into device-appropriate rasters. The printing device marks the output medium with a pattern of dots corresponding to the rasters. The resulting paper or film galleys represent the input of the prepress process, an important boundary. (The complexities of leading edge processes which write directly to offset printing plates aren t relevant here.) The prepress house combines paper or film in various ways. First is paste-up, literally pasting together pieces of paper containing typeset text and processed photographs. Multiple pages may be imposed together, or colour content may be separated for printing. The details are complex, and not important here. The important point to bear in mind is that it s possible to physically paste in individual characters onto the paper containing the rest of the text. The end result is a set of offset plates, ready for a printing press. Finally, the printer mounts the plates onto a printing machine and prints to paper, and then folds, cuts, and binds the paper into a finished printed piece. In the context of gaiji, the important observation is how far down this workflow gaiji are handled. For a printer, getting the right glyph mark on the paper is the overriding concern. They will move as far down this workflow as necessary to image the gaiji glyphs. However, the higher level they can stay at, the better. It makes for a less expensive workflow, and more options for repurposing content. 6 San Jose, California, September 2002

Existing Mechanisms for Gaiji Paste up hand-drawn glyphs Proprietary systems Desktop Publishing systems Gaiji one-byte fonts Adobe Type Composer Megafonts The Unicode standard 7 We will now review five mechanisms presently used in the publishing industry to support gaiji, plus the way in which the Unicode Standard contributes to gaiji support. We will evaluate what their strengths and weaknesses are relative to the publishing industry requirements for gaiji. 1. Paste-up of hand-drawn glyphs 2. The gaiji support of Proprietary Systems, special purpose publishing systems that perform the bulk of gaiji-intensive print publishing in Japan today. Within the realm of DTP, there are three mechanisms to study: 3. Gaiji one-byte fonts 4. Adobe Type Composer, a mechanism for combining fonts and using them together 5. Megafonts, which refers to a variety of similar approaches involving large fonts Finally, we examine: 6. The Unicode Standard, and how it contributes to gaiji support Note: outside of the publishing industry, there are other gaiji handling mechanisms. For instance, a Japanese consortium has developed the XKP architecture, which is aimed at supporting gaiji in databases based on Windows. In the interests of space, we limit ourselves here to the main publishing industry systems only. 7 San Jose, California, September 2002

Existing Mechanisms for Gaiji Paste up hand-drawn glyphs Lay out text, leaving gaiji blank Hand-draw gaiji glyph, matching font Paste in glyph on text galleys Alternative, higher level Scan gaiji glyph Drop in text flow as illustration 8 In the Paste-up of Hand-Drawn Glyphs method, the idea is to lay out text with blank spaces where the gaiji glyphs should be, then combine that typeset text with an image of the necessary gaiji glyph at the paste-up stage. The gaiji glyph is hand-drawn by a skilled calligrapher, in a style that matches the design of the main font. The glyph is scanned in, scaled down to size, and pasted into the blank spaces in the text. In the mid-1990 s I heard that a major Japanese newspaper was using exactly this method daily to print gaiji in its newspaper. An alternative workflow is to convert the scanned image of the gaiji into a bitmap, and place it in the text flow as a graphic. This requires that the layout software is able to support graphics in the text flow. I also used a slightly more sophisticated variant on this technique to prepare this paper. I used US edition software with no specific gaiji support available. To include the gaiji, I converted them into illustrations and pasted in the illustration. 8 San Jose, California, September 2002

Evaluating Existing Mechanisms Paste up hand-drawn glyphs Advantages Works with non-computer workflows No need for specialised gaiji equipment Disadvantages Drawing glyphs requires special skills Labour-intensive, so expensive Cannot repurpose text 9 Pasting up hand-drawn glyphs is the baseline gaiji approach. It succeeds in all cases of getting an appropriate glyph onto the printed page. It is compatible with non-computerised workflows, so it has been used for years and is the traditional method. There is no need for specialised equipment for gaiji; it can be implemented using scanner and paste-up techniques that are already standard in the workflow. However, this is an expensive, labour-intensive process. Drawing the glyphs, especially to match the font well, takes graphic design skills. The workflow needs careful manual attention to be sure that no gaiji is overlooked. Looking forward, perhaps the biggest drawback is that the marked-up text is not repurposable using this approach. The modern publishing industry seeks to repurpose content between print, web, cell phones, PDAs, and other media. It is important that gaiji be included in that repurposing. Achieving that repurposing requires being able to drive content back up the publishing workflow to the character level, then reformat the document for the new target device. 9 San Jose, California, September 2002

Existing Mechanisms for Gaiji Proprietary systems Buy gaiji glyphs from system vendor Install into cumulative gaiji font on RIP Thousands ten thousands chars Site-specific character codes Lay out text, using site-specific codes Gaiji displayed as geta placeholder Layout software generates RIP commands that use gaiji font in RIP 10 Proprietary System is the term for a dedicated hardware and software system produced by a single vendor for the specific purpose of doing text layout and other publishing and graphics tasks. This kind of system performs the bulk of gaiji-intensive work for the Japanese printing market. There are many different proprietary systems from various makers. Here we will give a general description, glossing over system-specific details. The system maker is the only supplier of fonts that work on the system, so the system maker supplies the gaiji glyphs too. The customer could buy thousands of gaiji glyphs when purchasing the system, or they could buy individual glyphs month by month as needed for specific jobs. (In some cases, there is a tool that lets the user draw their own glyphs, by combining shapes from existing glyphs in the standard fonts.) The customer installs the glyphs into a cumulative gaiji font, which is generally located on the final output device (Raster Image Processor, or RIP). Some metrics information about the glyph is also installed on the host layout system. There are typically thousands or tens of thousands of glyphs in this gaiji font. Gaiji glyphs for all font designs are mixed together. The character codes used to refer to the gaiji are a function of the sequence in which the gaiji were installed at that site. That means that text with formatting markup, which includes gaiji codes, is site-specific data. It generally cannot be moved to a different site, even if that site uses exactly the same equipment. The first step in laying out a document on a proprietary system is to enter the text data. When entering gaiji, operators must enter the site-specific gaiji codes that correspond to the site s particular cumulative gaiji font contents. The layout applications included in proprietary systems are typically not WYSIWYG ( What You See Is What You Get ). This means that the operator doing the layout of the text will see a placeholder such as geta ( )rather than the gaiji glyph. The layout software generates commands to the RIP that causes it to image glyphs on the film or paper. The commands include codes to invoke gaiji from the site s cumulative gaiji font on the RIP. The text can now be sent the the prepress stages, such as paste-up. Paper-based paste-up was typical of proprietary system workflows in the early 1990 s, but now electronic paste-up has become more common. 10 San Jose, California, September 2002

Evaluating Existing Mechanisms Proprietary systems Advantages Proven track record of success Font vendor provides high-quality glyphs Compatible with layout system Disadvantages Expensive Cannot repurpose text, even between sites Job failures due to wrong version of gaiji fonts Closed: to DTP workflows, to technology advances No choice or limited choice of suppliers 11 Proprietary systems have a proven track record of success supporting Japanese high-quality publishing with tens of thousands of gaiji glyphs. They are the benchmark for any other mechanism to beat. Since system vendor also supplies both the standard font collection and the gaiji glyphs, the vendor can ensure that the gaiji glyphs are faithful matches to the design of the main font. In fact, over time vendors build up a large database of raw glyphs, which they can package as either fonts or aftermarket gaiji glyphs as needed. And since one vendor supplies font, gaiji glyphs, and the layout system, all parts work together seamlessly. Unfortunately, these systems are expensive to purchase and to operate. The site-specific nature of the gaiji codes means that documents cannot be repurposed, even between different sites of the same printing company. There is a significant amount of management required for the cumulative gaiji fonts in the RIP and the host-side layout information. Job failures due to a mismatch in these pieces are a real danger unless the operator is meticulous about font management. The single-vendor approach has its disadvantages as well as advantages. The systems are generally more or less closed to the DTP workflows. Or it is open in some respects (for instance PostScript language output, PDF file output, or EPS illustration input) but not as generally open as a DTP system is. And since the printer is limited in choice of supplier for key components of the system, especially font provider and layout application provider. 11 San Jose, California, September 2002

Existing Mechanisms for Gaiji Gaiji one-byte fonts Draw gaiji glyphs in Fontographer Save as a Type 1 font ( one-byte ) Site-specific character codes Install font on layout workstation When formatting text, Change font of gaiji character to one-byte font Enter special character code for that gaiji 12 We now turn to gaiji mechanisms used by Desktop Publishing (DTP) systems. In contrast to proprietary systems, DTP systems are characterised by an open architecture: customers acquire computer hardware from one vendor, layout software from another, font software from a third, output devices from yet another vendor, and so on. In the Gaiji one-byte font approach, printers create their own gaiji glyphs in a font design tool like Fontographer. They save the glyphs out in a Type 1 font, which takes one-byte codes, can effectively hold up to 200 glyphs. The character codes for this gaiji font is of course site-specific. There may be many such gaiji fonts, several per job and dozens or hundreds per site. Printers install the gaiji font on the layout workstation. When formatting gaiji text, the operator must first change the font of the gaiji glyph to the correct one-byte font, then change the character code to the correct value for that particular font. Layout then proceeds normally. The layout software can generally dynamically download the onebyte font to the RIP for rasterisation. 12 San Jose, California, September 2002

Evaluating Existing Mechanisms Gaiji one-byte fonts Advantages Compatible with present DTP layout systems No special support in layout app needed Disadvantages Font management is very difficult Job failures due to missing or wrong version of fonts Cannot repurpose text Drawing glyphs requires special skills Labour-intensive, so expensive 13 One-byte fonts represent the simplest way to implement high-quality gaiji within a DTP workflow. Since Type 1 fonts are a universally supported format, the gaiji fonts can be used with any application. However, font management is very difficult. A printer may have dozens or hundreds of fonts to keep track of. The fonts need to be available on the layout system, and often on related systems for graphics editing. If there are different versions of the fonts, or if the right font isn t in every location, jobs can fail due to missing glyphs. The failure is particularly insidious because it is nearly silent, so easy to overlook. Since the gaiji fonts have completely arbitrary encodings and font names, the plain text and the text with formatting markup are very site-specific, and repurposing content with gaiji is not possible. Plus the burden is on the printer to create the glyphs to match the main fonts. As for hand-drawn glyphs, this requires special graphic arts skills. The glyph drawing, font management, and proofing is labour-intensive and expensive. 13 San Jose, California, September 2002

Existing Mechanisms for Gaiji Adobe Type Composer Get gaiji glyphs in Type 1 font ( one-byte ) Site-specific or commercial options Set up rearranged font Refers to base font + gaiji fonts (+ alternate kana font) Site-specific character codes (Shift-JIS user area) Install Rearranged font in system Enter text, using site-specific codes in user area Format text with rearranged font 14 Perhaps the DTP industry s most sophisticated gaiji mechanism to date has been Adobe Type Composer, by Adobe Systems. It provides a way of combining standard and gaiji fonts together into what looks like a new composite font. As before, the printer draws their gaiji and save them as Type 1 fonts. There are also commercial providers of gaiji fonts that work with Adobe Type composer. The printer uses the Adobe Type Composer tool to set up a rearranged font that maps both a standard Japanese font and one or more gaiji fonts into a shared Shift-JIS encoding range. (The rearranged font may also invoke a kana font to override the kana of the standard Japanese font.) The gaiji fonts are mapped to the user area of the Shift-JIS encoding. This means that the gaiji glyphs are invoked by Shift-JIS codes with site-specific meanings, based on the encoding of the glyphs in the Type 1 font and how they are mapped into Shift-JIS user area rows. The printer then installs this rearranged font into the system of the layout workstation. Thanks to special support from the font rasteriser in Adobe Type Manager or the layout application, the rearranged font appears to the system and applications like just another Japanese font. Text formatted with the rearranged font is WYSIWYG. The printer then enters text, using the site specific codes to refer to gaiji. In the layout application, the printer formats the text with the rearranged font. The font rasteriser turns text formatted with the rearranged font into runs of text formatted with the underlying standard Japanese font and gaiji fonts as needed. Rasterising proceeds as normal with the underlying fonts. The gaiji fonts and rearranged fonts can be downloaded to a PostScript RIP, or they can by dynamically downloaded by the printer driver as needed. 14 San Jose, California, September 2002

Evaluating Existing Mechanisms Adobe Type Composer Advantages Compatible with DTP layout systems Base of 3 rd -party font suppliers exists Can repurpose text to some extent Disadvantages MacOS 9 only (plus Adobe InDesign on Windows, MacOS X) No path to Unicode, OpenType, new OS s Low glyph capacity of Shift-JIS user area Font management is difficult Job failures due to missing or wrong version of fonts Cannot repurpose text very easily 15 The distinguishing feature of the Adobe Type Composer approach is that it does not require applications to support or even be aware of the rearranged font format. It can do this because of cooperation from the system font rasteriser, which does support the rearranged font format and can present them to the rest of the system as completely conventional fonts. Because of this, it is compatible with all DTP layout apps and systems. At least one DTP layout application, Adobe InDesign, took the extra step of building support for the rearranged font format directly into the application. Thanks to its many years on the market, the Adobe Type Composer architecture has built up a pool of 3 rd -party font suppliers. This in turn provides printers with a supply of high-quality gaiji glyphs that match popular font designs, and with roughly standard character codes. Printers can exchange text to the extent that they use the same gaiji fonts. But the requirement for rasteriser support limits the Adobe Type Composer approach in the long term. Adobe never provided rearranged font support at the system level on Windows. On MacOS through MacOS 9, the vehicle for supplying rearranged font support is ATM Light. However, thanks to improvements in the font rasterisers for MacOS X native, Adobe need not supply ATM Light for that platform. As a side effect, Adobe no longer can guarantee support for rearranged fonts. While Adobe InDesign has its own rasteriser with support for rearranged fonts, the fonts are usable only within InDesign. There are architectural limits to the ATC approach. One is the tight coupling to Shift-JIS encoding. The Shift-JIS user area has only 2,444 code points, so this severely limits the number of gaiji glyphs that may be combined with one main font. A capacity of 10,000-100,000 is what professional publishing requires. Another limit is the lack of a clean path to bring the existing ATC technology forward to support Unicode or the OpenType font format. There are also operational challenges with the ATC mechanism. Font management of both the rearranged fonts and the base fonts from which they are built is one challenge. If a rearranged font is missing, the application will usually flag an error that a font is not present. If the rearranged font is present but some of the component fonts are missing, the application will not flag an error, but any text that uses that component font will not display or print correctly. The lack of an error message is insidious; it can lead to an error not being discovered until late in the process, requiring expensive rework or reprinting. Finally, text repurposing is limited because the character codes used in documents are only as standard as the choice of gaiji font. 15 San Jose, California, September 2002

Existing Mechanisms for Gaiji Megafonts Define a larger and larger standard character and glyph set E.g. Adobe-Japan1-4, Apple Publishing Glyph Set, Unicode 3.2 Each increment to glyph set reduces but does not eliminates gaiji need Font vendors build fonts that cover the full glyph set Enter text, using standard character or glyph codes 16 One response to the need for gaiji is to have fonts cover more and more characters. This has been a trend in DTP font standards for the last five-ten years. Consider these font standards, and their corresponding character or glyph counts: Font Standard Character or glyph count (rounded) Japanese CID-keyed 8,700~ glyphs OpenType standard 9,400~ glyphs OpenType pro 15,000~ glyphs JIS X0208 + X0213 11,000~ characters Apple APGS 20,000~ glyphs Unicode 3.2 Unihan 70,000~ UniHan characters Dai Kanwa Jiten dictionary 50,000~ characters Over time, font makers develop fonts that match these standards. Let us refer loosely to fonts with a large number of glyphs as megafonts. Because the fonts cover more characters and glyphs, fewer of the glyphs required by a given publication need to be treated as gaiji. Thus it reduces the need for gaiji mechanisms. However, it does not eliminate that need. It is relatively simple to enter and lay out text with a megafonts. Text entry uses standard character codes, and sometimes glyph selection codes via formatting controls such as OpenType features. Font handling and printing with large fonts proceeds much the same as for smaller fonts. References: The Adobe-Japan1 series of glyph complements are documented in Adobe-Japan1-4 Character Collection for CID-Keyed Fonts, technote #5078, 6/21/2002. <http://partners.adobe.com/asn/developer/pdfs/tn/5078.adobe-japan1-4.pdf> Number of Han characters in Unicode 3.2 taken from The number of characters in Unicode, <http://www.i18nguy.com/unicode/char-count.html>, and from personal communication with others. 16 San Jose, California, September 2002

Evaluating Existing Mechanisms Megafonts Advantages Compatible with DTP layout systems Can repurpose text Disadvantages Increasing size of glyph set is unsustainable for font makers Large fonts are expensive, error-prone Insidious glyph failures when changing fonts Not a complete solution (there are always more gaiji) 17 Enlarged character set standards, and megafonts, are effective in reducing the gaiji requirement. After all, gaiji are defined as characters or glyphs valid in the language but not in your font. Also, to the extent that the OS and applications maintain compatibility with the enlarged character sets, this mechanism is compatible with a wide variety of DTP apps. Finally, since the character encodings and glyph complements are standardised, it permits reliable text interchange. However, this approach cannot be a complete solution. First, fonts are far from covering the characters known today. OpenType Pro fonts cover a fraction of the 50,000 kanji in the reference Dai Kanwa Jiten dictionary. Even the extended glyph coverage of the new JIS X0213 character set standard, at 11,223 characters when combined with the existing JIS X0208 complement, is a fraction of the language. And the reportoire of CJKV ideographs remains open-ended; people can and do invent new kanji characters. Furthermore, as character sets get larger, fonts get more and more expensive to produce, and more and more likely to contain errors. Font vendors are increasingly unable to recover their costs. Our sense is that OpenType Pro fonts, at 15,000 glyphs, are about as large a glyph complement (or character set) as font vendors are likely to be able to sustain across their typeface library in the Japanese publishing market. Larger glyph complements will only be addressed by a few fonts from a few vendors. And even the OpenType Pro font complement is so large that it is hard to justify making a wide variety of unusual display typefaces. There are proposals for composite font architectures, where the text and fonts system will use multiple fonts together to cover a large character set like Unicode. This does not resolve the megafont challenge for high-quality publishing. It is a requirement that all CJKV ideographs be in a consistent design, so all ideograph components of the composite font would have to be designed for each other. And each component is likely to have thousands of glyphs, which still leaves the font developer with the burden of building megafonts. So the market demands a way to handle gaiji, but ever-larger fonts are not the answer. 17 San Jose, California, September 2002

Existing Mechanisms for Gaiji The Unicode Standard Unicode encodes more and more historical and variant characters Fonts covering Unicode become megafonts Variation Selector mechanism adopted in Unicode 3.2 Usage for Han glyph variants is not yet specified Not all gaiji are eligible for encoding by Unicode Corporate logos, symbols, glyph variants Unicode doesn t commit to encode all glyph variants as characters 18 Now let us consider how the Unicode Standard addresses the publishing market s requirement for gaiji support. The Japanese publishing industry is beginning to encounter Unicode deeply, since it is the main character encoding used in OpenType fonts, OpenType layout services, MacOS X, and Windows. With each version, Unicode encodes more and more of the CJKV ideograph repertoire. The most frequently used characters and character set standards were covered years ago, so recent additions tend to cover exactly the historical and variant glyphs that have been handled as gaiji up to now. Projecting forward, we can imagine that over time, Unicode will eventually encode all of the characters of Chinese origin that are both known about and valid to encode. Of course, fonts that aim to cover the Unicode character set become megafonts, with the strengths and weaknesses of that approach. The Variation Selector mechanism, defined in Unicode 3.2, provides a mechanism for specifying variants, such as for CJK Ideographs, that have essentially the same semantic but have substantially different ranges of glyphs. This sounds like it might be useful to represent the kind of glyph variants that make up part of the gaiji requirement. However, it has not yet been defined precisely how variation selectors Han ideographs. The responsibility for that definition is with the Ideographic Rapporteur Group (IRG). In any case, the Unicode standard will by definition not expand to cover all gaiji. Some gaiji represent the kind of symbols which are out of scope for Unicode, such as corporate logos, unusual symbols, and arbitrary kinds of glyph variants. The Unicode standard is also careful to draw a distinction between encoding characters, which is in scope for Unicode, and glyphs, which is not. But the hard kernel of the gaiji business requirement in publishing is that the right glyph appear on the page. All the characters and glyphs which Unicode declares beyond its scope are reasons for a separate gaiji system. 18 San Jose, California, September 2002

Evaluating Existing Mechanisms The Unicode Standard Advantages Widely-used and well-designed standard Extensive repertoire of Han characters (99+%) Ideal for repurposing text Works well with OpenType Disadvantages Fonts covering Unicode become megafonts Not a complete solution: there are always more gaiji Not all gaiji are eligible for encoding by Unicode Open Will Variation Selectors turn out to be useful for gaiji? 19 The Unicode standard has some strengths that make it a very good solution for the part of the gaiji problem which is in its scope. It is a widely-used and well-designed standard. It escapes one of the great limitations of the Shift- JIS encoding scheme, lack of encoding space. Unicode has plenty of scope to encode characters and character variations. In fact, Unicode 3.2 has almost certainly encoded enough Japanese characters to cover well over 99% of the characters occurring in published documents. Text encoded as Unicode will be repurposable. Text portability is one of the core strengths of Unicode as a text data format. Also, the OpenType font format which is assuming primacy in the publishing industry is smoothly compatible with character data in Unicode. However, there are drawbacks. As mentioned before, fonts that attempt to cover large portions of the Unicode encoding space directly will be megafonts, with the problems of megafonts: they are unsustainable for font-makers, difficult to make error-free, and vulnerable to causing missing glyphs in documents. And in any case, Unicode by its own terms declares some gaiji to be glyphs or characters or symbols that are out of its scope. And, it cannot ever completely cover an open-ended writing system. An interesting open question is the extent to which Variation Selectors will turn out to be useful in handling gaiji. The Variation Selector mechanism is defined, but no Variation Selectors for CJKV ideographs have been registered yet (as of Summer 2002). Thinking about what meaning to assign to Variation Selectors here raises interesting questions about the relation between the concepts of character, character variation, glyph variation, and glyph. There will be more on this topic later in this paper. 19 San Jose, California, September 2002

Evaluation of Existing Solutions Proprietary systems have quality and capacity But: expensive, closed Desktop publishing systems are open, cheap But: lack capacity, original font vendors Megafonts are unsustainable and incomplete Unicode helps, but is not a complete solution How much will Variation Selectors help? There is an opportunity for a better way 20 So, where does this leave us? Proprietary systems and desktop publishing systems each have strengths the other lack, and weakness the other remedies. Proprietary systems can handle gaiji with sufficient quality, in sufficient numbers, with sufficient productivity. However, they are expensive and closed systems. Desktop publishing systems are far more open, and due to competition between providers they are cheaper to buy and operate. However, they aren t able to handle gaiji in sufficient quantities with sufficient productivity, and they lack comprehensive support by font vendors. Megafonts have allowed the industry to make some progress, but have gone about as far as they go, at least for graphic arts publishing using a wide library of type designs. Unicode brings significant advantages to bear, but it does not claim to be a complete solution to the publishing market s requirements. Pasting up hand-drawn glyphs is labour-intensive, and therefore expensive. It also does not allow repurposing. There is a need for a gaiji system which is smoothly compatible with Unicode (and other publishing standards like OpenType), but which covers the areas beyond Unicode s scope. It should be able to give the best of both worlds: the quality, capacity and productivity of the proprietary systems, with the openness and interchange of desktop publishing. 20 San Jose, California, September 2002

The Unicode Character-Glyph Model Plain text, Unicode, Database content Word Processing doc, HTML+CSS PDF file, SVG Characters Text with formatting markup Character to glyph mapping Glyph to outlines Fonts Gaiji: characters, glyphs, both, neither? A: All of the above! Note: character and glyph now used rigourously. 21 Now, let us turn from concrete market requirements for gaiji, to the more abstract theoretical questions which the gaiji requirement raises. In this section of the paper, we will try to be rigorous about the terms character and glyph, as used by the Unicode character-glyph model. Consider a portion of the overall publishing workflow presented here: the portion that goes from characters to formatted text to glyphs to glyph outlines. (Downstream of this portion, gaiji are handled the same as conventional text.) This model is well understood as it applies to character and layout architectures like Unicode and OpenType. But what about the several kinds of elements treated as gaiji in present publishing systems? In this model, are gaiji characters, or glyphs, or both, or neither? (The alert reader will perhaps guess the answer: depending on the case, they can be any of the above.) 21 San Jose, California, September 2002

Gaiji in the Character-Glyph Model A Gaiji That is a Character (and a Glyph) Consider this newly invented Japanese character dera, or in 1997 the new euro Dera han to Euro (pre 1997) U+???? U+5E06 U+4EBA U+???? In a pure character realm, e.g. a database entry, these gaiji are characters only But, in a page description context, they are both characters and glyphs. 22 Here we illustrate two cases. The first is a gaiji used in a personal name, which happens to be Japanese rendering of the author s name, dera-han-to. Since the Japanese language didn t have an ideograph to represent the appropriate sound, a new ideograph was invented. It is derived from a standard character tera, with the addition of marks to indicate a pronunciation of dera. (This is not conventional Japanese orthography, but it is comprehensible to native Japanese readers.) The remaining two ideographs are standard. As a novel character, dera is of course not encoded. If this paper is widely republished, perhaps it will establish enough of evidence of usage that the author can eventually register it as a new character in Unicode. Until then it is a gaiji. The second example is from the Latin script and Western Europe: the Euro currency symbol. Before 1997, this symbol was also a novel character, and it also was not encoded. After its invention, there was quite a lot of work to make it usable in publishing systems: registering it in Unicode and other coded character sets, adding a Euro glyph to fonts, and modifying countless operating systems, keyboards, text systems, and print systems to make it work everywhere. This process still not complete. So gaiji are also relevant in the Western publishing context. They are just rarer. Both of these could occur in plain text, such as an email message or the field of a database. So in that sense, they have an existence as characters only. That they are not encoded does not lessen their character nature. But like most characters, they are eventually presented in some form of output. The marks on the output that correspond to these gaiji are clearly glyphs. So in a presentation context, these gaiji have both character and glyph nature. So: in some contexts gaiji are characters only. In others, they are both characters and glyphs. And the story continues. 22 San Jose, California, September 2002

Gaiji in the Character-Glyph Model A Gaiji That is Neither Character nor Glyph Logos, symbols, etc. are graphics, not text. They are frequently implemented using fonts for convenience, or due to graphics limitations. This leaves artifacts in text stream e.g. select above logo, copy, paste; you get a. 23 In this example, we have an element of a sort that publishers often handle with a gaiji system: a logo. In Unicode s conceptual model, logos and symbols are not characters. They are graphics or illustrations. The fact that they are implemented with a font mechanism does not change their graphical, non-character nature. Why are such symbols handled implemented as a gaiji, or as in this case as a logo font? There are often practical reasons. In older proprietary systems, there was sometimes no concept of scalable vector graphics outside of the font system. Representing a logo as a gaiji was the most convenient way to be able to control its size freely. Also, a font represents a convenient package for distributing graphics, especially if several logos need to be distributed together. Finally, text rendering systems can sometimes do a better job of rendering the glyphs present in a logo, since their outline to bitmap rendering algorithms are adapted for this purpose. However, most present systems don t distinguish between fonts or gaiji that contain text, and those that contain graphics. In other words, they force users treat graphics as characters, simply because the graphics are implemented as fonts or as gaiji. This leads to unfortunate artifacts. Copying the above logo, and pasting it into the text stream, yields an a. But the logo is in no sense a glyph for the character a. So, some elements which publishers handle with existing gaiji systems are neither glyphs nor characters. But any gaiji system must expect to encounter them. 23 San Jose, California, September 2002

Gaiji in the Character-Glyph Model Gaiji That are Glyphs Encoded Character Abstract glyphs (1) (direction independent) o Abstract glyphs (2) (with writing direction) Font design Concrete glyphs ka name kabushiki gaisha 2002 1913 ligature horiz vertical (Kozuka Mincho Pr o M, Kozuka Gothic H) 24 Finally, let us examine two examples of gaiji that are glyphs. In the left example, you see a character oka, as used in the personal name Maruoka. The first row shows the conventional modern glyph for this character. In the second row, we see to the right of the modern (2002) glyph, a 1913-era glyph variant. As far as we are aware, this variant is not encoded, though we haven t combed all of Unihan Extension B for it. It certainly is not within the glyph complement of commonly-used publishing fonts. Let us suppose that it is in fact not encoded as a separate character in Unicode. Then the 1913-era glyph is a gaiji that is purely a variant glyph: it has no novel character properties, just glyph properties. We have now demonstrated examples of gaiji that are characters only, glyphs only, both, and neither. QED. But read on, there is more. In the right example, you see the Japanese phrase kabushiki gaisha, meaning stock company. In any case, this stock phrase is so commonly used with company names that it is sometimes set as a ligature, a single glyph that represents all four characters together. Japanese text is commonly written horizontally (left-to-right, top-tobottom) and vertically (top-to-bottom, right-to-left). So this ligature also appears in two forms, horizontal and vertical. If you look carefully, you will see that the parts of the ligature are arranged in different orders in the two forms. By the way, the oka glyph alternatives are the same in both horizontal and vertical layout. It is an interesting question whether this ligature is a character in the Unicode sense, eligible for encoding. In any case, it is certainly beyond Unicode s scope to encode the horizontal and vertical glyph variants as full characters. The horizontal-vertical text distinction is not required for plain-text legibility, it is a property of formatted text layout. In the last line of the diagram, you see each a font design applied to the glyphs, yielding different concrete glyphs. Three of the four glyphs are formatted with two fonts each: Kozukua Mincho Pro M and Kozuka Gothic Pro H. The 1913-era oka character is formatted only in Kozuka Mincho. Looking at this diagram, it is clear that each succeeding line represents a step from character to glyph, from abstract to concrete. The first line is pure character data. The line Abstract glyphs (1) specifies ligature formation and historical glyphs, but not writing direction or font design. The line Abstract glyphs (2) adds a specification of writing design. But only in the last line do we specify font design and unambiguously arrive at what the Unicode character-glyph model terms a glyph. Let us consider the implications of this. 24 San Jose, California, September 2002

Abstract Glyph is a Useful Concept It is meaningful to identify glyphs as the same except for font variations Glyph identification schemes are widely used CID-Keyed Character Collections, e.g. Adobe-Japan1-4 Adobe Glyph List naming conventions Practical uses of abstract glyph codes Glyph to character mapping, font change of text 25 While the Unicode character-glyph model clearly calls for characters and glyphs as two layers of abstraction, it does not speak to what other layers of abstraction may be meaningful. The diagram on the previous page illustrates two intermediate levels of abstraction: before and after the specification of writing direction, but not yet specifying font design. It is meaningful to speak of the glyphs on the bottom row as being the same as each other except for font design variation. It is meaningful to speak of the glyphs on the Abstract glyphs (2) as being the same except for writing direction variation. There are other layers that can meaningfully be defined. The Japanese desktop publishing and type industry has extensive practical experience with a layer of abstract glyphs. These are the glyphs specified by the CID-Keyed Character Collections defined by Adobe Systems, Incorporated and others. For instance, the Adobe-Japan1-4 collection describes 15,444 abstract glyphs that differentiate glyph variants, horizontal and vertical forms, and other variants besides, but are independent of font design. (The name Character Collection is an anachronism; today we might call it an Abstract Glyph Repertoire.) Adobe-Japan1-4, and the CID level of abstraction, have proven a useful foundation upon which multiple font vendors could design commercially useful, high-quality typefaces. It may be thought of as an abstract shape that can by styled, e.g. by bold versus light weight, or gothic versus mincho font treatment. In the Western publishing and type industry, the Adobe Glyph List naming conventions play a similar role. While they don t prescribe an abstract glyph repertoire, the conventions do provide a way to name abstract glyphs largely independent of font design but specifying glyph variants like swash capitals, small capitals, and lowercase figures. An objection that is sometimes raised to encoding glyphs is that the universe of font designs is infinitely large, and so the set of glyphs is innumerable. The type industry s experience is that it is possible to define abstract glyphs at various levels, and these abstract glyphs are both useful and tractable to name and enumerate. One practical uses for abstract glyph names in layout software is glyph to character mapping. In OpenType Japanese fonts that follow the Pro conventions, there are 15,444 glyphs, each corresponding to a numbered abstract glyph. Tables known as cmap tables map from Unicode character codes to these abstract glyph codes, and this mapping can be kept consistent from font to font. Another practical use is font changing when formatting text. Layout software can use one font to perform a character-to-glyph mapping, then store the abstract glyph code. If the formatting changes to use a different font which uses the same abstract glyph codes, it is simple and reliable to apply the stored abstract glyph codes to the new font, and display correct glyphs. Without shared abstract glyph codes, this operation is much more difficult. 25 San Jose, California, September 2002

Unicode Variation Selectors and Gaiji What is the boundary between characters and glyphs for the CJKV ideographic script? Are abstract glyphs on previous slide glyph variants or character variants? Single Han grapheme vs. ligature Helpful for Unicode Variation Selectors to designate same level of abstraction as CIDs Allows many gaiji to be processed at character level without resorting to gaiji mechanism 26 Consider Japanese orthography, particularly the preceding examples, in the light of this discussion. What is the boundary between characters and glyphs in this orthography? The 1913 oka is clearly related to the 2002 oka. Are they the different characters in the Unicode sense, or glyph variants of a single character? If they are different characters, what about the kabushiki gaisha ligature? Could that be a character in its own right, or is it more specific than a character, and so some level of abstract glyph? The Unicode Standard distinguishes between abstract characters and encoded characters. If you regard ideograph variations like the 1913 oka and the 2002 oka as the same abstract character, but if they are both encoded, then it appears that the for the CJKV ideographic script, the number of abstract characters is much smaller than the number of encoded characters. And Unicode Variation Selectors are clearly intended to be more specific than unadorned encoded characters, but not as specific as encoding font design variations. This puts them somewhere in the level of abstractions of what we call abstract glyphs. At what level are character variations? What is the distinction between a character variation and an abstract glyph of some level? Our experience is that the CID level of abstraction used in OpenType, i.e. an abstract shape that can by styled, is a practical and useful level for CJKV ideographic script publishing. This leads us to believe that defining Unicode Variation Selectors for the CJKV ideographic script characters to operate at roughly this level of abstraction would prove very useful for the publishing industry, and for representing gaiji in general. It would allow many gaiji to be processed at a character level, using standard Unicode mechanisms, saving many layers of software from having to add specific gaiji support. For instance, you can imagine a database of personal and place names which contains Unicode character data. Variant ideographs could be represented as plain characters where possible, and characters plus variation selectors where necessary. The text would remain standard Unicode character data for processing. Only data entry and formatting systems would need to take special notice of the variation selectors and what they mean. The question of how Variation Selectors will be put to use is, in our opinion, the most interesting open question concerning Unicode support for gaiji. 26 San Jose, California, September 2002

Gaiji in Unicode Character-Glyph Model Gaiji can represent characters, or glyphs, or both, or neither Unicode coverage of gaiji will never be complete Will take years to encode all known characters New characters will be invented or discovered Will never encode some items: logos, wrong glyphs Glyphs and glyph variations matter, and there must be a way to process and display them. 27 So, let us summarise what we have learned about gaiji in the terms of the Unicode character-glyph model. We have seen that gaiji, as used in Japanese professional publishing, can (in Unicode terms) be characters, or glyphs, or both or neither. We have seen that Unicode by itself can address part of the gaiji requirement, but it is not a complete solution not in the long term, and certainly not at present. For now, there are thousands of Han ideographs which have get to be encoded. It will take years to cover this backlog. Then, once the Unicode encoding process has caught up to known Han ideographs, there will always be a trickle of novel or recently discovered ideographs. Finally, there are elements publishers will want to use, which the Unicode standard declares to be beyond its scope. We have seen that glyphs matter, and glyph variations matter. Regardless of the necessary and appropriate boundaries which Unicode sets for itself, a complete publishing system must be able to handle these glyphs and glyph variations in a powerful and practical way, including processing and display. The solution will have to involve Unicode, but reach beyond Unicode. 27 San Jose, California, September 2002

Conclusions Gaiji are a real need, especially in Han texts, especially for high-quality publishing No existing gaiji mechanism has both enough capacity and enough openness Unicode can reduce, but never eliminate, gaiji Variation Selectors marking abstract glyphs would help 28 Gaiji represent a real need for texts worldwide. The need is particularly acute in Chinese, Japanese, Korean and Vietnamese (CJKV) texts, since the CJKV ideographic script is fundamentally open-ended, while at the same time the publishing market demands high-quality glyphs in a wide variety of font designs. While there are a variety of successful gaiji systems on the Japanese market today, the publishing industry there is roughly divided between proprietary systems and desktop publishing systems (DTP). The proprietary systems offer capacity and glyph quality, while DTP offers openness and economy. No system today captures the best of both worlds. There is a need for a gaiji system which is smoothly compatible with Unicode (and other publishing standards like OpenType), but which covers the areas beyond Unicode s scope. It should be able to give the best of both worlds: the quality, capacity and productivity of the proprietary systems, with the openness and interoperability of desktop publishing. Unicode s broad coverage of the CJKV ideographic script represents a significant step forward, but Unicode by its own terms will never be a complete gaiji solution. Some further mechanism is necessary, and it should interface smoothly with Unicode. The question of how the Variation Selectors introduced in Unicode 3.2 will be put to use for the CJKV ideographic script is, in our opinion, the most interesting open question concerning Unicode support for gaiji. 28 San Jose, California, September 2002

Q & A This paper available at http://partners.adobe.com/asn/developer/type/gaiji.html 29 29 San Jose, California, September 2002

b 30 30 San Jose, California, September 2002