Towards a Unicode Encoding of Teuthonista

Similar documents
When older typesetting methods gave

The BU Keyboarding System

1. Manuscript Formatting

HTML Codes - Characters and symbols

Adjunct Assistant Professor Gayle Rembold Furbert. Typography II. County College of Morris Graphic Design Degree Program

The Unicode Standard Version 8.0 Core Specification

ASCII control characters (character code 0-31)

Introduction to Unicode. By: Atif Gulzar Center for Research in Urdu Language Processing

Tibetan For Windows - Software Development and Future Speculations. Marvin Moser, Tibetan for Windows & Lucent Technologies, USA

Easy Bangla Typing for MS-Word!

The Unicode Standard Version 8.0 Core Specification

7-Bit coded Character Set

MANUSCRIPT TITLE (CENTERED, 14 POINT BOLD, SANS SERIF FONT, MAJUSCULE)

PURPOSE OF THIS GUIDE

D'Nealian manuscript handwriting practice worksheets.

dtsearch Frequently Asked Questions Version 3

FUNCTIONAL SKILLS ENGLISH - WRITING LEVEL 2

Preservation Handbook

Chart of ASCII Codes for SEVIS Name Fields

ISO/IEC JTC1 SC2/WG2 N4399

New in Quark 7 is the Glyphs palette, a feature added

Word Processing programs and their uses

BCCC Library. 2. Spacing-. Click the Home tab and then click the little arrow in the Paragraph group.

AccordIt User s Guide

What s New in QuarkXPress 8

CyI DOCTORAL THESIS TEMPLATE 1

Mac OS X 10.4 (Tiger): Enabling polytonic Greek

Romanian Online Dialect Atlas: Crişana

Rendering/Layout Engine for Complex script. Pema Geyleg

Q&As: Microsoft Excel 2013: Chapter 2

Creating a table of contents quickly in Word

Frequently Asked Questions on character sets and languages in MT and MX free format fields

3. Add and delete a cover page...7 Add a cover page... 7 Delete a cover page... 7

SAMPLE TURABIAN STYLE PAPER

Pepperdine University Graduate School of Education and Psychology Dissertation Format Requirements Manual. APA 6 th edition format

Keyboards for inputting Japanese language -A study based on US patents

EURESCOM - P923 (Babelweb) PIR.3.1

Graphic Design Basics. Shannon B. Neely. Pacific Northwest National Laboratory Graphics and Multimedia Design Group

How to Cite Information From This System

Internationalizing the Domain Name System. Šimon Hochla, Anisa Azis, Fara Nabilla

L2/ Abstract Introduction

Things to remember when transcribing speech

Formatting Text in Microsoft Word

Excel 2007: Basics Learning Guide

HIT THE GROUND RUNNING MS WORD INTRODUCTION

Writing Journal Articles

MS Word 2007 practical notes

The Center for Teaching, Learning, & Technology

Chapter 2 Text Processing with the Command Line Interface

Thesis Format Guide. Denise Robertson Graduate School Office 138 Woodland Street Room

Impress Guide Chapter 3 Adding and Formatting Text

ASCII CODES WITH GREEK CHARACTERS

Corporate Identity Standards and Trademarks Manual

Divide your material into sections, for example: Abstract, Introduction, Methods, Results, Conclusions

1.4 Compound Inequalities

SAPScript. A Standard Text is a like our normal documents. In Standard Text, you can create standard documents like letters, articles etc

Word is the Microsoft Office word processor application. This chapter

ITU-T E.123. Notation for national and international telephone numbers, addresses and Web addresses

So far we have investigated combinational logic for which the output of the logic devices/circuits depends only on the present state of the inputs.

HP Business Notebook Password Localization Guidelines V1.0

Office of Research and Graduate Studies

Manage. Help Documentation. This document was auto-created from web content and is subject to change at any time. Copyright (c) 2016 SmarterTools Inc.

Chapter 4: Computer Codes

Microsoft Excel 2010 Part 3: Advanced Excel

This project is an opportunity to create a unique logo for a fictional

Lesson Review Answers

VISUAL GUIDE to. RX Scripting. for Roulette Xtreme - System Designer 2.0

Microsoft Excel Understanding the Basics

A Guide to Completing the Music Education Thesis

Unicode in Mobile Phones

The future of International SEO. The future of Search Engine Optimization (SEO) for International Business

Producing accessible materials for print and online

Chapter 2 Review Questions and Answers

Latin Alphabet special characters in Microsoft Word Article by: Stélios C. Alvarez 08

Word 2007: Basics Learning Guide

CHECKLIST FOR THE DEGREE PROJECT REPORT

Creating Interactive PDF Forms

Basic numerical skills: FRACTIONS, DECIMALS, PROPORTIONS, RATIOS AND PERCENTAGES

Excel Level Two. Introduction. Contents. Exploring Formulas. Entering Formulas

Microsoft Excel 2010 Charts and Graphs

Toothpick Squares: An Introduction to Formulas

A THESIS/DISSERTATION FORMATTING MANUAL FOR THE PURDUE UNIVERISTY CHEMISTRY DEPARTMENT. A Manual. Submitted to the Faculty and Graduate Students

Guide for writing assignment reports

ABOUT THIS DOCUMENT ABOUT CHARTS/COMMON TERMINOLOGY

Basic Use of the TI-84 Plus

BLACKBOARD 9.1: Text Editor

JavaScript: Introduction to Scripting Pearson Education, Inc. All rights reserved.

Thesis and Dissertation Preparation Manual

What's New in QuarkXPress 10

Version of Barcode Toolbox adds support for Adobe Illustrator CS

How to make more than one math OpenType font

Section 1.4 Place Value Systems of Numeration in Other Bases

Tips for Presenting Online to People Who Are Blind or Visually Impaired

Transcription:

Towards a Unicode Encoding of Teuthonista Presentation held at the Workshop TranscriptionSystems andunicode Special Characters in a Standardized World Vienna (Austria), 2008-11-05 Karl Pentzlin http://pentzlin.com Abstract (added for the printed version) Teuthonista is a German dialect writing system employing numerous combinations of letters and diacritical marks stacked vertically or even side-/ downwards to its symbols. Although this is not uncommon for some South Asian scripts already encoded in Unicode, this is uncommon for the Latin script in its extent it is used for Teuthonista. Thus, a proposal to encode Teuthonista must convince the Unicode Consortium that Teuthonista is in fact a set of characters (i.e. a valid addition to the Latin script), rather than being a notation system like musical notes. To accomplish this, there an exhaustive list of the building blocks of Teuthonista must be supplied, together with the rules how these building blocks are used to combine to the complex Teuthonista symbols. Also, this must be done in a way that these building blocks and rules are compatible to the mechanisms defined in the Unicode standards, as well as to the unwritten practices of encoding. Some special Teuthonista symbols are presented with analyses how such symbols can be decomposed into building blocks which can be encoded. A preliminary character set based on the "Einführungin den Sprachatlas der Deutschen Schweiz" (R. Hotzenköcherle, Bern 1962) is presented informally, which can be used as base for further work on the Unicode encoding of Teuthonista. 1

Unicode encodes characters A text, consisting of characters does not encode notational systems Use of a notational system (However, Unicode encodes building blocks for notational systems.) Is Teuthonistaa character set? (rather than being a notational system) (Of course, it is. But the Unicode Committee has to be convinced.) Unicode encodes characters, not glyphs. Accented forms can be dynamically composed: Here two characters are used: A and, to generate the composed character Ä. Characters have well-defined semantics: In the previous example, is a single character, with the semantic diaeresis, trema, or umlaut. There is no recurrence to the optical building elements (the single dots in this case). The latter also implies: Teuthonistaitself must be defined completely and exhaustively, in terms of characters, before getting encoded. However, the phonetics which Teuthonistacharacters denote are not part of their definition when getting encoded in Unicode. 2

Thus, a typical composed Teuthonistacharacter is... = LATIN SMALL LETTER E (the base letter) + COMBINING MACRON (the dash above of the e) + either (Solution A): -*COMBINING LEFT PARENTHESIS BELOW + COMBINING DIAERESIS BELOW + *COMBINING RIGHT PARENTHESIS BELOW + or (Solution B): - COMBINING DIAERESIS BELOW + *COMBINING PAIR OF PARENTHESES BELOW + or (Solution C): -*COMBINING PARENTHESIZED DIAERESIS BELOW = LATIN SMALL LETTER E (the base letter) + COMBINING MACRON (the dash above of the e) + either (Solution A): -*COMBINING LEFT PARENTHESIS BELOW + COMBINING DIAERESIS BELOW + *COMBINING RIGHT PARENTHESIS BELOW + or (Solution B): - COMBINING DIAERESIS BELOW + *COMBINING PAIR OF PARENTHESES BELOW + or (Solution C): -*COMBINING PARENTHESIZED DIAERESIS BELOW Drawbacks of Solution A: - Parentheses occur in pairs only. Such pairs are a semantic unit in Teuthonista. Drawbacks of Solution B: - Font related: Fonts containing such characters must employ mechanisms for relative positioning of the parentheses and the enclosed base diacritics. - Unicode related: There is no combining class to accommodate a diacritic like pair of parentheses. Advantages of Solution C: - Parenthesized diacritics are, as single entities, easily to handle on font design - The members of the Teuthonista fourfold diacritic sequences (single in parens, single, double in parens, double) are treated equally as semantic units Thus, Solution C is preferred when encoding Teuthonista. 3

Unicode Unicode characters represent plain text. Plain text e.g. is suited for storing into text databases, without the need to include any style, layout, or positioning information. Not in the scope of Unicode are: - any presentation and formatting issues - any font and character style issues - any layout issues. This implies: - All issues of vertical positioning must be handled by the definition of the characters themselves. Unicode contains - modifier letters (raised or lowered letters), like ᵃᵉˣᵊ ₐₑₓ - combining letters which act like accents placed above of a letter, like e e e e But what about these: Unicode does not provide a mechanism to apply its combining characters (like the dieaeresis here) to a superscript letter. Thus, the superscript ü must be encoded as a separate entity. What is this? It denotes a vowel pronounced between e and i, marked as a long vowel (by the macron) as a whole. Thus, it is an e + superscript i + macron. The macron is stacked upon the combination of e + superscript i, which is the mechanism provided by Unicode. The macron retains its full length. However, to get a superscript i without a dot, either the font must provide a special display for this combination, or a special superscript dotless i must be encoded. 4

from[a] (see Bibliography), p.61 What about the diaeresisbelow the raised alpha? - The raised alpha is a modifier letter (as any raised or lowered letter in Unicode is called). - The character part where the diacritic applies is unambiguous. This differs from combining superscript letters, as for a dieaeresis on o + superscript u + diaeresis above (where Unicode rules apply the diaeresisto the o, not to the superscript u). - Therefore, the standard Unicode combining marks can be applied in such cases. It is the responsibility of the font or the rendering system to select the correct place and size for the diacritic. - Thus, the last letter in the example above is: MODIFIER LETTER SMALL ALPHA + COMBINING DIAERESIS BELOW. In this case, nothing new is to be encoded. These characters are taken from [C] (see bibliography). They show a special diacritical mark, the Teuthonista ray, in the typical Teuthonista row of four: single in parens, single, double in parens, double. Although this diacritical mark is left-attaching, it has to be entered after the base letter, as it is common for all diacritical marks. (Such logical order being different from the visual order is common for South Asian scripts.) This special beast (also taken from [C]) shows dynamic rendering of character sequences, a feature also common for South Asian scripts. It is: LATIN SMALL LETTER O + *DOUBLE PARENTHESIZED TEUTHONISTA RAY + *DOUBLE PARENTHESIZED TEUTHONISTA HOOK. It requires to be done by the font or rendering system: When a *DOUBLE PARENTHETIZED TEUTHONISTA RAY is followed by another double parenthesized diacritical mark to be placed below the base letter, they are displayed parenthesized as a whole. (As this does not affect the semantics, this special rendering is optional.) 5

Bibliography [A] Hotzenköcherle, Rudolf: Einführung in den Sprachatlas der Deutschen Schweiz. Bern 1962 Band A: Zur Methodologie der Kleinraumatlanten [B] Hotzenköcherle, Rudolf: Einführung in den Sprachatlas der Deutschen Schweiz. Bern 1962 Band B: Fragebuch -Transkriptionsschlüssel -Aufnahmeprotokolle [C] Reichel, Sibylle: Handbuch zum Zeichensatz SMFTeuthonista. Erlangen 2003 http://www.sprachatlas.phil.uni-erlangen.de/materialien/teuthonista_handbuch.pdf [M] Wiesinger, Peter: Das phonetische Transkriptionssystem der Zeitschrift Teuthonista. Zeitschrift für Mundartforschung, Wiesbaden 1964, pp.1-40 [S] Steger, Hugo; Schupp, Volker: Einleitung zum Südwestdeutschen Sprachatlas. Marburg 1993 Appendix Preliminary list of characters suitable for encoding found in the works listed in the Bibliography For each character, an example is referenced in parentheses: letter referring to the Biography + page number Diacritical Marks PARENTHESIZED COMBINING DOT BELOW (A52) PARENTHESIZED COMBINING DIAERESIS BELOW (A53) PARENTHESIZED TEUTHONISTA HOOK (A63) TEUTHONISTA HOOK (A52) PARENTHESIZED DOUBLE TEUTHONISTA HOOK (A53) DOUBLE TEUTHONISTA HOOK (A53) PARENTHESIZED TEUTHONISTA RAY (C) TEUTHONISTA RAY (C) PARENTHESIZED DOUBLE TEUTHONISTA RAY (C) DOUBLE TEUTHONISTA RAY (C) PARENTHESIZED COMBINING DIAERESIS (B80) PARENTHESIZED COMBINING TILDE (S70) PARENTHESIZED COMBINING VERTICAL LINE BELOW (A68) COMBINING TWOFOLD INVERTED BREVE ABOVE (B81) COMBINING TWOFOLD INVERTED BREVE BELOW (A72) COMBINING TWOFOLD BREVE BELOW (B92) COMBINING DOUBLE INVERTED BREVE BELOW (B85) COMBINING CIRCUMFLEX ACCENT PAIR ABOVE (B86) COMBINING LEMNISCATE ABOVE (B85) COMBINING GREATER THAN ABOVE (B88) COMBINING DOWNWARDS ARROW ABOVE (M37) COMBINING MULTIPLICATION SIGN BELOW (A72) COMBINING WAVY LINE BELOW (A/XV) connects to left and right Latin letters LATIN SMALL LETTER U WITH SHORT RIGHT STEM (A69) LATIN SMALL LETTER U WITH LEFT CURL (B81) LATIN SMALL LETTER E WITH FLOURISH (B83) LATIN SMALL LETTER SHORT ESH (B89) (reduced to caps height) LATIN CAPITAL LETTER R WITH RIGHT HOOK (B92) LATIN SMALL LETTER R WITH RIGHT HOOK (due to stability policies) LATIN SMALL LETTER L WITH MIDDLE LEMNISCATE (B93) LATIN SMALL LETTER L WITH MIDDLE DOUBLE TILDE (B93) LATIN SMALL LETTER X WITH TWO LONG LEGS (S73) (no chi! -straight when not italics) LATIN SMALL LETTER X WITH TWO LONG LEGS AND RIGHT CURL (S73) MODIFIER LETTER SMALL X WITH TWO LONG LEGS AND RIGHT CURL (S74) LATIN SMALL LETTER X WITH LEFT LONG LEG (S73) LATIN SMALL LETTER X WITH LEFT LONG LEG AND RIGHT CURL (S73) LATIN SMALL LETTER X WITH AND RIGHT CURL (S73) LATIN LETTER SMALL CAPITAL SCRIPT R (S76) Punctuation marks LEFT SMALL DOUBLE PARENTHESIS (A72, A/XV) marking standard language in dialectology RIGHT SMALL DOUBLE PARENTHESIS (A72, A/XV) Miscellaneous Technical LARGE PLUS SIGN (A/XV) marking missing inquiry about a term in dialectology CombiningLetters above COMBINING LATIN SMALL LETTER O UMLAUT (A66) COMBINING LATIN SMALL LETTER U UMLAUT (B80) COMBINING LATIN SMALL LETTER SCHWA COMBINING LATIN SMALL LETTER B (A72) COMBINING LATIN SMALL LETTER F (B89) COMBINING LATIN SMALL LETTER SHARP S (B89) COMBINING LATIN SMALL LETTER L WITH DOUBLE MIDDLE TILDE (B93) COMBINING LATIN SMALL LETTER ESH (S73) COMBINING LATIN SMALL LETTER O WITH TEUTHONISTA RAY (S68) COMBINING LATIN SMALL LETTER U WITH TEUTHONISTA RAY (S68) 6

Appendix 2 (added for the printed version) Some notes from the discussion held after the presentation at the Vienna workshop 1. As Unicode encodes characters, not glyphs, there cannot be assumed that the characters are displayed using a specific font style, as italics. As it is a common practice to typeset transcripted phrases in italics, there are character designs which base on their appearance in an italic font. An example was the "Bavarian f" presented at the workshop, which in italics look as an ordinary "f" having its descender cut off. However, it was not explained how this character has to look if a normal (i.e. not slanted or italicized) sans serif font is used: To get such a character to be encoded, such peculiarities must be solved. 2. It is not necessary to assign code points to combinations of a base letter with one or several diacritical marks. As it was shown, Unicode is capable to stack an unlimited number of diacritical marks above or below a base letter. The "rendering system" (i.e. the combination of the operating system, the text processing system, and the selected font) is responsible to display the result correctly, including issues like adjusting the line spacing correctly. However, it can be practical to design such combinations in a font, to ensure that a specific glyph design will be used when a special combination of base letter + diacritics is encountered, which is aesthetically superior to the display the rendering system would produce. Such fonts do not need to assign a code point to such a combined glyph. They can assign an internal handle like a "PostScript glyph name" to it which in turn is declared in the font to be used when a specific sequence of code points is encountered. Thus, such fonts do not even need to be "complete". If a diacritic combination is encountered for which there is no glyph in the font, the automatic rendering system will be active, producing a display which may aesthetically less pleasant, while staying semantically correct. It has to be admitted, however, that rendering systems which do diacritic stacking for Latin are hard to find today. Thus, having fonts containing glyphs for all diacritic combinations used within a project is an interim solution until the common rendering systems (text processing software as well as Internet access software) do as they are expected to do. There is at least progress in this direction, which can be seen on comparing the most recent versions of Internet Explorer or Firefox with older ones. 7