Multilingual Support in GSDL

Similar documents
National Language (Tamil) Support in Oracle An Oracle White paper / November 2004

SETTING UP A MULTILINGUAL INFORMATION REPOSITORY : A CASE STUDY WITH EPRINTS.ORG SOFTWARE

Red Hat Enterprise Linux International Language Support Guide

Introduction to Unicode. By: Atif Gulzar Center for Research in Urdu Language Processing

Preservation Handbook

Right-to-Left Language Support in EMu

Internationalization & Localization

Product Internationalization of a Document Management System

DRH specification framework

Setting up and Automating a MS Dynamics AX Job in JAMS

32-Bit Workload Automation 5 for Windows on 64-Bit Windows Systems

Data Integrator. Encoding Reference. Pervasive Software, Inc B Riata Trace Parkway Austin, Texas USA

LuitPad: A fully Unicode compatible Assamese writing software

Your single-source partner for corporate product communication. Transit NXT Evolution. from Service Pack 0 to Service Pack 8

The Unicode Standard Version 8.0 Core Specification

TYPING IN ARABIC (WINDOWS XP)

Table Of Contents. iii

FileMaker 14. ODBC and JDBC Guide

Unicode Enabling Java Web Applications

Using Impatica for Power Point

Using CertAgent to Obtain Domain Controller and Smart Card Logon Certificates for Active Directory Authentication

DiskPulse DISK CHANGE MONITOR

Translating QueueMetrics into a new language

Internationalizing the Domain Name System. Šimon Hochla, Anisa Azis, Fara Nabilla

The Virtual Tibetan Classroom

Chapter 4: Computer Codes

Jet Data Manager 2012 User Guide

Grandstream Networks, Inc.

Multi-lingual Label Printing with Unicode

PRICE LIST. ALPHA TRANSLATION AGENCY

Data Tool Platform SQL Development Tools

StreamServe Persuasion SP4 StreamServe Connect for SAP - Business Processes

EMCO Network Inventory 5.x

Reading and Writing in Chinese

MULTI-FIND/CHANGE. Automatication VERSION 1.02

HKSCS-2004 Support for Windows Platform

Japanese Character Printers EPL2 Programming Manual Addendum

Sterling Web. Localization Guide. Release 9.0. March 2010

Administrator Manual Across Personal Edition v6 (Revision: February 4, 2015)

FileMaker Server 12. Custom Web Publishing with PHP

MS Outlook AddIn version 3.6

Internationalization of the Domain Name System: The Next Big Step in a Multilingual Internet

Tibetan For Windows - Software Development and Future Speculations. Marvin Moser, Tibetan for Windows & Lucent Technologies, USA

FileMaker Server 9. Custom Web Publishing with PHP

Community Builder Language Package Guide Updated for CB 1.2.3

Oracle Universal Content Management

BarTender Web Print Server

How to test and debug an ASP.NET application

FileMaker 11. ODBC and JDBC Guide

Bitrix Site Manager 4.0. Quick Start Guide to Newsletters and Subscriptions

Easy Bangla Typing for MS-Word!

Administrator Manual Across Translator Edition v6.3 (Revision: 10. December 2015)

PRECISION v16.0 MSSQL Database. Installation Guide. Page 1 of 45

Administrator Manual Across Personal Edition v5.7 (Revision: January 2, 2014)

Connectivity Pack for Microsoft Guide

How To Write A Domain Name In Unix (Unicode) On A Pc Or Mac (Windows) On An Ipo (Windows 7) On Pc Or Ipo 8.5 (Windows 8) On Your Pc Or Pc (Windows

Keywords : complexity, dictionary, compression, frequency, retrieval, occurrence, coded file. GJCST-C Classification : E.3

1. Installing the client module on the phone

Using SQL Server Management Studio

QAD Business Intelligence Release Notes

FileMaker 13. ODBC and JDBC Guide

SuperOffice CRM for Windows. and. Eastern European characters

Chapter 2 Text Processing with the Command Line Interface

PHOTOSTORE 3 SERIES MANUAL TABLE OF CONTENTS

HP Business Notebook Password Localization Guidelines V1.0

OxyClassifieds Installation Handbook

The Indian National Bibliography: Today and tomorrow

Using Logon Agent for Transparent User Identification

FileMaker Server 13. Custom Web Publishing with PHP

Personal Token Software Installation Guide

User Manual V1.3. NCB File /alahlincb

TUTORIAL 4 Building a Navigation Bar with Fireworks

SQL Server An Overview

Internationalization of Domain Names

Speaking your language...

The following features have been added to FTK with this release:

Eventia Log Parsing Editor 1.0 Administration Guide

FmPro Migrator - FileMaker to SQL Server

HOMEWORK # 2 SOLUTIO

Polar Help Desk Installation Guide

Banana is a native application for Windows, Linux and Mac and includes functions that allow the user to manage different types of accounting files:

14.1. bs^ir^qfkd=obcib`qflk= Ñçê=emI=rkfuI=~åÇ=léÉåsjp=eçëíë

Perfect PDF 8 Premium

OpenIMS 4.2. Document Management Server. User manual

Introducing the Greenstone

InfiniteInsight 6.5 sp4

Secret Communication through Web Pages Using Special Space Codes in HTML Files

2Creating Reports: Basic Techniques. Chapter

StreamServe Persuasion SP4 Connectors

Oracle Database 11g Express Edition PL/SQL and Database Administration Concepts -II

Bitrix Framework. Localization Guide

Adobe Acrobat 6.0 Professional

alternative solutions, including: STRONG SECURITY for managing these security concerns. PLATFORM CHOICE LOW TOTAL COST OF OWNERSHIP

How are tags and messages archived in WinCC flexible? WinCC flexible. FAQ May Service & Support. Answers for industry.

Nesstar Server Nesstar WebView Version 3.5

Fiery E100 Color Server. Welcome

Xtreeme Search Engine Studio Help Xtreeme

Customizing Display Views

Microsoft SQL Server Connector for Apache Hadoop Version 1.0. User Guide

Stored Documents and the FileCabinet

Transcription:

Multilingual Support in GSDL K.T.Anuradha NCSI, Indian Institute of Science, Bangalore 560 012, Karnataka, India Email: anu@ncsi.iisc.ernet.in OR anu_ncsi@yahoo.co.in http://vidya-mapak.ncsi.iisc.ernet.in

Multilingual Support in GSDL Topics covered: What are char set, encoding & fonts Creating interface in your language for GSDL How to set default interface language How to handle multilingual content Unicode text Non-Unicode text Limitations

Charsets, encoding and fonts Charset:- is a bunch of characters, in the way a human would understand them. Ex: A, B, C so on are charset of Latin English Encoding:- is a way of storing characters on a computer as bits. Ex: ASCII, EBCIDIC, Unicode etc. Font:- is an Image or glyph for a particular character.

Charsets & Encoding Widely used charsets ISO 8859 series, windows series, gbk, ISO 10646 (utf-8, utf-16, utf-32), ISCII, user-defined, etc.. Unicode Emerging encoding standard Assigns unique hexadecimal numbers for more than 65,000 characters Ex: U+0041 is hexadecimal number of A

UNICODE.. Encoding chosen by operating systems for supporting various non-english languages on computers. Though it supports all the languages of the world, the operating systems such as Windows, Linux may not have implemented all the languages yet. On Windows, the support for Indian languages was not available until Windows 2000 was released.

Unicode.. Covers almost all international standards First 125 characters are same as ASCII Synchronized with the corresponding versions of ISO-10646 (utf-8, utf-16, utf-32) Groups the characters together by scripts in code blocks

Enabling Indian Languages In Windows XP and Windows 2000, you should first enable the Indian languages before you can use Indian language UNICODE. You can enable the Indian language support using Control Panel --> Regional Options applet. If you are using an operating system that does not support the Indian language UNICODE, then you will not see UNICODE text properly. Instead you will see 'square box' or 'question mark' characters.

Language Fonts.. Kannada, Telugu, Gujarati, Punjabi UNICODE for Kannada, Telugu, Gujarati and Gurumukhi scripts is supported only in Windows XP and later operating systems. Following are different language fonts: Open Type font "Tunga" to display Kannada UNICODE text. Open Type font "Gautami" to display Telugu UNICODE text.

Language Fonts.. Kannada, Telugu, Gujarati, Punjabi Open Type font "Shruti" to display Gujarati UNICODE text. Open Type font "Raavi" to display Gurumukhi UNICODE text.

Language Fonts.. Oriya Oriya script is not yet supported by Windows operating systems. Windows XP shows Oriya UNICODE, provided a Open Type font that has Oriya characters (such as Arial Unicode MS) is installed in the system.

Language Fonts.. Malayalam, Bengali UNICODE for Malayalam, Bengali scripts is supported only in Windows XP SP2 and later operating systems. Open Type font "Kartika" to display Malayalam UNICODE text. Open Type font "Vrinda" to display Bengali UNICODE text.

Language Fonts.. Devanagari, Tamil UNICODE for Devanagari, Tamil scripts is supported only in Windows 2000 and and later operating systems. Windows 2000/Windows XP uses an Open Type font "Mangal" to display Devanagari UNICODE text. Windows 2000/Windows XP uses an Open Type font "Latha" to display Tamil UNICODE text.

Unicode support in GSDL Unicode is used throughout the process All major charsets are internally converted into utf-8 through mappings Different charsets mapped from and to Unicode are as follows ISO 8859 series, windows series, gbk, big5, ShiftJis, uhc, ISCII, kio8_r, kio8_u, euc_jp, dos866 GSDL mapping files are stored in gsdl/mappings folder

Interface Customization Requirements Operating system and browser should support your language Install IME (Input Method Editor) for your language (Unicode supported) Try out!!!!

Interface Customization Create a macro file for your language > Make a copy of english.dm & rename > Translate each macro value > If it is Unicode encoding, select Unicode format for saving Ex _textimagehome_{ज.एस.ड.एल } > If it is ASCII encoding, select text document format for saving Ex.. _textimagehome_ {Home page}

Interface Customization If your language is in Non-ASCII or Non-Unicode format copy the encoding Ex _textimagehome_{å åš é µ} Pass language argument in front of each macro Follow ISO 639 two letter language standard Ex.. _textimagehome_ [l=zh] {} Zh=Chinese

Configuration in main.cfg Include newly created macro file in macro files list of main.cfg Configure your language by passing short name, long name, & default encoding arguments Language shortname=hi longname=hindi default_encoding=utf-8 Lets try!!!!

Default interface language For entire server: Add the following argument at the end of main.cfg cgiarg shortname=l argdefault=xx (this is already added in GSDL 2.70 onwards) For particular collection: Use Languages" format option in collect.cfg demo

Multilingual Content: Unicode content Operating system and browser should support your language Input filenames and its extensions should be in ASCII (English) Can build the collection using GLI or in command line mode

Settings to view collection Select respective language interface from preference page Select utf-8 encoding from preference page Enable auto select or select utf-8 encoding in browser i.e, View -> encoding -> utf-8

Search Need to install respective IME (Input Method Editor) for your language or online keyboard Browser should support for passing query string Can search for a particular word or phrase Advance searching is bit complicated i.e, Boolean and proximity operators should be typed in Latin language only Demo.

Non-Unicode mapped content Collection building is normal Content should follow any one of the mapped native encoding format Respective encoding should be enabled in etc/main.cfg file GSDL will internally map the encoding into utf-8 and build the index Converts back to native encoding while displaying

Settings to view collection Select your language interface from preference page Select native encoding from preference page Enable Auto-select or select native encoding in the browser Ex: Your collection is in windows-1256, select Arabic (windows-1256) in preference page, Arabic(windows) encoding in browser

Searching Need to install IME (Unicode supported) of your language or use Online keyboard Browser should support for passing query string Search features are same as Unicode content Demo.

Non-Unicode unmapped content Collection which uses encoding like user-defined X-user-defined etc.. GSDL will index as it is (mapping won t take place)

Settings Need to have particular fonts in fonts folder of operating system Enable auto-select option in browser Searching: query string should be entered in basic encoding format

Limitations Alphabetical sorting? Stemming? Boolean and proximity search operators should be entered in ASCII form only Appears to support html, word files

For better results You need to Understand > Operating systems multilingual support > Browsers multilingual support > Charsets, encodings, fonts > Html properties > Unicode

BARAHA Baraha allows you to convert the Indian language text into UNICODE format in various manners as follows. You can copy the text in UNICODE format using Edit Copy Special command. You can type UNICODE text directly into any applications that support UNICODE using Baraha Direct. You can save the Baraha documents as UNICODE text files using the File Export command. You can save the Baraha documents as HTML files which use UTF- 8 encoding using File Export command.

Resources GSDL developers guide Greenstone archives collection Greenstone wiki page http://www.unicode.org

Summarizing Creating multilingual interface Handling multilingual collection Unicode content Non-Unicode mapped content Non-Unicode unmapped content Limitations

Thank You! To view sample collection visit: http://vidya-mapak.ncsi.iisc.ernet.in/bahubhashi Send your feedback to anu@ncsi.iisc.ernet.in anu_ncsi@yahoo.co.in