Challenges of Multilingual Web in India : Technology Development & Standardization Perspective



Similar documents
Web Edition: PROVISIONAL POPULATION TOTALS. Chapter 5. Census of India 2001 Series 1, India, Paper 1 of Chapter 5

Press Note on Poverty Estimates,

State Data Centre. Round Table Conference 30 th July 2009

GOVERNMENT OF INDIA PRESS INFORMATION BUREAU *****

Dreams and Realities:

Press Note on Poverty Estimates,

DENSITY OF POPULATION. Figures Map Table/Statements Notes

IDENTIFICATION OF DEALERS

Eligibility for Scholarship: If a candidate is selected, the scholarship shall be paid for pursuing studies in India only.

NeGP Infrastructure Components (State Data Centre, SWAN, SSDG)

Chapter 3 LITERACY AND EDUCATION

WIRELESS and mobile technology

Hum a n Re s o u r c e s in He a lt h Se c t o r

ESTIMATION OF LIFE EXPECTANCY AT BIRTH

STATE WISE DATA As on

34-1/2013/DAF Dr. Ambedkar Foundation Ministry of Social Justice & Empowerment

Application for admission to B.Tech. Degree Programmes

Loan Details. Document on Educational Loans

FAQS FOR MEMBERS OF COMMODITY DERIVATIVES EXCHANGES

ALL INDIA WEATHER SUMMARY AND FORECAST BULLETIN

REPORT ON THE WORKING OF THE WORKMEN S COMPENSATION ACT, 1923 FOR THE YEAR 2009

Indian Institute of Tourism and Travel Management

New India geography explained fact sheet

Sample Reports of Value Added Tax

Planning for Teachers, Headmasters/Principals and Master Trainers Training

Women s Energy Justice Network: CDM Financing and Microlending for Appropriate Technology REEEP Output # N3123

VAT FORMS/WAY BILLS REQUIRED FOR DIFFERENT STATES IN INDIA

Challenges of Multilingualism and Possible Approach for Standardization of e-governance Solutions in India

PRICE DISSEMINATION PROJECT

DIRECTORATE OF ADVERTISING & VISUAL PUBLICITY MASS MAILING WING

How To Calculate The National Education System

qualifications a second discipline or for ex-servicemen upper age succeeding pages. CODE Age

National Language (Tamil) Support in Oracle An Oracle White paper / November 2004

State-wise List of Directors (Health Services)


Learnings - E-District Pilot Implementation. 30th July 2009

National Iodine Deficiency Disorders Control Programme

Youth development in India: does poverty matter?


At a Glance. Constructed Over 3.0 million sq. ft. in Retail, Entertainment, Commercial, Parking & Residential Assets. Planned (next 3 years)

APPLICATION FORMAT. (To be filled by Applicant and duly certified by Head/Principal/Dean of the Institution/University)

ALL INDIA SURVEY ON HIGHER EDUCATION ( )

Sub: States Fiscal Consolidation ( )

Pupil-Teacher Ratios in Schools and their Implications. February 2014 Azim Premji Foundation

FOOD SAFETY AND STANDARDS (LABORATORY AND SAMPLE ANALYSIS) REGULATIONS, 2011 CHAPTER 1 GENERAL

GROWTH, DEVELOPMENT AND POVERTY IN INDIA AND NEPAL

Government of India Earth System Science Organization Ministry of Earth Sciences India Meteorological Department

Admissions to MS Bioengineering at Christian Medical College, Vellore

Withholding Tax Configuration Country India Version

Access to Banking Services and Poverty Reduction: A State-wise Assessment in India

VISA STAMPING IN INDIA

Diploma Course in Guidance and Counselling (2013) INFORMATION BROCHURE

Ministry of Tourism. Market Research Division. Government of India. Evaluation Study for the Scheme of Market Research - Professional Services

UDYOG AADHAAR. Government of India Ministry of Micro, Small and Medium Enterprises (An ISO 9001:2008 Certified Organisation)

Financial Results Q3 & 9M FY February 11, 2016

ESTIMATES OF MORTALITY INDICATORS

Regional Institute of Education (RIE), Bhubaneswar (National Council of Educational Research and Training) Sachivalaya Marg, Bhubaneswar

CONTENTS NATIONAL TABLES

n Analysis of Census 2001

Trends in Private and Public Investments in Agricultural Marketing Infrastructure in India

ALL INDIA COUNCIL FOR TECHNICAL EDUCATION APPROVAL PROCESS HANDBOOK ( )

INDIA S NATIONAL INITIATIVES AND EXPERIENCES RELATED TO WIND RESOURCE ASSESSMENT

PRICE LIST. ALPHA TRANSLATION AGENCY

BASEL DISCLOSURES DOCUMENT AS ON 31 st December 2014 TABLE DF-3 CAPITAL ADEQUACY

Government of India Ministry of New and Renewable Energy MNRE

Companies Regulations, 1956

Kisan Credit Card Scheme- Master Policy on Personal Accident Insurance Scheme for KCC holders RPCD.PLFS.BC.NO./ 1 /

List of Principal/Revenue Secretaries of States/UTs

GOVERNMENT OF INDIA UNSTARRED QUESTION NO 822 ANSWERED ON NORMS FOR BPL

List of State Nodal Agencies. Andhra Pradesh

STATUS REPORT MUNICIPAL SOLID WASTE MANAGEMENT

Consumer Price Index Numbers - Separately for Rural and Urban Areas and also Combined (Rural plus Urban)

IN THIS POLICY, THE INVESTMENT RISK IN INVESTMENT PORTFOLIO IS BORNE BY THE POLICYHOLDER.

MAULANA AZAD NATIONAL SCHOLARSHIP SCHEME FOR MERITORIOUS GIRL STUDENTS BELONGING TO MINORITIES

MAULANA AZAD EDUCATION FOUNDATION APPLICATION FORM FOR MAULANA AZAD NATIONAL SCHOLARSHIP FOR MERITORIOUS GIRLS STUDENT BELONGING TO MINORITIES

DR. AMBEDKAR NATIONAL MERIT SCHOLARSHIP SCHEME FOR MERITORIOUS STUDENTS OF SCHEDULED CASTS / SCHEDULED TRIBES

NOTICE. Government of India, (Department of Animal Husbandry, Dairying and Fisheries), Ministry of Agriculture Krishi Bhawan, New Delhi

OFFICE OF THE REGISTRAR GENERAL, INDIA (Government of India, Ministry of Home Affairs) 2/A, Man Singh Road, New Delhi

Chapter-5. Special Economic Zones (Sezs) and Export Oriented Units (Eous)

JOIN INDIAN COAST GUARD (MINISTRY OF DEFENCE) AS NAVIK (DOMESTIC BRANCH) 10TH PASS ENTRY FOR 01/2016 BATCH

CODE LIST - III RELIGION CODES 01 Hindu 05 Buddhist 02 Muslim 06 Parsi 03 Sikh 07 Jain 04 Christian 10 Others

Kisan Call Center Services

BROCHURE OVERSEAS CITIZEN OF INDIA (OCI) CARDHOLDER

Overview of Infrastructure and Construction Machinery Industry in India Opportunities and Challenges Rajesh Nath Managing Director VDMA India

WHAT DO THEY KNOW? A summary of India s National Achievement Survey, Class V, Cycle 3, 2010/11. Supported by SSA TECHNICAL COOPERATION FUND

Strategy for Providing 24x7 Power Supply. Forum of Regulators

2.1 Act means, unless expressly stated otherwise, the Public Liability Insurance Act 1991 as amended from time to time.

INDIA. Road Accidents in India Issues & Dimensions. Ministry of Road Transport & Highways Government of India

ROAD USER TAXES IN INDIA

POLICY SCHEDULE. Name of the Primary Annuitant Date of Birth Age Age Admitted Gender SAMPLE

SOCIAL BACKGROUND OF OFFICERS IN THE INDIAN ADMINISTRATIVE SERVICE SANTOSH GOYAL

Operational Guidelines

भ र उत प दन स त लन रप टर LOAD GENERATION BALANCE REPORT

UNIT 3 IT PROJECTS IN INDIA

Policy Implementation and Impact Review: A Case of MGNREGA in India

Common Services Centers Enabling Service Delivery: Bridging the Digital Gap

A STUDY OF BHARTI AIRTEL LIMITED AND BIHAR TELECOM CIRCLE

ALL INDIA COUNCIL FOR TECHNICAL EDUCATION

GENDERED VULNERABILITY

Chapter-5. Special Economic Zones (SEZs) and Export Oriented Units (EOUs) Special Economic Zones (SEZs)

Transcription:

Challenges of Multilingual Web in India : Technology Development & Standardization Perspective Swaran Lata Country Manager, W3C India & Head, TDIL Programme, Dept of Information Technology, Govt of India E-mail : swaran@w3.org, slata@mit.gov.in

Organization of my talk : India - Internet Scenario in India - Multilingual Web based service in India - Major web based services in India : E-Governance Common Service Centers & Citizen centric services : Unique ID Project Aaadhar Multilingual Complexity In India: -Origin of Indic Scripts & family - Complex Characteristics Technology Challenges - Ab-initio development of Language Computing Solution - Non-Conformance with English Centric Model - Complex nature of key technologies (Machine Translation, Cross-lingual Information Access, Character Recognition etc) - Some key Initiatives Standardization Challenges: Core : -Encoding -Input -Display Middleware : HTML,CSS, Mobile Web, Web Accessibility & Mobile Web Application Level : Presentation Issues & Web design implementation issues 2

Internet Usage Statistics in India 3

Wireless subscriber Statistics in India Total Telephone subscriber base reaches 706.37 million Wireless subscription reaches 670.60 million Wireline subscription reaches 35.77 million Overall Tele-density reaches 59.63 Broadband subscription is 10.08 million 526 261 165 4

E-Governance Common Service Centers & Citizen centric services The Government of India has launched the National e-governance Action Plan (NeGP) for G2G,G2B,G2E and G2C services. Modalities for Government on the Web Provide: public services on the web, either transactional or information services or both. Engage: with citizens and businesses, on government terms or onthe citizens terms. Enable: public sector information re-use. 5

Web based services in India Panchayati Raj Department of Agriculture & Cooperation Bengal state Portal Kerala IT Mission 6 6

Web based services in India Tamil Nadu Government portal Gujarat State Portal E-Choupal Jan Mitra 7 7

Online Services under National e-governance Plan Central MisionMode Programme (MMP)s Income Tax Passport/VISA Company Affairs Central Excise Pensions Land Records Road Transport Property Registration Agriculture Municipalities Police Employment Exchange E-Courts State Mision Mode Programme (MMP)s Agriculture Commercial Taxes e District Employment Exchange Land Records Municipalities Gram Panchayats Police Road Transport Treasuries 8

Nation-wide Unique ID Project The Unique Identification Authority of India (UIDAI) was established in 2009 by the Government of India, with the developmental mandate of setting up the infrastructure to provide a universal way of uniquely identifying Indian residents. AADHAAR, a 12-digit unique identification number (UID) that will be provided after getting the demographic and biometric information of an individual. AADHAAR's guarantee of uniqueness and centralized online identity for building the multiple services and applications AADHAAR give ability to access these web based services and resources, anytime, anywhere in the country Aadhar Towards foundation of Nation-wide multilingual web based services accessible to its citizens irrespective devices & platforms 9

Host of On-line Nation-wide Web Based Services for a population over 1 billion 10

Multilingual Complexity In India 11

Languages of India According to Census 2001 India has 122 major languages and 2371 dialects. Out of 122 languages 22 are constitutionally recognized languages. Linguistic Diversity is very rich and wide in India One Language many scripts Many Language one script Culturally different depending on region though using same script for different languages. Even wide difference for same language across different parts of the country 12

States Chhattisgarh Himachal Pradesh Arunachal Pradesh Karnataka Kerala Madhya Pradesh Hindi Hindi Assamese Kannada Malayalam Hindi Maharashtra Manipur Mizoram Nagaland Orissa Punjab Marathi Manipuri (Meitei) Mizo,English English Oriya Punjabi Rajasthan Sikkim Tamil Nadu Tripura Andhra Pradesh Chandigarh Hindi Nepali, English Tamil Bengali Telugu Punjabi English, Kokborok Urdu Hindi Languages Goa Konkani Marathi Gujarat Gujarati Hindi Haryana Hindi Punjabi Jharkhand Hindi Santhali Lakshadweep Malayalam English Meghalaya English Khasi, Garo Uttar Pradesh West Bengal Assam Bihar Dadra and Nagar Haveli Daman and Diu Hindi Urdu Bengali Nepali Assamese Bengali Bodo Maithli Hindi Urdu Gujarati Marathi Hindi Gujarati English Marathi Delhi Hindi Jammu and Kashmir Urdu Puducherry Tamil Uttarakhand Hindi Andaman and Nicobar Islands Hindi Punjabi Urdu Kashmiri Dogri Malayalam Telugu Sanskrit Urdu Bengali Tamil, Telugu

Languages of India Northern Scripts (Gupta Scripts) Major Scripts and Corresponding Languages in India 2000 BC Unknown Ancient Scripts Indus Script (proto Brahmi Scripts)? Tamil Southern Scripts Malayalam Granth a 8th Century 7 th century Landa Gurmukhi 8 th Century Devanagari 10 th Century Gauri Sharda Nagari Kaithi Gujarati Kutil Jain Nagari 400 BC Oriya Bangla 3 rd BC Gaur Ol-Chiki Assamese Brahmi Script Brahmi Script (Ashokan) Nepali (Newari) Maithali Tibetan Kharoshthi Script 400 BC-300 BC Central Asian Kole hat Pallava Granth Sinhali Brahmi South-eastern Asian- Burmese, Thai, Cambodian, Indonesian, Malasiyan, vietbames, Philipines etc Vettashut Southern Sinhalese Cental Sinhali Telugu 12 th Century Kannadda 13th Century 14

Technology Challenges 15

Technologies Developed under consortium mode projects English to Indian Languages Machine Translation System [6 Language Pairs: English to Hindi, Marathi, Bengali, Oriya, Tamil, Urdu.] Indian Languages to Indian Languages Machine Translation System [9 Language Pairs: Telugu-Hindi, Hindi-Tamil, Urdu-Hindi, Kannada-Hindi, Punjabi-Hindi, Marathi-Hindi, Bengali-Hindi, Tamil-Telugu, Malayalam- Tamil] Cross-Lingual Information Access [6 Languages : Hindi, Bengali, Tamil,Marathi, Telugu and Punjabi] Optical Character Recognition Systems [10 Scripts: Bangla, Devnagari, Malayalam, Gujrati, Tamil, Telugu, Kannada, Oriya, Gurumukhi, Tibetan] On-line Handwriting recognition system [ 6 Scripts: Hindi, Bengali, Tamil, Telugu, Kannada and Malayalam] 16

Sample Outputs For English Hindi Complex nature of key technologies : Machine Translation 17

Sample Outputs For English - Urdu 18

Sample Outputs For English - Bangla 19

Sample Outputs For English - Tamil 20

Indian Language to Indian Languages Machine Translation System 21

Sample Outputs For Punjabi - Hindi 22

23

Cross Lingual information access integrated with Machine Translation English Query English Database Crawling Searching & Indexing Machine Translation Service Eng Indian Languages Input processing [Query translation /Transliteration] English Output search result Ind. Lang Output search Query in Indian Languages 24

Cross-Lingual Information Access (CLIA) In CLIA, the input query is in one language and information is retrieved in another. The query language is one of Bangla, Hindi, Marathi, Punjabi, Tamil and Telugu. The retrieved documents are in English, Hindi or the language of the query. 25

Bengali Monolingual Retrieval 26

Bengali English Cross-lingual Retrieval 27

Bengali Hindi Cross-lingual Retrieval 28

OCR Original Image Loaded in Input Window of OCR Output of Devnagari OCR 29

OCR Original Image Loaded in Input Window of OCR 30

OCR Output of Tamil OCR 31

Sample Tamil OHWR form -Tamil 32

Speech Technology: Speech Corpus: Annotated Speech Corpora of approximately 50 hours developed for Hindi, Marathi, Punjabi, Bengali, Assamese, Manipuri, Tamil,Malayalam, Telugu, Kannada. Text to-speech for 6 Indian Languages Hindi, Bengali, Marathi, Tamil, Telugu and Malayalam is being developed for visually challenged section of society Speech Recognition: Consortium Mode Project has been initiated for development of Automatic Speech Recognition systems for agricultural commodity prices in six Indian Languages : Hindi, Tamil, Telugu, Bengali, Assamese and Marathi languages Phonetic Enginefor Speech recognition system for Hindi and Telugu languages are being developed. 33

Sample Waveform for Tamil Speech: Phonetic Engine 34

Standardization Aspects for Supporting Multilingual Web in Indian Languages 35

Multilingual Requirement 36

Character Encoding : UNICODE Basis of Multilingual Web. All data exchange would be possible seamlessly across devices and Platforms Unicode Encoding for all 22 Constitutionally Recognized Indian Languages Complete. Unicode declared as a Standard for Data Storage for Web Based E-Governance Services in India 37

Styling Issues in Indic Languages 38

Drop Letters in Indian languages : Issues for Indian Languages with respect to first character used in Hindi, Malayalam, Bengali, Telugu and Gujarati etc. 39

Underlining of characters There is some examples of Indian languages in which Matra s are not readable due to underlining of characters 40

Vertical arrangements Bullets and Numbering Indentation of character 41

Under Lining of the characters 42

Major Identified Problems in Styling : Grapheme Cluster Problems for Vertical Writing Style Drop Initial Views of the First Letter Element Bullets & Numbering issues Justification Problems Horizontal Letter Spacing Problems Most browsers are unaware of syllable boundaries for Indic scripts. Current workaround for this is enclosing the first syllable in a separate span governed by a dedicated CSS class. Mozilla Firefox seems to recognize it well. 43

Approach to be taken for Possible Solution 1.Grapheme Cluster Problems : Adoption of UAX#29 (Unicode Text Segmentation Algorithm addressing the complexities of Indian Grapheme Clusters) 2.Bullets & Numbering issues 3.Justification Problems 4.Horizontal Letter Spacing Problems Development of Complete Mapping Table involving detailed requirements for document layout, typography and typesetting and calligraphic conventions, i.e. Styling Manual. The Rules developed in Styling Manual needs to be converted for inclusion in HTML and CSS 44

An activity involving in-depth analysis of browser rendering engine needs to be taken up on a urgent basis. A reference web-page layout engine needs to be developed which could act as a benchmark for other browser players to emulate. Mozilla Firefox could be taken up for modification to act as a reference implementation. Core Working Group under the aegis of W3C Internationalization Group being formed. 45

Enabling speech interface for web 46

Indian language requirement in Speech Interface 47

W3C Pronunciation Lexicon Specification (PLS 1.0) Voice Browsing W3C Voice Browser Activity, has published a Pronunciation Lexicon Specification (PLS) Version 1.0. PLS is designed to enable interoperable specification of pronunciation information for both speech recognition and speechsynthesis engines within voice browsing applications. PLS Attributes: Multiple pronunciations For ASR: to accommodate speaker/regional variability, not native speakers For TTS: a single preferred pronunciation will be selected Multiple orthographies (with same pronunciations) Useful for both ASR & TTS Homophones (same pronunciations, different meanings) Homographs (same spellings, different pronunciations) 48

Proposed Approach for incorporating Multilingual requirement in PLS: Usage of Parts of Speech (POS) information for resolving multiple pronunciations Parts of Speech (POS) plays an important role in Indian languages (like Bangla, Hindi) pronunciation. Based on the POS same orthography can produce different pronunciation. Pos as an attribute: 49

Usage of morphological information for resolving multiple pronunciations In Indian Languages not only POS information but also morphological information are very crucial in determining the pronunciation of a homograph. This information can be defined in the same attribute or element used for POS using proper POS tag set for that language. Example : Bengali 50

Web Accessibility & Implementation 51

Challenges : Make Web content accessible to people with disabilities w.r.t Indian languages WCAG 2.0 Guidelines for success criteria vis-a-vis selected recommendations relevant to Indian context Initiatives in India : Guidelines for Indian Government websites by NIC, Govt. of India STQC Implementing WCAG 2.0 Accessibility through Website Quality Certification Centre for Internet and Society developing authorized translation of WCAG 2.0 Guidelines Roadmap : Meet WCAG 2.0 guidelines & techniques w.r.t Indian languages Initiation with Hindi, Bangla, Marathi, Telugu, Tamil 52

Some of the WCAG 2.0 Compliant Web Sites Name of the Portal Snap Shot Level of Compliance www. socialjustice.nic.in (Ministry of Social Justice & Empowerment) AA www. mit.gov.in (Depart of Information Technology) AA 53

Issues for enabling Mobile Web in Indian languages Character encoding Bandwidth and Cost Backward Compatibility with Legacy Devices Lack of standardization Fonts Bit map fonts (used by low cost handset) True type fonts (used by high end handsets) Open type fonts (currently in wider use) Common Storage format Mobile messaging in indic languages Presentation Issues 54

Mobile keypads issues Multi-tap issues Too many taps per key for each char No way to know which char is on which Key. Never support more than one language on the keypad because there is not enough space on the key face to print more characters. Dictionary Based Transliteration 55

Level of Adherence to W3C Mobile Ok while testing 10 Hindi websites 56

Proposed approach: Gap Analysis W3C MWI Indian requirements MWI Standardization activity MWI Standardization requirements in India W3C Mobile OK Implementation of Mobile OK for Indian languages Mobile ok tests suites Implementation of Mobile ok tests suites for ILs Default Delivery context for Mobile applications Need to update Default Delivery context for Mobile applications in ILs 57

Development of Recommendations for Mobile Ok in Indian Languages W3C Current Scope Scope MWBP MWBP Mobile Mobile OK OK Scheme Scheme Mobile Mobile OK OK Checker Checker Mobile Mobile OK OK Tests Tests Requirements of Indian Scenario Requirements of Indian Scenario GAP Analysis Scope Scope MWBP MWBP Mobile Mobile OK OK Scheme Scheme Mobile Mobile OK OK Checker Checker Mobile Mobile OK OK Tests Tests Observation, Recommendation for Future activity Observation, Recommendation for Future activity 58

Future Initiatives 59

Web Based Services on Cloud Computing Environment Cloud English Query English Database Crawling Searching & Indexing Machine Translation Service Eng Indian Languages Input processing [Query translation /Transliteration] English Output search result Ind. Lang Output search Query in Indian Languages 60

Semantic Web 1. Development of Word-Net for Indian Languages. 2. Development of Multimedia Web Ontology Language (MWOL) Multimedia Web Ontology Language (MOWL), is designed as an extension of OWL, the W3C recommended ontology language for the web. MOWL supports creation of and reasoning with perceptual modeling of concepts, and probabilistic evidential reasoning. Formation of the concept Medieval Indian Monument and the abstracted visual patterns that are expected on an embodiment of the concept in a multimedia artifact. 61

Some of Future Initiatives: Multilingual Requirements for Indian Languages for HTML 5.0 Multilingual requirements for Voice Browser in Indian Languages Gap Analysis for implementation of Mobile Web best practices in Indian Languages Multilingual Linked Data for E-Governance data The Internationalization Best Practices Would be foundation of the all the above initiatives 62

Thanks & Questions к क к ਕకగ ક कಕ к ಕ ક ਕ к क ક గ ಕ ಕ к