FreeDict: an Open Source repository of TEI-encoded bilingual dictionaries



Similar documents
The presentation explains how to create and access the web services using the user interface. WebServices.ppt. Page 1 of 14

Advice Document: Bilingual Drafting, Translation and Interpretation

XML Processing and Web Services. Chapter 17

PHP Tutorial From beginner to master

- a Humanities Asset Management System. Georg Vogeler & Martina Semlak

Novell Identity Manager

Problems with the current speling.org system

Declarative Parsing and Annotation of Electronic Dictionaries

Computer Assisted Language Learning (CALL): Room for CompLing? Scott, Stella, Stacia

An XML Based Data Exchange Model for Power System Studies

The FLOSSWALD Information System on Free and Open Source Software

Eventia Log Parsing Editor 1.0 Administration Guide

How To Create A Charter Corpus On The Web (For Historians)

Project 4: IP over DNS Due: 11:59 PM, Dec 14, 2015

IBM Tivoli Federated Identity Manager V6.2.2 Implementation. Version: Demo. Page <<1/10>>

XML. CIS-3152, Spring 2013 Peter C. Chapin

Proxima 2.0: WYSIWYG generic editing for Web 2.0

EFFECTIVE STORAGE OF XBRL DOCUMENTS

Search Engine Optimization (SEO) & Positioning

System Requirement Specification for A Distributed Desktop Search and Document Sharing Tool for Local Area Networks

Introduction to Ingeniux Forms Builder. 90 minute Course CMSFB-V6 P

Creating a TEI-Based Website with the exist XML Database

By Glenn Fleishman. WebSpy. Form and function

Introduction to OpenTM2 An Open Source Solution for Translators

Internationalizing the Domain Name System. Šimon Hochla, Anisa Azis, Fara Nabilla

Certificates and Application Resigning

Creating an EAD Finding Aid. Nicole Wilkins. SJSU School of Library and Information Science. Libr 281. Professor Mary Bolin.

7 Why Use Perl for CGI?

Overview Document Framework Version 1.0 December 12, 2005

But have you ever wondered how to create your own website?

JAVASCRIPT AND COOKIES

Web Programming. Robert M. Dondero, Ph.D. Princeton University

Integrating with BarTender Integration Builder

XML: extensible Markup Language. Anabel Fraga

Encoding Library of Congress Subject Headings in SKOS: Authority Control for the Semantic Web

Firewall Builder Architecture Overview

Webapps Vulnerability Report

Markus Dickinson. Dept. of Linguistics, Indiana University Catapult Workshop Series; February 1, 2013

Natural Language to Relational Query by Using Parsing Compiler

Get Success in Passing Your Certification Exam at first attempt!

U.S. Department of Health and Human Services (HHS) The Office of the National Coordinator for Health Information Technology (ONC)

A Framework for Data Management for the Online Volunteer Translators' Aid System QRLex

Introducing TshwaneLex A New Computer Program for the Compilation of Dictionaries

About us Our Services

Citrix StoreFront. Customizing the Receiver for Web User Interface Citrix. All rights reserved.

DeltaPAY v2 Merchant Integration Specifications (HTML - v1.9)

Translution Price List GBP

VERSION 3.0. REMARKS Documentation rev 01. November CREATOR Douglas J. Robar, Percipient Studios. LICENSE MIT License

Recommended readings. Lecture 11 - Securing Web. Applications. Security. Declarative Security

XBMC Architecture Overview

Secure Web Appliance. Reverse Proxy

Modern Databases. Database Systems Lecture 18 Natasha Alechina

Why major in linguistics (and what does a linguist do)?

Application development in XML

Language and Computation

How to translate your website. An overview of the steps to take if you are about to embark on a website localization project.

ABSTRACT. Keywords: Learning Management Systems, Moodle, remote laboratory systems, scorm

Network Working Group

VoiceXML and VoIP. Architectural Elements of Next-Generation Telephone Services. RJ Auburn

Generating SQL Queries Using Natural Language Syntactic Dependencies and Metadata

Acano solution. Acano Solution Installation Guide. Acano. January B

Web Services for Management Perl Library VMware ESX Server 3.5, VMware ESX Server 3i version 3.5, and VMware VirtualCenter 2.5

Jenesis Software - Podcast Episode 3

Simplifying e Business Collaboration by providing a Semantic Mapping Platform

White Paper January 2009

Web Programming with PHP 5. The right tool for the right job.

Chapter 2: Designing XML DTDs

Moving Enterprise Applications into VoiceXML. May 2002

Standard Registry Development and Publication Process

DTD Tutorial. About the tutorial. Tutorial

Lightweight Data Integration using the WebComposition Data Grid Service

Secure Authentication and Session. State Management for Web Services

Best Practices for Java Projects Horst Rechner

Contents. 2 Alfresco API Version 1.0

The OpenFOAM-extend project on SourceForge: current status. Bernhard Gschaider, ICE Strömungforschung GmbH

Invitation to Ezhil : A Tamil Programming Language for Early Computer-Science Education 07/10/13

Agents and Web Services

Cache Database: Introduction to a New Generation Database

Tvärslå defining an XML exchange format and then building an on-line Nordic dictionary

Chapter ML:XI (continued)

TextGrid as Virtual Research Environment

A common interface for multi-rule-engine distributed systems

Chapter 3: XML Namespaces

An Insight into Cookie Security

SELF SERVICE RESET PASSWORD MANAGEMENT CITRIX AND MICROSOFT TERMINAL SERVICES

Setup Guide Access Manager 3.2 SP3

1. To start Installation: To install the reporting tool, copy the entire contents of the zip file to a directory of your choice. Run the exe.

Compiler I: Syntax Analysis Human Thought

XML for Manufacturing Systems Integration

Internet Technologies_1. Doc. Ing. František Huňka, CSc.

Digital media glossary

Data Integration through XML/XSLT. Presenter: Xin Gu

Information extraction from online XML-encoded documents

CREATING AND EDITING CONTENT AND BLOG POSTS WITH THE DRUPAL CKEDITOR

Identity Theft. Medical Asst. EKG / Cardio. Fire Rescue. Paramedic. Date(s) Used April 12 and 14, 2011

Honors Research Project Database

XTM for Language Service Providers Explained

Chapter 2 Text Processing with the Command Line Interface

technische universiteit eindhoven WIS & Engineering Geert-Jan Houben

Transcription:

FreeDict: an Open Source repository of TEI-encoded bilingual dictionaries Piotr Bański Institute of English Studies, University of Warsaw Beata Wójtowicz Dept of African Languages and Cultures, University of Warsaw; Institute of Computer Science, Polish Academy of Sciences Support from the Standing Committee for the Humanities (SCH) of the European Science Foundation made this paper possible (http://www.esf.org/human). TEI-MM, Ann Arbor, November 2009

How it all began Beata's Ph.D. diss on Swahili lexicography, outlining the structure of a modern electronic Swahili-Polish dictionary, before that, Beata edits or compiles two little dictionaries (swa-pol, swa-eng), one of them for the Freedict project (swa-eng v. 0.2), in early 2007 (?) Piotr talks Beata into trying to apply for a grant to build her dream swa-pol-swa dictionary, we prepare a proposal, and thus get some material to present and discuss at conferences, and so it begins, we talk and read and listen, and re-evaluate and modify, fail to get the grant, reapply... twice..., in the meantime, we work on the swa-eng Freedict dictionary in P5, because we like the project and want to popularize it; Piotr joins the Freedict admin team, sets up the wiki and just can't stop meddling with the tools (actually, Adam Przepiórkowski stops him for 10 months of work on the National Corpus of Polish), finally, the reviewers get tired and possibly scared of having to re-read our proposals, and we get the grant, officially starting now, in November, that forces us to define our stance wrt Freedict, license-wise; in the worst case, we will keep donating pieces of our closed-source dictionary to Freedict. P. Bański and B. Wójtowicz, TEI-MM, Ann Arbor, 14 November 2009 2 of 20

FreeDict(.org) Started in 2000 by Horst Eyermann, as a sister project to DICT. Many databases were adapted from Ergane (travelang.com) by concatenating (= crossing) their dictionaries,with Esperanto as the middle language (Dictionary crossing: L1 L2 crossed with L2 L3 or L3 L2 becomes L1 L3 ) advantage: mass production of glossaries disadvantage: translation errors bound to crop up High hopes, lots of work by Michael Bunk, conversion from a plain text format into TEI P4, a TEI P4 dictionary editor for Linux, then... licensing problems (Ergane changed their licenses, many databases had to be withdrawn), Real Life and eventually stagnation. Revival in late 2007, things started to move again. New features at Sourceforge, newer documentation, a gradual move towards P5 was initiated: the conversion tools had to be adapted, now the XSLT path from P5 to DICT is open. P. Bański and B. Wójtowicz, TEI-MM, Ann Arbor, 14 November 2009 3 of 20

DICT(.org) Dictionary Server Protocol (Rickard Faith and Bret Martin, NWG RFC #2229, 1997) TCP-based query/response protocol that allows access to dictionary databases, data in textual form, an option to provide MIME-encoded content, currently maintained and developed at SourceForge.net; more than one way to query a DICT database: you can search the definitions and the headwords, using regex-based criteria, clients can be free-standing desktop applications or they can be integrated into editors or web browsers; DICT web gateways also exist. DICT is one of the strengths of FreeDict, as it allows for easy publication of dictionaries. FreeDict, in turn, groups bilingual dictionaries that DICT can use. P. Bański and B. Wójtowicz, TEI-MM, Ann Arbor, 14 November 2009 4 of 20

Controlled tag abuse; conformance levels We want: mass production of Bad-but-Big dictionaries/glossaries with minimal markup, semi-automated production of better dictionaries with richer markup, lexicographic works with 'perfect' markup; hand-crafted. This means: 3 ODDs (eventually), each of them progressively restrictive and with tightening semantics, allowing for tag abuse at the lower levels (e.g. use <def> for translation equivalents, use <note> for practically everything else, including grammatical information); the tag abuse is controlled by the appropriate ODD, allowing for low machine-readability at the BbB level (equivalents separated by commas and semicolons; notes in brackets); the three ODDs should not be independent (Laurent, help!); we thought of externally expressed constraints, but ODD inheritance would be much better! P. Bański and B. Wójtowicz, TEI-MM, Ann Arbor, 14 November 2009 5 of 20

Basic restrictions No mess an external demand, by the tools (no entryfree or dictscrap, <sense> and gramgrp/pos obligatory, even if <pos/> ends up empty support for empty <pos/> implemented right before TEI-MM, see eng-ell, ca. 20.000 entries) Three levels of conformance for dictionary encoding (an old concept, possibly due to the CES): level 1 for mass encoding, <def> with a single string inside level 2 for some more nifty microstructure (gramgrp, iterated <form> for pronunciation, iterated <def>s, <note> instead of brackets), level 3: the new ultracool cit/quote structure where e.g. <def> means a real definition, should one appear in the entry, <note>s are specialized (<usg>, etc.) The ODDs should ideally be in a subsumption relation ODD inheritance would be nice to have... otherwise it's a maintenance nightmare. Let us illustrate the above now. P. Bański and B. Wójtowicz, TEI-MM, Ann Arbor, 14 November 2009 6 of 20

Level 1 <sense> <def>equivalent.1a, (sense restriction) equivalent.1b; equivalent.2 (definitional comment)</def> </sense> P. Bański and B. Wójtowicz, TEI-MM, Ann Arbor, 14 November 2009 7 of 20

Level 2 <sense> <def>equivalent.1a, <note type="hint">sense restriction</note> equivalent.1b</def> <def>equivalent.2</def> <note type="def">definitional comment</note> </sense> P. Bański and B. Wójtowicz, TEI-MM, Ann Arbor, 14 November 2009 8 of 20

Level 3 <sense n="1" xml:lang="l2"> <cit> <quote>equivalent.1a</quote> </cit> <cit> <note type="restr" xml:lang="l1">sense restriction</note> <quote>equivalent.1b</quote> </cit> </sense> <sense n="2" xml:lang="l2"> <cit> <quote>equivalent.2</quote> <def>definitional comment</def> </cit> </sense> P. Bański and B. Wójtowicz, TEI-MM, Ann Arbor, 14 November 2009 9 of 20

Local vs. project-wide markup language-specific, where various linguistic features pertaining to the given language pair are encoded in detail <form type="n"> <orth>alasiri</orth> </form> alasiri [sg = pl] uniform pre-conversion format, which is translated into the DICT format by a single project-wide set of stylesheets. The mapping between the language-specific format and the pre-conversion format is performed by language-specific XSLT stylesheets. XSLT is also helpful in producing new entries, language-specifically (we automatically produce entries for plural nouns from the singulars in Swahili, plurals are marked by prefixes; illustration in the appendix) P. Bański and B. Wójtowicz, TEI-MM, Ann Arbor, 14 November 2009 10 of 20

What we have 70 databases of varied quality (most need maintenance) and size (the largest have 140-150 thousand entries), regular format (eventually exposing the translation equivalents in cit/quote markup), a way to publish almost immediately, via Sourceforge, Freedict's own Debian package series, and via DICT, easy access (web, desktop; small local servers easy to install), plans for the future... P. Bański and B. Wójtowicz, TEI-MM, Ann Arbor, 14 November 2009 11 of 20

Plans for the future Make all databases P5-conformant, In doing so, establish what exactly is needed for our internal three-level conformance and freeze the ODDs, Find a DWS for the developers (lossless transformation into/from the data format of e.g. FLEx or Glossword), Establish a reasonably reliable concatenation method to produce more glossaries or dictionaries, Perform dictionary reversal for those databases where the equivalents are suitably exposed in cit/quote. P. Bański and B. Wójtowicz, TEI-MM, Ann Arbor, 14 November 2009 12 of 20

What we need A way to express the relatedness of ODDs, preferably by inheritance (cf. Laurent's talk!) we used to think of an outside set of pointers to express this; Dictionary developers anyone here? got any clever M.A.-level students interested in lexicography? Data available under free licenses; Tools (or reliable methods of lossless translation of data formats); how likely is it to have a Dictionary Writing System using the TEI as its format? It would have to understand the ODDs. P. Bański and B. Wójtowicz, TEI-MM, Ann Arbor, 14 November 2009 13 of 20

Possible extensions Wójtowicz and Bański: Swahili-Polish-Swahili dictionary project (previously, it informed the Freedict swa-eng dictionary; it may be based on it now, pending licensing questions) Bański and Moszczyński (LREC 2008), Enhancing an English-Polish Electronic Dictionary for Multiword Expression Research an RNG grammar for describing MWEs as regular expressions, applied to dictionary entries; dormant, will be woken up around January. Conversion between TEI and other dictionary-oriented XML formats (OLIF, LIFT) Bański, Habilitazionsschrift on applying the GOLD ontology to dictionary markup and content Piotr has around a year for that, not longer (or he gets a kick out of the Uni). + whatever you can think about with regard to FreeDict in your own research we'll be happy to assist on the administrative/sf side. P. Bański and B. Wójtowicz, TEI-MM, Ann Arbor, 14 November 2009 14 of 20

Thanks for your attention! http://freedict.org/ P. Bański and B. Wójtowicz, TEI-MM, Ann Arbor, 14 November 2009 15 of 20

<entry> <form><orth>alasiri</orth></form> <def>afternoon</def> </entry> Appendix: our story (I) <entry xml:id="alasiri"> <form><orth>alasiri</orth></form> <gramgrp><pos>n</pos></gramgrp> <sense> <def>afternoon (period between 3 p.m. and 5 p.m.)</def> </sense> </entry> P. Bański and B. Wójtowicz, TEI-MM, Ann Arbor, 14 November 2009 16 of 20

Appendix: our story (II) <entry xml:id="alasiri"> <form type="n"> <orth>alasiri</orth> </form> <gramgrp><pos>n</pos></gramgrp> <sense> <def>afternoon</def> <note type="def">period between 3 p.m. and 5 p.m.</note> </sense> </entry> P. Bański and B. Wójtowicz, TEI-MM, Ann Arbor, 14 November 2009 17 of 20

Appendix: our story (III) <entry xml:id="alasiri"> <form type="n" xml:lang="sw"> <orth>alasiri</orth> </form> <gramgrp><pos>n</pos></gramgrp> <sense> <cit type="trans"> <quote>afternoon</quote> </cit> <def>period between 3 p.m. and 5 p.m.</def> </sense> </entry> P. Bański and B. Wójtowicz, TEI-MM, Ann Arbor, 14 November 2009 18 of 20

Appendix: automatic enrichment of the dictionary (I) The singular entry: <entry> <form> <orth>adui</orth> <ref target="#maadui"/> </form> <gramgrp><pos>n</pos></gramgrp> <sense> <def>enemy</def> </sense> <sense> <def>opponent</def> <note type="hint">in games or sports</note> </sense> </entry> P. Bański and B. Wójtowicz, TEI-MM, Ann Arbor, 14 November 2009 19 of 20

Appendix: automatic enrichment of the dictionary (II) The plural entry: <entry xml:id="maadui"> <form><orth>maadui</orth></form> <gramgrp><pos>n</pos></gramgrp> <sense> <xr type="plural-sense">plural of <ref target="#adui">adui</ref> </xr> <def>enemy</def> <def>opponent <note type="hint">in games or sports</note></def> </sense> </entry> P. Bański and B. Wójtowicz, TEI-MM, Ann Arbor, 14 November 2009 20 of 20