LGPLLR : an open source license for NLP (Natural Language Processing) Sébastien Paumier. Université Paris-Est Marne-la-Vallée



Similar documents
A NLP engine from the lab to the iphone

RRSS - Rating Reviews Support System purpose built for movies recommendation

CURRICULUM VITAE Studies Positions Distinctions Research interests Research projects

TOOL OF THE INTELLIGENCE ECONOMIC: RECOGNITION FUNCTION OF REVIEWS CRITICS. Extraction and linguistic analysis of sentiments

Processing: current projects and research at the IXA Group

Markus Dickinson. Dept. of Linguistics, Indiana University Catapult Workshop Series; February 1, 2013

Grammar of French Detervers - A Review

How the Computer Translates. Svetlana Sokolova President and CEO of PROMT, PhD.

Module Catalogue for the Bachelor Program in Computational Linguistics at the University of Heidelberg

DiCE in the web: An online Spanish collocation dictionary

TS3: an Improved Version of the Bilingual Concordancer TransSearch

BILAL W S SHAFEI. An-Najah National University LRC / French department An-Najah Street. P.O. Box 07 Nablus Palestine

Paraphrasing controlled English texts

Computer Assisted Language Learning (CALL): Room for CompLing? Scott, Stella, Stacia

The polysemy of lexeme formation rules

Presented to The Federal Big Data Working Group Meetup On 07 June 2014 By Chuck Rehberg, CTO Semantic Insights a Division of Trigent Software

Capturing Syntactico-semantic Regularities among Terms: An Application of the FrameNet Methodology to Terminology

Tissot Pierre-Henri Recommended sites. Presentations : Online Activities

Special Topics in Computer Science

Statistical Machine Translation

Comprendium Translator System Overview

Master of Arts in Linguistics Syllabus

A new semantically annotated corpus with syntactic-semantic and cross-lingual senses

Université Paris-Est is a research and higher education cluster PRES.

Open Source Announcement

Empirical Machine Translation and its Evaluation

STUDENT APPLICATION FORM (Dossier d Inscription) ACADEMIC YEAR (Année Scolaire )

Hybrid Strategies. for better products and shorter time-to-market

The PALAVRAS parser and its Linguateca applications - a mutually productive relationship

D2.4: Two trained semantic decoders for the Appointment Scheduling task

SignLEF: Sign Languages within the European Framework of Reference for Languages

What s in a Lexicon. The Lexicon. Lexicon vs. Dictionary. What kind of Information should a Lexicon contain?

COMPUTATIONAL DATA ANALYSIS FOR SYNTAX

How To Identify And Represent Multiword Expressions (Mwe) In A Multiword Expression (Irme)

Teaching terms: a corpus-based approach to terminology in ESP classes

An Online Service for SUbtitling by MAchine Translation

Towards a RB-SMT Hybrid System for Translating Patent Claims Results and Perspectives

Get the most value from your surveys with text analysis

A cloud-based editor for multilingual grammars

CINTIL-PropBank. CINTIL-PropBank Sub-corpus id Sentences Tokens Domain Sentences for regression atsts 779 5,654 Test

Categories of Free and Nonfree Software

Numerical Data Integration for Cooperative Question-Answering

Annotation and Evaluation of Swedish Multiword Named Entities

Overview of MT techniques. Malek Boualem (FT)

ANDRÉ BOURCIER 35, Tigereye Crescent Whitehorse (Yukon) Y1A 6G9 (867)

PHONETIC TOOL FOR THE TUNISIAN ARABIC

Jornada de Seguimiento de Proyectos, 2004 Programa Nacional de Tecnologías Informáticas

SAINT-LOUIS UNIVERSITY BRUSSELS.

Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words

OSADL License Compliance Audit (OSADL LCA)

Customer Interaction 2.0: Adopting Social Media as Customer Service Channel

Morphology-Semantics Interface Lexicology-Lexicography Language Teaching (Learning Strategies, Syllabi)

Symbiosis of Evolutionary Techniques and Statistical Natural Language Processing

Annotation in Language Documentation

Syntactic Theory on Swedish

Modern foreign languages

Hybrid Machine Translation Guided by a Rule Based System

SPANISH Kindergarten

Christian Leibold CMU Communicator CMU Communicator. Overview. Vorlesung Spracherkennung und Dialogsysteme. LMU Institut für Informatik

Welcome page of the German Script Tutorial: script.byu.edu/german.

Parsing Technology and its role in Legacy Modernization. A Metaware White Paper

Contrastive conversational analysis of language production by Alzheimer's and control people

Manual CisWeb application A gwt-based web service that provides access to language processing tools of CIS

Study Plan. Bachelor s in. Faculty of Foreign Languages University of Jordan

Building a Question Classifier for a TREC-Style Question Answering System

Assessment in Modern Foreign Languages in the Primary School

Siemens Schweiz AG Building Technologies Division Intellectual Property Gubelstrasse 22 CH 6300 Zug Switzerland

A Survey of Online Tools Used in English-Thai and Thai-English Translation by Thai Students

Context Grammar and POS Tagging

Pattern Insight Clone Detection

Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic

IBM SPSS Text Analytics for Surveys

INF5820 Natural Language Processing - NLP. H2009 Jan Tore Lønning jtl@ifi.uio.no

Comparative Analysis on the Armenian and Korean Languages

GNU LIBRARY GENERAL PUBLIC LICENSE. Preamble

ONLINE TRANSLATION SERVICES FOR THE LAO LANGUAGE

EFL Learners Synonymous Errors: A Case Study of Glad and Happy

Sprinter: Language Technologies for Interactive and Multimedia Language Learning

How to become a successful language learner

Picking them up and Figuring them out: Verb-Particle Constructions, Noise and Idiomaticity

UNKNOWN WORDS ANALYSIS IN POS TAGGING OF SINHALA LANGUAGE

Tibetan-Chinese Bilingual Sentences Alignment Method based on Multiple Features

ICAME Journal No. 24. Reviews

Java Generation from UML Models specified with Alf Annotations

Generating SQL Queries Using Natural Language Syntactic Dependencies and Metadata

DAM-LR at the INL Archive Formation and Local INL. Remco van Veenendaal 01/03/2007 DAM-LR

Guidelines for Masters / Magister / MA Theses

Terminology Extraction from Log Files

Compilers. Introduction to Compilers. Lecture 1. Spring term. Mick O Donnell: michael.odonnell@uam.es Alfonso Ortega: alfonso.ortega@uam.

Research Libraries - Current and Future Developments

Automatic Speech Recognition and Hybrid Machine Translation for High-Quality Closed-Captioning and Subtitling for Video Broadcast

4. Clause combining 2

Date submitted: June 1, 2011

ON GETTING THE MOST OUT OF INTERNET RESOURCES TO RAISE TRANSLATION QUALITY OF PROFESSIONAL DOCUMENTATION

Off-line (and On-line) Text Analysis for Computational Lexicography

03 - Lexical Analysis

Two Semantic-based products for Large Document Corpus Research; Semantic Insights Research Assistant and Research Librarian

From concepts to ontologies in metadata: a happy explication of a cognitive scenario for discoverability

The Oxford Learner s Dictionary of Academic English

Tibetan For Windows - Software Development and Future Speculations. Marvin Moser, Tibetan for Windows & Lucent Technologies, USA

Transcription:

LGPLLR : an open source license for NLP (Natural Language Processing) Sébastien Paumier Université Paris-Est Marne-la-Vallée paumier@univ-mlv.fr Penguin from http://tux.crystalxp.net/ 1

Linguistic data data meant to be used by programs: different from electronic versions of things like human dictionaries a continuum between raw data like word lists and highly-structured data like tree banks and word nets: electronic dictionaries syntactic grammars annotated corpora etc. paumier@univ-mlv.fr 2

Linguistic data anyone has some native expertise about language: only programmers can fix bugs, but anyone can see that ptesident is a typing error a bad conclusion: anyone can deal with linguistic data, it is not a serious work an a good one: many people could contribute to make data better paumier@univ-mlv.fr 3

No serious work? anyone that has built linguistic resources knows how difficult it can be, even if the result looks obvious example: country names how to script them in your alphabet? what is the official definition of country? how to deal with names with a determiner like The United States of America? how about variants like U.S.A? etc. paumier@univ-mlv.fr 4

No serious work? the amount of work should not be underestimated a good question to ask: is the work important enough to be protected? computer science linguistic data no integer linked lists list of English 1 letter words yes automata library syntactic lexicon if the answer is yes, you need a license for your data paumier@univ-mlv.fr 5

Cooperative work linguistic data may be easier to fix and extend than programs, but available manpower is useless if data are not modifiable example: EuroWordNet people cannot commit their modifications same waste of time as for non-free software solution: allow people to modify the data, with an appropriate license paumier@univ-mlv.fr 6

LGPLLR derived from the LGPL, adapted for linguistic data linguistic resource: collection of data about language prepared so as to be used with application programs paumier@univ-mlv.fr 7

LGPLLR a work based on the linguistic resource: either the linguistic resource or any derivative work containing it or a portion of it, either verbatim of with modifications (including translation) legible form: the preferred form of the resource for making modifications to it paumier@univ-mlv.fr 8

Your rights anyone can redistribute the data modified or as is, with a copy of the license the modified work must be a linguistic resource: subset of adjectives is a linguistic resource derived from a lexicon a letter frequency table is not: such a resource is not covered by the license paumier@univ-mlv.fr 9

Your obligations modified files must carry prominent notices stating that you changed them and the date of any change moreover, it is strongly suggested that you explain your modifications (one word more or less can have a strong meaning) you must redistribute the machinereadable legible form of the linguistic resources a printed appendix in a book is not sufficient paumier@univ-mlv.fr 10

If you write programs... a program that contains no derivative of any portion of the data, but is designed to work with is called a work that uses the linguistic resource, and falls outside the scope of the LGPLLR however, combining such a program with data is considered as a derivative of the linguistic resource paumier@univ-mlv.fr 11

If you write programs... you must use a suitable mechanism for combining with the linguistic resource a suitable mechanism a is one that will operate properly with a modified version of the linguistic resource if your package includes an encrypted version of the linguistic resources, it must contain any data and utility programs needed to rebuild this encrypted version from the original data paumier@univ-mlv.fr 12

Some works using LGPLLR Tables du lexique-grammaire du français Lefff: lexique des formes fléchies du français Prolex: lexique de noms propres DicoValence: dictionnaire de valence des verbes français HPSG FRoG French Resource Grammar: An HPSG grammar of French developed with the LKB platform (parsing and generation) paumier@univ-mlv.fr 13

Some works using LGPLLR SynLex: syntactic lexicons derived from lexicon-grammar tables Sanskrit linguistic resources Spanish Resource Grammar GerTT: grammar fragment of German in TT-MCTAG LexSchem: lexique de schémas de souscatégorisation des les verbes français automatiquement acquis à partir de corpus paumier@univ-mlv.fr 14

Some works using LGPLLR GG: an HPSG for German Les Verbes français de J. Dubois et F. Dubois-Charlier Locutions en Français de J. Dubois et F. Dubois-Charlier all dictionaries and grammars distributed with Unitex paumier@univ-mlv.fr 15

Conclusion linguistic resources with restricted distribution policies suffer from the same problems that non-free software freely accessible data are good freely modifiable data are better share your work and make other's better: use LGPLLR http://igm.univ-mlv.fr/~unitex/ paumier@univ-mlv.fr 16