Software = hard for national termbanks? Henrik Nilsson & Sandra Cuadrado í Camps Terminologicentrum TNC & Termcat IITF Colloquium Vienna, Austria 9 July 2015
Outline National termbank The concept and some examples Rikstermbanken Cercaterm State-of-the-art (TERMINTRA) Aspects and related technical challenges getting (and presenting) content harmonizing content users digital age reuse getting funding
National could imply a government responsibility and financing a link to a national terminology centre a basis in the national conceptual world a certain language choice (monolingual, only national languages ) a certain quality a certain accessibility (free of charge, adapted) a certain scope (e.g. cover all terminology in the nation, nothing foreign etc.) a certain status (affecting usage) a marketing gimmick
National should imply a certain coverage (as to contents) a certain status (acknowledged by professionals and a language or terminology institution) accessibility (open and freed of ownership claims) [Termintra, Oslo, 2012]
national terminology database database containing mono- or multilingual terminological data [ ] established at country level [Guidelines for Terminology Policies, Unesco] the national termbank, which attempts to serve a general purpose role in coordinating the creation and use of terminologies within a country, and hence is theoretically multifunctional, multilingual and exploited by widely differing kinds of users [McNaught, 1987]
Why a national term bank? I have been a manager [ ] within the U.S. Federal Government for over 30 years. In that time, I have observed that the dominant case of ineffectiveness, inefficiency, and unreponsiveness in operations is the inconsistent terms used across the various boundaries of government, their contractors, industry, non-profits, and citizens. There are terminology boundaries between locations, organizations, offices within the organizations, work functions, processes, resources (e.g. people, intelligence, funds, skills, materiel, facilities, services), and capability requirements (e.g. missions, information systems). [Roebuck, 2009]
Next, the vocabulary of these functions would be automatically collected, organized, and placed into a National Terminology database to enable integration, interoperability, unification, and federation of operations technical challenges!? [Roebuck, 2009]
European national termbanks Stofnun Árna Magnússonar í íslenskum fræðum, Iceland: Orðabanki Foras na Gaeilge, Ireland: Téarma.ie Norway, : Termportalen, Snorre NL-Term, Nederländerna: Nedterm TNC, Sweden: Rikstermbanken TSK, Finland: Vetenskapstermbanken, TEPA, Valter Eter, Estonia: ESTERM Latvia: EuroTermBank LKI, Lithuania: Terminų bankas Wales: National Terminology Portal Société française de terminologie, Confédération suisse France: Termdat Slovenia: FranceTerme Evroterm Croatia: UZEI, Basque country: National Terminology Portal Euskalterm (incl. Struna) Termcat Cercaterm Dernmark, : (DTB) Türk Dil Kurumu, Turkey: Bilim ve Sanat Terimleri
Struna (CR)
FranceTerme (FR)
Terminų Bankas (LT)
BFT (FI)
National Terminology Portal (Wales)
Risten (Sápmi)
Orðabanki (ISL)
Téarma.ie (IRL)
Slovenská terminologicka databáza (SK)
AkadTerm (LV)
Euskalterm (Basque Country)
Türk Diril Kurumu (TR)
Terminoģijas portāls (LV)
Nedterm (NL)
Other termbanks EuroTermBank National Termbank (RSA) IATE ISO Online Browsing Platform UNTERM EAA Glossary Electropedia METEOTERM ILOTERM FAOTERM
EuroTermBank
IATE
www.rikstermbanken.se
Background The fast development of society requires constant work on creating and making accessible agreed-upon terminologies, within more and more subject fields. An easy access to terms via the Internet in a national termbank [rikstermbank] endorses such a development. TISS, 2002 2004 Nordterm-Net, 1999; Brussels Declaration, 2002 et al. IT-propositionen, (Prop. 2004/05:175), 2005 Bästa språket (Prop. 2005/06:2), 2005 Grant from Ministry of Industry, Employment and Communications: 2005: 1 500 000 SEK; 2007: 750 000 SEK, 2009: 0; 2011: discussion about semantic resource! IATE, EU; evaluation 2004 the establishment of a national central term bank, a rikstermbank, is a prerequisite for an easy access to, and quality assurance of, Swedish terms in all domains. Terminų Bankas, Lithuania & EuroTermBank
Rikstermbanken as a tool for storage for search and retrieval for terminology work, research
Rikstermbanken should mainly reflect concepts of the Swedish society; however, this does not mean that the termbank would comprise only Swedish terms. In order to make it function in the way it is planned, the termbank should also contain term equivalents in foreign languages, and not only in English but also in various immigrant languages and in the official minority languages of Sweden. [IT-propositionen, prop 2005/06:175]
Current contents no limitations as to domains! Swedish conceptual world = starting point complete glossaries, but also parts of documents and excerpts some digitalizated material quality control by terminologists (and at times the supplier) presentation phase consolidation phase overview harmonisation
Rikstermbanken in numbers 106 000 term records 300 000 terms (incl. look up-terms, synonyms, equivalents) 28 languages 71 % definitions (in Swedish) ca 1500 unique sources ca 500 suppliers
Contents priorities selection, types preparation (enhancing, record making & breaking) harmonization (doublettes ) updating addition of new material quality quantity?
Preparation of the material termbank adaptation (reformating according to NTRF-RTB, exclusion of remaining book-related aspects) selection changes for consistency linguistic and content-related adjustments (incl. removal of target group adaptations) discussion with suppliers illustrations semi-automatic three-step import control tool
Technology experience from Termdok development and Nordterm-Net (MLIS-project) comparisons to existing TMS-software and standards (ISO, LISA et al) IATE evaluation co-operation with IATE, EuroTermBank proper software open source: Lucene, Mysql, Tomcat, Java
Technical development Rikstermbanken Oracle replaced by open source: Mysql (database management) Tomcat (web server) Lucene (indexing) Java applications Iterative process Documentation via internal wiki
Cercaterm (CAT)
Cercaterm online platform designed, supported and updated by Termcat (since 2000) development of terminological products, terminology standardisation, terminology consulting service updates to Cercaterm Termcat s terminology production, standardized terminology, queries resolved + other material 230 000 files (more than 925 000 denominations) new functions in 2010 (based on user survey): search, sources 3 million visitis in 2014 also other information
Cercaterm (CAT)
Cercaterm (CAT)
Cercaterm (CAT)
TERMINTRA Forum for discussion on national termbanks The concept of national termbank Aspects: General, Contents, Users, Funding, Organization, Technology First seminar in Oslo 2012, second in Zagreb 2013 Participants from Catalonia, Croatia, Denmark, Finland, France, Ireland, Iceland, Latvia, Norway, Sápmi, Sweden, Switzerland, Wales
TERMINTRA: Technology What technical solutions are in use today, and are some more appropriate than others? Should a national term bank be based on a distributed solution or not? Or, rather, constitute a kind of portal? Pros and cons? What standards should be the basis for national terminology databases (storage and exchange formats, etc.)? Are the current terminology management systems suitable for the demands which could be made on a national term bank? To what extent are today s national terminology banks based on proprietary software (use of open source or not)?
The current situation is that most of the bigger existing term banks use purpose-built software, although there are cases where general purpose information retrieval software is used. Although computerized term banks have been in existence for a number of years, there seems to be little agreement as to how they should operate, and if the present situation persists, their use will continue to be low. If term banks are to become widely used certain changes in practice will be necessary; changes which in turn have implications for the software that must be used for term bank operation. [Negus, 1979]
the longer established term banks tend to use purpose built software, partly because nothing generally available at the time was found to be suitable, and partly because each is aimed at providing a range of services not found elsewhere, using terminological records and searching methods which are more or less unique. [ ] all systems should attempt to maintain the greatest flexibility in their approach. However, this is difficult to achieve where specially created software is concerned; there is an inevitable tendency to provide what is definitely required at the time of program specification, perhaps giving little thought to what services might be required, or facilities demanded, at some indeterminate time in the future. [Negus, 1979]
As to the technological aspects of national termbanks, it became clear during the presentations and discussions that most of the represented termbanks had developed their own technical solution (which, however, in many cases relied on international standards). The exception was the Finnish termbank using Wiki-technology and open source software. [Proceedings, TERMINTRA I, 2013]
Perspective Aspect Contents Technology Organisation Manager X X X Users X X (X) Suppliers X X (X) Financing bodies (X) X (X)
Challenge: getting content term extraction as part of software (or separate)? automatic record breaking into data categories (definition indicators etc.)? And record making? automatically fill in the gaps? (automatic classification)
Various sources [Heid (1991) in Martin & van der Vliet, 2003]
Import process (of glossaries) 1. inventory (weekly) & preliminary assessment 2. formal inquiry 3. collection 4. formatting 5. review 6. (feedback) 7. first import 8. adjustments 9. second import 10.updating
Term bank contents: challenges Selection: all or nothing or a little? Interpretation of contents, decontextualisation Term choice (variants, synonyms etc.) Definition vs. explanation Updating vs archiving consistency changes? Decustomization (= depersonalisation) Record breaking & record making Document types: legal documents
Record breaking (1) Before After svte offset svdf litografisk plantryckmetod där tryckplåten är preparerad så att färggivande ytor gjorts färgmottagliga och vattenbortstötande och icke färggivande partier gjorts vattenmottagliga och färgbortstötande svrete litografi, djuptryck, direktlito svan Överföringen av tryckbilden från offsetplåten sker indirekt via en gummiduk till papperet.
Record breaking (2) Before After svte incidens HONR 1 svfk Antalet fall av en viss sjukdom som uppträder i en befolkning under viss tid; anges t.ex. som antalet diagnoser per 1 000 invånare per år. svte incidens HONR 2 svupte incidenskvot svfk Antalet av en viss studerad händelse i en klinisk prövning eller kohortundersökning, dividerat med antalet deltagare i gruppen. Graden av skillnad mellan två gruppers incidenstal kan uttryckas genom att det ena divideras med det andra till en incidenskvot. svrete händelse
Challenge: getting content term extraction as part of software (or separate)? automatic record breaking into data categories (definition indicators etc.)? And record making? automatically fill in the gaps? (automatic classification) mirroring (QA?) or double storage (updating)?
Distributed or not? All terms in one place + consistency + control + not many other termbanks around + pragmatic: simpler at the time, traditional double storage updating needs administration of contributors higher technology demands on contributors
Challenge: presenting content automatic compounding of term records visualization (ontologies etc.)
bagværk konfekt? tærte brød kage mørdejstærte butterdejstærte kage for 1 person kage for > 1 person gærkage flødekage, flødeskumskage lagkage? skærekage kaffebrød? sandkage tørkage, fin kage småkage bagt kage creme frugt gulerodskage kiksbasered bund genoisebund bavarois vandbakkelse marengsbund lagkagebund vaniljecreme
Challenge: harmonizing content signalize various statuses ( primaries ) automatic handling of doublettes automatic calculation of definition similiarity? version management automatic updating of content automatic notification of updating (to users, of existing links etc.)
From presentation to consolidation Amount of content need one accepted definition of a concept Time
User survey 16. If your search for a particular term generated several hits, what do you think about that? Good Bad No opinion 84,3 % (172) 2,0 % (4) 13,7 % (28) 27 skipped question 17 comments
Resource harmonisation on a national level: Rikstermbanken background & perspectives & user survey content revision harmonisation within a source definition explanation harmonisation between sources (i.e. within the termbank as a whole) doublettes problems and solutions content presentation content updating
Harmonisation: problems Within and between sources Definition vs explanations choice? Certitude of domain? Breaking of conceptual whole, break in macro and micro structures Role of publication date Homonyms, synonyms Degree (%) of similarity between definitions? Handling of diverging interests (be shown disappear etc.) Different sources for different data categories indication of doublettes or problem?
Harmonisation: within a source often semasiological presentation redundancy (e.g. synonyms in separate records) choice of definition or explanation with respect to macrostructure (crossreferences etc.) homonyms
Harmonisation: between sources (automatic) removal of absolute doublettes (but other information, other languages etc.?) limit (%) of definition similarity calculation? combination of several sources in one record instead? several organizations using the same definition is in itself an interesting piece of information special marking in hit list? source respect? issues?
National term bank [ ] a large, general term bank to serve an entire nation. Such a bank would satisfy the needs of users with a variety of tasks, of prior knowledge, of organisational adherence, or of requirements for a specific product. [Åström, 1987]
Challenge: users satisfy all user groups? measures of usability?
for a successful operation of a term bank, today s imperative is reaching out for the user and delivering the required content, wherever it may reside, with the method and in the format required by the user. The area of user participation and interaction is identified [ ] as yet to be successfully integrated in the design of terminology portals. [Vasiljevs, Rirdance and Gornostay, 2010]
User adaptation!? = Important for terminology products! But: sometimes over-estimated, esp. concerning human users and layout of term banks? Demand, frequency of usage vs development costs?
Challenge: digital age crowdsourcing nichesourcing wiki-technology voting procedures moderating functionalities access rights, roles and responsibilities etc. new administrator interfaces etc. usage on new devices (tablets, phones etc.) app
Critics Crowdsourcing killed indie rock cause crowds have terrible taste. [Weingarten in Keats, 2011] government needs smart-sourcing, not crowdsourcing. [Peterson in Keats, 2011] Collectively based lexicography is often regarded with scepticism by professional lexicographers since anyone can contribute anything and there s no possibility to keep the quality level of the contributions under control. This way of working has even been described as a potential danger to all serious lexicography since these dictionaries risk disturbing the trust in the two qualities that users generally associate with professionally produced dictionaries: quality and reliability. [Doherty in Svensén, 2004]
Challenge: reuse linked open data etc. APIs, URIs web tracking version management? thematic portals integration, plug-ins CAT, Word etc. federations ( issues)
semantic resource Semantic Resource [ ] refers to all ontologysimilar entities, such as taxonomies, dictionaries, thesauri, etc. (Lima et al, 2010?) Fackverket 3.0 linked open data banisters TNC, Wikimedia, Bobitek funded by Swedish Agency for Innovation Systems aims: enhance use of linked open terminologies by co-ordinating and further develop existing resources and tools
Challenge: getting funding few existing national termbanks use OTS not good enough? (evalutation criteria?, new demands?) easier to obtain funding if you develop your own software?
What will be the needs of linguistic data bank users in the future? These can of course vary to a large extent, but I believe that the ones we should pay attention to are the simple, down-to-earth requests, which can be summed up under the following keywords: simplicity, quality and service. [Åström, 1982]
Links henrik.nilsson@tnc.se TNC: www.tnc.se Rikstermbanken: www.rikstermbanken.se scuadrado@termcat.cat Termcat: http://www.termcat.cat/ Cercaterm: http://www.termcat.cat/ca/cercaterm/fitxes/