Berlin-Brandenburg Academy of sciences and humanities (BBAW) resources / services

Berlin-Brandenburg Academy of sciences and humanities (BBAW) resources / services speakers: Kai Zimmer and Jörg Didakowski Clarin Workshop WP2 February 2009

BBAW/DWDS The BBAW and its 40 longterm projects offer many resources: digital dictionary of the german language (DWDS): corpora, dictionaries, `language information platform`, but is also developing natural language processing tools and search engine german text archive (DTA): texts from 14th to 19th century in an `active archive` TELOTA: technical service section for projects at the academy Clarin Workshop WP2 February 2009

BBAW/DWDS DWDS publicly available corpora (via webinterface): - german reference corpus (balanced over categories and decades) - newspapers: Die Zeit (daily updated,1946 - current) Berliner Zeitung Tagesspiegel Potsdamer Neueste Nachrichten (PNN) - spoken language corpus - historic corpora - jewish periodicals (by Compact memory) Clarin Workshop WP2 February 2009

DDC DWDS uses ddc-concordance (OSS, LGPL) as an online corpus search engine. Features are: - statistical queries, not approximations - regular expressions, phrase, distance, trunaction (l/r) search - sentence or document-based search - search for wordforms (for english, german and russian) - index metadata and annotations - document relevance ranking - it s fast - scaleable to huge corpora and load, due to clustering architecture - clients for python, perl, php, c++ (network protocol easy to implement in other programming languages) Clarin Workshop WP2 February 2009

DDC DDCs query language is completely available in the xmlrpc-service Clarin Workshop WP2 February 2009

DDC/C4 The clustering architecture of the search engine ddc is primarily used for performance and scaling purposes. But it also allows to connect separate corpora from different places - like in the C4-project (similar to Dieters DAM LR EU project): Clarin Workshop WP2 February 2009

DDC/C4 C4 project consist of four different participants: Austrian Academy corpus (AAC, Vienna) Swiss text corpus (Basel) Corpus Southtirol (Italy) Berlin corpus (DWDS/BBAW, Germany) Each participating country adds a balanced ~20 million token subcorpus to a `shared`corpus. Results of a search query are sorted and merged by ddc. Authentication is done by simple mysql databases. Clarin Workshop WP2 February 2009

DDC/C4 Clarin Workshop WP2 February 2009

On with Jörgs presentation about our xmlrpc services... Clarin Workshop WP2 February 2009

Web Services The web services are currently for internal use in our project network They allow an efficient and easy access to textual resources and language processing tools The web services for language processing tools are based on XML-RPC The web services for textual resources are based on DDC An XML-RPC based service repository manages the services

XML-RPC XML-RPC is a Remote Procedure Calling protocol that works over the Internet. An XML-RPC message is an HTTP-POST request. The body of the request is in XML. A procedure can be executed on the server and the value it returns is formatted in XML, too.

Service Repository ( Database User administration (based on a MySQL Authorization management Granular configuration Individual unlocking of services Time sensitive authorization ( Database Service administration (based on a MySQL Integration of services IP address and port Service name, version, description, ID, maintainer, etc. logging information

Language Processing Tools They are for German Most of them are based on finite-state techniques Most of them are rule-based They are implemented in C and C++

ToMaSoTaTh combines different tasks: Tokenizing Morphological analysis (TAGH Morphology) TextToSound (using SAMPA) Tagging (moot) Thesaurus (Lexikonet) The several components can be applied individually input: plain text output: one token per line with tabulator separated information

Meinten Sie (Did You mean?) This tool calculates corrections of typos (based on edit distance which is precompiled over a word list) Input: token Output: token list (proposals)

SynCoP grammar/specification-driven parser Implemented systems: (partial) dependency parsing named entity recognition and classification (person names, location names, organization names) Input: plain text/tokenized text in XML Output: TIGER-XML oriented format

Thank You for Your attention!

Architecture

Admin Panel

( client Using Services (as a First, connecting to the server: (" server=xmlrpclib.serverproxy("http://194.95.188.36:8050 A session_id is given by the server to the client: session_id =server.dwds.login("jantenner","test123") The session_id runs out after 15 minutes Then a service can be used via a function call: (, server.dwds.processor.lts.tomata.analyse(session_id print (, server.dwds.resource.kerncorpus.query(session_id print