Berlin-Brandenburg Academy of sciences and humanities (BBAW) resources / services speakers: Kai Zimmer and Jörg Didakowski Clarin Workshop WP2 February 2009
BBAW/DWDS The BBAW and its 40 longterm projects offer many resources: digital dictionary of the german language (DWDS): corpora, dictionaries, `language information platform`, but is also developing natural language processing tools and search engine german text archive (DTA): texts from 14th to 19th century in an `active archive` TELOTA: technical service section for projects at the academy Clarin Workshop WP2 February 2009
BBAW/DWDS DWDS publicly available corpora (via webinterface): - german reference corpus (balanced over categories and decades) - newspapers: Die Zeit (daily updated,1946 - current) Berliner Zeitung Tagesspiegel Potsdamer Neueste Nachrichten (PNN) - spoken language corpus - historic corpora - jewish periodicals (by Compact memory) Clarin Workshop WP2 February 2009
DDC DWDS uses ddc-concordance (OSS, LGPL) as an online corpus search engine. Features are: - statistical queries, not approximations - regular expressions, phrase, distance, trunaction (l/r) search - sentence or document-based search - search for wordforms (for english, german and russian) - index metadata and annotations - document relevance ranking - it s fast - scaleable to huge corpora and load, due to clustering architecture - clients for python, perl, php, c++ (network protocol easy to implement in other programming languages) Clarin Workshop WP2 February 2009
DDC DDCs query language is completely available in the xmlrpc-service Clarin Workshop WP2 February 2009
DDC/C4 The clustering architecture of the search engine ddc is primarily used for performance and scaling purposes. But it also allows to connect separate corpora from different places - like in the C4-project (similar to Dieters DAM LR EU project): Clarin Workshop WP2 February 2009
DDC/C4 C4 project consist of four different participants: Austrian Academy corpus (AAC, Vienna) Swiss text corpus (Basel) Corpus Southtirol (Italy) Berlin corpus (DWDS/BBAW, Germany) Each participating country adds a balanced ~20 million token subcorpus to a `shared`corpus. Results of a search query are sorted and merged by ddc. Authentication is done by simple mysql databases. Clarin Workshop WP2 February 2009
DDC/C4 Clarin Workshop WP2 February 2009
DDC/C4 Clarin Workshop WP2 February 2009
On with Jörgs presentation about our xmlrpc services... Clarin Workshop WP2 February 2009
Web Services The web services are currently for internal use in our project network They allow an efficient and easy access to textual resources and language processing tools The web services for language processing tools are based on XML-RPC The web services for textual resources are based on DDC An XML-RPC based service repository manages the services
XML-RPC XML-RPC is a Remote Procedure Calling protocol that works over the Internet. An XML-RPC message is an HTTP-POST request. The body of the request is in XML. A procedure can be executed on the server and the value it returns is formatted in XML, too.
Service Repository ( Database User administration (based on a MySQL Authorization management Granular configuration Individual unlocking of services Time sensitive authorization ( Database Service administration (based on a MySQL Integration of services IP address and port Service name, version, description, ID, maintainer, etc. logging information
Language Processing Tools They are for German Most of them are based on finite-state techniques Most of them are rule-based They are implemented in C and C++
ToMaSoTaTh combines different tasks: Tokenizing Morphological analysis (TAGH Morphology) TextToSound (using SAMPA) Tagging (moot) Thesaurus (Lexikonet) The several components can be applied individually input: plain text output: one token per line with tabulator separated information
Meinten Sie (Did You mean?) This tool calculates corrections of typos (based on edit distance which is precompiled over a word list) Input: token Output: token list (proposals)
SynCoP grammar/specification-driven parser Implemented systems: (partial) dependency parsing named entity recognition and classification (person names, location names, organization names) Input: plain text/tokenized text in XML Output: TIGER-XML oriented format
Thank You for Your attention!
Architecture
Admin Panel
( client Using Services (as a First, connecting to the server: (" server=xmlrpclib.serverproxy("http://194.95.188.36:8050 A session_id is given by the server to the client: session_id =server.dwds.login("jantenner","test123") The session_id runs out after 15 minutes Then a service can be used via a function call: (, server.dwds.processor.lts.tomata.analyse(session_id print (, server.dwds.resource.kerncorpus.query(session_id print