Berlin-Brandenburg Academy of sciences and humanities (BBAW) resources / services

Similar documents
SCALABLE DATA SERVICES

Interactive Dynamic Information Extraction

WebLicht: Web-based LRT services for German

Motivation. Korpus-Abfrage: Werkzeuge und Sprachen. Overview. Languages of Corpus Query. SARA Query Possibilities 1

ifinder ENTERPRISE SEARCH

Enterprise Archive Managed Archiving & ediscovery Services User Manual

Beyond The Web Drupal Meets The Desktop (And Mobile) Justin Miller Code Sorcery Workshop, LLC

Automatic Text Analysis Using Drupal

Efficiency of Web Based SAX XML Distributed Processing

WordPress Security Scan Configuration

CENG 734 Advanced Topics in Bioinformatics

Inmagic Content Server Standard and Enterprise Configurations Technical Guidelines

Log Analysis: Overall Issues p. 1 Introduction p. 2 IT Budgets and Results: Leveraging OSS Solutions at Little Cost p. 2 Reporting Security

Using MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam

Network Activity D Developing and Maintaining Databases

Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov

the missing log collector Treasure Data, Inc. Muga Nishizawa

MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts

XpoLog Center Suite Log Management & Analysis platform

RSA Security Analytics

Securing and Accelerating Databases In Minutes using GreenSQL

Information Retrieval Elasticsearch

Documentation of open source GIS/RS software projects

Die Vielfalt vereinen: Die CLARIN-Eingangsformate CMDI und TCF

EVALUATION ONLY. WA2088 WebSphere Application Server 8.5 Administration on Windows. Student Labs. Web Age Solutions Inc.

Search and Information Retrieval

EFFECTIVE STRATEGIES FOR SEARCHING ORACLE UCM. Alan Mackenthun Senior Software Consultant 4/23/2010. F i s h b o w l S o l u t I o n s

Implementing SharePoint 2010 as a Compliant Information Management Platform

Schema documentation for types1.2.xsd

Inmagic Content Server Workgroup Configuration Technical Guidelines

Word Completion and Prediction in Hebrew

Distributed Computing and Big Data: Hadoop and MapReduce

Configure VPN between ProSafe VPN Client Software and FVG318

Things Made Easy: One Click CMS Integration with Solr & Drupal

Using Microsoft Windows Authentication for Microsoft SQL Server Connections in Data Archive

PSG College of Technology, Coimbatore Department of Computer & Information Sciences BSc (CT) G1 & G2 Sixth Semester PROJECT DETAILS.

Document management and exchange system supporting education process

Natural Language to Relational Query by Using Parsing Compiler

MXSAVE XMLRPC Web Service Guide. Last Revision: 6/14/2012

ASULPUNTO Magento unicenta opos integration extension Version 1.0.0

Unifying Search for the Desktop, the Enterprise and the Web

ObserveIT Ticketing Integration Guide

Oracle 11g New Features - OCP Upgrade Exam

CloudCERT (Testbed framework to exercise critical infrastructure protection)

owncloud Architecture Overview

The data between TC Monitor and remote devices is exchanged using HTTP protocol. Monitored devices operate either as server or client mode.

Federated Identity Management. Willem Elbers (MPI-TLA) EUDAT training

Passive Logging. Intrusion Detection System (IDS): Software that automates this process

Team Collaboration, Version Management, Audit Trails

Log Management with Open-Source Tools. Risto Vaarandi SEB Estonia

Technical Report. The KNIME Text Processing Feature:

CERN Document Server

SpamPanel Level Manual Version 1 Last update: March 21, 2014 SpamPanel

Funambol Exchange Connector v6.5 Installation Guide

A Plan for the Continued Development of the DNS Statistics Collector

Special Topics in Computer Science

CSCI-UA: Database Design & Web Implementation. Professor Evan Sandhaus sandhaus@cs.nyu.edu evan@nytimes.com

PoS(ISGC 2013)021. SCALA: A Framework for Graphical Operations for irods. Wataru Takase KEK wataru.takase@kek.jp

Architecting ColdFusion For Scalability And High Availability. Ryan Stewart Platform Evangelist

Markus Dickinson. Dept. of Linguistics, Indiana University Catapult Workshop Series; February 1, 2013

The Power of Classifying in SharePoint 2010

5 Mistakes to Avoid on Your Drupal Website

EZcast technical documentation

HOB WebSecureProxy as an SSL Terminal for Clients

Deploying Cisco Unified Contact Center Express Volume 1

The Challenge of Machine Translation of Patent Specifications and the Approach of the European Patent Office

Flattening Enterprise Knowledge

Alternatives to SNMP and Challenges in Management Protocols. Communication Systems Seminar Talk 10 Francesco Luminati

A Framework-based Online Question Answering System. Oliver Scheuer, Dan Shen, Dietrich Klakow

Electronic Document Workflow Platform for KBA Customers

Configuring Cisco Secure ACS v5.5 to use RADIUS for Orchestrator Authentication

SERVICE ORIENTED EVENT ASSESSMENT CLOSING THE GAP OF COMPLIANCE MANAGEMENT

Inmagic Content Server v9 Standard Configuration Technical Guidelines

Electronic Document Management Using Inverted Files System

GT 6.0 GRAM5 Key Concepts

Log Mining Based on Hadoop s Map and Reduce Technique

Oracle Universal Content Management

Managing DICOM Image Metadata with Desktop Operating Systems Native User Interface

PoS(EGICF12-EMITC2)091

Analysis of Web Archives. Vinay Goel Senior Data Engineer

IT services for analyses of various data samples

Apache Lucene. Searching the Web and Everything Else. Daniel Naber Mindquarry GmbH ID 380

Your Question. Net Report Answer

Transcription:

Berlin-Brandenburg Academy of sciences and humanities (BBAW) resources / services speakers: Kai Zimmer and Jörg Didakowski Clarin Workshop WP2 February 2009

BBAW/DWDS The BBAW and its 40 longterm projects offer many resources: digital dictionary of the german language (DWDS): corpora, dictionaries, `language information platform`, but is also developing natural language processing tools and search engine german text archive (DTA): texts from 14th to 19th century in an `active archive` TELOTA: technical service section for projects at the academy Clarin Workshop WP2 February 2009

BBAW/DWDS DWDS publicly available corpora (via webinterface): - german reference corpus (balanced over categories and decades) - newspapers: Die Zeit (daily updated,1946 - current) Berliner Zeitung Tagesspiegel Potsdamer Neueste Nachrichten (PNN) - spoken language corpus - historic corpora - jewish periodicals (by Compact memory) Clarin Workshop WP2 February 2009

DDC DWDS uses ddc-concordance (OSS, LGPL) as an online corpus search engine. Features are: - statistical queries, not approximations - regular expressions, phrase, distance, trunaction (l/r) search - sentence or document-based search - search for wordforms (for english, german and russian) - index metadata and annotations - document relevance ranking - it s fast - scaleable to huge corpora and load, due to clustering architecture - clients for python, perl, php, c++ (network protocol easy to implement in other programming languages) Clarin Workshop WP2 February 2009

DDC DDCs query language is completely available in the xmlrpc-service Clarin Workshop WP2 February 2009

DDC/C4 The clustering architecture of the search engine ddc is primarily used for performance and scaling purposes. But it also allows to connect separate corpora from different places - like in the C4-project (similar to Dieters DAM LR EU project): Clarin Workshop WP2 February 2009

DDC/C4 C4 project consist of four different participants: Austrian Academy corpus (AAC, Vienna) Swiss text corpus (Basel) Corpus Southtirol (Italy) Berlin corpus (DWDS/BBAW, Germany) Each participating country adds a balanced ~20 million token subcorpus to a `shared`corpus. Results of a search query are sorted and merged by ddc. Authentication is done by simple mysql databases. Clarin Workshop WP2 February 2009

DDC/C4 Clarin Workshop WP2 February 2009

DDC/C4 Clarin Workshop WP2 February 2009

On with Jörgs presentation about our xmlrpc services... Clarin Workshop WP2 February 2009

Web Services The web services are currently for internal use in our project network They allow an efficient and easy access to textual resources and language processing tools The web services for language processing tools are based on XML-RPC The web services for textual resources are based on DDC An XML-RPC based service repository manages the services

XML-RPC XML-RPC is a Remote Procedure Calling protocol that works over the Internet. An XML-RPC message is an HTTP-POST request. The body of the request is in XML. A procedure can be executed on the server and the value it returns is formatted in XML, too.

Service Repository ( Database User administration (based on a MySQL Authorization management Granular configuration Individual unlocking of services Time sensitive authorization ( Database Service administration (based on a MySQL Integration of services IP address and port Service name, version, description, ID, maintainer, etc. logging information

Language Processing Tools They are for German Most of them are based on finite-state techniques Most of them are rule-based They are implemented in C and C++

ToMaSoTaTh combines different tasks: Tokenizing Morphological analysis (TAGH Morphology) TextToSound (using SAMPA) Tagging (moot) Thesaurus (Lexikonet) The several components can be applied individually input: plain text output: one token per line with tabulator separated information

Meinten Sie (Did You mean?) This tool calculates corrections of typos (based on edit distance which is precompiled over a word list) Input: token Output: token list (proposals)

SynCoP grammar/specification-driven parser Implemented systems: (partial) dependency parsing named entity recognition and classification (person names, location names, organization names) Input: plain text/tokenized text in XML Output: TIGER-XML oriented format

Thank You for Your attention!

Architecture

Admin Panel

( client Using Services (as a First, connecting to the server: (" server=xmlrpclib.serverproxy("http://194.95.188.36:8050 A session_id is given by the server to the client: session_id =server.dwds.login("jantenner","test123") The session_id runs out after 15 minutes Then a service can be used via a function call: (, server.dwds.processor.lts.tomata.analyse(session_id print (, server.dwds.resource.kerncorpus.query(session_id print