Insights into Six Decades of Scientific Practice

Similar documents

What the Hell is Big Data?

Digital Collections as Big Data. Leslie Johnston, Library of Congress Digital Preservation 2012

The Language Archive at the Max Planck Institute for Psycholinguistics. Alexander König (with thanks to J. Ringersma)

Outlook Web App: Basic Overview

SharePoint Benefits. Engage partners customers and employees across one platform. Internet Extranet Intranet

Building Dashboards for Real Business Results. Cindi Howson BIScorecard December 11, 2012

Tools & Resources for Visualising Conversational-Speech Interaction

BUSINESS OCR LEVEL 3 CAMBRIDGE TECHNICAL. Cambridge TECHNICALS WEBSITE DESIGN STRATEGY CERTIFICATE/DIPLOMA IN Y/502/5490 LEVEL 3 UNIT 19

Computer-Based Text- and Data Analysis Technologies and Applications. Mark Cieliebak

Doing Multidisciplinary Research in Data Science

COMP9321 Web Application Engineering

First Nation Membership Database. Sample Screens

Survey Results: Requirements and Use Cases for Linguistic Linked Data

Big Data and Analytics: Challenges and Opportunities

Analytics Scheduling reports

MICROSOFT DYNAMICS NAV 2013 INSIDE 1

Addressing Information Management Challenges to Improve Manufacturing Performance

GEOG 482/582 : GIS Data Management. Lesson 10: Enterprise GIS Data Management Strategies GEOG 482/582 / My Course / University of Washington

Interactive Dynamic Information Extraction

T Analyst User Guide 1

Big Data for Investment Research Management

Office 365 SharePoint Online

SAP SCM 5.0: Learning Map for SC Design and Analytics Consultants

Bryan Von Axelson Partner Solutions Advisor Microsoft Corporation

Offer from Zissor for the Zissor WEB portal for search and viewing of the digitized newspaper archive of Newbury Weekly News

Novell ZENworks Asset Management 7.5

CRGroup Whitepaper: Digging through the Data. Reporting Options in Microsoft Dynamics GP

Turnitin Blackboard 9.0 Integration Instructor User Manual

Information Technology User Guide Office 365 ProPlus

P6 Professional Report Writer

BPMonline CRM User Guide

Realfax Service User Manual Version 4

d3 Document Management Solution

Question template for interviews

Sharepoint vs. inforouter

Retail Industry Executive Summary

G8 Open Data Charter

GRAPHICAL USER INTERFACE, ACCESS, SEARCH AND REPORTING

PD 360 Training. Getting Started Series. PD 360 Essentials (Distance Learning) PD 360 Essentials. School Leadership and PD 360

CONTINIA DOCUMENT CAPTURE FACTSHEET ENGLISH LANGUAGE

Get the most value from your surveys with text analysis

As in the example given, a Newsletter created on the computer typically has: A title that explains what sort of information is in the newsletter

IBM Content Analytics with Enterprise Search, Version 3.0

Get to Grips with SEO. Find out what really matters, what to do yourself and where you need professional help

ithenticate User Manual

RIGHTNOW GUIDE: KNOWLEDGE BASE AND SEARCH BEST PRACTICES

System Requirements. Microsoft Dynamics NAV 2016

Designing and Developing Performance Measurement Software Solution

USABILITY OF A FILIPINO LANGUAGE TOOLS WEBSITE

Social Media Creating an Approach That Will Bring You More Business

Data Warehousing. Jens Teubner, TU Dortmund Winter 2015/16. Jens Teubner Data Warehousing Winter 2015/16 1

Throughout this document, you will be instructed to log in as user Ann, or as user Julia. Log in using the user name assigned to you.

A SYSTEM FOR AUTOMATIC QUERY EXPANSION IN A BROWSER-BASED ENVIRONMENT

TRAINING & CONSULTANCY GUIDE

Getting started with Google Analytics and MailChimp is as simple as checking a box while you re building a campaign.

The First Online 3D Epigraphic Library: The University of Florida Digital Epigraphy and Archaeology Project

REGULATION ON DOCTORATE AT RIGA TECHNICAL UNIVERSITY. Terms and definitions:

ithenticate User Manual

Global Oil & Gas Suite

HelpDesk / Technical Support * 1350 Euclid Avenue Ste 1500 * Cleveland, OH *

Student Advantage FREQUENTLY ASKED QUESTIONS FOR PARTNERS

Innovative Systems CVAA Compliance Statement

To sign up for your free 30-day trial period of FileGuardian, visit

WHY DIGITAL ASSET MANAGEMENT? WHY ISLANDORA?

Cloud Cruiser and Azure Public Rate Card API Integration

Zoho CRM. Getting Started. Guidelines for Beginners

DSC4215 Supply Chain Visualization and Actionable Intelligence. Lecturer: Keith B. Carter

Navigate your workflow

Evaluating the impact of research online with Google Analytics

Finally It All Makes Sense An online solution that fits your small business 365 days a year

Key Benefits: Increase your productivity. Sharpen your competitive edge. Grow your business. Connect with your employees, customers, and partners.

Examining the Usability of the University of Hawaii at Manoa s Office of the Registrar Website

Intelligent Data Capturing and Indexing

CommuniGate Mail Archiving and Cleanup with Outlook 2007

Transcription:

DTA-/CLARIN-D-Konferenz Historische Textkorpora für die Geistes- und Sozialwissenschaften Title Insights into Six Decades of Scientific Practice Speaker Coauthors Gerhard Heyer, NLP chair (heyer@informatik.uni-leipzig.de) Thomas Efer, researcher at the NLP group (efer@informatik.uni-leipzig.de) Jens Blecher, head of the university archive (blecher@uni-leipzig.de) Date 18 Feb. 2013

Overview 1 Leipzig University Archive 2 Rektoratsreden Corpus 3 NLP Processing 4 What s the point?

History tightly intertwined with University origination archive founded in 1409 (within first statutes) in responsibility of the head of the university (the Rector) now tasks defined by state law frequented by about 800 researchers every year

The Archive Rektoratsreden NLP Point Inventory Quantity

Inventory Quantity 140 million sheets of paper 7 km of shelve space 1500 new (physical) files each month 800 GB 50 000 digital files

The Archive Rektoratsreden NLP Point Inventory Quality

Inventory Quality

Inventory Quality matriculation lists personnel files bursary files administrative publications rare items and curiosities

Digital archive Inventory described by 1,2 million database entries only 5% of all documents digitized and available online online research portal improved efficiency and usage further means of accessing data (e.g. given name statistics) infrastructural cooperation across several archives

Digitized Corpora university newspaper corpora from the GDR-era scanned and OCRed official university newspaper (1957-1991) science-related newspaper "Wissenschaftliche Zeitschrift" (1951-1991) Rektoratsreden

The Speeches yearly transfer of administrative power from the rector to an elected successor Jahresbericht annual report (important news, events, faculty changes,... ) Antrittsrede inaugural speech (introduction, science communication,... )

The Archive Rektoratsreden NLP Point The Prints

The Prints (original written source, no transcription)

The edition process starting in 2004 (in preparation for 600th anniversary) scanning 2300 images, OCRing and error-correcting 123 speeches, 1871-1933 no language normalization, no commentary 2 volumes edition (de Gruyter, preview at Google Books) 6500 glossary entries (people, places,... ) took more than 2/3 of edition time!

Corpus Characteristics 720 000 running words in about 5,1 MB plain text 6 decades, 2 text sorts (partly many science terms) no contemporary language

Setting and Goals explorative approach (towards visual analytics) low-budget (free-time project) small time budget leading to use of standard tools and only simple NLP methods (allowing for non-computer-scientists to understand functionality) omit real evaluation

Setup Text extraction from PDF Named Entity Extraction (NER) using the ANNIE processing chain from complex operations (s.a. co-reference resolution) are skipped (mainly because of sub-optimal POS-tagging) Lists of Names (Gazetteer) originates from archive sources!

Setup digitized lists match time and place of the documents!

Setup

Setup

Results 2400 people s names extracted (quality can be improved fiddeling with the transducer rules) only full names extracted (College Geheimer Rath Wundt no Wilhelm, no person) imperfect but usable who is Ann Arbor? (imperfect = really dirty tricks: Moritz Arndt because of Ernst des Lebens) name variations also in print, so post-processing necessary anyway. graph structure emerges from co-occurrence of people mentioned in speeches a social network?

Results

The Archive Rektoratsreden NLP Point Results

Results

Efficient indexing work towards novel navigational means for digital editions less time for manual index creation means lower hurdles for editions of digitized documents make more documents accessible within limited budgets

Leveraging studies based on historical corpora interdisciplinary training scenario, involve students as texts get older, more linguistic knowledge is needed in automatic analyses

Outlook improve and extend the NLP system include an evaluation use the corpus as test scenario for concept-based corpus browser (project exchange) interconnect with other knowledge sources

The Archive Rektoratsreden NLP Point Outlook

Call for Collaboration archives can benefit from digital corpora expertise humanities can benefit from archival resources (corpora, lists) more interdisciplinary work benefits everyone small projects interwoven into curricular course can spark interest in students research infrastructure and archive infrastructure should be interoperable (communication needed)

Thanks for your attention Questions, please!