DTA-/CLARIN-D-Konferenz Historische Textkorpora für die Geistes- und Sozialwissenschaften Title Insights into Six Decades of Scientific Practice Speaker Coauthors Gerhard Heyer, NLP chair (heyer@informatik.uni-leipzig.de) Thomas Efer, researcher at the NLP group (efer@informatik.uni-leipzig.de) Jens Blecher, head of the university archive (blecher@uni-leipzig.de) Date 18 Feb. 2013
Overview 1 Leipzig University Archive 2 Rektoratsreden Corpus 3 NLP Processing 4 What s the point?
History tightly intertwined with University origination archive founded in 1409 (within first statutes) in responsibility of the head of the university (the Rector) now tasks defined by state law frequented by about 800 researchers every year
The Archive Rektoratsreden NLP Point Inventory Quantity
Inventory Quantity 140 million sheets of paper 7 km of shelve space 1500 new (physical) files each month 800 GB 50 000 digital files
The Archive Rektoratsreden NLP Point Inventory Quality
Inventory Quality
Inventory Quality matriculation lists personnel files bursary files administrative publications rare items and curiosities
Digital archive Inventory described by 1,2 million database entries only 5% of all documents digitized and available online online research portal improved efficiency and usage further means of accessing data (e.g. given name statistics) infrastructural cooperation across several archives
Digitized Corpora university newspaper corpora from the GDR-era scanned and OCRed official university newspaper (1957-1991) science-related newspaper "Wissenschaftliche Zeitschrift" (1951-1991) Rektoratsreden
The Speeches yearly transfer of administrative power from the rector to an elected successor Jahresbericht annual report (important news, events, faculty changes,... ) Antrittsrede inaugural speech (introduction, science communication,... )
The Archive Rektoratsreden NLP Point The Prints
The Prints (original written source, no transcription)
The edition process starting in 2004 (in preparation for 600th anniversary) scanning 2300 images, OCRing and error-correcting 123 speeches, 1871-1933 no language normalization, no commentary 2 volumes edition (de Gruyter, preview at Google Books) 6500 glossary entries (people, places,... ) took more than 2/3 of edition time!
Corpus Characteristics 720 000 running words in about 5,1 MB plain text 6 decades, 2 text sorts (partly many science terms) no contemporary language
Setting and Goals explorative approach (towards visual analytics) low-budget (free-time project) small time budget leading to use of standard tools and only simple NLP methods (allowing for non-computer-scientists to understand functionality) omit real evaluation
Setup Text extraction from PDF Named Entity Extraction (NER) using the ANNIE processing chain from complex operations (s.a. co-reference resolution) are skipped (mainly because of sub-optimal POS-tagging) Lists of Names (Gazetteer) originates from archive sources!
Setup digitized lists match time and place of the documents!
Setup
Setup
Results 2400 people s names extracted (quality can be improved fiddeling with the transducer rules) only full names extracted (College Geheimer Rath Wundt no Wilhelm, no person) imperfect but usable who is Ann Arbor? (imperfect = really dirty tricks: Moritz Arndt because of Ernst des Lebens) name variations also in print, so post-processing necessary anyway. graph structure emerges from co-occurrence of people mentioned in speeches a social network?
Results
The Archive Rektoratsreden NLP Point Results
Results
Efficient indexing work towards novel navigational means for digital editions less time for manual index creation means lower hurdles for editions of digitized documents make more documents accessible within limited budgets
Leveraging studies based on historical corpora interdisciplinary training scenario, involve students as texts get older, more linguistic knowledge is needed in automatic analyses
Outlook improve and extend the NLP system include an evaluation use the corpus as test scenario for concept-based corpus browser (project exchange) interconnect with other knowledge sources
The Archive Rektoratsreden NLP Point Outlook
Call for Collaboration archives can benefit from digital corpora expertise humanities can benefit from archival resources (corpora, lists) more interdisciplinary work benefits everyone small projects interwoven into curricular course can spark interest in students research infrastructure and archive infrastructure should be interoperable (communication needed)
Thanks for your attention Questions, please!