An Experimental Workflow Development Platform for Historical Document Digitisation and Analysis

Similar documents
Scalable and sustainable OCR & document image analysis in the cloud

Building a Modular Server Platform with OSGi. Dileepa Jayakody Software Engineer SSWSO2 Inc.

NextRow - AEM Training Program Course Catalog

Concept and Project Objectives

Using the Grid for the interactive workflow management in biomedicine. Andrea Schenone BIOLAB DIST University of Genova

Building Semantic Content Management Framework

Data Management for Biobanks

GeoSquare: A cloud-enabled geospatial information resources (GIRs) interoperate infrastructure for cooperation and sharing

Final Project Proposal. CSCI.6500 Distributed Computing over the Internet

secure intelligence collection and assessment system Your business technologists. Powering progress

DataNet Flexible Metadata Overlay over File Resources

The Data Grid: Towards an Architecture for Distributed Management and Analysis of Large Scientific Datasets

Problem Solving Hands-on Labware for Teaching Big Data Cybersecurity Analysis

XTM Web 2.0 Enterprise Architecture Hardware Implementation Guidelines. A.Zydroń 18 April Page 1 of 12

Chapter 7. Using Hadoop Cluster and MapReduce

Big Data and Analytics: Challenges and Opportunities

Software Architecture Document

CatDV Pro Workgroup Serve r

Cloud Computing for e-science with CARMEN

Meister Going Beyond Maven

DAME Astrophysical DAta Mining Mining & & Exploration Exploration GRID

HISP: a data-driven portal for hadron therapy

Digital Asset Management Beyond CMIS

Digital libraries of the future and the role of libraries

#jenkinsconf. Jenkins as a Scientific Data and Image Processing Platform. Jenkins User Conference Boston #jenkinsconf

Introduction to Arvados. A Curoverse White Paper

Enterprise SOA Strategy, Planning and Operations with Agile Techniques, Virtualization and Cloud Computing

GENERIC DATA ACCESS AND INTEGRATION SERVICE FOR DISTRIBUTED COMPUTING ENVIRONMENT

Managing Complexity in Distributed Data Life Cycles Enhancing Scientific Discovery

S1000D Transformation Toolkit. Mr. Wayne Gafford Advanced Distributed Learning (ADL) Mr. Tyler Shumaker Concurrent Technologies Corporation (CTC)

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Databricks. A Primer

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Copyright Soleran, Inc. esalestrack On-Demand CRM. Trademarks and all rights reserved. esalestrack is a Soleran product Privacy Statement

The OMII Software Distribution

Extension of a SCA Editor and Deployment-Strategies for Software as a Service Applications

A stream computing approach towards scalable NLP

SERVICE ORIENTED ARCHITECTURE

Using Tomcat with CA Clarity PPM

NIH Commons Overview, Framework & Pilots - Version 1. The NIH Commons

Analytic Modeling in Python

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

2015 The MathWorks, Inc. 1

The STC for Event Analysis: Scalability Issues

THE CCLRC DATA PORTAL

Software Architecture Document

Project Convergence: Integrating Data Grids and Compute Grids. Eugene Steinberg, CTO Grid Dynamics May, 2008

FIPA agent based network distributed control system

SURVEY ON THE ALGORITHMS FOR WORKFLOW PLANNING AND EXECUTION

Selenium WebDriver. Gianluca Carbone. Selenium WebDriver 1

SENSE/NET 6.0. Open Source ECMS for the.net platform. 1

Apache Jakarta Tomcat

MatchPoint Benefits with SharePoint 2013

HPC ABDS: The Case for an Integrating Apache Big Data Stack

Index. Registry Report

Real-Time Analytics on Large Datasets: Predictive Models for Online Targeted Advertising

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Usage of Business Process Choreography

Workflow Tools at NERSC. Debbie Bard NERSC Data and Analytics Services

Praseeda Manoj Department of Computer Science Muscat College, Sultanate of Oman

Databricks. A Primer

Service Road Map for ANDS Core Infrastructure and Applications Programs

Talend ESB. Getting Started Guide 5.5.1

CPET 581 Cloud Computing: Technologies and Enterprise IT Strategies

COMP9321 Web Application Engineering

Author: Gennaro Frazzingaro Universidad Rey Juan Carlos campus de Mostòles (Madrid) GIA Grupo de Inteligencia Artificial

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

Business Process Management

A collaborative platform for knowledge management

Engineering efficiency in automation for offshore applications

SOA and BPO SOA orchestration with flow. Jason Huggins Subject Matter Expert - Uniface

Remote Graphical Visualization of Large Interactive Spatial Data

Overview Motivation MapReduce/Hadoop in a nutshell Experimental cluster hardware example Application areas at the Austrian National Library

Enterprise Service Bus

Applying MDA in Developing Intermediary Service for Data Retrieval

Case Study: Semantic Integration as the Key Enabler of Interoperability and Modular Architecture for Smart Grid at Long Island Power Authority (LIPA)

A Prototype Implementation of Recommendation Engine Using Big Data Analytics

PROTOTYPE IMPLEMENTATION OF A DEMAND DRIVEN NETWORK MONITORING ARCHITECTURE

BIOINFORMATICS Supporting competencies for the pharma industry

GECKO Software. Introducing FACTORY SCHEMES. Adaptable software factory Patterns

Industry 4.0 and Big Data

Globus Striped GridFTP Framework and Server. Raj Kettimuthu, ANL and U. Chicago

Key Research Challenges in Cloud Computing

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Exploring Oracle E-Business Suite Load Balancing Options. Venkat Perumal IT Convergence

Transcription:

An Experimental Workflow Development Platform for Historical Document Digitisation and Analysis Clemens Neudecker, Mustafa Dogan, Sven Schlarb (IMPACT) Paolo Missier, Shoaib Sufi, Alan Williams, Katy Wolstencroft (mygrid) International workshop on Historical Document Imaging and Processing, Beijing, 17 September 2011

Background IMPACT Improving Access to Text (2008 2011) Innovate OCR technology IMPACT Centre of Competence (2011?) Capacity building in mass digitisation From a technical perspective: > 20 software toolkits for solving specific issues Prototyping new algorithms Various technologies One ring to rule them all IMPACT Interoperability Framework (IIF)

Main requirements Behavioural: Minimize integration effort Minimize deployment effort Maximize usability Maximize scalability Functional: Modular Transparent Expandable Open source Platform independent

Architecture IMPACT Interoperability Framework: Technologies - Java 6 - Generic Web Service Wrapper - Apache Maven2 - Apache Tomcat - Apache Axis2 - Apache Synapse - Taverna Workflow Engine IMPACT Interoperability Framework: Dataset - more than 500.000 images from digital libraries - more than 25.000 ground truth transcriptions

Tool integration Easy to use generic command line wrapper

Workflow development OCR workflow = data pipeline Building blocks = processing steps (nodes) Integration = interaction between nodes (mashup) Collaboration with

Workflow management Web 2.0 style registry: myexperiment Local client: Taverna Workbench Web client: project website

Compute cluster Enterprise Service Bus receives requests from users and distributes the load to the available worker nodes Process parallelization, Load distribution, Fail over Processing times improve by 0.56 per additional endpoint

Dataset Access to a representative and annotated dataset of significant size, with metadata, ground truth and search facilities

Evaluation features Text based comparison of result with ground truth, using Levenshtein distance method Layout based comparison of result with ground truth, using the Page Analysis And Ground Truth Elements Framework Example:

Community Web2.0 style workflow registry: Share, rate, comment, tag,... Community of experts Sharing of resources and results Knowledge exchange Online environment for users and researchers

Summary Benefits: - Availability of resources (images, ground truth and services) to the international research community - A common framework for transparent evaluation - Sharing of results and know-how - Enable new research through scalable computing - Cross domain collaboration Thank you! Questions?