An Experimental Workflow Development Platform for Historical Document Digitisation and Analysis Clemens Neudecker, Mustafa Dogan, Sven Schlarb (IMPACT) Paolo Missier, Shoaib Sufi, Alan Williams, Katy Wolstencroft (mygrid) International workshop on Historical Document Imaging and Processing, Beijing, 17 September 2011
Background IMPACT Improving Access to Text (2008 2011) Innovate OCR technology IMPACT Centre of Competence (2011?) Capacity building in mass digitisation From a technical perspective: > 20 software toolkits for solving specific issues Prototyping new algorithms Various technologies One ring to rule them all IMPACT Interoperability Framework (IIF)
Main requirements Behavioural: Minimize integration effort Minimize deployment effort Maximize usability Maximize scalability Functional: Modular Transparent Expandable Open source Platform independent
Architecture IMPACT Interoperability Framework: Technologies - Java 6 - Generic Web Service Wrapper - Apache Maven2 - Apache Tomcat - Apache Axis2 - Apache Synapse - Taverna Workflow Engine IMPACT Interoperability Framework: Dataset - more than 500.000 images from digital libraries - more than 25.000 ground truth transcriptions
Tool integration Easy to use generic command line wrapper
Workflow development OCR workflow = data pipeline Building blocks = processing steps (nodes) Integration = interaction between nodes (mashup) Collaboration with
Workflow management Web 2.0 style registry: myexperiment Local client: Taverna Workbench Web client: project website
Compute cluster Enterprise Service Bus receives requests from users and distributes the load to the available worker nodes Process parallelization, Load distribution, Fail over Processing times improve by 0.56 per additional endpoint
Dataset Access to a representative and annotated dataset of significant size, with metadata, ground truth and search facilities
Evaluation features Text based comparison of result with ground truth, using Levenshtein distance method Layout based comparison of result with ground truth, using the Page Analysis And Ground Truth Elements Framework Example:
Community Web2.0 style workflow registry: Share, rate, comment, tag,... Community of experts Sharing of resources and results Knowledge exchange Online environment for users and researchers
Summary Benefits: - Availability of resources (images, ground truth and services) to the international research community - A common framework for transparent evaluation - Sharing of results and know-how - Enable new research through scalable computing - Cross domain collaboration Thank you! Questions?