European Archival Records and Knowledge Preservation Database Archiving in the E-ARK Project



Similar documents
INTEGRATING RECORDS SYSTEMS WITH DIGITAL ARCHIVES CURRENT STATUS AND WAY FORWARD

OLAP and OLTP. AMIT KUMAR BINDAL Associate Professor M M U MULLANA

OLAP and Data Mining. Data Warehousing and End-User Access Tools. Introducing OLAP. Introducing OLAP

1. OLAP is an acronym for a. Online Analytical Processing b. Online Analysis Process c. Online Arithmetic Processing d. Object Linking and Processing

IST722 Data Warehousing

Week 3 lecture slides

Fluency With Information Technology CSE100/IMT100

Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Database preservation toolkit:

Part 22. Data Warehousing

Using distributed technologies to analyze Big Data

Course DSS. Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe. Slide 29-1

Data Warehousing. Read chapter 13 of Riguzzi et al Sistemi Informativi. Slides derived from those by Hector Garcia-Molina

Data Integration Checklist

DATA WAREHOUSE E KNOWLEDGE DISCOVERY

Designing a Dimensional Model

2074 : Designing and Implementing OLAP Solutions Using Microsoft SQL Server 2000

Chapter 5. Warehousing, Data Acquisition, Data. Visualization

How To Scale Out Of A Nosql Database

DATA WAREHOUSING - OLAP

How to Enhance Traditional BI Architecture to Leverage Big Data

OLAP. Business Intelligence OLAP definition & application Multidimensional data representation

Databases in Organizations

OLAP & DATA MINING CS561-SPRING 2012 WPI, MOHAMED ELTABAKH

How To Use Big Data For Telco (For A Telco)

Data Warehousing, OLAP, and Data Mining

BUILDING BLOCKS OF DATAWAREHOUSE. G.Lakshmi Priya & Razia Sultana.A Assistant Professor/IT

Enterprise Data Warehouse (EDW) UC Berkeley Peter Cava Manager Data Warehouse Services October 5, 2006

When to consider OLAP?

DATA WAREHOUSING AND OLAP TECHNOLOGY

OLAP and Data Warehousing! Introduction!

DATA CUBES E Jayant Haritsa Computer Science and Automation Indian Institute of Science. JAN 2014 Slide 1 DATA CUBES

Data Warehousing Concepts

Introduction to Data Mining

CS2032 Data warehousing and Data Mining Unit II Page 1

Data Warehousing. Outline. From OLTP to the Data Warehouse. Overview of data warehousing Dimensional Modeling Online Analytical Processing

The basic data mining algorithms introduced may be enhanced in a number of ways.

Data Warehousing and Decision Support. Introduction. Three Complementary Trends. Chapter 23, Part A

Data W a Ware r house house and and OLAP Week 5 1

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

TRANSFORMING YOUR BUSINESS

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

ORACLE TAX ANALYTICS. The Solution. Oracle Tax Data Model KEY FEATURES

Addressing Risk Data Aggregation and Risk Reporting Ben Sharma, CEO. Big Data Everywhere Conference, NYC November 2015

Outline. Data Warehousing. What is a Warehouse? What is a Warehouse?

ANALYTICS CENTER LEARNING PROGRAM

14. Data Warehousing & Data Mining

Decision Support. Chapter 23. Database Management Systems, 2 nd Edition. R. Ramakrishnan and J. Gehrke 1

Oracle Big Data Spatial & Graph Social Network Analysis - Case Study

Search and Real-Time Analytics on Big Data

PowerDesigner WarehouseArchitect The Model for Data Warehousing Solutions. A Technical Whitepaper from Sybase, Inc.

This Symposium brought to you by

Data Warehousing. Jens Teubner, TU Dortmund Winter 2015/16. Jens Teubner Data Warehousing Winter 2015/16 1

Chapter 3 - Data Replication and Materialized Integration

A Scalable Data Transformation Framework using the Hadoop Ecosystem

Business Intelligence, Analytics & Reporting: Glossary of Terms

LITERATURE SURVEY ON DATA WAREHOUSE AND ITS TECHNIQUES

CS54100: Database Systems

M Designing and Implementing OLAP Solutions Using Microsoft SQL Server Day Course

COURSE OUTLINE. Track 1 Advanced Data Modeling, Analysis and Design

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

FEDERATED DATA SYSTEMS WITH EIQ SUPERADAPTERS VS. CONVENTIONAL ADAPTERS WHITE PAPER REVISION 2.7

Ganzheitliches Datenmanagement

Week 13: Data Warehousing. Warehousing

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014

Data Warehousing Systems: Foundations and Architectures

Establish and maintain Center of Excellence (CoE) around Data Architecture

Data Warehousing. Overview, Terminology, and Research Issues. Joachim Hammer. Joachim Hammer

Data Warehouse Snowflake Design and Performance Considerations in Business Analytics

Data Warehousing and Data Mining

B.Sc (Computer Science) Database Management Systems UNIT-V

Hadoop Ecosystem B Y R A H I M A.

A very short talk about Apache Kylin Business Intelligence meets Big Data. Fabian Wilckens EMEA Solutions Architect

Data warehouse and Business Intelligence Collateral

A Comparative Study on Operational Database, Data Warehouse and Hadoop File System T.Jalaja 1, M.Shailaja 2

Data Warehousing and Data Mining in Business Applications

MDM and Data Warehousing Complement Each Other

SIZE & ESTIMATION OF DATA WAREHOUSE SYSTEMS

Client Overview. Engagement Situation. Key Requirements

Data Mining + Business Intelligence. Integration, Design and Implementation

Data Warehousing and OLAP Technology for Knowledge Discovery

Java Metadata Interface and Data Warehousing

IAF Business Intelligence Solutions Make the Most of Your Business Intelligence. White Paper November 2002

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

The Benefits of Data Modeling in Data Warehousing

A Design and implementation of a data warehouse for research administration universities

Big Data Technology ดร.ช ชาต หฤไชยะศ กด. Choochart Haruechaiyasak, Ph.D.

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Introduction to Data Warehousing. Ms Swapnil Shrivastava

Move Data from Oracle to Hadoop and Gain New Business Insights

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Big Data Analytics Nokia

Transcription:

European Archival Records and Knowledge Preservation Database Archiving in the E-ARK Project Janet Delve, University of Portsmouth Kuldar Aas, National Archives of Estonia Rainer Schmidt, Austrian Institute Technology DLM Forum, 13 November 2014

THE E-ARK PROJECT IS CO-FUNDED BY THE EUROPEAN COMMISSION UNDER THE ICT-PSP PROGRAMME www.eark-project.eu

Outline E-ARK objectives Current practices and needs Transactional (OLTP) vs Analytical (OLAP / Data Warehousing) techniques Database archiving in E-ARK

E-ARK Facts and Figures EU CIP PCP ICT Programme Objective 2.5: earchiving services Pilot B The pilot should share information on integration, operation and interoperability issues throughout the EU in order to facilitate the creation and maintenance of a European archiving infrastructure for government and public services thus promoting the re-use of archival data. 36 months: February 2014 January 2017 6 M Budget, 3 M funded by EC 16 Partners http://www.eark-project.eu for all deliverables https://github.com/eark-project for all software 4

E-ARK Objectives Reduce the cost of transfer, preservation and access to digital information by Standardising how agencies export and send information to digital archives Providing open formats for the long-term preservation of various content Exploring needs for accessing archives and providing novel interfaces for these Long-term vision: Improve semantic and technical interoperability to a level which allows any system developer to deliver out-of-the-box archiving functionality

Pre-Ingest E-ARK SIP SIP Creation Tools Archival records Content and Records Management Systems Ingest and Preservation SIP AIP Conversion Digital preservation systems Scalable Computation E-ARK AIP AIP - DIP Conversion CMIS Interface Data Mining Interface E-ARK DIP Access Archival Search, Access and Display Tools Content and Records Management Systems Data Mining Showcase

Current database archiving practice Snapshot policy Ingest: Transform the original relational structure into open formats Formats: SIARD, ADDML, DBML Access: Users need to find the appropriate snapshot(s), load these into a current DBMS and use predefined queries or build their own ones DB Snapshot Table 1 PK Row 1.1 Row 1.2 Row 1.3 Table 2 PK Row 2.1 Row 2.2 Row 2.3 Table 3 PK Row 3.1 Row 3.2 Row 3.3 Codes Code 1 Code 2 Code 3 Ingest Access

Problems Finding the appropriate snapshot Most users search for data about something Which car had the plate number 111YYY in January 15th 2000 Current practice allows to search for the database snapshot which includes data about something Which database includes information about cars in January 15th 2000 Scope of the snapshot The scope of data and time period covered in a single snapshot usually do not meet the needs of the user Required technical knowledge Relational structures are often highly optimised and hard to grasp Most users do not have the knowledge to build accurate queries for specific access needs The only way is to use pre-defined queries which have been archived along with the data

In the ideal world users do not need to search for databases but data! Semantic reuse Topic based reuse Big data analysis

HOW TO DO IT?

Transactional Processing (OLTP)

Online Analytical Processing (OLAP) OLAP

Data warehousing Updates only from a DB Snapshots (Useful for DB archiving) Can be denormalised Star schema dimensional model

Star Schema

Database archiving in E-ARK Overview of E-Ark Concepts Archiving of databases in different layers: primary format, semantic representation, representation for analytical processing. Data intensive technology for AIP storage and processing. Hadoop, HDFS, HBase, Lily, SolR. Support for database transformation and analysis such as denormalization, aggregation, indexing. Levels of DIP format and display Access to archived records, based on OLAP queries and reports, as dynamically reconstructed RDBs.

Extract Transform Load Goal: Integrate data from multiple applications into a database / warehouse. Extracting data from source systems like RDBMS and flat files. Transform: derive, extract, aggregate data. Load data into target: Overwrite cumulative information or add new data Important pre-processing step for data mining/analytics. involves data cleaning and data integration Result: Structured data, random access based on data based indexes (e.g. RDBMS, NOSQL). E-Ark Approach: Automated transformation of archived databases into snowflake schema representation(s). Denormalized, connected fact and dimension tables

Indexing - Full Text Search Searching documents based on full text distinguished from searches based on metadata Returns (ranked) list of document IDs Involved Information Retrieval methods Building an inverted index Scoring and weighting Results Text classification Evaluation Approach in E-Ark Denormalization / star schema transformation and ingestion into Apache HBase. Repository and faceted search on records based on NGDATA s Lily repository and Apache SolR.

Online analytical processing (OLAP) OLAP Database / Data Warehouse Aggregated, historical data, low transaction rate Resource-intensive and complex queries Analyse multi-dimensional data in a read efficient manner (Web analytics, sales) View metrics by combination of dimensions Time vs. Space: Pre-aggregates data to build cube Dynamically analyze data from multiple perspectives roll-up, drill-down, and slicing and dicing Approach in E-Ark Data analytics based on pre-processed database representation arranged along dimensions. Data loaded and queried through Apache HBase. Access (DIPs) supported by additional use of OLAP tools.

Data Mining Identify correlations and patterns in existing data Used by statisticians, database and business communities Data Analysis Techniques Regression: predict continuous valued output (e.g. price) Classification: discrete valued output (e.g. char. recognition) Segmentation: Separates data into interesting groups Based on mathematical methods pattern matching, machine learning, numerical analysis E-Ark data mining mainly based on text mining. Using data structure of the repository or search index Scalable through MapReduce / Apache Mahout. Goal: Clustering, Labeling, Anomaly Detection

Proposed Architecture EARK-AIP Data Management Application ESS Arch Preservation Platform Data Mining Showcase T6.4 D6.3 Data Connector API CRUD API Query API Data Mining API AIP Storage T6.2 MS10 Data Management Integration T6.1 MS06, D6.2 Query and Indexing T6.3 MS04, D6.1 Scalable Computation Staging Area Lily, Hadoop, HBase, HDFS Re-use and Data Mining T6.4 Archive Storage (WORM)

Tasks and Components Archival Storage Store APIs on HDFS using ESS Preservation Platform Bulk-load, permanent and replicated storage Data Integration Extract data from archival information package. ETL data into Lily/HBase, keep AIP in HDFS (don t touch) Query and Indexing Metadata on AIP level stored in HBase for basic retrieval Faceted search based on Apache SolR Data Mining and Analytics Load OLAP structure from package Data sets stored on record level into HBase Query for facts based on different dimension and levels.