Software Entwicklungen für das LSDF Datenmanagement



Similar documents
Image Data, RDA and Practical Policies

The Data Grid: Towards an Architecture for Distributed Management and Analysis of Large Scientific Datasets

Data processing goes big

CHESS DAQ* Introduction

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time

Data Management in an International Data Grid Project. Timur Chabuk 04/09/2007

Flexible Scalable Hardware independent. Solutions for Long Term Archiving

Client Overview. Engagement Situation. Key Requirements

A Service for Data-Intensive Computations on Virtual Clusters

Using Data Mining and Machine Learning in Retail

Product Brief: XenData X2500 LTO-6 Digital Video Archive System

2012 LABVANTAGE Solutions, Inc. All Rights Reserved.

Processing big data by WS- PGRADE/gUSE and Data Avenue

Data-Intensive Science and Scientific Data Infrastructure

Big Data. White Paper. Big Data Executive Overview WP-BD Jafar Shunnar & Dan Raver. Page 1 Last Updated

The Rise of Industrial Big Data. Brian Courtney General Manager Industrial Data Intelligence

Euro-BioImaging European Research Infrastructure for Imaging Technologies in Biological and Biomedical Sciences

Deploying a distributed data storage system on the UK National Grid Service using federated SRB

The cloud storage service bwsync&share at KIT

Why long time storage does not equate to archive

Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage

NextGen Infrastructure for Big DATA Analytics.

Data Storage. Vendor Neutral Data Archiving. May 2015 Sue Montagna. Imagination at work. GE Proprietary Information

Jitterbit Technical Overview : Microsoft Dynamics CRM

A very short Intro to Hadoop

EREBOS: CosmoSim Database. CLUES Research Environment. Harry Enke (Kristin Riebe, Jochen Klar, Adrian Partl) CLUES Meeting 2015, Copenhagen

Introduction to Arvados. A Curoverse White Paper

Big Data on Microsoft Platform

Power Grid Time Series Data Analysis with Pig on a Hadoop Cluster compared to Multi Core Systems

Managing Complexity in Distributed Data Life Cycles Enhancing Scientific Discovery

European Data Infrastructure - EUDAT Data Services & Tools

Euro-BioImaging European Research Infrastructure for Imaging Technologies in Biological and Biomedical Sciences

DISTRIBUTED SYSTEMS AND CLOUD COMPUTING. A Comparative Study

IBM Software Information Management Creating an Integrated, Optimized, and Secure Enterprise Data Platform:

CERN Cloud Storage Evaluation Geoffray Adde, Dirk Duellmann, Maitane Zotes CERN IT

HPC Storage Solutions at transtec. Parallel NFS with Panasas ActiveStor

Data-intensive HPC: opportunities and challenges. Patrick Valduriez

Scala Storage Scale-Out Clustered Storage White Paper

Databricks. A Primer

THE EMC ISILON STORY. Big Data In The Enterprise. Copyright 2012 EMC Corporation. All rights reserved.

Pentaho High-Performance Big Data Reference Configurations using Cisco Unified Computing System

Chapter 7. Using Hadoop Cluster and MapReduce

Eucalyptus-Based. GSAW 2010 Working Group Session 11D. Nehal Desai

Using an In-Memory Data Grid for Near Real-Time Data Analysis

Software-defined Storage Architecture for Analytics Computing

A Survey Study on Monitoring Service for Grid

BITKOM& NIK - Big Data Wo liegen die Chancen für den Mittelstand?

Advancements in Storage QoS Management in National Data Storage

Milestone Solution Partner IT Infrastructure MTP Certification Report Scality RING Software-Defined Storage

GEOG 482/582 : GIS Data Management. Lesson 10: Enterprise GIS Data Management Strategies GEOG 482/582 / My Course / University of Washington

Product Overview. Contents

The Synergy Between the Object Database, Graph Database, Cloud Computing and NoSQL Paradigms

SURFsara Data Services

Databricks. A Primer

Laurence Liew General Manager, APAC. Economics Is Driving Big Data Analytics to the Cloud

IBM WebSphere Enterprise Service Bus, Version 6.0.1

In Memory Accelerator for MongoDB

Case Study : 3 different hadoop cluster deployments

Cloud-pilot.doc SA1 Marcus Hardt, Marcin Plociennik, Ahmad Hammad, Bartek Palak E U F O R I A

NEXT GENERATION ARCHIVE MIGRATION TOOLS

Agilent s Kalabie Electronic Lab Notebook (ELN) Product Overview ChemAxon UGM 2008 Agilent Software and Informatics Division Mike Burke

New Features in Oracle Application Express 4.1. Oracle Application Express Websheets. Oracle Database Cloud Service

Open Cirrus: Towards an Open Source Cloud Stack

Percipient StorAGe for Exascale Data Centric Computing

BIG DATA What it is and how to use?

A Brief Introduction to Apache Tez

Analisi di un servizio SRM: StoRM

Evaluating MapReduce and Hadoop for Science

Object Storage: Out of the Shadows and into the Spotlight

XenData Archive Series Software Technical Overview

EMC ISILON OneFS OPERATING SYSTEM Powering scale-out storage for the new world of Big Data in the enterprise

Performance Testing of Big Data Applications

A Grid Architecture for Manufacturing Database System

Aspera Direct-to-Cloud Storage WHITE PAPER

Business-centric Storage FUJITSU Hyperscale Storage System ETERNUS CD10000

DataNet Flexible Metadata Overlay over File Resources

Data Management System - Developer Guide

Data Warehouse and Hive. Presented By: Shalva Gelenidze Supervisor: Nodar Momtselidze

Leveraging BlobSeer to boost up the deployment and execution of Hadoop applications in Nimbus cloud environments on Grid 5000

Hadoop and Map-Reduce. Swati Gore

Das HappyFace Meta-Monitoring Framework

Big Data Services at DKRZ

Hadoop. Sunday, November 25, 12

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Transcription:

Software Entwicklungen für das LSDF Datenmanagement Rainer Stotzka, V. Hartmann, T. Jejkal,, P. Neuberger, S. Ochsenreither, F. Rindone, T. Schmidt, H. Pasic J. van Wezel, A. Garcia, R. Kupsch, S. Bourov, M. Hardt Steinbuch Centre for Computing KIT University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association www.kit.edu

Access to Data Infrastructures Virtual Research Communities share resources across borders (computing centers, countries): Computing Storage Networking Facilities Services, etc. KIT LSDF NEW: Data as a Service 2

Requirements Storage Availability and reliability (24/7) Scalability (5-10 PB/a) Sustainability (>> 10 a) Performance and throughput (> 1 TB/h per application) Collaborative data networks Distribution Accessibility Security (worldwide) (multiple protocols: Grid, Cloud, Web, ) (X.509 certificates) Tools and applications (software) Flexibility Programming interfaces (API) User interfaces (multiple communities with a huge variety of requirements) (easy-to-use) 3

LSDF objectives Dedicated for science data ExaByte scale data To archive data, long term sustainability (10 yrs.?) To enable scientists to gain better scientific results by providing Data intensive analysis Added value services for data intensive processing To provide high performance access, high throughput Barrier free access (easy-to-use) Sustainability and interoperability 4 Guidelines : PARADE White Paper (2009): Strategy for a European Data Infrastructure ESFRI Data Management Task Force (2009): e-irg Report on Data Management OAIS (2002): Reference Model for an Open Archival Information System High Level Experts Group (2010): Riding the Wave European Commission Report on Scientific Data HLEG-SD (2010?): Note on Data Services infrastructure Microsoft Research (2009): The Fourth Paradigm: Data-Intensive Scientific Discovery

Software Development Infrastructure LSDF Development of software, technologies and algorithms LSDF Software and Service Development ADALAPI DataBrowser Meta data Data intensive applications ADALAPI DataBrowser Meta data Workflow Scientific experiments, applications, communities Development of services to support scientific communities 5

ADALAPI ADALAPI Abstract Data Access Layer Application Programming Interface Java class library Seamless application access to LSDF Independent of transfer protocol and location Protocols and filesystems local files, gsiftp sftp http(s) hdfs Authentifikation: X.509 certificates, user/passwd Performance up to 85 MB/s, 1 GE, gsiftp Client software Applications Tools Scientific exp. DataBrowser DAQ Visualization LSDF Storage Infrastructure Grid Cloud Workstations 6

DataBrowser DataBrowser API: GUI: Data and meta data organization File, data and project explorer Easy-to-use Extensible World-wide access Stable Functions: Data management Queries in meta data cataloges Up-/Download Control of data analysis + vis. workflows 7

Example: Adapted DataBrowser for Toxicology 8

Why is meta data necessary? Meta data Meta data describe the contents of data Everybody uses meta data: File name and extension (e.g. rainer.jpg, budget.xls, Readme.doc) Location (e.g. / /EU-projects/2010/Fishy/budget.xls) Personal know-how Sufficient for small file systems Have you ever tried to locate a file or info-somewhere-in-a-file-system 15 years old? in the file system of a colleague? in a 100 PetaByte file system? 9

Model of the LSDF meta data management Idea: Clear separation between Data (files), Meta data File Logical Logical File Catalog File Catalogs DB DB DB DB DB Storage 10

Model of the LSDF meta data management Idea: Clear separation between Data (files), Data organization (directory structure) Meta data My project dir dir dir dir dir Logical Directory Catalog DB File Logical Logical File Catalog File Catalogs DB DB DB DB DB 11

Model of the LSDF meta data management Idea: Clear separation between Data (files), Data organization (directory structure) and Associated meta data Logical Project Catalog Logical Directory Catalog File Logical Logical File Catalog File Catalogs DB DB DB DB DB DB DB Meta data name owners access rights date community (sub)subcommunity measurement type device, instrument Meta data structure depends on project, instruments, time, 12

Hierarchical Catalog System (Repository) APIs and Tools Meta data Sustainable Easily extensible Independent of data formats Enhanced performance: distribution of access Safety by redundancy Use of open standards Catalogs Meta data scheme repository Zebrafish I Zebrafish II ANKA BL1 Material research Digital objects in Arts and Humanities Generic file tree Logical Project Catalog LPN LDN, meta data Logical Directory Catalog LDN LDN, LFN File Logical Logical File Catalog File Catalogs LFN LFN Physical File File Name LFN LFN Physical File File Name LFN Physical File Name DB DB DB DB DB DB DB LSDF Systems Computing Storage 13

Additional Data Services How do I insert a new scientific project? Data and meta data organization experts for projects with specific needs Generic meta data format for simple file trees How do I transfer my data to a different location? Do I loose my meta data? Import-export to standard data and meta data formats Archive-in-a-box (Web installer or DVD, zip-archive, etc.) 14

Results Community Services Complex image analysis chain: DataBrowser Meta data Workflow 3D image stack, time series, Leica Image Format data set size: 100 GB Transfer to LSDF Automatic data conversion to RAW LSDF storage and online processing Storage Image processing and analysis Offline processing Computing Computing 15

Data Intensive Computing Workflow Visualization of huge data sets: Maximum projection, arbitrary viewpoint HeadNode - Job preparation and distribution Computing Nodes - Load data, compute rotation and projection, write results 1 projection 36 projections 2.8 TB read, 1.7 TB write, rotations and projections, 2 h 16

Scientific communities Systems biology (ITG, BioQuant, Immunogenetics) Vertebrate development studies and Deconvolution (5000 data sets <180 min.) Synchroton facilities and beamlines ANKA data storage HGF Programme Photon-Neutron-Ion High Data Rate Initiative Climate research Material research Arts and humanities»il Cenacolo«von Da Vinci (1494-98)»L ultima cena«von Julius Romanus (1754) 17

Conclusions LSDF is a powerful structure more than data storage and cluster computing Design for future requirements R&D in progress ExaByte storage + interactivity LSDF offers Sustainability and safety Flexibility for future requirements Support Interactivity Software and tools Community-specific services To gain faster and better scientific results 18