Software Entwicklungen für das LSDF Datenmanagement Rainer Stotzka, V. Hartmann, T. Jejkal,, P. Neuberger, S. Ochsenreither, F. Rindone, T. Schmidt, H. Pasic J. van Wezel, A. Garcia, R. Kupsch, S. Bourov, M. Hardt Steinbuch Centre for Computing KIT University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association www.kit.edu
Access to Data Infrastructures Virtual Research Communities share resources across borders (computing centers, countries): Computing Storage Networking Facilities Services, etc. KIT LSDF NEW: Data as a Service 2
Requirements Storage Availability and reliability (24/7) Scalability (5-10 PB/a) Sustainability (>> 10 a) Performance and throughput (> 1 TB/h per application) Collaborative data networks Distribution Accessibility Security (worldwide) (multiple protocols: Grid, Cloud, Web, ) (X.509 certificates) Tools and applications (software) Flexibility Programming interfaces (API) User interfaces (multiple communities with a huge variety of requirements) (easy-to-use) 3
LSDF objectives Dedicated for science data ExaByte scale data To archive data, long term sustainability (10 yrs.?) To enable scientists to gain better scientific results by providing Data intensive analysis Added value services for data intensive processing To provide high performance access, high throughput Barrier free access (easy-to-use) Sustainability and interoperability 4 Guidelines : PARADE White Paper (2009): Strategy for a European Data Infrastructure ESFRI Data Management Task Force (2009): e-irg Report on Data Management OAIS (2002): Reference Model for an Open Archival Information System High Level Experts Group (2010): Riding the Wave European Commission Report on Scientific Data HLEG-SD (2010?): Note on Data Services infrastructure Microsoft Research (2009): The Fourth Paradigm: Data-Intensive Scientific Discovery
Software Development Infrastructure LSDF Development of software, technologies and algorithms LSDF Software and Service Development ADALAPI DataBrowser Meta data Data intensive applications ADALAPI DataBrowser Meta data Workflow Scientific experiments, applications, communities Development of services to support scientific communities 5
ADALAPI ADALAPI Abstract Data Access Layer Application Programming Interface Java class library Seamless application access to LSDF Independent of transfer protocol and location Protocols and filesystems local files, gsiftp sftp http(s) hdfs Authentifikation: X.509 certificates, user/passwd Performance up to 85 MB/s, 1 GE, gsiftp Client software Applications Tools Scientific exp. DataBrowser DAQ Visualization LSDF Storage Infrastructure Grid Cloud Workstations 6
DataBrowser DataBrowser API: GUI: Data and meta data organization File, data and project explorer Easy-to-use Extensible World-wide access Stable Functions: Data management Queries in meta data cataloges Up-/Download Control of data analysis + vis. workflows 7
Example: Adapted DataBrowser for Toxicology 8
Why is meta data necessary? Meta data Meta data describe the contents of data Everybody uses meta data: File name and extension (e.g. rainer.jpg, budget.xls, Readme.doc) Location (e.g. / /EU-projects/2010/Fishy/budget.xls) Personal know-how Sufficient for small file systems Have you ever tried to locate a file or info-somewhere-in-a-file-system 15 years old? in the file system of a colleague? in a 100 PetaByte file system? 9
Model of the LSDF meta data management Idea: Clear separation between Data (files), Meta data File Logical Logical File Catalog File Catalogs DB DB DB DB DB Storage 10
Model of the LSDF meta data management Idea: Clear separation between Data (files), Data organization (directory structure) Meta data My project dir dir dir dir dir Logical Directory Catalog DB File Logical Logical File Catalog File Catalogs DB DB DB DB DB 11
Model of the LSDF meta data management Idea: Clear separation between Data (files), Data organization (directory structure) and Associated meta data Logical Project Catalog Logical Directory Catalog File Logical Logical File Catalog File Catalogs DB DB DB DB DB DB DB Meta data name owners access rights date community (sub)subcommunity measurement type device, instrument Meta data structure depends on project, instruments, time, 12
Hierarchical Catalog System (Repository) APIs and Tools Meta data Sustainable Easily extensible Independent of data formats Enhanced performance: distribution of access Safety by redundancy Use of open standards Catalogs Meta data scheme repository Zebrafish I Zebrafish II ANKA BL1 Material research Digital objects in Arts and Humanities Generic file tree Logical Project Catalog LPN LDN, meta data Logical Directory Catalog LDN LDN, LFN File Logical Logical File Catalog File Catalogs LFN LFN Physical File File Name LFN LFN Physical File File Name LFN Physical File Name DB DB DB DB DB DB DB LSDF Systems Computing Storage 13
Additional Data Services How do I insert a new scientific project? Data and meta data organization experts for projects with specific needs Generic meta data format for simple file trees How do I transfer my data to a different location? Do I loose my meta data? Import-export to standard data and meta data formats Archive-in-a-box (Web installer or DVD, zip-archive, etc.) 14
Results Community Services Complex image analysis chain: DataBrowser Meta data Workflow 3D image stack, time series, Leica Image Format data set size: 100 GB Transfer to LSDF Automatic data conversion to RAW LSDF storage and online processing Storage Image processing and analysis Offline processing Computing Computing 15
Data Intensive Computing Workflow Visualization of huge data sets: Maximum projection, arbitrary viewpoint HeadNode - Job preparation and distribution Computing Nodes - Load data, compute rotation and projection, write results 1 projection 36 projections 2.8 TB read, 1.7 TB write, rotations and projections, 2 h 16
Scientific communities Systems biology (ITG, BioQuant, Immunogenetics) Vertebrate development studies and Deconvolution (5000 data sets <180 min.) Synchroton facilities and beamlines ANKA data storage HGF Programme Photon-Neutron-Ion High Data Rate Initiative Climate research Material research Arts and humanities»il Cenacolo«von Da Vinci (1494-98)»L ultima cena«von Julius Romanus (1754) 17
Conclusions LSDF is a powerful structure more than data storage and cluster computing Design for future requirements R&D in progress ExaByte storage + interactivity LSDF offers Sustainability and safety Flexibility for future requirements Support Interactivity Software and tools Community-specific services To gain faster and better scientific results 18