Data Lab System Architecture
Data Lab Context
Data Lab Architecture Astronomer s Desktop Web Page Cmdline Tools Legacy Apps User Code User Mgmt Data Lab Ops Monitoring Presentation Layer Authentication Query Manager Public Services Job Manager Storage Mgr Resource Resolver Public Repo Private Services Ops Monitor Private Repo Services Layer Data Access Services SIA SSA SCS UWS VOSpace UWS TAP UWS SQL Service Data Access Layer Databases MyDB Large Cats Data Pub Ops DBs Storage Resource User Space Virtual Space Compute Resource UWS Compute Jobs External Resources VO Data VO Svcs NSA Resources Layer
Data Lab Architecture Astronomer s Desktop Web Page Cmdline Tools Legacy Apps User Code User Mgmt Data Lab Ops Monitoring Presentation Layer Authentication Query Manager Public Services Job Manager Storage Mgr Resource Resolver Public Repo Private Services Ops Monitor Private Repo Services Layer Data Access Services SIA SSA SCS UWS VOSpace UWS TAP UWS SQL Service Data Access Layer Databases MyDB Large Cats Data Pub Ops DBs Storage Resource User Space Virtual Space Compute Resource UWS Compute Jobs External Resources VO Data VO Svcs NSA Resources Layer
Presentation Layer This layer contains the primary user interfaces. Astronomer s Desktop Web clients -- data query forms, content browsers, monitors, etc Command-line tools -- for local desktop access Legacy Apps -- inc. scripting environments such as Python User-written code -- custom science clients Login shells Operators Tools System Monitoring / Administration User and Resource management
Services Layer This layer provides interfaces used mostly by software. Public Services Authentication / Authorization controlled access to D/L Job Manager manage compute jobs Query Manager manage large data queries Storage Manager manage virtual storage resource Resource Resolver locate services / resource within D/L Private Services Operations monitoring service automated resource checking
Data Access Layer This layer provides interfaces to data services. Simple VO data services Catalog/images/spectra positional (+constraint) based query Anonymous access allowed Advanced Catalog Services Full SQL query capability VO standard interface (public access) Custom SQL interface (authorized access) Virtual storage Authorized access, user-controlled sharing
Service vs. Access Layers Why the need for different layers? Service Layer Access Layer Astronomer Friendly X Authorized Access X Anonymous Access X X Direct VO Protocols X Job Control X Depends Data Lab API X X Virtual Observatory API X Web Interface X* X Programmatic (Desktop) Interface X* X* Legacy App Support X*
Resources Layer This layer describes physical / logical resources in the D/L. Databases Large (distributed) Catalog DB Personal DB (similar to SDSS MyDB) User-published datasets Operational DB Physical Storage Persistent user storage Virtual storage Compute Resources Servers for processing workflows External Services Data and processing VO tools (e.g. cross-match)
Large Catalogs Require a low-cost, scalable and reliable solution No viable turnkey system available The LSST QServ project will gain us valuable experience Presents a normal DB interface to client - Can put TAP/SQL service in front of it Can optimize data partitioning thru experimentation QServ Requires dedicated hardware for each catalog instance
Virtual Storage Implemented using disk filesystem as back-end Simplifies exported service for use on local user file systems Provides options for D/L operations: User-based partition scheme Legacy code can bypass VOSpace protocols (via FUSE mounted filesystem) Cons: Potential synchronization issues Containers used to package service Bundle dependencies FUSE mounts for other containers Exploit protocol s support of: Capabilities Views Virtual Storage Service Container Python VOSpace Database Data Lab Interfaces Base Docker OS Image/Table Support Apps Local Disk Container
Example - Bringing It All Together NOAO Data Lab Virtual Storage Svcs 1(b) DL Task DL Task 1(c) MyDB Large Catalog Svcs Data Publication Svcs PI/Survey NSA 1(a) 2(a) Virtual Storage Svc MyDB DL Task DL Task 2(b) Data Publication Svc User 1 Desktop Virtual Storage Svc Legacy Tools User 2 Laptop
Compute Services / Virtualization Task Container Task Containers Why are they interesting? Provide task-level virtualization Much smaller in size, faster to startup Bundles / isolates dependencies Container images can be layered E.g. a base Python 2.7 environment Containers have their own IP address Users can login to a container Tasking Interface Can be deployed to other Clouds easily Growing user / developer community Repository of public containers available Params Results Data Lab Support Code Base OS Image <<Task>> Disk Cache Mount F U S E Virtual Storage
Task Containers What can you contain? Web applications Desktop Tools Almost anything. Compute Services / Virtualization Task Container Tasking Interface Tasking Interface Handles UWS communications with the Job Manager Allows for setting of parameters, results collection, timeouts Redirects stdio streams back to calling client Params Results Data Lab Support Code Base OS Image Container Storage Persistent cache container shared in a workflow <<Task>> Virtual storage can be mounted as part of environment Disk Cache Mount F U S E Virtual Storage
Compute Services / Job Manager Job Manager Parallelizes a request based on user parameters User-defined independent input list to parallelize Initializes a job on the remote compute server Executes as sync or async job UWS for job control Polls for completion Gets result objects Returns results to client Or, creates new transfer job Manages hundreds of jobs Sync Job fork() Job Manager ssh Tasking Interface stdio streams <<Task>> Tasking Interface fork() Job Manager ssh UWS Client ASync Job stdio streams <<Task>>
Query Manager / SQL Service Query Manager Provides a high-level, uniform, interface for clients to query data services Hides the sync/async job handling and VO protocols from clients Orchestrates result handling (download, save to virtual storage, etc) SQL Service Provides job control for query by implementing UWS Offers options for query-result handling Store to personal database, virtual storage, direct download, etc. Download format options (FITS, etc) Offers alternative to VO TAP Greater re-use of existing DB client software
Data Publication Capability is used in multiple contexts Public access to high-level data products (static) Private access used in workflows (transient) Semi-private access within a collaboration (shared) Shared responsibility between D/L and Users D/L provides tools, resources and a publishing framework Users provide the content and the scientific curation Low-cost, simple, services for all datasets Higher-cost, advanced, services to support collaborations
Storage Manager Provides a simple interface for user applications Hides details of the Virtual Storage implementation (VOSpace) Can map to idiomatic filesystem interfaces easily (i.e. get, put, list) Abstracts easily to web, desktop and programmatic APIs Provides authenticated access to data holdings Manages the details for other Data Lab services Endpoint resolution, authentication, etc when used to save results
Authentication / Authorization Deferred implementation in Year-1 due to potential landmines in a changing landscape General user support not needed, trusted-users only Y1 services to use null interface to identify need for service in the code w/out requiring a working service Various authentication methods under discussion Requests to public services passed-thru automatically Implies, service knows public vs private services Manages user- and group-level access to resources Manages multiple authentication methods as needed