Introduction to LSST Data Management. Jeffrey Kantor Data Management Project Manager



Similar documents
The LSST Data management and French computing activities. Dominique Fouchez on behalf of the IN2P3 Computing Team. LSST France April 8th,2015

PMCS - WBS with Definition

Data analysis of L2-L3 products

Data Management So,ware Stack Intro

Data Lab System Architecture

Dominique Fouchez. 12 Fevrier 2011

LSST Data Management plans: Pipeline outputs and Level 2 vs. Level 3

The Murchison Widefield Array Data Archive System. Chen Wu Int l Centre for Radio Astronomy Research The University of Western Australia

irods at CC-IN2P3: managing petabytes of data

Learning from Big Data in

LSST and the Cloud: Astro Collaboration in 2016 Tim Axelrod LSST Data Management Scientist

Hadoop Architecture. Part 1

Migrating a (Large) Science Database to the Cloud

STREAM ANALYTIX. Industry s only Multi-Engine Streaming Analytics Platform

STeP-IN SUMMIT June 18 21, 2013 at Bangalore, INDIA. Performance Testing of an IAAS Cloud Software (A CloudStack Use Case)

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Massive Cloud Auditing using Data Mining on Hadoop

LSST Data Management System Applications Layer Simulated Data Needs Description: Simulation Needs for DC3

Conquering the Astronomical Data Flood through Machine

ORACLE BUSINESS INTELLIGENCE, ORACLE DATABASE, AND EXADATA INTEGRATION

Challenges in e-science: Research in a Digital World

Visualizing and Analyzing Massive Astronomical Datasets with Partiview

Databricks. A Primer

Object Storage: A Growing Opportunity for Service Providers. White Paper. Prepared for: 2012 Neovise, LLC. All Rights Reserved.

Data storage services at CC-IN2P3

Data Driven Discovery In the Social, Behavioral, and Economic Sciences

Ultimate Guide to Oracle Storage

A Service for Data-Intensive Computations on Virtual Clusters

Mr. Apichon Witayangkurn Department of Civil Engineering The University of Tokyo

GEDAE TM - A Graphical Programming and Autocode Generation Tool for Signal Processor Applications

Summary of Data Management Principles Dark Energy Survey V2.1, 7/16/15

White Paper November Technical Comparison of Perspectium Replicator vs Traditional Enterprise Service Buses

Big Data Challenges in Bioinformatics

Databricks. A Primer

Senior Business Intelligence/Engineering Analyst

Meeting the challenges of today s oil and gas exploration and production industry.

Software Development around a Millisecond

CRITEO INTERNSHIP PROGRAM 2015/2016

ViSION Status Update. Dan Savu Stefan Stancu. D. Savu - CERN openlab

Big Data Use Case. How Rackspace is using Private Cloud for Big Data. Bryan Thompson. May 8th, 2013

Make the Most of Big Data to Drive Innovation Through Reseach

Creating A Galactic Plane Atlas With Amazon Web Services

Unified Batch & Stream Processing Platform

The Impact of PaaS on Business Transformation

Hadoop. Sunday, November 25, 12

D5.6 Prototype demonstration of performance monitoring tools on a system with multiple ARM boards Version 1.0

Data-Intensive Science and Scientific Data Infrastructure

Managing large clusters resources

STUDY AND SIMULATION OF A DISTRIBUTED REAL-TIME FAULT-TOLERANCE WEB MONITORING SYSTEM

Cloud JPL Science Data Systems

The PACS Software System. (A high level overview) Prepared by : E. Wieprecht, J.Schreiber, U.Klaas November, Issue 1.

How To Teach Data Science

Taming Big Data Storage with Crossroads Systems StrongBox

A very short talk about Apache Kylin Business Intelligence meets Big Data. Fabian Wilckens EMEA Solutions Architect

IBM Deep Computing Visualization Offering

Optimizing Storage for Better TCO in Oracle Environments. Part 1: Management INFOSTOR. Executive Brief

The Data Grid: Towards an Architecture for Distributed Management and Analysis of Large Scientific Datasets

Data Management in an International Data Grid Project. Timur Chabuk 04/09/2007

T a c k l i ng Big Data w i th High-Performance

Copyright 2011 Sentry Data Systems, Inc. All Rights Reserved. No Unauthorized Reproduction.

Enabling Cloud Architecture for Globally Distributed Applications

How To Use Hp Vertica Ondemand

Five Steps to Integrate SalesForce.com with 3 rd -Party Systems and Avoid Most Common Mistakes

UPS battery remote monitoring system in cloud computing

Oracle Data Integrator 11g New Features & OBIEE Integration. Presented by: Arun K. Chaturvedi Business Intelligence Consultant/Architect

End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ

Workload Characterization and Analysis of Storage and Bandwidth Needs of LEAD Workspace

PLATFORA INTERACTIVE, IN-MEMORY BUSINESS INTELLIGENCE FOR HADOOP

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Transcription:

Introduction to LSST Data Management Jeffrey Kantor Data Management Project Manager

LSST Data Management Principal Responsibilities Archive Raw Data: Receive the incoming stream of images that the Camera system generates to archive the raw images. Process to Data Products: Detect and alert on transient events within one minute of visit acquisition. Approximately once per year create and archive a Data Release, a static self-consistent collection of data products generated from all survey data taken from the date of survey initiation to the cutoff date for the Data Release. Publish: Make all LSST data available through an interface that uses community-accepted standards, and facilitate user data analysis and production of user-defined data products at Data Access Centers (DACs) and external sites.

LSST From the User s Perspective A stream of ~10 million time-domain events per night, detected and transmitted to event distribution networks within 60 seconds of observation. A catalog of orbits for ~6 million bodies in the Solar System. A catalog of ~37 billion objects (20B galaxies, 17B stars), ~7 trillion observations ( sources ), and ~30 trillion measurements ( forced sources ), produced annually, accessible through online databases. Deep co-added images. Level 1 Level 2 Services and computing resources at the Data Access Centers to enable user-specified custom processing and analysis. Software and APIs enabling development of analysis codes. Level 3

Data Management System Architecture Application Layer (LDM-151) Scientific Layer Pipelines constructed from reusable, standard parts, i.e. Application Framework Data Products representations standardized Metadata extendable without schema change Object-oriented, python, C++ Custom Software Middleware Layer (LDM-152) Portability to clusters, grid, other Provide standard services so applications behave consistently (e.g. provenance) Preserve performance (<1% overhead) Custom Software on top of Open Source, Offthe-shelf Software 02C.05 Science User Interface and Analysis Tools 02C.06.01 Science Data Archive (Images, Alerts, Catalogs) 02C.06.02 Data Access Services 02C.03.05, 02C.04.07 Application Framework 02C.01.02.02-03 SDQA and Science Pipeline Toolkits 02C.01.02.01, 02C.02.01.04, 02C.03, 02C.04 Alert, SDQA, Calibration, Data Release Productions/Pipelines 02C.07.01, 02C.06.03 Processing Middleware 02C.07.02 Infrastructure Services (System Administration, Operations, Security) Infrastructure Layer (LDM-129) Distributed Platform Different sites specialized for real-time alerting, data release production, petascale data access Off-the-shelf, Commercial Hardware & Software, Custom Integration 02C.07.04.01 Archive Site 02C.07.04.02 Base Site Physical Plant (included in above) 02C.08.03 Long-Haul Communications Data Management System Design (LDM-148)

Level 2 L3 Level 1 Mapping Data Products into Pipelines 02C.01.02.01/02. Data Quality Assessment Pipelines 02C.01.02.04. Calibration Products Production Pipelines 02C.03.01. Instrumental Signature Removal Pipeline 02C.03.01. Single-Frame Processing Pipeline 02C.03.04. Image Differencing Pipeline 02C.03.03. Alert Generation Pipeline 02C.03.06. Moving Object Pipeline 02C.04.04. Coaddition Pipeline 02C.04.04/.05 Association and Detection Pipelines 02C.04.06. Object Characterization Pipeline 02C.04.03. PSF Estimation 02C.01.02.03. Science Pipeline Toolkit 02C.03.05/04.07 Common Application Framework Data Management Applications Design (LDM-151)

Infrastructure: Petascale Computing, Gbps Networks The computing cluster at the LSST Archive at NCSA will run the processing pipelines. Single-user, single-application data center Commodity computing clusters. Distributed file system for scaling and hierarchical storage Local-attached, shared-nothing storage when high bandwidth needed Archive Site and U.S. Data Access Center NCSA, Champaign, IL Long Haul Networks to transport data from Chile to the U.S. 2x100 Gbps from Summit to La Serena (new fiber) 2x40 Gbps for La Serena to Champaign, IL (path diverse, existing fiber) Base Site and Chilean Data Access Center La Serena, Chile

Middleware Layer: Isolating Hardware, Orchestrating Software Enabling execution of science pipelines on hundreds of thousands of cores. Frameworks to construct pipelines out of basic algorithmic components Orchestration of execution on thousands of cores Control and monitoring of the whole DM System Isolating the science pipelines from details of underlying hardware Services used by applications to access/produce data and communicate "Common denominator" interfaces handle changing underlying technologies Data Management Middleware Design (LDM-152)

Database and Science UI: Delivering to Users Massively parallel, distributed, fault-tolerant relational database. To be built on existing, robust, wellunderstood, technologies (MySQL and xrootd) Commodity hardware, open source Advanced prototype in existence (qserv) Science User Interface to enable the access to and analysis of LSST data Web and machine interfaces to LSST databases Visualization and analysis capabilities More: Talks by Becla, Van Dyk

Critical Prototypes: Algorithms and Technologies Algorithm Design Approximately 60% of the software functional capability has been prototyped Over 350,000 lines of c++, python coded, unit tested, integrated, run in production mode Have released three terabyte-scale datasets, including single frame measurements, point source and galaxy photometry Pre-cursors leveraged Pan-STARRS, SDSS, HSC Petascale Computing Design Executed in parallel on up to 10k cores (TeraGrid/XSEDE and NCSA Blue Waters hardware) with scalable results Petascale Database Design Conducted parallel database tests up to 300 nodes, 100 TB of data, 100% of scale for operations year 1 Gigascale Network Design Currently testing at up to 1 Gbps Agreements in principle are in hand with key infrastructure providers (NCSA, FIU/AmPath, REUNA, IN2P3)

Data Management Scope is Defined and Requirements are Established Data Product requirements have been vetted with Science Collaborations multiple times and have successfully passed review (Jul 13) Data quality and algorithmic assessments are far advanced and we understand the risks, successfully passed review (Sep 13) Hardware sizing has been refreshed based on latest scientific and engineering requirements, system design, technology trends, software performance profiles, acquisition strategy Interfaces are defined to Phase 2 level Requirements and Final Design have been baselined (Data Management Technical Control Team) Traceability from OSS to DMSR has been verified All WBS elements have been estimated and scheduled in PMCS with scope and basis of estimate documented

Data Management ICDs needed for Construction start are at Phase 2 Level under formal change control in progress (Phase 1) ICDs on Confluence: http://ls.st/mmm Docushare: http://ls.st/col-1033

Going Where the Talent is: Distributed Team Mgmt, I&T, and Science QA User Interfaces Database Science Pipelines Middleware Infrastructure

Data Management Organization Project Manager J. Kantor Project Scientist M. Juric LSST DM Leadership DM Lead institutions are integrated into one project and are performing in their construction roles/responsibilities Survey Science Group SSG Lead Scientist TBD F. Economou LSST System Architecture K-T. Lim G. Dubois-Felsmann SLAC International Comms/Base Site R. Lambert NOAO Processing Services & Site Infrastructure D. Petravick NCSA Science Database & Data Acc Services J. Becla SLAC Alert Production A. Connolly UW/OPEN Data Release Production R. Lupton J. Swinbank Princeton Science User Interface & Tools X. Wu D. Ciardi IPAC Data Management Organization document-139

Leveraging national and international investments NSF/OCI Funded Formal relationships continue with the IRNC-funded AmLight project and they are the lead entity in securing Chile - US network capacity for LSST We have leveraged significant XSEDE and Blue Waters Service Unit and storage allocations for critical R&D phase prototypes and productions Our LSST Archive Center and US Data Access Center will hosted in the National Petascale Computing Facility at NCSA A strong relationship has been established with the Condor Group at the University of Wisconsin and HTCondor is now in our processing middleware baseline We have reused a wide range of open source software libraries and tools, many of which received seed funding from the NSF Other National/International Funded We have participated in joint development of astronomical software with Pan-STARRS and HSC We have fostered collaborative development of scientific database technology via the extremely Large Data Base (XLDB) conferences and collaborations with database developers (e.g. SciDB, MySQL, MonetDB) We have a deep process of community engagement to deliver products that are needed, and an architecture to allow the community to deliver their own tools

Data Management is Construction Ready The Data Management System is scoped and credibly estimated Requirements have been baselined and are achievable (LSE-61) Final Design baselined (LDM-148, -151, 152, -129, -135) Approximately 60% of the software functional capability has been prototyped Data and algorithmic assessments are far advanced and we understand the risks Hardware sizing has been done based on scientific and engineering requirements, system design, technology trends, software performance profiles, acquisition strategy All lowest level WBS elements have been estimated and scheduled in PMCS with scope and basis of estimate documented All lead institutions are demonstrably integrated into one project and are performing in their construction roles/responsibilities Core lead technical personnel are on board at all institutions Agreements in principle are in hand with key technology and center providers (NCSA, NOAO, FIU/AmPath, REUNA) The software development process has been exercised fully Have successfully executed eight software and data releases Standard/formal processes, tools, environment exercised repeatedly and refined Automated build, test environment is configured and exercised nightly/weekly Data Management PMCS plans current and complete