Data analysis of L2-L3 products



Similar documents
Learning from Big Data in

Conquering the Astronomical Data Flood through Machine

The LSST Data management and French computing activities. Dominique Fouchez on behalf of the IN2P3 Computing Team. LSST France April 8th,2015

How To Teach Data Science

Computational Science and Informatics (Data Science) Programs at GMU

Introduction to LSST Data Management. Jeffrey Kantor Data Management Project Manager

The Tonnabytes Big Data Challenge: Transforming Science and Education. Kirk Borne George Mason University

Astrophysics with Terabyte Datasets. Alex Szalay, JHU and Jim Gray, Microsoft Research

Migrating a (Large) Science Database to the Cloud

Description of the Dark Energy Survey for Astronomers

LSST Data Management System Applications Layer Simulated Data Needs Description: Simulation Needs for DC3

PMCS - WBS with Definition

LSST and the Cloud: Astro Collaboration in 2016 Tim Axelrod LSST Data Management Scientist

Dominique Fouchez. 12 Fevrier 2011

Course DSS. Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

ASKAP Science Data Archive: Users and Requirements CSIRO ASTRONOMY AND SPACE SCIENCE (CASS)

Chapter 5. Warehousing, Data Acquisition, Data. Visualization

Data Mining: Introduction. Lecture Notes for Chapter 1. Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler

The World-Wide Telescope, an Archetype for Online Science

European Archival Records and Knowledge Preservation Database Archiving in the E-ARK Project

Distributed Database Access in the LHC Computing Grid with CORAL

DAME Astrophysical DAta Mining Mining & & Exploration Exploration GRID

The Scientific Data Mining Process

Data Mining Challenges and Opportunities in Astronomy

Energy Efficient MapReduce

College of Science George Mason University Fairfax, VA 22030

ETL as a Necessity for Business Architectures

Performance and Scalability Overview

Similarity Search in a Very Large Scale Using Hadoop and HBase

The Virtual Observatory: What is it and how can it help me? Enrique Solano LAEFF / INTA Spanish Virtual Observatory

Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

VisIVO, an open source, interoperable visualization tool for the Virtual Observatory

Summary of Data Management Principles Dark Energy Survey V2.1, 7/16/15

The Challenge of Data in an Era of Petabyte Surveys Andrew Connolly University of Washington

Lecture 5b: Data Mining. Peter Wheatley

Big Data Challenges in Bioinformatics

Reduced data products in the ESO Phase 3 archive (Status: 15 May 2015)

Performance and Scalability Overview

International Journal of Scientific & Engineering Research, Volume 5, Issue 4, April ISSN

LSST Data Management plans: Pipeline outputs and Level 2 vs. Level 3

Kepler Data and Tools. Kepler Science Conference II November 5, 2013

BIG DATA AND ANALYTICS

STeP-IN SUMMIT June 18 21, 2013 at Bangalore, INDIA. Performance Testing of an IAAS Cloud Software (A CloudStack Use Case)

Hadoop Cluster Applications

How To Process Data From A Casu.Com Computer System

Example application (1) Telecommunication. Lecture 1: Data Mining Overview and Process. Example application (2) Health

Constructing the Subaru Advanced Data and Analysis Service on VO

Making the Most of Missing Values: Object Clustering with Partial Data in Astronomy

Best Practices for Hadoop Data Analysis with Tableau

MAST: The Mikulski Archive for Space Telescopes

Hexaware E-book on Predictive Analytics

NITRD and Big Data. George O. Strawn NITRD

How to Enhance Traditional BI Architecture to Leverage Big Data

Silviu Panica, Marian Neagul, Daniela Zaharie and Dana Petcu (Romania)

Data Mining and Database Systems: Where is the Intersection?

Information Processing, Big Data, and the Cloud

Advanced In-Database Analytics

Fluency With Information Technology CSE100/IMT100

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

Distributed Computing and Big Data: Hadoop and MapReduce

Sanjeev Kumar. contribute

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

Performing a data mining tool evaluation

Database Marketing, Business Intelligence and Knowledge Discovery

Statistics, Data Mining and Machine Learning in Astronomy: A Practical Python Guide for the Analysis of Survey Data. and Alex Gray

How To Scale Out Of A Nosql Database

Transcription:

Data analysis of L2-L3 products Emmanuel Gangler UBP Clermont-Ferrand (France) Emmanuel Gangler BIDS 14 1/13

Data management is a pillar of the project : L3 Telescope Caméra Data Management Outreach L1 & L2 «The data volumes [ ] of LSST are so large that the limitation on our ability to do science isn't the ability to collect the data, it's the ability to understand [ ] the data» Andrew Conolly (U. Washington) How do you turn petabytes of data into scientific knowledge? Kirk Borne (George Mason U.) Emmanuel Gangler BIDS 14 2/13

Data products: L1 Nightly g L2 Annual Image Catalog Emmanuel Gangler BIDS 14 3/13

From Image to Catalog Raw image (in 1 band) Calibration images «Flat»,... SNLS images, from P. Astier Clean image (+weights image + fag image) Standard container :.fts format Emmanuel Gangler BIDS 14 4/13

From Image to Catalog N Sources 1 Object Clean images + astrometry Stacked image (here : 600 images) SNLS images, from P. Astier Emmanuel Gangler BIDS 14 5/13

From image to catalog For each object/source, extract data Metadata Sky coordinates ( almost an index ) Ra/dec, pixel,... Flux measurement Time of observation, band, exposure,... Aperture, PSF, extended source, Shape measurements 2nd ordre moment, Quality fags And associated covariance ~100 attributes to describe a source ~1000 sources per object ~ 40 B objects Remarks : LSST Paradigm : Characterize frst (L2), Analyze later (L3) Image processing : I/O driven, highly parallel Scalability : ex. using map/reduce for coaddition. http://arxiv.org/abs/1010.1015 Emmanuel Gangler BIDS 14 6/13

Data mining Astroinformatics point of view: Borne 2009 VO domain Emmanuel Gangler BIDS 14 7/13

Data mining Astroinformatics point of view: Which knowledge to extract? How to reuse knowledge? How to integrate information and learning algorithms? Which new algorithms to develop? How to test the new ideas? VO domain Emmanuel Gangler BIDS 14 Borne 2009 8/13

Distributing LSST data The baseline Orchestration tool SQL parser Metadata DB User defned function (geometry) Communication with xrootd MySQL Backend Returns agregate results Partitioning : Geometry (cone searches) Sources and Object in the same node Limitations SQL-based Some queries can't be treated Ad hoc optimization Emmanuel Gangler BIDS 14 9/13

Distributing LSST data The baseline Partitioning : Orchestration tool Geometry (cone searches) SQL parser Sources and Object in the same node Metadata DB User defned function (geometry) WG1 Limitations SQL-based Communication with xrootd Some queries can't be treated MySQL Backend Ad hoc optimization Returns agregate results Emmanuel Gangler BIDS 14 10/13

Which knowledge to extract? Classical problems in astronomy Objects classifcation Highly dimensional problems (> 1000 dimensions, >1010 entries) 2-points (or N-points) correlations Rarity metric, effcient algorithmic Discoveries? Anomalies (detector, software) Dimensional reduction Rarity detection Cluster signifcance? (statistical/scientifc) Confusion problems Effcient algorithms for Compact data representation Measurements errors, statistical approach Impact usually underestimated in machine learning S. G. Djorgovski,

Some astrophysical challenges for the machine learning Galaxy Classifcation Transient classifcation Human better than computer at this task citizen science (ex. Galaxyzoo) (however : 20B galaxies in LSST) See Darko Talk Photometric Redshifts How to invert (galaxy type + ''distance'') u g r i z y ( & morphology) relation to retrieve distance and galaxy properties? Spectroscopic training sample smaller by ~103 Finding back hidden parameters...

Toughts about bridging expertize Big Data research needs data! Informatics research needs reference (and documented) data sets to experiment. Solving specifc issues Machine-learning-aware Geo- and Astro- researchers (WG3) Not all problems are impacted the same way by the scalability «classical» learning can still lead to good results. bottleneck in integrating learning methods and data Disentangle Machine learning and Big Data mining LSST had handy precursor data (SDSS, CHFTLS, DES, HSC...) Simulation is mandatory to assess performances / detect biasses Some algorithmic approach specifc to Big Data (1-pass algorithms, sublinear methods...) need select/apply existing methods to Astro- and Geo- data need to fnd the questions where the learning will provide answers Matching Algorithms, Data and Issues is the key! Emmanuel Gangler BIDS 14 13/13