Data analysis of L2-L3 products Emmanuel Gangler UBP Clermont-Ferrand (France) Emmanuel Gangler BIDS 14 1/13
Data management is a pillar of the project : L3 Telescope Caméra Data Management Outreach L1 & L2 «The data volumes [ ] of LSST are so large that the limitation on our ability to do science isn't the ability to collect the data, it's the ability to understand [ ] the data» Andrew Conolly (U. Washington) How do you turn petabytes of data into scientific knowledge? Kirk Borne (George Mason U.) Emmanuel Gangler BIDS 14 2/13
Data products: L1 Nightly g L2 Annual Image Catalog Emmanuel Gangler BIDS 14 3/13
From Image to Catalog Raw image (in 1 band) Calibration images «Flat»,... SNLS images, from P. Astier Clean image (+weights image + fag image) Standard container :.fts format Emmanuel Gangler BIDS 14 4/13
From Image to Catalog N Sources 1 Object Clean images + astrometry Stacked image (here : 600 images) SNLS images, from P. Astier Emmanuel Gangler BIDS 14 5/13
From image to catalog For each object/source, extract data Metadata Sky coordinates ( almost an index ) Ra/dec, pixel,... Flux measurement Time of observation, band, exposure,... Aperture, PSF, extended source, Shape measurements 2nd ordre moment, Quality fags And associated covariance ~100 attributes to describe a source ~1000 sources per object ~ 40 B objects Remarks : LSST Paradigm : Characterize frst (L2), Analyze later (L3) Image processing : I/O driven, highly parallel Scalability : ex. using map/reduce for coaddition. http://arxiv.org/abs/1010.1015 Emmanuel Gangler BIDS 14 6/13
Data mining Astroinformatics point of view: Borne 2009 VO domain Emmanuel Gangler BIDS 14 7/13
Data mining Astroinformatics point of view: Which knowledge to extract? How to reuse knowledge? How to integrate information and learning algorithms? Which new algorithms to develop? How to test the new ideas? VO domain Emmanuel Gangler BIDS 14 Borne 2009 8/13
Distributing LSST data The baseline Orchestration tool SQL parser Metadata DB User defned function (geometry) Communication with xrootd MySQL Backend Returns agregate results Partitioning : Geometry (cone searches) Sources and Object in the same node Limitations SQL-based Some queries can't be treated Ad hoc optimization Emmanuel Gangler BIDS 14 9/13
Distributing LSST data The baseline Partitioning : Orchestration tool Geometry (cone searches) SQL parser Sources and Object in the same node Metadata DB User defned function (geometry) WG1 Limitations SQL-based Communication with xrootd Some queries can't be treated MySQL Backend Ad hoc optimization Returns agregate results Emmanuel Gangler BIDS 14 10/13
Which knowledge to extract? Classical problems in astronomy Objects classifcation Highly dimensional problems (> 1000 dimensions, >1010 entries) 2-points (or N-points) correlations Rarity metric, effcient algorithmic Discoveries? Anomalies (detector, software) Dimensional reduction Rarity detection Cluster signifcance? (statistical/scientifc) Confusion problems Effcient algorithms for Compact data representation Measurements errors, statistical approach Impact usually underestimated in machine learning S. G. Djorgovski,
Some astrophysical challenges for the machine learning Galaxy Classifcation Transient classifcation Human better than computer at this task citizen science (ex. Galaxyzoo) (however : 20B galaxies in LSST) See Darko Talk Photometric Redshifts How to invert (galaxy type + ''distance'') u g r i z y ( & morphology) relation to retrieve distance and galaxy properties? Spectroscopic training sample smaller by ~103 Finding back hidden parameters...
Toughts about bridging expertize Big Data research needs data! Informatics research needs reference (and documented) data sets to experiment. Solving specifc issues Machine-learning-aware Geo- and Astro- researchers (WG3) Not all problems are impacted the same way by the scalability «classical» learning can still lead to good results. bottleneck in integrating learning methods and data Disentangle Machine learning and Big Data mining LSST had handy precursor data (SDSS, CHFTLS, DES, HSC...) Simulation is mandatory to assess performances / detect biasses Some algorithmic approach specifc to Big Data (1-pass algorithms, sublinear methods...) need select/apply existing methods to Astro- and Geo- data need to fnd the questions where the learning will provide answers Matching Algorithms, Data and Issues is the key! Emmanuel Gangler BIDS 14 13/13