Big Data Era Challenges and Opportunities in Astronomy: How SOM/LVQ and Related Learning Methods Can Contribute?



Similar documents
Learning from Big Data in

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning

Steven C.H. Hoi School of Information Systems Singapore Management University

The Tonnabytes Big Data Challenge: Transforming Science and Education. Kirk Borne George Mason University

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches

The Scientific Data Mining Process

How To Teach Data Science

Conquering the Astronomical Data Flood through Machine

A Partially Supervised Metric Multidimensional Scaling Algorithm for Textual Data Visualization

Introduction to Data Mining

Example application (1) Telecommunication. Lecture 1: Data Mining Overview and Process. Example application (2) Health

Scalable Developments for Big Data Analytics in Remote Sensing

ADVANCED MACHINE LEARNING. Introduction

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

LSST and the Cloud: Astro Collaboration in 2016 Tim Axelrod LSST Data Management Scientist

Top 10 Discoveries by ESO Telescopes

Maschinelles Lernen mit MATLAB

Calculation of Minimum Distances. Minimum Distance to Means. Σi i = 1

Using Data Mining for Mobile Communication Clustering and Characterization

Description of the Dark Energy Survey for Astronomers

Big Data: Rethinking Text Visualization

Visualization methods for patent data

Azure Machine Learning, SQL Data Mining and R

Big Data: Image & Video Analytics

Introduction to Pattern Recognition

STA 4273H: Statistical Machine Learning

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Statistics, Data Mining and Machine Learning in Astronomy: A Practical Python Guide for the Analysis of Survey Data. and Alex Gray

TIETS34 Seminar: Data Mining on Biometric identification

Novelty Detection in image recognition using IRF Neural Networks properties

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski

How To Use Neural Networks In Data Mining

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

New Work Item for ISO Predictive Analytics (Initial Notes and Thoughts) Introduction

Sanjeev Kumar. contribute

Machine Learning Introduction

Class-specific Sparse Coding for Learning of Object Representations

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

6.2.8 Neural networks for data mining

Clustering Big Data. Anil K. Jain. (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012

Supervised Feature Selection & Unsupervised Dimensionality Reduction

A Review of Data Mining Techniques

Statistical Distributions in Astronomy

Unsupervised and supervised dimension reduction: Algorithms and connections

Alignment and Preprocessing for Data Analysis

Information Management course

BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES

Chapter 6. The stacking ensemble approach

Analysis Tools and Libraries for BigData

Meta-learning. Synonyms. Definition. Characteristics

Introduction to Machine Learning Using Python. Vikram Kamath

Statistics for BIG data

A Survey on Outlier Detection Techniques for Credit Card Fraud Detection

Neural network software tool development: exploring programming language options

The Virtual Observatory: What is it and how can it help me? Enrique Solano LAEFF / INTA Spanish Virtual Observatory

Data topology visualization for the Self-Organizing Map

Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing

Data Mining and Neural Networks in Stata

Galaxy Morphological Classification

Learning to Process Natural Language in Big Data Environment

Data Mining and Visualization

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

An Overview of Knowledge Discovery Database and Data mining Techniques

Predict the Popularity of YouTube Videos Using Early View Data

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer

EFFICIENT DATA PRE-PROCESSING FOR DATA MINING

Is a Data Scientist the New Quant? Stuart Kozola MathWorks

What is Data Science? Data, Databases, and the Extraction of Knowledge Renée November 2014

The Need for Training in Big Data: Experiences and Case Studies

Comparison of K-means and Backpropagation Data Mining Algorithms

Data, Measurements, Features

Mobile Phone APP Software Browsing Behavior using Clustering Analysis

165 points. Name Date Period. Column B a. Cepheid variables b. luminosity c. RR Lyrae variables d. Sagittarius e. variable stars

COOKBOOK. for. Aristarchos Transient Spectrometer (ATS)

USING SELF-ORGANIZING MAPS FOR INFORMATION VISUALIZATION AND KNOWLEDGE DISCOVERY IN COMPLEX GEOSPATIAL DATASETS

Social Media Mining. Data Mining Essentials

How To Solve The Kd Cup 2010 Challenge

Using Photometric Data to Derive an HR Diagram for a Star Cluster

Analecta Vol. 8, No. 2 ISSN

Functional Data Analysis of MALDI TOF Protein Spectra

Environmental Remote Sensing GEOG 2021

Neural Networks in Data Mining

Predictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD

Comparison of Supervised and Unsupervised Learning Classifiers for Travel Recommendations

Data Mining Applications in Higher Education

Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics

Transcription:

Big Data Era Challenges and Opportunities in Astronomy: How SOM/LVQ and Related Learning Methods Can Contribute? Prof. Pablo Estévez, Ph.D. Department of Electrical Engineering, Universidad de Chile & Millennium Institute of Astrophysics, Chile WSOM 2016 Houston, TX, January 8, 2016

Contents Astronomy Context Large Synoptic Survey Telescope Millennium Institute of Astrophysics (MAS) Big Data SOM/LVQ Two Examples Conclusions

Mirrors of the largest telescopes SALT By 2024 Chile will concentrate 70% of the global observing We have the responsibility to capitalize on this opportunity

Large Synoptic Survey Telescope (LSST) Cerro Pachón, Chile, 2022

3 x3 degrees field of view All southern hemisphere in 3 days During 10 years In one year it will collect more data than all previous telescopes as a whole (15 PB/year) LSST will produce a 3D video of the Universe Cosmic Cinematography: Exploration of time domain Real time data management 100,000 transients per night

Challenges for Chile Access to 100% of the data But LSST will not do the analysis Avalanche of data Huge challenges in computational intelligence and data analysis New way of doing science: Data-driven Science

LSST: Big Data Challenges Mining in real time a massive data stream of ~2 Terabytes per hour for 10 years Classify more than 50 billion objects and follow up many of these events in real time Spectrograph to separate light into frequency spectrum Extracting knowledge in real time for ~2 million events per night Discovering the unknown unknowns (serendipity): the things that we do not even know that we don t know!

Big Data Four V s

Astronomy Data Production Comparison Credits: ALMA, Maccarena Gonzalez:

Big Data Analytics How did the Milky Way form? High Performance Computing, GPGPUs Challenge: Can Data Mining discover new patterns and correlations? What vs. Why Challenge: Providing good visualization tools for doing science

Pragmatic Approach Knowing what, not why, is good enough This is done finding out valuable correlations (including non-linear relationships) Correlation allows us analyzing a phenomenon by identifying a good proxy for it The grand challenge is the problem of inference: turning data into knowledge through models A data-driven approach is used instead of a hypothesis-driven one

SOM/LVQ Methods What is the Best Classifier in the World? Source: Fernandez-Delgado et al, JMLR, 2014 Including all the relevant classifiers available today. Comparison of 179 classifiers on 121 data sets Ranking Classifier 1 Parallel Random Forest 2 Random Forest 3 SVM-C 77 LVQ 119 Supervised SOM Why the GLVQ* classifiers are not in the list? Implementation in R or Python Easy interface Automatic parameter tuning

SOM Journal Papers 120 385 SOM Papers Published in ISI Journals(2010-2015) 100 80 60 40 20 0

Semi-supervised variable star clustering New paradigm in the field of ML/CI, is semi-supervised learning: (unsupervised) find structures, patterns or clusters by measuring similarities between samples (supervised) Incorporate labels if available, this guides the unsupervised half (label propagation) Possibility of detecting something novel (patterns not in the training set) while still discriminating the known classes. Example: Clustering 10,000 periodic variable stars from EROS-2 (Sammon visualization). Only 10% of the data is labeled. Purple: EB, blue: CEPH, yellow: RRL, green: LPV, red: unknown

Active learning with human in the loop Active learning: The machine can query the expert for labels. In practice, the number of labels is much less than in the supervised case. Query strategy: (1) Ask labels for the most uncertain samples (boundaries), (2) minimize expected error, (3) minimize output variance, etc. Example: AL query interface for variable star classification, show a pair of samples and choose if they belong to the same class.

Millennium Institute of Astrophysics (MAS) Started in January 2014 Passion for the exploration of the natural world

Millennium Institute of Astrophysics (MAS) Started in January 2014 Milky Way Astroinformatics, Astrostatistics Exoplanets, Transients Supernovae

Astronomical Time Series: Light Curves. LOS PABLOS Work Light Curve: Stellar brightness (magnitude or flux) versus time. Variable stars: stars whose luminosity varies over time (3% of the stars in the universe are variables, and 1% are periodic variable stars) Light Curve Analysis: Useful for period detection, event detection, stellar classification, extra solar planet discovery, measuring distance to earth, etc.

An Example of a Light Curve

Variable stars Eclipsing binary stars Pulsating star

Folded Light Curves The transformation t modulus T plots successive cycles atop one another, where T is the period Usually all periods within a range are tried to find the one that maximizes a criterion (sweep). Folding a light curve Estimating the period

Automated period detection Correntropy (generalized correlation) is used to compute similarities between samples Go beyond second order statistics, taking into account higher order moments Robustness to outliers and noise Spectral decomposition of correntropy using advanced signal processing techniques Gaussian basis functions are used instead of sinusoids Go beyond Fourier representation to get superresolution, more localized and sparser spectra

Example

EROS-2 Survey Survey of the Magellanic Clouds and the Galactic bulge Data taken from ESO Observatory, in La Silla, Chile 38.2 million light curves with two channels each (blue and red). EROS dataset processed automatically in 18 hours using GPGPU cluster. We found 120,000 periodic variables. Near future (within MAS): Be able to process a billion light curves per day

DEMO Period_finding_demo_Python

Real-time Transient Detection Pipeline (PANCHO S Work) Quest to complete the luminosity-time diagram for low luminosities and short cadences Discovery of new transient phenomena New instruments like DECam and LSST will allow us to detect for example the explosion of a supernova in real time. A custom real-time transient pipeline has been developed.

High Cadence Transient Survey (F. Forster et al.) HiTS scientific objective: Find evidence of shock breakouts (SBO). SBO: Event that occurs instants after the explosion of a supernova. Supernova: Explosion by the end of the life cycle of massive stars Dark Energy Camera (DECam) 1. Formation of neutron star (~secs) 2. Shock emergence (~hrs) 3. Glowing ejecta (~days/weeks) 4. Renmant diffusion (~kyrs) Figures: nasa.gov

HiTS Image Reduction Pipeline Data Capture Preprocessing Alignment PSF Matching Subtraction Candidate Selection Candidate Filtering Visual Inspection At this point, candidates are dominated by artifacts 1:10K ML to find the needles in the haystack

Non-negative Matrix Factorization

Feature Extraction using NMF In NMF, we aim to decompose V into factors W and H by solving where the non-negative constraints are element-wise. Nonnegativity: Only additive combinations. Sparse and part based decompositions The NMF problem is non-convex in W and H at the same time (ill-posed). Regularization can alleviate this.

Principal Component Analysis Lee and Seung, Nature 1999

Non-negative Matrix Factorization Lee and Seung, Nature 1999

Cosmic rays and noise Current Reference Difference Current Reference Difference

Stars! Current Reference Difference Current Reference Difference

2D Visualization of the astronomical images using Self Organizing Maps (SOM) The SOM is an unsupervised neural network using for visualization... The database contain ~1000 21x21 images of objects labeled as variable by the CMM pipeline. We use Non-negative matrix factorization (NMF) to capture the different behaviors in the stamps and reduce dimensionality (441 16) We train the SOM with the NMF coefficients and obtain a u-matrix visualization

The whole picture

Proposed method: Offline training phase Nx441 Im1 H1 NxK Kx441 W1 Training data Im2 W2 H2 Train RF model Nx3x441 Ims Slicing & Scaling Ws Hs NMF Training set: 500,000 labeled HiTS candidates NMF parameters: K and Lambda

Results: Classification Performance SNR>6 5<SNR<6 NMF outperforms PCA and the raw model (a) FPR: 0.1%, FNR: 4% (NMF), 10% (PCA), 15% (raw) (b) FPR: 0.1%, FNR: 15% (NMF), 30% (PCA), 40% (raw) NMF is less affected when SNR decreases

Traditional Pattern Recognition Model

Inspired in Inception Movie PANCHO, WE NEED TO GO DEEPER

Deep Learning For many years shallow neural network architectures were used Classical multilayer perceptron trained by error backpropagation (gradient descent) Problem of vanishing gradients for deeper arquitectures, number of examples, computational time

Convolutional Neural Nets Applied to HiTS 4 Guillermo Cabrera, Ignacio Reyes, Pablo Estevez, Francisco Förster, Juan-Carlos Maureira 20 21 5 21 5 17 3 3 17 6 3 3 6 20 4 2 4 2 50 convolutional pooling 200 200

ConvNet Applied to HiTS Detection error tradeoff curve 455,393 data-set: - 350,393 training set - 5,000 validation set - 100,000 test set It takes ~12 mins using Theano over a GPU Tesla K20 (Graphical Processor Unit)

Movies! http://www.das.uchile.cl/~fforster/atel/summary_das.html

Movies! http://www.das.uchile.cl/~fforster/atel/summary_das.html

Movies! 20 Young supernovae were detected in real-time Plus other 70 supernovae In 6-nights of observation at DECam

Conclusions Big Data is here to stay Deep learning is a computational intelligence/machine learning technique that allows us to extract features automatically The importance of multidisciplinary teams TO DO List: Larger sets and more heterogeneous sets, combine features (NMF, PCA, DL, engineered features), active learning

References P. Huijse, P. Estevez, P. Protopapas, JC Principe, P. Zegers, Computational Intelligence Challenges and Applications on Large- Scale Astronomical Time Series Databases, IEEE Computational Intelligence Magazine, August 2014. P. Protopapas, P. Huijse, P. Estevez, P. Zegers, JC Principe, JB Marquette A novel, fully automated pipeline for period estimation in the EROS 2 data set, Astrophysical Journal Supplement Series, 216:25, February 2015. P.Huijse, P. Estevez, P. Protopapas, P. Zegers, J. Principe, An Information Theoretic Algorithm for Finding Periodicities in Stellar Light Curves, IEEE Transactions on Signal Processing, Vol. 60, n 10, pp. 5135-5145, 2012. P.Huijse, P. Estevez, P. Zegers, P. Protopapas, J. Principe, Period Estimation in Astronomical Time Series using Slotted Correntropy, IEEE Signal Processing Letters, Vol. 18, n 6, pp. 371-374, 2011.