Massive Labeled Solar Image Data Benchmarks for Automated Feature Recognition Michael A. Schuh1, Rafal A. Angryk2 1 Montana State University, Bozeman, MT 2 Georgia State University, Atlanta, GA
Introduction Big Data in Solar Physics (SDO) SDO: ~70,000 high def images of Sun per day (1.5 TB of 4k resolution per day) Traditional manual (human-in-the-loop) analyses infeasible to cover this scale Future mission only getting larger (DKIST,...) Content-based Image Retrieval (CBIR) Similarity Search and Indexing 2 2 of 21
Overview Goal: creating large-scale solar image datasets for further scientific inquiry Combining generalized image parameters and automated event labels Today's talk: Domain-specific data analysis and validation Dataset curation and dissemination Preliminary feature recognition results Future directions 3 3 of 21
Background Solar Dynamics Observatory (SDO) NASA's Living with a Star (LWS) program AIA instrument: 10 wavelength channels from 94 Å to 4500 Å Dedicated downlink with near real time data stream processing and lossless archival to tape 4 4 of 21
Background Feature Finding Team (FFT) International consortium of 16 automated feature (event) recognition modules Direct access to SDO pipeline 5 5 of 21
Background FFT Computer Science Trainable Module Montana State University Data Mining Lab Ten generalized image parameters extracted from each cell over a 64 x 64 grid 4,096 cells per image 64 x 64 pixels per cell All AIA channels Example 6 6 of 21
The Data Combining raster image with vector objects to create labeled feature (parameter) vectors + Seven FFT event-specific modules + MSU Trainable Module Parameter Data 7 7 of 21
The Data Overview of the process Collection Validation Transformation and Labeling Dataset creation Example data range: Jan 2012 (one month) Currently processing 2012 present Highlighting the 5 V's of Big Data Volume, Velocity, Veracity, Variety, Value 8 8 of 21
Collection MSU FFT: 6 minute cadence ~ 240 images per day, per wave (x10) ~ 9 million image cells per day (~ 3.25 billion / yr) Stored at MSU, accessible thru web API (soon) FFT Event Modules Reported to and retrieved from the Heliophysics Event Knowledgebase (HEK) Reporting variabilities based on event types Cadence, Counts, Durations, etc. 9 9 of 21
Validation Ensuring quality of raw data products Sanity checks for human maintainers Crucial before further works (trusting results) Pipeline Checks Large-scale, scalable analyses Expected cadences Alerting of data gaps and possible outages Sensible event counts and parameter values 10 10 of 21
Validation (cont.) Time difference (in minutes) between image parameter files for each AIA channel. 11 11 of 21
Validation (cont.) Time difference (in hours) between reports to the HEK for each event type. 12 12 of 21
Validation (cont.) Number of event reports for each unique timestamp for all event types. 13 13 of 21
Validation (cont.) 3-statistic (min, avg, max) of P2 (mean) over all AIA channels 14 14 of 21
Data Transformation Two considerations Removing solar expertise Choosing appropriate labels Spatial: coordinate conversions HPC to pixel-space (requires solar metadata) Bounding Boxes vs. Chain Codes Temporal: timestamp locations Events have start and end time (duration) Instantaneous vs. Continuous labeling 15 15 of 21
Dataset Creation Goal: event-specific benchmarking As clean as possible for a given event type Instance Generation Apply an event label to all image cells it covers Instance: <MSU parameters, event label> Considerations Cells vs Regions Multi-class labeling Unlabeled ( quiet ) areas 16 16 of 21
Case Study Two event types from the SPoCA module Active Regions (AR) Coronal Holes (CH) Labels Instantaneous time (smallest, but cleanest set) Bounding boxes Chain codes Cells Region-based cells Regions Regions: 3-stat x 10 params (over all cells) Basic classification evaluation Comparing labeling methods Validating proof of concept for recognition 17 17 of 21
Case Study 18 18 of 21
Case Study Preliminary Machine Learning Benchmarks Algorithms: Naive Bayes (NB), Decision Tree (DT), Support Vector Machine (SVM), K-nearest neighbor (KNN), Random Forests (RF) Methods: Bounding Box (BB), Chain Code (CC) Instances: region-based cells (R-cells), region feature vectors (R-fvs) 19 19 of 21
Conclusions and Future Work Foundational starting place for data use Validations and Expectations Dissemination of dataset(s) Each individual event type Monthly and Cumulatively Stats, Charts, Benchmarks, etc. Advanced Uses http://cbsir.cs.montana.edu/solar Spatiotemporal classification Event tracking and continous labeling Content-based image retrieval with region-based querying Many many more... 20 20 of 21
Questions? Thank you. Michael A. Schuh michael.schuh@cs.montana.edu