Massive Labeled Solar Image Data Benchmarks for Automated Feature Recognition

Similar documents
Big Data: Image & Video Analytics

Steven C.H. Hoi School of Information Systems Singapore Management University

Information Management course

The Scientific Data Mining Process

Big Data Analytics. Genoveva Vargas-Solar French Council of Scientific Research, LIG & LAFMIA Labs

Feature Subset Selection in Spam Detection

COMP9321 Web Application Engineering

Advanced Analytics for Call Center Operations

Introduction to Data Mining

Environmental Remote Sensing GEOG 2021

Active Learning SVM for Blogs recommendation

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Data, Measurements, Features

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

ANALYTICS IN BIG DATA ERA

Content-Based Recommendation

Advanced In-Database Analytics

Big Data and Analytics: Challenges and Opportunities

Introduction to Data Mining

ADVANCED MACHINE LEARNING. Introduction

SIPAC. Signals and Data Identification, Processing, Analysis, and Classification

Advanced Image Management using the Mosaic Dataset

Knowledge Discovery and Data Mining

Sense Making in an IOT World: Sensor Data Analysis with Deep Learning

Let the data speak to you. Look Who s Peeking at Your Paycheck. Big Data. What is Big Data? The Artemis project: Saving preemies using Big Data

Big Data and Analytics: Getting Started with ArcGIS. Mike Park Erik Hoel

Data Mining. Nonlinear Classification

High Productivity Data Processing Analytics Methods with Applications

Data Centric Systems (DCS)

Conquering the Astronomical Data Flood through Machine

CLASSIFYING NETWORK TRAFFIC IN THE BIG DATA ERA

Machine Learning: Overview

Deep Learning Meets Heterogeneous Computing. Dr. Ren Wu Distinguished Scientist, IDL, Baidu

NAVIGATING SCIENTIFIC LITERATURE A HOLISTIC PERSPECTIVE. Venu Govindaraju

Biomedical Informatics Applications, Big Data, & Cloud Computing

Exploring Big Data in Social Networks

Scalable Developments for Big Data Analytics in Remote Sensing

Graph Database Performance: An Oracle Perspective

Industrial Challenges for Content-Based Image Retrieval

GPU Programming in Computer Vision

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Automatic land-cover map production of agricultural areas using supervised classification of SPOT4(Take5) and Landsat-8 image time series.

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Distributed forests for MapReduce-based machine learning

Canadian Astronomy Data Centre. Séverin Gaudet David Schade Canadian Astronomy Data Centre

Workload Characterization and Analysis of Storage and Bandwidth Needs of LEAD Workspace

Introduzione alle Biblioteche Digitali Audio/Video

Mining Big Data. Pang-Ning Tan. Associate Professor Dept of Computer Science & Engineering Michigan State University

Azure Machine Learning, SQL Data Mining and R

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. ~ Spring~r

Image Analytics on Big Data In Motion Implementation of Image Analytics CCL in Apache Kafka and Storm

An intelligent tool for expediting and automating data mining steps. Ourania Hatzi, Nikolaos Zorbas, Mara Nikolaidou and Dimosthenis Anagnostopoulos

Data Mining. Dr. Saed Sayad. University of Toronto

Random forest algorithm in big data environment

HPC technology and future architecture

SeaCloudDM: Massive Heterogeneous Sensor Data Management in the Internet of Things

Learning from Big Data in

Mining a Corpus of Job Ads

INTERSEC BENCHMARK. High Performance for Fast Data & Real-Time Analytics Part I: Vs Hadoop

Solar Irradiance Forecasting Using Multi-layer Cloud Tracking and Numerical Weather Prediction

Monday Morning Data Mining

Data Mining for Customer Service Support. Senioritis Seminar Presentation Megan Boice Jay Carter Nick Linke KC Tobin

How To Become A Data Scientist

A Service for Data-Intensive Computations on Virtual Clusters

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

Large Scale Mobility Analysis: Extracting Significant Places using Hadoop/Hive and Spatial Processing

Machine learning for algo trading

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Machine Learning Capacity and Performance Analysis and R

Text Analytics and Big Data

Applying Data Science to Sales Pipelines for Fun and Profit

Office: LSK 5045 Begin subject: [ISOM3360]...

ASKAP Science Data Archive: Users and Requirements CSIRO ASTRONOMY AND SPACE SCIENCE (CASS)

Big Data Research in the AMPLab: BDAS and Beyond

Data Integration and long-term planning of the Observing Systems as a cross-cutting process in a NMS

How To Use Data From Copernicus And Big Data To Help The Environment

Data Validation and Data Management Solutions

Is a Data Scientist the New Quant? Stuart Kozola MathWorks

Miracle Integrating Knowledge Management and Business Intelligence

On Establishing Big Data Breakwaters

Objectives. Raster Data Discrete Classes. Spatial Information in Natural Resources FANR Review the raster data model

Scalable End-User Access to Big Data HELLENIC REPUBLIC National and Kapodistrian University of Athens

Open Source UAS Software Toolkits. Keith Fieldhouse Technical Lead, Kitware Inc.

Big Data: Opportunities & Challenges, Myths & Truths 資 料 來 源 : 台 大 廖 世 偉 教 授 課 程 資 料

Volume 3, Issue 8, August 2015 International Journal of Advance Research in Computer Science and Management Studies

Euro-BioImaging European Research Infrastructure for Imaging Technologies in Biological and Biomedical Sciences

Galaxy Morphological Classification

Transcription:

Massive Labeled Solar Image Data Benchmarks for Automated Feature Recognition Michael A. Schuh1, Rafal A. Angryk2 1 Montana State University, Bozeman, MT 2 Georgia State University, Atlanta, GA

Introduction Big Data in Solar Physics (SDO) SDO: ~70,000 high def images of Sun per day (1.5 TB of 4k resolution per day) Traditional manual (human-in-the-loop) analyses infeasible to cover this scale Future mission only getting larger (DKIST,...) Content-based Image Retrieval (CBIR) Similarity Search and Indexing 2 2 of 21

Overview Goal: creating large-scale solar image datasets for further scientific inquiry Combining generalized image parameters and automated event labels Today's talk: Domain-specific data analysis and validation Dataset curation and dissemination Preliminary feature recognition results Future directions 3 3 of 21

Background Solar Dynamics Observatory (SDO) NASA's Living with a Star (LWS) program AIA instrument: 10 wavelength channels from 94 Å to 4500 Å Dedicated downlink with near real time data stream processing and lossless archival to tape 4 4 of 21

Background Feature Finding Team (FFT) International consortium of 16 automated feature (event) recognition modules Direct access to SDO pipeline 5 5 of 21

Background FFT Computer Science Trainable Module Montana State University Data Mining Lab Ten generalized image parameters extracted from each cell over a 64 x 64 grid 4,096 cells per image 64 x 64 pixels per cell All AIA channels Example 6 6 of 21

The Data Combining raster image with vector objects to create labeled feature (parameter) vectors + Seven FFT event-specific modules + MSU Trainable Module Parameter Data 7 7 of 21

The Data Overview of the process Collection Validation Transformation and Labeling Dataset creation Example data range: Jan 2012 (one month) Currently processing 2012 present Highlighting the 5 V's of Big Data Volume, Velocity, Veracity, Variety, Value 8 8 of 21

Collection MSU FFT: 6 minute cadence ~ 240 images per day, per wave (x10) ~ 9 million image cells per day (~ 3.25 billion / yr) Stored at MSU, accessible thru web API (soon) FFT Event Modules Reported to and retrieved from the Heliophysics Event Knowledgebase (HEK) Reporting variabilities based on event types Cadence, Counts, Durations, etc. 9 9 of 21

Validation Ensuring quality of raw data products Sanity checks for human maintainers Crucial before further works (trusting results) Pipeline Checks Large-scale, scalable analyses Expected cadences Alerting of data gaps and possible outages Sensible event counts and parameter values 10 10 of 21

Validation (cont.) Time difference (in minutes) between image parameter files for each AIA channel. 11 11 of 21

Validation (cont.) Time difference (in hours) between reports to the HEK for each event type. 12 12 of 21

Validation (cont.) Number of event reports for each unique timestamp for all event types. 13 13 of 21

Validation (cont.) 3-statistic (min, avg, max) of P2 (mean) over all AIA channels 14 14 of 21

Data Transformation Two considerations Removing solar expertise Choosing appropriate labels Spatial: coordinate conversions HPC to pixel-space (requires solar metadata) Bounding Boxes vs. Chain Codes Temporal: timestamp locations Events have start and end time (duration) Instantaneous vs. Continuous labeling 15 15 of 21

Dataset Creation Goal: event-specific benchmarking As clean as possible for a given event type Instance Generation Apply an event label to all image cells it covers Instance: <MSU parameters, event label> Considerations Cells vs Regions Multi-class labeling Unlabeled ( quiet ) areas 16 16 of 21

Case Study Two event types from the SPoCA module Active Regions (AR) Coronal Holes (CH) Labels Instantaneous time (smallest, but cleanest set) Bounding boxes Chain codes Cells Region-based cells Regions Regions: 3-stat x 10 params (over all cells) Basic classification evaluation Comparing labeling methods Validating proof of concept for recognition 17 17 of 21

Case Study 18 18 of 21

Case Study Preliminary Machine Learning Benchmarks Algorithms: Naive Bayes (NB), Decision Tree (DT), Support Vector Machine (SVM), K-nearest neighbor (KNN), Random Forests (RF) Methods: Bounding Box (BB), Chain Code (CC) Instances: region-based cells (R-cells), region feature vectors (R-fvs) 19 19 of 21

Conclusions and Future Work Foundational starting place for data use Validations and Expectations Dissemination of dataset(s) Each individual event type Monthly and Cumulatively Stats, Charts, Benchmarks, etc. Advanced Uses http://cbsir.cs.montana.edu/solar Spatiotemporal classification Event tracking and continous labeling Content-based image retrieval with region-based querying Many many more... 20 20 of 21

Questions? Thank you. Michael A. Schuh michael.schuh@cs.montana.edu