Conquering the Astronomical Data Flood through Machine



Similar documents
Learning from Big Data in

The Tonnabytes Big Data Challenge: Transforming Science and Education. Kirk Borne George Mason University

How To Teach Data Science

Computational Science and Informatics (Data Science) Programs at GMU

Data analysis of L2-L3 products

LSST and the Cloud: Astro Collaboration in 2016 Tim Axelrod LSST Data Management Scientist

Big Data a threat or a chance?

Introduction to Data Mining

Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing

College of Science George Mason University Fairfax, VA 22030

The Scientific Data Mining Process

Big Data Processing and Analytics for Mouse Embryo Images

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

Big Data Analytics. An Introduction. Oliver Fuchsberger University of Paderborn 2014

Big Data with Rough Set Using Map- Reduce

Introduction. A. Bellaachia Page: 1

Introduction to Data Mining

DATA MANAGEMENT FOR THE INTERNET OF THINGS

Concept and Project Objectives

Data Catalogs for Hadoop Achieving Shared Knowledge and Re-usable Data Prep. Neil Raden Hired Brains Research, LLC

Astrophysics with Terabyte Datasets. Alex Szalay, JHU and Jim Gray, Microsoft Research

Information Management course

Data Isn't Everything

Galaxy Morphological Classification

Augmented Search for Web Applications. New frontier in big log data analysis and application intelligence

SEYMOUR SLOAN IDEAS THAT MATTER

Chapter ML:XI. XI. Cluster Analysis

COMP9321 Web Application Engineering

Connecting library content using data mining and text analytics on structured and unstructured data

locuz.com Big Data Services

What do Big Data & HAVEn mean? Robert Lejnert HP Autonomy

Detecting Anomalous Behavior with the Business Data Lake. Reference Architecture and Enterprise Approaches.

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

The Big Data methodology in computer vision systems

Fight fire with fire when protecting sensitive data

Azure Machine Learning, SQL Data Mining and R

Data Driven Discovery In the Social, Behavioral, and Economic Sciences

Journal of Chemical and Pharmaceutical Research, 2015, 7(3): Research Article. E-commerce recommendation system on cloud computing

Data-intensive HPC: opportunities and challenges. Patrick Valduriez

Convergence of Big Data and Cloud

3rd International Symposium on Big Data and Cloud Computing Challenges (ISBCC-2016) March 10-11, 2016 VIT University, Chennai, India

Cloud Computing and the Future of Internet Services. Wei-Ying Ma Principal Researcher, Research Area Manager Microsoft Research Asia

Big Data Hope or Hype?

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Big Data & Analytics for Semiconductor Manufacturing

Big Data and Science: Myths and Reality

ICT Perspectives on Big Data: Well Sorted Materials

Big Data and Analytics: Challenges and Opportunities

Machine Learning, Data Mining, and Knowledge Discovery: An Introduction

Big Data Text Mining and Visualization. Anton Heijs

Data Mining and Machine Learning in Bioinformatics

Big Data: What You Should Know. Mark Child Research Manager - Software IDC CEMA

Leading Genomics. Diagnostic. Discove. Collab. harma. Shanghai Cambridge, MA Reykjavik

Big Data in Subsea Solutions

A Case of Study on Hadoop Benchmark Behavior Modeling Using ALOJA-ML

European Archival Records and Knowledge Preservation Database Archiving in the E-ARK Project

Inner Classification of Clusters for Online News

Pragmatic Web 4.0. Towards an active and interactive Semantic Media Web. Fachtagung Semantische Technologien September 2013 HU Berlin

Pallas Ludens. We inject human intelligence precisely where automation fails. Jonas Andrulis, Daniel Kondermann

Business Intelligence meets Big Data: An Overview on Security and Privacy

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Search and Information Retrieval

Using Data Mining and Machine Learning in Retail

Example application (1) Telecommunication. Lecture 1: Data Mining Overview and Process. Example application (2) Health

UPS battery remote monitoring system in cloud computing

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

Big Data Integration: A Buyer's Guide

Getting the Most Out of SIEM. Presentation Title. Data in Big Data. Presented By: Dr. Char Sample, CERT

ANALYTICS IN BIG DATA ERA

Manjula Ambur NASA Langley Research Center April 2014

Implementation of Big Data and Analytics Projects with Big Data Discovery and BICS March 2015

Industry 4.0 and Big Data

Transcription:

Conquering the Astronomical Data Flood through Machine Learning and Citizen Science Kirk Borne George Mason University School of Physics, Astronomy, & Computational Sciences http://spacs.gmu.edu/

The Problem: Big Data is a Big Challenge 2

LSST big data challenge #1 Each night for 10 years LSST will obtain roughly the equivalent amount of data that was obtained by the entire Sloan Digital Sky Survey Our grad students will be asked to mine these data (~20 TB each night 40,000 CDs filled with data): A truckload of CDs each and every day for 10 yrs Cumulatively, a football stadium full of 100 million CDs Cumulatively, a football stadium full of 100 million CDs after 10 yrs

LSST big data challenge #1 Each night for 10 years LSST will obtain roughly the equivalent amount of data that was obtained by the entire Sloan Digital Sky Survey Our grad students will be asked to mine these data (~20 TB each night 40,000 CDs filled with data): A truckload of CDs each and every day for 10 yrs Cumulatively, a football stadium full of 100 million CDs after 10 yrs The challenge is to find the new, the novel, the interesting, and the surprises (the unknown unknowns) within all of these data. Yes, more is most definitely different!

LSST big data challenge # 2 Approximately 2,000,000 times each night for 10 years LSST will detect a new sky event, and the astronomical community will be challenged with classifying these events. What will we do with all of these events? flux time

LSST big data challenge # 2 Approximately 2,000,000 times each night for 10 years LSST will detect a new sky event, and the astronomical community will be challenged with classifying these events. What will we do with all of these events? flux Characterize first! (Unsupervised Learning) Classify later. time

Characterization includes Feature Detection and Extraction: Identifying and describing features in the data via machine algorithms or human inspection (including the potentially huge contributions from Citizen Science) Extracting feature descriptors from the data Curating these features for search, re-use, & discovery Finding other parameters and features from other archives, other databases, other information sources and using those to help characterize (ultimately classify) each new event.

Data-driven Discovery (Unsupervised Learning) i.e., What can I do with characterizations? 1. Class Discovery Clustering 2. Principal Component Analysis Dimension Reduction 3. Outlier (Anomaly/ Deviation / Novelty) Detection Surprise Discovery 4. Link Analysis Association Analysis Network Analysis 5. and more.

The Promise: Big Data leads to Big Insights and New Discoveries http://kdd2012.sigkdd.org/ 9

Scary News: Big Data is taking us to a Tipping Point http://bit.ly/huqmu5 10

Good News: Big Data is Sexy http://dilbert.com/strips/comic/2012-09-05/ 11

There are many technologies associated with Big Data http://siliconangle.com/blog/2012/07/13/big-data-nightmares/ 12

One approach to Big Data: Hadoop and Map/Reduce (Computational Science) http://www.bigdatabytes.com/wp-content/uploads/2012/01/big-data.jpg 13

Another approach to Big Data: Data Science (Informatics) 14

A third approach to Big Data: Citizen Science (crowdsourcing) 15

Modes of Computing Numerical Computation (in silico) Fast, efficient Processing power is rapidly increasing Model-dependent, subjective, only as good as your best hypothesis Computational Intelligence Data-driven, objective (machine learning) Often relies on human-generated training data Often generated by a single investigator Primitive algorithms Not as good as humans on most tasks Human Computation (Carbon-based Computing) Data-driven, objective (human cognition) Creates training sets, Cross-checks machine results Excellent at finding patterns, image classification Capable of classifying anomalies that machines don t understand Slow at numerical processing, low bandwidth, easily distracted

Galaxy Zoo: example of Citizen Science (crowdsourcing) http://astrophysics.gsfc.nasa.gov/outreach/podcast/wordpress/index.php/2010/10/08/saras-blog-be-a-scientist/ 17

Galaxy Zoo: example of Citizen Science (crowdsourcing) http://astrophysics.gsfc.nasa.gov/outreach/podcast/wordpress/index.php/2010/10/08/saras-blog-be-a-scientist/ 18

There are 2 main types of galaxies in the Universe: Spiral & Elliptical (plus there are some peculiar & irregular galaxies) Spiral Elliptical

Gallery of Elliptical Galaxies M32 M59 M87 M105 M110=NGC205

Gallery of Face-on Spiral Galaxies

There are lots of Peculiar Galaxies also!

There are lots of things you can do with these peculiar galaxies Spell your name in galaxies @ http://writing.galaxyzoo.org k i r k b o r n e 23

Colliding and Merging Galaxies = Interacting Galaxies Galaxies Gone Wild! Colliding and Merging Galaxies = Interacting Galaxies

Merging/Colliding Galaxies are the building blocks of the Universe: 1+1=1

Galaxy Mergers Zoo Gallery Sloan image Actual sky image. Volunteer-selected models, based on viewing tens of thousands of real-time numerical simulations. SDSS 587722984435351614

Galaxy Mergers Zoo Gallery Sloan image SDSS 587722984435351614

Galaxy Mergers Zoo Gallery Sloan image SDSS 587726033843585146

Galaxy Mergers Zoo Gallery Sloan image SDSS 587739646743412797

Galaxy Mergers Zoo Gallery Sloan image SDSS 587739721900163101

Galaxy Mergers Zoo Gallery Sloan image SDSS 587727222471131318

Galaxy Mergers Zoo Gallery Sloan image SDSS 588011124116422756

Key Feature of Zooniverse: Data mining from the volunteer-contributed labels Train the automated pipeline classifiers with: Improved classification algorithms Better identification of anomalies Fewer classification errors Millions of training examples (V&V) Hundreds of millions of class labels Statistics deluxe! Users (see paper: http://arxiv.org/abs/0909.2925 ) Uncertainty Quantification (UQ) Classification certainty vs. Classification dispersion

Astroinformatics for Eventful Astronomy Report discoveries back to the science database for community reuse Basic astronomical objects (informatics granules) are annotated... with follow-up observations of any kind with new knowledge discovered with common knowledge with inter-relationships between objects and their properties with concepts with context With assertions (e.g., classifications, concepts, quality flags, relationships, references, observational parameters, common knowledge, inter-connectivity with other objects) with experimental parameters with observer / observatory descriptors Enables knowledge-sharing and reuse Semantics! Provenance (traceback)

Astroinformatics for Eventful Astronomy In order to facilitate filtering and prioritization of events for rapid follow-up observations, a near real-time characterization provider of tags (end-user annotations) for each object and event is needed. The semantic integration of real-time survey data products with federated VO-accessible archival information resources will facilitate the sharing of knowledge-rich quantifiable astronomical features (event characterizations) to the research community. An astroinformatics-enabled characterization service for large sky surveys provides uniform tags, metadata, labels, terminology. Use cases of the characterization service include knowledge capture, annotation, data mining, & queries of distributed knowledgebases. The addition of human-provided annotations and semantic tagging, in structured form, will enhance and improve eventful astronomy research and worldwide astronomical knowledge.

Responding to Big Data in Science X-Informatics (e.g., X = Bio, Geo, Astro, ): addresses the scientific data lifecycle challenges in the era of Big Data and data-intensive science via data science techniques for indexing, accessing, searching, fusing, integrating, mining, and analyzing massive data repositories. Includes automatic (autonomous) tagging and annotation Citizen Science (user-guided, informatics-powered): Human computation (e.g., tagging, labeling, classification) characterized by enormous cognitive capacity and pattern recognition efficiency (carbon-based computing) Semantic e-science and Volunteer Citizen Science Tagging everything, everywhere: Analytics in the Cloud

Astroinformatics and Citizen Science: Resistance is futile, you will be assimilated