1 Conquering the Astronomical Data Flood through Machine Learning and Citizen Science Kirk Borne George Mason University School of Physics, Astronomy, & Computational Sciences
2 The Problem: Big Data is a Big Challenge 2
3 LSST big data challenge #1 Each night for 10 years LSST will obtain roughly the equivalent amount of data that was obtained by the entire Sloan Digital Sky Survey Our grad students will be asked to mine these data (~20 TB each night 40,000 CDs filled with data): A truckload of CDs each and every day for 10 yrs Cumulatively, a football stadium full of 100 million CDs Cumulatively, a football stadium full of 100 million CDs after 10 yrs
4 LSST big data challenge #1 Each night for 10 years LSST will obtain roughly the equivalent amount of data that was obtained by the entire Sloan Digital Sky Survey Our grad students will be asked to mine these data (~20 TB each night 40,000 CDs filled with data): A truckload of CDs each and every day for 10 yrs Cumulatively, a football stadium full of 100 million CDs after 10 yrs The challenge is to find the new, the novel, the interesting, and the surprises (the unknown unknowns) within all of these data. Yes, more is most definitely different!
5 LSST big data challenge # 2 Approximately 2,000,000 times each night for 10 years LSST will detect a new sky event, and the astronomical community will be challenged with classifying these events. What will we do with all of these events? flux time
6 LSST big data challenge # 2 Approximately 2,000,000 times each night for 10 years LSST will detect a new sky event, and the astronomical community will be challenged with classifying these events. What will we do with all of these events? flux Characterize first! (Unsupervised Learning) Classify later. time
7 Characterization includes Feature Detection and Extraction: Identifying and describing features in the data via machine algorithms or human inspection (including the potentially huge contributions from Citizen Science) Extracting feature descriptors from the data Curating these features for search, re-use, & discovery Finding other parameters and features from other archives, other databases, other information sources and using those to help characterize (ultimately classify) each new event.
8 Data-driven Discovery (Unsupervised Learning) i.e., What can I do with characterizations? 1. Class Discovery Clustering 2. Principal Component Analysis Dimension Reduction 3. Outlier (Anomaly/ Deviation / Novelty) Detection Surprise Discovery 4. Link Analysis Association Analysis Network Analysis 5. and more.
9 The Promise: Big Data leads to Big Insights and New Discoveries 9
10 Scary News: Big Data is taking us to a Tipping Point 10
11 Good News: Big Data is Sexy 11
12 There are many technologies associated with Big Data 12
13 One approach to Big Data: Hadoop and Map/Reduce (Computational Science) 13
14 Another approach to Big Data: Data Science (Informatics) 14
15 A third approach to Big Data: Citizen Science (crowdsourcing) 15
16 Modes of Computing Numerical Computation (in silico) Fast, efficient Processing power is rapidly increasing Model-dependent, subjective, only as good as your best hypothesis Computational Intelligence Data-driven, objective (machine learning) Often relies on human-generated training data Often generated by a single investigator Primitive algorithms Not as good as humans on most tasks Human Computation (Carbon-based Computing) Data-driven, objective (human cognition) Creates training sets, Cross-checks machine results Excellent at finding patterns, image classification Capable of classifying anomalies that machines don t understand Slow at numerical processing, low bandwidth, easily distracted
17 Galaxy Zoo: example of Citizen Science (crowdsourcing) 17
18 Galaxy Zoo: example of Citizen Science (crowdsourcing) 18
19 There are 2 main types of galaxies in the Universe: Spiral & Elliptical (plus there are some peculiar & irregular galaxies) Spiral Elliptical
20 Gallery of Elliptical Galaxies M32 M59 M87 M105 M110=NGC205
21 Gallery of Face-on Spiral Galaxies
22 There are lots of Peculiar Galaxies also!
23 There are lots of things you can do with these peculiar galaxies Spell your name in k i r k b o r n e 23
24 Colliding and Merging Galaxies = Interacting Galaxies Galaxies Gone Wild! Colliding and Merging Galaxies = Interacting Galaxies
25 Merging/Colliding Galaxies are the building blocks of the Universe: 1+1=1
26 Galaxy Mergers Zoo Gallery Sloan image Actual sky image. Volunteer-selected models, based on viewing tens of thousands of real-time numerical simulations. SDSS
27 Galaxy Mergers Zoo Gallery Sloan image SDSS
28 Galaxy Mergers Zoo Gallery Sloan image SDSS
29 Galaxy Mergers Zoo Gallery Sloan image SDSS
30 Galaxy Mergers Zoo Gallery Sloan image SDSS
31 Galaxy Mergers Zoo Gallery Sloan image SDSS
32 Galaxy Mergers Zoo Gallery Sloan image SDSS
33 Key Feature of Zooniverse: Data mining from the volunteer-contributed labels Train the automated pipeline classifiers with: Improved classification algorithms Better identification of anomalies Fewer classification errors Millions of training examples (V&V) Hundreds of millions of class labels Statistics deluxe! Users (see paper: ) Uncertainty Quantification (UQ) Classification certainty vs. Classification dispersion
34 Astroinformatics for Eventful Astronomy Report discoveries back to the science database for community reuse Basic astronomical objects (informatics granules) are annotated... with follow-up observations of any kind with new knowledge discovered with common knowledge with inter-relationships between objects and their properties with concepts with context With assertions (e.g., classifications, concepts, quality flags, relationships, references, observational parameters, common knowledge, inter-connectivity with other objects) with experimental parameters with observer / observatory descriptors Enables knowledge-sharing and reuse Semantics! Provenance (traceback)
35 Astroinformatics for Eventful Astronomy In order to facilitate filtering and prioritization of events for rapid follow-up observations, a near real-time characterization provider of tags (end-user annotations) for each object and event is needed. The semantic integration of real-time survey data products with federated VO-accessible archival information resources will facilitate the sharing of knowledge-rich quantifiable astronomical features (event characterizations) to the research community. An astroinformatics-enabled characterization service for large sky surveys provides uniform tags, metadata, labels, terminology. Use cases of the characterization service include knowledge capture, annotation, data mining, & queries of distributed knowledgebases. The addition of human-provided annotations and semantic tagging, in structured form, will enhance and improve eventful astronomy research and worldwide astronomical knowledge.
36 Responding to Big Data in Science X-Informatics (e.g., X = Bio, Geo, Astro, ): addresses the scientific data lifecycle challenges in the era of Big Data and data-intensive science via data science techniques for indexing, accessing, searching, fusing, integrating, mining, and analyzing massive data repositories. Includes automatic (autonomous) tagging and annotation Citizen Science (user-guided, informatics-powered): Human computation (e.g., tagging, labeling, classification) characterized by enormous cognitive capacity and pattern recognition efficiency (carbon-based computing) Semantic e-science and Volunteer Citizen Science Tagging everything, everywhere: Analytics in the Cloud
37 Astroinformatics and Citizen Science: Resistance is futile, you will be assimilated
DATA MINING CONCEPTS AND TECHNIQUES Marek Maurizio E-commerce, winter 2011 INTRODUCTION Overview of data mining Emphasis is placed on basic data mining concepts Techniques for uncovering interesting data
32 Big Data: present and future Big Data: present and future Mircea Răducu TRIFU, Mihaela Laura IVAN University of Economic Studies, Bucharest, Romania email@example.com, firstname.lastname@example.org
International Journal of Computer Science and Applications, Technomathematics Research Foundation Vol. 11, No. 3, pp. 116 127, 2014 ANALYTICS ON BIG AVIATION DATA: TURNING DATA INTO INSIGHTS RAJENDRA AKERKAR
Challenges and Opportunities with Big Data A community white paper developed by leading researchers across the United States Executive Summary The promise of data-driven decision-making is now being recognized
International Journal of Information and Computation Technology. ISSN 0974-2239 Volume 4, Number 1 (2014), pp. 33-40 International Research Publications House http://www. irphouse.com /ijict.htm Big Data
American Journal of Engineering Research (AJER) e-issn : 2320-0847 p-issn : 2320-0936 Volume-03, Issue-05, pp-266-270 www.ajer.org Research Paper Open Access Convergence of Big Data and Cloud Sreevani.Y.V.
Auto-Grouping Emails For Faster E-Discovery Sachindra Joshi IBM Research, India email@example.com Prasad M Deshpande IBM Research, India firstname.lastname@example.org Danish Contractor IBM Research, India email@example.com
NESSI White Paper, December 2012 Big Data A New World of Opportunities Contents 1. Executive Summary... 3 2. Introduction... 4 2.1. Political context... 4 2.2. Research and Big Data... 5 2.3. Purpose of
TABLE OF CONTENTS Introduction... 3 The Importance of Triplestores... 4 Why Triplestores... 5 The Top 8 Things You Should Know When Considering a Triplestore... 9 Inferencing... 9 Integration with Text
The Mystery Machine: End-to-end performance analysis of large-scale Internet services Michael Chow, David Meisner, Jason Flinn, Daniel Peek, Thomas F. Wenisch University of Michigan Facebook, Inc. Abstract
ENSEMBLE WHITE PAPER ENABLING THE REAL-TIME ENTERPRISE BUSINESS ACTIVITY MONITORING WITH ENSEMBLE 1 INTRODUCTION ENABLING THE REAL-TIME ENTERPRISE Executive Summary Business Activity Monitoring (BAM) enhances
SPAN White Paper!? Sentiment Analysis on Big Data Machine Learning Approach Several sources on the web provide deep insight about people s opinions on the products and services of various companies. Social
Big-Data Computing: Creating revolutionary breakthroughs in commerce, science, and society Randal E. Bryant Carnegie Mellon University Randy H. Katz University of California, Berkeley Version 8: December
ANALYTICS ON BIG FAST DATA USING REAL TIME STREAM DATA PROCESSING ARCHITECTURE Dibyendu Bhattacharya Architect-Big Data Analytics HappiestMinds Manidipa Mitra Principal Software Engineer EMC Table of Contents
BIG DATA: CHALLENGES AND OPPORTUNITIES IN LOGISTICS SYSTEMS Branka Mikavica a*, Aleksandra Kostić-Ljubisavljević a*, Vesna Radonjić Đogatović a a University of Belgrade, Faculty of Transport and Traffic
Customer Cloud Architecture for Big Data and Analytics Executive Overview Using analytics reveals patterns, trends and associations in data that help an organization understand the behavior of the people
WHITE PAPER Balancing Access to Information While Preserving Privacy, Security and Governance in the Era of Big Data. EXECUTIVE SUMMARY This white paper explores the critical role that privacy, security
For Big Data Analytics There s No Such Thing as Too Big The Compelling Economics and Technology of Big Data Computing March 2012 By: 4syth.com Emerging big data thought leaders Forsyth Communications 2012.
INTELLIGENT BUSINESS STRATEGIES W H I T E P A P E R Architecting A Big Data Platform for Analytics By Mike Ferguson Intelligent Business Strategies October 2012 Prepared for: Table of Contents Introduction...
JANUARY 2013 REPORT OF THE DEFENSE SCIENCE BOARD TASK FORCE ON Cyber Security and Reliability in a Digital Cloud JANUARY 2013 Office of the Under Secretary of Defense for Acquisition, Technology, and Logistics
IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE, VOL. XX, NO. X, XXXX 20XX 1 Big Data Opportunities and Challenges: Discussions from Data Analytics Perspectives Zhi-Hua Zhou, Nitesh V. Chawla, Yaochu Jin, and
NOVEMBER 2014 TDWI E-Book Predictive Analytics: Revolutionizing Business Decision Making 1 Q&A: Predictive Analytics 101 3 Who Should Be Building Predictive Models? 5 Exploratory Predictive Analytics 8
An Oracle White Paper March 2013 Big Data Analytics Advanced Analytics in Oracle Database Advanced Analytics in Oracle Database Disclaimer The following is intended to outline our general product direction.
Data-Intensive Text Processing with MapReduce i Jimmy Lin and Chris Dyer University of Maryland, College Park Manuscript prepared April 11, 2010 This is the pre-production manuscript of a book in the Morgan