Astrophysics with Terabyte Datasets. Alex Szalay, JHU and Jim Gray, Microsoft Research



Similar documents
The World-Wide Telescope, an Archetype for Online Science

Migrating a (Large) Science Database to the Cloud

Data Mining Challenges and Opportunities in Astronomy

What is the Sloan Digital Sky Survey?

Example application (1) Telecommunication. Lecture 1: Data Mining Overview and Process. Example application (2) Health

LSST and the Cloud: Astro Collaboration in 2016 Tim Axelrod LSST Data Management Scientist

Learning from Big Data in

Information Management course

Computational Science and Informatics (Data Science) Programs at GMU

MAST: The Mikulski Archive for Space Telescopes

How To Teach Data Science

Visualizing and Analyzing Massive Astronomical Datasets with Partiview

Data analysis of L2-L3 products

Big Data Hope or Hype?

Einstein Rings: Nature s Gravitational Lenses

The Status of Astronomical Control Systems. Steve Scott OVRO

Big Data a threat or a chance?

The Virtual Observatory: What is it and how can it help me? Enrique Solano LAEFF / INTA Spanish Virtual Observatory

Conquering the Astronomical Data Flood through Machine

Big Data Management and Analytics

Making the Most of Missing Values: Object Clustering with Partial Data in Astronomy

The Astronomical Data Warehouse

Unleash your intuition

The Data Engineer. Mike Tamir Chief Science Officer Galvanize. Steven Miller Global Leader Academic Programs IBM Analytics

The Tonnabytes Big Data Challenge: Transforming Science and Education. Kirk Borne George Mason University

SAP Solution Brief SAP HANA. Transform Your Future with Better Business Insight Using Predictive Analytics

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

DAME Astrophysical DAta Mining Mining & & Exploration Exploration GRID

One-Size-Fits-All: A DBMS Idea Whose Time has Come and Gone. Michael Stonebraker December, 2008

Sanjeev Kumar. contribute

The Crafoord Prize 2005

ASKAP Science Data Archive: Users and Requirements CSIRO ASTRONOMY AND SPACE SCIENCE (CASS)

I N T E L L I G E N T S O L U T I O N S, I N C. DATA MINING IMPLEMENTING THE PARADIGM SHIFT IN ANALYSIS & MODELING OF THE OILFIELD

Good morning. It is a pleasure to be with you here today to talk about the value and promise of Big Data.

The Solar Journey: Modeling Features of the Local Bubble and Galactic Environment of the Sun

Introduction to Data Mining

The Challenge of Data in an Era of Petabyte Surveys Andrew Connolly University of Washington

A Critical Review of Scientific Visualization in Astronomy

Galaxy Survey data analysis using SDSS-III as an example

The Scientific Data Mining Process

Statistics, Data Mining and Machine Learning in Astronomy: A Practical Python Guide for the Analysis of Survey Data. and Alex Gray

EMC NETWORKER AND DATADOMAIN

Modeling the Expanding Universe

Introduction of Information Visualization and Visual Analytics. Chapter 2. Introduction and Motivation

Exploring Space in Cyberspace: Cyber-Enabled Research and

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

Clustering & Visualization

Environmental Remote Sensing GEOG 2021

Supervised Classification workflow in ENVI 4.8 using WorldView-2 imagery

Data Management using irods

Information Visualization WS 2013/14 11 Visual Analytics

and the VO-Science Francisco Jiménez Esteban Suffolk University

Galaxy Morphological Classification

MODIS IMAGES RESTORATION FOR VNIR BANDS ON FIRE SMOKE AFFECTED AREA

The PACS Software System. (A high level overview) Prepared by : E. Wieprecht, J.Schreiber, U.Klaas November, Issue 1.

Dominique Fouchez. 12 Fevrier 2011

Efficient data reduction and analysis of DECam images using multicore architecture Poor man s approach to Big data

Challenges in e-science: Research in a Digital World

Calculation of Minimum Distances. Minimum Distance to Means. Σi i = 1

NIH Commons Overview, Framework & Pilots - Version 1. The NIH Commons

Data Requirements from NERSC Requirements Reviews

Transcription:

Astrophysics with Terabyte Datasets Alex Szalay, JHU and Jim Gray, Microsoft Research

Living in an Exponential World Astronomers have a few hundred TB now 1 pixel (byte) / sq arc second ~ 4TB Multi-spectral, temporal, 1PB They mine it looking for new (kinds of) objects or more of interesting ones (quasars), density variations in 400-D space correlations in 400-D space Data doubles every year Driven by Moore s Law Data is public after 1 year 50% of the data is public Same access for everyone 1970 1975 1980 1985 1000 100 10 1 0.1 2000 1995 1990 CCDs Glass

Evolving Science Thousand years ago: science was empirical describing natural phenomena Last few hundred years: theoretical branch using models, generalizations Last few decades: a computational branch simulating complex phenomena Today: data exploration (escience) synthesizing theory, experiment and computation with advanced data management and statistics. a a 2 4πGρ c = Κ 3 a 2 2

Making Discoveries Where are discoveries made? At the edges and boundaries Going deeper, collecting more data, using more colors. Metcalfe s law Utility of computer networks grows as the number of possible connections: O(N 2 ) Data federations Federation of N archives has utility O(N 2 ) Possibilities for new discoveries grow as O(N 2 ) Current sky surveys have proven this Very early discoveries from SDSS, 2MASS, DPOSS

Smart Data If there is too much data to move around, take the analysis to the data! Do all data manipulations at database Build custom procedures and functions in the database Automatic parallelism guaranteed Easy to build-in custom functionality Databases & Procedures being unified Example temporal and spatial indexing Pixel processing Easy to reorganize the data Multiple views, each optimal for certain analyses Building hierarchical summaries are trivial Scalable to Petabyte datasets active databases!

Next-Generation Data Analysis Looking for Needles in haystacks the Higgs particle Haystacks: Dark matter, Dark energy Needles are easier than haystacks Optimal statistics have poor scaling Correlation functions are N 2, likelihood techniques N 3 For large data sets main errors are not statistical As data and computers grow with Moore s Law, we can only keep up with N logn A way out? Discard notion of optimal (data is fuzzy, answers are approximate) Don t assume infinite computational resources or memory Requires combination of statistics & computer science

Astronomical Data Imaging 2D map of the sky at multiple (20+) wavelengths Derived catalogs subsequent processing of images (segmentation) extracting object parameters (400+ per object) Spectroscopic follow-up spectra: more detailed object properties clues to physical state and formation history lead to distances: 3D maps Numerical simulations (1B+ objects) All inter-related!

Imaging Data

3D Maps

N-body Simulations

Visualization Mostly connecting pixels to objects Data correlated in multiple dimensions Scatter plots useless for cardinalities of 100M Very large density contrast in data: Rare object fractions are 10-8 Time domain astronomy beginning Petabytes/yr in 2010 Bimodal approach Scientists use primitive tools for obsdata (scatter plot, xgobi) Slightly better visualization tools for simulations High end visualization for publicity only (Hayden ) Schoolkidsare used to better visualization than scientists

The Virtual Observatory Premise: most data is (or could be online) The Internet is the world s best telescope: It has data on every part of the sky In every measured spectral band: optical, x-ray, radio.. As deep as the best instruments (2 years ago). It is up when you are up The seeing is always great It s a smart telescope: links objects and data to literature on them Software became the capital expense Share, standardize, reuse..

SkyServer SkyServer is an educational website, based on the Sloan Digital Sky Survey data Access to an underlying multiterabyte database More than 50 hours of educational exercises Background on astronomy Tutorials and documentation Searchable web pages Interactive simple visual tools for data exploration Prototype escience lab http://skyserver.sdss.org/

Summary Data growing exponentially in many different areas Publishing so much data requires a new model Multiple challenges for different communities publishing, statistics/data mining, data visualization, digital libraries, education, web services, Information at your fingertips: Students see the same data as professional scientists More data coming: Petabytes/year by 2010 We need scalable analysis algorithms Move analysis to the data! Same thing happening in all sciences High energy physics, genomics, cancer research, medical imaging, oceanography, remote sensing, Data Exploration: an emerging new branch of science Currently has no owner