Astrophysics with Terabyte Datasets. Alex Szalay, JHU and Jim Gray, Microsoft Research
|
|
|
- Berniece Cummings
- 10 years ago
- Views:
Transcription
1 Astrophysics with Terabyte Datasets Alex Szalay, JHU and Jim Gray, Microsoft Research
2 Living in an Exponential World Astronomers have a few hundred TB now 1 pixel (byte) / sq arc second ~ 4TB Multi-spectral, temporal, 1PB They mine it looking for new (kinds of) objects or more of interesting ones (quasars), density variations in 400-D space correlations in 400-D space Data doubles every year Driven by Moore s Law Data is public after 1 year 50% of the data is public Same access for everyone CCDs Glass
3 Evolving Science Thousand years ago: science was empirical describing natural phenomena Last few hundred years: theoretical branch using models, generalizations Last few decades: a computational branch simulating complex phenomena Today: data exploration (escience) synthesizing theory, experiment and computation with advanced data management and statistics. a a 2 4πGρ c = Κ 3 a 2 2
4 Making Discoveries Where are discoveries made? At the edges and boundaries Going deeper, collecting more data, using more colors. Metcalfe s law Utility of computer networks grows as the number of possible connections: O(N 2 ) Data federations Federation of N archives has utility O(N 2 ) Possibilities for new discoveries grow as O(N 2 ) Current sky surveys have proven this Very early discoveries from SDSS, 2MASS, DPOSS
5 Smart Data If there is too much data to move around, take the analysis to the data! Do all data manipulations at database Build custom procedures and functions in the database Automatic parallelism guaranteed Easy to build-in custom functionality Databases & Procedures being unified Example temporal and spatial indexing Pixel processing Easy to reorganize the data Multiple views, each optimal for certain analyses Building hierarchical summaries are trivial Scalable to Petabyte datasets active databases!
6 Next-Generation Data Analysis Looking for Needles in haystacks the Higgs particle Haystacks: Dark matter, Dark energy Needles are easier than haystacks Optimal statistics have poor scaling Correlation functions are N 2, likelihood techniques N 3 For large data sets main errors are not statistical As data and computers grow with Moore s Law, we can only keep up with N logn A way out? Discard notion of optimal (data is fuzzy, answers are approximate) Don t assume infinite computational resources or memory Requires combination of statistics & computer science
7 Astronomical Data Imaging 2D map of the sky at multiple (20+) wavelengths Derived catalogs subsequent processing of images (segmentation) extracting object parameters (400+ per object) Spectroscopic follow-up spectra: more detailed object properties clues to physical state and formation history lead to distances: 3D maps Numerical simulations (1B+ objects) All inter-related!
8 Imaging Data
9 3D Maps
10 N-body Simulations
11 Visualization Mostly connecting pixels to objects Data correlated in multiple dimensions Scatter plots useless for cardinalities of 100M Very large density contrast in data: Rare object fractions are 10-8 Time domain astronomy beginning Petabytes/yr in 2010 Bimodal approach Scientists use primitive tools for obsdata (scatter plot, xgobi) Slightly better visualization tools for simulations High end visualization for publicity only (Hayden ) Schoolkidsare used to better visualization than scientists
12 The Virtual Observatory Premise: most data is (or could be online) The Internet is the world s best telescope: It has data on every part of the sky In every measured spectral band: optical, x-ray, radio.. As deep as the best instruments (2 years ago). It is up when you are up The seeing is always great It s a smart telescope: links objects and data to literature on them Software became the capital expense Share, standardize, reuse..
13 SkyServer SkyServer is an educational website, based on the Sloan Digital Sky Survey data Access to an underlying multiterabyte database More than 50 hours of educational exercises Background on astronomy Tutorials and documentation Searchable web pages Interactive simple visual tools for data exploration Prototype escience lab
14 Summary Data growing exponentially in many different areas Publishing so much data requires a new model Multiple challenges for different communities publishing, statistics/data mining, data visualization, digital libraries, education, web services, Information at your fingertips: Students see the same data as professional scientists More data coming: Petabytes/year by 2010 We need scalable analysis algorithms Move analysis to the data! Same thing happening in all sciences High energy physics, genomics, cancer research, medical imaging, oceanography, remote sensing, Data Exploration: an emerging new branch of science Currently has no owner
The World-Wide Telescope, an Archetype for Online Science
The World-Wide Telescope, an Archetype for Online Science Jim Gray, Microsoft Research Alex Szalay, Johns Hopkins University June 2002 Technical Report MSR-TR-2002-75 Microsoft Research Microsoft Corporation
Migrating a (Large) Science Database to the Cloud
The Sloan Digital Sky Survey Migrating a (Large) Science Database to the Cloud Ani Thakar Alex Szalay Center for Astrophysical Sciences and Institute for Data Intensive Engineering and Science (IDIES)
Data Mining Challenges and Opportunities in Astronomy
Data Mining Challenges and Opportunities in Astronomy S. G. Djorgovski (Caltech) With special thanks to R. Brunner, A. Szalay, A. Mahabal, et al. The Punchline: Astronomy has become an immensely datarich
What is the Sloan Digital Sky Survey?
What is the Sloan Digital Sky Survey? Simply put, the Sloan Digital Sky Survey is the most ambitious astronomical survey ever undertaken. The survey will map one-quarter of the entire sky in detail, determining
Example application (1) Telecommunication. Lecture 1: Data Mining Overview and Process. Example application (2) Health
Lecture 1: Data Mining Overview and Process What is data mining? Example applications Definitions Multi disciplinary Techniques Major challenges The data mining process History of data mining Data mining
LSST and the Cloud: Astro Collaboration in 2016 Tim Axelrod LSST Data Management Scientist
LSST and the Cloud: Astro Collaboration in 2016 Tim Axelrod LSST Data Management Scientist DERCAP Sydney, Australia, 2009 Overview of Presentation LSST - a large-scale Southern hemisphere optical survey
Learning from Big Data in
Learning from Big Data in Astronomy an overview Kirk Borne George Mason University School of Physics, Astronomy, & Computational Sciences http://spacs.gmu.edu/ From traditional astronomy 2 to Big Data
Information Management course
Università degli Studi di Milano Master Degree in Computer Science Information Management course Teacher: Alberto Ceselli Lecture 01 : 06/10/2015 Practical informations: Teacher: Alberto Ceselli ([email protected])
Computational Science and Informatics (Data Science) Programs at GMU
Computational Science and Informatics (Data Science) Programs at GMU Kirk Borne George Mason University School of Physics, Astronomy, & Computational Sciences http://spacs.gmu.edu/ Outline Graduate Program
MAST: The Mikulski Archive for Space Telescopes
MAST: The Mikulski Archive for Space Telescopes Richard L. White Space Telescope Science Institute 2015 April 1, NRC Space Science Week/CBPSS A model for open access The NASA astrophysics data archives
How To Teach Data Science
The Past, Present, and Future of Data Science Education Kirk Borne @KirkDBorne http://kirkborne.net George Mason University School of Physics, Astronomy, & Computational Sciences Outline Research and Application
Visualizing and Analyzing Massive Astronomical Datasets with Partiview
Visualizing and Analyzing Massive Astronomical Datasets with Partiview Brian P. Abbott 1, Carter B. Emmart 1, Stuart Levy 2, and Charles T. Liu 1 1 American Museum of Natural History & Hayden Planetarium,
Data analysis of L2-L3 products
Data analysis of L2-L3 products Emmanuel Gangler UBP Clermont-Ferrand (France) Emmanuel Gangler BIDS 14 1/13 Data management is a pillar of the project : L3 Telescope Caméra Data Management Outreach L1
Big Data Hope or Hype?
Big Data Hope or Hype? David J. Hand Imperial College, London and Winton Capital Management Big data science, September 2013 1 Google trends on big data Google search 1 Sept 2013: 1.6 billion hits on big
Einstein Rings: Nature s Gravitational Lenses
National Aeronautics and Space Administration Einstein Rings: Nature s Gravitational Lenses Leonidas Moustakas and Adam Bolton Taken from: Hubble 2006 Science Year in Review The full contents of this book
The Status of Astronomical Control Systems. Steve Scott OVRO
The Status of Astronomical Control Systems Steve Scott OVRO Astronomical Control System Status 10 years ago radio control systems were much more sophisticated than optical. Why? Single pixel detectors
Big Data a threat or a chance?
Big Data a threat or a chance? Helwig Hauser University of Bergen, Dept. of Informatics Big Data What is Big Data? well, lots of data, right? we come back to this in a moment. certainly, a buzz-word but
The Virtual Observatory: What is it and how can it help me? Enrique Solano LAEFF / INTA Spanish Virtual Observatory
The Virtual Observatory: What is it and how can it help me? Enrique Solano LAEFF / INTA Spanish Virtual Observatory Astronomy in the XXI century The Internet revolution (the dot com boom ) has transformed
Conquering the Astronomical Data Flood through Machine
Conquering the Astronomical Data Flood through Machine Learning and Citizen Science Kirk Borne George Mason University School of Physics, Astronomy, & Computational Sciences http://spacs.gmu.edu/ The Problem:
Big Data Management and Analytics
Big Data Management and Analytics Lecture Notes Winter semester 2015 / 2016 Ludwig-Maximilians-University Munich Prof. Dr. Matthias Renz 2015 Based on lectures by Donald Kossmann (ETH Zürich), as well
Making the Most of Missing Values: Object Clustering with Partial Data in Astronomy
Astronomical Data Analysis Software and Systems XIV ASP Conference Series, Vol. XXX, 2005 P. L. Shopbell, M. C. Britton, and R. Ebert, eds. P2.1.25 Making the Most of Missing Values: Object Clustering
The Astronomical Data Warehouse
The Astronomical Data Warehouse Clive G. Page Department of Physics & Astronomy, University of Leicester, Leicester, LE1 7RH, UK Abstract. The idea of the astronomical data warehouse has arisen as the
Unleash your intuition
Introducing Qlik Sense Unleash your intuition Qlik Sense is a next-generation self-service data visualization application that empowers everyone to easily create a range of flexible, interactive visualizations
The Data Engineer. Mike Tamir Chief Science Officer Galvanize. Steven Miller Global Leader Academic Programs IBM Analytics
The Data Engineer Mike Tamir Chief Science Officer Galvanize Steven Miller Global Leader Academic Programs IBM Analytics Alessandro Gagliardi Lead Faculty Galvanize Businesses are quickly realizing that
The Tonnabytes Big Data Challenge: Transforming Science and Education. Kirk Borne George Mason University
The Tonnabytes Big Data Challenge: Transforming Science and Education Kirk Borne George Mason University Ever since we first began to explore our world humans have asked questions and have collected evidence
SAP Solution Brief SAP HANA. Transform Your Future with Better Business Insight Using Predictive Analytics
SAP Brief SAP HANA Objectives Transform Your Future with Better Business Insight Using Predictive Analytics Dealing with the new reality Dealing with the new reality Organizations like yours can identify
EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set
EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set Amhmed A. Bhih School of Electrical and Electronic Engineering Princy Johnson School of Electrical and Electronic Engineering Martin
DAME Astrophysical DAta Mining Mining & & Exploration Exploration GRID
DAME Astrophysical DAta Mining & Exploration on GRID M. Brescia S. G. Djorgovski G. Longo & DAME Working Group Istituto Nazionale di Astrofisica Astronomical Observatory of Capodimonte, Napoli Department
One-Size-Fits-All: A DBMS Idea Whose Time has Come and Gone. Michael Stonebraker December, 2008
One-Size-Fits-All: A DBMS Idea Whose Time has Come and Gone Michael Stonebraker December, 2008 DBMS Vendors (The Elephants) Sell One Size Fits All (OSFA) It s too hard for them to maintain multiple code
Sanjeev Kumar. contribute
RESEARCH ISSUES IN DATAA MINING Sanjeev Kumar I.A.S.R.I., Library Avenue, Pusa, New Delhi-110012 [email protected] 1. Introduction The field of data mining and knowledgee discovery is emerging as a
The Crafoord Prize 2005
I N F O R M A T I O N F O R T H E P U B L I C The Royal Swedish Academy of Sciences has decided to award the Crafoord Prize in Astronomy 2005 to James Gunn, Princeton University, USA, James Peebles, Princeton
ASKAP Science Data Archive: Users and Requirements CSIRO ASTRONOMY AND SPACE SCIENCE (CASS)
ASKAP Science Data Archive: Users and Requirements CSIRO ASTRONOMY AND SPACE SCIENCE (CASS) Jessica Chapman, Data Workshop March 2013 ASKAP Science Data Archive Talk outline Data flow in brief Some radio
I N T E L L I G E N T S O L U T I O N S, I N C. DATA MINING IMPLEMENTING THE PARADIGM SHIFT IN ANALYSIS & MODELING OF THE OILFIELD
I N T E L L I G E N T S O L U T I O N S, I N C. OILFIELD DATA MINING IMPLEMENTING THE PARADIGM SHIFT IN ANALYSIS & MODELING OF THE OILFIELD 5 5 T A R A P L A C E M O R G A N T O W N, W V 2 6 0 5 0 USA
Good morning. It is a pleasure to be with you here today to talk about the value and promise of Big Data.
Good morning. It is a pleasure to be with you here today to talk about the value and promise of Big Data. 1 Advances in information technologies are transforming the fabric of our society and data represent
The Solar Journey: Modeling Features of the Local Bubble and Galactic Environment of the Sun
The Solar Journey: Modeling Features of the Local Bubble and Galactic Environment of the Sun P.C. Frisch and A.J. Hanson Department of Astronomy and Astrophysics University of Chicago and Computer Science
Introduction to Data Mining
Introduction to Data Mining 1 Why Data Mining? Explosive Growth of Data Data collection and data availability Automated data collection tools, Internet, smartphones, Major sources of abundant data Business:
The Challenge of Data in an Era of Petabyte Surveys Andrew Connolly University of Washington
The Challenge of Data in an Era of Petabyte Surveys Andrew Connolly University of Washington We acknowledge support from NSF IIS-0844580 and NASA 08-AISR08-0081 The science of big data sets Big Questions
A Critical Review of Scientific Visualization in Astronomy
A Critical Review of Scientific Visualization in Astronomy Christopher Fluke & Amr Hassan Those not properly initiated into the mysteries of visualization research often seek to understand the images rather
Galaxy Survey data analysis using SDSS-III as an example
Galaxy Survey data analysis using SDSS-III as an example Will Percival (University of Portsmouth) showing work by the BOSS galaxy clustering working group" Cosmology from Spectroscopic Galaxy Surveys"
The Scientific Data Mining Process
Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In
Statistics, Data Mining and Machine Learning in Astronomy: A Practical Python Guide for the Analysis of Survey Data. and Alex Gray
Statistics, Data Mining and Machine Learning in Astronomy: A Practical Python Guide for the Analysis of Survey Data Željko Ivezić, Andrew J. Connolly, Jacob T. VanderPlas University of Washington and Alex
EMC NETWORKER AND DATADOMAIN
EMC NETWORKER AND DATADOMAIN Capabilities, options and news Madis Pärn Senior Technology Consultant EMC [email protected] 1 IT Pressures 2009 0.8 Zettabytes 2020 35.2 Zettabytes DATA DELUGE BUDGET DILEMMA
Modeling the Expanding Universe
H9 Modeling the Expanding Universe Activity H9 Grade Level: 8 12 Source: This activity is produced by the Universe Forum at NASA s Office of Space Science, along with their Structure and Evolution of the
Introduction of Information Visualization and Visual Analytics. Chapter 2. Introduction and Motivation
Introduction of Information Visualization and Visual Analytics Chapter 2 Introduction and Motivation Overview! 2 Overview and Motivation! Information Visualization (InfoVis)! InfoVis Application Areas!
Exploring Space in Cyberspace: Cyber-Enabled Research and
011 Exploring Space in Cyberspace: 01100 Cyber-Enabled Research and 1010011 Discovery in Astronomy 00101000 S. George Djorgovski 1110100011 (Caltech) 001001110110110 100101010001011101 CDI Workshop, Seattle,
AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW
AGENDA What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story Hadoop PDW Our BIG DATA Roadmap BIG DATA? Volume 59% growth in annual WW information 1.2M Zetabytes (10 21 bytes) this
Clustering & Visualization
Chapter 5 Clustering & Visualization Clustering in high-dimensional databases is an important problem and there are a number of different clustering paradigms which are applicable to high-dimensional data.
Environmental Remote Sensing GEOG 2021
Environmental Remote Sensing GEOG 2021 Lecture 4 Image classification 2 Purpose categorising data data abstraction / simplification data interpretation mapping for land cover mapping use land cover class
Supervised Classification workflow in ENVI 4.8 using WorldView-2 imagery
Supervised Classification workflow in ENVI 4.8 using WorldView-2 imagery WorldView-2 is the first commercial high-resolution satellite to provide eight spectral sensors in the visible to near-infrared
Data Management using irods
Data Management using irods Fundamentals of Data Management September 2014 Albert Heyrovsky Applications Developer, EPCC [email protected] 2 Course outline Why talk about irods? What is irods?
Information Visualization WS 2013/14 11 Visual Analytics
1 11.1 Definitions and Motivation Lot of research and papers in this emerging field: Visual Analytics: Scope and Challenges of Keim et al. Illuminating the path of Thomas and Cook 2 11.1 Definitions and
and the VO-Science Francisco Jiménez Esteban Suffolk University
The Spanish-VO and the VO-Science Francisco Jiménez Esteban CAB / SVO (INTA-CSIC) Suffolk University The Spanish-VO (SVO) IVOA was created in June 2002 with the mission to facilitate the international
Galaxy Morphological Classification
Galaxy Morphological Classification Jordan Duprey and James Kolano Abstract To solve the issue of galaxy morphological classification according to a classification scheme modelled off of the Hubble Sequence,
MODIS IMAGES RESTORATION FOR VNIR BANDS ON FIRE SMOKE AFFECTED AREA
MODIS IMAGES RESTORATION FOR VNIR BANDS ON FIRE SMOKE AFFECTED AREA Li-Yu Chang and Chi-Farn Chen Center for Space and Remote Sensing Research, National Central University, No. 300, Zhongda Rd., Zhongli
The PACS Software System. (A high level overview) Prepared by : E. Wieprecht, J.Schreiber, U.Klaas November,5 2007 Issue 1.
The PACS Software System (A high level overview) Prepared by : E. Wieprecht, J.Schreiber, U.Klaas November,5 2007 Issue 1.0 PICC-ME-DS-003 1. Introduction The PCSS, the PACS ICC Software System, is the
Dominique Fouchez. 12 Fevrier 2011
données données CPPM 12 Fevrier 2011 The Data données one 6.4-gigabyte image every 17 seconds 15 terabytes of raw scientific image data / night 60-petabyte final image data archive 20-petabyte final database
Efficient data reduction and analysis of DECam images using multicore architecture Poor man s approach to Big data
Efficient data reduction and analysis of DECam images using multicore architecture Poor man s approach to Big data Instituto de Astrofísica Pontificia Universidad Católica de Chile Thomas Puzia, Maren
Challenges in e-science: Research in a Digital World
Challenges in e-science: Research in a Digital World Thom Dunning National Center for Supercomputing Applications National Center for Supercomputing Applications University of Illinois at Urbana-Champaign
Calculation of Minimum Distances. Minimum Distance to Means. Σi i = 1
Minimum Distance to Means Similar to Parallelepiped classifier, but instead of bounding areas, the user supplies spectral class means in n-dimensional space and the algorithm calculates the distance between
NIH Commons Overview, Framework & Pilots - Version 1. The NIH Commons
The NIH Commons Summary The Commons is a shared virtual space where scientists can work with the digital objects of biomedical research, i.e. it is a system that will allow investigators to find, manage,
Data Requirements from NERSC Requirements Reviews
Data Requirements from NERSC Requirements Reviews Richard Gerber and Katherine Yelick Lawrence Berkeley National Laboratory Summary Department of Energy Scientists represented by the NERSC user community
