The Sloan Digital Sky Survey. From Big Data to Big Database to Big Compute. Heidi Newberg Rensselaer Polytechnic Institute



Similar documents
First Discoveries. Asteroids

Astrophysics with Terabyte Datasets. Alex Szalay, JHU and Jim Gray, Microsoft Research

Libraries and Large Data

An ArrayLibraryforMS SQL Server

The Solar Journey: Modeling Features of the Local Bubble and Galactic Environment of the Sun

Origins of the Cosmos Summer Pre-course assessment

165 points. Name Date Period. Column B a. Cepheid variables b. luminosity c. RR Lyrae variables d. Sagittarius e. variable stars

LSST Resources for Data Analysis

The Milky Way Galaxy is Heading for a Major Cosmic Collision

Commentary on Techniques for Massive- Data Machine Learning in Astronomy

Science Standard 4 Earth in Space Grade Level Expectations

Data Management Plan Extended Baryon Oscillation Spectroscopic Survey

The Gaia Archive. Center Forum, Heidelberg, June 10-11, Stefan Jordan. The Gaia Archive, COSADIE Astronomical Data

Learning from Big Data in

Modeling Galaxy Formation

Einstein Rings: Nature s Gravitational Lenses

Visualization of Large Multi-Dimensional Datasets

The Size & Shape of the Galaxy

Ellipticals. Elliptical galaxies: Elliptical galaxies: Some ellipticals are not so simple M89 E0

Lecture 6: distribution of stars in. elliptical galaxies

Software challenges in the implementation of large surveys: the case of J-PAS

arxiv:astro-ph/ v1 31 Jan 2001

Migrating a (Large) Science Database to the Cloud

Data Pipelines & Archives for Large Surveys. Peter Nugent (LBNL)

What is the Sloan Digital Sky Survey?

Indiana University Science with the WIYN One Degree Imager

A Universe of Galaxies

A Preliminary Summary of The VLA Sky Survey

astronomy A planet was viewed from Earth for several hours. The diagrams below represent the appearance of the planet at four different times.

Using Photometric Data to Derive an HR Diagram for a Star Cluster

Visualization and Astronomy

National Aeronautics and Space Administration. Teacher s. Science Background. GalaxY Q&As

The Ghosts of Galaxies Past

Galaxy Morphological Classification

Populations and Components of the Milky Way

The Messier Objects As A Tool in Teaching Astronomy

MANAGING AND MINING THE LSST DATA SETS

The Celestial Sphere. Questions for Today. The Celestial Sphere 1/18/10

Observing the Universe

Description of the Dark Energy Survey for Astronomers

Virtual Observatories A New Era for Astronomy. Reinaldo R. de Carvalho DAS-INPE/MCT 2010

The Hidden Lives of Galaxies. Jim Lochner, USRA & NASA/GSFC

Galaxy Survey data analysis using SDSS-III as an example

The Challenge of Data in an Era of Petabyte Surveys Andrew Connolly University of Washington

How Do Galeries Form?

Cross-Matching Very Large Datasets

Chapter 15.3 Galaxy Evolution

Class 2 Solar System Characteristics Formation Exosolar Planets

The World-Wide Telescope, an Archetype for Online Science

Science Drivers for Big Data Joseph Lazio SKA Program Development Office & Jet Propulsion Laboratory, California Institute of Technology

Bringing the Night Sky Closer: Discoveries in the Data Deluge

Data Provided: A formula sheet and table of physical constants is attached to this paper. DARK MATTER AND THE UNIVERSE

Conquering the Astronomical Data Flood through Machine

Data Mining Challenges and Opportunities in Astronomy

Science and the Taiwan Airborne Telescope

Faber-Jackson relation: Fundamental Plane: Faber-Jackson Relation

The facts we know today will be the same tomorrow but today s theories may tomorrow be obsolete.

vodcast series. Script for Episode 6 Charting the Galaxy - from Hipparcos to Gaia

Galaxy Classification and Evolution

Virtual Observatory tools for the detection of T dwarfs. Enrique Solano, LAEFF / SVO Eduardo Martín, J.A. Caballero, IAC

SIERRA COLLEGE OBSERVATIONAL ASTRONOMY LABORATORY EXERCISE NUMBER III.F.a. TITLE: ASTEROID ASTROMETRY: BLINK IDENTIFICATION

Study Guide: Solar System

Beginning of the Universe Classwork 6 th Grade PSI Science

The Scientific Data Mining Process

HR Diagram Student Guide

In studying the Milky Way, we have a classic problem of not being able to see the forest for the trees.

and the VO-Science Francisco Jiménez Esteban Suffolk University

Top 10 Discoveries by ESO Telescopes

Grade 6 Standard 3 Unit Test A Astronomy. 1. The four inner planets are rocky and small. Which description best fits the next four outer planets?

Astronomy & Physics Resources for Middle & High School Teachers

White Dwarf Properties and the Degenerate Electron Gas

LSST and the Cloud: Astro Collaboration in 2016 Tim Axelrod LSST Data Management Scientist

SSO Transmission Grating Spectrograph (TGS) User s Guide

Name Class Date. true

The formation and evolution of massive galaxies: A major theoretical challenge

The Earth, Sun & Moon. The Universe. The Earth, Sun & Moon. The Universe

SKINAKAS OBSERVATORY. Astronomy Projects for University Students PROJECT THE HERTZSPRUNG RUSSELL DIAGRAM

An Introduction to Astronomy and Cosmology. 1) Astronomy - an Observational Science

The Legacy Value of Large Public Surveys: the SDSS Archive. Alexander Szalay The Johns Hopkins University

Some Basic Principles from Astronomy

Clustering with Missing Values: No Imputation Required

Astro 102 Test 5 Review Spring See Old Test 4 #16-23, Test 5 #1-3, Old Final #1-14

TELESCOPE AS TIME MACHINE

Newton s Law of Gravity

Size and Scale of the Universe

Spectral Line II. G ij (t) are calibrated as in chapter 5. To calibrated B ij (ν), observe a bright source that is known to be spectrally flat

Introduction to the Solar System

Astronomy of Planets

Elliptical Galaxies. Houjun Mo. April 19, Basic properties of elliptical galaxies. Formation of elliptical galaxies

CELESTIAL CLOCK - THE SUN, THE MOON, AND THE STARS

Evolution of Close Binary Systems

Class #14/15 14/16 October 2008

Neutron Stars. How were neutron stars discovered? The first neutron star was discovered by 24-year-old graduate student Jocelyn Bell in 1967.

Making the Most of Missing Values: Object Clustering with Partial Data in Astronomy

The Virtual Observatory: What is it and how can it help me? Enrique Solano LAEFF / INTA Spanish Virtual Observatory

ASKAP Science Data Archive: Users and Requirements CSIRO ASTRONOMY AND SPACE SCIENCE (CASS)

Data analysis of L2-L3 products

Celestial Sphere. Celestial Coordinates. Lecture 3: Motions of the Sun and Moon. ecliptic (path of Sun) ecliptic (path of Sun)

Dwarf Elliptical andFP capture the Planets

LSST Data Management System Applications Layer Simulated Data Needs Description: Simulation Needs for DC3

Transcription:

The Sloan Digital Sky Survey From Big Data to Big Database to Big Compute Heidi Newberg Rensselaer Polytechnic Institute

Summary History of the data deluge from a personal perspective. The transformation of astronomy with the Sloan Digital Sky Survey. The discovery of density substructure in the Milky Way stellar spheroid. Using MilkyWay@home to fit more complex models to the data.

The new 1024x1024 CCD camera required a new computer to store the data from just one night of observing (2 megabytes every five minutes). We also needed to write to exabyte tape drives rather than magnetic tapes, so the data would be easier to carry home on the airplane.

The beginning of the data deluge (1990 s) New CCD cameras produced enough data that we could no longer look at each astronomical object individually. Automated algorithms were needed. Mag tapes hold 100 Mbytes each, ~2 hrs of observing time per tape. (Requires large backpack to transport home.) Exabyte tapes made data transport easier. I still own all of these tapes, but it is likely that they are not readable. All astronomical data from that era is lost forever.

The Sloan Digital Sky Survey (SDSS) is a joint project of The University of Chicago, Fermilab, the Institute for Advanced Study, the Japan Participation Group, The Johns Hopkins University, the Max-Planck-Institute for Astronomy (MPIA), the Max- Planck-Institute for Astrophysics (MPA), New Mexico State University, Princeton University, the U.S. Naval Observatory, and the University of Washington. (11 institutions) Funding for the project has been provided by the Alfred P. Sloan Foundation, the Participating Institutions, the National Aeronautics and Space Administration, the National Science Foundation, the U.S. Department of Energy, the Japanese Monbukagakusho, and the Max Planck Society.

The Data Images of 14,000 square degrees of sky in 5 passbands (raw data 20 TB) A catalog of a billion objects detected in those images (20 TB SQL database), ~400 parameters per object Other data products (DAS 34 TB) 1.5 million spectra of galaxies, stars, and quasars (3.3 TB) Spectral parameters (450 Gbytes) Data reduction??

I discussed the data processing,

Alex Szalay and his group at Johns Hopkins took on the enormous task of putting all of this data into a database, preserving as much provenance as possible, and making the data as accessible as possible. There are serious issues with speed in a database of this size, so his group needed to think hard about how the data would be accessed, and thus how it should be organized.

Scientists were asked for example scientific queries, so the database could be optimized. The 20 Queries Q1: Find all galaxies without unsaturated pixels within 1' of a given point of ra=75.327, dec=21.023 Q2: Find all galaxies with blue surface brightness between and 23 and 25 mag per square arcseconds, and -10<super galactic latitude (sgb) <10, and declination less than zero. Q3: Find all galaxies brighter than magnitude 22, where the local extinction is >0.75. Q4: Find galaxies with an isophotal surface brightness (SB) larger than 24 in the red band, with an ellipticity>0.5, and with the major axis of the ellipse having a declination of between 30 and 60 arc seconds. Q5: Find all galaxies with a devaucouleours profile (r ¼ falloff of intensity on disk) and the photometric colors consistent with an elliptical galaxy. The devaucouleours profile Q6: Find galaxies that are blended with a star, output the deblended galaxy magnitudes. Q7: Provide a list of star-like objects that are 1% rare. Q8: Find all objects with unclassified spectra. Q9: Find quasars with a line width >2000 km/s and 2.5<redshift<2.7. Q10: Find galaxies with spectra that have an equivalent width in Ha >40Å (Ha is the main hydrogen spectral line.) Q11: Find all elliptical galaxies with spectra that have an anomalous emission line. Q12: Create a grided count of galaxies with u-g>1 and r<21.5 over 60<declination<70, and 200<right ascension<210, on a grid of 2, and create a map of masks over the same grid. Q13: Create a count of galaxies for each of the HTM triangles which satisfy a certain color cut, like 0.7u-0.5g-0.2i<1.25 && r<21.75, output it in a form adequate for visualization. Q14: Find stars with multiple measurements and have magnitude variations >0.1. Scan for stars that have a secondary object (observed at a different time) and compare their magnitudes. Q15: Provide a list of moving objects consistent with an asteroid. Q16: Find all objects similar to the colors of a quasar at 5.5<redshift<6.5. Q17: Find binary stars where at least one of them has the colors of a white dwarf. Q18: Find all objects within 30 arcseconds of one another that have very similar colors: that is where the color ratios u-g, g-r, r-i are less than 0.05m. Q19: Find quasars with a broad absorption line in their spectra and at least one galaxy within 10 arcseconds. Return both the quasars and the galaxies. Q20: For each galaxy in the BCG data set (brightest color galaxy), in 160<right ascension<170, -25<declination<35 count of galaxies within 30"of it that have a photoz within 0.05 of that galaxy. From talk by Jim Gray (2001)

Sky survey Navigate tool lets you browse through the images

Over a billion hits to the SDSS site, leveling off at 150 million per year. Over 2,000,000 SQL queries per month on the database.

Computational Science Traditional Empirical Science Scientist gathers data by direct observation Scientist analyzes data Computational Science Data captured by instruments Or data generated by simulator Processed by software Placed in a database Scientist analyzes database From talk by Jim Gray 10/10/2001 16

Scientists What s needed? (not drawn to scale) Miners Science Data & Questions Data Mining Algorithms Plumbers Database To store data Execute Queries Question & Answer Visualization Tools Slide from talk by Jim Gray 4/10/2002 17

Astronomy Information Age Astronomical data is processed without anyone looking at the individual images/spectra Astronomers used to classify galaxies by eye. Sometimes a graduate student would classify thousands of galaxies from a computer screen. At three per minute, this might take hours, days, or even weeks of time. The SDSS found 108 galaxies. At three per minute, classification would take 63 years of 24 hours per day, seven days per week. The Galaxy Zoo is a project that allows private citizens to look at data by eye, and contribute classifications to scientists. More data is obtained than anyone can analyze himself (drinking from a fire hose) Projects like the SDSS SkyServer, the Virtual Observatory, Google Sky, and WikiSky are all projects aimed at letting people better access the data from SDSS. New surveys, including Pan-STARRS, LSST, Guo Shou Jing (LAMOST), DES, RAVE, SEGUE, HERMES, and WFMOS are planned or in progress, patterned on the success of the Sloan Digital Sky Survey.

r 2 2, 3.5, r x y ( z / q ) 2

The SDSS survey was funded as an extragalactic project, but Galactic stars could not be completely avoided.

Statistical Photometric Parallax The use of statistical knowledge of the absolute magnitudes of stellar populations to determine the density distributions of stars.

Newberg et al. 2002 Vivas overdensity, or Virgo Stellar Stream Monoceros stream, Stream in the Galactic Plane, Galactic Anticenter Stellar Stream, Canis Major Stream, Argo Navis Stream Stellar Spheroid? Sagittarius Dwarf Tidal Stream

Squashed halo Spherical halo Prolate halo Exponential disk Newberg et al. 2002

Kathryn Johnston

David Law

A map of stars in the outer regions of the Milky Way Galaxy, derived from the SDSS images of the northern sky, shown in a Mercator-like projection. The color indicates the distance of the stars, while the intensity indicates the density of stars on the sky. Structures visible in this map include streams of stars torn from the Sagittarius dwarf galaxy, a smaller 'orphan' stream crossing the Sagittarius streams, the 'Monoceros Ring' that encircles the Milky Way disk, trails of stars being stripped from the globular cluster Palomar 5, and excesses of stars found towards the constellations Virgo and Hercules. Circles enclose new Milky Way companions discovered by the SDSS; two of these are faint globular star clusters, while the others are faint dwarf galaxies. Credit: V. Belokurov and the Sloan Digital Sky Survey.

Why is this important? Small dwarf galaxies are merging with the Milky Way at the present time. The Milky Way itself was created by a long history of merging smaller galaxies to make larger ones The tidal streams are an archeological record of the merger history that created our galaxy The tidal streams encode the gravitational potential through which the dwarf galaxy traveled, and can therefore tell us about the distribution of dark matter in the Milky Way.

Newberg et al. 2002 Vivas overdensity, or Virgo Stellar Stream Monoceros stream, Stream in the Galactic Plane, Galactic Anticenter Stellar Stream, Canis Major Stream, Argo Navis Stream Stellar Spheroid? Sagittarius Dwarf Tidal Stream

Fitting model parameters Previous astronomers fit 3 parameters to the entire stellar halo. We want to fit 20 parameters to each of eighteen 2.5-degree wide stripe = 360 parameters. The number of iterations to compute the likelihood increases with the number of stars, and the required accuracy of the calculation At four hours per evaluation and 50 likelihood calculations per iteration in a conjugate gradient descent method and 50 iterations, 10,000 hours are required to optimize one stripe. This would take more than 400 days on a single processor.

Began: November 9, 2007 Computing power: 0.5 PetaFLOPS (high over 2 PetaFLOPS) Number of volunteers (total people): 146,863 Number of computers volunteered (total): 291,944 Number of active volunteers: 25,670 Number of active computer being volunteered: 35,686 Number of volunteers as of 10/4/2012

206 countries (of which 193 are UN members)

Volunteer Computing with 150,000 volunteers: Let us use their CPUs for scientific calculations Continously upgrade their hardware Populate extensive forum discussions on science, technical support, and well, anything Monitor the health of our system (especially our volunteer moderator) Wrote the first GPU version of our software Donate money and hardware

Volunteer Computing with 150,000 volunteers also: Compete with each other for BOINC credits Become angry if another person or team is getting an unfair number of credits Return garbage results (which require zero computations) so they can earn credit faster Insult each other on public forum boards Link anti-semitic websites to ours

Astronomy students write algorithms Algorithms are adapted to run on asynchronous, heterogenious, parallel computing environment. The code compiled and tested on 16 platforms including CPUs and GPUs, and attached to the server. Mechanisms are created to start and end runs. The MySQL database is maintained. MilkyWay@home server sends out jobs to volunteers and collects results

Algorithms are adapted to run on asynchronous, heterogenious, parallel computing environment. The code compiled and tested on 16 platforms including CPUs and GPUs, and attached to the Astronomy students write algorithms server. Mechanisms are created to start and end runs. The MySQL database is maintained. MilkyWay@home server sends out jobs to volunteers and collects results This was originally accomplished with a $750,000 grant shared between astronomy and computer science faculty. But there is no model for maintaining this, since it is no longer an interesting computer science problem, and very expensive for an individual astronomy grant. We need lighter tools

Data from one stripe Stream 1 (6 parameters) Stream 2 (6 parameters) We can fit 20 parameters to each 2.5-degreee wide stripe of data. We recently analyzed 18 stripes of data from DR7 (300-400 parameters). Stream 3 (6 parameters) Smooth (3 parameters)

Newby et al., submitted Law & Majewski (2010) We can compare the position of the stream in the sky (left), with n-body simulations of Sgr dwarf galaxy disruption (right). The stream positions in the left panel are calculated by 2.5- degree wide stripe.

1.9 million F turnoff stars Polar plots of SDSS F turnoff stars in the north Galactic Cap (top). Using our density model, we place each star in either the Sgr (lower left) or non-sgr panel (lower right), with the probability given by the model. The stars in the Sgr panel are not guaranteed to be from the stream, but they collectively have the spatial properties of the Sgr stream. 160,000 stars with Sgr density 1.7 million non-sgr stars

Determining the total mass, lumpiness, and flattening of the Galaxy s dark matter halo We now want to fit parameters of the Milky Way galaxy and the dwarf galaxies that fell in, by using n-body simulations of the merging and comparing them to the density parameters we measured in the data. (1) We would like to fit N-body simulations (100,000 particles in the dwarf) instead of orbits (1 particle) (2) We would like to fit multiple streams at the same time. (3) We would like to fit distances, velocities, positions, and densities of the streams, and simultaneously fit measurements of the Milky Way s rotation curve. (4) We need to consider internal properties of the dwarfs Since modeling one dwarf requires ~30 minutes on a CPU, this requires substantial computational power. But then, we have MilkyWay@home.

Right now we have a version of the Barnes and Hut (1986) code that works across CPU platforms for our MilkyWay@home with checkpointing, and hope it will be running on GPUs sometime within the coming year. Sample 100,000 particle (sub-sampled above) semi-analytic N-body simulations of the tidal disruption of the Orphan Stream. Fit only the Plummer sphere parameters for the dwarf galaxy