Migrating a (Large) Science Database to the Cloud



Similar documents
Libraries and Large Data

Astrophysics with Terabyte Datasets. Alex Szalay, JHU and Jim Gray, Microsoft Research

Cross-Matching Very Large Datasets

Software challenges in the implementation of large surveys: the case of J-PAS

An ArrayLibraryforMS SQL Server

Scientific Computing Meets Big Data Technology: An Astronomy Use Case

Low-Power Amdahl-Balanced Blades for Data-Intensive Computing

ArcGIS for Server in the Amazon Cloud. Michele Lundeen Esri

Amazon Elastic Compute Cloud Getting Started Guide. My experience

Large science databases are cloud services ready for them?

Use of Cloud Computing for scalable geospatial data processing and access

SQL Azure vs. SQL Server

Intro to Sessions 3 & 4: Data Management & Data Analysis. Bob Mann Wide-Field Astronomy Unit University of Edinburgh

Data Management Plan Extended Baryon Oscillation Spectroscopic Survey

A Very Brief Introduction To Cloud Computing. Jens Vöckler, Gideon Juve, Ewa Deelman, G. Bruce Berriman

ur skills.com

RDS Migration Tool Customer FAQ Updated 7/23/2015

Cloud computing - Architecting in the cloud

Last time. Today. IaaS Providers. Amazon Web Services, overview

How To Choose Between A Relational Database Service From Aws.Com

Scalable Application. Mikalai Alimenkou

March Lynn Langit twitter

Leveraging Public Clouds to Ensure Data Availability

DISTRIBUTED SYSTEMS [COMP9243] Lecture 9a: Cloud Computing WHAT IS CLOUD COMPUTING? 2

SQL Server 2016 New Features!

References. Introduction to Database Systems CSE 444. Motivation. Basic Features. Outline: Database in the Cloud. Outline

Introduction to Database Systems CSE 444

USER CONFERENCE 2011 SAN FRANCISCO APRIL Running MarkLogic in the Cloud DEVELOPER LOUNGE LAB

Amazon Web Services Student Tutorial

National Center for Education Statistics. Amazon Hosted ESRI ArcGIS Servers Project Final Report

The Legacy Value of Large Public Surveys: the SDSS Archive. Alexander Szalay The Johns Hopkins University

Elastic Detector on Amazon Web Services (AWS) User Guide v5

Creating A Galactic Plane Atlas With Amazon Web Services

Where We Are. References. Cloud Computing. Levels of Service. Cloud Computing History. Introduction to Data Management CSE 344

SQL Server What s New? Christopher Speer. Technology Solution Specialist (SQL Server, BizTalk Server, Power BI, Azure) v-cspeer@microsoft.

SUSE Manager in the Public Cloud. SUSE Manager Server in the Public Cloud

Amazon Elastic Beanstalk

Composable cost estimation and monitoring for computational applications in cloud computing environments

Vol. 10, No. 1 January/February 2008

Making the Most of Missing Values: Object Clustering with Partial Data in Astronomy

Resource Sizing: Spotfire for AWS

Online Backup Guide for the Amazon Cloud: How to Setup your Online Backup Service using Vembu StoreGrid Backup Virtual Appliance on the Amazon Cloud

McAfee Public Cloud Server Security Suite

The age of Big Data is rapidly changing

AWS Schema Conversion Tool. User Guide Version 1.0

GeoCloud Project Report GEOSS Clearinghouse

Cloud Computing and Amazon Web Services. CJUG March, 2009 Tom Malaher

Every Silver Lining Has a Vault in the Cloud

OTM in the Cloud. Ryan Haney

Storage and Disaster Recovery

Web Application Hosting in the AWS Cloud Best Practices

Cloud Computing. Adam Barker

Cloudera Manager Training: Hands-On Exercises

Migration Scenario: Migrating Batch Processes to the AWS Cloud

Distributed Data Parallel Computing: The Sector Perspective on Big Data

Building a SQL Server Test Lab. Ted Krueger SQL Server MVP Data Architect

How AWS Pricing Works

Cloud Computing For Bioinformatics

MS SQL Server 2014 New Features and Database Administration

An Introduction to Cloud Computing Concepts

Amazon EC2 Product Details Page 1 of 5

Public Cloud Offerings and Private Cloud Options. Week 2 Lecture 4. M. Ali Babar

Why back up the Cloud?

Visualization of Semantic Windows with SciDB Integration

Cloud Models and Platforms

Description of Application

Best Practices for Using MySQL in the Cloud

Cloud computing is a marketing term that means different things to different people. In this presentation, we look at the pros and cons of using

The World-Wide Telescope, an Archetype for Online Science

MANAGING AND MINING THE LSST DATA SETS

OpenTOSCA Release v1.1. Contact: Documentation Version: March 11, 2014 Current version:

Postgres Plus Cloud Database!

WaterfordTechnologies.com Frequently Asked Questions

Deployment of Intersystems Caché with GUMS on Amazon EC2

Data analysis of L2-L3 products

The Inside Scoop on Hadoop

A Comparison of Clouds: Amazon Web Services, Windows Azure, Google Cloud Platform, VMWare and Others (Fall 2012)

Microsoft SQL Server 2008 Administrator's Pocket Consultant

Deploying System Center 2012 R2 Configuration Manager

Drobo How-To Guide. Cloud Storage Using Amazon Storage Gateway with Drobo iscsi SAN

Deployment Options for Microsoft Hyper-V Server

Viswanath Nandigam Sriram Krishnan Chaitan Baru

A programming model in Cloud: MapReduce

Chapter 3 Cloud Infrastructure. Cloud Computing: Theory and Practice. 1

What is Analytic Infrastructure and Why Should You Care?

Introduction to Database Systems CSE 444. Lecture 24: Databases as a Service

Big Data on Microsoft Platform

The Research Data Revolution Harvard/Purdue Data Symposium Sayeed Choudhury

Web Application Deployment in the Cloud Using Amazon Web Services From Infancy to Maturity

Hyper-V Replica Essentials

Deploying for Success on the Cloud: EBS on Amazon VPC Session ID#11312

Big data, big deal? Presented by: David Stanley, CTO PCI Geomatics

Cloud Computing for Education Workshop

Big data blue print for cloud architecture

Oracle Public Cloud. Peter Schmidt Principal Sales Consultant Oracle Deutschland BV & CO KG

Amazon Relational Database Service (RDS)

Cloud Computing. Chapter 1 Introducing Cloud Computing

Daniel J. Adabi. Workshop presentation by Lukas Probst

This presentation provides an overview of the architecture of the IBM Workload Deployer product.

W I S E. SQL Server 2012 Database Engine Technical Update WISE LTD.

Transcription:

The Sloan Digital Sky Survey Migrating a (Large) Science Database to the Cloud Ani Thakar Alex Szalay Center for Astrophysical Sciences and Institute for Data Intensive Engineering and Science (IDIES) The Johns Hopkins University

Motivation Exponential growth in scientific data size Astronomy no exception, SDSS leading the way Large scientific databases in the cloud large = multi-tb, database = SQL Server On-demand resources very attractive to scientists Don t want to become FT data managers Most projects don t have have deep IT pockets Focus on migrating data in this talk Is it easy? Is it even possible? Performance evaluation premature at best Economics even more so ScienceCloud 2010, June 21, Chicago Ani Thakar, JHU 2

Data Avalanche in Astronomy Moore s Law squared 2009 Physics Nobels: CCD technology Optical fiber technology Both were critical for SDSS, new generation of astronomy surveys SDSS now few TB Pan-STARRS next 50x SDSS, 100 TB LSST later One SDSS every 2-3 nights, PBs ScienceCloud 2010, June 21, Chicago Ani Thakar, JHU 3

Cloudy with a chance of pain? Can entire DB be moved to the cloud as is? Restrictions on data size For now, use subset of original database In future, will have to partition the data first Migrating the data to the cloud How to copy data into the cloud? Will it need to be extracted from the DBMS? Will schema have to be modified? Will science functionality be compromised? What about query performance? ScienceCloud 2010, June 21, Chicago Ani Thakar, JHU 4

Cloud Experiments Dataset Sloan Digital Sky Survey (SDSS) Public dataset, no restrictions Reasonably complex schema, usage model Easy to generate subset of arbitrary size Only available as SQL Server database Clouds Amazon Elastic Cloud 2 (EC2) AWS wanted to host public SDSS dataset Microsoft Azure Natural fit for SQL Server databases ScienceCloud 2010, June 21, Chicago Ani Thakar, JHU 5

The Sloan Digital Sky Survey Largest astronomy archive to date Digital mapping of quarter of sky 10000 sq. degrees, 5 wavelength bands Dedicated 2.5m telescope (APO, NM) 40 TB raw data International collaboration Funded by: Sloan Foundation, NSF, NASA, DOE Final data release: DR7 (Oct 31, 2008) 357M objects (180M stars, 175M galaxies) 1.6M spectra (~ 1M galaxies, 500k stars, 120k QSOs) Extracted catalog science archive: 5 TB SDSS Catalog Archive Server (CAS) Microsoft SQL Server DBMS SkyServer Web Interface SDSS Sky Coverage www.sdss.org ScienceCloud 2010, June 21, Chicago Ani Thakar, JHU 6

SDSS Catalog DB Schema ScienceCloud 2010, June 21, Chicago Ani Thakar, JHU 7

SDSS UDFs and SPs 190 user-defined functions(udfs) Utility (coordinated conversions, etc.) Spatial search (invoke HTM indexing primitives) UI (Schema Browser, Help pages, etc.) Specialty (region algebra, cosmology, etc.) Admin (not accessible to Web users) 135 stored procedures (SPs) Mostly admin SPs (~ 100) Some user accessible SPs for nearest neighbor searches, special queries etc. ScienceCloud 2010, June 21, Chicago Ani Thakar, JHU 8

Migrating data to Amazon EC2 1 TB data size limit (per instance) SDSS DR6 100 GB subset: BestDR6_100GB Actually more like 150 GB Large enough for performance tests, small enough to be migrated in a few days/weeks Several manual steps to create DB instance Could not connect to DB from outside Preliminary performance test, but without optimizing within cloud ScienceCloud 2010, June 21, Chicago Ani Thakar, JHU 9

SDSS public dataset on EC2 Public datasets on Amazon Stored as snapshot available on AWS Advertised on AWS blog Anyone can pull into their account Data is free, but not the usage Create a running instance Multiple instances deployed manually First SQL Server dataset on EC2 (?) AWS also created a LINUX snapshot of SQL Server DB! ScienceCloud 2010, June 21, Chicago Ani Thakar, JHU 10

AWS Blog Entry http://aws.typepad.com/ ScienceCloud 2010, June 21, Chicago Ani Thakar, JHU 11

How it s supposed to work With other db dumps, assumption is that users will set up their own DB and import the data Public SDSS dataset to serve two kinds of users: a) People who currently access the SDSS CAS b) General AWS users who are interested in the data For a), should be able to replicate the same services that SDSS currently has, but using a SQL Server instance on EC2 For b), users should have everything they need on the AWS public dataset page for SDSS ScienceCloud 2010, June 21, Chicago Ani Thakar, JHU 12

Steps to create DB in EC2 Create snapshot of database first Create storage for DB: 200 GB EBS volume Instantiate snapshot as volume of required size Create SQL Server 2005 Amazon Machine Image (AMI) AMI instance from snapshot Attach AMI instance to EBS volume Creates running instance of DB Get Elastic IP to point to instance ScienceCloud 2010, June 21, Chicago Ani Thakar, JHU 13

Steps to create Web interface Create volume (small) with win2003 Create instance with Win 2003 AMI (only IIS, no SQL Server) Attach volume to instance Get public DNS, admin account BUT couldn t connect to SQL Server IP So outside world cannot connect to the data as yet ScienceCloud 2010, June 21, Chicago Ani Thakar, JHU 14

Steps to Create DB Server Amazon EC2 AWS Management Console 200 GB Volume Attach Instance Create EBS Volume Create Instance Select AMI Create Elastic IP SDSS DB Server Repeat for each instance of DB server ScienceCloud 2010, June 21, Chicago Ani Thakar, JHU 15

Comparing Cost & Performance 1200 GrayWulf GW_CPUvs EC2 1000 35 test queries 800 600 Run 3 times Measure CPU/elapsed time, phys. IO, #rows Compare performance on Graywulf node 400 200 Virtualization and/or infrastructure? Bottom line: 1.5h on GW node, 13h on EC2 Cost on Amazon 0 EC2_CPU log(gw_io) GW_Elapsed EC2_Elapsed log(ec2_io) Large volume SQL Server instance: ~ $500 Charged for as long as instance is active 8 7 6 5 4 3 2 1 0 ScienceCloud 2010, June 21, Chicago Ani Thakar, JHU 16

SDSS Production Cluster at FNAL ScienceCloud 2010, June 21, Chicago Ani Thakar, JHU 17

Migrating data to Microsoft Azure 10 GB data size limit ( 50 GB last week!) SDSS DR6 10 GB subset: BestDR6_10GB Two ways to migrate database Script it out of source db (very painful) Many options to select Runs out of memory on server! Use SQL Azure Migration Wizard (much better!) Runs for few hours Produces huge trace of errors, many items skipped But does produce a working db in Azure ScienceCloud 2010, June 21, Chicago Ani Thakar, JHU 18

SQL Azure Migration Wizard ScienceCloud 2010, June 21, Chicago Ani Thakar, JHU 19

Unsupported MS-SQL features No references to other databases Can t run command (including SQL) scripts from Master DB Global temp objects disallowed Can t use performance counters in test query suite Difficult to benchmark and compare performance SQL-CLR function bindings not supported Can t use our HTM spatial indexing library Makes spatial searches much slower ScienceCloud 2010, June 21, Chicago Ani Thakar, JHU 20

Unsupported MS-SQL features T-SQL directives e.g., to set the level of parallelism Built-in T-SQL functions Probably can do without these for now More for convenience Deprecated features Lose the SQL Server 2000 baggage ScienceCloud 2010, June 21, Chicago Ani Thakar, JHU 21

Azure sign of success? Ok, so data in Azure, but at what cost?! Meaningful performance comparison not possible Dataset too small Schema features stripped No spatial index, so many queries crippled Connected to Azure DB from outside cloud Hooked up SkyServer WI Connected with SQL Server client (SSMS) Ran simple queries from both ScienceCloud 2010, June 21, Chicago Ani Thakar, JHU 22

Conclusions Migrating scientific databases to the cloud not really possible at the moment Even migrating smaller DBs can be painful Several steps to deploy each copy Cloud may not support full functionality Problems limited to SQL Server DBs? Large DBs will have to be partitioned Set up distributed databases in the cloud Query distributed databases from outside Haven t even talked about economics yet ScienceCloud 2010, June 21, Chicago Ani Thakar, JHU 23

The one who says it can t be done should never stop the one who s doing it. Anon. Hope to hear more positive experiences of others! ScienceCloud 2010, June 21, Chicago Ani Thakar, JHU 24

Thank you! ScienceCloud2010, June 21, Chicago Ani Thakar, JHU 25