Cloud Computing for Research Roger Barga Cloud Computing Futures, Microsoft Research



Similar documents
Dennis Gannon Cloud Computing Futures extreme Computing Group Microsoft Research

Dennis Gannon Cloud Computing Futures extreme Computing Group Microsoft Research


Building Cloud Applications to Support Research: the Microsoft Azure Research Engagement Project

Enabling Science in the Cloud: A Remote Sensing Data Processing Service for Environmental Science Analysis

Introduction to Windows Azure Cloud Computing Futures Group, Microsoft Research Roger Barga, Jared Jackson,Nelson Araujo, Dennis Gannon, Wei Lu, and

Data Semantics Aware Cloud for High Performance Analytics

How To Handle Big Data With A Data Scientist

Chapter 4 Cloud Computing Applications and Paradigms. Cloud Computing: Theory and Practice. 1

Cloud Computing with Microsoft Azure

SQL Server 2012 Performance White Paper

Hadoop s Entry into the Traditional Analytical DBMS Market. Daniel Abadi Yale University August 3 rd, 2010

Windows Azure Platform

SQL Server In-Memory by Design. Anu Ganesan August 8, 2014

HIGH-SPEED BRIDGE TO CLOUD STORAGE

Cloud Computing Is In Your Future

MS 20487A Developing Windows Azure and Web Services

Big Data and Analytics: Getting Started with ArcGIS. Mike Park Erik Hoel

WINDOWS AZURE DATA MANAGEMENT

Enterprise Architectures for Large Tiled Basemap Projects. Tommy Fauvell

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Assignment # 1 (Cloud Computing Security)

Windows Azure Data Services (basics) 55093A; 3 Days

CLOUD COMPUTING & WINDOWS AZURE

The MODIS online archive and on-demand processing

Scaling Analysis Services in the Cloud

Big Data on Microsoft Platform

Cloud Computing for Fundamental Spatial Operations on Polygonal GIS Data

Data Warehouse in the Cloud Marketing or Reality? Alexei Khalyako Sr. Program Manager Windows Azure Customer Advisory Team

Building COBOL applications for Microsoft Azure. Jim Lane Senior Solution Engineer

System Models for Distributed and Cloud Computing

In-Situ Bitmaps Generation and Efficient Data Analysis based on Bitmaps. Yu Su, Yi Wang, Gagan Agrawal The Ohio State University

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Enterprise GIS Architecture Deployment Options. Andrew Sakowicz

Ground up Introduction to In-Memory Data (Grids)

Learn. Connect. Explore.

In Memory Accelerator for MongoDB

Data-centric Renovation of Scientific Workflow in the Age of Big Data

Information Processing, Big Data, and the Cloud

Oracle Database In-Memory The Next Big Thing

Big Data Challenges in Bioinformatics

The Berkeley AMPLab - Collaborative Big Data Research

This module provides an overview of service and cloud technologies using the Microsoft.NET Framework and the Windows Azure cloud.

SharePoint & Azure: Digital Asset Management

Scientific Computing Meets Big Data Technology: An Astronomy Use Case

Understanding the Benefits of IBM SPSS Statistics Server

Data-Intensive Science and Scientific Data Infrastructure

Data Refinery with Big Data Aspects

An Alternative Storage Solution for MapReduce. Eric Lomascolo Director, Solutions Marketing

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

Overview: X5 Generation Database Machines

Astrophysics with Terabyte Datasets. Alex Szalay, JHU and Jim Gray, Microsoft Research

With DDN Big Data Storage

Cloud Computing Trends

Chapter 7. Using Hadoop Cluster and MapReduce

EMC XtremSF: Delivering Next Generation Performance for Oracle Database

Enterprise Applications

PACE Predictive Analytics Center of San Diego Supercomputer Center, UCSD. Natasha Balac, Ph.D.

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

NetApp Big Content Solutions: Agile Infrastructure for Big Data

EMC XtremSF: Delivering Next Generation Storage Performance for SQL Server

Big Data Analytics. Chances and Challenges. Volker Markl

Storage I/O Control: Proportional Allocation of Shared Storage Resources

Developing Microsoft Azure Solutions 20532A; 5 days

Enabling Real-Time Sharing and Synchronization over the WAN

Cisco UCS and Fusion- io take Big Data workloads to extreme performance in a small footprint: A case study with Oracle NoSQL database

bigdata Managing Scale in Ontological Systems

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

Application Migration Best Practices. Gregory Shepard Senior Consultant InCycle Software

Comparison of Windows IaaS Environments

Advanced Big Data Analytics with R and Hadoop

Pre-Talk Talk. What does ESS look like as more of this CI arrives?

Benchmarking Cassandra on Violin

Binary search tree with SIMD bandwidth optimization using SSE

Flash Memory Arrays Enabling the Virtualized Data Center. July 2010

Partner Camp Leistungsstarkes Log-Management für physische, virtuelle und cloud-basierte Umgebungen. Tomas Baublys

Traditional v/s CONVRGD

Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics

Big Data Management in the Clouds and HPC Systems

ENZO UNIFIED SOLVES THE CHALLENGES OF OUT-OF-BAND SQL SERVER PROCESSING

Testing & Assuring Mobile End User Experience Before Production. Neotys

Big Data Mining Services and Knowledge Discovery Applications on Clouds

Big Data in the context of Preservation and Value Adding

NextGen Infrastructure for Big DATA Analytics.

Milestone Solution Partner IT Infrastructure MTP Certification Report Scality RING Software-Defined Storage

Comparing SMB Direct 3.0 performance over RoCE, InfiniBand and Ethernet. September 2014

Scaling Database Performance in Azure

Big Data Executive Survey

Big Data a threat or a chance?

Wrangler: A New Generation of Data-intensive Supercomputing. Christopher Jordan, Siva Kulasekaran, Niall Gaffney

Big Data Systems CS 5965/6965 FALL 2014

KPACK: SQL Capacity Monitoring

White Paper. Version 1.2 May 2015 RAID Incorporated

Big Data With Hadoop

Ariett Purchasing & Expense Management. Go Paperless, Go Mobile, Go Easy.

Building a cost-effective and high-performing public cloud. Sander Cruiming, founder Cloud Provider

Windows Azure Storage Scaling Cloud Storage Andrew Edwards Microsoft

Solving I/O Bottlenecks to Enable Superior Cloud Efficiency

Obtaining and Processing MODIS Data

AzureBlast: A Case Study of Developing Science Applications on the Cloud

Transcription:

Cloud Computing for Research Roger Barga Cloud Computing Futures, Microsoft Research

Trends: Data on an Exponential Scale Scientific data doubles every year Combination of inexpensive sensors + exponentially faster computing Cost of acquiring data has dropped close to zero Changes the nature of research computing Cuts across disciplines (escience) Becoming increasingly harder to extract knowledge Got data, now what? And it is really is about data, not the FLOPS (going faster) Very extended distribution: data sets on all scales! When data collection does grow large, not able to analyze. Tools are limited, must dedicate resources to build analysis tools (and this doesn t help complete the actual research).

Trend: Lack of Broad Access 1M 14M 70M 55M Little to no access to high performance data-intensive Computing capacity Scientists & Engineers 53% of technical organizations are forced to scale down their advanced problems to fit within their technology limitations, and 57% of companies have problems they can t solve with existing computers. US Council on Competitiveness, 2010.

Widespread On Demand Access High Performance Cloud Computing for Data-intensive Compute Capacity and Data Intensive Research 1M 80% 14M 20% 70M 55M Little to no access to high performance data-intensive Computing capacity Scientists & Engineers

Bridging the Gap with the Cloud With cloud computing, researchers can: Marshal needed storage and compute resources on demand without knowing or caring how it happens Access curated reference datasets, in the cloud. Run key algorithms as Software as a Service without having to know coding details or how to install software Collaborate and share data and algorithms. Use familiar client tools, leverage the cloud as a resource and intellectual amplifier. Offer new (higher value) data services for research Scalable analytics to explore data sets Interactive data visualizations Result provenance

Suggested Application Model Using queues for reliable messaging To scale, add more of either 1) Receive work Web Role ASP.NET, WCF, etc. Worker Role main() { } 4) Do work 2) Put work in queue 3) Get work from queue Queue

Storage Service in Windows Azure Azure applications can use Azure storage services or SQL Azure Blobs: large, unstructured data (audio, video, etc) Tables: simply structured data, accessed using ADO.NET Data Services Queues: serially accessed messages or requests, allowing web-roles and worker-roles to interact SQL Azure

Modern Data Center Network Internet Data Center Layer 3 CR Internet AR AR AR AR CR Layer 2 LB S S LB 10 GigE GigE S S S S Key: CR (L3 Border Router) AR (L3 Access Router) S (L2 Switch) LB (Load Balancer) A (20 Server Rack/TOR) Source: Albert Greenberg and Cisco

Cumulative Percentage Cumulative Percentage Worker Role TCP-Endpoints Latency 100% Bandwidth 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 0 20 40 60 80 100 RTT ms MB/sec 90% 80% 70% 60% 50% 40% 30% 20% 10% 0%

http://azurescope.cloudapp.net/ > Azure service, Silverlight frontend, data archived in SQL Azure > Microbenchmarks, algorithm benchmarks, best practices > Includes utility to measure local to cloud throughput (download) > Run weekly on datacenters across the world > New version(s) in development pipeline Your input welcome

Demo AzureBLAST Bioinformatics in the Cloud Most important app in bioinformatics, exponential data growth Research groups are unable to secure sufficient compute resources External research group, blasted ~5K proteins (700K sequences) Normal BLAST job, between 700 ~1000 CPU hours, couldn t run on NCBI 3 day run was reduced to 30 min, a 30 minutes reduced to 30 sec Total cost $100, publishable result after one hour of cloud computing Reduced a weeks work of research down to one afternoon of computing All-against-all non-redundant protein database (10m sequences) Theoretically 100 billion sequence comparisons, 6 years on 8 core node One of the largest BLAST jobs completed to date; 1.85 billion results

35 Nodes experience blob writing failure at same time The $24,000 Slide Deck 4000 cores (500 XL VMs), running for over 7 days

Application Benchmarks Inform Design Task size vs. Performance Benefit of the warm cache effect 100 sequences per partition is best choice Instance size vs. Performance Super-linear speedup with larger size worker instances Primarily due to the memory capability. Task Size/Instance Size vs. Cost Extra-large instance generated the best and the most economical throughput Fully utilize the resource

AzureMODIS Computing Evapotranspiration (ET) in the Cloud Project with UC Berkeley, Univ Virginia, Indiana University, and LBNL Two MODIS satellites Terra, launched 12/1999 Aqua, launched 05/2002 Near polar orbits Global coverage two days Sensitive in 36 spectral bands

AzureMODIS Four stage image processing pipeline http://research.microsoft.com/en-us/projects/azure/azuremodis.aspx Pipeline for download, processing, and reduction of satellite imagery. ~35 different data products; Atmospheric and land products are in different projections; Re-projection, spatial temporal resolve. Integrate data from different swaths, days, 5 TB data processed (600,000 files) 35,000 hours reprojection stage 12,000 hours derivation reduction stage 3,000 hours analysis state Total cost for computing one US year ET computation: ~ $2000.

Supporting Smart Sensors and Data Fusion in collaboration with Ed Lazowska, Bill Howe, Keith Grochow & Mark Stoermer, U of Washington NSF Ocean Observing Initiative Hundreds of cabled sensors and robots exploring the sea floor Data to be collected, curated, mined Demo

Conclusions Clouds are the largest scale computer centers ever constructed and have the potential to be important to both large and small scale research. Suitable for loosely coupled data parallel applications, but tightly coupled low-latency applications perform poorly on clouds today. Landscape fast changing, research computing market is significant Provide valuable fault tolerance and scalability abstractions Use patterns enormously benefit from databases and data services Rapidly extract small subsets of large data sets Compute aggregates, perform compression, etc. Fast sequential read performance is criticial. Software is becoming a new kind instrument Value added federated data sets Simulations Hierarchical data replication

Questions? barga@microsoft.com http://research.microsoft.com/en-us/people/barga/ Microsoft Research extreme Computing: research.microsoft.com/en-us/labs/xcg/default.aspx Microsoft Research Cloud Research Team: www.research.microsoft.com/cloud The Fourth Paradigm: www.research.microsoft.com/fourthparadigm Microsoft Cloud Services: www.microsoft.com/cloud