Distributed Data Parallel Computing: The Sector Perspective on Big Data



Similar documents
Sector vs. Hadoop. A Brief Comparison Between the Two Systems

What is Analytic Infrastructure and Why Should You Care?

Data Mining for Data Cloud and Compute Cloud

An Overview of the Open Cloud Consor4um

Challenges for Data Driven Systems

NoSQL Data Base Basics

Cloud Computing. Lecture 24 Cloud Platform Comparison

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford

Where We Are. References. Cloud Computing. Levels of Service. Cloud Computing History. Introduction to Data Management CSE 344

Cloud computing - Architecting in the cloud

DISTRIBUTED SYSTEMS [COMP9243] Lecture 9a: Cloud Computing WHAT IS CLOUD COMPUTING? 2

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Can the Elephants Handle the NoSQL Onslaught?

Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu

A programming model in Cloud: MapReduce

Cloud computing doesn t yet have a

Towards Efficient and Simplified Distributed Data Intensive Computing*

An Overview of the Open Cloud Consortium

Sriram Krishnan, Ph.D.

Apache Hadoop. Alexandru Costan

Structured Data Storage

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Integrating Big Data into the Computing Curricula

A Comparison of Clouds: Amazon Web Services, Windows Azure, Google Cloud Platform, VMWare and Others (Fall 2012)

extensible record stores document stores key-value stores Rick Cattel s clustering from Scalable SQL and NoSQL Data Stores SIGMOD Record, 2010

Cloud Computing Summary and Preparation for Examination

Hadoop IST 734 SS CHUNG

Not Relational Models For The Management of Large Amount of Astronomical Data. Bruno Martino (IASI/CNR), Memmo Federici (IAPS/INAF)

Leveraging BlobSeer to boost up the deployment and execution of Hadoop applications in Nimbus cloud environments on Grid 5000

MapReduce with Apache Hadoop Analysing Big Data

Big Data With Hadoop

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

Lecture Data Warehouse Systems

CSE-E5430 Scalable Cloud Computing Lecture 2

HYBRID CLOUD SUPPORT FOR LARGE SCALE ANALYTICS AND WEB PROCESSING. Navraj Chohan, Anand Gupta, Chris Bunch, Kowshik Prakasam, and Chandra Krintz

Introduction to NOSQL

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Open source large scale distributed data management with Google s MapReduce and Bigtable

Cloud Computing using MapReduce, Hadoop, Spark

Viswanath Nandigam Sriram Krishnan Chaitan Baru


References. Introduction to Database Systems CSE 444. Motivation. Basic Features. Outline: Database in the Cloud. Outline

Introduction to Database Systems CSE 444

Cloud Computing Is In Your Future

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

Hadoop and Map-Reduce. Swati Gore

Cloud Scale Distributed Data Storage. Jürmo Mehine

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

GraySort and MinuteSort at Yahoo on Hadoop 0.23

NoSQL Databases. Institute of Computer Science Databases and Information Systems (DBIS) DB 2, WS 2014/2015

LARGE-SCALE DATA STORAGE APPLICATIONS

GigaSpaces Real-Time Analytics for Big Data

Introduction to NoSQL Databases. Tore Risch Information Technology Uppsala University

Scalable Architecture on Amazon AWS Cloud

Facebook s Petabyte Scale Data Warehouse using Hive and Hadoop

Cloud Courses Description

A survey of big data architectures for handling massive data

Cloud Computing. Up until now

MapReduce and Hadoop Distributed File System V I J A Y R A O

Introduction to Cloud Computing

Introduction to Apache Cassandra

Comparison of the Frontier Distributed Database Caching System with NoSQL Databases

Cassandra A Decentralized Structured Storage System

A Survey on Cloud Storage Systems

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

Facebook: Cassandra. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

Cloud Computing Where ISR Data Will Go for Exploitation


Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Scaling Out With Apache Spark. DTL Meeting Slides based on

NoSQL for SQL Professionals William McKnight

Cloud Computing at Google. Architecture

So What s the Big Deal?

NoSQL and Hadoop Technologies On Oracle Cloud

Cloud Computing and Big Data What Technical Writers Need to Know

How To Scale Out Of A Nosql Database

Analysis and Research of Cloud Computing System to Comparison of Several Cloud Computing Platforms

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Big Data on AWS. Services Overview. Bernie Nallamotu Principle Solutions Architect

Data-Intensive Computing with Map-Reduce and Hadoop

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Hadoop & its Usage at Facebook

Amazon EC2 Product Details Page 1 of 5

Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University

Introduction to Hadoop

What Is It? Business Architecture Research Challenges Bibliography. Cloud Computing. Research Challenges Overview. Carlos Eduardo Moreira dos Santos

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Data Semantics Aware Cloud for High Performance Analytics

Preparing Your Data For Cloud

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Big Data Course Highlights

Transcription:

Distributed Data Parallel Computing: The Sector Perspective on Big Data Robert Grossman July 25, 2010 Laboratory for Advanced Computing University of Illinois at Chicago Open Data Group Institute for Genomics & Systems Biology University of Chicago 1

Part 1.

Open Cloud Testbed 9 racks 250+ Nodes 1000+ Cores 10+ Gb/s 3 CENIC C-Wave MREN Dragon Hadoop Sector/Sphere Thrift KVM VMs Nova Eucalyptus VMs

Open Science Data Cloud NSF OSDC PIRE Project Working with 5 international partners (all connected with 10 Gbps networks). sky cloud Bionimbus (biology & health care) 4

Variety of analysis Wide Scientist with laptop Med Low Open Science Data Cloud High energy physics, astronomy Data Size Small Medium to Large Very Large No infrastructure General infrastructure Dedicated infrastructure

Part 2 What s Different About Data Center Computing? 6

Data center scale computing provides storage and computational resources at the scale and with the reliability of a data center.

A very nice recent book by Barroso and Holzle

Scale is new 9

Elastic, Usage Based Pricing Is New costs the same as 1 computer in a rack for 120 hours 120 computers in three racks for 1 hour 10

Simplicity of the Parallel Programming Framework is New A new programmer can develop a program to process a container full of data with less than day of training using MapReduce. 11

Elastic Clouds HPC Goal: Minimize latency and control heat. Large Data Clouds Goal: Maximize data (with matching compute) and control cost. Goal: Minimize cost of virtualized machines & provide on-demand.

2003 10x-100x 1670 250x 1976 10x-100x simulation science data science 1609 30x experimental science

Databases Data Clouds Scalability 100 s TB 100 s PB Functionality Optimized Consistency model Full SQL-based queries, including joins Databases optimized for safe writes ACID (Atomicity, Consistency, Isolation & Durability) Single keys Data clouds optimized for efficient reads Eventual consistency Parallelism Difficult because of ACID model; shared nothing is possible Parallelism over commodity components Scale Racks Data center 14

Grids Clouds Problem Too few cycles Too many users & too much data Infrastructure Architecture Programming Model Clusters and supercomputers Federated Virtual Organization Powerful, but difficult to use Data centers Hosted Organization Not as powerful, but easy to use 15

Part 3 How Do You Program A Data Center? 16

How Do You Build A Data Center? Containers used by Google, Microsoft & others Data center consists of 10-60+ containers. Microsoft Data Center, Northlake, Illinois 17

What is the Operating System? VM 1 VM 5 VM 1 VM 50,000 Data Center Operating System workstatio n Data center services include: VM management services, VM fail over and restart, security services, power management services, etc. 18

Architectural Models: How Do You Fill a Data Center? App App App on-demand computing instances App App App App App large data cloud services Cloud Data Services (BigTable, etc.) Quasi-relational Data Services App App Cloud Compute Services (MapReduce & Generalizations) App App Cloud Storage Services

Instances, Services & Frameworks many instances single instance Amazon s EC2 Hadoop DFS & MapReduce Amazon s SQS Azure Services S3 Microsoft Azure Google AppEngine VMWare Vmotion instance (IaaS) service framework (PaaS) operating system 20

Some Programming Models for Data Centers Operations over data center of disks MapReduce ( string-based ) Iterate MapReduce (Twister) DryadLINQ User-Defined Functions (UDFs) over data center SQL and Quasi-SQL over data center Data analysis / statistics functions over data center

More Programming Models Operations over data center of memory Memcached (distributed in-memory key-value store) Grep over distributed memory UDFs over distributed memory SQL and Quasi-SQL over distributed memory Data analysis / statistics over distributed memory

Part 4. Stacks for Big Data 23

The Google Data Stack The Google File System (2003) MapReduce: Simplified Data Processing (2004) BigTable: A Distributed Storage System (2006) 24

Map-Reduce Example Input is file with one document per record User specifies map function key = document URL Value = terms that document contains ( doc cdickens, it was the best of times ) map it, 1 was, 1 the, 1 best, 1

Example (cont d) MapReduce library gathers together all pairs with the same key value (shuffle/sort phase) The user-defined reduce function combines all the values associated with the same key key = it values = 1, 1 key = was values = 1, 1 key = best values = 1 key = worst values = 1 reduce it, 2 was, 2 best, 1 worst, 1

Applying MapReduce to the Data in Storage Cloud map/shuffle reduce 27

Google s Large Data Cloud Applications Compute Services Data Services Google s MapReduce Google s BigTable Storage Services Google File System (GFS) Google s Stack 28

Hadoop s Large Data Cloud Applications Compute Services Data Services Storage Services Hadoop s Stack Hadoop s MapReduce NoSQL Databases Hadoop Distributed File System (HDFS) 29

Amazon Style Data Cloud Load Balancer Simple Queue Service SDB EC2 EC2 Instance EC2 Instance Instance EC2 EC2 Instance EC2 Instance Instances EC2 EC2 Instance EC2 Instance Instance EC2 EC2 Instance EC2 Instance Instances S3 Storage Services 30

Evolution of NoSQL Databases Standard architecture for simple web apps: Front end load balanced web servers Business logic layer in the middle Backend database Databases do not scale well with very large numbers of users or very large amounts of data Alternatives include Sharded (partitioned) databases master-slave databases memcached 31

NoSQL Systems Suggests No SQL support, also Not Only SQL One or more of the ACID properties not supported Joins generally not supported Usually flexible schemas Some well known examples: Google s BigTable, Amazon s S3 & Facebook s Cassandra Several recent open source systems 32

Different Types of NoSQL Systems Distributed Key-Value Systems Amazon s S3 Key-Value Store (Dynamo) Voldemort Column-based Systems BigTable HBase Cassandra Document-based systems CouchDB 33

Cassandra vs MySQL Comparison MySQL > 50 GB Data Writes Average : ~300 ms Reads Average : ~350 ms Cassandra > 50 GB Data Writes Average : 0.12 ms Reads Average : 15 ms Source: Avinash Lakshman, Prashant Malik, Cassandra Structured Storage System over a P2P Network, static.last.fm/johan/nosql- 20090611/cassandra_nosql.pdf

CAP Theorem Proposed by Eric Brewer, 2000 Three properties of a system: consistency, availability and partitions You can have at most two of these three properties for any shared-data system Scale out requires partitions Most large web-based systems choose availability over consistency Reference: Brewer, PODC 2000; Gilbert/Lynch, SIGACT News 2002 35

Eventual Consistency All updates eventually propagate through the system and all nodes will eventually be consistent (assuming no more updates) Eventually, a node is either updated or removed from service. Can be implemented with Gossip protocol Amazon s Dynamo popularized this approach Sometimes this is called BASE (Basically Available, Soft state, Eventual consistency), as opposed to ACID 36

Part 5. Sector Architecture 37

Design Objectives 1. Provide Internet scale data storage for large data Support multiple data centers connected by high speed wide networks 2. Simplify data intensive computing for a larger class of problems than covered by MapReduce Support applying User Defined Functions to the data managed by a storage cloud, with transparent load balancing and fault tolerance

Sector s Large Data Cloud Applications Compute Services Sphere s UDFs Data Services Storage Services Routing & Transport Services Sector s Distributed File System (SDFS) UDP-based Data Transport Protocol (UDT) Sector s Stack 39

Apply User Defined Functions (UDF) to Files in Storage Cloud map/shuffle reduce UDF 40

UDT udt.sourceforge.net Sterling Commerce Movie2Me Globus Nifty TV UDT has been downloaded 25,000+ times Power Folder 41

Alternatives to TCP Decreasing Increases AIMD Protocols (x) UDT Scalable TCP HighSpeed TCP AIMD (TCP NewReno) x x x (x) x (1 ) x increase of packet sending rate x decrease factor

System Architecture User account Data protection System Security Metadata Scheduling Service provider System access tools App. Programming Interfaces Security Server SSL Masters SSL Clients Data UDT Encryption optional slaves slaves Storage and Processing

Storage Cloud Programming Model Hadoop DFS Block-based file system MapReduce Sector DFS File-based UDF & MapReduce Protocol TCP UDP-based protocol (UDT) Replication At write At write or period. Security Not yet HIPAA capable Language Java C++ 44

MapReduce Sphere Storage Disk data Disk & in-memory Processing Data exchanging Input data locality Output data locality Map followed by Reduce Reducers pull results from mappers Input data is assigned to nearest mapper NA Arbitrary user defined functions UDF s push results to bucket files Input data is assigned to nearest UDF Can be specified

Terasort Benchmark 1 Rack 2 Racks 3 Racks 4 Racks Nodes 32 64 96 128 Cores 128 256 384 512 Hadoop 85m 49s 37m 0s 25m 14s 17m 45s Sector 28m 25s 15m 20s 10m 19s 7m 56s Speed up 3.0 2.4 2.4 2.2 Sector/Sphere 1.24a, Hadoop 0.20.1 with no replication on Phase 2 of Open Cloud Testbed with co-located racks.

MalStone sites entities d k-2 d k-1 d k time 47

MalStone Benchmark MalStone A MalStone B Hadoop 455m 13s 840m 50s Hadoop streaming with Python 87m 29s 142m 32s Sector/Sphere 33m 40s 43m 44s Speed up (Sector v Hadoop) 13.5x 19.2x Sector/Sphere 1.20, Hadoop 0.18.3 with no replication on Phase 1 of Open Cloud Testbed in a single rack. Data consisted of 20 nodes with 500 million 100-byte records / node.

Disks Input Segments UDF Bucket Writers Files not split into blocks Directory directives In-memory objects Output Segments Disks

Sector Summary Sector is fastest open source large data cloud As measured by MalStone & Terasort Sector is easy to program UDFs, MapReduce & Python over streams Sector does not require extensive tuning Sector is secure A HIPAA compliant Sector cloud is being launched Sector is reliable Sector supports multiple active master node servers 50

Part 6. Sector Applications

App 1: Bionimbus www.bionimbus.org 52

App 2. Sector Application: Cistrack & Flynet 53

Cistrack Web Portal & Widgets Cistrack Elastic Cloud Services Cistrack Database Analysis Pipelines & Re-analysis Services Ingestion Services Cistrack Large Data Cloud Services

App 3: Bulk Download of the SDSS Source Destin. LLPR* Link Bandwidth Chicago Greenbelt 0.98 1 Gb/s 615 Mb/s Chicago Austin 0.83 10 Gb/s 8000 Mb/s LLPR = local / long distance performance Sector LLPR varies between 0.61 and 0.98 Recent Sloan Digital Sky Survey (SDSS) data release is 14 TB in size. 55

App 4: Anomalies in Network Data 56

Sector Applications Distributing the 15 TB Sloan Digital Sky Survey to astronomers around the world (with JHU, 2005) Managing and analyzing high throughput sequence data (Cistrack, University of Chicago, 2007). Detecting emergent behavior in distributed network data (Angle, won SC 07 Analytics Challenge) Wide area clouds (won SC 09 BWC with 100 Gbps wide area computation) New ensemble-based algorithms for trees Graph processing Image processing (OCC Project Matsu) 57

Credits Sector was developed by Yunhong Gu from the University of Illinois at Chicago and verycloud.com

For More Information For more information, please visit sector.sourceforge.net rgrossman.com (Robert Grossman) users.lac.uic.edu/~yunhong (Yunhong Gu)