Distributed Data Parallel Computing: The Sector Perspective on Big Data Robert Grossman July 25, 2010 Laboratory for Advanced Computing University of Illinois at Chicago Open Data Group Institute for Genomics & Systems Biology University of Chicago 1
Part 1.
Open Cloud Testbed 9 racks 250+ Nodes 1000+ Cores 10+ Gb/s 3 CENIC C-Wave MREN Dragon Hadoop Sector/Sphere Thrift KVM VMs Nova Eucalyptus VMs
Open Science Data Cloud NSF OSDC PIRE Project Working with 5 international partners (all connected with 10 Gbps networks). sky cloud Bionimbus (biology & health care) 4
Variety of analysis Wide Scientist with laptop Med Low Open Science Data Cloud High energy physics, astronomy Data Size Small Medium to Large Very Large No infrastructure General infrastructure Dedicated infrastructure
Part 2 What s Different About Data Center Computing? 6
Data center scale computing provides storage and computational resources at the scale and with the reliability of a data center.
A very nice recent book by Barroso and Holzle
Scale is new 9
Elastic, Usage Based Pricing Is New costs the same as 1 computer in a rack for 120 hours 120 computers in three racks for 1 hour 10
Simplicity of the Parallel Programming Framework is New A new programmer can develop a program to process a container full of data with less than day of training using MapReduce. 11
Elastic Clouds HPC Goal: Minimize latency and control heat. Large Data Clouds Goal: Maximize data (with matching compute) and control cost. Goal: Minimize cost of virtualized machines & provide on-demand.
2003 10x-100x 1670 250x 1976 10x-100x simulation science data science 1609 30x experimental science
Databases Data Clouds Scalability 100 s TB 100 s PB Functionality Optimized Consistency model Full SQL-based queries, including joins Databases optimized for safe writes ACID (Atomicity, Consistency, Isolation & Durability) Single keys Data clouds optimized for efficient reads Eventual consistency Parallelism Difficult because of ACID model; shared nothing is possible Parallelism over commodity components Scale Racks Data center 14
Grids Clouds Problem Too few cycles Too many users & too much data Infrastructure Architecture Programming Model Clusters and supercomputers Federated Virtual Organization Powerful, but difficult to use Data centers Hosted Organization Not as powerful, but easy to use 15
Part 3 How Do You Program A Data Center? 16
How Do You Build A Data Center? Containers used by Google, Microsoft & others Data center consists of 10-60+ containers. Microsoft Data Center, Northlake, Illinois 17
What is the Operating System? VM 1 VM 5 VM 1 VM 50,000 Data Center Operating System workstatio n Data center services include: VM management services, VM fail over and restart, security services, power management services, etc. 18
Architectural Models: How Do You Fill a Data Center? App App App on-demand computing instances App App App App App large data cloud services Cloud Data Services (BigTable, etc.) Quasi-relational Data Services App App Cloud Compute Services (MapReduce & Generalizations) App App Cloud Storage Services
Instances, Services & Frameworks many instances single instance Amazon s EC2 Hadoop DFS & MapReduce Amazon s SQS Azure Services S3 Microsoft Azure Google AppEngine VMWare Vmotion instance (IaaS) service framework (PaaS) operating system 20
Some Programming Models for Data Centers Operations over data center of disks MapReduce ( string-based ) Iterate MapReduce (Twister) DryadLINQ User-Defined Functions (UDFs) over data center SQL and Quasi-SQL over data center Data analysis / statistics functions over data center
More Programming Models Operations over data center of memory Memcached (distributed in-memory key-value store) Grep over distributed memory UDFs over distributed memory SQL and Quasi-SQL over distributed memory Data analysis / statistics over distributed memory
Part 4. Stacks for Big Data 23
The Google Data Stack The Google File System (2003) MapReduce: Simplified Data Processing (2004) BigTable: A Distributed Storage System (2006) 24
Map-Reduce Example Input is file with one document per record User specifies map function key = document URL Value = terms that document contains ( doc cdickens, it was the best of times ) map it, 1 was, 1 the, 1 best, 1
Example (cont d) MapReduce library gathers together all pairs with the same key value (shuffle/sort phase) The user-defined reduce function combines all the values associated with the same key key = it values = 1, 1 key = was values = 1, 1 key = best values = 1 key = worst values = 1 reduce it, 2 was, 2 best, 1 worst, 1
Applying MapReduce to the Data in Storage Cloud map/shuffle reduce 27
Google s Large Data Cloud Applications Compute Services Data Services Google s MapReduce Google s BigTable Storage Services Google File System (GFS) Google s Stack 28
Hadoop s Large Data Cloud Applications Compute Services Data Services Storage Services Hadoop s Stack Hadoop s MapReduce NoSQL Databases Hadoop Distributed File System (HDFS) 29
Amazon Style Data Cloud Load Balancer Simple Queue Service SDB EC2 EC2 Instance EC2 Instance Instance EC2 EC2 Instance EC2 Instance Instances EC2 EC2 Instance EC2 Instance Instance EC2 EC2 Instance EC2 Instance Instances S3 Storage Services 30
Evolution of NoSQL Databases Standard architecture for simple web apps: Front end load balanced web servers Business logic layer in the middle Backend database Databases do not scale well with very large numbers of users or very large amounts of data Alternatives include Sharded (partitioned) databases master-slave databases memcached 31
NoSQL Systems Suggests No SQL support, also Not Only SQL One or more of the ACID properties not supported Joins generally not supported Usually flexible schemas Some well known examples: Google s BigTable, Amazon s S3 & Facebook s Cassandra Several recent open source systems 32
Different Types of NoSQL Systems Distributed Key-Value Systems Amazon s S3 Key-Value Store (Dynamo) Voldemort Column-based Systems BigTable HBase Cassandra Document-based systems CouchDB 33
Cassandra vs MySQL Comparison MySQL > 50 GB Data Writes Average : ~300 ms Reads Average : ~350 ms Cassandra > 50 GB Data Writes Average : 0.12 ms Reads Average : 15 ms Source: Avinash Lakshman, Prashant Malik, Cassandra Structured Storage System over a P2P Network, static.last.fm/johan/nosql- 20090611/cassandra_nosql.pdf
CAP Theorem Proposed by Eric Brewer, 2000 Three properties of a system: consistency, availability and partitions You can have at most two of these three properties for any shared-data system Scale out requires partitions Most large web-based systems choose availability over consistency Reference: Brewer, PODC 2000; Gilbert/Lynch, SIGACT News 2002 35
Eventual Consistency All updates eventually propagate through the system and all nodes will eventually be consistent (assuming no more updates) Eventually, a node is either updated or removed from service. Can be implemented with Gossip protocol Amazon s Dynamo popularized this approach Sometimes this is called BASE (Basically Available, Soft state, Eventual consistency), as opposed to ACID 36
Part 5. Sector Architecture 37
Design Objectives 1. Provide Internet scale data storage for large data Support multiple data centers connected by high speed wide networks 2. Simplify data intensive computing for a larger class of problems than covered by MapReduce Support applying User Defined Functions to the data managed by a storage cloud, with transparent load balancing and fault tolerance
Sector s Large Data Cloud Applications Compute Services Sphere s UDFs Data Services Storage Services Routing & Transport Services Sector s Distributed File System (SDFS) UDP-based Data Transport Protocol (UDT) Sector s Stack 39
Apply User Defined Functions (UDF) to Files in Storage Cloud map/shuffle reduce UDF 40
UDT udt.sourceforge.net Sterling Commerce Movie2Me Globus Nifty TV UDT has been downloaded 25,000+ times Power Folder 41
Alternatives to TCP Decreasing Increases AIMD Protocols (x) UDT Scalable TCP HighSpeed TCP AIMD (TCP NewReno) x x x (x) x (1 ) x increase of packet sending rate x decrease factor
System Architecture User account Data protection System Security Metadata Scheduling Service provider System access tools App. Programming Interfaces Security Server SSL Masters SSL Clients Data UDT Encryption optional slaves slaves Storage and Processing
Storage Cloud Programming Model Hadoop DFS Block-based file system MapReduce Sector DFS File-based UDF & MapReduce Protocol TCP UDP-based protocol (UDT) Replication At write At write or period. Security Not yet HIPAA capable Language Java C++ 44
MapReduce Sphere Storage Disk data Disk & in-memory Processing Data exchanging Input data locality Output data locality Map followed by Reduce Reducers pull results from mappers Input data is assigned to nearest mapper NA Arbitrary user defined functions UDF s push results to bucket files Input data is assigned to nearest UDF Can be specified
Terasort Benchmark 1 Rack 2 Racks 3 Racks 4 Racks Nodes 32 64 96 128 Cores 128 256 384 512 Hadoop 85m 49s 37m 0s 25m 14s 17m 45s Sector 28m 25s 15m 20s 10m 19s 7m 56s Speed up 3.0 2.4 2.4 2.2 Sector/Sphere 1.24a, Hadoop 0.20.1 with no replication on Phase 2 of Open Cloud Testbed with co-located racks.
MalStone sites entities d k-2 d k-1 d k time 47
MalStone Benchmark MalStone A MalStone B Hadoop 455m 13s 840m 50s Hadoop streaming with Python 87m 29s 142m 32s Sector/Sphere 33m 40s 43m 44s Speed up (Sector v Hadoop) 13.5x 19.2x Sector/Sphere 1.20, Hadoop 0.18.3 with no replication on Phase 1 of Open Cloud Testbed in a single rack. Data consisted of 20 nodes with 500 million 100-byte records / node.
Disks Input Segments UDF Bucket Writers Files not split into blocks Directory directives In-memory objects Output Segments Disks
Sector Summary Sector is fastest open source large data cloud As measured by MalStone & Terasort Sector is easy to program UDFs, MapReduce & Python over streams Sector does not require extensive tuning Sector is secure A HIPAA compliant Sector cloud is being launched Sector is reliable Sector supports multiple active master node servers 50
Part 6. Sector Applications
App 1: Bionimbus www.bionimbus.org 52
App 2. Sector Application: Cistrack & Flynet 53
Cistrack Web Portal & Widgets Cistrack Elastic Cloud Services Cistrack Database Analysis Pipelines & Re-analysis Services Ingestion Services Cistrack Large Data Cloud Services
App 3: Bulk Download of the SDSS Source Destin. LLPR* Link Bandwidth Chicago Greenbelt 0.98 1 Gb/s 615 Mb/s Chicago Austin 0.83 10 Gb/s 8000 Mb/s LLPR = local / long distance performance Sector LLPR varies between 0.61 and 0.98 Recent Sloan Digital Sky Survey (SDSS) data release is 14 TB in size. 55
App 4: Anomalies in Network Data 56
Sector Applications Distributing the 15 TB Sloan Digital Sky Survey to astronomers around the world (with JHU, 2005) Managing and analyzing high throughput sequence data (Cistrack, University of Chicago, 2007). Detecting emergent behavior in distributed network data (Angle, won SC 07 Analytics Challenge) Wide area clouds (won SC 09 BWC with 100 Gbps wide area computation) New ensemble-based algorithms for trees Graph processing Image processing (OCC Project Matsu) 57
Credits Sector was developed by Yunhong Gu from the University of Illinois at Chicago and verycloud.com
For More Information For more information, please visit sector.sourceforge.net rgrossman.com (Robert Grossman) users.lac.uic.edu/~yunhong (Yunhong Gu)