Scalable Data-Intensive Processing for Science on Clouds: A-Brain and Z-CloudFlow Lessons Learned and Future Directions Gabriel Antoniu, Inria Joint work with Radu Tudoran, Benoit Da Mota, Alexandru Costan, Elena Apostol, Bertrand Thirion (co-pi for A-Brain), Ji Liu, Luis Pineda, Esther Pacitti, Patrick Valduriez (co-pi for Z-CloudFlow) and the Microsoft Azure team from MSR ATL Europe EIT Digital Future Cloud Symposium, Rennes, 19-20 October 2015
Inria Teams Involved in Cloud-Related Projects of the MSR-Inria Joint Centre INRIA Lille Nord Europe KERDATA: Data Storage and Processing INRIA Paris Rocquencourt INRIA Rennes Bretagne Atlantique INRIA Nancy Grand Est INRIA Saclay Île-de-France PARIETAL: Neuroimaging INRIA Grenoble Rhône-Alpes INRIA Bordeaux Sud-Ouest INRIA Sophia Antipolis Méditerranée ZENITH: Scien=fic Data Management -2 2
KerData s Focus: How to efficiently store and share data at large scale for next-generation, data-intensive applications? Scientific challenges Massive data Geographically distributed Fine-grain access (MB) for reading and writing High concurrency Without locking Major goal: high-throughput under heavy concurrency Our contribution Design and implementation of distributed algorithms Validation with real apps on real platforms with real users 3
Motivating Application: A-Brain Detect risk factors for brain diseases Brain image finding associations: p( 106 Genetic data, ) 106 Anatomical MRI Functional MRI Diffusion MRI DNA array (SNP/CNV) gene expression data others... >2000 subjects IEEE Cluster 15, Chicago, USA, 10 September 2015 4
Approach: A-Brain as Map-Reduce Processing 5 5
Challenges: Overview Scaling the processing Multi- site MapReduce Enabling scientific discovery Enabling large- scale scientific processing Data management across sites High- Performance Big Data Management Across Cloud Data Centers High- performance streaming Optimize inter- site transfers Cloud- provided Transfers Service Streaming across cloud sites Configurable cost- performance tradeoffs 6
Challenges: Overview Scaling the processing Multi- site MapReduce Enabling scientific discovery Enabling large- scale scientific processing Data management across sites High- Performance Big Data Management Across Cloud Data Centers High- performance streaming Optimize inter- site transfers Cloud- provided Transfers Service Streaming across cloud sites Configurable cost- performance tradeoffs 7
Data Management on Public Clouds Cloud Compute Nodes Cloud- provided storage service Computa.on- to- data latency is high! 8
TomusBlobs: Leverage Virtual Disks Colloca.ng computa.on and data in PaaS clouds: Federate virtual disk of compute nodes Self- configura.on, automa.c deployment and scaling of the data management system Apply to MapReduce and Workflow processing 9
Leveraging TomusBlobs for MapReduce Processing Map Client Map Map Azure Queues Reduce Reduce New MapReduce prototype (no Hadoop at that point on Azure) Relies on versioning to support high throughput under heavy concurrency, leveraging BlobSeer (KerData, Inria, Rennes) 10
Background: BlobSeer, a Software Platform for Scalable, Distributed BLOB Management Started in 2008, 6 PhD theses (Gilles Kahn/SPECIF PhD Thesis Award in 2011) Main goal: optimized for concurrent accesses under heavy concurrency Three key ideas Decentralized metadata management Lock-free concurrent writes (enabled by versioning) Write = create new version of the data Data and metadata patching rather than updating A back-end for higher-level data management systems Short term: highly scalable distributed file systems Middle term: storage for cloud services Our approach Design and implementation of distributed algorithms Experiments on the Grid 5000 grid/cloud testbed Validation with real apps on real platforms: Nimbus, Azure, OpenNebula clouds http://blobseer.gforge.inria.fr/ 11 11-11
Initial A-Brain Experimentation Scenario: 100 nodes deployment on Azure Comparison with an Azure Blobs based MapReduce TomusBlobs is definitely faster than Azure storage 12
Beyond MapReduce: MapIterativeReduce Unique result with parallel reduction No central control entity No synchronization barrier 13
The Global Gain The Most Frequent Words benchmark Data set 3.2 GB to 32 GB A- Brain ini.al experimenta.on Data set 5 GB to 50 GB Experimental Setup: 200 nodes deployment on Azure Map-IterativeReduce reduces the execution timespan to half 14
Challenges: Overview Scaling the processing Multi- site MapReduce Enabling scientific discovery Enabling large- scale scientific processing Data management across sites High- Performance Big Data Management Across Cloud Data Centers High- performance streaming Optimize inter- site transfers Cloud- provided Transfers Service Streaming across cloud sites Configurable cost- performance tradeoffs 15
Single-Site Computation on the Cloud Timespan es.ma.on for single core machine: 5,3 years Parallelize and execute on Azure cloud across 350 cores using TomusBlobs Achievements: Reduced execu.on.me to 5.6 days Demonstrated tgeographically hat this technique is d sensi.ve to outliers and cannot get results L istributed processing Ø Get more data: 1 billion euro needed Ø More robust analysis: computa.on.mespan increases to 86 years (for single core) 16
Going Geo-Distributed Azure Data Centers Hierarchical mul.- site MapReduce: Map- Itera.veReduce, Global Reduce Data management: TomusBlobs (intra- site), Cloud Storage (inter- site) Itera.ve- Reduce technique for minimizing transfers of par.al results 17-17
Executing the A-Brain Application at Large-Scale The TomusBlobs data- storage layer developed within the A- Brain project was demonstrated to scale up to 1,000 cores on 3 Azure data centers (from EU, US) Gain compared to Azure BLOBs: close to 50% Experiment dura.on: ~ 14 days More than 210,000 hours of computa.on used Cost of the experiments: 20,000 euros (VM price, storage, outbound traffic) 28,000 map jobs (each las.ng about 2 hours) and ~600 reduce jobs Scien=fic Discovery: Provided the first sta.s.cal evidence of the heritability of func.onal signals in a failed stop task in basal ganglia 18
People Involved Gabriel Antoniu (INRIA, Project Lead) Benoit Da Mota (INRIA) Bertrand Thirion (INRIA, Project Lead) Hakan Soncu (Microsoft Research) Pierre Louis Xech (Microsoft) Alexandru Costan (INRIA) Götz-Philip Brasche (Microsoft Research Now at HUAWEI) Radu Tudoran (INRIA Now at HUAWEI)
What s Next? Z-CloudFlow: Data-Intensive Workflows in the Cloud KerData
Scientific Workflow Scenario 1. Data is generated and collected Provenance Data 2. It is locally evaluated 3. Large volume of data produced... 5. Final results generated in a reasonable time 4....which need to be processed (HPC) Phylogenetic trees 21
Why to Use Multisite Clouds for Workflows? Multisite cloud = a cloud with multiple data centers Each with its own cluster, data and programs Matches well the requirements of scientific apps With different labs and groups at different sites 22
Multisite Cloud Data Management: Challenges What strategies to use and how for efficient data transfers? How to handle metadata across datacenters? How to group tasks and datasets together to minimize data transfers? 23
Multisite Cloud Data Management: Challenges What strategies to use and how for efficient data transfers? How to handle metadata across datacenters? How to group tasks and datasets together to minimize data transfers? 24
Main obstacle: the network! Metadata update latency 1400 1200 Time (sec) 1000 Remote 800 600 Regional 400 Local 200 0 100 500 1000 5000 Number of files IEEE Cluster 15, Chicago, USA, 10 September 2015 25 25
Metadata management (for workflows on distributed clouds) Workflow features what we know Many small files (when striping makes no sense) Common data access patterns: pipeline, gather, scatter, reduce and broadcast Applications are a combination of them Typical scheme: write once, read many times Design Principles how we handle it Hybrid distributed/replicated DHT-based architecture In-memory Caching Leverage workflow metadata for data provisioning Eventual consistency for geo-distributed metadata: lazy metadata updates 26
Four Strategies Centralized Baseline Replicated Local metadata accesses Synchroniza.on agent Decentralized Non- replicated Scahered metadata across sites DHT- based Decentralized Replicated Metadata stored locally and replicated to a remote loca.on (using hashing) 27
Architecture and Implementation Communication and distributed synchronization manager In-memory metadata storage Optimistic Concurrency Model: no locks during operations 28
Experimental Setup Azure Cloud (PaaS) 4 datacenters 2 EU, 2 US Up to 128 nodes 1 CPU core, 1.75 GB memory, 127 GB disk size 29
Impact of descentralization on makespan large workflow provisioning small Completion time vs. speedup No significant gain in small settings 50% improvement at large scale Impact of the local replication Speedup of 1.25+ IEEE Cluster 15, Chicago, USA, 10 September 2015 30 30
Scalability Up to 128 nodes 5000 operations/node Degradation of the replicated approach at large scale 31
Support for real-life workflows BuzzFlow Pipeline-like Correlation in large scientific databases Montage Split + parallelized jobs + merge Astronomy application to create mosaics of the sky Scenario Opera=ons / node 3 scenarios Small Scale Comp. Int. Metadata Int. 100 200 1,000 Computa=on =me / node 1s 5s 1s Total Ops BuzzFlow 7,200 14,400 72,000 Total Ops - Montage 16,000 32,000 160,000 32
Matching strategies to workflows Centralized still better at small scale Replicated benefits from intensive computations on large files Decentralized approaches suitable for large-scale, metadata-intensive apps handling a large number of small files non-replicated: for parallel jobs linear metadata access replicated: for sequential, tightly dependent jobs, data available locally IEEE Cluster 15, Chicago, USA, 10 September 2015 33 33
Overall Achievements Publications Book Chapter In Cloud Computing for Data-Intensive Applications, Springer 2015 Journal articles Frontiers in Neuroinformatics 2014 Concurrency and Computation Practice and Experience 2013 ERCIM Electonic Journal 2012 International Conferences publications IEEE Cluster 2015 3 papers at IEEE/ACM CCGrid 2012 and 2014 IEEE SRDS 2014 IEEE Big Data 2013 ACM DEBS 2014 IEEE Trustcom/ISPA 2013 Workshops papers, Posters and Demos MapReduce in conjuction with ACM HPDC (rank A) CloudCP in conjuction with ACM EuroSys (rank A) IPDPSW in conjuction with IEEE IPDPS (rank A) Microsoft: CloudFutures, ResearchNext, PhD Summer School DEBS Demo in conjunction with ACM DEBS SoWware PaaS data management middleware Available with Microsoi GenericWorker MapReduce engine for the Azure cloud Cloud service for bio- informa.cs SaaS for benchmarking the performance of data stage- in to cloud data centers Available on Azure Cloud Middleware for batch- based, high- performance streaming across cloud sites Binding with Microsoi StreamInsight External Collaborators Microsoi Research ATLE, Cambridge Argonne Na.onal Laboratory Inria Saclay Inria Sophia An.polis 34
Future Directions Multi-site workflow across geographically distributed sites Incorporate metadata registry with a workflow execution engine to support multi-site scheduling Self-* processing Cost/performance/energy tradeoffs One size does not fit all! Cloud stream processing Management of many small events, latency constraints for distributed queries 35
Scalable Data-Intensive Processing for Science on Clouds: A-Brain and Z-CloudFlow Contact: gabriel.antoniu@inria.fr team.inria.fr/kerdata Thank you!