Will They Blend?: Exploring Big Data Computation atop Traditional HPC NAS Storage

Size: px

Start display at page:

Download "Will They Blend?: Exploring Big Data Computation atop Traditional HPC NAS Storage"

Baldwin Todd
8 years ago
Views:

1 Will They Blend?: Exploring Big Data Computation atop Traditional HPC NAS Storage Ellis H. Wilson III 1,2 Mahmut Kandemir 1 Garth Gibson 2,3 1 Department of Computer Science and Engineering, The Pennsylvania State University 2 Panasas, Inc. 3 Department of Computer Science, Carnegie Mellon University July 3rd, 2014

Wilson III 1,2 Mahmut Kandemir 1 Garth Gibson 2,3 1 Department of Computer

2 Before We Begin: Get the Slides and Paper Slides and Paper are Available At:

3 1 Introduction and Background From 10,000 Feet: Considering Hadoop s Fit in HPC Goals of this Research: MapReduce in HPC? 2 for Overview of Architectures Reliability and Performance Implications 3 Performance of Setup and Benchmarks Performance Results

2 for Overview of Architectures Reliability and Performance

4 Motivation Introduction/Background Hadoop s Fit in HPC Goals of this Research Divide between HPC and Big Data is increasingly foggy Big Data processing framework MapReduce (MR) promises faster time-to-solution for data-intensive science But MR often comes tightly coupled with the Hadoop Distributed File System (HDFS) Standard HDFS requires local disks to the compute for distributed storage HPC typically already has it s own Parallel File System (PFS) solutions in place Using Hadoop threatens to require large capital and maintenance investments Totally dropping MPI and similar solutions for MR is impossible Copying massive amounts of data from Network-Attached Storage (NAS) to HDFS and back is a common problem Dividing your storage into two pools, NAS and HDFS, will exacerbate the Compute-Storage gap

HPC typically already has it s own Parallel File System (PFS) solutions in place Using Hadoop threatens to require large capital and maintenance investments Totally dropping MPI and similar solutions

5 Hadoop s Fit in HPC Goals of this Research Hurdles to Adoption of Hadoop in HPC Loss of Infrastructure Consolidation Forced Import/Export I/O Performance Degradation Loss of High-Availability No Modification to Files Inefficient Compute-Storage Coupling

Import/Export I/O Performance Degradation Loss of

6 Goals of this Research Hadoop s Fit in HPC Goals of this Research Three Main Goals/Contributions: 1 Explore if/how one can enable MR to run on traditional NAS Enables reuse of existing storage infrastructure consolidation 2 Explore whether one can use MR alongside MPI and others without copying Improves utility of capacity, reduces network contention, fights the I/O Gap 3 Identify the relative efficiencies and reliabilities of potential solutions Examine four different architecture approaches

use MR alongside MPI and others without copying Improves utility of capacity, reduces network contention, fights the I/O Gap

7 First: Consider Traditional Hadoop Overview of Architectures Reliability and Performance Implications Typical Hadoop Architecture: Example of Write Path

8 Overview of Architectures Reliability and Performance Implications Exploration of Four Possible Architectures Possible Architectures: Traditional HDFS Pointed at a PFS Configure HDFS with PFS paths rather than to local disks HDFS as a Wire Protocol in the PFS NAS Heads Run DataNodes (DNs) on NAS heads instead of all clients No HDFS, MR Directly to the PFS Run MR configured to send data directly to PFS : Replicating Array of Independent NAS File System New Hadoop Filesystem designed specifically to intermediate between MR and PFS

Heads Run DataNodes (DNs) on NAS heads instead of all clients No HDFS, MR Directly to the PFS Run MR configured to send data directly

9 Overview of Architectures Reliability and Performance Implications Architecture Details: Traditional HDFS Pros: Cons: Simplicity Performance Degradation: One full replica in network contention Reliability Limits: Duplication is the ceiling Copy Required: Distinct namespace

Performance Degradation: One full replica in network contention

10 Overview of Architectures Reliability and Performance Implications Architecture Details: HDFS as a Wire Protocol Pros: Cons: HDFS becomes Yet Another Protocol Reliability limits go away Performance Bottleneck: NAS Head limits throughput NAS Invasion: May not be possible (easy) with many NAS solutions Copy Required: Distinct namespace

Reliability limits go away Performance Bottleneck: NAS Head limits throughput NAS

Architecture Details: No HDFS Pros: Cons: High-Performance: Alleviates overheads and bottlenecks No Copies: Operates on typical POSIX namespace Requires Single

11 Architecture Details: No HDFS Pros: Cons: High-Performance: Alleviates overheads and bottlenecks No Copies: Operates on typical POSIX namespace Requires Single Namespace: No HDFS to intermediate between distinct NAS No Replication: Must tolerate solely RAID Overview of Architectures Reliability and Performance Implications

Namespace: No HDFS to intermediate between distinct NAS No Replication: Must

12 Overview of Architectures Reliability and Performance Implications Hadoop vs. HPC Storage: A Reliability Divergence HPC Storage Enterprise storage solutions RAID 5/6 ECC-enabled hardware (sometimes end-to-end) Redundant hardware (PSU/NIC/etc) Hadoop Storage (HDFS) Commodity hard drives in compute nodes Replication performed across nodes/racks No ECC No Redundant hardware

ECC-enabled hardware (sometimes end-to-end) Redundant hardware (PSU/NIC/etc) Hadoop Storage

13 Converged Reliability Guarantees Overview of Architectures Reliability and Performance Implications RAID 5 RAID 6 RAID 5 RAID 6 RAID 5 RAID 6 Repl. 1 Repl. 1 Repl. 2 Repl. 2 Repl. 3 Repl. 3 DN-on-Client 1 / 0 2 / 0 3 / 1 5 / 1 / / DN-on-NAS Node 1 / 0 2 / 0 3 / 1 5 / 1 5 / 2 8 / 2 No HDFS 1 / 0 2 / 0 / / / / 1 / 0 2 / 0 3 / 1 5 / 1 5 / 2 8 / 2 Two main failure modes for converged HDFS/HPC storage: Failure of a disk Failure of a rack

3 DN-on-Client 1 / 0 2 / 0 3 / 1 5 / 1 / / DN-on-NAS Node 1 / 0 2 / 0 3 / 1 5 / 1 5 / 2 8 / 2 No HDFS 1 / 0

14 Locality Confusion: Write Transport Overview of Architectures Reliability and Performance Implications Errant Pass-Through Behavior on Write Network Throughput (MB/s) Received Sent 0 00:00 01:00 02:00 03:00 04:00 05:00 06:00 07:00 08:00 09:00 Time Since Start (Minutes:Seconds)

Throughput (MB/s) 5000 4000 3000 2000 1000 Received Sent 0 00:00 01:00

15 Read Transport Introduction/Background Overview of Architectures Reliability and Performance Implications Errant Pass-Through Behavior on Read Network Throughput (MB/s) Received Sent 0 00:00 01:00 02:00 03:00 04:00 05:00 06:00 07:00 08:00 09:00 10:00 Time Since Start (Minutes:Seconds)

Throughput (MB/s) 2500 2000 1500 1000 500 Received Sent 0 00:00 01:00 02:00

16 Design Desirata Introduction/Background Overview of Architectures Reliability and Performance Implications Four main goals for : 1 Client-Level Federation of NAS Systems: Enable performance of all available NAS systems concurrently and maintain discrete failure domains 2 Full Replication: Restore replication ability in MapReduce 3 No Data Pass-Throughs: Writes/Reads should never go through another client node. 4 A Fair Namespace: Create a framework-agnostic namespace where no imports or exports are required.

failure domains 2 Full Replication: Restore replication ability in MapReduce 3 No Data Pass-Throughs: Writes/Reads should never

17 Main Implementation Mechanisms Overview of Architectures Reliability and Performance Implications Symbolic Links (symlinks) Symlinks on master failure domain are pointed at replica zero on one of the NAS systems Placement of replica zero is randomly chosen, following replicas are round-robined MR can read from MPI output; MPI can read from MR output Key algorithms and their synchronization issues are covered in the paper Hidden Metadata File Beside and named similarly to the symlink Manage where replicas exist, up/down state, etc Avoids dedicated, centralized metadata manager daemon

round-robined MR can read from MPI output; MPI can read from MR output Key algorithms and their synchronization issues are covered in the paper

18 Setup and Benchmarks in Use Setup Results Hardware Environment Cluster of 50 multi-core machines at Carnegie Mellon CentOS 5.5 running as VM on KVM DirectFlow(tm) network attached protocol to: 5 shelves of Panasas ActiveStor 12 Benchmarks in Use Ubiquitous Yahoo! TeraSort Benchmark Suite TeraGen - Write-intensive TeraSort - Mixed, CPU-intensive TeraValidate - Read-intensive

5 running as VM on KVM DirectFlow(tm) network attached protocol to: 5 shelves of Panasas

19 Setup Results Impact of Architecture on Throughput Performance Yahoo! TeraSort Benchmark (50 clients, 500GBs of Data) Throughput (MB/s) TeraGen TeraValidate DN-on-Client DN-on-NAS No-DN (a) Rep. Level 1: Write- and Read-Intensive Throughput (MB/s) DN-on-Client DN-on-NAS TeraSort No-DN (b) Rep. Level 1: Mixed Throughput (MB/s) TeraGen TeraValidate DN-on-Client DN-on-NAS No-DN (c) Rep. Level 2: Write- and Read-Intensive Throughput (MB/s) DN-on-Client DN-on-NAS TeraSort No-DN (d) Rep. Level 2: Mixed

Rep. Level 1: Write- and Read-Intensive Throughput (MB/s) 300 250 200 150 100 50 0 DN-on-Client DN-on-NAS TeraSort No-DN (b) Rep.

20 Setup Results Impact of Replication on Performance Relative Throughput Slow-Downs for Replication per Architecture DN-on-Client DN-on-NAS Relative Slowdown TeraGen 1.52

21 Conclusion Introduction/Background Setup Results Conclusions: Convergence of Big Data and HPC is happening compute is easy, storage is hard Numerous pitfalls/caveats, especially relating to reliability Four different architectures using Hadoop MapReduce on NAS were explored each have their own pros/cons demonstrates optimal performance with highest reliability

22 Questions? Introduction/Background Setup Results

23 Backup Slides Introduction/Background Setup Results Begin Backup Slides

24 Setup Results Super-Moore s Law CPU Scaling in Supercomputers Top 500 Supercomputer HPLinpack Results over Time HPLinpack Rmax 1e+08 1e+07 1e #1 Mean #500 HPLinpack Rmax/CPU Doubling Performance Every 13.5 Months 1993 Mean Rmax/CPUs Normalized Mean CPU Count Doubling CPU Count Every 22 Months 2003 Year Year Mean CPU Count (Normalized to 1 at start)

25 Setup Results HDD Capacity and Bandwidth Scaling Capacity (GB) Historical Data from over 1500 Hard Drives Fit-line: exp(0.452*x ) Maximum Read Bandwidth (MB/s) Year Doubling Capacity every 18 Months 300 Fit-line: * ln(x) Capacity (GB) Doubling Bandwidth only once per decade!

26 Setup Results Taken In Perspective: A Grim Reality Time taken to access all on-disk data: seconds minutes day

Performing Cloud Computation on a Parallel File System

Performing Cloud Computation on a Parallel File System Ellis H. Wilson III Computer Science and Engineering Pennsylvania State University University Park, PA 1682 ellis@cse.psu.edu 1. INTRODUCTION Cluster