Hadoop MapReduce over Lustre* High Performance Data Division Omkar Kulkarni April 16, 2013

Size: px

Start display at page:

Download "Hadoop MapReduce over Lustre* High Performance Data Division Omkar Kulkarni April 16, 2013"

Harry Parks
10 years ago
Views:

1 Hadoop MapReduce over Lustre* High Performance Data Division Omkar Kulkarni April 16, 2013 * Other names and brands may be claimed as the property of others.

2 Agenda Hadoop Intro Why run Hadoop on Lustre? OpAmizing Hadoop for Lustre Performance What s next? 2 2

3 3 A LiHle Intro of Hadoop Open source MapReduce framework for data- intensive compuang Simple programming model two funcaons: Map and Reduce Map: Transforms input into a list of key value pairs Map(D) List[Ki, Vi] Reduce: Given a key and all associated values, produces result in the form of a list of values Reduce(Ki, List[Vi]) List[Vo] Parallelism hidden by framework Highly scalable: can be applied to large datasets (Big Data) and run on commodity clusters Comes with its own user- space distributed file system (HDFS) based on the local storage of cluster nodes 3

form of a list of values Reduce(Ki, List[Vi]) List[Vo] Parallelism hidden by framework Highly scalable: can be applied to large datasets (Big

4 A LiHle Intro of Hadoop (cont.) Input Split Map Shuffle Reduce Output Input Split Map Reduce Output Input Split Map 4 Framework handles most of the execuaon Splits input logically and feeds mappers ParAAons and sorts map outputs (Collect) Transports map outputs to reducers (Shuffle) Merges output obtained from each mapper (Merge) 4

Map 4 Framework handles most of the execuaon Splits input logically and feeds

5 Why Hadoop with Lustre? HPC moving towards Exascale. SimulaAons will only get bigger Need tools to run analyses on resulang massive datasets Natural allies: Hadoop is the most popular sozware stack for big data analyacs Lustre is the file system of choice for most HPC clusters Easier to manage a single storage pla[orm No data transfer overhead for staging inputs and extracang results No need to paraaon storage into HPC (Lustre) and AnalyAcs (HDFS) Also, HDFS expects nodes with locally a]ached disks, while most HPC clusters have diskless compute nodes with a separate storage cluster 5 5

for big data analyacs Lustre is the file system of choice for most HPC clusters Easier to manage a single storage pla[orm No data transfer overhead

6 How to make them cooperate? Hadoop uses pluggable extensions to work with different file system types Lustre is POSIX compliant: Use Hadoop s built- in LocalFileSystem class Uses naave file system support in Java Extend and override default behavior: LustreFileSystem Defines new URL scheme for Lustre lustre:/// Controls Lustre striping info Resolves absolute paths to user- defined directory Leaves room for future enhancements org.apache.hadoop.fs FileSystem RawLocalFileSystem LustreFileSystem 6 Allow Hadoop to find it in config files 6

LocalFileSystem class Uses naave file system support in Java Extend and override default behavior: LustreFileSystem Defines new URL

7 7 Sort, Shuffle & Merge M Number of Maps, R Number of Reduces Map output records (Key- Value pairs) organized into R paraaons ParAAons exist in memory. Records within a paraaon are sorted A background thread monitors the buffer, spills to disk if full Each spill generates a spill file and a corresponding index file Eventually, all spill files are merged (paraaon- wise) into a single file Final index is file created containing R index records Index Record = [Offset, Compressed Length, Original Length] A Servlet extracts paraaons and streams to reducers over HTTP Reducer merges all M streams on disk or in memory before reducing 7

index file Eventually, all spill files are merged (paraaon- wise) into a single file Final index is file created containing R index records Index Record =

8 Sort, Shuffle & Merge (Cont.) Mapper X TaskTracker Reducer Y Sort Merge Servlet Copy Reduce Map Partition 1 Idx 1 Merged Streams Partition 2 Idx 2 Map 1:Partition Y Partition R Idx R Map 2:Partition Y Input Split X Output Index HDFS Map M:Partition Y Output Part Y 8

9 OpAmized Shuffle for Lustre Why? Biggest (but inevitable) bo]leneck bad performance on Lustre! How? Shared File System HTTP transport is redundant 9 How would reducers access map outputs? First Method: Let reducers read paraaons from map outputs directly But, index informaaon sall needed Either, let reducers read index files, as well Results in (M*R) small (24 bytes/record) IO operaaons Or, let Servlet convey index informaaon to reducer Advantage: Read enare index file at once, and cache it Disadvantage: Seeking paraaon offsets + HTTP latency Second Method: Let mappers put each paraaon in a separate file Three birds with one stone: No index files, no disk seeks, no HTTP 9

First Method: Let reducers read paraaons from map outputs directly But, index informaaon sall needed Either, let reducers read index files, as well Results in (M*R)

10 OpAmized Shuffle for Lustre (Cont.) Mapper X Sort Merge Reducer Y Reduce Map Input Split X Map 1:Partition Y Map 2:Partition Y Merged Streams Output Part Y Map X:Partition 1 Map X:Partition Y Map X:Partition R 10 Lustre Map M:Partition Y 10

Map 1:Partition Y Map 2:Partition Y Merged Streams Output

11 Performance Tests Standard Hadoop benchmarks were run on the Rosso cluster ConfiguraAon Hadoop (Intel Distro v1.0.3): 8 nodes, 2 SATA disks per node (used only for HDFS) One with dual configuraaon, i.e. master and slave ConfiguraAon Lustre (v2.3.0): 4 OSS nodes, 4 SATA disks per node (OSTs) 1 MDS, 4GB SSD MDT All storage handled by Lustre, local disks not used 11 11

3): 8 nodes, 2 SATA disks per node (used only for HDFS) One with dual configuraaon, i.e. master and slave ConfiguraAon Lustre (v2.

12 TestDFSIO Benchmark Tests the raw performance of a file system Write and read very large files (35G each) in parallel One mapper per file. Single reducer to collect stats Embarrassingly parallel, does not test shuffle & sort Throughput "! filesize % $ #!time ' & MB/s HDFS Lustre 12 More is better! 20 0 Write Read 12

Single reducer to collect stats Embarrassingly parallel, does not test shuffle &

13 Terasort Benchmark Distributed sort: The primary Map- Reduce primiave Sort a 1 Billion records, i.e. approximately 100G Record: Randomly generated 10 byte key + 90 bytes garbage data Terasort only supplies a custom paraaoner for keys, the rest is just default mapreduce behavior. Block Size: 128M, Maps: 4/node, Reduces: 2/node Lustre HDFS Lustre 10-15% Faster Runtime (seconds) Less is better! 13

custom paraaoner for keys, the rest is just default map- reduce behavior.

14 Work in progress Planned so far More exhausave tesang needed Test at scale: Verify that large scale jobs don t thro]le MDS Port to IDH 3.x (Hadoop 2.x): New architecture, More decoupled Scenarios with other tools in the Hadoop Stack: Hive, HBase, etc. Further Work Experiment with caching Scheduling Enhancements ExploiAng Locality 14 14

x): New architecture, More decoupled Scenarios with other tools in the Hadoop Stack:

Performance Comparison of Intel Enterprise Edition for Lustre* software and HDFS for MapReduce Applications

Performance Comparison of Intel Enterprise Edition for Lustre software and HDFS for MapReduce Applications Rekha Singhal, Gabriele Pacciucci and Mukesh Gangadhar 2 Hadoop Introduc-on Open source MapReduce