A RAM-disk provisioning service for high performance data analysis

Transcription

1 A RAM-disk provisioning service for high performance data analysis Allan Espinosa Mentors: M. Woitaszek and J. Dennis University of Chicago, National Center for Atmospheric Research July 29, / 64

2 Outline 1 Motivation: data analysis 2 Approach and challenges 3 Implementation 4 Target applications 5 Conclusions 2 / 64

3 Motivation: data-intensive post-processing Computing center Simulation results Analysis cluster Transfer nodes Spinning disk-based parallel file system 3 / 64

4 Motivation: data-intensive post-processing Computing center Simulation results Analysis cluster Transfer nodes Analysis 1 Spinning disk-based parallel file system Tape Archive 4 / 64

5 Motivation: data-intensive post-processing Computing center Simulation results Analysis cluster Transfer nodes Analysis 1 Spinning disk-based parallel file system Tape Archive 5 / 64

6 Motivation: data-intensive post-processing Computing center Simulation results Analysis cluster Transfer nodes Analysis 1 Analysis 2 Spinning disk-based parallel file system Tape Archive 6 / 64

7 Motivation: data-intensive post-processing Computing center Simulation results Analysis cluster Transfer nodes Analysis 1 Analysis 2 Analysis n... Spinning disk-based parallel file system Tape Archive 7 / 64

8 Motivation: data-intensive post-processing Computing center Simulation results Analysis cluster Transfer nodes Analysis 1 Analysis 2 Analysis n... Spinning disk-based parallel file system Tape Archive Multiple trips to disk is slow 8 / 64

9 Approach: Run analysis on RAM Fast I/O access 9 / 64

10 Approach: Run analysis on RAM Fast I/O access tmpfs or formatted /dev/ram Analysis node CPU CPU RAM-based disk Problem: Restricted parallelism 10 / 64

11 Approach: Run analysis on RAM Fast I/O access tmpfs or formatted /dev/ram NFS-exported RAM CPU CPU RAM-based disk CPU CPU Problem: Restricted data size 11 / 64

12 Approach: Run analysis on RAM Fast I/O access tmpfs or formatted /dev/ram NFS-exported RAM Split data over multiple nodes CPU CPU RAM-based disk CPU CPU RAM-based disk Problem: Requires thorough I/O management 12 / 64

13 Approach: Run analysis on RAM CPU CPU CPU CPU Fast I/O access tmpfs or formatted /dev/ram NFS-exported RAM Split data over multiple nodes Lustre parallel RAM file system Lustre parallel RAM file system CPU CPU CPU CPU 13 / 64

14 Solution: Automatically-provisioned parallel file system Polynya analysis cluster User Client Submit jobs Scheduler 14 / 64

15 Solution: Automatically-provisioned parallel file system Polynya analysis cluster User Client Submit jobs Scheduler Control Node Parallel RAM file system 15 / 64

16 Solution: Automatically-provisioned parallel file system Kraken Polynya analysis cluster File system Transfer Node User Client Submit jobs Scheduler WAN Control Node Transfer Node Parallel RAM file system 16 / 64

17 Solution: Automatically-provisioned parallel file system Kraken Polynya analysis cluster File system Transfer Node User Client Submit jobs Scheduler WAN Control Node Transfer Node Analysis Nodes Parallel RAM file system 17 / 64

18 Solution: Automatically-provisioned parallel file system Kraken Polynya analysis cluster File system Transfer Node User Client Submit jobs Scheduler WAN Control Node Transfer Node Analysis Nodes Archive Node Parallel RAM file system Tape Archive 18 / 64

19 Remote triggering the workflow Kraken Simulation finishes Trigger workflow Polynya Workflow 19 / 64

20 Remote triggering the workflow Kraken Simulation finishes Trigger workflow Polynya Workflow Request space 20 / 64

21 Remote triggering the workflow Kraken Simulation finishes Trigger workflow Polynya Workflow Request space Transfer datasets 21 / 64

22 Remote triggering the workflow Kraken Simulation finishes Trigger workflow Polynya Workflow Archive datasets Request space Transfer datasets Run analysis 22 / 64

23 Remote triggering the workflow Kraken Simulation finishes Trigger workflow Polynya Workflow Archive datasets Request space Transfer datasets Trigger cleanup Run analysis 23 / 64

24 Requesting RAM-based disk space Implementation: PBS Torque+Maui scheduler generic resource 24 / 64

25 Requesting RAM-based disk space Implementation: PBS Torque+Maui scheduler generic resource Parameters: amount of space #PBS -W #PBS -l walltime="48:00:00" #PBS -q ramdisk_service #PBS -l prologue=allocate.sh #PBS -l epilogue=cleanup.sh sleep 45h mail sleep 3h 25 / 64

26 Requesting RAM-based disk space Implementation: PBS Torque+Maui scheduler generic resource Parameters: amount of space duration of allocation #PBS -W #PBS -l walltime="48:00:00" #PBS -q ramdisk_service #PBS -l prologue=allocate.sh #PBS -l epilogue=cleanup.sh sleep 45h mail sleep 3h 26 / 64

27 Requesting RAM-based disk space Implementation: PBS Torque+Maui scheduler generic resource Parameters: amount of space duration of allocation 1 Route to control node #PBS -W x="gres:ramdisk@25" #PBS -l walltime="48:00:00" #PBS -q ramdisk_service #PBS -l prologue=allocate.sh #PBS -l epilogue=cleanup.sh sleep 45h mail user@cluster... sleep 3h 27 / 64

28 Requesting RAM-based disk space Implementation: PBS Torque+Maui scheduler generic resource Parameters: amount of space duration of allocation 1 Route to control node 2 Prepare space #PBS -W x="gres:ramdisk@25" #PBS -l walltime="48:00:00" #PBS -q ramdisk_service #PBS -l prologue=allocate.sh #PBS -l epilogue=cleanup.sh sleep 45h mail user@cluster... sleep 3h 28 / 64

29 Requesting RAM-based disk space Implementation: PBS Torque+Maui scheduler generic resource Parameters: amount of space duration of allocation 1 Route to control node 2 Prepare space 3 Sleep until allocation expiration #PBS -W x="gres:ramdisk@25" #PBS -l walltime="48:00:00" #PBS -q ramdisk_service #PBS -l prologue=allocate.sh #PBS -l epilogue=cleanup.sh sleep 45h mail user@cluster... sleep 3h 29 / 64

30 Requesting RAM-based disk space Implementation: PBS Torque+Maui scheduler generic resource Parameters: amount of space duration of allocation 1 Route to control node 2 Prepare space 3 Sleep until allocation expiration 4 notice before expiration #PBS -W x="gres:ramdisk@25" #PBS -l walltime="48:00:00" #PBS -q ramdisk_service #PBS -l prologue=allocate.sh #PBS -l epilogue=cleanup.sh sleep 45h mail user@cluster... sleep 3h 30 / 64

31 Requesting RAM-based disk space Implementation: PBS Torque+Maui scheduler generic resource Parameters: amount of space duration of allocation 1 Route to control node 2 Prepare space 3 Sleep until allocation expiration 4 notice before expiration 5 Clean up space #PBS -W x="gres:ramdisk@25" #PBS -l walltime="48:00:00" #PBS -q ramdisk_service #PBS -l prologue=allocate.sh #PBS -l epilogue=cleanup.sh sleep 45h mail user@cluster... sleep 3h 31 / 64

32 Transferring datasets Implementation: Route request to transfer nodes Striped GridFTP data nodes 32 / 64

33 Transferring datasets Implementation: Route request to transfer nodes Striped GridFTP data nodes Co-located as RAM-based disk space provider 33 / 64

34 Transferring datasets Implementation: Route request to transfer nodes Striped GridFTP data nodes Co-located as RAM-based disk space provider Other administrative components: GridFTP control channel server 34 / 64

35 Transferring datasets Implementation: Route request to transfer nodes Striped GridFTP data nodes Co-located as RAM-based disk space provider Other administrative components: GridFTP control channel server Key-authenticated SSH Remote trigger mechanism 35 / 64

36 Transferring datasets Implementation: Route request to transfer nodes Striped GridFTP data nodes Co-located as RAM-based disk space provider Other administrative components: GridFTP control channel server Key-authenticated SSH X509-authenticaed GRAM5 Remote trigger mechanism 36 / 64

37 Example application: AMWG diagnostics Compares CESM simulation data, observational data, reanalysis data 37 / 64

38 Example application: AMWG diagnostics Compares CESM simulation data, observational data, reanalysis data Parallel implementation in Swift Parallel scripting engine 38 / 64

39 Example application: AMWG diagnostics Compares CESM simulation data, observational data, reanalysis data Parallel implementation in Swift Parameters: dataset name number of time segments (years) Parallel scripting engine 39 / 64

40 Example application: AMWG diagnostics Compares CESM simulation data, observational data, reanalysis data Parallel implementation in Swift Parameters: dataset name number of time segments (years) Dataset volume: 2.8 GB per year (1 data) Parallel scripting engine 40 / 64

41 Data movement benchmarks File system /dev/null 3,190 Lustre disk 111 tmpfs RAM 2,983 XFS RAM 2,296 Lustre RAM 2,881 IOR-8 GridFTP to Polynya Write from Frost from Kraken units in MB/s from D. Duplyakin s experiments 41 / 64

42 Data movement benchmarks File system IOR-8 GridFTP to Polynya Write from Frost from Kraken /dev/null 3, Lustre disk tmpfs RAM 2, XFS RAM 2, Lustre RAM 2, units in MB/s from D. Duplyakin s experiments 32 MB TCP buffer, 16 MB block size, 4 streams 42 / 64

43 Data movement benchmarks File system IOR-8 GridFTP to Polynya Write from Frost from Kraken /dev/null 3, Lustre disk tmpfs RAM 2, XFS RAM 2, Lustre RAM 2, units in MB/s from D. Duplyakin s experiments 32 MB TCP buffer, 16 MB block size, 16 streams 43 / 64

44 Data movement benchmarks File system IOR-8 GridFTP to Polynya Write from Frost from Kraken /dev/null 3, Lustre disk tmpfs RAM 2, XFS RAM 2, Lustre RAM 2, GridFTP from Kraken to Frost: 216 MB/s units in MB/s from D. Duplyakin s experiments 32 MB TCP buffer, 16 MB block size, 16 streams 44 / 64

45 Application performance Ran on 64-CPU node, 2-year time segment (8.2 GB total) File system Runtime (s) Lustre disk 213 tmpfs RAM 29 XFS RAM 29 Lustre RAM / 64

46 Application performance From Frost: Lustre disk tmpfs RAM XFS RAM Data Transfer AMWG Analysis Lustre RAM Time (s) 46 / 64

47 End-to-end workflow Request space Time (s) 47 / 64

48 End-to-end workflow Request space Transfer Time (s) 48 / 64

49 End-to-end workflow Request space Transfer Time (s) 49 / 64

50 End-to-end workflow Request space Transfer Analysis 1 Analysis 2... Analysis n Archive Time (s) 50 / 64

51 End-to-end workflow Request space Transfer Analysis 1 Analysis 2... Analysis n Archive Time (s) 51 / 64

52 End-to-end workflow Request space Transfer Analysis 1 Analysis 2... Analysis n Cleanup Archive Time (s) 52 / 64

53 End-to-end workflow Request space Transfer Analysis 1 Analysis 2... Analysis n Cleanup Archive Time (s) 53 / 64

54 Other use case: Interactive jobs Automated workflow split component wise 54 / 64

55 Other use case: Interactive jobs Automated workflow split component wise Each step is run by the user manually 55 / 64

56 Other use case: Interactive jobs Steps: Automated workflow split component wise Each step is run by the user manually 1 Request space 2 Transfers data to allocated space (globus-url-copy or Globus Online) 3 Runs analysis on allocated space 4 notice before expiration 5 Cleanup by deleting request job 56 / 64

57 Conclusions End-to-end analysis platform without touching spinning disk 57 / 64

58 Conclusions End-to-end analysis platform without touching spinning disk Interface through familiar PBS interface 58 / 64

59 Conclusions End-to-end analysis platform without touching spinning disk Interface through familiar PBS interface Workflow automation to drive analysis 59 / 64

60 Conclusions End-to-end analysis platform without touching spinning disk Interface through familiar PBS interface Workflow automation to drive analysis Network bandwidth critical to performance 60 / 64

61 Conclusions End-to-end analysis platform without touching spinning disk Interface through familiar PBS interface Workflow automation to drive analysis Network bandwidth critical to performance Future work: Tune network for high performance data movement 61 / 64

62 Conclusions End-to-end analysis platform without touching spinning disk Interface through familiar PBS interface Workflow automation to drive analysis Network bandwidth critical to performance Future work: Tune network for high performance data movement Application-perspective file system scalability 62 / 64

63 Conclusions End-to-end analysis platform without touching spinning disk Interface through familiar PBS interface Workflow automation to drive analysis Network bandwidth critical to performance Future work: Tune network for high performance data movement Application-perspective file system scalability Explore framework on other resources: disk, bandwidth, etc. 63 / 64

64 Questions? A RAM-disk provisioning service for high performance data analysis Allan Espinosa (aespinosa@cs.uchicago.edu) Mentors: M. Woitaszek and J. Dennis University of Chicago, National Center for Atmospheric Research July 29, / 64