Scientific Computing Meets Big Data Technology: An Astronomy Use Case

Size: px

Start display at page:

Download "Scientific Computing Meets Big Data Technology: An Astronomy Use Case"

Camilla Jordan
8 years ago
Views:

1 Scientific Computing Meets Big Data Technology: An Astronomy Use Case Zhao Zhang AMPLab and BIDS UC Berkeley In collaboration with Kyle Barbary, Frank Nothaft, Evan Sparks, Oliver Zahn, David Patterson, Michael Franklin, Saul Perlmutter 1

2 Berkeley Institute for Data Science Founded in 2013 by Moore and Sloan Foundations together A central hub of research and education to facilitate and nurture data-intensive science Directed by Nobel Laureate Saul Perlmutter 2

3 The Kira Project Kira is an astronomy image processing toolkit built on Apache Spark The first collaborative project from BIDS It is a joint work between AMPLab and the Berkeley Center for Cosmological Physics 3

4 Counting the Stars Courtesy image from 4

5 Science and Missions Use the galaxy distribution to study the biggest contemporary mysteries in astrophysics: dark matter, dark energy, and the properties of gravity Courtesy Image from Detect the Near Earth Objects to protect the earth from collisions Courtesy Image from 5

6 The Data Problem Sloan Digital Sky Survey PanSTARRS Large Synoptic Survey Telescope Start Date Coming Soon Data Rate 170GB/night 1.2TB/night ~20TB/night Total Size 116 TB in Release 12 N/A 60PB raw data 15PB catalog 6

7 A Typical Supernovae Detection Pipeline Images Source Extraction Point Spread Function Estimation Image Reprojection Image Coaddition Source Extraction Object Classification Catalogs 7

8 Common Attributes Computations are embarrassingly parallel Computations are I/O intensive Computations are done in memory Parallelism is from data Built from existing C and Fortran programs These programs access data through a POSIX compatible interface 8

9 Existing Parallel Approaches on Supercomputers Scientific Workflow System (Galaxy, HTCondor, Swift, Pegasus) Inflexible programming interface Message Passing Interface (MPI) Requires custom data management for efficient I/O Not fault tolerant 9

10 Big Data and the Cloud Can scientific computing take advantage of the big data technologies such as Spark and HDFS in the clouds? Implicit parallelism Optimized locality Lineage based resilience 10

11 The Big Data Ecosystem Apache Spark (from AMPLab) A distributed processing engine Rich data flow pattern support Implicit parallelism In-memory computation Lineage-based resilience Apache HDFS Locality exposed to clients Replication-based resilience 11

12 Conventional Wisdom JVM is slow, so we should not build scientific applications with Java or Scala. Supercomputers have higher bandwidth networks for I/O, so it should run my applications faster. 12

13 Source Extractor Source Extraction Point Spread Function Estimation Image Reprojection Image Coaddition Source Extraction Object Classification 13

14 Source Extractor Steps Background+Es.ma.on+ Background+Subtrac.on+ Object+Detec.on+through+Convolu.on+ Object+Sta.s.cs+Evalua.on+ Number X Y FLUX FLUXERR KRON_R FLUX_A FLUX_AERR FLAGS

15 Kira SE Architecture A Source Extractor C library 15

16 Kira SE VS. C in the Cloud Kira C Platform EC2 m2.4xlarge 8 core CPU EC2 m2.4xlarge 8 core CPU Parallel Framework Apache Spark Shell Storage Apache HDFS Ext4/GlusterFS Dataset Small: 12GB SDSS DR2 (2,310 files) Medium: 65GB SDSS DR2 (11,150 files) Large: 1TB SDSS DR7 (176,938 files) 16

17 Kira SE VS. C Performance Penalty Single Node Time-to-solution (seconds) Kira SE+Ext Dataset C+Ext4 408 Small 12G core 2 cores 4 cores 8 cores 1 core 2 cores 4 cores 8 cores I/O bound, 75% of time is I/O 371 Performance Penalty 7.4x 4.5x 3x 2.2x 17

18 Kira SE VS. C Multiple Node Performance Kira SE+HDFS C+GlusterFS Time-to-solution in Log Scale (seconds) ,806 1, node 4 nodes 8 nodes 16 nodes 32 nodes 64 nodes 18 HDFS GlusterFS Replication 2 2 Dataset Medium 65G

19 Kira SE VS. C One Time Locality Measurement Kira SE+HDFS C+GlusterFS 100% Locality Hit Rate 75% 50% 25% 0% 1 node 4 nodes 8 nodes 16 nodes 32 nodes 64 nodes 19 HDFS GlusterFS Replication 2 2 Dataset Medium 65G

20 Kira SE VS. C 1TB Dataset Performance 6000 Kira SE+HDFS Metadata Overhead C+GlusterFS Replicated Time-To-Solution (seconds) ,127 4, ,295 3, nodes 64 nodes HDFS GlusterFS-Replicated Replication 2 2 Dataset Large 1TB 20

21 Kira SE VS. C 1TB Dataset Performance 6000 Kira SE+HDFS C+GlusterFS Replicated C+GlusterFS Time-to-solution (seconds) ,127 2,280 4,918 1,295 1,565 3, nodes 64 nodes HDFS GlusterFS GlusterFS-Replicated Replication Dataset 21 Large 1TB

22 Kira SE VS. C Supercomputer Performance Kira SE+EC2 C+Edison Platform EC2 m2.4xlarge 8-core 2.4GHz vcpu 20MB L3 Cache 12-core 2.4Ghz CPU 30MB L3 Cache Parallel Framework Apache Spark Shell Storage Network Apache HDFS 10Gb Ethernet Lustre 48GB/s Bandwidth Aries Interconnect us latency ~8GB/s Bandwidth Dataset Large: 1TB SDSS DR7 (176,938 files) 22

23 Kira SE VS. C Supercomputer 1TB Dataset Performance Kira SE+EC2 C+Edison 2000 Time-to-solution (seconds) ,295 1, cores Kira SE C on Edison Best Performance 1272 seconds 886 seconds Worst Performance 1327 seconds 2005 seconds Dataset Large 1TB 23

24 Conventional Wisdom Revisited JVM is slow, so we should not build scientific applications with Java or Scala. JVM is slow, other bottlenecks can dominate CPU Supercomputers have higher bandwidth networks for I/O, so it should run my applications faster. With big data frameworks, the locality optimization makes network less important On shared network, data intensive applications performance can be highly variable 24

25 Conclusions Kira SE demonstrates linear scalability with both data size and cluster size Due to the superior data locality, Kira SE runs up to 3.7x faster than that the equivalent C implementation on GlusterFS at scale Kira SE on Amazon EC2 performs comparably to the C version on the NERSC Edison supercomputer Leveraging a big data platform such as Spark/HDFS would enable scientists to benefit from the rapid pace of innovation and large range of systems that are being driven by widespread interest in big data analytics 25

26 Future Work Working beyond the implementation of the rest of the pipeline Looking into fine-grained lineage information of the pipeline to help users debug code and data

Scientific Computing Meets Big Data Technology: An Astronomy Use Case

Scientific Computing Meets Big Data Technology: An Astronomy Use Case Zhao Zhang, Kyle Barbary, Frank Austin Nothaft, Evan Sparks Oliver Zahn Michael J. Franklin, David A. Patterson, Saul Perlmutter, AMPLab,