Green-Marl: A DSL for Easy and Efficient Graph Analysis



Similar documents
Efficient Parallel Graph Exploration on Multi-Core CPU and GPU

HIGH PERFORMANCE BIG DATA ANALYTICS

Green-Marl: A DSL for Easy and Efficient Graph Analysis

Parallel Algorithm Engineering

MapGraph. A High Level API for Fast Development of High Performance Graphic Analytics on GPUs.

Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries

Big Graph Processing: Some Background

Parallel Computing for Data Science

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

STINGER: High Performance Data Structure for Streaming Graphs

Accelerating In-Memory Graph Database traversal using GPGPUS

PDQCollections*: A Data-Parallel Programming Model and Library for Associative Containers

A Performance Evaluation of Open Source Graph Databases. Robert McColl David Ediger Jason Poovey Dan Campbell David A. Bader

MapReduce on GPUs. Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu

An Introduction to Parallel Computing/ Programming

Graph Analytics in Big Data. John Feo Pacific Northwest National Laboratory

MPI and Hybrid Programming Models. William Gropp

Parallel Algorithms for Small-world Network. David A. Bader and Kamesh Madduri

Multi-Threading Performance on Commodity Multi-Core Processors

Large Scale Social Network Analysis

Spring 2011 Prof. Hyesoon Kim

Big Data Analytics. Lucas Rego Drumond

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

A Comparison Of Shared Memory Parallel Programming Models. Jace A Mogill David Haglin

Software tools for Complex Networks Analysis. Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team

David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems

Optimizing Code for Accelerators: The Long Road to High Performance

SMock A Test Platform for the Evaluation of Monitoring Tools

Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing

Big Data Systems CS 5965/6965 FALL 2015

Asking Hard Graph Questions. Paul Burkhardt. February 3, 2014

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

The PageRank Citation Ranking: Bring Order to the Web

Final Project Report. Trading Platform Server

PARALLEL JAVASCRIPT. Norm Rubin (NVIDIA) Jin Wang (Georgia School of Technology)

Scheduling Task Parallelism" on Multi-Socket Multicore Systems"

HPC Programming Framework Research Team

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

Graph Processing and Social Networks

Capstone Overview Architecture for Big Data & Machine Learning. Debbie Marr ICRI-CI 2015 Retreat, May 5, 2015

Overview on Graph Datastores and Graph Computing Systems. -- Litao Deng (Cloud Computing Group)

Systems and Algorithms for Big Data Analytics

OpenACC Programming and Best Practices Guide

IMCM: A Flexible Fine-Grained Adaptive Framework for Parallel Mobile Hybrid Cloud Applications

HPC with Multicore and GPUs

OpenMP & MPI CISC 879. Tristan Vanderbruggen & John Cavazos Dept of Computer & Information Sciences University of Delaware

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

Overview of HPC Resources at Vanderbilt

Dynamic Network Analyzer Building a Framework for the Graph-theoretic Analysis of Dynamic Networks

Tool Support for Inspecting the Code Quality of HPC Applications

Stream Processing on GPUs Using Distributed Multimedia Middleware

AjaxScope: Remotely Monitoring Client-side Web-App Behavior

Petascale Software Challenges. Piyush Chaudhary High Performance Computing

Presto/Blockus: Towards Scalable R Data Analysis

Home Page. Data Structures. Title Page. Page 1 of 24. Go Back. Full Screen. Close. Quit

Spark. Fast, Interactive, Language- Integrated Cluster Computing

Hardware design for ray tracing

CUDA programming on NVIDIA GPUs

Software and the Concurrency Revolution

Understanding Hardware Transactional Memory

Optimization and analysis of large scale data sorting algorithm based on Hadoop

Introduction to Scheduling Theory

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

OpenMP Programming on ScaleMP

Texture Cache Approximation on GPUs

NVIDIA Tools For Profiling And Monitoring. David Goodwin

Parallel Computing. Benson Muite. benson.

Optimizing Application Performance with CUDA Profiling Tools

The Fastest Way to Parallel Programming for Multicore, Clusters, Supercomputers and the Cloud.

OpenACC 2.0 and the PGI Accelerator Compilers

Course Development of Programming for General-Purpose Multicore Processors

Parallel Programming Survey

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

<Insert Picture Here> An Experimental Model to Analyze OpenMP Applications for System Utilization

Distributed SociaLite: A Datalog-Based Language for Large-Scale Graph Analysis

Large-Scale Data Processing

Hadoop Parallel Data Processing

GPU Programming in Computer Vision

Cray: Enabling Real-Time Discovery in Big Data

Social Media Mining. Graph Essentials

Petascale Software Challenges. William Gropp

The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices

Big Data and Scripting Systems beyond Hadoop

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

HPC Wales Skills Academy Course Catalogue 2015

FPGA area allocation for parallel C applications

SWARM: A Parallel Programming Framework for Multicore Processors. David A. Bader, Varun N. Kanade and Kamesh Madduri

Resource Utilization of Middleware Components in Embedded Systems

BSPCloud: A Hybrid Programming Library for Cloud Computing *

Transcription:

Green-Marl: A DSL for Easy and Efficient Graph Analysis Sungpack Hong*, Hassan Chafi* +, Eric Sedlar +, and Kunle Olukotun* *Pervasive Parallelism Lab, Stanford University + Oracle Labs

Graph Analysis Classic graphs; New applications Artificial Intelligence, Computational Biology, SNS apps: Linkedin, Facebook, Example> Movie Database Graph Analysis: a process of drawing out further information from the given graph data-set Sam Worthington What would be the avg. hop-distance between any two (Australian) actors? Linda Hamilton James Cameron Aliens Avatar Sigourney Weaver Is he a central figure in the movie network? How much? Kevin Bacon Do these actors work together more frequently than others? Ben Stiller Jack Black Owen Wilson

More formally Graph Data-Set Graph G = (V,E): Arbitrary relationship (E) between data entities (V) Property P: any extra data associated with each vertex or edge of graph G (e.g. name of the person) Your Data-Set = (G, Π) = (G, P 1, P 2, ) Graph analysis on (G, Π) Compute a scalar value e.g. Avg-distance, conductance, eigen-value, Compute a (new) property e.g. (Max) Flow, betweenness centrality, page-rank, Identify a specific subset of G: e.g. Minimum spanning tree, connected component, community structure detection,

The Performance Issue Traditional single-core machines showed limited performance for graph analysis problems A lot of random memory accesses + data does not fit in cache Performance is bound to memory latency Conventional hardware (e.g. floating point units) does not help much Use parallelism to accelerate graph analysis Plenty of data-parallelism in large graph instances Performance now depends on memory bandwidth, not latency. Exploit modern parallel computers: Multi-core CPU, GPU, Cray XMT, Cluster,...

New Issue: Implementation Overhead It is challenging to implement a graph algorithm correctly + and efficiently + while applying parallelism + differently for each execution environment Are we really expecting a single (averagelevel) programmer to do all of the above?

Our approach: DSL We design a domain specific language (DSL) for graph analysis The user writes his/her algorithm concisely with our DSL The compiler translates it into the target language (e.g. parallel C++ or CUDA) (1) Inherent data-parallelism (2) Good impl. templates Intuitive Description of a graph algorithm (3) High-level optimization Efficient (parallel) Implementation of the given algorithm Foreach (t: G. Nodes) t.sigma += DSL Edgeset Foreach BFS DSL Compiler For(i=0;i<G.numN odes();i++) { fetch_and_add (G.nodes[i], ) Target Language (e.g. C++) Source-to-Source Translation

Example: Betweenness Centrality Betweenness Centrality (BC) A measure that tells how central a node is in the graph Used in social network analysis Definition How many shortest paths are there between any two nodes going through this node. Low BC High BC Kevin Bacon Ayush K. Kehdekar [Image source; Wikipedia]

Example: Init BC for every nodebetweenness Centrality and begin outer-loop (s) [Brandes 2001] Looks complex s Queues, Lists, Stack Is this parallelizable? w w v BFS Order Accumulate delta into BC Compute sigma from parents s v w w w Reverse BFS Order Compute delta from children

Example: Betweenness Centrality [Brandes 2001] s w w BFS Order v Compute sigma from parents s v w w w Reverse BFS Order Compute delta from children

Example: Betweenness Centrality [Brandes 2001] s Parallel Iteration w w BFS Order Parallel Assignment v Compute sigma from parents Parallel BFS s v w w w Reverse BFS Order Compute delta from children Reduction

DSL Approach: Benefits Three benefits Productivity Portability Performance

Productivity Benefits A common limiting resource in software development your brain power (i.e. how long can you focus?) A C++ implementation of BC from SNAP ( a parallel graph library from GT): 400 line of codes (with OpenMP) Vs. Green-Marl* LOC: 24 *Green-Marl (그린 말) means Depicted Language in Korean

Productivity Benefits Procedure Manual LOC Green-Marl LOC Source Misc BC ~ 400 24 SNAP C++ openmp Vertex Cover 71 21 SNAP C++ openmp Conductance 42 10 SNAP C++ openmp Page Rank 75 15 http://.. C++ single thread SCC 65 15 http://.. Java single thread It is more than LOC Focusing on the algorithm, not its implementation More intuitive, less error-prone Rapidly explore many different algorithms

Portability Benefits (On-going work) Multiple compiler targets Command line argument DSL Description SMP back-end Cluster back-end (*) For large instances DSL Compiler (Parallelized) C++ CUDA for GPU We generate codes that work on Pregel API [Malewicz et al. SIGMOD 2010] GPU back-end (*) For small instances We know some tricks [Hong et al. PPOPP 2011] Codes for Cluster LIB (& RT) LIB (& RT) LIB (& RT)

Performance Benefits Back-end specific optimization Optimized data structure & Code template Green-Marl Code Target Arch. (SMP? GPU? Distributed?) Threading Lib, (e.g.openmp) Graph Data Structure Compiler Parsing & Checking Arch. Independent Opt Arch. Dependent Opt Code Generation Use High-level Semantic Information Target Code (e.g. C++)

Arch-Indep-Opt: Loop Fusion Foreach(t: G.Nodes) t.a = t.c + 1; Foreach(s: G.Nodes) s.b = s.a + s.c; C++ compiler cannot merge loops (Independence not gauranteed) Loop Fusion Foreach(t: G.Nodes) { t.a = t.c +1; t.b = t.a + t.c; } set of nodes (elems are unique) Map<Node, int> A, B, C; List<Node>& Nodes = G.getNodes(); List<Node>::iterator t, s; for(t = Nodes.begin(); t!= Nodes.end(); t++) A[*t] = C[*t]; for(s = Nodes.begin(); s!= Nodes.end(); s++) B[*s] = A[*s] + C[*s]; Optimization enabled by high-level (semantic) information

Arch-Indep-Opt: Flipping Edges Graph-Specific Optimization Adding 1 to for all Outgoing Neighbors, if my B value is positive Foreach(t: G.Nodes) Foreach(s: t.innbrs)(s.b>0) t.a += 1; s t s Foreach(t: G.Nodes)(t.B>0) Foreach(s: t.outnbrs) s.a += 1; s t s Counting number of Incoming Neighbors whose B value is positive (Why?) Reverse edges may not be available or expensive to compute Optimization using domain-specific property

Arch-Dep-Opt : Selective Parallelization Flattens nested parallelism with a heuristic Foreach(t: G.Nodes) { Foreach(s: G.Nodes)(s.X > t.y) { Foreach(r: s.nbrs) { s.a += r.b; For (t: G.Nodes) { } t.c *= s.a; } val min= t.c Three levels of nested parallelism + reductions } For (t: G.Nodes) { Foreach(s: G.Nodes)(s.X > t.y) { For (r: s.nbrs) { s.a = s.a + r.b; } t.c *= s.a; } val = (t.c < val)? t.c : val; } } Compiler chooses parallel region, heuristically Foreach(s: G.Nodes)(s.X > t.y) { For (r: s.nbrs) { s.a += r.b; } t.c *= s.a; } val min= t.c Reductions became normal read & write [Why?] Graph is large # core is small. There is overhead for parallelization Optimization enabled by both architectural and domain knowledge

Code-Gen: Saving DownNbrs in BFS Prepare data structure for reverse BFS traversal during forward traversal, only if required. InBFS(t: G.Nodes From s) { } InRBFS { } Foreach (s: t.downnbrs) Compiler detects that down-nbrs are used in reverse traversal Generated code can iterate only edges to down-nbrs during reverse traversal Optimization enabled by code analysis (i.e. no BFS library could do this automatically) // Preperation of BFS Generated code saves edges to the down-nbrs during forward traversal. // Forward BFS (generated) { // k is an out-edge of s for(k ) node_t child = get_node(k); if (is_not_visited(child)) { ; // normal BFS code here edge_bfs_child[k] = true; } } } // Reverse BFS (generated) { // k is an out-edge of s for(k ) { if (!edge_bfs_child[k]) continue; } }

Code-Gen: Code Templates Data Structure Graph: similar to a conventional graph library Collections: custom implementation Code Generation Template BFS DFS Hong et al. PACT 2011 (for CPU and GPU) Better implementations coming; can be adapted transparently Inherently sequential Compiler takes any benefits that a (template) library would give, as well

Experimental Results Betweenness Centrality Implementation (1) [Bader and Madduri ICPP 2006] (2) [Madduri et al. IPDPS 2009] Apply some new optimizations Performance improved over (1) ~ x2.3 on Cray XMT Parallel implementation available in SNAP library based on (1) not (2) (for x86) Our Experiment Start from DSL description (as shown previously) Let the compiler apply the optimizations in (2), automatically.

Parallel performance Experimental Results difference Effects of other optimizations Flipping Edges Saving BFS children Nehalem (8 cores x 2HT), 32M nodes, 256M edges (two different synthetic graphs) Shows speed up over Baseline: SNAP (single thread) Better single thread performance: (1) Efficient BFS code (2) No unnecessary locks

Other Results Conductance Perf similar to manual impl. Loop Fusion Privitization Vertex Cover Test and Test-set Privitization Original code data race; Naïve correction (omp_critical) serialization

Automatic parallelization as much as exposed data parallelism (i.e. there is no black magic) Other Results PageRank Compare against Seq. Impl Strongly Connected Component DFS + BFS: Max Speed-up is 2 (Amdahl's Law)

Conclusion Green-Marl A DSL designed for graph analysis Three benefits Productivity Performance Portability (soon) Project page: ppl.stanford.edu/main/green_marl.html GitHub repository: github.com/stanford-ppl/green-marl