Efficient Parallel Graph Exploration on Multi-Core CPU and GPU

Similar documents

Green-Marl: A DSL for Easy and Efficient Graph Analysis

FPGA-based Multithreading for In-Memory Hash Joins

Multi-Threading Performance on Commodity Multi-Core Processors

Big Graph Processing: Some Background

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

Parallel Programming Survey

The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices

Parallel Algorithm Engineering

Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries

GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

Graph Analytics in Big Data. John Feo Pacific Northwest National Laboratory

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms

Clustering Billions of Data Points Using GPUs

Accelerating In-Memory Graph Database traversal using GPGPUS

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

Next Generation GPU Architecture Code-named Fermi

Parallel Computing 37 (2011) Contents lists available at ScienceDirect. Parallel Computing. journal homepage:

Load Balancing and Dynamic Parallelism for Static and Morph Graph Algorithms

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

HPC with Multicore and GPUs

Stream Processing on GPUs Using Distributed Multimedia Middleware

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Computer Graphics Hardware An Overview

David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems

High Performance Computing in CST STUDIO SUITE

GPGPU accelerated Computational Fluid Dynamics

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

Introduction to GPU Programming Languages

Accelerating CFD using OpenFOAM with GPUs

Benchmarking Cassandra on Violin

MapGraph. A High Level API for Fast Development of High Performance Graphic Analytics on GPUs.

Rethinking SIMD Vectorization for In-Memory Databases

Parallel Firewalls on General-Purpose Graphics Processing Units

CHAPTER 1 INTRODUCTION

ultra fast SOM using CUDA

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

Introduction to Cloud Computing

Why the Network Matters

GPUs for Scientific Computing

Final Project Report. Trading Platform Server

Introduction to GPU Computing

STINGER: High Performance Data Structure for Streaming Graphs

Enabling Technologies for Distributed and Cloud Computing

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

PHAST: Hardware-Accelerated Shortest Path Trees

Choosing a Computer for Running SLX, P3D, and P5

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Turbomachinery CFD on many-core platforms experiences and strategies

Multi-core and Linux* Kernel

Scaling from Datacenter to Client

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller

Optimizing Code for Accelerators: The Long Road to High Performance

GPU Computing - CUDA

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Parallel Simplification of Large Meshes on PC Clusters

Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand

COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook)

Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

SIGMOD RWE Review Towards Proximity Pattern Mining in Large Graphs

A1 and FARM scalable graph database on top of a transactional memory layer

Binary search tree with SIMD bandwidth optimization using SSE

Performance Analysis of Flash Storage Devices and their Application in High Performance Computing

bigdata Managing Scale in Ontological Systems

Asking Hard Graph Questions. Paul Burkhardt. February 3, 2014

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures

Comparison of Hybrid Flash Storage System Performance

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

How To Compare Amazon Ec2 To A Supercomputer For Scientific Applications

YALES2 porting on the Xeon- Phi Early results

Introduction to GPGPU. Tiziano Diamanti

Assessing the Performance of OpenMP Programs on the Intel Xeon Phi

HIGH PERFORMANCE BIG DATA ANALYTICS

Seeking Opportunities for Hardware Acceleration in Big Data Analytics

LS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp.

Frog: Asynchronous Graph Processing on GPU with Hybrid Coloring Model

Introduction to GPU Architecture

Hardware Acceleration for CST MICROWAVE STUDIO

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

Overlapping Data Transfer With Application Execution on Clusters

Parallel Computing with MATLAB

High Performance Computing in the Multi-core Area

Understanding the Benefits of IBM SPSS Statistics Server

PERFORMANCE ENHANCEMENTS IN TreeAge Pro 2014 R1.0

Transcription:

Efficient Parallel Graph Exploration on Multi-Core CPU and GPU Pervasive Parallelism Laboratory Stanford University Sungpack Hong, Tayo Oguntebi, and Kunle Olukotun

Graph and its Applications Graph Fundamental data structure G = (N,E): Arbitrary relationship (E) between data entities (N) Wide range of Applications Scheduling task graphs PDE (Partial Differential Equation) solver on mesh Artificial Intelligence Bayesian network Bioinformatics molecular interaction graph Social network analysis Web graphs Graph database schema-less data management Requires large graph analysis

Performance Issues Single-core machines showed limited performance for large graph analysis problems A lot of random memory accesses + Data does not fit in cache Performance is bound to memory latency Conventional hardware units (e.g. floating point, branch predictors, out-of-order) do not help much Use parallelism to accelerate graph analysis Plenty of data-parallelism in large graph instances Latency bound Bandwidth bound Exploit recent proliferation of parallel computers: Multi-core CPU and GPU

Graph Exploration Breadth first search (BFS) A systematic way to traverse the graph A building block for many other algorithms s-t connectivity, betweeness centrality, connected component, community detection, max-flow Can be parallelized (c.f. depth first search) More about this in the next slide Many previous researches on implementation For various architectures: Cluster, Cell, Cray, Multicore/SMP, GPU, Preferred as parallel benchmark See graph500.org

Parallel BFS Algorithm Start from a root, and visit all the connected nodes in a graph Nodes closer to the root are visited first Nodes of the same hop-distance (level) from the root can be visited in parallel 2 1 1 0 1 1 2 2 2 2 2 Three Node-sets Nodes of the current level Neighbors of current level nodes Synchronization at the end of each level Add non-visited neighbors to Next and Visited set

Implementation for Multi-Core CPU Level Synchronous Parallel BFS Requires synchronization at everyl evel Degree of parallelism limited by (# nodes) in each level State-of-Art Implementation [Agarwal et. al. SC 2010] V bitmap Maximize cache hit ratio Atomic update required: test and test-and-set C, N queue Local Queue + Global Queue Outperformed previous implementations Complex queue implementation based on ticket-lock and fast forwarding Not so much details revealed in their paper Avoid unnecessary cache-to-cache traffic

Can we do better? Issues Requires complex queue implementation Can we do better even without it? Our two implementations Queue-Based Implementation Approximate Agarwal et. al. s approach Bitmap Test and Test-and-Set Local Q + Global Q Standard Queue Another implementation Exploit properties of the graphs Exploit properties of the machines

Observation on Graphs Small-World Property [Watts and Strogatz, Nature 1998] Any randomly-shaped graphs has a small diameter ( Six-degrees of separation ) A fundamental property : web graphs, social graphs, molecular graphs, [Corollary] There must be at least one level that has O(N) nodes. Regular Graph: Diameter O(N) Adding Ramdom re-wiring: Diameter O(c) Total execution time is governed by these critical levels (e.g.) Number of nodes at each BFS level (16 million node graph)

Read-based implementation Another implementation of ours V: Bitmap C, N: Level-Array A single O(N)-sized array that keeps the level of each node 0 1 2 N-1 inf inf inf inf inf inf 0 inf inf inf Initialize 1 Level 0 Read the entire array! while (!finished) { foreach (c: G.Nodes) { if (level[c]!= curr_lev) continue; } lev++; } 1 0 inf 1 inf 0 N-2 Level 1 Iterate nodes in Current set 1 0 2 1 2 Adding nodes to Next set N-1 2 Instead of keeping queues, update the value in the level array.

What s the benefit of that? 10x Diff (1) The array is read sequentially (1)-b Overall access pattern become more sequential as well (2) There are only a few level; In critical levels, you have to visit O(N) nodes anyway. But cannot eliminate all the natural random accesses. Current Queue 0 4 2 9 7 1 0 2 4 Level List 0 1 2 3 4 I 1 1 1 N F 1 I N F I N F 1 0 1 0 1 2 4 Packed Adjacency List a b c h i j k n o p q Packed Adjacency List a b c d e f g h i j k n o p q (a) Data-Access Pattern of Queue-Based Method (b) Data-Access Pattern of Read-Based Method

Queue-Based vs. Read-Based Level-wise execution time breakdown Small increase in non-critical levels (1,2, 6 and 7) (e.g.) Number of nodes at each BFS level (16 million node graph) Reduction in critical levels (3, 4 and 5)

What about big-world graphs? Worst-case inputs for Read-based method: 1. High-diameter graphs Recent graph applications (e.g. social network) deal with small-world graphs more frequently Still, there are high-diameter graphs: e.g. mesh 2. Small search instance When the graph is not (strongly) connected Your traversal finishes after visiting only small portion of the graph

Preventing worst case execution Our solution: hybrid method Choose appropriate method (Read or Queue), adaptively at each level Based on the size of Next set and its growth rate. Finite State Machine Go Parallel (with Queues) when there are enough # nodes. If Next is large enough (e.g. p% of num nodes) or exponential growing, migrate to Read method Process the root node, sequentially. Return to Queue method when Next is shrinking Transient state: Read from Queue, Write to Array

Result: worst-case avoidance BFS on tree Y-axis: time (high is bad) Mix of large search instances (good for Read) and small search instances (good for Queue) The FSM allows best of both methods Tree Large search instances Small search instances

Result: worst-case avoidance 2-D Mesh 4000x4000 Diameter is O(sqrt(N)) (# nodes) at each level increases not exponentially, but linearly Read-based method showed a lot of overheads Hybrid Queue+Read method avoids it

Graph Exploration on GPU GPU Benefits Large memory bandwidth (GDDR, # channels) Massively parallel hardware HW multi-threading + SIMD(/SIMT) HW Traits similar to Cray-XMT But much cheaper GPU Issues Limited capacity (~ a few GB) Our approach: Use GPU, only if the graph fits Use multi-core CPU, otherwise But how much performance does this give?

Graph Exploration on GPU BFS on GPU [Harish and Narayanan, HiPC 2007], [Hong et al, PPoPP2011] Similar to Queue-based implementation Visited, Next, Current Level Array If level[node] is INF, then node is not visited Hard to do bitwise atomic operation efficiently on GPU A node can be written multiple times by different parents Okay, because the written level value is always same 1 1 2... But it has the same issue as Queue-based method Bad for small or long-diameter graphs

Hybrid CPU+GPU An extension to the previous FSM Next T 4 SEQ Next > T 1 QUEUE FOR GPU Migrate to GPU, only if the graph is growing exponentially QUEUE Next > α * Curr and Next >T 4 CopyBack InitGPU Need data copy to GPU. (i.e. ship out the contents in the queue) Otherwise, go back to CPUbased FSM Finished GPU Copy back level values from GPU

GPU: Worst-case avoidance BFS on tree with GPU Small Searches GPU: Worse for small instances (fixed cost) Hybrid method avoids worst-case Large Searches

Experiments on Small-world Graphs Multi-Core CPU GPU Intel Nehalem (X5550) 2.67GHz 2 Socket x 4 Core x 2 HT LLC: 8MB x 2 Main Memory: 24GB Nvidia Fermi (C2050) 1.15GHz 14 SM x (2 warps) x 32 SIMT LLC: 2MB Main Memory: 3GB Measurement Start from multiple root nodes Measure average execution time from multiple executions Graphs Two kinds of widely accepted synthetic graphs Random (Erdos-Renyi) Simple uniform random RMAT Skewed degree distribution (good) Many (~50%) unconnected nodes (bad) 32mil nodes, 256 mil edges

Performance Results Multi-Core CPU Result y-axis: processing rate (Higher is better) SC10-EP: numbers from [Agarwal et. al SC10] Measured for same sized graph on a faster (2.9Ghz) machine Read+Queue gives small performance improvement Read-based > Queue-based Queue-Based approximates SC10 (a) Uniform RMAT > Uniform (due to unconnected nodes) (b) RMAT 16 thread uses hyper-threading

Performance Results GPU Result Same graph inputs GPU:1.5x ~ 2x (compared to best CPU 16 threads) CPU + GPU can give small performance improvement (a) Uniform (b) RMAT

Changing Graph Size Varying number of nodes 1mil ~ 64 mil # Edges = (# Nodes) x 8 # Threads = 16 Performance difference widens as graph size grows (cache-cache miss doesn t matter much) (a) Uniform (b) RMAT

Changing Graph Size Varying number of edges 256 mil ~ 2048 mil # Nodes = 32 mil # Threads = 16 Performance gap widens as the graph size grows GPU still performs better; but has hit size limit (a) Uniform (b) RMAT

Architectural Effects Nehalem Fermi Core Tesla SC10-EP SC10-EX Freq. 2.67GHz 1.15GHz 2.33GHz 1.40GHz 2.93GHz 2.26GHz (# Cores) 2 x 4 (x 2) 14 x 2 2 x 4 30 2 x 4 (x 2) 4 x 8 (x 2) SIMD/SIMT - 32-32 - - LLC (MB) 16 MB 2 MB 8 MB - 16 MB 96 MB Memory 24 GB 3 GB 32 GB 896 MB 48 GB 256 GB Rnd Read 0.98 GB/s 2.71 GB/s 0.25 GB/s 3.15 GB/s - - Core vs. Nehalem: Memory BW Fermi vs. Tesla: L2 Cache as write buffer Nehalem Nehalem CPU vs. GPU? (1) Size of graph (2) What you can afford #Node :16/32 mil Avg. Degree = 8

Summary Why rather than How Exploited properties of graphs and machines Small-world property Bandwidth difference between sequential access and random access A simple state-machine to avoid worst-case execution Graph exploration on GPU Limited capacity Faster execution due to memory bandwidth

Thank you Questions?