Efficient Parallel Graph Exploration on Multi-Core CPU and GPU
|
|
|
- Berenice Berry
- 10 years ago
- Views:
Transcription
1 Efficient Parallel Graph Exploration on Multi-Core CPU and GPU Pervasive Parallelism Laboratory Stanford University Sungpack Hong, Tayo Oguntebi, and Kunle Olukotun
2 Graph and its Applications Graph Fundamental data structure G = (N,E): Arbitrary relationship (E) between data entities (N) Wide range of Applications Scheduling task graphs PDE (Partial Differential Equation) solver on mesh Artificial Intelligence Bayesian network Bioinformatics molecular interaction graph Social network analysis Web graphs Graph database schema-less data management Requires large graph analysis
3 Performance Issues Single-core machines showed limited performance for large graph analysis problems A lot of random memory accesses + Data does not fit in cache Performance is bound to memory latency Conventional hardware units (e.g. floating point, branch predictors, out-of-order) do not help much Use parallelism to accelerate graph analysis Plenty of data-parallelism in large graph instances Latency bound Bandwidth bound Exploit recent proliferation of parallel computers: Multi-core CPU and GPU
4 Graph Exploration Breadth first search (BFS) A systematic way to traverse the graph A building block for many other algorithms s-t connectivity, betweeness centrality, connected component, community detection, max-flow Can be parallelized (c.f. depth first search) More about this in the next slide Many previous researches on implementation For various architectures: Cluster, Cell, Cray, Multicore/SMP, GPU, Preferred as parallel benchmark See graph500.org
5 Parallel BFS Algorithm Start from a root, and visit all the connected nodes in a graph Nodes closer to the root are visited first Nodes of the same hop-distance (level) from the root can be visited in parallel Three Node-sets Nodes of the current level Neighbors of current level nodes Synchronization at the end of each level Add non-visited neighbors to Next and Visited set
6 Implementation for Multi-Core CPU Level Synchronous Parallel BFS Requires synchronization at everyl evel Degree of parallelism limited by (# nodes) in each level State-of-Art Implementation [Agarwal et. al. SC 2010] V bitmap Maximize cache hit ratio Atomic update required: test and test-and-set C, N queue Local Queue + Global Queue Outperformed previous implementations Complex queue implementation based on ticket-lock and fast forwarding Not so much details revealed in their paper Avoid unnecessary cache-to-cache traffic
7 Can we do better? Issues Requires complex queue implementation Can we do better even without it? Our two implementations Queue-Based Implementation Approximate Agarwal et. al. s approach Bitmap Test and Test-and-Set Local Q + Global Q Standard Queue Another implementation Exploit properties of the graphs Exploit properties of the machines
8 Observation on Graphs Small-World Property [Watts and Strogatz, Nature 1998] Any randomly-shaped graphs has a small diameter ( Six-degrees of separation ) A fundamental property : web graphs, social graphs, molecular graphs, [Corollary] There must be at least one level that has O(N) nodes. Regular Graph: Diameter O(N) Adding Ramdom re-wiring: Diameter O(c) Total execution time is governed by these critical levels (e.g.) Number of nodes at each BFS level (16 million node graph)
9 Read-based implementation Another implementation of ours V: Bitmap C, N: Level-Array A single O(N)-sized array that keeps the level of each node N-1 inf inf inf inf inf inf 0 inf inf inf Initialize 1 Level 0 Read the entire array! while (!finished) { foreach (c: G.Nodes) { if (level[c]!= curr_lev) continue; } lev++; } 1 0 inf 1 inf 0 N-2 Level 1 Iterate nodes in Current set Adding nodes to Next set N-1 2 Instead of keeping queues, update the value in the level array.
10 What s the benefit of that? 10x Diff (1) The array is read sequentially (1)-b Overall access pattern become more sequential as well (2) There are only a few level; In critical levels, you have to visit O(N) nodes anyway. But cannot eliminate all the natural random accesses. Current Queue Level List I N F 1 I N F I N F Packed Adjacency List a b c h i j k n o p q Packed Adjacency List a b c d e f g h i j k n o p q (a) Data-Access Pattern of Queue-Based Method (b) Data-Access Pattern of Read-Based Method
11 Queue-Based vs. Read-Based Level-wise execution time breakdown Small increase in non-critical levels (1,2, 6 and 7) (e.g.) Number of nodes at each BFS level (16 million node graph) Reduction in critical levels (3, 4 and 5)
12 What about big-world graphs? Worst-case inputs for Read-based method: 1. High-diameter graphs Recent graph applications (e.g. social network) deal with small-world graphs more frequently Still, there are high-diameter graphs: e.g. mesh 2. Small search instance When the graph is not (strongly) connected Your traversal finishes after visiting only small portion of the graph
13 Preventing worst case execution Our solution: hybrid method Choose appropriate method (Read or Queue), adaptively at each level Based on the size of Next set and its growth rate. Finite State Machine Go Parallel (with Queues) when there are enough # nodes. If Next is large enough (e.g. p% of num nodes) or exponential growing, migrate to Read method Process the root node, sequentially. Return to Queue method when Next is shrinking Transient state: Read from Queue, Write to Array
14 Result: worst-case avoidance BFS on tree Y-axis: time (high is bad) Mix of large search instances (good for Read) and small search instances (good for Queue) The FSM allows best of both methods Tree Large search instances Small search instances
15 Result: worst-case avoidance 2-D Mesh 4000x4000 Diameter is O(sqrt(N)) (# nodes) at each level increases not exponentially, but linearly Read-based method showed a lot of overheads Hybrid Queue+Read method avoids it
16 Graph Exploration on GPU GPU Benefits Large memory bandwidth (GDDR, # channels) Massively parallel hardware HW multi-threading + SIMD(/SIMT) HW Traits similar to Cray-XMT But much cheaper GPU Issues Limited capacity (~ a few GB) Our approach: Use GPU, only if the graph fits Use multi-core CPU, otherwise But how much performance does this give?
17 Graph Exploration on GPU BFS on GPU [Harish and Narayanan, HiPC 2007], [Hong et al, PPoPP2011] Similar to Queue-based implementation Visited, Next, Current Level Array If level[node] is INF, then node is not visited Hard to do bitwise atomic operation efficiently on GPU A node can be written multiple times by different parents Okay, because the written level value is always same But it has the same issue as Queue-based method Bad for small or long-diameter graphs
18 Hybrid CPU+GPU An extension to the previous FSM Next T 4 SEQ Next > T 1 QUEUE FOR GPU Migrate to GPU, only if the graph is growing exponentially QUEUE Next > α * Curr and Next >T 4 CopyBack InitGPU Need data copy to GPU. (i.e. ship out the contents in the queue) Otherwise, go back to CPUbased FSM Finished GPU Copy back level values from GPU
19 GPU: Worst-case avoidance BFS on tree with GPU Small Searches GPU: Worse for small instances (fixed cost) Hybrid method avoids worst-case Large Searches
20 Experiments on Small-world Graphs Multi-Core CPU GPU Intel Nehalem (X5550) 2.67GHz 2 Socket x 4 Core x 2 HT LLC: 8MB x 2 Main Memory: 24GB Nvidia Fermi (C2050) 1.15GHz 14 SM x (2 warps) x 32 SIMT LLC: 2MB Main Memory: 3GB Measurement Start from multiple root nodes Measure average execution time from multiple executions Graphs Two kinds of widely accepted synthetic graphs Random (Erdos-Renyi) Simple uniform random RMAT Skewed degree distribution (good) Many (~50%) unconnected nodes (bad) 32mil nodes, 256 mil edges
21 Performance Results Multi-Core CPU Result y-axis: processing rate (Higher is better) SC10-EP: numbers from [Agarwal et. al SC10] Measured for same sized graph on a faster (2.9Ghz) machine Read+Queue gives small performance improvement Read-based > Queue-based Queue-Based approximates SC10 (a) Uniform RMAT > Uniform (due to unconnected nodes) (b) RMAT 16 thread uses hyper-threading
22 Performance Results GPU Result Same graph inputs GPU:1.5x ~ 2x (compared to best CPU 16 threads) CPU + GPU can give small performance improvement (a) Uniform (b) RMAT
23 Changing Graph Size Varying number of nodes 1mil ~ 64 mil # Edges = (# Nodes) x 8 # Threads = 16 Performance difference widens as graph size grows (cache-cache miss doesn t matter much) (a) Uniform (b) RMAT
24 Changing Graph Size Varying number of edges 256 mil ~ 2048 mil # Nodes = 32 mil # Threads = 16 Performance gap widens as the graph size grows GPU still performs better; but has hit size limit (a) Uniform (b) RMAT
25 Architectural Effects Nehalem Fermi Core Tesla SC10-EP SC10-EX Freq. 2.67GHz 1.15GHz 2.33GHz 1.40GHz 2.93GHz 2.26GHz (# Cores) 2 x 4 (x 2) 14 x 2 2 x x 4 (x 2) 4 x 8 (x 2) SIMD/SIMT LLC (MB) 16 MB 2 MB 8 MB - 16 MB 96 MB Memory 24 GB 3 GB 32 GB 896 MB 48 GB 256 GB Rnd Read 0.98 GB/s 2.71 GB/s 0.25 GB/s 3.15 GB/s - - Core vs. Nehalem: Memory BW Fermi vs. Tesla: L2 Cache as write buffer Nehalem Nehalem CPU vs. GPU? (1) Size of graph (2) What you can afford #Node :16/32 mil Avg. Degree = 8
26 Summary Why rather than How Exploited properties of graphs and machines Small-world property Bandwidth difference between sequential access and random access A simple state-machine to avoid worst-case execution Graph exploration on GPU Limited capacity Faster execution due to memory bandwidth
27 Thank you Questions?
Green-Marl: A DSL for Easy and Efficient Graph Analysis
Green-Marl: A DSL for Easy and Efficient Graph Analysis Sungpack Hong*, Hassan Chafi* +, Eric Sedlar +, and Kunle Olukotun* *Pervasive Parallelism Lab, Stanford University + Oracle Labs Graph Analysis
FPGA-based Multithreading for In-Memory Hash Joins
FPGA-based Multithreading for In-Memory Hash Joins Robert J. Halstead, Ildar Absalyamov, Walid A. Najjar, Vassilis J. Tsotras University of California, Riverside Outline Background What are FPGAs Multithreaded
Multi-Threading Performance on Commodity Multi-Core Processors
Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction
Big Graph Processing: Some Background
Big Graph Processing: Some Background Bo Wu Colorado School of Mines Part of slides from: Paul Burkhardt (National Security Agency) and Carlos Guestrin (Washington University) Mines CSCI-580, Bo Wu Graphs
GPU System Architecture. Alan Gray EPCC The University of Edinburgh
GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems
Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1
Introduction to GP-GPUs Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1 GPU Architectures: How do we reach here? NVIDIA Fermi, 512 Processing Elements (PEs) 2 What Can It Do?
GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile
GPU Computing with CUDA Lecture 2 - CUDA Memories Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1 Outline of lecture Recap of Lecture 1 Warp scheduling CUDA Memory hierarchy
Parallel Programming Survey
Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory
The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices
WS on Models, Algorithms and Methodologies for Hierarchical Parallelism in new HPC Systems The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices
Parallel Algorithm Engineering
Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework Examples Software crisis
Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries
Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries Shin Morishima 1 and Hiroki Matsutani 1,2,3 1Keio University, 3 14 1 Hiyoshi, Kohoku ku, Yokohama, Japan 2National Institute
GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile
GPU Computing with CUDA Lecture 4 - Optimizations Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1 Outline of lecture Recap of Lecture 3 Control flow Coalescing Latency hiding
Graph Analytics in Big Data. John Feo Pacific Northwest National Laboratory
Graph Analytics in Big Data John Feo Pacific Northwest National Laboratory 1 A changing World The breadth of problems requiring graph analytics is growing rapidly Large Network Systems Social Networks
Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) [email protected] http://www.mzahran.com
CSCI-GA.3033-012 Graphics Processing Units (GPUs): Architecture and Programming Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) [email protected] http://www.mzahran.com Modern GPU
Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms
Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms Björn Rocker Hamburg, June 17th 2010 Engineering Mathematics and Computing Lab (EMCL) KIT University of the State
Clustering Billions of Data Points Using GPUs
Clustering Billions of Data Points Using GPUs Ren Wu [email protected] Bin Zhang [email protected] Meichun Hsu [email protected] ABSTRACT In this paper, we report our research on using GPUs to accelerate
Accelerating In-Memory Graph Database traversal using GPGPUS
Accelerating In-Memory Graph Database traversal using GPGPUS Ashwin Raghav Mohan Ganesh University of Virginia High Performance Computing Laboratory [email protected] Abstract The paper aims to provide
CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA
CUDA Optimization with NVIDIA Tools Julien Demouth, NVIDIA What Will You Learn? An iterative method to optimize your GPU code A way to conduct that method with Nvidia Tools 2 What Does the Application
OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC
OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC Driving industry innovation The goal of the OpenPOWER Foundation is to create an open ecosystem, using the POWER Architecture to share expertise,
Next Generation GPU Architecture Code-named Fermi
Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time
Parallel Computing 37 (2011) 26 41. Contents lists available at ScienceDirect. Parallel Computing. journal homepage: www.elsevier.
Parallel Computing 37 (2011) 26 41 Contents lists available at ScienceDirect Parallel Computing journal homepage: www.elsevier.com/locate/parco Architectural support for thread communications in multi-core
Load Balancing and Dynamic Parallelism for Static and Morph Graph Algorithms
Load Balancing and Dynamic Parallelism for Static and Morph Graph Algorithms A Project Report submitted in partial fulfilment of the requirements for the Degree of Master of Technology in Computational
Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.
Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide
HPC with Multicore and GPUs
HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville CS 594 Lecture Notes March 4, 2015 1/18 Outline! Introduction - Hardware
Stream Processing on GPUs Using Distributed Multimedia Middleware
Stream Processing on GPUs Using Distributed Multimedia Middleware Michael Repplinger 1,2, and Philipp Slusallek 1,2 1 Computer Graphics Lab, Saarland University, Saarbrücken, Germany 2 German Research
Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging
Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro- and even nano-seconds.
Computer Graphics Hardware An Overview
Computer Graphics Hardware An Overview Graphics System Monitor Input devices CPU/Memory GPU Raster Graphics System Raster: An array of picture elements Based on raster-scan TV technology The screen (and
David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems
David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems About me David Rioja Redondo Telecommunication Engineer - Universidad de Alcalá >2 years building and managing clusters UPM
High Performance Computing in CST STUDIO SUITE
High Performance Computing in CST STUDIO SUITE Felix Wolfheimer GPU Computing Performance Speedup 18 16 14 12 10 8 6 4 2 0 Promo offer for EUC participants: 25% discount for K40 cards Speedup of Solver
GPGPU accelerated Computational Fluid Dynamics
t e c h n i s c h e u n i v e r s i t ä t b r a u n s c h w e i g Carl-Friedrich Gauß Faculty GPGPU accelerated Computational Fluid Dynamics 5th GACM Colloquium on Computational Mechanics Hamburg Institute
Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing
Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing Innovation Intelligence Devin Jensen August 2012 Altair Knows HPC Altair is the only company that: makes HPC tools
NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist
NVIDIA CUDA Software and GPU Parallel Computing Architecture David B. Kirk, Chief Scientist Outline Applications of GPU Computing CUDA Programming Model Overview Programming in CUDA The Basics How to Get
Introduction to GPU Programming Languages
CSC 391/691: GPU Programming Fall 2011 Introduction to GPU Programming Languages Copyright 2011 Samuel S. Cho http://www.umiacs.umd.edu/ research/gpu/facilities.html Maryland CPU/GPU Cluster Infrastructure
Accelerating CFD using OpenFOAM with GPUs
Accelerating CFD using OpenFOAM with GPUs Authors: Saeed Iqbal and Kevin Tubbs The OpenFOAM CFD Toolbox is a free, open source CFD software package produced by OpenCFD Ltd. Its user base represents a wide
Benchmarking Cassandra on Violin
Technical White Paper Report Technical Report Benchmarking Cassandra on Violin Accelerating Cassandra Performance and Reducing Read Latency With Violin Memory Flash-based Storage Arrays Version 1.0 Abstract
MapGraph. A High Level API for Fast Development of High Performance Graphic Analytics on GPUs. http://mapgraph.io
MapGraph A High Level API for Fast Development of High Performance Graphic Analytics on GPUs http://mapgraph.io Zhisong Fu, Michael Personick and Bryan Thompson SYSTAP, LLC Outline Motivations MapGraph
Rethinking SIMD Vectorization for In-Memory Databases
SIGMOD 215, Melbourne, Victoria, Australia Rethinking SIMD Vectorization for In-Memory Databases Orestis Polychroniou Columbia University Arun Raghavan Oracle Labs Kenneth A. Ross Columbia University Latest
Parallel Firewalls on General-Purpose Graphics Processing Units
Parallel Firewalls on General-Purpose Graphics Processing Units Manoj Singh Gaur and Vijay Laxmi Kamal Chandra Reddy, Ankit Tharwani, Ch.Vamshi Krishna, Lakshminarayanan.V Department of Computer Engineering
CHAPTER 1 INTRODUCTION
1 CHAPTER 1 INTRODUCTION 1.1 MOTIVATION OF RESEARCH Multicore processors have two or more execution cores (processors) implemented on a single chip having their own set of execution and architectural recourses.
ultra fast SOM using CUDA
ultra fast SOM using CUDA SOM (Self-Organizing Map) is one of the most popular artificial neural network algorithms in the unsupervised learning category. Sijo Mathew Preetha Joy Sibi Rajendra Manoj A
GPU File System Encryption Kartik Kulkarni and Eugene Linkov
GPU File System Encryption Kartik Kulkarni and Eugene Linkov 5/10/2012 SUMMARY. We implemented a file system that encrypts and decrypts files. The implementation uses the AES algorithm computed through
Introduction to Cloud Computing
Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic
Why the Network Matters
Week 2, Lecture 2 Copyright 2009 by W. Feng. Based on material from Matthew Sottile. So Far Overview of Multicore Systems Why Memory Matters Memory Architectures Emerging Chip Multiprocessors (CMP) Increasing
GPUs for Scientific Computing
GPUs for Scientific Computing p. 1/16 GPUs for Scientific Computing Mike Giles [email protected] Oxford-Man Institute of Quantitative Finance Oxford University Mathematical Institute Oxford e-research
15-418 Final Project Report. Trading Platform Server
15-418 Final Project Report Yinghao Wang [email protected] May 8, 214 Trading Platform Server Executive Summary The final project will implement a trading platform server that provides back-end support
Introduction to GPU Computing
Matthis Hauschild Universität Hamburg Fakultät für Mathematik, Informatik und Naturwissenschaften Technische Aspekte Multimodaler Systeme December 4, 2014 M. Hauschild - 1 Table of Contents 1. Architecture
STINGER: High Performance Data Structure for Streaming Graphs
STINGER: High Performance Data Structure for Streaming Graphs David Ediger Rob McColl Jason Riedy David A. Bader Georgia Institute of Technology Atlanta, GA, USA Abstract The current research focus on
Enabling Technologies for Distributed and Cloud Computing
Enabling Technologies for Distributed and Cloud Computing Dr. Sanjay P. Ahuja, Ph.D. 2010-14 FIS Distinguished Professor of Computer Science School of Computing, UNF Multi-core CPUs and Multithreading
E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices
E6895 Advanced Big Data Analytics Lecture 14: NVIDIA GPU Examples and GPU on ios devices Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist,
HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK
HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK Steve Oberlin CTO, Accelerated Computing US to Build Two Flagship Supercomputers SUMMIT SIERRA Partnership for Science 100-300 PFLOPS Peak Performance
PHAST: Hardware-Accelerated Shortest Path Trees
PHAST: Hardware-Accelerated Shortest Path Trees Daniel Delling Andrew V. Goldberg Andreas Nowatzyk Renato F. Werneck Microsoft Research Silicon Valley, 1065 La Avenida, Mountain View, CA 94043, USA September
Choosing a Computer for Running SLX, P3D, and P5
Choosing a Computer for Running SLX, P3D, and P5 This paper is based on my experience purchasing a new laptop in January, 2010. I ll lead you through my selection criteria and point you to some on-line
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France
Turbomachinery CFD on many-core platforms experiences and strategies
Turbomachinery CFD on many-core platforms experiences and strategies Graham Pullan Whittle Laboratory, Department of Engineering, University of Cambridge MUSAF Colloquium, CERFACS, Toulouse September 27-29
Multi-core and Linux* Kernel
Multi-core and Linux* Kernel Suresh Siddha Intel Open Source Technology Center Abstract Semiconductor technological advances in the recent years have led to the inclusion of multiple CPU execution cores
Scaling from Datacenter to Client
Scaling from Datacenter to Client KeunSoo Jo Sr. Manager Memory Product Planning Samsung Semiconductor Audio-Visual Sponsor Outline SSD Market Overview & Trends - Enterprise What brought us to NVMe Technology
In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller
In-Memory Databases Algorithms and Data Structures on Modern Hardware Martin Faust David Schwalb Jens Krüger Jürgen Müller The Free Lunch Is Over 2 Number of transistors per CPU increases Clock frequency
Optimizing Code for Accelerators: The Long Road to High Performance
Optimizing Code for Accelerators: The Long Road to High Performance Hans Vandierendonck Mons GPU Day November 9 th, 2010 The Age of Accelerators 2 Accelerators in Real Life 3 Latency (ps/inst) Why Accelerators?
GPU Computing - CUDA
GPU Computing - CUDA A short overview of hardware and programing model Pierre Kestener 1 1 CEA Saclay, DSM, Maison de la Simulation Saclay, June 12, 2012 Atelier AO and GPU 1 / 37 Content Historical perspective
Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms
Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Family-Based Platforms Executive Summary Complex simulations of structural and systems performance, such as car crash simulations,
Parallel Simplification of Large Meshes on PC Clusters
Parallel Simplification of Large Meshes on PC Clusters Hua Xiong, Xiaohong Jiang, Yaping Zhang, Jiaoying Shi State Key Lab of CAD&CG, College of Computer Science Zhejiang University Hangzhou, China April
Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand
Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand P. Balaji, K. Vaidyanathan, S. Narravula, K. Savitha, H. W. Jin D. K. Panda Network Based
COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook)
COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook) Vivek Sarkar Department of Computer Science Rice University [email protected] COMP
Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes
Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes Eric Petit, Loïc Thebault, Quang V. Dinh May 2014 EXA2CT Consortium 2 WPs Organization Proto-Applications
LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR
LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR Frédéric Kuznik, frederic.kuznik@insa lyon.fr 1 Framework Introduction Hardware architecture CUDA overview Implementation details A simple case:
SIGMOD RWE Review Towards Proximity Pattern Mining in Large Graphs
SIGMOD RWE Review Towards Proximity Pattern Mining in Large Graphs Fabian Hueske, TU Berlin June 26, 21 1 Review This document is a review report on the paper Towards Proximity Pattern Mining in Large
A1 and FARM scalable graph database on top of a transactional memory layer
A1 and FARM scalable graph database on top of a transactional memory layer Miguel Castro, Aleksandar Dragojević, Dushyanth Narayanan, Ed Nightingale, Alex Shamis Richie Khanna, Matt Renzelmann Chiranjeeb
Binary search tree with SIMD bandwidth optimization using SSE
Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous
Performance Analysis of Flash Storage Devices and their Application in High Performance Computing
Performance Analysis of Flash Storage Devices and their Application in High Performance Computing Nicholas J. Wright With contributions from R. Shane Canon, Neal M. Master, Matthew Andrews, and Jason Hick
bigdata Managing Scale in Ontological Systems
Managing Scale in Ontological Systems 1 This presentation offers a brief look scale in ontological (semantic) systems, tradeoffs in expressivity and data scale, and both information and systems architectural
Asking Hard Graph Questions. Paul Burkhardt. February 3, 2014
Beyond Watson: Predictive Analytics and Big Data U.S. National Security Agency Research Directorate - R6 Technical Report February 3, 2014 300 years before Watson there was Euler! The first (Jeopardy!)
A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures
11 th International LS-DYNA Users Conference Computing Technology A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures Yih-Yih Lin Hewlett-Packard Company Abstract In this paper, the
Comparison of Hybrid Flash Storage System Performance
Test Validation Comparison of Hybrid Flash Storage System Performance Author: Russ Fellows March 23, 2015 Enabling you to make the best technology decisions 2015 Evaluator Group, Inc. All rights reserved.
Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms
Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms Amani AlOnazi, David E. Keyes, Alexey Lastovetsky, Vladimir Rychkov Extreme Computing Research Center,
Intro to GPU computing. Spring 2015 Mark Silberstein, 048661, Technion 1
Intro to GPU computing Spring 2015 Mark Silberstein, 048661, Technion 1 Serial vs. parallel program One instruction at a time Multiple instructions in parallel Spring 2015 Mark Silberstein, 048661, Technion
GPU Hardware and Programming Models. Jeremy Appleyard, September 2015
GPU Hardware and Programming Models Jeremy Appleyard, September 2015 A brief history of GPUs In this talk Hardware Overview Programming Models Ask questions at any point! 2 A Brief History of GPUs 3 Once
Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui
Hardware-Aware Analysis and Optimization of Stable Fluids Presentation Date: Sep 15 th 2009 Chrissie C. Cui Outline Introduction Highlights Flop and Bandwidth Analysis Mehrstellen Schemes Advection Caching
How To Compare Amazon Ec2 To A Supercomputer For Scientific Applications
Amazon Cloud Performance Compared David Adams Amazon EC2 performance comparison How does EC2 compare to traditional supercomputer for scientific applications? "Performance Analysis of High Performance
YALES2 porting on the Xeon- Phi Early results
YALES2 porting on the Xeon- Phi Early results Othman Bouizi Ghislain Lartigue Innovation and Pathfinding Architecture Group in Europe, Exascale Lab. Paris CRIHAN - Demi-journée calcul intensif, 16 juin
Introduction to GPGPU. Tiziano Diamanti [email protected]
[email protected] Agenda From GPUs to GPGPUs GPGPU architecture CUDA programming model Perspective projection Vectors that connect the vanishing point to every point of the 3D model will intersecate
Assessing the Performance of OpenMP Programs on the Intel Xeon Phi
Assessing the Performance of OpenMP Programs on the Intel Xeon Phi Dirk Schmidl, Tim Cramer, Sandra Wienke, Christian Terboven, and Matthias S. Müller [email protected] Rechen- und Kommunikationszentrum
HIGH PERFORMANCE BIG DATA ANALYTICS
HIGH PERFORMANCE BIG DATA ANALYTICS Kunle Olukotun Electrical Engineering and Computer Science Stanford University June 2, 2014 Explosion of Data Sources Sensors DoD is swimming in sensors and drowning
Seeking Opportunities for Hardware Acceleration in Big Data Analytics
Seeking Opportunities for Hardware Acceleration in Big Data Analytics Paul Chow High-Performance Reconfigurable Computing Group Department of Electrical and Computer Engineering University of Toronto Who
LS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp.
LS-DYNA Scalability on Cray Supercomputers Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp. WP-LS-DYNA-12213 www.cray.com Table of Contents Abstract... 3 Introduction... 3 Scalability
Frog: Asynchronous Graph Processing on GPU with Hybrid Coloring Model
X. SHI ET AL. 1 Frog: Asynchronous Graph Processing on GPU with Hybrid Coloring Model Xuanhua Shi 1, Xuan Luo 1, Junling Liang 1, Peng Zhao 1, Sheng Di 2, Bingsheng He 3, and Hai Jin 1 1 Services Computing
Introduction to GPU Architecture
Introduction to GPU Architecture Ofer Rosenberg, PMTS SW, OpenCL Dev. Team AMD Based on From Shader Code to a Teraflop: How GPU Shader Cores Work, By Kayvon Fatahalian, Stanford University Content 1. Three
Hardware Acceleration for CST MICROWAVE STUDIO
Hardware Acceleration for CST MICROWAVE STUDIO Chris Mason Product Manager Amy Dewis Channel Manager Agenda 1. Introduction 2. Why use Hardware Acceleration? 3. Hardware Acceleration Technologies 4. Current
OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA
OpenCL Optimization San Jose 10/2/2009 Peng Wang, NVIDIA Outline Overview The CUDA architecture Memory optimization Execution configuration optimization Instruction optimization Summary Overall Optimization
Overlapping Data Transfer With Application Execution on Clusters
Overlapping Data Transfer With Application Execution on Clusters Karen L. Reid and Michael Stumm [email protected] [email protected] Department of Computer Science Department of Electrical and Computer
Parallel Computing with MATLAB
Parallel Computing with MATLAB Scott Benway Senior Account Manager Jiro Doke, Ph.D. Senior Application Engineer 2013 The MathWorks, Inc. 1 Acceleration Strategies Applied in MATLAB Approach Options Best
High Performance Computing in the Multi-core Area
High Performance Computing in the Multi-core Area Arndt Bode Technische Universität München Technology Trends for Petascale Computing Architectures: Multicore Accelerators Special Purpose Reconfigurable
Understanding the Benefits of IBM SPSS Statistics Server
IBM SPSS Statistics Server Understanding the Benefits of IBM SPSS Statistics Server Contents: 1 Introduction 2 Performance 101: Understanding the drivers of better performance 3 Why performance is faster
PERFORMANCE ENHANCEMENTS IN TreeAge Pro 2014 R1.0
PERFORMANCE ENHANCEMENTS IN TreeAge Pro 2014 R1.0 15 th January 2014 Al Chrosny Director, Software Engineering TreeAge Software, Inc. [email protected] Andrew Munzer Director, Training and Customer
