Cell-SWat: Modeling and Scheduling Wavefront Computations on the Cell Broadband Engine



Similar documents
IBM CELL CELL INTRODUCTION. Project made by: Origgi Alessandro matr Teruzzi Roberto matr IBM CELL. Politecnico di Milano Como Campus

Optimizing Code for Accelerators: The Long Road to High Performance

High-Performance Modular Multiplication on the Cell Processor

High Performance Computing in CST STUDIO SUITE

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

SPEEDUP - optimization and porting of path integral MC Code to new computing architectures

Implementation of Canny Edge Detector of color images on CELL/B.E. Architecture.

22S:295 Seminar in Applied Statistics High Performance Computing in Statistics

Efficient Parallel Execution of Sequence Similarity Analysis Via Dynamic Load Balancing

CFD Implementation with In-Socket FPGA Accelerators

Programming the Cell Multiprocessor: A Brief Introduction

Stream Processing on GPUs Using Distributed Multimedia Middleware

Parallel Compression and Decompression of DNA Sequence Reads in FASTQ Format

A NOR Emulation Strategy over NAND Flash Memory

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

September 25, Maya Gokhale Georgia Institute of Technology

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

Advanced Computer Architecture

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Introduction to Cloud Computing

Computer Graphics Hardware An Overview

OC By Arsene Fansi T. POLIMI

Evoluzione dell Infrastruttura di Calcolo e Data Analytics per la ricerca

Parallel Programming Survey

HPC with Multicore and GPUs

Tableau Server 7.0 scalability

Unprecedented Performance and Scalability Demonstrated For Meter Data Management:

Improving System Scalability of OpenMP Applications Using Large Page Support

High Performance Matrix Inversion with Several GPUs

SQL Server Business Intelligence on HP ProLiant DL785 Server

Introduction to GPGPU. Tiziano Diamanti

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

Dynamic Profiling and Load-balancing of N-body Computational Kernels on Heterogeneous Architectures

Rethinking SIMD Vectorization for In-Memory Databases

SGI. High Throughput Computing (HTC) Wrapper Program for Bioinformatics on SGI ICE and SGI UV Systems. January, Abstract. Haruna Cofer*, PhD

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

Using Many-Core Hardware to Correlate Radio Astronomy Signals

YALES2 porting on the Xeon- Phi Early results

Interconnection Networks. Interconnection Networks. Interconnection networks are used everywhere!

GPGPU accelerated Computational Fluid Dynamics

Performance Balancing: Software-based On-chip Memory Management for Effective CMP Executions

Intel Xeon Processor E5-2600


Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage

Capacity Planning Process Estimating the load Initial configuration

Supporting OpenMP on Cell

Benchmarking Cassandra on Violin

Observations on Data Distribution and Scalability of Parallel and Distributed Image Processing Applications

Seeking Opportunities for Hardware Acceleration in Big Data Analytics

ACCELERATING COMMERCIAL LINEAR DYNAMIC AND NONLINEAR IMPLICIT FEA SOFTWARE THROUGH HIGH- PERFORMANCE COMPUTING

Architectures and Platforms

Fast Implementations of AES on Various Platforms

Motivation: Smartphone Market

Chapter 1 Computer System Overview

Interconnect Efficiency of Tyan PSC T-630 with Microsoft Compute Cluster Server 2003

SOLUTION BRIEF: SLCM R12.8 PERFORMANCE TEST RESULTS JANUARY, Submit and Approval Phase Results

Scalability and Classifications

Clustering Billions of Data Points Using GPUs

Operating System Support for Multiprocessor Systems-on-Chip

Multi-core Curriculum Development at Georgia Tech: Experience and Future Steps

The Green Index: A Metric for Evaluating System-Wide Energy Efficiency in HPC Systems

How To Test For Performance And Scalability On A Server With A Multi-Core Computer (For A Large Server)

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

Avoid Paying The Virtualization Tax: Deploying Virtualized BI 4.0 The Right Way. Ashish C. Morzaria, SAP

Panasas High Performance Storage Powers the First Petaflop Supercomputer at Los Alamos National Laboratory

Big Graph Processing: Some Background

Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms

ECLIPSE Best Practices Performance, Productivity, Efficiency. March 2009

Big Data Visualization on the MIC

DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION

Performance characterization report for Microsoft Hyper-V R2 on HP StorageWorks P4500 SAN storage

Copyright 2013, Oracle and/or its affiliates. All rights reserved.

JBoss Data Grid Performance Study Comparing Java HotSpot to Azul Zing

Intel Xeon +FPGA Platform for the Data Center

Porting the Plasma Simulation PIConGPU to Heterogeneous Architectures with Alpaka

Multi-Threading Performance on Commodity Multi-Core Processors

HP reference configuration for entry-level SAS Grid Manager solutions

Next Generation GPU Architecture Code-named Fermi

MATE-EC2: A Middleware for Processing Data with AWS

Course Development of Programming for General-Purpose Multicore Processors

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

MAD2: A Scalable High-Throughput Exact Deduplication Approach for Network Backup Services

GPU Parallel Computing Architecture and CUDA Programming Model

The Methodology Behind the Dell SQL Server Advisor Tool

HP ProLiant BL660c Gen9 and Microsoft SQL Server 2014 technical brief

<Insert Picture Here> An Experimental Model to Analyze OpenMP Applications for System Utilization

Assessing the Performance of OpenMP Programs on the Intel Xeon Phi

DELL. Virtual Desktop Infrastructure Study END-TO-END COMPUTING. Dell Enterprise Solutions Engineering

Experiences with Parallelizing a Bio-informatics Program on the Cell BE

Improving Grid Processing Efficiency through Compute-Data Confluence

A Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment

Performance Characteristics of VMFS and RDM VMware ESX Server 3.0.1

CONFIGURATION GUIDELINES: EMC STORAGE FOR PHYSICAL SECURITY

Comparing the performance of the Landmark Nexus reservoir simulator on HP servers

Algorithms of Scientific Computing II

FPGA Acceleration using OpenCL & PCIe Accelerators MEW 25

Transcription:

Cell-SWat: Modeling and Scheduling Wavefront Computations on the Cell Broadband Engine Ashwin Aji, Wu Feng, Filip Blagojevic and Dimitris Nikolopoulos

Forecast Efficient mapping of wavefront algorithms on the Cell Broadband Engine Double buffering and data streaming across the cores Unique data layout optimizations within the cores Developing an accurate performance prediction model End Result Achieving near-linear scalability or near-constant efficiency with respect to the number of cores on the chip 2

Outline Introduction Mapping and Modeling Wavefront Algorithms on the Cell Broadband Engine Optimizations for the Accelerator Cores Evaluation / Results Conclusion 3

Outline Introduction The Cell Broadband Engine (B.E.) The Cell B.E. Architecture The QS20 Cell Blade The Wavefront Pattern Mapping and Modeling Wavefront Algorithms on the Cell Broadband Engine Optimizations for the Accelerator Cores Evaluation / Results Conclusion 4

The Cell Broadband Engine Highlights 9 cores, 10 threads 3.2 GHz frequency > 200 GFlops (SP) Up to 25 GB/s memory B/W > 300 GB/s EIB Source: IBM Corporation 5

The Cell B.E. Architecture SPE SPU SXU SPU SXU SPU SXU SPU SXU SPU SXU SPU SXU SPU SXU SPU SXU LS LS LS LS LS LS LS LS MFC MFC MFC MFC MFC MFC MFC MFC EIB (up to 96B/cycle) PPE 16B/cycle 16B/cycle 16B/cycle (2x) PPU MIC BIC L2 32B/cycle L1 16B/cycle PXU Dual XDR TM FlexIO TM 64-bit Power Architecture with VMX Source: IBM Corporation 6

The QS20 Cell Blade Source: IBM Corporation 1GB XDR Memory Cell Processors IO Controllers IBM Blade Center interface 7

Outline Introduction The Cell Broadband Engine (B.E.) The Cell B.E. Architecture The QS20 Cell Blade The Wavefront Pattern Mapping and Modeling Wavefront Algorithms on the Cell Broadband Engine Optimizations for the Accelerator Cores Evaluation / Results Conclusion 8

The Wavefront Pattern NW N W Dependency Areas of utility Computational Biology: Smith-Waterman Linear algebra: LU Decomposition Multimedia: Video Encoding Computational Physics: Particle Physics Simulations 9

Outline Introduction Mapping and Modeling Wavefront Algorithms on the Cell Broadband Engine Tiled-Wavefront Model for Performance Prediction Optimizations for the SPEs Evaluation / Results Conclusion 10

Mapping to the Cell B.E. SPEs Each element is processed on individual SPEs (S i ) Each diagonal is computed in parallel Bus overhead due to concurrent DMA calls (reads and writes) Scalability issue S 1 S 1 S 1 S 1 S 1 S 1 S 2 S 2 S 2 S 2 S 2 S 3 S 3 S 3 S 3 S 4 S 4 S 4 S 5 S 5 S 6 Element Interconnect Bus Main Memory Matrix store Naïve mapping 11

Mapping to the Cell B.E. Elements are grouped to form square tiles Larger granularity Tile dimension can be modified Each tile is processed on individual SPEs (S i ) Each tile-diagonal is computed in parallel Tile S 1 S 1 S 1 Tile-column S 2 S 2 S 3 Tile-row Tiled-Wavefront 12

Tile-Scheduling Cyclic assignment of the SPEs to tile-rows S 1 t 1 t 2 t 3 Direction of Computation Tile-row t 4 t 5 t 6 t 7 t 8 S 2 t 2 t 3 t 4 t 5 t 6 t 7 t 8 t 9 Block-row: group of active tile-rows Block-row S 3 S 4 t 3 t 4 t 4 t 5 t 5 t 6 t 6 t 7 Block-row t t t 7 t 8 t 8 t 9 t 9 t 10 t 10 t 11 Computation overlap between consecutive blockrows (t 9 t 13 ) S 5 S 6 S 1 t 5 t 6 t 6 t 9 t 7 t 10 t 7 t 8 t 11 t 8 t 9 t 12 t 9 t 10 t 13 t 10 t 11 t 14 t 11 t 12 t 15 t 12 t 13 t 16 t 10 t 11 t 12 t 13 S 2 t 14 t 15 t 16 t 17

Tile-Scheduling (continued ) S= number of active SPEs Siterations before all SPEs are fully utilized Block-row S tiles

Computation-Communication Pattern X X X X X X West Tile Buffer copy X X X X X X X X X X X X X X X X X North Tile DMA Ready message South Tile DMA tile to main memory East Tile 15

Model for Performance Prediction T = (TT one_tile_diagonal = (T Tile + T T)*(number = DMA matrix_filling parallel_code * [(m of * n) tile-diagonals)+t serial_code + S] + T serial_code serial_code Time for processing a tilediagonal in a block row (or a single tile) =(T tile +T DMA ) Independent of the number of tiles in the tile-diagonal Number of tile-diagonals =(m*n)+s Model Usage: Sampling Phase: measure T tile,t DMA and T serial_code Calculate m, n and S from the input problem size, tile dimension and number of available SPEs mblock rows n tile-diagonals Computation overlap Tiled-Matrix 16 Tile-diagonal Block Row S tile-diagonals

Outline Introduction Mapping and Modeling Wavefront Algorithms on the Cell Broadband Engine Optimizations for the SPEs Tile Representation Vector Computations Evaluation / Results Conclusion 17

Tile Representation Logical representation 0 0 0 0 0 0 0 3 1 0 0 0 1 6 4 0 0 0 4 5 0 0 0 2 7 Physical representation 0 0 0 0 0 0 0 0 3 0 0 0 1 1 0 0 0 6 0 0 4 4 2 5 7 18

Vector Computations Goal: Vectorize as much as possible 0 0 0 0 0 9 8 5 5 0 5 5 0 3 7 2 3 0 0 2 5 3 7 2 0 0 2 3 6 0 0 0 4 2 4 Serial computations 0 9 7 1 3 6 1 2 6 8 19

Outline Introduction Mapping and Modeling Wavefront Algorithms on the Cell Broadband Engine Optimizations for the SPEs Evaluation / Results Experimental Setup Scalability Charts Performance Model Verification Conclusion 20

Experimental Setup Compute Platforms QS20 dual-cell blade at Georgia Tech for the parallel implementation 2.8 GHz dual-core Intel processor with 2GB memory for the serial implementation Example Wavefront Algorithm: Smith-Waterman A fundamental algorithm in bioinformatics used for homology search Matches nucleotide/protein sequences 8 different sequences chosen from the range: 1KB 8KB Two-phase dynamic programming Matrix filling (wavefront pattern) Backtracing (sequential code) 21

Scalability Chart Sequence size: 8KB Wavefront matrix size: 8000x8000 integers Similar results for all other input sequence sizes (1KB 8KB) Why are the sequence sizes < 8KB? The matrix overflows the 1GB XDRAM Why are the tile dimensions < 64x64? The tile overflows the 256KB local store Efficiency Speedup 251 0.9 0.8 20 0.7 0.6 15 0.5 0.4 10 0.3 0.2 5 0.1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Tile Dim. Size = 64x64 Tile Dim. Size = 16x16 Number of SPEs Tile Size Dim. = 32x32 Tile Size Dim. = 8x8 Near-constant efficiency irrespective of the tile dimensions 22

Performance Model Verification Sequence size: 8KB Wavefront matrix size: 8000x8000 integers Tile dimension: 64x64 integers 7100% 90% 6 80% 5 70% 460% 50% 3 40% 230% Similar results for all 20% 1 10% other input configs. Sequence sizes (1KB 8KB) Tile dimensions 32x32, 16x16 and 8x8 integers Execution Time (normalized) (seconds) 0 0% 1 21 23 34 45 56 67 7 8 8 9 91010 11 111212 13 14 15 16 Measured Number of of SPEs Predicted Mean error rate = 3%; Max. error rate = 10% 23

Performance Model Why do we need the performance model? Predict the execution time offline, based on pluggable input parameters 100% 90% 80% 70% 60% 50% 40% 30% Evaluate the tradeoffs 20% 10% between different input 0% configurations before actually deploying the application See section Sequence Throughput in the paper Execution Time (normalized) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Measured Number of SPEs Predicted Mean error rate = 3%; Max. error rate = 10% 24

Outline Introduction Mapping and Modeling Wavefront Algorithms on the Cell Broadband Engine Optimizations for the SPEs Evaluation / Results Conclusion and Future Work 25

Conclusion Efficiently mapped wavefront algorithms on the Cell Broadband Engine Developed a highly scalable design that streams tiles across the SPEs Unique data layout scheme to maximize the vector processing capabilities of the SPEs Accurate prediction model of the execution time based on a number of pluggable parameters 26

Future Work Validation of the tiled-wavefront approach for other wavefront algorithms and also other emergent CMP architectures (e.g., GPU) Integrate the parallelized Smith-Waterman code into sequence search toolkits Extend the design to a cluster of Cell-based nodes For more information CS @ VT: www.cs.vt.edu The SyNeRGy Lab: synergy.cs.vt.edu Center for High-End Computing Systems (CHECS): www.checs.eng.vt.edu Contacts: Ashwin Aji: aaji@cs.vt.edu Wu Feng: feng@cs.vt.edu 27

Related Work: Smith-Waterman on the Cell IBM s approach Coarse-grained parallelization One sequence pair one SPE Max seq. size = 2048 O(m) space No backtrace? Which gap penalty? Linear or affine? Our approach Fine-grained parallelization One sequence pair all available SPEs Max seq. size ~= 8200 O(mn) space Includes backtrace Affine gap penalty More realistic scenario but requires more memory 28