High Performance Computing Lab Exercises



Similar documents
Matrix Multiplication

SOLVING LINEAR SYSTEMS

Big-data Analytics: Challenges and Opportunities

PERFORMANCE ENHANCEMENTS IN TreeAge Pro 2014 R1.0

CSE 6040 Computing for Data Analytics: Methods and Tools

High Performance Computing. Course Notes HPC Fundamentals

Java Virtual Machine: the key for accurated memory prefetching

Chapter 1. Dr. Chris Irwin Davis Phone: (972) Office: ECSS CS-4337 Organization of Programming Languages

HPC performance applications on Virtual Clusters

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA

HY345 Operating Systems

Optimizing matrix multiplication Amitabha Banerjee

External Sorting. Why Sort? 2-Way Sort: Requires 3 Buffers. Chapter 13

WITH A FUSION POWERED SQL SERVER 2014 IN-MEMORY OLTP DATABASE

Multi-core Programming System Overview

Windows Server Performance Monitoring

Memory Hierarchy. Arquitectura de Computadoras. Centro de Investigación n y de Estudios Avanzados del IPN. adiaz@cinvestav.mx. MemoryHierarchy- 1

Solution of Linear Systems

SDFS Overview. By Sam Silverberg

Improve Business Productivity and User Experience with a SanDisk Powered SQL Server 2014 In-Memory OLTP Database

RAID. RAID 0 No redundancy ( AID?) Just stripe data over multiple disks But it does improve performance. Chapter 6 Storage and Other I/O Topics 29

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

2: Computer Performance

Report Paper: MatLab/Database Connectivity

SSDs and RAID: What s the right strategy. Paul Goodwin VP Product Development Avant Technology

Removing Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays. Red Hat Performance Engineering

YALES2 porting on the Xeon- Phi Early results

POSIX and Object Distributed Storage Systems

THE NAS KERNEL BENCHMARK PROGRAM

Interpreters and virtual machines. Interpreters. Interpreters. Why interpreters? Tree-based interpreters. Text-based interpreters

and RISC Optimization Techniques for the Hitachi SR8000 Architecture

A3 Computer Architecture

RevoScaleR Speed and Scalability

White paper. QNAP Turbo NAS with SSD Cache

1. Memory technology & Hierarchy

Copyright 2013, Oracle and/or its affiliates. All rights reserved.

Storage in Database Systems. CMPSCI 445 Fall 2010

Exploring Amazon EC2 for Scale-out Applications

Cloud Computing through Virtualization and HPC technologies

Why Computers Are Getting Slower (and what we can do about it) Rik van Riel Sr. Software Engineer, Red Hat

Agenda. Enterprise Application Performance Factors. Current form of Enterprise Applications. Factors to Application Performance.

CSE 130 Programming Language Principles & Paradigms

Distributed File System Performance. Milind Saraph / Rich Sudlow Office of Information Technologies University of Notre Dame

Hardware performance monitoring. Zoltán Majó

Effective Java Programming. efficient software development

Cloud Storage. Parallels. Performance Benchmark Results. White Paper.

COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING

Investigation of storage options for scientific computing on Grid and Cloud facilities

Programming Languages & Tools

Delivering Quality in Software Performance and Scalability Testing

IT of SPIM Data Storage and Compression. EMBO Course - August 27th! Jeff Oegema, Peter Steinbach, Oscar Gonzalez

File System & Device Drive. Overview of Mass Storage Structure. Moving head Disk Mechanism. HDD Pictures 11/13/2014. CS341: Operating System

Mass Storage Structure

Language Evaluation Criteria. Evaluation Criteria: Readability. Evaluation Criteria: Writability. ICOM 4036 Programming Languages

Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27

Tech Tip: Understanding Server Memory Counters

How SSDs Fit in Different Data Center Applications

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

EMSCRIPTEN - COMPILING LLVM BITCODE TO JAVASCRIPT (?!)

Promise Pegasus R6 Thunderbolt RAID Performance

Hadoop Architecture. Part 1

Comparative Performance Review of SHA-3 Candidates

Bringing Big Data Modelling into the Hands of Domain Experts

Hard Disk Drive vs. Kingston SSDNow V+ 200 Series 240GB: Comparative Test

Parallel Algorithm for Dense Matrix Multiplication

Price/performance Modern Memory Hierarchy

Performance Measurement of Dynamically Compiled Java Executions

Report: Declarative Machine Learning on MapReduce (SystemML)

Chapter 3 Operating-System Structures

Samsung Portable SSD. Branded Product Marketing Team, Memory Business

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Lecture 3: Evaluating Computer Architectures. Software & Hardware: The Virtuous Cycle?

An Affordable Commodity Network Attached Storage Solution for Biological Research Environments.

CPS104 Computer Organization and Programming Lecture 18: Input-Output. Robert Wagner

#9011 GeoMedia WebMap Performance Analysis and Tuning (a quick guide to improving system performance)

Virtualisa)on* and SAN Basics for DBAs. *See, I used the S instead of the zed. I m pretty smart for a foreigner.

Enabling Technologies for Distributed and Cloud Computing

CS 6290 I/O and Storage. Milos Prvulovic

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

System Structures. Services Interface Structure

Best Practices for Monitoring Databases on VMware. Dean Richards Senior DBA, Confio Software

A Comparison Of Shared Memory Parallel Programming Models. Jace A Mogill David Haglin

Geospatial Server Performance Colin Bertram UK User Group Meeting 23-Sep-2014

1 Storage Devices Summary

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2014/15

Enabling Technologies for Distributed Computing

Processor Architectures

Outline. Principles of Database Management Systems. Memory Hierarchy: Capacities and access times. CPU vs. Disk Speed

Computer Systems Structure Main Memory Organization

Kashif Iqbal - PhD Kashif.iqbal@ichec.ie

Disk Storage & Dependability

Azure VM Performance Considerations Running SQL Server

CS (CCN 27156) CS (CCN 26880) Software Engineering for Scientific Computing. Lecture 1: Introduction

Transcription:

High Performance Computing Lab Exercises (Make sense of the theory!) Rubin H Landau With Sally Haerer and Scott Clark 6 GB/s CPU cache RAM cache Main Store 32 KB 2GB 2MB 32 TB@ 111Mb/s Computational Physics for Undergraduates BS Degree Program: Oregon State University Engaging People in Cyber Infrastructure Support by EPICS/NSF & OSU Introductory Computational Science 1

Programming for Virtual Memory A(1) A(2) A(3) A(N) M(1,1) M(2,1) M(3,1) M(N,1) M(1,2) M(2,2) M(3,2) RAM Page 1 A(1),..., A(16) Page 2 Page 3 Swap Space Page N Virtual Memory Data Cache CPU Registers A(2032),..., A(2048) Avoid page faults ($$) Avoid resource conflicts ( I/O) hurt other users hurt yourself M(N,N) VM Programming Rules 1. Worry only if use much memory 2. Look at entire program (global optimization) 3. Make successive calculations on data subsets that fit in RAM 4. Avoid simultaneous calculations on same data set [M 2 /Det(M)] 5. Group together in memory data used together ( i M ii ) Introductory Computational Science 2

Comparison: Java vs F90 & C Java Don t be a programming bigot most universal, most portable good for scientific programming (error, precision) slowest for HPC; days your t vs computer t? is the wait a problem? compiler: not designed for speed interpreter Just-In-Time (JIT) compiler class file work on all Java Virtual Machines Fortran (legacy, CSE) & C (CS) compilers perfected for speed examines entire code, rewrites (e.g. nested do's) careful with arrays, cache lines optimized for specific architectures compiled code: not universal, portable source code: often not 100% universal or portable not for business, Web Introductory Computational Science 3

Experiment : : Good & Bad Virtual Memory Use Run simple examples, look inside your computer(s) Measure time for each program (time in Unix) Write your own memory intensive force method: Do j = 1, n Do i = 1, n f12(i,j) = force12(pion(i), pion(j)) \\ Fill f12 f21(i,j) = force21(pion(i), pion(j)) \\ Fill f21 ftot = f12(i,j) + f21(i,j) \\ Compute EndDo EndDo Each loop iteration requires all data Large working set size Better separate components (next) Introductory Computational Science 4

GOOD Program, Separate Loops Do j = 1, n Do i = 1, n f12(i,j) = force12(pion(i), pion(j)) \\ Fill f12 Do j = 1, n Do i = 1, n f21(i,j) = force21(pion(i), pion(j)) \\ Fill f21 Do j = 1, n Do i = 1, n ftot = f12(i,j) + f21(i,j) \\ Compute Separate loops, separate data access 1/2 working set size Groups together data used together Yet, more complicated, less elegant Introductory Computational Science 5

Try it! Java vs Fortran Optimization Aim: make numerically intensive program run faster Compare languages & their options, increasing matrix size Program tune solves matrix eigenvalue problem (power method): [H i,j ]= [H][c] = E[c] (1) E =?, [c] =? (2) 1 0.3 0.09 0.027... 0.3 2 0.3 0.9... 0.09 0.3 3 0.3... (3). Almost diagonal E H i,i, c ệ = H[2000][2000] 4,000,000 elements 11 iterations smallest E to 6 places Introductory Computational Science 6 1 0 0.

Effects of: Dimensions, Optimize, Languages 1. Try optimize compiler options (-O1, -O2,-O3) 2. Language: 3-10 X time difference 3. Memory: page faults 10,000 Execution Time vs Matrix Size 1000 Fortran, S olaris Time (sec) Time (seconds) 100 100 Java, Solaris 10 10 1 1 0.1 10 100 1000 0.1 Matrix Dimension 0.01 100 1000 10,000 Matrix Dimension 4. Loop unrolling Improve memory access Compiler fills cache better Doubles speed (ugly) for ( i= 1; i <= L -2; i = i+2) { ovlp1 = ovlp1 + coef[i] * coef[i] ; ovlp2 = ovlp2 + coef[i+1] * coef[i+1]; t1 = t1 + coef[j] * ham[j] [i]; t2 = t2 + coef[j] * ham[j] [i+1]; } sigma[i] = t1; sigma[i+1] = t2; Introductory Computational Science 7

High Performance Computing Lab Exercises (part II) (examples) Rubin H Landau With Sally Haerer and Scott Clark 6 GB/s CPU cache RAM cache Main Store 32 KB 2GB 2MB 32 TB@ 111Mb/s Computational Physics for Undergraduates BS Degree Program: Oregon State University Engaging People in Cyber Infrastructure Support by EPICS/NSF & OSU Introductory Computational Science 8

Avoid Discontinuous Memory page faults Common/Double zed, ylt(9), part(9), zpart(100000), med2(9) Do j = 1, n ylt(j) = zed * part(j) / med2(9) \\ Discontinuous Bad zed, ylt, part: together on 1 memory page med2: end of Common (after hog), different page Can't assure same page; improve your chances Common/Double zed, ylt(9), part(9), med2(9), zpart(100000) Do j = 1, n ylt(j) = zed*part(j)/med2(9) \\ Continuous Good Introductory Computational Science 9

Method: Programming for Cache A(1) A(2) A(3) A(N) M(1,1) M(2,1) M(3,1) RAM Page 1 Page 2 Page 3 A(1),..., A(16) Data Cache A(2032),..., A(2048) Very fast cache M(N,1) M(1,2) M(2,2) M(3,2) Swap Space Page N Fast RAM CPU Registers Ultra-fast CPU M(N,N) Important for HPC : cache misses 10 T Stride: # array elements stepped thru / op Good Bad Tr A = NX a(i, i) c(i) =x(i)+x(i +1) (1) i=1 Rule: Keep the stride low, preferably 1 Fortran (column): vary LH array index 1 st C & Java (row): vary RH array index 1st Introductory Computational Science 10

Exercise: Cache Misses Aim of HPC: have data to be processed in cache F90 2-D array storage: column-major order C, Java 2-D array storage : row-major order Compare times: M(column x column) vs M(row x row) Do j = 1, 9999 x(j) = m(1,j) \\ Do j = 1, 9999 x(j) = m(j,1) \\ Sequential column ref Sequential row ref 1 version always makes large jumps through memory RAM: virtual memory full story Introductory Computational Science 11

Good & Bad Cache Flow (OK to try these at home) GOOD f90/bad C, Minimum/Maximum Stride Dimension Vec(idim,jdim) \\ Loop A Do j = 1, jdim Vary row rapidly Do i=1, idim Ans = Ans + Vec(i,j)*Vec(i,j) \\ Stride 1 BAD f90/good C, Maximum/Minimum Stride Dimension Vec(idim, jdim) \\ Loop B Do i = 1, idim Do j=1, jdim Vary column rapidly Ans = Ans + Vec(i,j)*Vec(i,j) \\ Stride jdim Introductory Computational Science 12

Exercise: : Large Matrix Multiplication [C] =[A] [B], c ij = NX k=1 a ik b kj Must compromise: as both row & columns used Good f90/bad C, Min/Max Stride Bad f90/good C, Max/Min Stride i j k Do i = 1, N \\ Row Do j = 1, N \\ Column c(i,j) = 0.0 \\ Initial Do k = 1, N c(i,j) = c(i,j) + a(i,k)*b(k,j) j i k Do j = 1, N Do i = 1, N c(i,j) = 0.0 Do k = 1, N Do i = 1, N c(i,j) = c(i,j) + a(i,k)*b(k,j) Introductory Computational Science 13