High Performance Computing

Similar documents

Parallelism and Cloud Computing

Program Optimization for Multi-core Architectures

Lecture 3: Evaluating Computer Architectures. Software & Hardware: The Virtuous Cycle?

CMSC 611: Advanced Computer Architecture

Introduction to DISC and Hadoop

Performance evaluation

COMP/CS 605: Intro to Parallel Computing Lecture 01: Parallel Computing Overview (Part 1)

Unit 4: Performance & Benchmarking. Performance Metrics. This Unit. CIS 501: Computer Architecture. Performance: Latency vs.

Parallel Programming Survey

Trends in High-Performance Computing for Power Grid Applications

Parallel Computing. Benson Muite. benson.

FPGA-based Multithreading for In-Memory Hash Joins

White Paper The Numascale Solution: Extreme BIG DATA Computing

High Performance Computing in the Multi-core Area

Quiz for Chapter 1 Computer Abstractions and Technology 3.10

An Introduction to Parallel Computing/ Programming

Design Cycle for Microprocessors

CS 159 Two Lecture Introduction. Parallel Processing: A Hardware Solution & A Software Challenge

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

Introduction to High Performance Cluster Computing. Cluster Training for UCL Part 1

Performance Metrics and Scalability Analysis. Performance Metrics and Scalability Analysis

Chapter 2. Why is some hardware better than others for different programs?

High Performance Computing. Course Notes HPC Fundamentals

numascale White Paper The Numascale Solution: Extreme BIG DATA Computing Hardware Accellerated Data Intensive Computing By: Einar Rustad ABSTRACT

Computer Performance. Topic 3. Contents. Prerequisite knowledge Before studying this topic you should be able to:

Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms

PACE Predictive Analytics Center of San Diego Supercomputer Center, UCSD. Natasha Balac, Ph.D.

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Power-Aware High-Performance Scientific Computing

Data Centric Systems (DCS)

EEM 486: Computer Architecture. Lecture 4. Performance

Introduction to Cloud Computing

ADVANCED COMPUTER ARCHITECTURE

Scientific Computing Programming with Parallel Objects

An Overview of Knowledge Discovery Database and Data mining Techniques

Introduction to Microprocessors

Fast Multipole Method for particle interactions: an open source parallel library component

Building a Top500-class Supercomputing Cluster at LNS-BUAP

Multi-core architectures. Jernej Barbic , Spring 2007 May 3, 2007

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

RevoScaleR Speed and Scalability

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Big Graph Processing: Some Background

System Models for Distributed and Cloud Computing

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller

Jean-Pierre Panziera Teratec 2011

Partitioning and Divide and Conquer Strategies

How To Compare Amazon Ec2 To A Supercomputer For Scientific Applications

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Cluster Computing at HRI

Parallel Computing. Introduction

Increasing Flash Throughput for Big Data Applications (Data Management Track)

White Paper The Numascale Solution: Affordable BIG DATA Computing

Performance Monitoring of Parallel Scientific Applications

CS 575 Parallel Processing

Binary search tree with SIMD bandwidth optimization using SSE

A Very Brief History of High-Performance Computing

Introducción. Diseño de sistemas digitales.1

Understanding the Benefits of IBM SPSS Statistics Server

SWARM: A Parallel Programming Framework for Multicore Processors. David A. Bader, Varun N. Kanade and Kamesh Madduri

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

How To Understand The Design Of A Microprocessor

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

Seeking Opportunities for Hardware Acceleration in Big Data Analytics

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Clustering Billions of Data Points Using GPUs

THE EXPAND PARALLEL FILE SYSTEM A FILE SYSTEM FOR CLUSTER AND GRID COMPUTING. José Daniel García Sánchez ARCOS Group University Carlos III of Madrid

Clustering Big Data. Anil K. Jain. (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012

PARALLEL & CLUSTER COMPUTING CS 6260 PROFESSOR: ELISE DE DONCKER BY: LINA HUSSEIN

Grid Scheduling Dictionary of Terms and Keywords

Computer Architecture Lecture 2: Instruction Set Principles (Appendix A) Chih Wei Liu 劉志尉 National Chiao Tung University

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

64-Bit versus 32-Bit CPUs in Scientific Computing

Architecture of Hitachi SR-8000

Dell High-Performance Computing Clusters and Reservoir Simulation Research at UT Austin.

How To Build A Cloud Computer

Lecture 1: the anatomy of a supercomputer

An Introduction to High-Frequency Circuits and Signal Integrity

Communicating with devices

IBM Deep Computing Visualization Offering

OBJECTIVE ANALYSIS WHITE PAPER MATCH FLASH. TO THE PROCESSOR Why Multithreading Requires Parallelized Flash ATCHING

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

The Lattice Project: A Multi-Model Grid Computing System. Center for Bioinformatics and Computational Biology University of Maryland

Scalable and High Performance Computing for Big Data Analytics in Understanding the Human Dynamics in the Mobile Age

Logical Operations. Control Unit. Contents. Arithmetic Operations. Objectives. The Central Processing Unit: Arithmetic / Logic Unit.

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

Lecture 4. Parallel Programming II. Homework & Reading. Page 1. Projects handout On Friday Form teams, groups of two

Introduction to GPU Programming Languages

Generations of the computer. processors.

Virtuoso and Database Scalability

Transcription:

High Performance Computing Introduction Salvatore Orlando 1

General information about the course Web Site: Register in moodle.unive.it OFFERTA FORMATIVA > Corsi di Laurea Magistrale > Dipartimento di Scienze Ambientali, Informatica e Statistica > Informatica - Computer Science [CM9] > High Performance Computing [CM0227] Website: http://www.dsi.unive.it/~calpar/ The exam is subdivided into two parts: Written exam (60%) Home assignments or project (40%) Text, Reading list, Didactic Material Slides A. Grama, A. Gupta, G. Karypis, V. Kumar. Introduction to Parallel Computing, 2 Ed., Addison-Wesley, 2003. B. Wilkinson, M. Allen. Parallel Programming: Techniques and Applications Using Networked Workstation and Parallel Computers. 2 Ed., Prentice-Hall, 2003. J. L. Hennessy, D. A. Patterson, Computer Architecture: A Quantitative Approach, 5th Ed, Morgan Kaufmann, 2012. I. Foster. Designing and Building Parallel Programs. Addison-Wesley, 1995, Versione online disponibile presso http://www-unix.mcs.anl.gov/dbpp. 2

HPC and Parallel Computing High-performance computing (HPC) exploits parallel hardware and software HPC uses supercomputers, computer clusters, heterogeneous computing architectures to solve advanced computation problems Today computer systems approaching the Tera Flops (10 12 Floating Point Operations Per Second) are counted as HPC 3

Sequential vs. Parallel Computing Sequential computing: solve a problem with an algorithm whose instructions are executed in sequence the corresponding computational model is characterized by a single processor Parallel computing: solve a problem with an algorithm whose instructions are executed in parallel the corresponding computational model is characterized by multiple processors and associated mechanisms of cooperation the algorithm has to exploit in an efficient way the parallelism that can be made explicit in the problem, with the goal of making the execution faster 4

The parallelization process General methodology subdivide (de-compose) the problem among the various processors/workers processors work on their own parts of the whole problem, but cooperate for solving shared portions of the problem the final goal is to balance the load among processors to reduce as much as possible communications/ synchronizations due to processor cooperation to avoid replication of the same task by multiple processors 5

Pros and Cons of Parallel Computing The benefit in exploiting parallelism to speedup applications has been a disputed argument in the last years Today parallelism is in the mainstream nowadays parallel processors even equip mobile phones In the the following, we discuss some arguments used as Pros and Cons for exploiting parallel computing 6

Cons Parallel software development is very complex Complexity depends on the need of specifying concurrent tasks and coordinating their execution, with performance guarantee Software is non portable between different platforms The programming environments are not standardized enough Tuning the performance (Performance debugging) is expensive Continuous improvements in the performance of microprocessors Which are the advantages in devoting so much time and money in developing complex parallel software, when the Moore law also holds for the performance of microprocessors (doubled approximately every two years)? but this continuous increment of uniprocessor performance is no longer true 7

Pros: Limits in the improvement of unicores Limit of the Moore Law The curve of the performance of sequential microprocessor (single core) stops increasing even if the Moore Law applies to the density of transistors (2x transistor/chip every 1.5 year), this density increase also entailed a similar performance increase the performance curve is relative to 1.1 the Introduction SPEC sequential 3 benchmarks wrt VAX 11/780 Performance (vs. VAX-11/780) 100,000 10,000 1000 100 10 Move to multi-core AX-11/780, 5 MHz Intel Xeon 6 cores, 3.3 GHz (boost to 3.6 GHz) Intel Xeon 4 cores, 3.3 GHz (boost to 3.6 GHz) Intel Core i7 Extreme 4 cores 3.2 GHz (boost to 3.5 GHz) 24,129 Intel Core Duo Extreme 2 cores, 3.0 GHz 21,871 Intel Core 2 Extreme 2 cores, 2.9 GHz AMD Athlon 64, 2.8 GHz 11,865 14,38719,484 AMD Athlon, 2.6 GHz Intel Xeon EE 3.2 GHz 7,108 Intel D850EMVR motherboard (3.06 GHz, Pentium 4 processor with Hyper-Threading Technology) 6,043 6,681 IBM Power4, 1.3 GHz 4,195 Intel VC820 motherboard, 1.0 GHz Pentium III processor 3,016 1,779 Professional Workstation XP1000, 667 MHz 21264A Digital AlphaServer 8400 6/575, 575 MHz 21264 1,267 993 AlphaServer 4000 5/600, 600 MHz 21164 Digital Alphastation 5/500, 500 MHz 649 481 Digital Alphastation 5/300, 300 MHz 280 Digital Alphastation 4/266, 266 MHz 183 22%/year IBM POWERstation 100, 150 MHz 117 Digital 3000 AXP/500, 150 MHz 80 HP 9000/750, 66 MHz 51 IBM RS6000/540, 30 MHz 24 52%/year MIPS M2000, 25 MHz MIPS M/120, 16.7 MHz 13 18 Sun-4/260, 16.7 MHz 9 VAX 8700, 22 MHz 5 RISC 25%/year 1.5, VAX-11/785 1 1 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 Figure 1.1 Growth in processor performance since the late 1970s. This chart plots performance relative to the VAX 11/780 as measured by the SPEC benchmarks (see Section 1.8). Prior to the mid-1980s, processor performance 8

Pros: Limits in the improvement of unicores In the last decades the performance improvement of microprocessors (unicores) was impressive, due to: Moore law (2x transistor/chip every 1.5 year) Parallelism/Pipeline in the instruction execution Implicit ILP (Instruction Level Parallelism) Average CPI (cycles per instr.) reduced of an order of magnitude in a decade Improved HLL compilers Deep memory hierarchies, with large and multilevel caches memory bottleneck has became move evident Very short clock cycle (GHz frequency) 9

Pros: Limits in the improvement of unicores Trend in the last years Limits in the automatic extraction of further ILP (independent instructions) from a single sequential program No benefits from adding further pipelined functional units in unicores Limit in the further increase of the clock frequency 10

Pros: Trends in parallel hw Due to the limit of unicore performance, the CPU manufactures are now integrating two or more individual processors (cores) onto a single chip: multi-core microprocessors the plan is to double the number of cores per microprocessors per semiconductor technology generation (about every two years) SIMD (data parallel) coprocessors to accelerate applications Multi-core architecture needs to be explicitly programmed in parallel 11

Pros: Trends in parallel hw Clock Frequency Scaling Replaced by Scaling Cores high frequency processors need high power 12

Pros: Trends in parallel hw Low cost of standard electronic equipments (offthe-shelf microprocessor and other components such as GP-GPUs, memory, networks boards) This has permitted the widespread diffusion of computer clusters interconnected by fast communication networks 13

Pros: Trends in hw Clusters made of multicore nodes (many of these nodes), host GPUs boards (graphic), and several disks Cluster are used not only for scientific computing, but also for business data-intensive applications e.g., as parallel data intensive platform for hosting web servers, databases, search engines, data mining, rendering farms Not only greater aggregate computational bandwidth, but also greater aggregate bandwidth and size of the various memory levels 14

Other Pros High latency of main and secondary memory It is well known that the overall performance also depends on the capability of memory hierarchies to promptly supply data to the computational core Memory hierarchies introduced to reduce the consequence of the Von Neumann bottleneck cache to reduce memory access latency Parallel platforms may reduce the consequences of the bottleneck In these platforms we can exploit the aggregate memory banks and disks at the various levels of the hierarchy This increases the overall memory size and bandwidth, but also the reduces the average memory latency if we are able to increase the cache hit rate However, in order to effectively exploit aggregate memory, parallel algorithms must be carefully designed in order to exploit locality 15

Other Pros: Parallelism in distributed infrastructure Computer Network and Distributed Infrastructures To solve large scale problems, often with data already distributed in remote sites, we can nowadays exploit many computers connected by wide-area networks (Internet) large scale, heterogeneous, parallel/distributed platform it requires parallel/distributed software and suitable problem decomposition strategies unfortunately, not all the problems can be parallelized successfully on these platforms Examples: SETI@home (Search for Extra Terrestrial Intelligence) employs the power of desktop computers to analyze the electromagnetic signals coming from the space Factorization of large integers, with applications in cryptography and optimization problems 16

When parallelism is necessary Traditional scientific paradigm first theorize and then lab experiments to confirm or deny the theory. Traditional engineering paradigm first design and then build a laboratory prototype. Both paradigms are being replacing by numerical experiments and numerical prototyping. Real phenomena are too complicated to model (e.g. climate prediction) Real experiments are too hard, too expensive, too slow, or too dangerous for a laboratory (e.g. oil reservoir simulation, large wind tunnels, overall aircraft design, galactic evolution, whole factory or product life cycle design and optimization, etc.). Computational science Large scale numerical simulation: a new way to do science 17

When parallelism is necessary Many of the fundamental questions in science (especially those with potentially broad social, political, and scientific impact) are sometimes referred to as "Grand Challenge" problems. Many of the the Grand Challenge problems are those that can only be solved computationally They require a lot of storage and computational resources to be solved in a reasonable time, and a suitable parallel software 18

Some particularly challenging computations Science Global climate modeling Astrophysical modeling Biology: genomics; protein folding; drug design Computational Chemistry Computational Material Sciences and Nanosciences Engineering Crash simulation Semiconductor design Earthquake and structural modeling Computation fluid dynamics (airplane design) Combustion (engine design) Business Financial and economic modeling Data mining, Social network analysis Search engines Defense Nuclear weapons -- test by simulations Cryptography 19

Global Weather forecasting The simulation algorithm requires the atmosphere to be subdivided into disjoint regions (tridimensional cells) The algorithm executes repeated computations for each cell, at each time step 20

Global Weather forecasting Example: cells of 1 mile 1 mile 1 mile, for a total height of the atmosphere of 10 miles (10 cells) total number of cells: 5 10 8 cells suppose that a per-cell computation requires 200 FP ops (Float. Point operations). The a whole time step requires a total of 10 11 FP ops using a simulation time step of 10 minutes, we need 6 steps for 1 hour, and a total of 1440 steps (about 1,4 10 14 FP ops) for 10 days of weather prediction a computer able to execute 100 Mflops (10 8 FP ops/sec) takes more than 10 6 sec (about 16 days!!!!) to execute the same computation in 10 min, we would need a computer able to execute 0.2 Tflops (0.2 10 12 FP ops/sec) what would it happen if we increased the number of steps per hour, o we made smaller the cell size? 21

Motion of celestial objects (N-bodies) N bodies masses m 1,...,m N in three-dimensional (physical) space with the initial positions and velocities, where the force of attraction between each pair of bodies is Newtonian. Determine the position of each body at every future moment of time. For N bodies, we must compute N-1 forces per body, i.e. approx N 2 in total There exist approximate algorithms that require N log 2 N force computations they exploit the fact that distant groups of bodies (galaxies, solar systems) appear very close, and can be treated as a single large body centered at its center of mass force interaction between agglomerates rather than single bodies (e.g. Solar system approximated with its center of mass to compute the force interaction with the Proxima Centaury star) Solar system center of mass Proxima Centaury (diameter one-seventh that of the Sun) 22

Motion of celestial objects (N-bodies) The approximate algorithm (Barnes-Hut) exploits an irregular data structure organized as a sort of tree (oct-tree) the creation of such tree dynamically divides the simulation volume into cubic cells the tree maintains information about the bodies that fall in the same box (agglomerates), and about the centers of mass of agglomerates 23

Motion of celestial objects (N-bodies) The simulation algorithms for N- bodies are iterated for many time steps: at every step the new positions of bodies are updated Suppose we have a galaxy of 10 11 stars If the computation of each force took 1 µs (10-6 sec), it should be needed about 10 9 years per iteration by employing the algorithm of complexity N 2 about one year per iteration by employing the algorithm of complexity N log 2 N 24

Web Search Engines and parallel components Page Repository Indexer Collection Analysis Queries Query Engine Results Ranking Crawlers Text Structure Utility Crawl Control Indexes 25

Web Search Engines and PageRank computation PageRank computes the relevance of a web page based on the link structure of the WEB PageRank relies on the democratic nature of the Web It interprets a hyperlink from page x to page y as a vote, by page x for page y Let W be the adjacency matrix of the Web graph: W[i,j] = 1/o(j), if o(j) is the number of outgoing links from j W[i,j] is the probability of the random surfer of going from page i to page j. W[i,j] has size N 2 where N is the number of Web pages 26

Web Search Engines and PageRank computation PageRank is computed by performing iteratively a matrix (W) by vector multiplication At the end the result will be the vector of N PageRank values, each associated with a Web page The PageRank value of a page reflects the probability that the random surfer will land on that page by clicking on links randomly, by also restarting the navigation from a random page Example: Let N = 10 10 pages Each iteration (e.g., of 50 iterations) takes N 2 multiplications (and N 2 additions) The power method takes 10 20 x 50 FLOPS Assuming a modern processor (with performace of 1 GFLOP), the total running time is: 5x10 21 / 10 9 = 5x10 12 seconds Obviously this is not the best implementation. due to matrix High Performance sparsity Computing - S. Orlando 27

Data Mining (DM) A typical business application, even if there are many scientific use of DM tools Exploration & Analysis, in automatic or semi-automatic way, of large data sets to discover and extract Significant Patterns & Rules Patterns & Rules must be innovative, valid, potentially useful, understandable DM algorithms on huge data are computationally and I/O expensive DM is a step of the KDD (Knowledge Discovery in Databases) process. 28

Data Mining (DM) Some of the main DM Tasks: Classification [Predictive, Supervised] Clustering [Descriptive, Unsupervised] Association Rule Discovery [Descriptive, Unsupervised] The traditional sequential algorithms do not work due to huge data sets with high number of dimensions heterogeneous and distribute nature of data sets analysts need to explore data using interactive and responsive DM tools, and thus the knowledge extraction must be very fast 29

DM: Clustering Given N objects, each associated with a d-dimensional feature vectors Find a meaningful partition of the N examples into K subsets or groups K may be given, or discovered Have to define some notion of similarity between objects Feature vector can be All numeric (well defined distances) All categorical or mixed (harder to define similarity; geometric notions don t work) 30

K-means: a Centroid-based Clustering Algorithm Specify K, the number of clusters Guess the K seed cluster centroids 1. Look at each point and assign it to the centroid that is the closest 2. Recalculate the centers (means) Iterate on steps 1. and 2. till centers converge or for a fixed number of times 31

K-means: a Centroid-based Clustering Algorithm Main operations: Calculate distance to all K means (or centroids) per each point Other operations: Find the closest centroid for each point Calculate mean squared error (MSE) for all points Recalculate centroids Number of operations (distance computations) per iteration: O(N K d), where N is the no. of objects, K is the no. of clusters, d is the no. of attributes (dimensions) of the feature vectors note that the complexity of each distance computation depends on d If the computation of the distance for each dimension takes optimistically 1 Flop, if we have N=10 6, K=10 2, and d=10 2 the no. of Flops per iteration is 10 10 if the number of iterations are 10 2, then the total no. of Flops is 10 12 since we need that DM tools are interactive, we need a computer able to execute about 1 Teraflops (10 12 Flops/sec) 32