High Performance Computing
|
|
|
- Olivia Dennis
- 10 years ago
- Views:
Transcription
1 High Performance Computing Introduction Salvatore Orlando 1
2 General information about the course Web Site: Register in moodle.unive.it OFFERTA FORMATIVA > Corsi di Laurea Magistrale > Dipartimento di Scienze Ambientali, Informatica e Statistica > Informatica - Computer Science [CM9] > High Performance Computing [CM0227] Website: The exam is subdivided into two parts: Written exam (60%) Home assignments or project (40%) Text, Reading list, Didactic Material Slides A. Grama, A. Gupta, G. Karypis, V. Kumar. Introduction to Parallel Computing, 2 Ed., Addison-Wesley, B. Wilkinson, M. Allen. Parallel Programming: Techniques and Applications Using Networked Workstation and Parallel Computers. 2 Ed., Prentice-Hall, J. L. Hennessy, D. A. Patterson, Computer Architecture: A Quantitative Approach, 5th Ed, Morgan Kaufmann, I. Foster. Designing and Building Parallel Programs. Addison-Wesley, 1995, Versione online disponibile presso 2
3 HPC and Parallel Computing High-performance computing (HPC) exploits parallel hardware and software HPC uses supercomputers, computer clusters, heterogeneous computing architectures to solve advanced computation problems Today computer systems approaching the Tera Flops (10 12 Floating Point Operations Per Second) are counted as HPC 3
4 Sequential vs. Parallel Computing Sequential computing: solve a problem with an algorithm whose instructions are executed in sequence the corresponding computational model is characterized by a single processor Parallel computing: solve a problem with an algorithm whose instructions are executed in parallel the corresponding computational model is characterized by multiple processors and associated mechanisms of cooperation the algorithm has to exploit in an efficient way the parallelism that can be made explicit in the problem, with the goal of making the execution faster 4
5 The parallelization process General methodology subdivide (de-compose) the problem among the various processors/workers processors work on their own parts of the whole problem, but cooperate for solving shared portions of the problem the final goal is to balance the load among processors to reduce as much as possible communications/ synchronizations due to processor cooperation to avoid replication of the same task by multiple processors 5
6 Pros and Cons of Parallel Computing The benefit in exploiting parallelism to speedup applications has been a disputed argument in the last years Today parallelism is in the mainstream nowadays parallel processors even equip mobile phones In the the following, we discuss some arguments used as Pros and Cons for exploiting parallel computing 6
7 Cons Parallel software development is very complex Complexity depends on the need of specifying concurrent tasks and coordinating their execution, with performance guarantee Software is non portable between different platforms The programming environments are not standardized enough Tuning the performance (Performance debugging) is expensive Continuous improvements in the performance of microprocessors Which are the advantages in devoting so much time and money in developing complex parallel software, when the Moore law also holds for the performance of microprocessors (doubled approximately every two years)? but this continuous increment of uniprocessor performance is no longer true 7
8 Pros: Limits in the improvement of unicores Limit of the Moore Law The curve of the performance of sequential microprocessor (single core) stops increasing even if the Moore Law applies to the density of transistors (2x transistor/chip every 1.5 year), this density increase also entailed a similar performance increase the performance curve is relative to 1.1 the Introduction SPEC sequential 3 benchmarks wrt VAX 11/780 Performance (vs. VAX-11/780) 100,000 10, Move to multi-core AX-11/780, 5 MHz Intel Xeon 6 cores, 3.3 GHz (boost to 3.6 GHz) Intel Xeon 4 cores, 3.3 GHz (boost to 3.6 GHz) Intel Core i7 Extreme 4 cores 3.2 GHz (boost to 3.5 GHz) 24,129 Intel Core Duo Extreme 2 cores, 3.0 GHz 21,871 Intel Core 2 Extreme 2 cores, 2.9 GHz AMD Athlon 64, 2.8 GHz 11,865 14,38719,484 AMD Athlon, 2.6 GHz Intel Xeon EE 3.2 GHz 7,108 Intel D850EMVR motherboard (3.06 GHz, Pentium 4 processor with Hyper-Threading Technology) 6,043 6,681 IBM Power4, 1.3 GHz 4,195 Intel VC820 motherboard, 1.0 GHz Pentium III processor 3,016 1,779 Professional Workstation XP1000, 667 MHz 21264A Digital AlphaServer /575, 575 MHz , AlphaServer /600, 600 MHz Digital Alphastation 5/500, 500 MHz Digital Alphastation 5/300, 300 MHz 280 Digital Alphastation 4/266, 266 MHz %/year IBM POWERstation 100, 150 MHz 117 Digital 3000 AXP/500, 150 MHz 80 HP 9000/750, 66 MHz 51 IBM RS6000/540, 30 MHz 24 52%/year MIPS M2000, 25 MHz MIPS M/120, 16.7 MHz Sun-4/260, 16.7 MHz 9 VAX 8700, 22 MHz 5 RISC 25%/year 1.5, VAX-11/ Figure 1.1 Growth in processor performance since the late 1970s. This chart plots performance relative to the VAX 11/780 as measured by the SPEC benchmarks (see Section 1.8). Prior to the mid-1980s, processor performance 8
9 Pros: Limits in the improvement of unicores In the last decades the performance improvement of microprocessors (unicores) was impressive, due to: Moore law (2x transistor/chip every 1.5 year) Parallelism/Pipeline in the instruction execution Implicit ILP (Instruction Level Parallelism) Average CPI (cycles per instr.) reduced of an order of magnitude in a decade Improved HLL compilers Deep memory hierarchies, with large and multilevel caches memory bottleneck has became move evident Very short clock cycle (GHz frequency) 9
10 Pros: Limits in the improvement of unicores Trend in the last years Limits in the automatic extraction of further ILP (independent instructions) from a single sequential program No benefits from adding further pipelined functional units in unicores Limit in the further increase of the clock frequency 10
11 Pros: Trends in parallel hw Due to the limit of unicore performance, the CPU manufactures are now integrating two or more individual processors (cores) onto a single chip: multi-core microprocessors the plan is to double the number of cores per microprocessors per semiconductor technology generation (about every two years) SIMD (data parallel) coprocessors to accelerate applications Multi-core architecture needs to be explicitly programmed in parallel 11
12 Pros: Trends in parallel hw Clock Frequency Scaling Replaced by Scaling Cores high frequency processors need high power 12
13 Pros: Trends in parallel hw Low cost of standard electronic equipments (offthe-shelf microprocessor and other components such as GP-GPUs, memory, networks boards) This has permitted the widespread diffusion of computer clusters interconnected by fast communication networks 13
14 Pros: Trends in hw Clusters made of multicore nodes (many of these nodes), host GPUs boards (graphic), and several disks Cluster are used not only for scientific computing, but also for business data-intensive applications e.g., as parallel data intensive platform for hosting web servers, databases, search engines, data mining, rendering farms Not only greater aggregate computational bandwidth, but also greater aggregate bandwidth and size of the various memory levels 14
15 Other Pros High latency of main and secondary memory It is well known that the overall performance also depends on the capability of memory hierarchies to promptly supply data to the computational core Memory hierarchies introduced to reduce the consequence of the Von Neumann bottleneck cache to reduce memory access latency Parallel platforms may reduce the consequences of the bottleneck In these platforms we can exploit the aggregate memory banks and disks at the various levels of the hierarchy This increases the overall memory size and bandwidth, but also the reduces the average memory latency if we are able to increase the cache hit rate However, in order to effectively exploit aggregate memory, parallel algorithms must be carefully designed in order to exploit locality 15
16 Other Pros: Parallelism in distributed infrastructure Computer Network and Distributed Infrastructures To solve large scale problems, often with data already distributed in remote sites, we can nowadays exploit many computers connected by wide-area networks (Internet) large scale, heterogeneous, parallel/distributed platform it requires parallel/distributed software and suitable problem decomposition strategies unfortunately, not all the problems can be parallelized successfully on these platforms Examples: (Search for Extra Terrestrial Intelligence) employs the power of desktop computers to analyze the electromagnetic signals coming from the space Factorization of large integers, with applications in cryptography and optimization problems 16
17 When parallelism is necessary Traditional scientific paradigm first theorize and then lab experiments to confirm or deny the theory. Traditional engineering paradigm first design and then build a laboratory prototype. Both paradigms are being replacing by numerical experiments and numerical prototyping. Real phenomena are too complicated to model (e.g. climate prediction) Real experiments are too hard, too expensive, too slow, or too dangerous for a laboratory (e.g. oil reservoir simulation, large wind tunnels, overall aircraft design, galactic evolution, whole factory or product life cycle design and optimization, etc.). Computational science Large scale numerical simulation: a new way to do science 17
18 When parallelism is necessary Many of the fundamental questions in science (especially those with potentially broad social, political, and scientific impact) are sometimes referred to as "Grand Challenge" problems. Many of the the Grand Challenge problems are those that can only be solved computationally They require a lot of storage and computational resources to be solved in a reasonable time, and a suitable parallel software 18
19 Some particularly challenging computations Science Global climate modeling Astrophysical modeling Biology: genomics; protein folding; drug design Computational Chemistry Computational Material Sciences and Nanosciences Engineering Crash simulation Semiconductor design Earthquake and structural modeling Computation fluid dynamics (airplane design) Combustion (engine design) Business Financial and economic modeling Data mining, Social network analysis Search engines Defense Nuclear weapons -- test by simulations Cryptography 19
20 Global Weather forecasting The simulation algorithm requires the atmosphere to be subdivided into disjoint regions (tridimensional cells) The algorithm executes repeated computations for each cell, at each time step 20
21 Global Weather forecasting Example: cells of 1 mile 1 mile 1 mile, for a total height of the atmosphere of 10 miles (10 cells) total number of cells: cells suppose that a per-cell computation requires 200 FP ops (Float. Point operations). The a whole time step requires a total of FP ops using a simulation time step of 10 minutes, we need 6 steps for 1 hour, and a total of 1440 steps (about 1, FP ops) for 10 days of weather prediction a computer able to execute 100 Mflops (10 8 FP ops/sec) takes more than 10 6 sec (about 16 days!!!!) to execute the same computation in 10 min, we would need a computer able to execute 0.2 Tflops ( FP ops/sec) what would it happen if we increased the number of steps per hour, o we made smaller the cell size? 21
22 Motion of celestial objects (N-bodies) N bodies masses m 1,...,m N in three-dimensional (physical) space with the initial positions and velocities, where the force of attraction between each pair of bodies is Newtonian. Determine the position of each body at every future moment of time. For N bodies, we must compute N-1 forces per body, i.e. approx N 2 in total There exist approximate algorithms that require N log 2 N force computations they exploit the fact that distant groups of bodies (galaxies, solar systems) appear very close, and can be treated as a single large body centered at its center of mass force interaction between agglomerates rather than single bodies (e.g. Solar system approximated with its center of mass to compute the force interaction with the Proxima Centaury star) Solar system center of mass Proxima Centaury (diameter one-seventh that of the Sun) 22
23 Motion of celestial objects (N-bodies) The approximate algorithm (Barnes-Hut) exploits an irregular data structure organized as a sort of tree (oct-tree) the creation of such tree dynamically divides the simulation volume into cubic cells the tree maintains information about the bodies that fall in the same box (agglomerates), and about the centers of mass of agglomerates 23
24 Motion of celestial objects (N-bodies) The simulation algorithms for N- bodies are iterated for many time steps: at every step the new positions of bodies are updated Suppose we have a galaxy of stars If the computation of each force took 1 µs (10-6 sec), it should be needed about 10 9 years per iteration by employing the algorithm of complexity N 2 about one year per iteration by employing the algorithm of complexity N log 2 N 24
25 Web Search Engines and parallel components Page Repository Indexer Collection Analysis Queries Query Engine Results Ranking Crawlers Text Structure Utility Crawl Control Indexes 25
26 Web Search Engines and PageRank computation PageRank computes the relevance of a web page based on the link structure of the WEB PageRank relies on the democratic nature of the Web It interprets a hyperlink from page x to page y as a vote, by page x for page y Let W be the adjacency matrix of the Web graph: W[i,j] = 1/o(j), if o(j) is the number of outgoing links from j W[i,j] is the probability of the random surfer of going from page i to page j. W[i,j] has size N 2 where N is the number of Web pages 26
27 Web Search Engines and PageRank computation PageRank is computed by performing iteratively a matrix (W) by vector multiplication At the end the result will be the vector of N PageRank values, each associated with a Web page The PageRank value of a page reflects the probability that the random surfer will land on that page by clicking on links randomly, by also restarting the navigation from a random page Example: Let N = pages Each iteration (e.g., of 50 iterations) takes N 2 multiplications (and N 2 additions) The power method takes x 50 FLOPS Assuming a modern processor (with performace of 1 GFLOP), the total running time is: 5x10 21 / 10 9 = 5x10 12 seconds Obviously this is not the best implementation. due to matrix High Performance sparsity Computing - S. Orlando 27
28 Data Mining (DM) A typical business application, even if there are many scientific use of DM tools Exploration & Analysis, in automatic or semi-automatic way, of large data sets to discover and extract Significant Patterns & Rules Patterns & Rules must be innovative, valid, potentially useful, understandable DM algorithms on huge data are computationally and I/O expensive DM is a step of the KDD (Knowledge Discovery in Databases) process. 28
29 Data Mining (DM) Some of the main DM Tasks: Classification [Predictive, Supervised] Clustering [Descriptive, Unsupervised] Association Rule Discovery [Descriptive, Unsupervised] The traditional sequential algorithms do not work due to huge data sets with high number of dimensions heterogeneous and distribute nature of data sets analysts need to explore data using interactive and responsive DM tools, and thus the knowledge extraction must be very fast 29
30 DM: Clustering Given N objects, each associated with a d-dimensional feature vectors Find a meaningful partition of the N examples into K subsets or groups K may be given, or discovered Have to define some notion of similarity between objects Feature vector can be All numeric (well defined distances) All categorical or mixed (harder to define similarity; geometric notions don t work) 30
31 K-means: a Centroid-based Clustering Algorithm Specify K, the number of clusters Guess the K seed cluster centroids 1. Look at each point and assign it to the centroid that is the closest 2. Recalculate the centers (means) Iterate on steps 1. and 2. till centers converge or for a fixed number of times 31
32 K-means: a Centroid-based Clustering Algorithm Main operations: Calculate distance to all K means (or centroids) per each point Other operations: Find the closest centroid for each point Calculate mean squared error (MSE) for all points Recalculate centroids Number of operations (distance computations) per iteration: O(N K d), where N is the no. of objects, K is the no. of clusters, d is the no. of attributes (dimensions) of the feature vectors note that the complexity of each distance computation depends on d If the computation of the distance for each dimension takes optimistically 1 Flop, if we have N=10 6, K=10 2, and d=10 2 the no. of Flops per iteration is if the number of iterations are 10 2, then the total no. of Flops is since we need that DM tools are interactive, we need a computer able to execute about 1 Teraflops (10 12 Flops/sec) 32
Parallelism and Cloud Computing
Parallelism and Cloud Computing Kai Shen Parallel Computing Parallel computing: Process sub tasks simultaneously so that work can be completed faster. For instances: divide the work of matrix multiplication
Program Optimization for Multi-core Architectures
Program Optimization for Multi-core Architectures Sanjeev K Aggarwal ([email protected]) M Chaudhuri ([email protected]) R Moona ([email protected]) Department of Computer Science and Engineering, IIT Kanpur
Lecture 3: Evaluating Computer Architectures. Software & Hardware: The Virtuous Cycle?
Lecture 3: Evaluating Computer Architectures Announcements - Reminder: Homework 1 due Thursday 2/2 Last Time technology back ground Computer elements Circuits and timing Virtuous cycle of the past and
CMSC 611: Advanced Computer Architecture
CMSC 611: Advanced Computer Architecture Parallel Computation Most slides adapted from David Patterson. Some from Mohomed Younis Parallel Computers Definition: A parallel computer is a collection of processing
Introduction to DISC and Hadoop
Introduction to DISC and Hadoop Alice E. Fischer April 24, 2009 Alice E. Fischer DISC... 1/20 1 2 History Hadoop provides a three-layer paradigm Alice E. Fischer DISC... 2/20 Parallel Computing Past and
Performance evaluation
Performance evaluation Arquitecturas Avanzadas de Computadores - 2547021 Departamento de Ingeniería Electrónica y de Telecomunicaciones Facultad de Ingeniería 2015-1 Bibliography and evaluation Bibliography
COMP/CS 605: Intro to Parallel Computing Lecture 01: Parallel Computing Overview (Part 1)
COMP/CS 605: Intro to Parallel Computing Lecture 01: Parallel Computing Overview (Part 1) Mary Thomas Department of Computer Science Computational Science Research Center (CSRC) San Diego State University
Unit 4: Performance & Benchmarking. Performance Metrics. This Unit. CIS 501: Computer Architecture. Performance: Latency vs.
This Unit CIS 501: Computer Architecture Unit 4: Performance & Benchmarking Metrics Latency and throughput Speedup Averaging CPU Performance Performance Pitfalls Slides'developed'by'Milo'Mar0n'&'Amir'Roth'at'the'University'of'Pennsylvania'
Parallel Programming Survey
Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory
Trends in High-Performance Computing for Power Grid Applications
Trends in High-Performance Computing for Power Grid Applications Franz Franchetti ECE, Carnegie Mellon University www.spiral.net Co-Founder, SpiralGen www.spiralgen.com This talk presents my personal views
Parallel Computing. Benson Muite. [email protected] http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage
Parallel Computing Benson Muite [email protected] http://math.ut.ee/ benson https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage 3 November 2014 Hadoop, Review Hadoop Hadoop History Hadoop Framework
FPGA-based Multithreading for In-Memory Hash Joins
FPGA-based Multithreading for In-Memory Hash Joins Robert J. Halstead, Ildar Absalyamov, Walid A. Najjar, Vassilis J. Tsotras University of California, Riverside Outline Background What are FPGAs Multithreaded
White Paper The Numascale Solution: Extreme BIG DATA Computing
White Paper The Numascale Solution: Extreme BIG DATA Computing By: Einar Rustad ABOUT THE AUTHOR Einar Rustad is CTO of Numascale and has a background as CPU, Computer Systems and HPC Systems De-signer
High Performance Computing in the Multi-core Area
High Performance Computing in the Multi-core Area Arndt Bode Technische Universität München Technology Trends for Petascale Computing Architectures: Multicore Accelerators Special Purpose Reconfigurable
Quiz for Chapter 1 Computer Abstractions and Technology 3.10
Date: 3.10 Not all questions are of equal difficulty. Please review the entire quiz first and then budget your time carefully. Name: Course: Solutions in Red 1. [15 points] Consider two different implementations,
An Introduction to Parallel Computing/ Programming
An Introduction to Parallel Computing/ Programming Vicky Papadopoulou Lesta Astrophysics and High Performance Computing Research Group (http://ahpc.euc.ac.cy) Dep. of Computer Science and Engineering European
Design Cycle for Microprocessors
Cycle for Microprocessors Raúl Martínez Intel Barcelona Research Center Cursos de Verano 2010 UCLM Intel Corporation, 2010 Agenda Introduction plan Architecture Microarchitecture Logic Silicon ramp Types
CS 159 Two Lecture Introduction. Parallel Processing: A Hardware Solution & A Software Challenge
CS 159 Two Lecture Introduction Parallel Processing: A Hardware Solution & A Software Challenge We re on the Road to Parallel Processing Outline Hardware Solution (Day 1) Software Challenge (Day 2) Opportunities
GPU System Architecture. Alan Gray EPCC The University of Edinburgh
GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview
Introduction to High Performance Cluster Computing. Cluster Training for UCL Part 1
Introduction to High Performance Cluster Computing Cluster Training for UCL Part 1 What is HPC HPC = High Performance Computing Includes Supercomputing HPCC = High Performance Cluster Computing Note: these
Performance Metrics and Scalability Analysis. Performance Metrics and Scalability Analysis
Performance Metrics and Scalability Analysis 1 Performance Metrics and Scalability Analysis Lecture Outline Following Topics will be discussed Requirements in performance and cost Performance metrics Work
Chapter 2. Why is some hardware better than others for different programs?
Chapter 2 1 Performance Measure, Report, and Summarize Make intelligent choices See through the marketing hype Key to understanding underlying organizational motivation Why is some hardware better than
High Performance Computing. Course Notes 2007-2008. HPC Fundamentals
High Performance Computing Course Notes 2007-2008 2008 HPC Fundamentals Introduction What is High Performance Computing (HPC)? Difficult to define - it s a moving target. Later 1980s, a supercomputer performs
numascale White Paper The Numascale Solution: Extreme BIG DATA Computing Hardware Accellerated Data Intensive Computing By: Einar Rustad ABSTRACT
numascale Hardware Accellerated Data Intensive Computing White Paper The Numascale Solution: Extreme BIG DATA Computing By: Einar Rustad www.numascale.com Supemicro delivers 108 node system with Numascale
Computer Performance. Topic 3. Contents. Prerequisite knowledge Before studying this topic you should be able to:
55 Topic 3 Computer Performance Contents 3.1 Introduction...................................... 56 3.2 Measuring performance............................... 56 3.2.1 Clock Speed.................................
Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms
Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms Björn Rocker Hamburg, June 17th 2010 Engineering Mathematics and Computing Lab (EMCL) KIT University of the State
PACE Predictive Analytics Center of Excellence @ San Diego Supercomputer Center, UCSD. Natasha Balac, Ph.D.
PACE Predictive Analytics Center of Excellence @ San Diego Supercomputer Center, UCSD Natasha Balac, Ph.D. Brief History of SDSC 1985-1997: NSF national supercomputer center; managed by General Atomics
Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging
Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging
Power-Aware High-Performance Scientific Computing
Power-Aware High-Performance Scientific Computing Padma Raghavan Scalable Computing Laboratory Department of Computer Science Engineering The Pennsylvania State University http://www.cse.psu.edu/~raghavan
Data Centric Systems (DCS)
Data Centric Systems (DCS) Architecture and Solutions for High Performance Computing, Big Data and High Performance Analytics High Performance Computing with Data Centric Systems 1 Data Centric Systems
EEM 486: Computer Architecture. Lecture 4. Performance
EEM 486: Computer Architecture Lecture 4 Performance EEM 486 Performance Purchasing perspective Given a collection of machines, which has the» Best performance?» Least cost?» Best performance / cost? Design
Introduction to Cloud Computing
Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic
ADVANCED COMPUTER ARCHITECTURE
ADVANCED COMPUTER ARCHITECTURE Marco Ferretti Tel. Ufficio: 0382 985365 E-mail: [email protected] Web: www.unipv.it/mferretti, eecs.unipv.it 1 Course syllabus and motivations This course covers the
Scientific Computing Programming with Parallel Objects
Scientific Computing Programming with Parallel Objects Esteban Meneses, PhD School of Computing, Costa Rica Institute of Technology Parallel Architectures Galore Personal Computing Embedded Computing Moore
An Overview of Knowledge Discovery Database and Data mining Techniques
An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,
Introduction to Microprocessors
Introduction to Microprocessors Yuri Baida [email protected] [email protected] October 2, 2010 Moscow Institute of Physics and Technology Agenda Background and History What is a microprocessor?
Fast Multipole Method for particle interactions: an open source parallel library component
Fast Multipole Method for particle interactions: an open source parallel library component F. A. Cruz 1,M.G.Knepley 2,andL.A.Barba 1 1 Department of Mathematics, University of Bristol, University Walk,
Building a Top500-class Supercomputing Cluster at LNS-BUAP
Building a Top500-class Supercomputing Cluster at LNS-BUAP Dr. José Luis Ricardo Chávez Dr. Humberto Salazar Ibargüen Dr. Enrique Varela Carlos Laboratorio Nacional de Supercómputo Benemérita Universidad
Multi-core architectures. Jernej Barbic 15-213, Spring 2007 May 3, 2007
Multi-core architectures Jernej Barbic 15-213, Spring 2007 May 3, 2007 1 Single-core computer 2 Single-core CPU chip the single core 3 Multi-core architectures This lecture is about a new trend in computer
Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging
Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro- and even nano-seconds.
RevoScaleR Speed and Scalability
EXECUTIVE WHITE PAPER RevoScaleR Speed and Scalability By Lee Edlefsen Ph.D., Chief Scientist, Revolution Analytics Abstract RevoScaleR, the Big Data predictive analytics library included with Revolution
Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms
Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Family-Based Platforms Executive Summary Complex simulations of structural and systems performance, such as car crash simulations,
Big Graph Processing: Some Background
Big Graph Processing: Some Background Bo Wu Colorado School of Mines Part of slides from: Paul Burkhardt (National Security Agency) and Carlos Guestrin (Washington University) Mines CSCI-580, Bo Wu Graphs
System Models for Distributed and Cloud Computing
System Models for Distributed and Cloud Computing Dr. Sanjay P. Ahuja, Ph.D. 2010-14 FIS Distinguished Professor of Computer Science School of Computing, UNF Classification of Distributed Computing Systems
Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.
Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide
In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller
In-Memory Databases Algorithms and Data Structures on Modern Hardware Martin Faust David Schwalb Jens Krüger Jürgen Müller The Free Lunch Is Over 2 Number of transistors per CPU increases Clock frequency
Jean-Pierre Panziera Teratec 2011
Technologies for the future HPC systems Jean-Pierre Panziera Teratec 2011 3 petaflop systems : TERA 100, CURIE & IFERC Tera100 Curie IFERC 1.25 PetaFlops 256 TB ory 30 PB disk storage 140 000+ Xeon cores
Partitioning and Divide and Conquer Strategies
and Divide and Conquer Strategies Lecture 4 and Strategies Strategies Data partitioning aka domain decomposition Functional decomposition Lecture 4 and Strategies Quiz 4.1 For nuclear reactor simulation,
How To Compare Amazon Ec2 To A Supercomputer For Scientific Applications
Amazon Cloud Performance Compared David Adams Amazon EC2 performance comparison How does EC2 compare to traditional supercomputer for scientific applications? "Performance Analysis of High Performance
HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK
HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK Steve Oberlin CTO, Accelerated Computing US to Build Two Flagship Supercomputers SUMMIT SIERRA Partnership for Science 100-300 PFLOPS Peak Performance
David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems
David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems About me David Rioja Redondo Telecommunication Engineer - Universidad de Alcalá >2 years building and managing clusters UPM
Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca
Clustering Adrian Groza Department of Computer Science Technical University of Cluj-Napoca Outline 1 Cluster Analysis What is Datamining? Cluster Analysis 2 K-means 3 Hierarchical Clustering What is Datamining?
Cluster Computing at HRI
Cluster Computing at HRI J.S.Bagla Harish-Chandra Research Institute, Chhatnag Road, Jhunsi, Allahabad 211019. E-mail: [email protected] 1 Introduction and some local history High performance computing
Parallel Computing. Introduction
Parallel Computing Introduction Thorsten Grahs, 14. April 2014 Administration Lecturer Dr. Thorsten Grahs (that s me) [email protected] Institute of Scientific Computing Room RZ 120 Lecture Monday 11:30-13:00
Increasing Flash Throughput for Big Data Applications (Data Management Track)
Scale Simplify Optimize Evolve Increasing Flash Throughput for Big Data Applications (Data Management Track) Flash Memory 1 Industry Context Addressing the challenge A proposed solution Review of the Benefits
White Paper The Numascale Solution: Affordable BIG DATA Computing
White Paper The Numascale Solution: Affordable BIG DATA Computing By: John Russel PRODUCED BY: Tabor Custom Publishing IN CONJUNCTION WITH: ABSTRACT Big Data applications once limited to a few exotic disciplines
Performance Monitoring of Parallel Scientific Applications
Performance Monitoring of Parallel Scientific Applications Abstract. David Skinner National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory This paper introduces an infrastructure
CS 575 Parallel Processing
CS 575 Parallel Processing Lecture one: Introduction Wim Bohm Colorado State University Except as otherwise noted, the content of this presentation is licensed under the Creative Commons Attribution 2.5
Binary search tree with SIMD bandwidth optimization using SSE
Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous
A Very Brief History of High-Performance Computing
A Very Brief History of High-Performance Computing CPS343 Parallel and High Performance Computing Spring 2016 CPS343 (Parallel and HPC) A Very Brief History of High-Performance Computing Spring 2016 1
Introducción. Diseño de sistemas digitales.1
Introducción Adapted from: Mary Jane Irwin ( www.cse.psu.edu/~mji ) www.cse.psu.edu/~cg431 [Original from Computer Organization and Design, Patterson & Hennessy, 2005, UCB] Diseño de sistemas digitales.1
Understanding the Benefits of IBM SPSS Statistics Server
IBM SPSS Statistics Server Understanding the Benefits of IBM SPSS Statistics Server Contents: 1 Introduction 2 Performance 101: Understanding the drivers of better performance 3 Why performance is faster
SWARM: A Parallel Programming Framework for Multicore Processors. David A. Bader, Varun N. Kanade and Kamesh Madduri
SWARM: A Parallel Programming Framework for Multicore Processors David A. Bader, Varun N. Kanade and Kamesh Madduri Our Contributions SWARM: SoftWare and Algorithms for Running on Multicore, a portable
Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association
Making Multicore Work and Measuring its Benefits Markus Levy, president EEMBC and Multicore Association Agenda Why Multicore? Standards and issues in the multicore community What is Multicore Association?
How To Understand The Design Of A Microprocessor
Computer Architecture R. Poss 1 What is computer architecture? 2 Your ideas and expectations What is part of computer architecture, what is not? Who are computer architects, what is their job? What is
Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011
Scalable Data Analysis in R Lee E. Edlefsen Chief Scientist UserR! 2011 1 Introduction Our ability to collect and store data has rapidly been outpacing our ability to analyze it We need scalable data analysis
Seeking Opportunities for Hardware Acceleration in Big Data Analytics
Seeking Opportunities for Hardware Acceleration in Big Data Analytics Paul Chow High-Performance Reconfigurable Computing Group Department of Electrical and Computer Engineering University of Toronto Who
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France
Clustering Billions of Data Points Using GPUs
Clustering Billions of Data Points Using GPUs Ren Wu [email protected] Bin Zhang [email protected] Meichun Hsu [email protected] ABSTRACT In this paper, we report our research on using GPUs to accelerate
THE EXPAND PARALLEL FILE SYSTEM A FILE SYSTEM FOR CLUSTER AND GRID COMPUTING. José Daniel García Sánchez ARCOS Group University Carlos III of Madrid
THE EXPAND PARALLEL FILE SYSTEM A FILE SYSTEM FOR CLUSTER AND GRID COMPUTING José Daniel García Sánchez ARCOS Group University Carlos III of Madrid Contents 2 The ARCOS Group. Expand motivation. Expand
Clustering Big Data. Anil K. Jain. (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012
Clustering Big Data Anil K. Jain (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012 Outline Big Data How to extract information? Data clustering
PARALLEL & CLUSTER COMPUTING CS 6260 PROFESSOR: ELISE DE DONCKER BY: LINA HUSSEIN
1 PARALLEL & CLUSTER COMPUTING CS 6260 PROFESSOR: ELISE DE DONCKER BY: LINA HUSSEIN Introduction What is cluster computing? Classification of Cluster Computing Technologies: Beowulf cluster Construction
Grid Scheduling Dictionary of Terms and Keywords
Grid Scheduling Dictionary Working Group M. Roehrig, Sandia National Laboratories W. Ziegler, Fraunhofer-Institute for Algorithms and Scientific Computing Document: Category: Informational June 2002 Status
Computer Architecture Lecture 2: Instruction Set Principles (Appendix A) Chih Wei Liu 劉 志 尉 National Chiao Tung University [email protected].
Computer Architecture Lecture 2: Instruction Set Principles (Appendix A) Chih Wei Liu 劉 志 尉 National Chiao Tung University [email protected] Review Computers in mid 50 s Hardware was expensive
GPU Hardware and Programming Models. Jeremy Appleyard, September 2015
GPU Hardware and Programming Models Jeremy Appleyard, September 2015 A brief history of GPUs In this talk Hardware Overview Programming Models Ask questions at any point! 2 A Brief History of GPUs 3 Once
Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing
Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing Innovation Intelligence Devin Jensen August 2012 Altair Knows HPC Altair is the only company that: makes HPC tools
64-Bit versus 32-Bit CPUs in Scientific Computing
64-Bit versus 32-Bit CPUs in Scientific Computing Axel Kohlmeyer Lehrstuhl für Theoretische Chemie Ruhr-Universität Bochum March 2004 1/25 Outline 64-Bit and 32-Bit CPU Examples
Architecture of Hitachi SR-8000
Architecture of Hitachi SR-8000 University of Stuttgart High-Performance Computing-Center Stuttgart (HLRS) www.hlrs.de Slide 1 Most of the slides from Hitachi Slide 2 the problem modern computer are data
Dell High-Performance Computing Clusters and Reservoir Simulation Research at UT Austin. http://www.dell.com/clustering
Dell High-Performance Computing Clusters and Reservoir Simulation Research at UT Austin Reza Rooholamini, Ph.D. Director Enterprise Solutions Dell Computer Corp. [email protected] http://www.dell.com/clustering
How To Build A Cloud Computer
Introducing the Singlechip Cloud Computer Exploring the Future of Many-core Processors White Paper Intel Labs Jim Held Intel Fellow, Intel Labs Director, Tera-scale Computing Research Sean Koehl Technology
Lecture 1: the anatomy of a supercomputer
Where a calculator on the ENIAC is equipped with 18,000 vacuum tubes and weighs 30 tons, computers of the future may have only 1,000 vacuum tubes and perhaps weigh 1½ tons. Popular Mechanics, March 1949
An Introduction to High-Frequency Circuits and Signal Integrity
An Introduction to High-Frequency Circuits and Signal Integrity 1 Outline The electromagnetic spectrum Review of market and technology trends Semiconductors industry Computers industry Communication industry
Communicating with devices
Introduction to I/O Where does the data for our CPU and memory come from or go to? Computers communicate with the outside world via I/O devices. Input devices supply computers with data to operate on.
IBM Deep Computing Visualization Offering
P - 271 IBM Deep Computing Visualization Offering Parijat Sharma, Infrastructure Solution Architect, IBM India Pvt Ltd. email: [email protected] Summary Deep Computing Visualization in Oil & Gas
OBJECTIVE ANALYSIS WHITE PAPER MATCH FLASH. TO THE PROCESSOR Why Multithreading Requires Parallelized Flash ATCHING
OBJECTIVE ANALYSIS WHITE PAPER MATCH ATCHING FLASH TO THE PROCESSOR Why Multithreading Requires Parallelized Flash T he computing community is at an important juncture: flash memory is now generally accepted
Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child
Introducing A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child Bio Tim Child 35 years experience of software development Formerly VP Oracle Corporation VP BEA Systems Inc.
The Lattice Project: A Multi-Model Grid Computing System. Center for Bioinformatics and Computational Biology University of Maryland
The Lattice Project: A Multi-Model Grid Computing System Center for Bioinformatics and Computational Biology University of Maryland Parallel Computing PARALLEL COMPUTING a form of computation in which
Scalable and High Performance Computing for Big Data Analytics in Understanding the Human Dynamics in the Mobile Age
Scalable and High Performance Computing for Big Data Analytics in Understanding the Human Dynamics in the Mobile Age Xuan Shi GRA: Bowei Xue University of Arkansas Spatiotemporal Modeling of Human Dynamics
Logical Operations. Control Unit. Contents. Arithmetic Operations. Objectives. The Central Processing Unit: Arithmetic / Logic Unit.
Objectives The Central Processing Unit: What Goes on Inside the Computer Chapter 4 Identify the components of the central processing unit and how they work together and interact with memory Describe how
Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms
Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms Amani AlOnazi, David E. Keyes, Alexey Lastovetsky, Vladimir Rychkov Extreme Computing Research Center,
18-742 Lecture 4. Parallel Programming II. Homework & Reading. Page 1. Projects handout On Friday Form teams, groups of two
age 1 18-742 Lecture 4 arallel rogramming II Spring 2005 rof. Babak Falsafi http://www.ece.cmu.edu/~ece742 write X Memory send X Memory read X Memory Slides developed in part by rofs. Adve, Falsafi, Hill,
Introduction to GPU Programming Languages
CSC 391/691: GPU Programming Fall 2011 Introduction to GPU Programming Languages Copyright 2011 Samuel S. Cho http://www.umiacs.umd.edu/ research/gpu/facilities.html Maryland CPU/GPU Cluster Infrastructure
Generations of the computer. processors.
. Piotr Gwizdała 1 Contents 1 st Generation 2 nd Generation 3 rd Generation 4 th Generation 5 th Generation 6 th Generation 7 th Generation 8 th Generation Dual Core generation Improves and actualizations
Virtuoso and Database Scalability
Virtuoso and Database Scalability By Orri Erling Table of Contents Abstract Metrics Results Transaction Throughput Initializing 40 warehouses Serial Read Test Conditions Analysis Working Set Effect of
