Exploiting Transparent Remote Memory Access for Non-Contiguous- and One-Sided-Communication
|
|
|
- Harvey Atkinson
- 10 years ago
- Views:
Transcription
1 Workshop for Communication Architecture in Clusters, IPDPS 2002: Exploiting Transparent Remote Memory Access for Non-Contiguous- and One-Sided-Communication Joachim Worringen, Andreas Gäer, Frank Reker Lehrstuhl für Betriebssysteme, RWTH Aachen
2 Agenda Introduction into SCI via PCI MPI via SCI: SCI-MPICH Non-Contiguous Datatype Communication One-sided Communication Conclusion
3 SCI Principle Scalable Coherent Interface: Protocol for a memory-coupling interconnect standardized in 1992 (IEEE ) Not a physical bus but offers bus-like services transparent operation for source and target operates within a global 64-bit adress space Good scalability through flexible topologies and small packet sizes optional: efficient, distributed cache-coherence
4 Integrated Systems SCI Implementation PCI-SCI adapter: 64-bit 66 MHz PCI-PCI Bridge Key performance values on IA-32 platform: - Remote-write (word) 1,5 us - Remote-read (word) 3,2 us - Bandwidth PIO 170 MB/s (IA-32 PCI is the limit) (peak bandwidth for 512 byte blocksize) DMA 250 MB/s (DMA engine is the limit) - Synchronization ~ 15 us - Remote interrupt ~ 30 us
5 PCI-SCI Peformance on IA-32: Bandwidth DMA engine PCI Memory
6 PCI-SCI Peformance on IA-32: Latency Native packet size
7 Message Passing over SCI Similar to local shared memory, but: Different performance characteristics - Read / write latency - I/O-bus behaves differently than system bus - Granularity and contiguousity of access important - Load on independant inter-node links (collective operations) Different memory consistency model Connecting / Mapping incurs more overhead Resources are limited It s still a network: connection monitoring Naive approach map everything and do shmem is neither efficient nor scalable
8 Communicating Non-Contiguous Data
9 MPI Datatypes Basic datatypes (char, double,...) Derived datatypes Concanate elements: communicat vectors Associations of elements: communicate structs Using offsets, strides and upper/lower bounds, noncontiguous datatypes can be defined Example: Columns of a matrix ( C storage)
10 Sending Non-Contiguous Data The Generic Way What is the problem? Many communication devices have simple data specification <data location, data length> N send operations for N separate data chunks inefficient Generic solution: Sender Receiver Pack Transfer Gather Unpack
11 Sending Non-Contiguous Data The SCI Way Goal: avoid intermediate copy operations! SCI Advantage: high bandwidth for small data chunks Problem: requires fast, space-efficient, restartable pack/unpack algorithm: direct_pack_ff Sender Receiver Pack & Transfer Gather & Unpack
12 Performance (I) Simple Vector
13 Performance (II) Complex Type
14 Performance (III) - Alignment
15 Reduced Cache-Pollution
16 One-Sided Communication SCI is one-sided communication, but: Remote read bandwidth < 10 MB/s Only fragments of each process address space are accessable These fragments need to be allocated specifically (until now) Multi-protocol approach required to Handle every kind of memory type Achieve optimal performance for each type of access
17 Different Access Modes Direct access of remote memory: Window is allocated from an SCI shared segment Put-operations of any size Get-operations sized up to a threshold Emulated access of remote memory: Access initiated by origin via messages, executed by target: Windows allocated from private memory Get-Operations beyond threshold Accumulate-operations Synchronous or asynchronous completion Different overhead & synchronization semantics
18 Latency of Remote Accesses latency [µs] put shared put priv async put priv sync get shared get priv async get priv sync number elements ("double") 124
19 Bandwidth of Remote Accesses 100 bandwidth [MiB/s] 10 put shared put priv async put priv sync get shared get priv async get priv sync number elements ("double") 124
20 Scalability SCI ringlet bandwidth: 633 MiB/s Outgoing peak traffic for MPI_Put() per node: 122 MiB/s Worst case scalability: nodes offered load per node [MiB/s] total [MiB/s] efficiency 4 76,3 % 120,7 482,8 76,3 % 5 95,3 % 115,8 579,0 91,5% 6 114,4 % 97,8 586,5 92,7 % 7 133,5 % 79,3 555,1 87,7 % 8 152,5 % 62,8 502,2 79,3 %
21 Conclusions Avoiding separate pack/unpack for non-contiguous datatypes Highly efficient communication Reduced cache pollution and working set size Performance of one-sided communication comparable to integrated systems Multi-protocol approach ensures unlimited usability with optimal performance for each scenario Techniques are directly usable for SCI and local shmem Limits (and possible solutions): Address space of IA-32 architecture (IA-64 or others) Limited address translation resources (better hw, or sw cache) Ringlet scalability (increase link clock) Remote write latency (PCI-X w/ 133MHz -> down to 0.8 µs) Remote read-bandwidth (no real solution in sight) Interrupt latency (interrupt-on-write will halve it)
22 Thank you!
23 Spare Slides
24 Pack on the fly Store type information on commit of datatype: offset, length, stride, repetition count Usual data structure tree: recursive, complex restart Naive data structure list: not space-efficient Suitable data structure: linked list of stacks Bandwidth break-even point between Overhead for intermediate-copies Reduced bandwidth for fine-grained copy operations List sorted by size of stack entries ( leaves ) to allow for optimization: buffering for small and misaligned (parts of) leaves
25 Integration in SCI-MPICH Building the stack: MPIR_build_ff (struct MPIR_DATATYPE *type, int *leaves) Moving data: MPID_SMI_Pack_ff (char *inbuf, struct MPIR_DATATYPE *dtype_ptr, char *outbuf, int dest, int max, int *outlen) MPID_SMI_UnPack_ff (char *inbuf, struct MPIR_DATATYPE *dtype_ptr, char *outbuf, int from, int max, int *outlen)
26 Type Matching Sorting of basic datatypes (leafs) by size: Improves performance by mixing gathering & direct write Changes order in incoming buffer at receiver!
27 Efficiency Comparision: Non-contig Vector 1,2 1,0 0,8 0,6 0,4 0,2 ff via SCI ff via shmem Score Myrinet Score shmem SunFire Gigabit SunFire shmem Cray T3E 0,
28 MPI_Alloc_mem(): Below threshold 1: Below threshold 2: Memory Allocation allocate via malloc() allocate from shared mem pool Beyond threshold 2: create specific shared memory Attributes: Keys private, shared, pinned: type of memory buffer Key align with value: align memory buffer MPI_Free_mem(): Unpin memory, release shared memory segment Implicitely invoke remote segment callbacks
29 sparse micro-benchmark Simulate access to remote sparse matrix: MPI_Win_create (..., winsize,...) for (increasing values of access_cnt) { } offset = 0 stride = access_cnt * sizeof(datatype) flush_cache() time = MPI_Wtime() while (offset + access_size < winsize) { } MPI_Get/Put (.., partner, offset, access_cnt,..) offset = offset + stride MPI_Win_fence(...) time = MPI_Wtime() - time
30 Latency of Remote Accesses
31 Bandwidth of Remote Accesses
32 Point-to-Point Comparison
33 Scaling Comparison MPI_Put
Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck
Sockets vs. RDMA Interface over 1-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck Pavan Balaji Hemal V. Shah D. K. Panda Network Based Computing Lab Computer Science and Engineering
Architecture of Hitachi SR-8000
Architecture of Hitachi SR-8000 University of Stuttgart High-Performance Computing-Center Stuttgart (HLRS) www.hlrs.de Slide 1 Most of the slides from Hitachi Slide 2 the problem modern computer are data
Performance Evaluation of InfiniBand with PCI Express
Performance Evaluation of InfiniBand with PCI Express Jiuxing Liu Server Technology Group IBM T. J. Watson Research Center Yorktown Heights, NY 1598 [email protected] Amith Mamidala, Abhinav Vishnu, and Dhabaleswar
Building an Inexpensive Parallel Computer
Res. Lett. Inf. Math. Sci., (2000) 1, 113-118 Available online at http://www.massey.ac.nz/~wwiims/rlims/ Building an Inexpensive Parallel Computer Lutz Grosz and Andre Barczak I.I.M.S., Massey University
COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook)
COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook) Vivek Sarkar Department of Computer Science Rice University [email protected] COMP
CS5460: Operating Systems
CS5460: Operating Systems Lecture 13: Memory Management (Chapter 8) Where are we? Basic OS structure, HW/SW interface, interrupts, scheduling Concurrency Memory management Storage management Other topics
LOAD BALANCING DISTRIBUTED OPERATING SYSTEMS, SCALABILITY, SS 2015. Hermann Härtig
LOAD BALANCING DISTRIBUTED OPERATING SYSTEMS, SCALABILITY, SS 2015 Hermann Härtig ISSUES starting points independent Unix processes and block synchronous execution who does it load migration mechanism
- An Essential Building Block for Stable and Reliable Compute Clusters
Ferdinand Geier ParTec Cluster Competence Center GmbH, V. 1.4, March 2005 Cluster Middleware - An Essential Building Block for Stable and Reliable Compute Clusters Contents: Compute Clusters a Real Alternative
Cray Gemini Interconnect. Technical University of Munich Parallel Programming Class of SS14 Denys Sobchyshak
Cray Gemini Interconnect Technical University of Munich Parallel Programming Class of SS14 Denys Sobchyshak Outline 1. Introduction 2. Overview 3. Architecture 4. Gemini Blocks 5. FMA & BTA 6. Fault tolerance
Multi-Threading Performance on Commodity Multi-Core Processors
Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction
Hardware Based Virtualization Technologies. Elsie Wahlig [email protected] Platform Software Architect
Hardware Based Virtualization Technologies Elsie Wahlig [email protected] Platform Software Architect Outline What is Virtualization? Evolution of Virtualization AMD Virtualization AMD s IO Virtualization
High Performance Computing. Course Notes 2007-2008. HPC Fundamentals
High Performance Computing Course Notes 2007-2008 2008 HPC Fundamentals Introduction What is High Performance Computing (HPC)? Difficult to define - it s a moving target. Later 1980s, a supercomputer performs
Installation Guide for Dolphin PCI-SCI Adapters
Installation Guide for Dolphin PCI-SCI Adapters Part Number: DI950-10353 Version: 1.5 Date: December 7th 2006 Dolphin Interconnect Solutions ASA Olaf Helsets vei 6 P.O.Box 150, Oppsal N-0619 Oslo, Norway
EXPLOITING SHARED MEMORY TO IMPROVE PARALLEL I/O PERFORMANCE
EXPLOITING SHARED MEMORY TO IMPROVE PARALLEL I/O PERFORMANCE Andrew B. Hastings Sun Microsystems, Inc. Alok Choudhary Northwestern University September 19, 2006 This material is based on work supported
OpenSPARC T1 Processor
OpenSPARC T1 Processor The OpenSPARC T1 processor is the first chip multiprocessor that fully implements the Sun Throughput Computing Initiative. Each of the eight SPARC processor cores has full hardware
18-742 Lecture 4. Parallel Programming II. Homework & Reading. Page 1. Projects handout On Friday Form teams, groups of two
age 1 18-742 Lecture 4 arallel rogramming II Spring 2005 rof. Babak Falsafi http://www.ece.cmu.edu/~ece742 write X Memory send X Memory read X Memory Slides developed in part by rofs. Adve, Falsafi, Hill,
Integrating PCI Express into the PXI Backplane
Integrating PCI Express into the PXI Backplane PCI Express Overview Serial interconnect at 2.5 Gbits/s PCI transactions are packetized and then serialized Low-voltage differential signaling, point-to-point,
An Architectural study of Cluster-Based Multi-Tier Data-Centers
An Architectural study of Cluster-Based Multi-Tier Data-Centers K. VAIDYANATHAN, P. BALAJI, J. WU, H. -W. JIN, D. K. PANDA Technical Report OSU-CISRC-5/4-TR25 An Architectural study of Cluster-Based Multi-Tier
An Active Packet can be classified as
Mobile Agents for Active Network Management By Rumeel Kazi and Patricia Morreale Stevens Institute of Technology Contact: rkazi,[email protected] Abstract-Traditionally, network management systems
Power-Aware High-Performance Scientific Computing
Power-Aware High-Performance Scientific Computing Padma Raghavan Scalable Computing Laboratory Department of Computer Science Engineering The Pennsylvania State University http://www.cse.psu.edu/~raghavan
Computer Systems Structure Input/Output
Computer Systems Structure Input/Output Peripherals Computer Central Processing Unit Main Memory Computer Systems Interconnection Communication lines Input Output Ward 1 Ward 2 Examples of I/O Devices
Lecture 2 Parallel Programming Platforms
Lecture 2 Parallel Programming Platforms Flynn s Taxonomy In 1966, Michael Flynn classified systems according to numbers of instruction streams and the number of data stream. Data stream Single Multiple
COS 318: Operating Systems. I/O Device and Drivers. Input and Output. Definitions and General Method. Revisit Hardware
COS 318: Operating Systems I/O and Drivers Input and Output A computer s job is to process data Computation (, cache, and memory) Move data into and out of a system (between I/O devices and memory) Challenges
GPI Global Address Space Programming Interface
GPI Global Address Space Programming Interface SEPARS Meeting Stuttgart, December 2nd 2010 Dr. Mirko Rahn Fraunhofer ITWM Competence Center for HPC and Visualization 1 GPI Global address space programming
Automatic Workload Management in Clusters Managed by CloudStack
Automatic Workload Management in Clusters Managed by CloudStack Problem Statement In a cluster environment, we have a pool of server nodes with S running on them. Virtual Machines are launched in some
Interconnection Networks
Advanced Computer Architecture (0630561) Lecture 15 Interconnection Networks Prof. Kasim M. Al-Aubidy Computer Eng. Dept. Interconnection Networks: Multiprocessors INs can be classified based on: 1. Mode
Ethernet. Ethernet Frame Structure. Ethernet Frame Structure (more) Ethernet: uses CSMA/CD
Ethernet dominant LAN technology: cheap -- $20 for 100Mbs! first widely used LAN technology Simpler, cheaper than token rings and ATM Kept up with speed race: 10, 100, 1000 Mbps Metcalfe s Etheret sketch
TCP Servers: Offloading TCP Processing in Internet Servers. Design, Implementation, and Performance
TCP Servers: Offloading TCP Processing in Internet Servers. Design, Implementation, and Performance M. Rangarajan, A. Bohra, K. Banerjee, E.V. Carrera, R. Bianchini, L. Iftode, W. Zwaenepoel. Presented
Big Graph Processing: Some Background
Big Graph Processing: Some Background Bo Wu Colorado School of Mines Part of slides from: Paul Burkhardt (National Security Agency) and Carlos Guestrin (Washington University) Mines CSCI-580, Bo Wu Graphs
Technical Computing Suite Job Management Software
Technical Computing Suite Job Management Software Toshiaki Mikamo Fujitsu Limited Supercomputer PRIMEHPC FX10 PRIMERGY x86 cluster Outline System Configuration and Software Stack Features The major functions
PCI Express: Interconnect of the future
PCI Express: Interconnect of the future There are many recent technologies that have signalled a shift in the way data is sent within a desktop computer in order to increase speed and efficiency. Universal
Performance Characteristics of a Cost-Effective Medium-Sized Beowulf Cluster Supercomputer
Res. Lett. Inf. Math. Sci., 2003, Vol.5, pp 1-10 Available online at http://iims.massey.ac.nz/research/letters/ 1 Performance Characteristics of a Cost-Effective Medium-Sized Beowulf Cluster Supercomputer
Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology Parallel Reduction Common and important data parallel primitive Easy to implement in CUDA Harder to get it right Serves as
Virtualization: TCP/IP Performance Management in a Virtualized Environment Orlando Share Session 9308
Virtualization: TCP/IP Performance Management in a Virtualized Environment Orlando Share Session 9308 Laura Knapp WW Business Consultant [email protected] Applied Expert Systems, Inc. 2011 1 Background
Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster
Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster Gabriele Jost and Haoqiang Jin NAS Division, NASA Ames Research Center, Moffett Field, CA 94035-1000 {gjost,hjin}@nas.nasa.gov
TCP Offload Engines. As network interconnect speeds advance to Gigabit. Introduction to
Introduction to TCP Offload Engines By implementing a TCP Offload Engine (TOE) in high-speed computing environments, administrators can help relieve network bottlenecks and improve application performance.
AMD Opteron Quad-Core
AMD Opteron Quad-Core a brief overview Daniele Magliozzi Politecnico di Milano Opteron Memory Architecture native quad-core design (four cores on a single die for more efficient data sharing) enhanced
High Speed I/O Server Computing with InfiniBand
High Speed I/O Server Computing with InfiniBand José Luís Gonçalves Dep. Informática, Universidade do Minho 4710-057 Braga, Portugal [email protected] Abstract: High-speed server computing heavily relies on
Why Compromise? A discussion on RDMA versus Send/Receive and the difference between interconnect and application semantics
Why Compromise? A discussion on RDMA versus Send/Receive and the difference between interconnect and application semantics Mellanox Technologies Inc. 2900 Stender Way, Santa Clara, CA 95054 Tel: 408-970-3400
Bandwidth Calculations for SA-1100 Processor LCD Displays
Bandwidth Calculations for SA-1100 Processor LCD Displays Application Note February 1999 Order Number: 278270-001 Information in this document is provided in connection with Intel products. No license,
COS 318: Operating Systems. Virtual Memory and Address Translation
COS 318: Operating Systems Virtual Memory and Address Translation Today s Topics Midterm Results Virtual Memory Virtualization Protection Address Translation Base and bound Segmentation Paging Translation
Bridging the Gap between Software and Hardware Techniques for I/O Virtualization
Bridging the Gap between Software and Hardware Techniques for I/O Virtualization Jose Renato Santos Yoshio Turner G.(John) Janakiraman Ian Pratt Hewlett Packard Laboratories, Palo Alto, CA University of
Computer Network. Interconnected collection of autonomous computers that are able to exchange information
Introduction Computer Network. Interconnected collection of autonomous computers that are able to exchange information No master/slave relationship between the computers in the network Data Communications.
50. DFN Betriebstagung
50. DFN Betriebstagung IPS Serial Clustering in 10GbE Environment Tuukka Helander, Stonesoft Germany GmbH Frank Brüggemann, RWTH Aachen Slide 1 Agenda Introduction Stonesoft clustering Firewall parallel
Principles and characteristics of distributed systems and environments
Principles and characteristics of distributed systems and environments Definition of a distributed system Distributed system is a collection of independent computers that appears to its users as a single
TCP Adaptation for MPI on Long-and-Fat Networks
TCP Adaptation for MPI on Long-and-Fat Networks Motohiko Matsuda, Tomohiro Kudoh Yuetsu Kodama, Ryousei Takano Grid Technology Research Center Yutaka Ishikawa The University of Tokyo Outline Background
CS2510 Computer Operating Systems
CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction
CS2510 Computer Operating Systems
CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction
How To Understand The Concept Of A Distributed System
Distributed Operating Systems Introduction Ewa Niewiadomska-Szynkiewicz and Adam Kozakiewicz [email protected], [email protected] Institute of Control and Computation Engineering Warsaw University of
Optimizing TCP Forwarding
Optimizing TCP Forwarding Vsevolod V. Panteleenko and Vincent W. Freeh TR-2-3 Department of Computer Science and Engineering University of Notre Dame Notre Dame, IN 46556 {vvp, vin}@cse.nd.edu Abstract
Parallel Programming Survey
Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory
Introduction to PCI Express Positioning Information
Introduction to PCI Express Positioning Information Main PCI Express is the latest development in PCI to support adapters and devices. The technology is aimed at multiple market segments, meaning that
CSE 6040 Computing for Data Analytics: Methods and Tools
CSE 6040 Computing for Data Analytics: Methods and Tools Lecture 12 Computer Architecture Overview and Why it Matters DA KUANG, POLO CHAU GEORGIA TECH FALL 2014 Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS
Myrinet Express (MX): A High-Performance, Low-Level, Message-Passing Interface for Myrinet. Version 1.0 July 01, 2005
Myrinet Express (MX): A High-Performance, Low-Level, Message-Passing Interface for Myrinet Version 1.0 July 01, 2005 Table of Conten ts I OVERVIEW...1 I.1 UPCOMING FEATURES...3 II CONCEPTS...4 II.1 N
Architecting High-Speed Data Streaming Systems. Sujit Basu
Architecting High-Speed Data Streaming Systems Sujit Basu stream ing [stree-ming] verb 1. The act of transferring data to or from an instrument at a rate high enough to sustain continuous acquisition or
PART II. OPS-based metro area networks
PART II OPS-based metro area networks Chapter 3 Introduction to the OPS-based metro area networks Some traffic estimates for the UK network over the next few years [39] indicate that when access is primarily
A Transport Protocol for Multimedia Wireless Sensor Networks
A Transport Protocol for Multimedia Wireless Sensor Networks Duarte Meneses, António Grilo, Paulo Rogério Pereira 1 NGI'2011: A Transport Protocol for Multimedia Wireless Sensor Networks Introduction Wireless
Portable Scale-Out Benchmarks for MySQL. MySQL User Conference 2008 Robert Hodges CTO Continuent, Inc.
Portable Scale-Out Benchmarks for MySQL MySQL User Conference 2008 Robert Hodges CTO Continuent, Inc. Continuent 2008 Agenda / Introductions / Scale-Out Review / Bristlecone Performance Testing Tools /
Computer Organization & Architecture Lecture #19
Computer Organization & Architecture Lecture #19 Input/Output The computer system s I/O architecture is its interface to the outside world. This architecture is designed to provide a systematic means of
Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes
Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes Eric Petit, Loïc Thebault, Quang V. Dinh May 2014 EXA2CT Consortium 2 WPs Organization Proto-Applications
Ethernet, VLAN, Ethernet Carrier Grade
Ethernet, VLAN, Ethernet Carrier Grade Dr. Rami Langar LIP6/PHARE UPMC - University of Paris 6 [email protected] www-phare.lip6.fr/~langar RTEL 1 Point-to-Point vs. Broadcast Media Point-to-point PPP
I/O. Input/Output. Types of devices. Interface. Computer hardware
I/O Input/Output One of the functions of the OS, controlling the I/O devices Wide range in type and speed The OS is concerned with how the interface between the hardware and the user is made The goal in
PCI Technology Overview
PCI Technology Overview February 2003 February 2003 Page 1 Agenda History and Industry Involvement Technology Information Conventional PCI PCI-X 1.0 2.0 PCI Express Other Digi Products in PCI/PCI-X environments
Cluster Grid Interconects. Tony Kay Chief Architect Enterprise Grid and Networking
Cluster Grid Interconects Tony Kay Chief Architect Enterprise Grid and Networking Agenda Cluster Grid Interconnects The Upstart - Infiniband The Empire Strikes Back - Myricom Return of the King 10G Gigabit
Quality of Service in the Internet. QoS Parameters. Keeping the QoS. Traffic Shaping: Leaky Bucket Algorithm
Quality of Service in the Internet Problem today: IP is packet switched, therefore no guarantees on a transmission is given (throughput, transmission delay, ): the Internet transmits data Best Effort But:
PCI Express IO Virtualization Overview
Ron Emerick, Oracle Corporation Author: Ron Emerick, Oracle Corporation SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA unless otherwise noted. Member companies and
Distributed communication-aware load balancing with TreeMatch in Charm++
Distributed communication-aware load balancing with TreeMatch in Charm++ The 9th Scheduling for Large Scale Systems Workshop, Lyon, France Emmanuel Jeannot Guillaume Mercier Francois Tessier In collaboration
Memory Architecture and Management in a NoC Platform
Architecture and Management in a NoC Platform Axel Jantsch Xiaowen Chen Zhonghai Lu Chaochao Feng Abdul Nameed Yuang Zhang Ahmed Hemani DATE 2011 Overview Motivation State of the Art Data Management Engine
Advanced Computer Networks. High Performance Networking I
Advanced Computer Networks 263 3501 00 High Performance Networking I Patrick Stuedi Spring Semester 2014 1 Oriana Riva, Department of Computer Science ETH Zürich Outline Last week: Wireless TCP Today:
CSC 2405: Computer Systems II
CSC 2405: Computer Systems II Spring 2013 (TR 8:30-9:45 in G86) Mirela Damian http://www.csc.villanova.edu/~mdamian/csc2405/ Introductions Mirela Damian Room 167A in the Mendel Science Building [email protected]
Gigabit Ethernet Design
Gigabit Ethernet Design Laura Jeanne Knapp Network Consultant 1-919-254-8801 [email protected] www.lauraknapp.com Tom Hadley Network Consultant 1-919-301-3052 [email protected] HSEdes_ 010 ed and
Unified Fabric: Cisco's Innovation for Data Center Networks
. White Paper Unified Fabric: Cisco's Innovation for Data Center Networks What You Will Learn Unified Fabric supports new concepts such as IEEE Data Center Bridging enhancements that improve the robustness
The proliferation of the raw processing
TECHNOLOGY CONNECTED Advances with System Area Network Speeds Data Transfer between Servers with A new network switch technology is targeted to answer the phenomenal demands on intercommunication transfer
What s New in 2013. Mike Bailey LabVIEW Technical Evangelist. uk.ni.com
What s New in 2013 Mike Bailey LabVIEW Technical Evangelist Building High-Performance Test, Measurement and Control Systems Using PXImc Jeremy Twaits Regional Marketing Engineer Automated Test & RF National
Overlapping Data Transfer With Application Execution on Clusters
Overlapping Data Transfer With Application Execution on Clusters Karen L. Reid and Michael Stumm [email protected] [email protected] Department of Computer Science Department of Electrical and Computer
NVIDIA Tools For Profiling And Monitoring. David Goodwin
NVIDIA Tools For Profiling And Monitoring David Goodwin Outline CUDA Profiling and Monitoring Libraries Tools Technologies Directions CScADS Summer 2012 Workshop on Performance Tools for Extreme Scale
Implementing MPI-IO Shared File Pointers without File System Support
Implementing MPI-IO Shared File Pointers without File System Support Robert Latham, Robert Ross, Rajeev Thakur, Brian Toonen Mathematics and Computer Science Division Argonne National Laboratory Argonne,
I/O Virtualization Using Mellanox InfiniBand And Channel I/O Virtualization (CIOV) Technology
I/O Virtualization Using Mellanox InfiniBand And Channel I/O Virtualization (CIOV) Technology Reduce I/O cost and power by 40 50% Reduce I/O real estate needs in blade servers through consolidation Maintain
Scheduling Task Parallelism" on Multi-Socket Multicore Systems"
Scheduling Task Parallelism" on Multi-Socket Multicore Systems" Stephen Olivier, UNC Chapel Hill Allan Porterfield, RENCI Kyle Wheeler, Sandia National Labs Jan Prins, UNC Chapel Hill Outline" Introduction
Communication Systems Internetworking (Bridges & Co)
Communication Systems Internetworking (Bridges & Co) Prof. Dr.-Ing. Lars Wolf TU Braunschweig Institut für Betriebssysteme und Rechnerverbund Mühlenpfordtstraße 23, 38106 Braunschweig, Germany Email: [email protected]
Lecture 10: Dynamic Memory Allocation 1: Into the jaws of malloc()
CS61: Systems Programming and Machine Organization Harvard University, Fall 2009 Lecture 10: Dynamic Memory Allocation 1: Into the jaws of malloc() Prof. Matt Welsh October 6, 2009 Topics for today Dynamic
Lustre Networking BY PETER J. BRAAM
Lustre Networking BY PETER J. BRAAM A WHITE PAPER FROM CLUSTER FILE SYSTEMS, INC. APRIL 2007 Audience Architects of HPC clusters Abstract This paper provides architects of HPC clusters with information
COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters
COSC 6374 Parallel Computation Parallel I/O (I) I/O basics Spring 2008 Concept of a clusters Processor 1 local disks Compute node message passing network administrative network Memory Processor 2 Network
PCI Express Impact on Storage Architectures and Future Data Centers. Ron Emerick, Oracle Corporation
PCI Express Impact on Storage Architectures and Future Data Centers Ron Emerick, Oracle Corporation SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA. Member companies
SCSI vs. Fibre Channel White Paper
SCSI vs. Fibre Channel White Paper 08/27/99 SCSI vs. Fibre Channel Over the past decades, computer s industry has seen radical change in key components. Limitations in speed, bandwidth, and distance have
Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association
Making Multicore Work and Measuring its Benefits Markus Levy, president EEMBC and Multicore Association Agenda Why Multicore? Standards and issues in the multicore community What is Multicore Association?
Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand
Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand P. Balaji, K. Vaidyanathan, S. Narravula, K. Savitha, H. W. Jin D. K. Panda Network Based
1. Memory technology & Hierarchy
1. Memory technology & Hierarchy RAM types Advances in Computer Architecture Andy D. Pimentel Memory wall Memory wall = divergence between CPU and RAM speed We can increase bandwidth by introducing concurrency
I/O Device and Drivers
COS 318: Operating Systems I/O Device and Drivers Prof. Margaret Martonosi Computer Science Department Princeton University http://www.cs.princeton.edu/courses/archive/fall11/cos318/ Announcements Project
A New Chapter for System Designs Using NAND Flash Memory
A New Chapter for System Designs Using Memory Jim Cooke Senior Technical Marketing Manager Micron Technology, Inc December 27, 2010 Trends and Complexities trends have been on the rise since was first
