LOCKUP-FREE INSTRUCTION FETCH/PREFETCH CACHE ORGANIZATION
|
|
|
- Collin Moody
- 10 years ago
- Views:
Transcription
1 LOCKUP-FREE INSTRUCTION FETCH/PREFETCH CACHE ORGANIZATION DAVID KROFT Control Data Canada, Ltd. Canadian Development Division Mississauga, Ontario, Canada ABSTRACT In the past decade, there has been much literature describing various cache organizations that exploit general programming idiosyncrasies to obtain maximum hit rate (the probability that a requested datum is now resident in the cache). Little, if any, has been presented to exploit: (1) the inherent dual input nature of the cache and (2) the many-datum reference type central processor instructions. No matter how high the cache hit rate is, a cache miss may impose a penalty on subsequent cache references. This penalty is the necessity of waiting until the missed requested datum is received from central memory and, possibly, for cache update. For the two cases above, the cache references following a miss do not require the information of the datum not resident in the cache, and are therefore penalized in this fashion. In this paper, a cache organization is presented that essentially eliminates this penalty. This cache organizational feature has been incorporated in a cache/memory interface subsystem design, and the design has been implemented and prototyped. An existing simple instruction set machine has verified the advantage of this feature; future, more extensive and sophisticated instruction set machines may obviously take more advantage. Prior to prototyping, simulations verified the advantage. INTRODUCTION A cache buffer 1,2 is a small, fast memory holding most recently accessed data and its surrounding neighbors, Because the access time of this buffer is usually an order of magnitude greater than main or central memory, and the standard software practice is to localize data, the effective memory access time is considerably reduced when a cache buffer is included. The cost increment for this when compared with the cost of central memory along with the above access time advantage infers cost effectiveness. Now, accepting the usefulness of a cache buffer, one looks into ways of increasing its effectiveness; that is, further decreasing the effective memory access time. Considerable research has been done to fine tune a cache design for various requirements. 3,6 This fine tuning consisted of selecting optimal total cache buffer size, block size (the number of bytes to be requested on a cache miss), space allocation, and replacement algorithms to maximize hit rate. Another method presented to increase the hit rate was selective prefetching. 7 All these methods assume the cache can handle only one request at a time; on a miss, the cach stays busy servicing the request until the data is received from memory and, possibly, for cache buffer update. In this paper, a cache organization is presented that increases the effectiveness of a normal cache inclusion by using the inherent dual input nature of an overall cache and the many data reference instructions. In oth words, it would be extremely useful to pipeline the requests into the cache at the cache hit throughput ratq regardless of any misses. If this could be accomplisheq then all fetch and/or prefetch of instructions could be totally transparent to the execution unit. Also, for instructions that require a number of data references, the requests could be almost entirely overlapped. Obviously, requests could not be streamed into the cache at the hit throughput rate indefinitely. There is a limit. This organization's limit is imposed by the numb. of misses that have not been completely processed th~ the cache will keep track of simultaneously without Iockup. ORGANIZATION In addition to the standard blocks, this cache organization requires the following: 1. One unresolved miss information/status holding register (MSHR) for each miss that will be handle concurrently. 2. One n way comparator, in which n is the number of MSHR registers, for registering hits on data in transit from memory. 3. An input stack to hold the total number of received data words possibly outstanding. The size of this stack, consequently, is equal to the EEE 81
2 block size in words times the number of MSHR registers. 4. MSHR status update and collecting networks. 5. The appropriate control unit enhancement to accommodate 1 through 4. Figure 1 is a simplified block diagram of the cache organization. (A set-associative operation is assumed.) Included are the required blocks for a set-associative cache (tag arrays and control, cache buffer), the central memory interface blocks (memory requestor, memory receiver), and the cache enhancement blocks (miss info holding registers, miss comparator and status collection, input stack). The miss info holding registers hold all necessary information to (1) handle the central memory received data properly and (2) inform the main cache control, through the miss comparator and status collector, of all hit and other status of data in transit from memory. The input stack is necessary to leave the main cache buffer available for overlapped reads and writes. Note that this organization allows for data just received from memory or in the input stack to be sent immediately to the requesting CPU units. Of course, the number of MSHR registers is important. As with set size (blocks per set), the incremental value decreases rapidly with the number of registers. This is good, because the cost increases significantly with the number of registers. Figure 2 presents a qualitative curve. The average delay time is caused by lockout on outstanding misses. This delay time, of course, is also dependent on cache input request and hit rates. In the degenerate case, 1 MSHR register of reduced size is required; 2 MSHR registers allow for overlap while one miss is outstanding, but still would lock up the cache input on multiple misses outstanding. Owing to cost considerations and incremental effectiveness gained on increasing the number of MSHR registers, 4 registers appear to be optimal. 8 The necessary information contained within one of these MSHR registers includes the following: First, the cache buffer address, along with the input request address, is required. The cache buffer address is kept to know where to place the returning memory data; the input request address is saved to determine if, on subsequent requests, the data requested is on its way from central memory. Second, input request identification tags, along with the send-to-cpu status, are stored. This information permits the cache to return to CPU requesting units only the data requested and return it with its identification tag. Third, in-input-stack indicators are used to allow for reading data directly from the input stack. Fourth, a code (for example, one bit per byte for partial write) is held for each word to indicate what bytes of the word have been written to the cache buffer. This code controls the cache buffer write update and allows dispensing of data for buffer areas that have been totally written after requested. The cache, thus, has the capability of processing partial write input requests 'ton the fly" without purging. (Of course, this partial write code may not be incorporated if the cache block is purged on a partial write request to a word in a block in transit from memory.) Last, some control information (the register contains valid information only for returning requested data, but not for cache buffer update and the number of words of the block that have been received and written, if required, into the cache buffer) is needed. Therefore, each MSHR register contains: 1. Cache buffer address 2. Input request address 3. Input identification tags (one per word) 4. Send-to-CPU indicators (one per'word) 5. In-input-stack indicators (one per word) 6. Partial write codes (one per word) 7. Number of words of blocks processed 8. Valid information indicator 9. Obsolete indicator (information not valid for cache update or MSHR hit on data in transit) OPERATION The operation can be split into two basic parts: memory receiver/input stack operations and tag array control operations. For memory receiver/input stack operations, the fields of MSHR interrogated are the following: 1. Send-to-CPU indicator 2. Input identification tags 3. Cache buffer address 4. Partial write codes 82
3 5. Obsolete indicator 6. Valid indicator When a word is received from memory, it is sent to the CPU requesting unit if the send-to-cpu indicator is set; the appropriate identification tag accompanies the data. This word is also written into the input stack if the word's space has not been previously totally written in the cache buffer or if MSHR is not obsolete (invalid for cache update). The words of data are removed from this input stack on a first-in, first-out basis and are written into the cache buffer using fields 3 and 4. Of course, MSHR must hold valid information when interrogated, or an error signal will be generated. A slight diversion is necessary at this point to explain cache data tagging. On a miss, the cache requests a block of words. Along with each word, a cache tag is sent. This tag points to the particular assigned MSHR and indicates the word of the block. Note that the cache saves in MSHR the requesting unit's identification tag. This tagging closes the remaining open link for the handling of data returned from memory and removes all restrictions on memory on the order of responses. If a particular processor/memory interface allows for a data width of a block of words for cache to central memory requests, the cache data tagging may be simplified by merely pointing to the particular assigned MSHR. If, however, all other data paths are still one word wide, the main operations would be essentially unchanged. Consequently, this extended interface would not significantly reduce the control complexity or the average lockout time delay per request. The fields of the MSHR updated during memory receiver/input stack operations are the following: 1. In-input-stack indicators 2. Partial write codes 3. Number of words of block processed 4. Valid information indicator (being used indicator) The in-input-stack indicators are set when the data word is written into the input stack and cleared when data is removed from the input stack and written into the cache buffer. The partial write code is set to indicate totally written when the data word from central memory indicates the cache buffer. In addition, whenever a data word. is disposed of because of being totally written or having an obsolete MSHR, or is written into the cache buffer, the number-of-words-processed counter is incremented. On number-of-words-processed counter overflow (all words for a block have been received), the valid or used MSHR indicator is cleared. For tag array control operations, the following fields of MSHRs are interrogated: 1. Input request addresses 2. Send-to-CPU indicators 3. In-input-stack indicators 4. Partial write codes 5. Valid indicator 6. Obsolete indicator Fields 1, 5, and 6 are used along with current input request address and the n way MSHR comparator to determine if there is a hit on previously missed data still being handled (previous miss hit). Fields 2, 3, and 4 produce one of the following states for the previous miss hit: Partially written (Partial write code has at least one bit set.) Totally written (Partial write code is all l"s.) In-input-stack Already-asked-for (Send-to-CPU indicator is already set.) Figure 3 indicates the actions followed by the tag array control under all the above combinations for a previous miss hit. On a miss, a MSHR is assigned, and the following is performed: 1. Valid indicator set 2. Obsolete indicator cleared 3. Cache buffer address saved in assigned MSHR 4. Input request address saved in assigned MSHR 5. Appropriate send-to-cpu indicator set and others cleared 83
4 6. Input identification tag saved in appropriate position 7. All partial write codes associated with assigned MSHR cleared 8. All MSHRs pointing to same cache buffer address purged (Set partial write code to all l's) Note that actions 5 and 6 will vary if the cache function was a prefetch (all send-to-cpu indicators are cleared, and no tag is saved). Action 8 prevents data from a previous allocation of a cache buffer block from overwriting the present allocation's data. On a miss and previous miss hit (the cache buffer block was reallocated for the same input address before all data was received), MSHR is set obsolete to prevent possible subsequent multiple hits in the MSHR comparator. SIMULTANEITY CONCLUSIONS This cache organization has been designed, prototyped, and verified. The design allows for the disabling of the MSHR registers. Using this capability, the direct effect of the number of MSHR registers on the execution times of a number of applications was noted. The reduced execution times of these applications directly demonstrated the effectiveness of this enhancement. (It is beyond the scope of this paper to analyze quantitatively the average lockout delay/request with respect to the number of enabled MSHR registers for different cache input rates and hit rates [cache buffer sizes]. This analysis will be reported in future work.) The cost of the 4 MSHR additions to the design was about 10% of the total cache cost. ACKNOWLEDGMENT The author thanks Control Data Canada for the opportuni~ to develop the new cache organization presented in this paper. A previous miss hit on a data word just being received is definitely possible. Depending on the control operation, this word may have its corresponding send-to-cpu indicator's output forced to the send condition or may be read out of the input stack on the, next minor cycle. DIAGNOSABILITY To diagnose this cache enhancement more readily, cache input functions should be added to clear and set the valid indicators of the MSHR registers. This would allow the following error conditions to be forced: Cache tag points to nonvalid MSHR register Multiple hit with MSHR comparator Previous miss hit status-totally written and not partially written All other fields of the MSHR registers may be verified by using these special cache input functions in combination with the standard input functions with all combinations of addresses, identification tags and data. 184
5 REFERENCES 1C. J. Conti. Concepts of buffer storage, IEEE Computer Group News, 2 (March 1969). 2R. i. Meade. How a cache memory enhances a computer's performance, Electronics (Jan. 1972). 3K. R. Kaplan and R. O. Winder. Cache-based computer systems, IEEE Computer (March 1973). 4j. Bell, D. Casasent, and C. G. Bell. An investigation of alternative cache organizations. IEEE Transactions on Computers, C-23 (April 1974). 5j. H. Kroeger and R. M. Meade (of Cogar Corporation, Woppingers Fall, NY). Cache buffer memory specification. 6A. V. Pohm, O. P. Agrawal, and R. N. Monroe. The cost and performance tradeoffs of buffered memories. Proceedings of the IEEE, 63 (Aug. 1973). 7A. J. Smith. Sequential program prefetching in memory hierachies, IEEE Computer (Dec 1978). 8G. H. Toole. Instruction Iookahead and execution traffic considerations for the cache design (Development division internal paper), Control Data-Canada,
6 MEMORY REQUESTOR // CENTRAL MEMORY AODRESSIS~ DATA CPU INST UNIT CPU EXEC UNIT --TAG CONTROL ADDRESS & CONTROL ADDRESS & DATA MISS INFO I HOLDING REGISTERS < ADDRESS i.~-~ CACHE BUFFER ADDRESS STATUS MISS COMPARATOR AND STATUS COLLECTION J INP ST~ DATA CPU UNITS CENTRAL MEMORY MEMORY RECEIVER DATA Figure 1. Cache Organization p- O Lu NO. OF MISS INFO HOLDING REGISTERS Figure 2. Qualitative Curve for Lockout Delay 86
7 INPUT PARTIALLY TOTALLY IN ALREADY FUNCTION WRITTEN WRITTEN INPUT ASKED ACTION STACK FOR READ NO NO NO NO SET SEND-TO-DPU BIT SAV~ IDENT REAO NO NO NO YES READ FROM CENTRAL MEMORY (BY- PASS) READ NO NO YES X READ FROM STACX READ YES NO X X R_AO FROM CENTRAL MEMORY (BY-PASS) RFa,0 YES YES X X READ FROM CACHE BUFFER PREFETCH X X X X NO ACTION WRITE BYTES TO CACHE SUFFER. SET APPROPRIATE WRITE X X X X PARTIAL WRITE SITS. WHERE X IS DON'T CARE Figure 3. Previous Miss Hit Operations 87
Chapter 1 Computer System Overview
Operating Systems: Internals and Design Principles Chapter 1 Computer System Overview Eighth Edition By William Stallings Operating System Exploits the hardware resources of one or more processors Provides
Chapter 11 I/O Management and Disk Scheduling
Operating Systems: Internals and Design Principles, 6/E William Stallings Chapter 11 I/O Management and Disk Scheduling Dave Bremer Otago Polytechnic, NZ 2008, Prentice Hall I/O Devices Roadmap Organization
find model parameters, to validate models, and to develop inputs for models. c 1994 Raj Jain 7.1
Monitors Monitor: A tool used to observe the activities on a system. Usage: A system programmer may use a monitor to improve software performance. Find frequently used segments of the software. A systems
Performance Impacts of Non-blocking Caches in Out-of-order Processors
Performance Impacts of Non-blocking Caches in Out-of-order Processors Sheng Li; Ke Chen; Jay B. Brockman; Norman P. Jouppi HP Laboratories HPL-2011-65 Keyword(s): Non-blocking cache; MSHR; Out-of-order
Reconfigurable Architecture Requirements for Co-Designed Virtual Machines
Reconfigurable Architecture Requirements for Co-Designed Virtual Machines Kenneth B. Kent University of New Brunswick Faculty of Computer Science Fredericton, New Brunswick, Canada [email protected] Micaela Serra
Database Management Systems
4411 Database Management Systems Acknowledgements and copyrights: these slides are a result of combination of notes and slides with contributions from: Michael Kiffer, Arthur Bernstein, Philip Lewis, Anestis
Computer Organization and Architecture. Characteristics of Memory Systems. Chapter 4 Cache Memory. Location CPU Registers and control unit memory
Computer Organization and Architecture Chapter 4 Cache Memory Characteristics of Memory Systems Note: Appendix 4A will not be covered in class, but the material is interesting reading and may be used in
Windows Server Performance Monitoring
Spot server problems before they are noticed The system s really slow today! How often have you heard that? Finding the solution isn t so easy. The obvious questions to ask are why is it running slowly
A3 Computer Architecture
A3 Computer Architecture Engineering Science 3rd year A3 Lectures Prof David Murray [email protected] www.robots.ox.ac.uk/ dwm/courses/3co Michaelmas 2000 1 / 1 6. Stacks, Subroutines, and Memory
Chapter 11 I/O Management and Disk Scheduling
Operatin g Systems: Internals and Design Principle s Chapter 11 I/O Management and Disk Scheduling Seventh Edition By William Stallings Operating Systems: Internals and Design Principles An artifact can
PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design
PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions Slide 1 Outline Principles for performance oriented design Performance testing Performance tuning General
Operating Systems. Virtual Memory
Operating Systems Virtual Memory Virtual Memory Topics. Memory Hierarchy. Why Virtual Memory. Virtual Memory Issues. Virtual Memory Solutions. Locality of Reference. Virtual Memory with Segmentation. Page
OC By Arsene Fansi T. POLIMI 2008 1
IBM POWER 6 MICROPROCESSOR OC By Arsene Fansi T. POLIMI 2008 1 WHAT S IBM POWER 6 MICROPOCESSOR The IBM POWER6 microprocessor powers the new IBM i-series* and p-series* systems. It s based on IBM POWER5
In-memory Tables Technology overview and solutions
In-memory Tables Technology overview and solutions My mainframe is my business. My business relies on MIPS. Verna Bartlett Head of Marketing Gary Weinhold Systems Analyst Agenda Introduction to in-memory
Dell Reliable Memory Technology
Dell Reliable Memory Technology Detecting and isolating memory errors THIS WHITE PAPER IS FOR INFORMATIONAL PURPOSES ONLY, AND MAY CONTAIN TYPOGRAPHICAL ERRORS AND TECHNICAL INACCURACIES. THE CONTENT IS
CS 61C: Great Ideas in Computer Architecture Virtual Memory Cont.
CS 61C: Great Ideas in Computer Architecture Virtual Memory Cont. Instructors: Vladimir Stojanovic & Nicholas Weaver http://inst.eecs.berkeley.edu/~cs61c/ 1 Bare 5-Stage Pipeline Physical Address PC Inst.
18-548/15-548 Associativity 9/16/98. 7 Associativity. 18-548/15-548 Memory System Architecture Philip Koopman September 16, 1998
7 Associativity 18-548/15-548 Memory System Architecture Philip Koopman September 16, 1998 Required Reading: Cragon pg. 166-174 Assignments By next class read about data management policies: Cragon 2.2.4-2.2.6,
Java Virtual Machine: the key for accurated memory prefetching
Java Virtual Machine: the key for accurated memory prefetching Yolanda Becerra Jordi Garcia Toni Cortes Nacho Navarro Computer Architecture Department Universitat Politècnica de Catalunya Barcelona, Spain
what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?
Inside the CPU how does the CPU work? what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored? some short, boring programs to illustrate the
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, [email protected] Assistant Professor, Information
Homework # 2. Solutions. 4.1 What are the differences among sequential access, direct access, and random access?
ECE337 / CS341, Fall 2005 Introduction to Computer Architecture and Organization Instructor: Victor Manuel Murray Herrera Date assigned: 09/19/05, 05:00 PM Due back: 09/30/05, 8:00 AM Homework # 2 Solutions
Definition of RAID Levels
RAID The basic idea of RAID (Redundant Array of Independent Disks) is to combine multiple inexpensive disk drives into an array of disk drives to obtain performance, capacity and reliability that exceeds
Hardware/Software Co-Design of a Java Virtual Machine
Hardware/Software Co-Design of a Java Virtual Machine Kenneth B. Kent University of Victoria Dept. of Computer Science Victoria, British Columbia, Canada [email protected] Micaela Serra University of Victoria
How to handle Out-of-Memory issue
How to handle Out-of-Memory issue Overview Memory Usage Architecture Memory accumulation 32-bit application memory limitation Common Issues Encountered Too many cameras recording, or bitrate too high Too
Thread level parallelism
Thread level parallelism ILP is used in straight line code or loops Cache miss (off-chip cache and main memory) is unlikely to be hidden using ILP. Thread level parallelism is used instead. Thread: process
RevoScaleR Speed and Scalability
EXECUTIVE WHITE PAPER RevoScaleR Speed and Scalability By Lee Edlefsen Ph.D., Chief Scientist, Revolution Analytics Abstract RevoScaleR, the Big Data predictive analytics library included with Revolution
SCALABILITY OF CONTEXTUAL GENERALIZATION PROCESSING USING PARTITIONING AND PARALLELIZATION. Marc-Olivier Briat, Jean-Luc Monnot, Edith M.
SCALABILITY OF CONTEXTUAL GENERALIZATION PROCESSING USING PARTITIONING AND PARALLELIZATION Abstract Marc-Olivier Briat, Jean-Luc Monnot, Edith M. Punt Esri, Redlands, California, USA [email protected], [email protected],
Timing of a Disk I/O Transfer
Disk Performance Parameters To read or write, the disk head must be positioned at the desired track and at the beginning of the desired sector Seek time Time it takes to position the head at the desired
Computer Organization and Components
Computer Organization and Components IS5, fall 25 Lecture : Pipelined Processors ssociate Professor, KTH Royal Institute of Technology ssistant Research ngineer, University of California, Berkeley Slides
CROSS LAYER BASED MULTIPATH ROUTING FOR LOAD BALANCING
CHAPTER 6 CROSS LAYER BASED MULTIPATH ROUTING FOR LOAD BALANCING 6.1 INTRODUCTION The technical challenges in WMNs are load balancing, optimal routing, fairness, network auto-configuration and mobility
Case Study: Load Testing and Tuning to Improve SharePoint Website Performance
Case Study: Load Testing and Tuning to Improve SharePoint Website Performance Abstract: Initial load tests revealed that the capacity of a customized Microsoft Office SharePoint Server (MOSS) website cluster
A Dell Technical White Paper Dell Compellent
The Architectural Advantages of Dell Compellent Automated Tiered Storage A Dell Technical White Paper Dell Compellent THIS WHITE PAPER IS FOR INFORMATIONAL PURPOSES ONLY, AND MAY CONTAIN TYPOGRAPHICAL
Impact of Control Theory on QoS Adaptation in Distributed Middleware Systems
Impact of Control Theory on QoS Adaptation in Distributed Middleware Systems Baochun Li Electrical and Computer Engineering University of Toronto [email protected] Klara Nahrstedt Department of Computer
Key Components of WAN Optimization Controller Functionality
Key Components of WAN Optimization Controller Functionality Introduction and Goals One of the key challenges facing IT organizations relative to application and service delivery is ensuring that the applications
Concept of Cache in web proxies
Concept of Cache in web proxies Chan Kit Wai and Somasundaram Meiyappan 1. Introduction Caching is an effective performance enhancing technique that has been used in computer systems for decades. However,
VirtualCenter Database Performance for Microsoft SQL Server 2005 VirtualCenter 2.5
Performance Study VirtualCenter Database Performance for Microsoft SQL Server 2005 VirtualCenter 2.5 VMware VirtualCenter uses a database to store metadata on the state of a VMware Infrastructure environment.
Secondary Storage. Any modern computer system will incorporate (at least) two levels of storage: magnetic disk/optical devices/tape systems
1 Any modern computer system will incorporate (at least) two levels of storage: primary storage: typical capacity cost per MB $3. typical access time burst transfer rate?? secondary storage: typical capacity
Understanding Neo4j Scalability
Understanding Neo4j Scalability David Montag January 2013 Understanding Neo4j Scalability Scalability means different things to different people. Common traits associated include: 1. Redundancy in the
OPTIMIZE DMA CONFIGURATION IN ENCRYPTION USE CASE. Guillène Ribière, CEO, System Architect
OPTIMIZE DMA CONFIGURATION IN ENCRYPTION USE CASE Guillène Ribière, CEO, System Architect Problem Statement Low Performances on Hardware Accelerated Encryption: Max Measured 10MBps Expectations: 90 MBps
Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2
Lecture Handout Computer Architecture Lecture No. 2 Reading Material Vincent P. Heuring&Harry F. Jordan Chapter 2,Chapter3 Computer Systems Design and Architecture 2.1, 2.2, 3.2 Summary 1) A taxonomy of
COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters
COSC 6374 Parallel Computation Parallel I/O (I) I/O basics Spring 2008 Concept of a clusters Processor 1 local disks Compute node message passing network administrative network Memory Processor 2 Network
The Design of the Inferno Virtual Machine. Introduction
The Design of the Inferno Virtual Machine Phil Winterbottom Rob Pike Bell Labs, Lucent Technologies {philw, rob}@plan9.bell-labs.com http://www.lucent.com/inferno Introduction Virtual Machine are topical
Optimizing Performance. Training Division New Delhi
Optimizing Performance Training Division New Delhi Performance tuning : Goals Minimize the response time for each query Maximize the throughput of the entire database server by minimizing network traffic,
Parallel AES Encryption with Modified Mix-columns For Many Core Processor Arrays M.S.Arun, V.Saminathan
Parallel AES Encryption with Modified Mix-columns For Many Core Processor Arrays M.S.Arun, V.Saminathan Abstract AES is an encryption algorithm which can be easily implemented on fine grain many core systems.
The Quest for Speed - Memory. Cache Memory. A Solution: Memory Hierarchy. Memory Hierarchy
The Quest for Speed - Memory Cache Memory CSE 4, Spring 25 Computer Systems http://www.cs.washington.edu/4 If all memory accesses (IF/lw/sw) accessed main memory, programs would run 20 times slower And
Garbage Collection in the Java HotSpot Virtual Machine
http://www.devx.com Printed from http://www.devx.com/java/article/21977/1954 Garbage Collection in the Java HotSpot Virtual Machine Gain a better understanding of how garbage collection in the Java HotSpot
The Big Picture. Cache Memory CSE 675.02. Memory Hierarchy (1/3) Disk
The Big Picture Cache Memory CSE 5.2 Computer Processor Memory (active) (passive) Control (where ( brain ) programs, Datapath data live ( brawn ) when running) Devices Input Output Keyboard, Mouse Disk,
Performance Testing. Slow data transfer rate may be inherent in hardware but can also result from software-related problems, such as:
Performance Testing Definition: Performance Testing Performance testing is the process of determining the speed or effectiveness of a computer, network, software program or device. This process can involve
WHITE PAPER Guide to 50% Faster VMs No Hardware Required
WHITE PAPER Guide to 50% Faster VMs No Hardware Required Think Faster. Visit us at Condusiv.com GUIDE TO 50% FASTER VMS NO HARDWARE REQUIRED 2 Executive Summary As much as everyone has bought into the
Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging
Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro- and even nano-seconds.
DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION
DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION A DIABLO WHITE PAPER AUGUST 2014 Ricky Trigalo Director of Business Development Virtualization, Diablo Technologies
Stress Testing Technologies for Citrix MetaFrame. Michael G. Norman, CEO December 5, 2001
Stress Testing Technologies for Citrix MetaFrame Michael G. Norman, CEO December 5, 2001 Scapa Technologies Contents Executive Summary... 1 Introduction... 1 Approaches to Stress Testing...1 Windows Applications...1
Introduction to Cloud Computing
Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic
AES Power Attack Based on Induced Cache Miss and Countermeasure
AES Power Attack Based on Induced Cache Miss and Countermeasure Guido Bertoni, Vittorio Zaccaria STMicroelectronics, Advanced System Technology Agrate Brianza - Milano, Italy, {guido.bertoni, vittorio.zaccaria}@st.com
COMPUTERS ORGANIZATION 2ND YEAR COMPUTE SCIENCE MANAGEMENT ENGINEERING UNIT 5 INPUT/OUTPUT UNIT JOSÉ GARCÍA RODRÍGUEZ JOSÉ ANTONIO SERRA PÉREZ
COMPUTERS ORGANIZATION 2ND YEAR COMPUTE SCIENCE MANAGEMENT ENGINEERING UNIT 5 INPUT/OUTPUT UNIT JOSÉ GARCÍA RODRÍGUEZ JOSÉ ANTONIO SERRA PÉREZ Tema 5. Unidad de E/S 1 I/O Unit Index Introduction. I/O Problem
Switch Fabric Implementation Using Shared Memory
Order this document by /D Switch Fabric Implementation Using Shared Memory Prepared by: Lakshmi Mandyam and B. Kinney INTRODUCTION Whether it be for the World Wide Web or for an intra office network, today
Application Note 195. ARM11 performance monitor unit. Document number: ARM DAI 195B Issued: 15th February, 2008 Copyright ARM Limited 2007
Application Note 195 ARM11 performance monitor unit Document number: ARM DAI 195B Issued: 15th February, 2008 Copyright ARM Limited 2007 Copyright 2007 ARM Limited. All rights reserved. Application Note
Testing Database Performance with HelperCore on Multi-Core Processors
Project Report on Testing Database Performance with HelperCore on Multi-Core Processors Submitted by Mayuresh P. Kunjir M.E. (CSA) Mahesh R. Bale M.E. (CSA) Under Guidance of Dr. T. Matthew Jacob Problem
Locality Based Protocol for MultiWriter Replication systems
Locality Based Protocol for MultiWriter Replication systems Lei Gao Department of Computer Science The University of Texas at Austin [email protected] One of the challenging problems in building replication
Sitecore Health. Christopher Wojciech. netzkern AG. [email protected]. Sitecore User Group Conference 2015
Sitecore Health Christopher Wojciech netzkern AG [email protected] Sitecore User Group Conference 2015 1 Hi, % Increase in Page Abondonment 40% 30% 20% 10% 0% 2 sec to 4 2 sec to 6 2 sec
What is a bus? A Bus is: Advantages of Buses. Disadvantage of Buses. Master versus Slave. The General Organization of a Bus
Datorteknik F1 bild 1 What is a bus? Slow vehicle that many people ride together well, true... A bunch of wires... A is: a shared communication link a single set of wires used to connect multiple subsystems
HETEROGENEOUS SYSTEM COHERENCE FOR INTEGRATED CPU-GPU SYSTEMS
HETEROGENEOUS SYSTEM COHERENCE FOR INTEGRATED CPU-GPU SYSTEMS JASON POWER*, ARKAPRAVA BASU*, JUNLI GU, SOORAJ PUTHOOR, BRADFORD M BECKMANN, MARK D HILL*, STEVEN K REINHARDT, DAVID A WOOD* *University of
An Implementation Of Multiprocessor Linux
An Implementation Of Multiprocessor Linux This document describes the implementation of a simple SMP Linux kernel extension and how to use this to develop SMP Linux kernels for architectures other than
Technical Note. Micron NAND Flash Controller via Xilinx Spartan -3 FPGA. Overview. TN-29-06: NAND Flash Controller on Spartan-3 Overview
Technical Note TN-29-06: NAND Flash Controller on Spartan-3 Overview Micron NAND Flash Controller via Xilinx Spartan -3 FPGA Overview As mobile product capabilities continue to expand, so does the demand
Chapter 2 Basic Structure of Computers. Jin-Fu Li Department of Electrical Engineering National Central University Jungli, Taiwan
Chapter 2 Basic Structure of Computers Jin-Fu Li Department of Electrical Engineering National Central University Jungli, Taiwan Outline Functional Units Basic Operational Concepts Bus Structures Software
Using Synology SSD Technology to Enhance System Performance Synology Inc.
Using Synology SSD Technology to Enhance System Performance Synology Inc. Synology_SSD_Cache_WP_ 20140512 Table of Contents Chapter 1: Enterprise Challenges and SSD Cache as Solution Enterprise Challenges...
Magento & Zend Benchmarks Version 1.2, 1.3 (with & without Flat Catalogs)
Magento & Zend Benchmarks Version 1.2, 1.3 (with & without Flat Catalogs) 1. Foreword Magento is a PHP/Zend application which intensively uses the CPU. Since version 1.1.6, each new version includes some
Redpaper. Performance Metrics in TotalStorage Productivity Center Performance Reports. Introduction. Mary Lovelace
Redpaper Mary Lovelace Performance Metrics in TotalStorage Productivity Center Performance Reports Introduction This Redpaper contains the TotalStorage Productivity Center performance metrics that are
OPTIMIZING VIRTUAL TAPE PERFORMANCE: IMPROVING EFFICIENCY WITH DISK STORAGE SYSTEMS
W H I T E P A P E R OPTIMIZING VIRTUAL TAPE PERFORMANCE: IMPROVING EFFICIENCY WITH DISK STORAGE SYSTEMS By: David J. Cuddihy Principal Engineer Embedded Software Group June, 2007 155 CrossPoint Parkway
Overlapping Data Transfer With Application Execution on Clusters
Overlapping Data Transfer With Application Execution on Clusters Karen L. Reid and Michael Stumm [email protected] [email protected] Department of Computer Science Department of Electrical and Computer
Operating Systems, 6 th ed. Test Bank Chapter 7
True / False Questions: Chapter 7 Memory Management 1. T / F In a multiprogramming system, main memory is divided into multiple sections: one for the operating system (resident monitor, kernel) and one
Technology Insight Series
Evaluating Storage Technologies for Virtual Server Environments Russ Fellows June, 2010 Technology Insight Series Evaluator Group Copyright 2010 Evaluator Group, Inc. All rights reserved Executive Summary
WHITE PAPER FUJITSU PRIMERGY SERVER BASICS OF DISK I/O PERFORMANCE
WHITE PAPER BASICS OF DISK I/O PERFORMANCE WHITE PAPER FUJITSU PRIMERGY SERVER BASICS OF DISK I/O PERFORMANCE This technical documentation is aimed at the persons responsible for the disk I/O performance
Data Memory Alternatives for Multiscalar Processors
Data Memory Alternatives for Multiscalar Processors Scott E. Breach, T. N. Vijaykumar, Sridhar Gopal, James E. Smith, Gurindar S. Sohi Computer Sciences Department University of Wisconsin-Madison 1210
OpenSPARC T1 Processor
OpenSPARC T1 Processor The OpenSPARC T1 processor is the first chip multiprocessor that fully implements the Sun Throughput Computing Initiative. Each of the eight SPARC processor cores has full hardware
Computer Architecture Lecture 2: Instruction Set Principles (Appendix A) Chih Wei Liu 劉 志 尉 National Chiao Tung University [email protected].
Computer Architecture Lecture 2: Instruction Set Principles (Appendix A) Chih Wei Liu 劉 志 尉 National Chiao Tung University [email protected] Review Computers in mid 50 s Hardware was expensive
ZooKeeper. Table of contents
by Table of contents 1 ZooKeeper: A Distributed Coordination Service for Distributed Applications... 2 1.1 Design Goals...2 1.2 Data model and the hierarchical namespace...3 1.3 Nodes and ephemeral nodes...
CS:APP Chapter 4 Computer Architecture. Wrap-Up. William J. Taffe Plymouth State University. using the slides of
CS:APP Chapter 4 Computer Architecture Wrap-Up William J. Taffe Plymouth State University using the slides of Randal E. Bryant Carnegie Mellon University Overview Wrap-Up of PIPE Design Performance analysis
SOS: Software-Based Out-of-Order Scheduling for High-Performance NAND Flash-Based SSDs
SOS: Software-Based Out-of-Order Scheduling for High-Performance NAND -Based SSDs Sangwook Shane Hahn, Sungjin Lee, and Jihong Kim Department of Computer Science and Engineering, Seoul National University,
Cray: Enabling Real-Time Discovery in Big Data
Cray: Enabling Real-Time Discovery in Big Data Discovery is the process of gaining valuable insights into the world around us by recognizing previously unknown relationships between occurrences, objects
Input / Ouput devices. I/O Chapter 8. Goals & Constraints. Measures of Performance. Anatomy of a Disk Drive. Introduction - 8.1
Introduction - 8.1 I/O Chapter 8 Disk Storage and Dependability 8.2 Buses and other connectors 8.4 I/O performance measures 8.6 Input / Ouput devices keyboard, mouse, printer, game controllers, hard drive,
AMD Opteron Quad-Core
AMD Opteron Quad-Core a brief overview Daniele Magliozzi Politecnico di Milano Opteron Memory Architecture native quad-core design (four cores on a single die for more efficient data sharing) enhanced
DELL RAID PRIMER DELL PERC RAID CONTROLLERS. Joe H. Trickey III. Dell Storage RAID Product Marketing. John Seward. Dell Storage RAID Engineering
DELL RAID PRIMER DELL PERC RAID CONTROLLERS Joe H. Trickey III Dell Storage RAID Product Marketing John Seward Dell Storage RAID Engineering http://www.dell.com/content/topics/topic.aspx/global/products/pvaul/top
Computer Architecture
Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 11 Memory Management Computer Architecture Part 11 page 1 of 44 Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin
EE361: Digital Computer Organization Course Syllabus
EE361: Digital Computer Organization Course Syllabus Dr. Mohammad H. Awedh Spring 2014 Course Objectives Simply, a computer is a set of components (Processor, Memory and Storage, Input/Output Devices)
A Content-Based Load Balancing Algorithm for Metadata Servers in Cluster File Systems*
A Content-Based Load Balancing Algorithm for Metadata Servers in Cluster File Systems* Junho Jang, Saeyoung Han, Sungyong Park, and Jihoon Yang Department of Computer Science and Interdisciplinary Program
Central Processing Unit
Chapter 4 Central Processing Unit 1. CPU organization and operation flowchart 1.1. General concepts The primary function of the Central Processing Unit is to execute sequences of instructions representing
RAID Overview: Identifying What RAID Levels Best Meet Customer Needs. Diamond Series RAID Storage Array
ATTO Technology, Inc. Corporate Headquarters 155 Crosspoint Parkway Amherst, NY 14068 Phone: 716-691-1999 Fax: 716-691-9353 www.attotech.com [email protected] RAID Overview: Identifying What RAID Levels
Introduction. What is RAID? The Array and RAID Controller Concept. Click here to print this article. Re-Printed From SLCentral
Click here to print this article. Re-Printed From SLCentral RAID: An In-Depth Guide To RAID Technology Author: Tom Solinap Date Posted: January 24th, 2001 URL: http://www.slcentral.com/articles/01/1/raid
PRT-CTRL-SE. Protege System Controller Reference Manual
PRT-CTRL-SE Protege System Controller Reference Manual The specifications and descriptions of products and services contained in this document were correct at the time of printing. Integrated Control Technology
High Performance Tier Implementation Guideline
High Performance Tier Implementation Guideline A Dell Technical White Paper PowerVault MD32 and MD32i Storage Arrays THIS WHITE PAPER IS FOR INFORMATIONAL PURPOSES ONLY, AND MAY CONTAIN TYPOGRAPHICAL ERRORS
Operating Systems 4 th Class
Operating Systems 4 th Class Lecture 1 Operating Systems Operating systems are essential part of any computer system. Therefore, a course in operating systems is an essential part of any computer science
Computer Systems Structure Main Memory Organization
Computer Systems Structure Main Memory Organization Peripherals Computer Central Processing Unit Main Memory Computer Systems Interconnection Communication lines Input Output Ward 1 Ward 2 Storage/Memory
Multicore Programming with LabVIEW Technical Resource Guide
Multicore Programming with LabVIEW Technical Resource Guide 2 INTRODUCTORY TOPICS UNDERSTANDING PARALLEL HARDWARE: MULTIPROCESSORS, HYPERTHREADING, DUAL- CORE, MULTICORE AND FPGAS... 5 DIFFERENCES BETWEEN
Job Reference Guide. SLAMD Distributed Load Generation Engine. Version 1.8.2
Job Reference Guide SLAMD Distributed Load Generation Engine Version 1.8.2 June 2004 Contents 1. Introduction...3 2. The Utility Jobs...4 3. The LDAP Search Jobs...11 4. The LDAP Authentication Jobs...22
Parallel Computing 37 (2011) 26 41. Contents lists available at ScienceDirect. Parallel Computing. journal homepage: www.elsevier.
Parallel Computing 37 (2011) 26 41 Contents lists available at ScienceDirect Parallel Computing journal homepage: www.elsevier.com/locate/parco Architectural support for thread communications in multi-core
