High-speed image processing algorithms using MMX hardware
|
|
- Marsha Arnold
- 3 years ago
- Views:
Transcription
1 High-speed image processing algorithms using MMX hardware J. W. V. Miller and J. Wood The University of Michigan-Dearborn ABSTRACT Low-cost PC-based machine vision systems have become more common due to faster processing capabilities and the availability of compatible high-speed image acquisition and processing hardware. One development, which is likely to have a very favorable impact on this trend, is enhanced multimedia capabilities present in new processor chips such as Intel MMX and Cyrix M2 processors. Special instructions are provided with this type of hardware which, combined with a SIMD parallel processing architecture, provides a substantial speed improvement over more traditional processors. Eight simultaneous byte or four double-byte operations are possible. The new instructions are similar to those provided by DSP chips such as multiply and accumulate and are quite useful for linear processing operations like convolution. However, only four pixels may be processed simultaneously because of the limited dynamic range of byte data. Given the inherent limitations with respect to looping in SIMD hardware, nonlinear operations such as erosion and dilation would seem to be difficult to implement. However, special instructions are available for required operations. Benchmarks for a number of image-processing operations are provided in the paper to illustrate the advantages of the new multimedia extensions for vision applications. Keywords: machine vision, image processing algorithms, multimedia extensions, single instruction multiple data, performance gains 1. INTRODUCTION Along with other recent advances in microprocessor developments for personal computers, the advent of a special instruction set for multimedia applications significantly decreases processing time for many machine vision algorithms. Originally specified and developed by Intel which they designated as MMX technology. 1 Such processors are fully compatible with all earlier X86 software and are only modestly more expensive than processors without the new capabilities. At least two other companies, AMD and Cyrix, have developed compatible hardware with similar performance. Applications targeted for the new hardware are motion video, combined graphics and video, image processing, audio synthesis, speech synthesis and compression, among others. These applications are ideal for MMX because they all share the same fundamental characteristics, namely: Small integer data types (for example: 8-bit graphics pixels) Small, highly recurring loops Compute-intensive algorithms Highly paralleled operations Algorithms are compatible with Single Instruction Multiple Data (SIMD) architecture 2. OVERVIEW OF THE NEW MULTIMEDIA ARCHITECTURE An SIMD architecture is used to improve performance by processing multiple data elements in parallel. This feature along with packed data types and special instructions provides significantly faster processing. 2.1 New data types and registers The principal data type of the MMX instruction set is the packed fixed-point integer, where multiple integer quantities are grouped into a single 64-bit entity. The MMX data types are: Packed byte Eight bytes packed into a single 64-bit quantity Packed word Four 16-bit words packed into a single 64-bit quantity
2 Packed doubleword Two 32-bit double words packed into a 64-bit quantity Quadword One 64-bit quantity The new architecture provides eight 64-bit general-purpose registers. These registers are aliases for the floating-point registers and are directly accessed using the register names mm0 to mm New Instructions The MMX instruction set operates on all data types packed into a single 64-bit quantity to complete the following functions in parallel: Data transfer (move instruction) for MMX register-to-register transfers, etc. Basic arithmetic operations (add, subtract, multiply, divide) Conversions between the data types pack data together, and unpack from small to large data types. Logical operations (and, or, and not, or, xor) Shift operations Comparison operations The combinations of different packed integer data types with each of the functions above create the 57 new instruction op codes. 3. TRADITIONAL CODING FOR MACHINE VISION APPLICATIONS Typical machine vision algorithms provide different types of image processing such as median filtering, morphology, and edge detection. Filtering applications process one or more input images and then produce one output image accordingly. Generally, the complete input image is stored in image buffers as either one byte per pixel (values 0-255) or two bytes per pixel (values ) depending upon the dynamic range required. Because every pixel (except in some cases border pixels) must be processed in sequence, image processing operations often require substantial amounts of time for completion. This often imposes substantial limitations on the amount of processing possible for real-time applications. If a given application requires additional processing, the classical approach would utilize expensive special-purpose hardware. An implementation for a simple one-dimensional algorithm for edge detection using a digital derivative can be coded in C as follows: for (i = 0; i < nrow; i++) for (j = 0; j < npixel; j++) outbuf[i][j] = inbuf[i][j] inbuf[i][j+1]; The outer loop steps through the image buffer row by row, and the innermost loop steps through the image buffer pixel by pixel. This filter simply subtracts an adjacent pixel from the current pixel and stores it in the output image buffer. Note that the result may be less than 0 or greater than 255. When this happens the value may be clipped to either 0 if the result is less than 0, or 255 if the result is greater than 255 assuming that the image buffers are of type unsigned char. The additional coding required to accomplish this is not shown but will increase processing time. Morphological filters can be implemented using a similar approach but in place of the subtraction operation, maximum and minimum operations are used instead. These filters perform the standard dilate, erode, open and close operations for a variety of structuring elements. Other operations include image sums and differences, logical operators, general convolution, and median filtering. All of the processing techniques are candidates for the new multimedia architecture. 4. SPEED IMPROVEMENTS WITH THE NEW ARCHITECTURE Machine Vision applications possess all the characteristics required to exploit the advantages of the new multimedia architecture. Substantial performance gains, if realized, will greatly increase the amount of processing that can be accomplished for real-time applications without the need for expensive special-purpose hardware.
3 4.1 Potential performance gains From a casual review of the architecture and instruction set, dramatic performance gains would be expected. However, other factors besides the raw processing power affect the actual improvements that are achieved as will be discussed later. As a starting point, however, the following are not unreasonable expectations: 8-bits/pixel images can be processed up to eight times faster than the original code by processing eight pixels in parallel. 16-bits/pixel images can see performance gains as much as four times the original code by processing four pixels in parallel. 4.2 Image processing with MMX hardware As previously noted with traditional coding, one pixel is processed at a time. For example, if a gray-scale image has a resolution of 640 x 480 pixels, then 307,200 iterations will need to take place to process the entire image. With MMX hardware, however, only 38,400 iterations will be needed since 8 pixels can be processed at a time, which should be considerably faster. In order to achieve this, however, the original filter code must be revised to use the MMX instruction set. Performance gains for specific filters such as morphology, noise reduction and edge detection filters implementing MMX technology are shown near the end of this research paper. Only hardware and software that supports MMX technology could be used to develop and test the new code. The hardware used for evaluation was a computer with 16 Mbytes of main memory, a 512K pipelined secondary cache and an AMD K6 166 MHz processor. The original routines (no MMX instructions for comparison purposes) were written in C code and compiled using the DOS port of GCC (DJGPP). Since MMX hardware is new, most high-level language compilers do not support it. This problem was overcome by writing the majority of the filter code in C and writing the portions that benefit from MMX (the innermost loop) in assembly code. The software currently being used is Microsoft Visual C/C++ V5.0 for compiling the C code and NASM V0.94 for compiling the MMX assembly code. Intel created a C/C++ Plug-In for use with Microsoft Visual C/C++ that includes special MMX C type syntax; however, this software was not used because it did not add much of a benefit over standard assembly code. The following approach was used to implement various functions: Code the MMX assembly routines as external functions in C. Remove the innermost pixel-by-pixel iteration loop and replace it with the equivalent MMX code. Optimize the MMX code to optimize processing speed. 4.3 Accessing assembly code functions Only segments of the filters that benefit from MMX will use assembly code, the rest will remain in C. The best way to combine C with assembly code is to call the MMX assembly code from an external function. 2 External assembly code functions act exactly the same as standard ANSI C functions except the internal code is in assembly instead of C. The assembly code is complied with NASM V0.94 and then linked with the C code to form the final executable. 4.4 Checking for the presence of MMX hardware A check to verify the presence the presence of MMX hardware was implemented to prevent aberrant operation if the hardware was not present. To prevent this, a special function was written to detect MMX hardware and run either straight C code or code that will use the MMX hardware Using MMX instructions in the innermost loop The segments that benefit the most from MMX are the innermost filter loops. The original C algorithms process each pixel one at a time; however, MMX can process up to eight pixels at a time by eliminating the innermost loop and replacing it with MMX code. Note that the outer loop could also have implemented MMX. However, this approach was not taken for two reasons: first, it would not provide a significant performance increase, and second, the MMX functions become much more versatile since row pointers can be passed from C instead of setting up specific row pointers inside the assembly code. Here is a code fragment for a morphology filter that finds the maximum of three pixel values and places the result in the output buffer. Both the original C code and the equivalent MMX code are show here:
4 Original C code algorithm: #define max(a,b) (((a) > (b))? (a) : (b)) for (i = 1; i < nl[infile] - 1; i++) /* imbuf[1] is the output image buffer and imbuf[0] is the input image buffer */ for (j = 1; j < nb[infile] - 1; j++) imbuf[1][i][j] = max(imbuf[0][i-1][j-1], max(imbuf[0][i][j], imbuf[0][i+1][j+1])); Equivalent C code implementing MMX technology: int ncolsets = nb[infile] / 8; for (i = 1; i < nl[infile] - 1; i++) { unsigned char *InPtr0 = imbuf[0][i-1], *InPtr1 = imbuf[0][i] + 1; unsigned char *InPtr2 = imbuf[0][i+1] + 2, *OutPtr = imbuf[1][i] + 1; dilate_mmx(inptr0, InPtr1, InPtr2, OutPtr, ncolsets); } Notice that only the outer for loop is required for the MMX implementation. This loop accesses each line like the original code; however, its purpose is to set up line pointers for the MMX assembly code function (here called dilate_mmx). Since the input image buffer is of data type unsigned char the MMX code can process eight pixels at a time. Next the line pointers are setup to point at the line and the first pixel to be processed in the MMX assembly code. What the MMX assembly code does is load eight pixels starting from this first pointer pixel. Each of the eight pixels is processed independently of the other pixels. This assembly code function takes these three sets of eight pixel values and calculates the maximum value of the three sets. It then stores the maximum values at the output buffer pointer (OutPtr). The variable nb[infile] represents the amount of pixels/line in the input image. Since eight pixels are processed at a time with MMX technology, the number of bytes to process per loop iteration is reduced by eight. This is accomplished by dividing the nb[infile] variable by eight which determines the total number of pixel groups that the assembly code should process. Here is the actual code for the function dilate_mmx: ; ; dilate_mmx( unsigned char *InPtr0, unsigned char *InPtr1, unsigned char *InPtr2, ; unsigned char *OutPtr, int ncolsets ); ; _dilate_mmx: push ebp mov ebp, esp mov esi, [ebp + 24] ; esi contains the value of ncolset mov ebx, [ebp + 8] ; ebx contains InPtr0 pointer mov ecx, [ebp + 12] ; ecx contains InPtr1 pointer mov edx, [ebp + 16] ; edx contains InPtr2 pointer mov edi, [ebp + 20] movq mm6, [signoffset] ; edi contains OutPtr pointer ; mm6 contains the packed value 128 for converting from unsigned to signed values dilate_loop: ; the innermost loop begins here to process all interior pixels movq mm0, [ebx] ; move current eight pixels from pointer InPtr0 into mm0 add ebx, 8 ; advance pointer by 8 to point at next eight pixels
5 movq mm1, [ecx] add ecx, 8 psubb mm0, mm6 psubb mm1, mm6 movq mm4, mm0 ; move InPtr1 eight pixels into mm1 ; advance pointer to point at next eight pixels ; subtract 128 from each pixel to convert to signed values for PCMPGTB op-code ; copy mm0 to mm4 pcmpgtb mm4, mm1 pand mm0, mm4 pandn mm4, mm1 por mm0, mm4 movq mm2, [edx] add edx, 8 ; store Greater Than Mask of InPtr0 and InPtr1 in mm4 ; keep only the greater values in mm0 ; keep only the greater values in mm4 ; logic OR mm0 with mm4 to store the great values of both mm0 and mm4 ; move InPtr2 eight pixels into mm2 movq mm4, mm0 psubb mm2, mm6 pcmpgtb mm4, mm2 pand mm0, mm4 pandn mm4, mm2 por mm0, mm4 paddb mm0, mm6 movq [edi], mm0 add edi, 8 dec esi jg dilate_loop ; keep only the greater vales in mm0 ; keep only the greater values in mm4 ; logic OR mm0 with mm4 to store the great values of both mm0 and mm4 ; now mm0 holds the greatest eight values of InPtr0, InPtr1, and InPtr2 ; add 128 to restore original pixel size ; move the greater values into the output buffer pointer OutPtr ; advance output pointer by eight to point at the next eight pixels ; decrement the ncolset variable. When ncolset reaches zero the loop ends. mov esp, ebp pop ebp ret signoffset dd h dd h The variable signoffset is required since the MMX instruction code PCMPGTB (Packed Compare Greater Than- Byte) only works with signed values, and, since the pixel values are unsigned, they must be temporarily converted to signed values. This is done by subtracting 80h (128 in decimal) from each packed value, thus creating a signed value, and then once all the calculations are complete, 80h is added to restore the original values without the loss of any data. 4.6 Optimizing MMX assembly code High-level languages such as C automatically optimize their code to obtain maximum performance; however, assembly needs to be optimized by the programmer to achieve the same results. Pentium-type processors allow two instructions to be paired for each clock cycle. 4 This means for each clock cycle, two instructions can be executed simultaneously. The two execution pipelines in the processor are the U and V pipelines. The U pipeline can execute any instruction, while only basic instructions may be executed in the V pipeline. To achieve the desired performance gains, much consideration should be
6 taken to pair up instructions in the U and V pipelines. Here are the general guidelines for pairing MMX instructions in the U and V pipelines: MMX instructions which access either memory or the integer register file can be issued in the U-pipe only. The MMX destination register of the U-pipe instruction should not match the source or destination register of the V- pipe instruction (dependency check). Two MMX instructions which both use the MMX multiplier unit (pmull, pmulh, pmadd type instructions) cannot pair since there is only one MMX multiplier unit. Two MMX instructions in *which both use the MMX shifter unit (pack, unpack, and shift instructions) cannot pair since there is only one MMX shifter unit. Shift operations may be issued in either the U-pipe or the V-pipe but not in both in the same clock cycle. 4.7 Memory architecture considerations It would appear that MMX hardware imposes special demands on computer memory. With classical implementations of image processing algorithms, one pixel at a time is processed compared to eight with the new hardware. This means that much more data must be read out of memory in approximately the same time interval, and if the bandwidth is not available, stalls will occur which lessen the effectiveness of the MMX hardware. The amount of cache memory available has a very significant effect on memory bandwidth. The main memory, which uses DRAM, operates at a much slower rate than the CPU. Without cache memory, performance would be greatly degraded for even processing a single pixel at a time. Two caches are provided to improve performance. A large secondary cache is used to buffer the main memory. However, this cache only operates at around 60 MHz and is 2-3 times slower than the processor clock. With no additional hardware, this would almost completely negate the advantage of clock-multiplied CPU chips. By providing a small but very fast primary cache, good performance would be expected even with demands from MMX hardware as long as there are not too many primary cache misses. However, when processing large images that are too big for the primary cache, image data will have to be obtained from the secondary cache with the potential for a significant performance loss. 5. PERFORMANCE RESULTS TO DATE The available test results provide a number of surprises and inconsistencies. A variety of routines were evaluated using a 640 X 480 test image. A small 160 X 10 window of this image was also used to evaluate performance so that only the primary cache would be used to evaluate the significance of primary cache misses with large images. The results were obtained under the following conditions: 1. Performance results were obtained on an AMD K6 166MHz microprocessor computer with 16 Mbytes of memory and a 512K pipelined secondary cache. The primary cache instruction and data caches were both 32Kbytes. 2. The operating system was Windows The computer was lightly loaded (no other significant tasks were running concurrently) during testing. 4. C routines were compiled using Microsoft Visual C/C++ V5.0 with compiler options set to produce Pentium code and optimization set to maximum speed. 5. MMX technology routines were assembled using DJGPP with NASM V Performance results with large images Six different image processing routines were evaluated using both standard and MMX coding, digital derivative, two different dilation functions, logical operations between two images, sums of 8 and 16 bit/pixel images and difference of two 8 bit/pixel images. Processing rates are impressive, even without using the MMX instructions. However, significant gains are apparent when the multimedia hardware is used as given in Table 1. Improvement factors ranged from 1.55 to 4.83 and were generally best when using 8 bit/pixel data as expected. The most significant improvements can be noted in routines that manipulate data in more complex fashions. Filter description Conditions Conventional MMX Performance
7 digital derivative (edges) rectangle (morphology) diagonals (morphology) logical (AND, OR, XOR) sum of two 8-bit images sum of two 16-bit images difference of two 8-bit images frames/sec. frames/sec. Gain 1,000 passes X 100 iterations X 100 iterations X 1,000 passes X 1,000 passes X 1,000 passes X 1,000 passes X 5.2 Performance results with small images Table 1 Processing rates for 640 x 480 resolution images By reducing image size sufficiently so that they can be stored in the primary cache, processing speed increases should be expected. In addition, greater improvements should be expected for MMX processing because of its higher memory bandwidth requirements. These expectations are generally verified as shown in table 2. Even the conventionally coded routines are faster (on a per-pixel basis) for the small image sizes but the MMX routines benefit to a substantially greater degree. Filter description Conditions Conventional MMX Performance frames/sec. frames/sec. Gain digital derivative (edges) 300,000 passes X rectangles (morphology) diagonals (morphology) logical (AND, OR, XOR) Sum of two 8-bit images Sum of two 16-bit images Difference of two 8-bit images 100,000 passes X 100,000 passes X 300,000 passes X 300,000 passes X 300,000 passes X 300,000 passes X 5.3 A simple coding strategy for a small primary cache Table 2 Using 160 x 10 resolution input image(s). While using small images provides a significant performance boost, this is clearly not a valid approach for most applications since image sizes are generally much larger. The best approach is to improve the memory bandwidth using an approach such as a larger primary cache. As noted previously, as the image data from a large image is processed it will be continually flushed out of the primary cache with a corresponding loss of processing speed.
8 However, there is at least one approach that can be employed for some algorithms to alleviate this problem. It will, however, cause an increase in coding complexity. The basic idea here is to process image data on a line by line basis in which several different algorithms process a given line of pixels consecutively. Since relatively few pixels are being processed at one time, they will remain in the primary cache for rapid accessing. 5.4 The Intel Image Processing Library Intel has created an image processing library that contains functions similar to those described in this paper including arithmetic, logical and morphological operations. 6 The library is designed to take full advantage of multimedia hardware when available. Intel has released performance gain results which are similar to the results included in the paper with and without MMX hardware. 7 Additional performance comparisons are planned in the future to verify the performance gain results obtained in the paper with the Intel image processing library results. 6. CONCLUSIONS Clearly the new multimedia hardware provides dramatic image processing capabilities for machine vision applications without the need for expensive custom hardware. Many additional real-time vision applications can be addressed with the MMX architecture in a very inexpensive manner. While speed increases have not been as great as anticipated, they are still very significant and offer the vision engineer the ability to address many applications that previously were too constrained by cost. ACKNOWLEDGEMENTS This project was supported in part by a National Science Foundation Research Experiences for Undergraduates grant and funding from the Center for Engineering Education and Practice at the University of Michigan-Dearborn. REFERENCES Section Section Norden, Robert C., Pentium Processors - Explained!, htttp://
Using MMX Instructions to Convert RGB To YUV Color Conversion
Using MMX Instructions to Convert RGB To YUV Color Conversion Information for Developers and ISVs From Intel Developer Services www.intel.com/ids Information in this document is provided in connection
Lecture 7: Machine-Level Programming I: Basics Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com
CSCI-UA.0201-003 Computer Systems Organization Lecture 7: Machine-Level Programming I: Basics Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Some slides adapted (and slightly modified)
Intel 8086 architecture
Intel 8086 architecture Today we ll take a look at Intel s 8086, which is one of the oldest and yet most prevalent processor architectures around. We ll make many comparisons between the MIPS and 8086
FLIX: Fast Relief for Performance-Hungry Embedded Applications
FLIX: Fast Relief for Performance-Hungry Embedded Applications Tensilica Inc. February 25 25 Tensilica, Inc. 25 Tensilica, Inc. ii Contents FLIX: Fast Relief for Performance-Hungry Embedded Applications...
Using the RDTSC Instruction for Performance Monitoring
Using the Instruction for Performance Monitoring http://developer.intel.com/drg/pentiumii/appnotes/pm1.htm Using the Instruction for Performance Monitoring Information in this document is provided in connection
TODAY, FEW PROGRAMMERS USE ASSEMBLY LANGUAGE. Higher-level languages such
9 Inline Assembly Code TODAY, FEW PROGRAMMERS USE ASSEMBLY LANGUAGE. Higher-level languages such as C and C++ run on nearly all architectures and yield higher productivity when writing and maintaining
X86-64 Architecture Guide
X86-64 Architecture Guide For the code-generation project, we shall expose you to a simplified version of the x86-64 platform. Example Consider the following Decaf program: class Program { int foo(int
Computer System: User s View. Computer System Components: High Level View. Input. Output. Computer. Computer System: Motherboard Level
System: User s View System Components: High Level View Input Output 1 System: Motherboard Level 2 Components: Interconnection I/O MEMORY 3 4 Organization Registers ALU CU 5 6 1 Input/Output I/O MEMORY
Computer Organization and Architecture
Computer Organization and Architecture Chapter 11 Instruction Sets: Addressing Modes and Formats Instruction Set Design One goal of instruction set design is to minimize instruction length Another goal
64-Bit NASM Notes. Invoking 64-Bit NASM
64-Bit NASM Notes The transition from 32- to 64-bit architectures is no joke, as anyone who has wrestled with 32/64 bit incompatibilities will attest We note here some key differences between 32- and 64-bit
CHAPTER 7: The CPU and Memory
CHAPTER 7: The CPU and Memory The Architecture of Computer Hardware, Systems Software & Networking: An Information Technology Approach 4th Edition, Irv Englander John Wiley and Sons 2010 PowerPoint slides
Administration. Instruction scheduling. Modern processors. Examples. Simplified architecture model. CS 412 Introduction to Compilers
CS 4 Introduction to Compilers ndrew Myers Cornell University dministration Prelim tomorrow evening No class Wednesday P due in days Optional reading: Muchnick 7 Lecture : Instruction scheduling pr 0 Modern
CS412/CS413. Introduction to Compilers Tim Teitelbaum. Lecture 20: Stack Frames 7 March 08
CS412/CS413 Introduction to Compilers Tim Teitelbaum Lecture 20: Stack Frames 7 March 08 CS 412/413 Spring 2008 Introduction to Compilers 1 Where We Are Source code if (b == 0) a = b; Low-level IR code
MICROPROCESSOR AND MICROCOMPUTER BASICS
Introduction MICROPROCESSOR AND MICROCOMPUTER BASICS At present there are many types and sizes of computers available. These computers are designed and constructed based on digital and Integrated Circuit
Central Processing Unit (CPU)
Central Processing Unit (CPU) CPU is the heart and brain It interprets and executes machine level instructions Controls data transfer from/to Main Memory (MM) and CPU Detects any errors In the following
Logical Operations. Control Unit. Contents. Arithmetic Operations. Objectives. The Central Processing Unit: Arithmetic / Logic Unit.
Objectives The Central Processing Unit: What Goes on Inside the Computer Chapter 4 Identify the components of the central processing unit and how they work together and interact with memory Describe how
Chapter 6. Inside the System Unit. What You Will Learn... Computers Are Your Future. What You Will Learn... Describing Hardware Performance
What You Will Learn... Computers Are Your Future Chapter 6 Understand how computers represent data Understand the measurements used to describe data transfer rates and data storage capacity List the components
Embedded Systems. Review of ANSI C Topics. A Review of ANSI C and Considerations for Embedded C Programming. Basic features of C
Embedded Systems A Review of ANSI C and Considerations for Embedded C Programming Dr. Jeff Jackson Lecture 2-1 Review of ANSI C Topics Basic features of C C fundamentals Basic data types Expressions Selection
Pentium vs. Power PC Computer Architecture and PCI Bus Interface
Pentium vs. Power PC Computer Architecture and PCI Bus Interface CSE 3322 1 Pentium vs. Power PC Computer Architecture and PCI Bus Interface Nowadays, there are two major types of microprocessors in the
MACHINE ARCHITECTURE & LANGUAGE
in the name of God the compassionate, the merciful notes on MACHINE ARCHITECTURE & LANGUAGE compiled by Jumong Chap. 9 Microprocessor Fundamentals A system designer should consider a microprocessor-based
1 The Java Virtual Machine
1 The Java Virtual Machine About the Spec Format This document describes the Java virtual machine and the instruction set. In this introduction, each component of the machine is briefly described. This
Management Challenge. Managing Hardware Assets. Central Processing Unit. What is a Computer System?
Management Challenge Managing Hardware Assets What computer processing and storage capability does our organization need to handle its information and business transactions? What arrangement of computers
FPGA. AT6000 FPGAs. Application Note AT6000 FPGAs. 3x3 Convolver with Run-Time Reconfigurable Vector Multiplier in Atmel AT6000 FPGAs.
3x3 Convolver with Run-Time Reconfigurable Vector Multiplier in Atmel AT6000 s Introduction Convolution is one of the basic and most common operations in both analog and digital domain signal processing.
Instruction Set Architecture. or How to talk to computers if you aren t in Star Trek
Instruction Set Architecture or How to talk to computers if you aren t in Star Trek The Instruction Set Architecture Application Compiler Instr. Set Proc. Operating System I/O system Instruction Set Architecture
İSTANBUL AYDIN UNIVERSITY
İSTANBUL AYDIN UNIVERSITY FACULTY OF ENGİNEERİNG SOFTWARE ENGINEERING THE PROJECT OF THE INSTRUCTION SET COMPUTER ORGANIZATION GÖZDE ARAS B1205.090015 Instructor: Prof. Dr. HASAN HÜSEYİN BALIK DECEMBER
Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2
Lecture Handout Computer Architecture Lecture No. 2 Reading Material Vincent P. Heuring&Harry F. Jordan Chapter 2,Chapter3 Computer Systems Design and Architecture 2.1, 2.2, 3.2 Summary 1) A taxonomy of
CPU Organization and Assembly Language
COS 140 Foundations of Computer Science School of Computing and Information Science University of Maine October 2, 2015 Outline 1 2 3 4 5 6 7 8 Homework and announcements Reading: Chapter 12 Homework:
Instruction Set Architecture (ISA)
Instruction Set Architecture (ISA) * Instruction set architecture of a machine fills the semantic gap between the user and the machine. * ISA serves as the starting point for the design of a new machine
Instruction Set Architecture
Instruction Set Architecture Consider x := y+z. (x, y, z are memory variables) 1-address instructions 2-address instructions LOAD y (r :=y) ADD y,z (y := y+z) ADD z (r:=r+z) MOVE x,y (x := y) STORE x (x:=r)
Lecture 27 C and Assembly
Ananda Gunawardena Lecture 27 C and Assembly This is a quick introduction to working with x86 assembly. Some of the instructions and register names must be check for latest commands and register names.
Chapter 1 Computer System Overview
Operating Systems: Internals and Design Principles Chapter 1 Computer System Overview Eighth Edition By William Stallings Operating System Exploits the hardware resources of one or more processors Provides
Fast Arithmetic Coding (FastAC) Implementations
Fast Arithmetic Coding (FastAC) Implementations Amir Said 1 Introduction This document describes our fast implementations of arithmetic coding, which achieve optimal compression and higher throughput by
MICROPROCESSOR. Exclusive for IACE Students www.iace.co.in iacehyd.blogspot.in Ph: 9700077455/422 Page 1
MICROPROCESSOR A microprocessor incorporates the functions of a computer s central processing unit (CPU) on a single Integrated (IC), or at most a few integrated circuit. It is a multipurpose, programmable
a storage location directly on the CPU, used for temporary storage of small amounts of data during processing.
CS143 Handout 18 Summer 2008 30 July, 2008 Processor Architectures Handout written by Maggie Johnson and revised by Julie Zelenski. Architecture Vocabulary Let s review a few relevant hardware definitions:
A Tiny Guide to Programming in 32-bit x86 Assembly Language
CS308, Spring 1999 A Tiny Guide to Programming in 32-bit x86 Assembly Language by Adam Ferrari, ferrari@virginia.edu (with changes by Alan Batson, batson@virginia.edu and Mike Lack, mnl3j@virginia.edu)
CISC, RISC, and DSP Microprocessors
CISC, RISC, and DSP Microprocessors Douglas L. Jones ECE 497 Spring 2000 4/6/00 CISC, RISC, and DSP D.L. Jones 1 Outline Microprocessors circa 1984 RISC vs. CISC Microprocessors circa 1999 Perspective:
Faculty of Engineering Student Number:
Philadelphia University Student Name: Faculty of Engineering Student Number: Dept. of Computer Engineering Final Exam, First Semester: 2012/2013 Course Title: Microprocessors Date: 17/01//2013 Course No:
Instruction Set Architecture
CS:APP Chapter 4 Computer Architecture Instruction Set Architecture Randal E. Bryant adapted by Jason Fritts http://csapp.cs.cmu.edu CS:APP2e Hardware Architecture - using Y86 ISA For learning aspects
IA-32 Intel Architecture Software Developer s Manual
IA-32 Intel Architecture Software Developer s Manual Volume 1: Basic Architecture NOTE: The IA-32 Intel Architecture Software Developer s Manual consists of three volumes: Basic Architecture, Order Number
Overview. CISC Developments. RISC Designs. CISC Designs. VAX: Addressing Modes. Digital VAX
Overview CISC Developments Over Twenty Years Classic CISC design: Digital VAX VAXÕs RISC successor: PRISM/Alpha IntelÕs ubiquitous 80x86 architecture Ð 8086 through the Pentium Pro (P6) RJS 2/3/97 Philosophy
Interpreters and virtual machines. Interpreters. Interpreters. Why interpreters? Tree-based interpreters. Text-based interpreters
Interpreters and virtual machines Michel Schinz 2007 03 23 Interpreters Interpreters Why interpreters? An interpreter is a program that executes another program, represented as some kind of data-structure.
Computer Graphics Hardware An Overview
Computer Graphics Hardware An Overview Graphics System Monitor Input devices CPU/Memory GPU Raster Graphics System Raster: An array of picture elements Based on raster-scan TV technology The screen (and
150127-Microprocessor & Assembly Language
Chapter 3 Z80 Microprocessor Architecture The Z 80 is one of the most talented 8 bit microprocessors, and many microprocessor-based systems are designed around the Z80. The Z80 microprocessor needs an
More on Pipelining and Pipelines in Real Machines CS 333 Fall 2006 Main Ideas Data Hazards RAW WAR WAW More pipeline stall reduction techniques Branch prediction» static» dynamic bimodal branch prediction
Memory Systems. Static Random Access Memory (SRAM) Cell
Memory Systems This chapter begins the discussion of memory systems from the implementation of a single bit. The architecture of memory chips is then constructed using arrays of bit implementations coupled
ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM
ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM 1 The ARM architecture processors popular in Mobile phone systems 2 ARM Features ARM has 32-bit architecture but supports 16 bit
IA-64 Application Developer s Architecture Guide
IA-64 Application Developer s Architecture Guide The IA-64 architecture was designed to overcome the performance limitations of today s architectures and provide maximum headroom for the future. To achieve
The Central Processing Unit:
The Central Processing Unit: What Goes on Inside the Computer Chapter 4 Objectives Identify the components of the central processing unit and how they work together and interact with memory Describe how
Keil C51 Cross Compiler
Keil C51 Cross Compiler ANSI C Compiler Generates fast compact code for the 8051 and it s derivatives Advantages of C over Assembler Do not need to know the microcontroller instruction set Register allocation
CPU performance monitoring using the Time-Stamp Counter register
CPU performance monitoring using the Time-Stamp Counter register This laboratory work introduces basic information on the Time-Stamp Counter CPU register, which is used for performance monitoring. The
Systolic Computing. Fundamentals
Systolic Computing Fundamentals Motivations for Systolic Processing PARALLEL ALGORITHMS WHICH MODEL OF COMPUTATION IS THE BETTER TO USE? HOW MUCH TIME WE EXPECT TO SAVE USING A PARALLEL ALGORITHM? HOW
This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?
This Unit: Putting It All Together CIS 501 Computer Architecture Unit 11: Putting It All Together: Anatomy of the XBox 360 Game Console Slides originally developed by Amir Roth with contributions by Milo
NEON. Support in Compilation Tools. Development Article. Copyright 2009 ARM Limited. All rights reserved. DHT 0004A (ID081609)
NEON Support in Compilation Tools Development Article Copyright 2009 ARM Limited. All rights reserved. DHT 0004A () NEON Support in Compilation Tools Development Article Copyright 2009 ARM Limited. All
Unpacked BCD Arithmetic. BCD (ASCII) Arithmetic. Where and Why is BCD used? From the SQL Server Manual. Packed BCD, ASCII, Unpacked BCD
BCD (ASCII) Arithmetic The Intel Instruction set can handle both packed (two digits per byte) and unpacked BCD (one decimal digit per byte) We will first look at unpacked BCD Unpacked BCD can be either
Computer Organization. and Instruction Execution. August 22
Computer Organization and Instruction Execution August 22 CSC201 Section 002 Fall, 2000 The Main Parts of a Computer CSC201 Section Copyright 2000, Douglas Reeves 2 I/O and Storage Devices (lots of devices,
Assembly Language: Function Calls" Jennifer Rexford!
Assembly Language: Function Calls" Jennifer Rexford! 1 Goals of this Lecture" Function call problems:! Calling and returning! Passing parameters! Storing local variables! Handling registers without interference!
612 CHAPTER 11 PROCESSOR FAMILIES (Corrisponde al cap. 12 - Famiglie di processori) PROBLEMS
612 CHAPTER 11 PROCESSOR FAMILIES (Corrisponde al cap. 12 - Famiglie di processori) PROBLEMS 11.1 How is conditional execution of ARM instructions (see Part I of Chapter 3) related to predicated execution
A single register, called the accumulator, stores the. operand before the operation, and stores the result. Add y # add y from memory to the acc
Other architectures Example. Accumulator-based machines A single register, called the accumulator, stores the operand before the operation, and stores the result after the operation. Load x # into acc
Touchstone -A Fresh Approach to Multimedia for the PC
Touchstone -A Fresh Approach to Multimedia for the PC Emmett Kilgariff Martin Randall Silicon Engineering, Inc Presentation Outline Touchstone Background Chipset Overview Sprite Chip Tiler Chip Compressed
DNA Data and Program Representation. Alexandre David 1.2.05 adavid@cs.aau.dk
DNA Data and Program Representation Alexandre David 1.2.05 adavid@cs.aau.dk Introduction Very important to understand how data is represented. operations limits precision Digital logic built on 2-valued
Chapter 7D The Java Virtual Machine
This sub chapter discusses another architecture, that of the JVM (Java Virtual Machine). In general, a VM (Virtual Machine) is a hypothetical machine (implemented in either hardware or software) that directly
Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27
Logistics Week 1: Wednesday, Jan 27 Because of overcrowding, we will be changing to a new room on Monday (Snee 1120). Accounts on the class cluster (crocus.csuglab.cornell.edu) will be available next week.
Chapter 3: Operating-System Structures. System Components Operating System Services System Calls System Programs System Structure Virtual Machines
Chapter 3: Operating-System Structures System Components Operating System Services System Calls System Programs System Structure Virtual Machines Operating System Concepts 3.1 Common System Components
A New, High-Performance, Low-Power, Floating-Point Embedded Processor for Scientific Computing and DSP Applications
1 A New, High-Performance, Low-Power, Floating-Point Embedded Processor for Scientific Computing and DSP Applications Simon McIntosh-Smith Director of Architecture 2 Multi-Threaded Array Processing Architecture
ASSEMBLY PROGRAMMING ON A VIRTUAL COMPUTER
ASSEMBLY PROGRAMMING ON A VIRTUAL COMPUTER Pierre A. von Kaenel Mathematics and Computer Science Department Skidmore College Saratoga Springs, NY 12866 (518) 580-5292 pvonk@skidmore.edu ABSTRACT This paper
CS:APP Chapter 4 Computer Architecture Instruction Set Architecture. CS:APP2e
CS:APP Chapter 4 Computer Architecture Instruction Set Architecture CS:APP2e Instruction Set Architecture Assembly Language View Processor state Registers, memory, Instructions addl, pushl, ret, How instructions
Computer Performance. Topic 3. Contents. Prerequisite knowledge Before studying this topic you should be able to:
55 Topic 3 Computer Performance Contents 3.1 Introduction...................................... 56 3.2 Measuring performance............................... 56 3.2.1 Clock Speed.................................
Sources: On the Web: Slides will be available on:
C programming Introduction The basics of algorithms Structure of a C code, compilation step Constant, variable type, variable scope Expression and operators: assignment, arithmetic operators, comparison,
Generations of the computer. processors.
. Piotr Gwizdała 1 Contents 1 st Generation 2 nd Generation 3 rd Generation 4 th Generation 5 th Generation 6 th Generation 7 th Generation 8 th Generation Dual Core generation Improves and actualizations
Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data
Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data Amanda O Connor, Bryan Justice, and A. Thomas Harris IN52A. Big Data in the Geosciences:
Deploying De-Duplication on Ext4 File System
Deploying De-Duplication on Ext4 File System Usha A. Joglekar 1, Bhushan M. Jagtap 2, Koninika B. Patil 3, 1. Asst. Prof., 2, 3 Students Department of Computer Engineering Smt. Kashibai Navale College
Lecture N -1- PHYS 3330. Microcontrollers
Lecture N -1- PHYS 3330 Microcontrollers If you need more than a handful of logic gates to accomplish the task at hand, you likely should use a microcontroller instead of discrete logic gates 1. Microcontrollers
Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:
Multiple-Issue Processors Pipelining can achieve CPI close to 1 Mechanisms for handling hazards Static or dynamic scheduling Static or dynamic branch handling Increase in transistor counts (Moore s Law):
LSN 2 Computer Processors
LSN 2 Computer Processors Department of Engineering Technology LSN 2 Computer Processors Microprocessors Design Instruction set Processor organization Processor performance Bandwidth Clock speed LSN 2
Hacking Techniques & Intrusion Detection. Ali Al-Shemery arabnix [at] gmail
Hacking Techniques & Intrusion Detection Ali Al-Shemery arabnix [at] gmail All materials is licensed under a Creative Commons Share Alike license http://creativecommonsorg/licenses/by-sa/30/ # whoami Ali
EC 362 Problem Set #2
EC 362 Problem Set #2 1) Using Single Precision IEEE 754, what is FF28 0000? 2) Suppose the fraction enhanced of a processor is 40% and the speedup of the enhancement was tenfold. What is the overall speedup?
Intel 64 and IA-32 Architectures Software Developer s Manual
Intel 64 and IA-32 Architectures Software Developer s Manual Volume 1: Basic Architecture NOTE: The Intel 64 and IA-32 Architectures Software Developer's Manual consists of seven volumes: Basic Architecture,
Lecture 8: Binary Multiplication & Division
Lecture 8: Binary Multiplication & Division Today s topics: Addition/Subtraction Multiplication Division Reminder: get started early on assignment 3 1 2 s Complement Signed Numbers two = 0 ten 0001 two
Systems Design & Programming Data Movement Instructions. Intel Assembly
Intel Assembly Data Movement Instruction: mov (covered already) push, pop lea (mov and offset) lds, les, lfs, lgs, lss movs, lods, stos ins, outs xchg, xlat lahf, sahf (not covered) in, out movsx, movzx
The programming language C. sws1 1
The programming language C sws1 1 The programming language C invented by Dennis Ritchie in early 1970s who used it to write the first Hello World program C was used to write UNIX Standardised as K&C (Kernighan
Reconfigurable Architecture Requirements for Co-Designed Virtual Machines
Reconfigurable Architecture Requirements for Co-Designed Virtual Machines Kenneth B. Kent University of New Brunswick Faculty of Computer Science Fredericton, New Brunswick, Canada ken@unb.ca Micaela Serra
what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?
Inside the CPU how does the CPU work? what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored? some short, boring programs to illustrate the
Chapter 2 Logic Gates and Introduction to Computer Architecture
Chapter 2 Logic Gates and Introduction to Computer Architecture 2.1 Introduction The basic components of an Integrated Circuit (IC) is logic gates which made of transistors, in digital system there are
picojava TM : A Hardware Implementation of the Java Virtual Machine
picojava TM : A Hardware Implementation of the Java Virtual Machine Marc Tremblay and Michael O Connor Sun Microelectronics Slide 1 The Java picojava Synergy Java s origins lie in improving the consumer
Glitch Free Frequency Shifting Simplifies Timing Design in Consumer Applications
Glitch Free Frequency Shifting Simplifies Timing Design in Consumer Applications System designers face significant design challenges in developing solutions to meet increasingly stringent performance and
DUKANE Intelligent Assembly Solutions
PC Configuration Requirements: Configuration Requirements for ipc Operation The hardware and operating system of the PC must comply with a list of minimum requirements for proper operation with the ipc
Medical Image Processing on the GPU. Past, Present and Future. Anders Eklund, PhD Virginia Tech Carilion Research Institute andek@vtc.vt.
Medical Image Processing on the GPU Past, Present and Future Anders Eklund, PhD Virginia Tech Carilion Research Institute andek@vtc.vt.edu Outline Motivation why do we need GPUs? Past - how was GPU programming
Building an Inexpensive Parallel Computer
Res. Lett. Inf. Math. Sci., (2000) 1, 113-118 Available online at http://www.massey.ac.nz/~wwiims/rlims/ Building an Inexpensive Parallel Computer Lutz Grosz and Andre Barczak I.I.M.S., Massey University
Writing Applications for the GPU Using the RapidMind Development Platform
Writing Applications for the GPU Using the RapidMind Development Platform Contents Introduction... 1 Graphics Processing Units... 1 RapidMind Development Platform... 2 Writing RapidMind Enabled Applications...
Chapter 2 Basic Structure of Computers. Jin-Fu Li Department of Electrical Engineering National Central University Jungli, Taiwan
Chapter 2 Basic Structure of Computers Jin-Fu Li Department of Electrical Engineering National Central University Jungli, Taiwan Outline Functional Units Basic Operational Concepts Bus Structures Software
The AVR Microcontroller and C Compiler Co-Design Dr. Gaute Myklebust ATMEL Corporation ATMEL Development Center, Trondheim, Norway
The AVR Microcontroller and C Compiler Co-Design Dr. Gaute Myklebust ATMEL Corporation ATMEL Development Center, Trondheim, Norway Abstract High Level Languages (HLLs) are rapidly becoming the standard
Bachelors of Computer Application Programming Principle & Algorithm (BCA-S102T)
Unit- I Introduction to c Language: C is a general-purpose computer programming language developed between 1969 and 1973 by Dennis Ritchie at the Bell Telephone Laboratories for use with the Unix operating
GPU(Graphics Processing Unit) with a Focus on Nvidia GeForce 6 Series. By: Binesh Tuladhar Clay Smith
GPU(Graphics Processing Unit) with a Focus on Nvidia GeForce 6 Series By: Binesh Tuladhar Clay Smith Overview History of GPU s GPU Definition Classical Graphics Pipeline Geforce 6 Series Architecture Vertex
Computer Organization and Components
Computer Organization and Components IS5, fall 25 Lecture : Pipelined Processors ssociate Professor, KTH Royal Institute of Technology ssistant Research ngineer, University of California, Berkeley Slides
Quiz for Chapter 1 Computer Abstractions and Technology 3.10
Date: 3.10 Not all questions are of equal difficulty. Please review the entire quiz first and then budget your time carefully. Name: Course: Solutions in Red 1. [15 points] Consider two different implementations,
Implementation of Modified Booth Algorithm (Radix 4) and its Comparison with Booth Algorithm (Radix-2)
Advance in Electronic and Electric Engineering. ISSN 2231-1297, Volume 3, Number 6 (2013), pp. 683-690 Research India Publications http://www.ripublication.com/aeee.htm Implementation of Modified Booth
In the two following sections we separately consider hardware and software requirements. Sometimes, they will be offered for sale as a package.
Appendix A COMPUTER HARDWARE AND SOFTWARE In this appendix we discuss some of the issues in choosing hardware and software for image analysis. The purpose is to draw attention to the issues involved rather
The IA-32 processor architecture
The IA-32 processor architecture Nicholas FitzRoy-Dale Document Revision: 1 Date: 2006/05/30 22:31:24 nfd@cse.unsw.edu.au http://www.cse.unsw.edu.au/ disy/ Operating Systems and Distributed Systems Group
Computer Systems Structure Input/Output
Computer Systems Structure Input/Output Peripherals Computer Central Processing Unit Main Memory Computer Systems Interconnection Communication lines Input Output Ward 1 Ward 2 Examples of I/O Devices