CSEE W4824 Computer Architecture Fall 2012
|
|
- Marybeth Ford
- 8 years ago
- Views:
Transcription
1 CSEE W4824 Computer Architecture Fall 2012 Lecture 2 Performance Metrics and Quantitative Principles of Computer Design Luca Carloni Department of Computer Science Columbia University in the City of New York Announcements: CS Distinguished Lecture Wed, Oct. 12 th 11:00 am - Davis Auditorium What Should a Well-informed Person Know about Computers? Brian Kernighan (Princeton Univ.) His book with Dennis Ritchie, the creator of the C programming language is considered the bible of C At Bell Labs contributed to the development of Unix working with the Unix creators K. Thompson and D. Ritchie He is also a coauthor of the widely used AWK and AMPL programming languages, and of the EQN and PIC typesetting languages In collaboration with Shen Lin he devised well-known heuristics for two important NP-complete optimization problems: graph partitioning travelling salesman problem CSEE 4824 Fall Lecture 2 Page 3 1
2 Computer Architects and Quantitative Approach Design ideas and trade-offs are tested by using tools in order to estimate the impact on performance, power and cost (an iterative process) analytical reasoning and fundamental design principles equations for basic metrics cost, performance, power simulations at various levels system level, ISA, micro-architecture, memory, RTL, gate, circuit level benchmark programs representing typical workloads CSEE 4824 Fall Lecture 2 Page 5 How to Define Performance? Airplane Passenger Capacity Cruising Range (miles) Cruising Speed (m.p.h.) Passenger Throughput (passenger x m.p.h) Boeing ,750 Boeing ,700 Concorde ,200 Douglas DC ,424 CSEE 4824 Fall Lecture 2 Page 6 2
3 Two Key Performance Metrics Time to run the task execution time, response time, elapsed time, latency Tasks per time unit execution rate, bandwidth, throughput Airplane DC to Paris Speed Passengers Throughput (passengers x mph) Boeing hours 610mph ,700 Concorde 3 hours 1350mph ,200 CSEE 4824 Fall Lecture 2 Page 7 Latency vs. Throughput Latency real time necessary to complete a task important when the focus is on a single task a computer user who is working with a single application a critical task of a real-time embedded system Throughput (aka Bandwidth) number of tasks completed per unit of time a metric independent from the exact number of executed tasks important when the focus is on running many tasks a manager of a large data-processing center is interested in the total amount of work done in a given time CSEE 4824 Fall Lecture 2 Page 8 3
4 Latency lags Bandwidth Bandwidth has outpaced latency across the main computer technologies There is an old network saying: Bandwidth problems can be cured with money. Latency problems are harder because the speed of light is fixed you can t bribe God. [Anonymous] CSEE 4824 Fall Lecture 2 Page 9 Latency and Throughput The Classic 5-Stage Pipeline Pipelining increases the instruction throughput number of instructions completed per unit of time but does not reduce (in fact, it usually slightly increases) the execution time of an individual instruction CSEE 4824 Fall Lecture 2 Page 10 4
5 Performance Metrics Machine X is n times faster than machine Y executiontime(y) n = = executiontime(x) performance(x) performance(y) Performance and execution time are reciprocal improve performance increase performance improve execution time decrease execution time Example executiontime(y) = 4.8, executiontime(x) = 3.6 n = 1.33, i.e. X is 33% faster than Y CSEE 4824 Fall Lecture 2 Page 11 Make the Common Case Fast the most important, pervasive, and simple principle of computer design in making a design trade-off favor the frequent case rather than infrequent case when determining how to allocate resources favor the frequent event rather than the rare event when optimizing the design of a module target the average functional behavior besides, the frequent case is often simpler 1. How to determine what the frequent case is? 2. How to determine the amount of the possible performance gain in making the frequent case faster? CSEE 4824 Fall Lecture 2 Page 12 5
6 Simulation and Simulation Levels ISA (functional) simulator execute program & get ISA-level statistics frequency of instructions Memory simulator ISA simulator is run together with a model of the memory systems get cache hit/miss rates, study memory hierarchy options Full performance simulator a detailed performance model to a functional simulator model all interactions, stalls, (mis)-speculations generate accurate statistics CSEE 4824 Fall Lecture 2 Page 13 Simulation Tradeoffs ISA simulator 10x slower than the real processor x faster than a detailed performance simulator Key points use the right level of simulation to answer a specific question e.g., ISA simulator to get instruction mix statistics use fast, idealized models for non-critical components e.g., assume a perfect main memory for applications that present an optimal cache hit ratio simulation is a powerful tool for architectural explorations, but analytical reasoning should always be applied before starting long simulations CSEE 4824 Fall Lecture 2 Page 14 6
7 Benchmark Suites Sets of programs to simulate typical workloads Several types real software applications (GCC, Word, ) most accurate but typically longer to process portability problems (OS/compiler dependencies), GUI kernels(livermore Loops, Linpack, ) small, key pieces taken from real programs limited picture, but good to isolate the performance of individual features of a machine synthetic benchmarks (Whetstone, Dhrystone, ) try to match the average frequency of operations on operands of a real program may easily mislead compiler and hardware designers CSEE 4824 Fall Lecture 2 Page 15 Amdahl s Law What is the overall speedup after improving a component x of a system? system x originalexecutiontime speedup = = newexectiontime newperformance originalperformance If component x is improved by Sx and component x affects a fraction Fx of the overall execution time then 1 speedup = original execution time of unimproved part Fx (1 Fx) + Sx new exec. time of improved part CSEE 4824 Fall Lecture 2 Page 16 7
8 Amdahl s Law - Example speedup = If we optimize the module for the floatingpoint instructions by a factor of 2, but the system will normally run programs with only 20% of floating point instructions then the speedup is only 1 speedup = (1 0.2) + 1 (1 Fx) Fx Sx = CSEE 4824 Fall Lecture 2 Page = Amdahl s Law - Example If Sx=100, what is the overall speedup as a function of Fx? S Speedup vs. Optimized Fraction S CSEE 4824 Fall Lecture 2 Page 18 8
9 Amdahl s Law and the Law of Diminishing Returns the closer to 1 is Fx, the closer to Sx is the overall speedup i.e. [make common case fast] however, as Sx, speedup 1 / (1- Fx) i.e., once Fx/Sx is small with respect to (1-Fx) the price/performance ratio falls rapidly as Sx is increased the incremental improvement in speedup gained by an additional improvement in the performance of just a portion of the computation diminishes as improvements are added CSEE 4824 Fall Lecture 2 Page 19 Amdahl s Law - Reference Gene Amdahl, "Validity of the Single Processor Approach to Achieving Large-Scale Computing Capabilities", AFIPS 67 Amdahl s Law - special case of parallelization if F is the fraction of a calculation that can be parallelized and (1-F) is the fraction that is sequential (i.e. cannot benefit from parallelization) then Amdahl s Law gives the maximum speedup that can be achieved by using N processors as 1 speedup = F (1 F) + Example N if F is only 90%, the calculation can be sped up by only a maximum of a factor of 10, no matter how many processors are used key to parallel computing is to augment F but there is also Gustafson s Law CSEE 4824 Fall Lecture 2 Page 20 9
10 Principle of Locality Temporal Locality a resource that is referenced at one point in time will be referenced again sometime in the near future Spatial Locality the likelihood of referencing a resource is higher if a resource near it was just referenced 90/10 Locality Rule of Thumb a program spends 90% of its execution time in only 10% of its code hence, it is possible to predict with reasonable accuracy what instructions and data a program will use in the near future based on its accesses in the recent past this is a consequence of how we program and we store the data in the memory CSEE 4824 Fall Lecture 2 Page 21 Principle of Locality - Example Cache Memory directly exploits temporal locality providing faster access to a smaller subset of the main memory which contains copy of data recently used but, all data in the cache are not necessarily data that are spatially close in the main memory still, when a cache miss occurs a fixed-size block of contiguous memory cells is retrieved from the main memory based on the principle of spatial locality CSEE 4824 Fall Lecture 2 Page 22 10
11 CPU Time CPU Time user CPU Time spent in the user program system CPU Time spent in the OS performing tasks required by the program harder to measure and to compare across architectures CPU performance = user CPU time on an unloaded system CPU Time = (Clock Cycles for a Program) x (Clock Cycle Time) = = (Clock Cycles for a Program) / (Clock Frequency) most computers run with a single clock signal (strictly synchronous design) whose discrete time events are called cycles, periods, or ticks a P with a 1ns clock period runs at 1GHz of clock frequency CSEE 4824 Fall Lecture 2 Page 23 CPU Time Three Main Factors CPU Time = (Clock Cycles for a Program) x CCT IC = instruction count number of instructions executed for a program CPI = clock cycles per instruction = CCfP/IC average number of clock cycles per instruction of a program its reciprocal is IPC = instruction per clock cycles CPU Time = IC x CPI x CCT CPU Time equally depends on these three factors a 10% improvement in any of these leads to a 10% improvement in CPU time CSEE 4824 Fall Lecture 2 Page 24 11
12 CPU Time - Dependencies Program Compiler HW organization HW technology CPU Time = IC x CPI x CCT IC CPI CCT ISA organization some interdependencies, but many techniques improve a single factor CSEE 4824 Fall Lecture 2 Page 25 Improving Performance by Exploiting Parallelism at the system level use multiple processors, multiple disks scalability is key to adaptively distribute workload in server apps at the single microprocessor level exploit instruction level parallelism (ILP) e.g., pipelining overlaps the execution of instruction to reduce the overall program CPU Time reduces CPI by overlapping instructions in time possible because many subsequent instructions are independent e.g. parallel computation reduces CPI by overlapping instructions in space duplicate hardware modules such as ALUs at the circuit level carry-lookahead adders speed-up sums from linear to logarithmic CSEE 4824 Fall Lecture 2 Page 26 12
13 CPU Time broken down per instruction CPU Time = IC x CPI x CCT CPU Time = i ( ICi x CPIi) x CCT CPI = i ( ICi x CPIi) IC = i (IFi x CPIi) frequent instructions have larger contributions on CPI CPI should be measured to include pipeline/memory effects it is not sufficient to calculate it from the reference manual table NOTE: it is ok to compare two designs based only on CPI (or IPC) only if IC and CCT are the same! CSEE 4824 Fall Lecture 2 Page 27 Example: Average Instruction Execution Time Assuming a simple un-pipelined processor with CCT = 2ns Operation IFi CPIi IFi x CPIi (% Time) ALU Load Store Branch CPI = i (IFi x CPIi ) = 4.3 Average instruction execution time = CPI x CCT = 8.6ns CSEE 4824 Fall Lecture 2 Page 28 13
14 Example: Speedup From 5-stage Pipelining Assumption after pipelining the slowest stage forces an effective clock period equal to (CCT + clockoverhead) = ( )ns Question What is the speedup from pipelining? (Average Instruction Time )unpipelined 8.6 speedup = = = 3.9 (Average Instruction Time ) pipelined 2.2 CSEE 4824 Fall Lecture 2 Page 29 Another Key Metric: Power Dissipation Energy measured in Joules Power rate of energy consumption [Watts = Joules/sec] instantaneous power P = V * I voltage drop across a component times the current flowing through it Example system A higher peak power lower total energy system B lower peak power higher total energy I V [Source: K. Asanovic MIT ] CSEE 4824 Fall Lecture 2 Page 30 14
15 Power Consumption of CMOS Transistors Dynamic Power traditionally dominant component dissipated when transistor switches (i.e. data dependent) Static Power becoming more important with transistors scaling due to leakage current that flows even if there is no switching activity proportional to the number of transistors on the chip Challenges power is the key limitation to chip design distribute power on-chip remove heat prevent hot spots low power design (clock gating, DVFS) CSEE 4824 Fall Lecture 2 Page 31 Example: Dynamic Power Consumption Assume a 0.25µm CMOS chip with a voltage supply Vdd=2.5V clock frequency F=500Mhz, and average load capacitance of CL=15fF/gate (assuming a fan-out of 4) What is the power consumption per gate? Approximately, Pavg =50µW For a design with 1 million gates, assuming that a transition occurs at every clock edge, this would result in an average power consumption of ~50W! In reality, not all gates on the chip switch at the full rate of 500Mhz. The actual activity is substantially lower and it is estimated by the switching capacitance CSEE 4824 Fall Lecture 2 Page 32 15
16 Dynamic Voltage Frequency Scaling DVFS is a low-power design technique that is becoming pervasive in modern processors Example: If the voltage and frequency of a processing core are both reduced by 15% what would be the impact on dynamic power? Power Save = Pnew Pold = C x (V x 0.85) x (F x 0.85) 2 2 C x V x F 3 = 0.85 = 0.61 Pnew is 64% more power efficient than Pold CSEE 4824 Fall Lecture 2 Page 33 Assigned Readings Computer Architecture A Quantitative Approach by John Hennessy Stanford University Dave Patterson UC Berkeley Fifth Edition Morgan Kaufmann (Elsevier) Read Sections CSEE 4824 Fall Lecture 1 Page 34 16
CPU Performance. Lecture 8 CAP 3103 06-11-2014
CPU Performance Lecture 8 CAP 3103 06-11-2014 Defining Performance Which airplane has the best performance? 1.6 Performance Boeing 777 Boeing 777 Boeing 747 BAC/Sud Concorde Douglas DC-8-50 Boeing 747
More informationLecture 3: Evaluating Computer Architectures. Software & Hardware: The Virtuous Cycle?
Lecture 3: Evaluating Computer Architectures Announcements - Reminder: Homework 1 due Thursday 2/2 Last Time technology back ground Computer elements Circuits and timing Virtuous cycle of the past and
More informationPerformance evaluation
Performance evaluation Arquitecturas Avanzadas de Computadores - 2547021 Departamento de Ingeniería Electrónica y de Telecomunicaciones Facultad de Ingeniería 2015-1 Bibliography and evaluation Bibliography
More informationEEM 486: Computer Architecture. Lecture 4. Performance
EEM 486: Computer Architecture Lecture 4 Performance EEM 486 Performance Purchasing perspective Given a collection of machines, which has the» Best performance?» Least cost?» Best performance / cost? Design
More informationChapter 2. Why is some hardware better than others for different programs?
Chapter 2 1 Performance Measure, Report, and Summarize Make intelligent choices See through the marketing hype Key to understanding underlying organizational motivation Why is some hardware better than
More informationUnit 4: Performance & Benchmarking. Performance Metrics. This Unit. CIS 501: Computer Architecture. Performance: Latency vs.
This Unit CIS 501: Computer Architecture Unit 4: Performance & Benchmarking Metrics Latency and throughput Speedup Averaging CPU Performance Performance Pitfalls Slides'developed'by'Milo'Mar0n'&'Amir'Roth'at'the'University'of'Pennsylvania'
More informationQuiz for Chapter 1 Computer Abstractions and Technology 3.10
Date: 3.10 Not all questions are of equal difficulty. Please review the entire quiz first and then budget your time carefully. Name: Course: Solutions in Red 1. [15 points] Consider two different implementations,
More informationWeek 1 out-of-class notes, discussions and sample problems
Week 1 out-of-class notes, discussions and sample problems Although we will primarily concentrate on RISC processors as found in some desktop/laptop computers, here we take a look at the varying types
More informationon an system with an infinite number of processors. Calculate the speedup of
1. Amdahl s law Three enhancements with the following speedups are proposed for a new architecture: Speedup1 = 30 Speedup2 = 20 Speedup3 = 10 Only one enhancement is usable at a time. a) If enhancements
More informationHow To Understand The Design Of A Microprocessor
Computer Architecture R. Poss 1 What is computer architecture? 2 Your ideas and expectations What is part of computer architecture, what is not? Who are computer architects, what is their job? What is
More informationMemory Hierarchy. Arquitectura de Computadoras. Centro de Investigación n y de Estudios Avanzados del IPN. adiaz@cinvestav.mx. MemoryHierarchy- 1
Hierarchy Arturo Díaz D PérezP Centro de Investigación n y de Estudios Avanzados del IPN adiaz@cinvestav.mx Hierarchy- 1 The Big Picture: Where are We Now? The Five Classic Components of a Computer Processor
More information! Metrics! Latency and throughput. ! Reporting performance! Benchmarking and averaging. ! CPU performance equation & performance trends
This Unit CIS 501 Computer Architecture! Metrics! Latency and throughput! Reporting performance! Benchmarking and averaging Unit 2: Performance! CPU performance equation & performance trends CIS 501 (Martin/Roth):
More informationResource Efficient Computing for Warehouse-scale Datacenters
Resource Efficient Computing for Warehouse-scale Datacenters Christos Kozyrakis Stanford University http://csl.stanford.edu/~christos DATE Conference March 21 st 2013 Computing is the Innovation Catalyst
More informationCS 159 Two Lecture Introduction. Parallel Processing: A Hardware Solution & A Software Challenge
CS 159 Two Lecture Introduction Parallel Processing: A Hardware Solution & A Software Challenge We re on the Road to Parallel Processing Outline Hardware Solution (Day 1) Software Challenge (Day 2) Opportunities
More informationCS 147: Computer Systems Performance Analysis
CS 147: Computer Systems Performance Analysis CS 147: Computer Systems Performance Analysis 1 / 39 Overview Overview Overview What is a Workload? Instruction Workloads Synthetic Workloads Exercisers and
More informationCISC, RISC, and DSP Microprocessors
CISC, RISC, and DSP Microprocessors Douglas L. Jones ECE 497 Spring 2000 4/6/00 CISC, RISC, and DSP D.L. Jones 1 Outline Microprocessors circa 1984 RISC vs. CISC Microprocessors circa 1999 Perspective:
More informationA Lab Course on Computer Architecture
A Lab Course on Computer Architecture Pedro López José Duato Depto. de Informática de Sistemas y Computadores Facultad de Informática Universidad Politécnica de Valencia Camino de Vera s/n, 46071 - Valencia,
More informationEE361: Digital Computer Organization Course Syllabus
EE361: Digital Computer Organization Course Syllabus Dr. Mohammad H. Awedh Spring 2014 Course Objectives Simply, a computer is a set of components (Processor, Memory and Storage, Input/Output Devices)
More informationPipelining Review and Its Limitations
Pipelining Review and Its Limitations Yuri Baida yuri.baida@gmail.com yuriy.v.baida@intel.com October 16, 2010 Moscow Institute of Physics and Technology Agenda Review Instruction set architecture Basic
More informationADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM
ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM 1 The ARM architecture processors popular in Mobile phone systems 2 ARM Features ARM has 32-bit architecture but supports 16 bit
More informationAchieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging
Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro- and even nano-seconds.
More information1. Memory technology & Hierarchy
1. Memory technology & Hierarchy RAM types Advances in Computer Architecture Andy D. Pimentel Memory wall Memory wall = divergence between CPU and RAM speed We can increase bandwidth by introducing concurrency
More informationIntroduction to Microprocessors
Introduction to Microprocessors Yuri Baida yuri.baida@gmail.com yuriy.v.baida@intel.com October 2, 2010 Moscow Institute of Physics and Technology Agenda Background and History What is a microprocessor?
More informationBENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB
BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB Planet Size Data!? Gartner s 10 key IT trends for 2012 unstructured data will grow some 80% over the course of the next
More informationLecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.
Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide
More informationLow Power AMD Athlon 64 and AMD Opteron Processors
Low Power AMD Athlon 64 and AMD Opteron Processors Hot Chips 2004 Presenter: Marius Evers Block Diagram of AMD Athlon 64 and AMD Opteron Based on AMD s 8 th generation architecture AMD Athlon 64 and AMD
More informationEVALUATING POWER MANAGEMENT CAPABILITIES OF LOW-POWER CLOUD PLATFORMS. Jens Smeds
EVALUATING POWER MANAGEMENT CAPABILITIES OF LOW-POWER CLOUD PLATFORMS Jens Smeds Master of Science Thesis Supervisor: Prof. Johan Lilius Advisor: Dr. Sébastien Lafond Embedded Systems Laboratory Department
More informationVLIW Processors. VLIW Processors
1 VLIW Processors VLIW ( very long instruction word ) processors instructions are scheduled by the compiler a fixed number of operations are formatted as one big instruction (called a bundle) usually LIW
More informationIncreasing Flash Throughput for Big Data Applications (Data Management Track)
Scale Simplify Optimize Evolve Increasing Flash Throughput for Big Data Applications (Data Management Track) Flash Memory 1 Industry Context Addressing the challenge A proposed solution Review of the Benefits
More information18-742 Lecture 4. Parallel Programming II. Homework & Reading. Page 1. Projects handout On Friday Form teams, groups of two
age 1 18-742 Lecture 4 arallel rogramming II Spring 2005 rof. Babak Falsafi http://www.ece.cmu.edu/~ece742 write X Memory send X Memory read X Memory Slides developed in part by rofs. Adve, Falsafi, Hill,
More informationPhotonic Networks for Data Centres and High Performance Computing
Photonic Networks for Data Centres and High Performance Computing Philip Watts Department of Electronic Engineering, UCL Yury Audzevich, Nick Barrow-Williams, Robert Mullins, Simon Moore, Andrew Moore
More informationAdvanced Computer Architecture
Advanced Computer Architecture Institute for Multimedia and Software Engineering Conduction of Exercises: Institute for Multimedia eda and Software Engineering g BB 315c, Tel: 379-1174 E-mail: marius.rosu@uni-due.de
More informationPART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design
PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions Slide 1 Outline Principles for performance oriented design Performance testing Performance tuning General
More informationSolution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:
Multiple-Issue Processors Pipelining can achieve CPI close to 1 Mechanisms for handling hazards Static or dynamic scheduling Static or dynamic branch handling Increase in transistor counts (Moore s Law):
More informationPerformance Metrics and Scalability Analysis. Performance Metrics and Scalability Analysis
Performance Metrics and Scalability Analysis 1 Performance Metrics and Scalability Analysis Lecture Outline Following Topics will be discussed Requirements in performance and cost Performance metrics Work
More information361 Computer Architecture Lecture 14: Cache Memory
1 361 Computer Architecture Lecture 14 Memory cache.1 The Motivation for s Memory System Processor DRAM Motivation Large memories (DRAM) are slow Small memories (SRAM) are fast Make the average access
More informationTypes of Workloads. Raj Jain. Washington University in St. Louis
Types of Workloads Raj Jain Washington University in Saint Louis Saint Louis, MO 63130 Jain@cse.wustl.edu These slides are available on-line at: http://www.cse.wustl.edu/~jain/cse567-08/ 4-1 Overview!
More informationLizy Kurian John Electrical and Computer Engineering Department, The University of Texas as Austin
BUS ARCHITECTURES Lizy Kurian John Electrical and Computer Engineering Department, The University of Texas as Austin Keywords: Bus standards, PCI bus, ISA bus, Bus protocols, Serial Buses, USB, IEEE 1394
More informationDesign Cycle for Microprocessors
Cycle for Microprocessors Raúl Martínez Intel Barcelona Research Center Cursos de Verano 2010 UCLM Intel Corporation, 2010 Agenda Introduction plan Architecture Microarchitecture Logic Silicon ramp Types
More informationMore on Pipelining and Pipelines in Real Machines CS 333 Fall 2006 Main Ideas Data Hazards RAW WAR WAW More pipeline stall reduction techniques Branch prediction» static» dynamic bimodal branch prediction
More informationEE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May 2000. ILP Execution
EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May 2000 Lecture #11: Wednesday, 3 May 2000 Lecturer: Ben Serebrin Scribe: Dean Liu ILP Execution
More informationTowards Energy Efficient Query Processing in Database Management System
Towards Energy Efficient Query Processing in Database Management System Report by: Ajaz Shaik, Ervina Cergani Abstract Rising concerns about the amount of energy consumed by the data centers, several computer
More informationInternational Journal of Electronics and Computer Science Engineering 1482
International Journal of Electronics and Computer Science Engineering 1482 Available Online at www.ijecse.org ISSN- 2277-1956 Behavioral Analysis of Different ALU Architectures G.V.V.S.R.Krishna Assistant
More informationProcessor Architectures
ECPE 170 Jeff Shafer University of the Pacific Processor Architectures 2 Schedule Exam 3 Tuesday, December 6 th Caches Virtual Memory Input / Output OperaKng Systems Compilers & Assemblers Processor Architecture
More informationOutline. Introduction. Multiprocessor Systems on Chip. A MPSoC Example: Nexperia DVP. A New Paradigm: Network on Chip
Outline Modeling, simulation and optimization of Multi-Processor SoCs (MPSoCs) Università of Verona Dipartimento di Informatica MPSoCs: Multi-Processor Systems on Chip A simulation platform for a MPSoC
More informationIn-network Monitoring and Control Policy for DVFS of CMP Networkson-Chip and Last Level Caches
In-network Monitoring and Control Policy for DVFS of CMP Networkson-Chip and Last Level Caches Xi Chen 1, Zheng Xu 1, Hyungjun Kim 1, Paul V. Gratz 1, Jiang Hu 1, Michael Kishinevsky 2 and Umit Ogras 2
More informationFault Modeling. Why model faults? Some real defects in VLSI and PCB Common fault models Stuck-at faults. Transistor faults Summary
Fault Modeling Why model faults? Some real defects in VLSI and PCB Common fault models Stuck-at faults Single stuck-at faults Fault equivalence Fault dominance and checkpoint theorem Classes of stuck-at
More informationComputer Science 146/246 Homework #3
Computer Science 146/246 Homework #3 Due 11:59 P.M. Sunday, April 12th, 2015 We played with a Pin-based cache simulator for Homework 2. This homework will prepare you to setup and run a detailed microarchitecture-level
More informationpicojava TM : A Hardware Implementation of the Java Virtual Machine
picojava TM : A Hardware Implementation of the Java Virtual Machine Marc Tremblay and Michael O Connor Sun Microelectronics Slide 1 The Java picojava Synergy Java s origins lie in improving the consumer
More informationIn-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller
In-Memory Databases Algorithms and Data Structures on Modern Hardware Martin Faust David Schwalb Jens Krüger Jürgen Müller The Free Lunch Is Over 2 Number of transistors per CPU increases Clock frequency
More informationAbstract. Cycle Domain Simulator for Phase-Locked Loops
Abstract Cycle Domain Simulator for Phase-Locked Loops Norman James December 1999 As computers become faster and more complex, clock synthesis becomes critical. Due to the relatively slower bus clocks
More informationAgenda. Michele Taliercio, Il circuito Integrato, Novembre 2001
Agenda Introduzione Il mercato Dal circuito integrato al System on a Chip (SoC) La progettazione di un SoC La tecnologia Una fabbrica di circuiti integrati 28 How to handle complexity G The engineering
More informationFPGA-based Multithreading for In-Memory Hash Joins
FPGA-based Multithreading for In-Memory Hash Joins Robert J. Halstead, Ildar Absalyamov, Walid A. Najjar, Vassilis J. Tsotras University of California, Riverside Outline Background What are FPGAs Multithreaded
More informationAdvanced Computer Architecture
Advanced Computer Architecture Instructor: Andreas Moshovos moshovos@eecg.toronto.edu Fall 2005 Some material is based on slides developed by profs. Mark Hill, David Wood, Guri Sohi and Jim Smith at the
More informationEnergy Efficient MapReduce
Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing
More informationWhite Paper The Numascale Solution: Extreme BIG DATA Computing
White Paper The Numascale Solution: Extreme BIG DATA Computing By: Einar Rustad ABOUT THE AUTHOR Einar Rustad is CTO of Numascale and has a background as CPU, Computer Systems and HPC Systems De-signer
More informationRecommendations for Performance Benchmarking
Recommendations for Performance Benchmarking Shikhar Puri Abstract Performance benchmarking of applications is increasingly becoming essential before deployment. This paper covers recommendations and best
More informationComputer Architecture
Computer Architecture Random Access Memory Technologies 2015. április 2. Budapest Gábor Horváth associate professor BUTE Dept. Of Networked Systems and Services ghorvath@hit.bme.hu 2 Storing data Possible
More informationSlide Set 8. for ENCM 369 Winter 2015 Lecture Section 01. Steve Norman, PhD, PEng
Slide Set 8 for ENCM 369 Winter 2015 Lecture Section 01 Steve Norman, PhD, PEng Electrical & Computer Engineering Schulich School of Engineering University of Calgary Winter Term, 2015 ENCM 369 W15 Section
More informationIntroduction to Cloud Computing
Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic
More informationChapter 12: Multiprocessor Architectures. Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup
Chapter 12: Multiprocessor Architectures Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup Objective Be familiar with basic multiprocessor architectures and be able to
More informationFive Families of ARM Processor IP
ARM1026EJ-S Synthesizable ARM10E Family Processor Core Eric Schorn CPU Product Manager ARM Austin Design Center Five Families of ARM Processor IP Performance ARM preserves SW & HW investment through code
More informationControl 2004, University of Bath, UK, September 2004
Control, University of Bath, UK, September ID- IMPACT OF DEPENDENCY AND LOAD BALANCING IN MULTITHREADING REAL-TIME CONTROL ALGORITHMS M A Hossain and M O Tokhi Department of Computing, The University of
More informationHistorically, Huge Performance Gains came from Huge Clock Frequency Increases Unfortunately.
Historically, Huge Performance Gains came from Huge Clock Frequency Increases Unfortunately. Hardware Solution Evolution of Computer Architectures Micro-Scopic View Clock Rate Limits Have Been Reached
More informationPowerPC Microprocessor Clock Modes
nc. Freescale Semiconductor AN1269 (Freescale Order Number) 1/96 Application Note PowerPC Microprocessor Clock Modes The PowerPC microprocessors offer customers numerous clocking options. An internal phase-lock
More informationModule 2. Embedded Processors and Memory. Version 2 EE IIT, Kharagpur 1
Module 2 Embedded Processors and Memory Version 2 EE IIT, Kharagpur 1 Lesson 5 Memory-I Version 2 EE IIT, Kharagpur 2 Instructional Objectives After going through this lesson the student would Pre-Requisite
More informationThe Classical Architecture. Storage 1 / 36
1 / 36 The Problem Application Data? Filesystem Logical Drive Physical Drive 2 / 36 Requirements There are different classes of requirements: Data Independence application is shielded from physical storage
More informationThis Unit: Multithreading (MT) CIS 501 Computer Architecture. Performance And Utilization. Readings
This Unit: Multithreading (MT) CIS 501 Computer Architecture Unit 10: Hardware Multithreading Application OS Compiler Firmware CU I/O Memory Digital Circuits Gates & Transistors Why multithreading (MT)?
More informationMcPAT: An Integrated Power, Area, and Timing Modeling Framework for Multicore and Manycore Architectures
McPAT: An Integrated Power, Area, and Timing Modeling Framework for Multicore and Manycore Architectures Sheng Li, Junh Ho Ahn, Richard Strong, Jay B. Brockman, Dean M Tullsen, Norman Jouppi MICRO 2009
More informationCS:APP Chapter 4 Computer Architecture. Wrap-Up. William J. Taffe Plymouth State University. using the slides of
CS:APP Chapter 4 Computer Architecture Wrap-Up William J. Taffe Plymouth State University using the slides of Randal E. Bryant Carnegie Mellon University Overview Wrap-Up of PIPE Design Performance analysis
More informationParallel Scalable Algorithms- Performance Parameters
www.bsc.es Parallel Scalable Algorithms- Performance Parameters Vassil Alexandrov, ICREA - Barcelona Supercomputing Center, Spain Overview Sources of Overhead in Parallel Programs Performance Metrics for
More informationInstruction scheduling
Instruction ordering Instruction scheduling Advanced Compiler Construction Michel Schinz 2015 05 21 When a compiler emits the instructions corresponding to a program, it imposes a total order on them.
More informationMaking Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association
Making Multicore Work and Measuring its Benefits Markus Levy, president EEMBC and Multicore Association Agenda Why Multicore? Standards and issues in the multicore community What is Multicore Association?
More informationnumascale White Paper The Numascale Solution: Extreme BIG DATA Computing Hardware Accellerated Data Intensive Computing By: Einar Rustad ABSTRACT
numascale Hardware Accellerated Data Intensive Computing White Paper The Numascale Solution: Extreme BIG DATA Computing By: Einar Rustad www.numascale.com Supemicro delivers 108 node system with Numascale
More informationIntroducción. Diseño de sistemas digitales.1
Introducción Adapted from: Mary Jane Irwin ( www.cse.psu.edu/~mji ) www.cse.psu.edu/~cg431 [Original from Computer Organization and Design, Patterson & Hennessy, 2005, UCB] Diseño de sistemas digitales.1
More informationEFFICIENT EXTERNAL SORTING ON FLASH MEMORY EMBEDDED DEVICES
ABSTRACT EFFICIENT EXTERNAL SORTING ON FLASH MEMORY EMBEDDED DEVICES Tyler Cossentine and Ramon Lawrence Department of Computer Science, University of British Columbia Okanagan Kelowna, BC, Canada tcossentine@gmail.com
More informationComputer Architecture Lecture 2: Instruction Set Principles (Appendix A) Chih Wei Liu 劉 志 尉 National Chiao Tung University cwliu@twins.ee.nctu.edu.
Computer Architecture Lecture 2: Instruction Set Principles (Appendix A) Chih Wei Liu 劉 志 尉 National Chiao Tung University cwliu@twins.ee.nctu.edu.tw Review Computers in mid 50 s Hardware was expensive
More informationCHAPTER 1 INTRODUCTION
1 CHAPTER 1 INTRODUCTION 1.1 MOTIVATION OF RESEARCH Multicore processors have two or more execution cores (processors) implemented on a single chip having their own set of execution and architectural recourses.
More informationHardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui
Hardware-Aware Analysis and Optimization of Stable Fluids Presentation Date: Sep 15 th 2009 Chrissie C. Cui Outline Introduction Highlights Flop and Bandwidth Analysis Mehrstellen Schemes Advection Caching
More informationThe Quest for Speed - Memory. Cache Memory. A Solution: Memory Hierarchy. Memory Hierarchy
The Quest for Speed - Memory Cache Memory CSE 4, Spring 25 Computer Systems http://www.cs.washington.edu/4 If all memory accesses (IF/lw/sw) accessed main memory, programs would run 20 times slower And
More informationEnterprise Applications
Enterprise Applications Chi Ho Yue Sorav Bansal Shivnath Babu Amin Firoozshahian EE392C Emerging Applications Study Spring 2003 Functionality Online Transaction Processing (OLTP) Users/apps interacting
More informationPyCompArch: Python-Based Modules for Exploring Computer Architecture Concepts
PyCompArch: Python-Based Modules for Exploring Computer Architecture Concepts Workshop on Computer Architecture Education 2015 Dan Connors, Kyle Dunn, Ryan Bueter Department of Electrical Engineering University
More informationDigital Systems Design! Lecture 1 - Introduction!!
ECE 3401! Digital Systems Design! Lecture 1 - Introduction!! Course Basics Classes: Tu/Th 11-12:15, ITE 127 Instructor Mohammad Tehranipoor Office hours: T 1-2pm, or upon appointments @ ITE 441 Email:
More informationReconfigurable Architecture Requirements for Co-Designed Virtual Machines
Reconfigurable Architecture Requirements for Co-Designed Virtual Machines Kenneth B. Kent University of New Brunswick Faculty of Computer Science Fredericton, New Brunswick, Canada ken@unb.ca Micaela Serra
More informationwhat operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?
Inside the CPU how does the CPU work? what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored? some short, boring programs to illustrate the
More informationSystem Models for Distributed and Cloud Computing
System Models for Distributed and Cloud Computing Dr. Sanjay P. Ahuja, Ph.D. 2010-14 FIS Distinguished Professor of Computer Science School of Computing, UNF Classification of Distributed Computing Systems
More informationUse-Case Power Management Optimization: Identifying & Tracking Key Power Indicators ELC- E Edimburgh, 2013-10- 24 Patrick Ti:ano
Use-Case Power Management Optimization: Identifying & Tracking Key Power Indicators ELC- E Edimburgh, 2013-10- 24 Patrick Ti:ano DRAFT!!! This is a draft version of the presentation Practical examples
More informationQLIKVIEW SERVER MEMORY MANAGEMENT AND CPU UTILIZATION
QLIKVIEW SERVER MEMORY MANAGEMENT AND CPU UTILIZATION QlikView Scalability Center Technical Brief Series September 2012 qlikview.com Introduction This technical brief provides a discussion at a fundamental
More informationWhy Latency Lags Bandwidth, and What it Means to Computing
Why Latency Lags Bandwidth, and What it Means to Computing David Patterson U.C. Berkeley patterson@cs.berkeley.edu October 2004 Bandwidth Rocks (1) Preview: Latency Lags Bandwidth Over last 20 to 25 years,
More information路 論 Chapter 15 System-Level Physical Design
Introduction to VLSI Circuits and Systems 路 論 Chapter 15 System-Level Physical Design Dept. of Electronic Engineering National Chin-Yi University of Technology Fall 2007 Outline Clocked Flip-flops CMOS
More informationOBJECTIVE ANALYSIS WHITE PAPER MATCH FLASH. TO THE PROCESSOR Why Multithreading Requires Parallelized Flash ATCHING
OBJECTIVE ANALYSIS WHITE PAPER MATCH ATCHING FLASH TO THE PROCESSOR Why Multithreading Requires Parallelized Flash T he computing community is at an important juncture: flash memory is now generally accepted
More informationIntel Labs at ISSCC 2012. Copyright Intel Corporation 2012
Intel Labs at ISSCC 2012 Copyright Intel Corporation 2012 Intel Labs ISSCC 2012 Highlights 1. Efficient Computing Research: Making the most of every milliwatt to make computing greener and more scalable
More informationAMD Opteron Quad-Core
AMD Opteron Quad-Core a brief overview Daniele Magliozzi Politecnico di Milano Opteron Memory Architecture native quad-core design (four cores on a single die for more efficient data sharing) enhanced
More informationARM Cortex-A9 MPCore Multicore Processor Hierarchical Implementation with IC Compiler
ARM Cortex-A9 MPCore Multicore Processor Hierarchical Implementation with IC Compiler DAC 2008 Philip Watson Philip Watson Implementation Environment Program Manager ARM Ltd Background - Who Are We? Processor
More informationProgram Optimization for Multi-core Architectures
Program Optimization for Multi-core Architectures Sanjeev K Aggarwal (ska@iitk.ac.in) M Chaudhuri (mainak@iitk.ac.in) R Moona (moona@iitk.ac.in) Department of Computer Science and Engineering, IIT Kanpur
More informationPower-Aware High-Performance Scientific Computing
Power-Aware High-Performance Scientific Computing Padma Raghavan Scalable Computing Laboratory Department of Computer Science Engineering The Pennsylvania State University http://www.cse.psu.edu/~raghavan
More informationMulti-Threading Performance on Commodity Multi-Core Processors
Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction
More informationChapter 7: Distributed Systems: Warehouse-Scale Computing. Fall 2011 Jussi Kangasharju
Chapter 7: Distributed Systems: Warehouse-Scale Computing Fall 2011 Jussi Kangasharju Chapter Outline Warehouse-scale computing overview Workloads and software infrastructure Failures and repairs Note:
More informationPerformance metrics for parallel systems
Performance metrics for parallel systems S.S. Kadam C-DAC, Pune sskadam@cdac.in C-DAC/SECG/2006 1 Purpose To determine best parallel algorithm Evaluate hardware platforms Examine the benefits from parallelism
More informationNaveen Muralimanohar Rajeev Balasubramonian Norman P Jouppi
Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 Naveen Muralimanohar Rajeev Balasubramonian Norman P Jouppi University of Utah & HP Labs 1 Large Caches Cache hierarchies
More information