- Nishad Nerurkar. - Aniket Mhatre



Similar documents
How To Build A Cloud Computer

COMPUTER HARDWARE. Input- Output and Communication Memory Systems

DEPLOYING AND MONITORING HADOOP MAP-REDUCE ANALYTICS ON SINGLE-CHIP CLOUD COMPUTER

Computer Organization & Architecture Lecture #19

OpenSPARC T1 Processor

Digitale Signalverarbeitung mit FPGA (DSF) Soft Core Prozessor NIOS II Stand Mai Jens Onno Krah

Architectural Level Power Consumption of Network on Chip. Presenter: YUAN Zheng

OC By Arsene Fansi T. POLIMI

Intel 965 Express Chipset Family Memory Technology and Configuration Guide

SOC architecture and design

Router Architectures

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Low-Overhead Hard Real-time Aware Interconnect Network Router

Switch Fabric Implementation Using Shared Memory

The Lagopus SDN Software Switch. 3.1 SDN and OpenFlow. 3. Cloud Computing Technology

Open Flow Controller and Switch Datasheet

Design and Implementation of an On-Chip timing based Permutation Network for Multiprocessor system on Chip

Chapter 1 Computer System Overview

OpenSoC Fabric: On-Chip Network Generator

DDR3 memory technology

Lesson Objectives. To provide a grand tour of the major operating systems components To provide coverage of basic computer system organization

Software engineering for real-time systems

Intel X38 Express Chipset Memory Technology and Configuration Guide

21152 PCI-to-PCI Bridge

Multi-core architectures. Jernej Barbic , Spring 2007 May 3, 2007

How To Understand The Concept Of A Distributed System

Introduction to Exploration and Optimization of Multiprocessor Embedded Architectures based on Networks On-Chip

SDRAM and DRAM Memory Systems Overview

Technical Note. Micron NAND Flash Controller via Xilinx Spartan -3 FPGA. Overview. TN-29-06: NAND Flash Controller on Spartan-3 Overview

DDR subsystem: Enhancing System Reliability and Yield

LSN 2 Computer Processors

Computer Systems Structure Input/Output

Motivation: Smartphone Market

The Bus (PCI and PCI-Express)

Seeking Opportunities for Hardware Acceleration in Big Data Analytics

Intel Q35/Q33, G35/G33/G31, P35/P31 Express Chipset Memory Technology and Configuration Guide

White Paper Utilizing Leveling Techniques in DDR3 SDRAM Memory Interfaces

Introduction to Cloud Computing

Applying the Benefits of Network on a Chip Architecture to FPGA System Design

Design of a High Speed Communications Link Using Field Programmable Gate Arrays

85MIV2 / 85MIV2-L -- Components Locations

NetFlow probe on NetFPGA

Computer Network. Interconnected collection of autonomous computers that are able to exchange information

Multi-Threading Performance on Commodity Multi-Core Processors

Using High Availability Technologies Lesson 12

Intel DPDK Boosts Server Appliance Performance White Paper

1 Basic Configuration of Cisco 2600 Router. Basic Configuration Cisco 2600 Router

760 Veterans Circle, Warminster, PA Technical Proposal. Submitted by: ACT/Technico 760 Veterans Circle Warminster, PA

OpenFlow with Intel Voravit Tanyingyong, Markus Hidell, Peter Sjödin

CS 78 Computer Networks. Internet Protocol (IP) our focus. The Network Layer. Interplay between routing and forwarding

Lecture 18: Interconnection Networks. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Communicating with devices

CSC 2405: Computer Systems II

Logical Operations. Control Unit. Contents. Arithmetic Operations. Objectives. The Central Processing Unit: Arithmetic / Logic Unit.

PE310G4BPi40-T Quad port Copper 10 Gigabit Ethernet PCI Express Bypass Server Intel based

Table of Contents. Cisco How Does Load Balancing Work?

FWSM introduction Intro 5/1

MCSE Core exams (Networking) One Client OS Exam. Core Exams (6 Exams Required)

Computer Performance. Topic 3. Contents. Prerequisite knowledge Before studying this topic you should be able to:

Fiber Channel Over Ethernet (FCoE)

Chapter 5 Cubix XP4 Blade Server

Chapter 6. Inside the System Unit. What You Will Learn... Computers Are Your Future. What You Will Learn... Describing Hardware Performance

Lizy Kurian John Electrical and Computer Engineering Department, The University of Texas as Austin

Performance Evaluation of 2D-Mesh, Ring, and Crossbar Interconnects for Chip Multi- Processors. NoCArc 09

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Computer-System Architecture

D1.2 Network Load Balancing

C-GEP 100 Monitoring application user manual

Stream Processing on GPUs Using Distributed Multimedia Middleware

Header Parsing Logic in Network Switches Using Fine and Coarse-Grained Dynamic Reconfiguration Strategies

DNS (Domain Name System) is the system & protocol that translates domain names to IP addresses.

Architecture of distributed network processors: specifics of application in information security systems

McAfee Web Gateway 7.4.1

Interconnection Networks. Interconnection Networks. Interconnection networks are used everywhere!

A Survey on ARM Cortex A Processors. Wei Wang Tanima Dey

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Symmetric Multiprocessing

Protocol Data Units and Encapsulation

LAB THREE STATIC ROUTING

Performance of Software Switching

ECE 358: Computer Networks. Solutions to Homework #4. Chapter 4 - The Network Layer

IBM CELL CELL INTRODUCTION. Project made by: Origgi Alessandro matr Teruzzi Roberto matr IBM CELL. Politecnico di Milano Como Campus

DDR3 DIMM Slot Interposer

SpW-10X Network Performance Testing. Peter Mendham, Jon Bowyer, Stuart Mills, Steve Parkes. Space Technology Centre University of Dundee

Intel Itanium Quad-Core Architecture for the Enterprise. Lambert Schaelicke Eric DeLano

The Microsoft Windows Hypervisor High Level Architecture

Vorlesung Rechnerarchitektur 2 Seite 178 DASH

Computer System Design. System-on-Chip

IDE/ATA Interface. Objectives. IDE Interface. IDE Interface

FPGA Implementation of IP Packet Segmentation and Reassembly in Internet Router*

Infrastructure for active and passive measurements at 10Gbps and beyond

Chapter 2 Heterogeneous Multicore Architecture

LogiCORE IP AXI Performance Monitor v2.00.a

Transcription:

- Nishad Nerurkar - Aniket Mhatre

Single Chip Cloud Computer is a project developed by Intel. It was developed by Intel Lab Bangalore, Intel Lab America and Intel Lab Germany. It is part of a larger project, the Tera-scale computer which aims at executing trillions of calculations per second (teraflops).

It is a technique in which multiple calculations are performed simultaneously. This can be achieved by dividing larger problems into smaller ones and solving them concurrently. On a small-scale, it can be implemented using multi-core and multi-processor computers having multiple processing elements within a single machine. On a large-scale, clusters, MPPs, and grids use multiple computers to work on highly parallel tasks concurrently.

Servers and Mainframes Used in medium-scale enterprises for parallel data access and data processing. Distributed Systems Multiple autonomous computers that communicate through a computer network. The computers interact with each other in order to process data simultaneously. Cloud Computing Higher version of distributed systems where multiple computers are connected over the internet. Computation, software, data access, and storage services are provided by the network.

Data centers handle and service large amount of data. Need for low power data centers. Need for faster parallel computing to service the requests. At the same time reduce the power consumption for a network cloud. Scheduling and Co-ordinating Multiple computing elements with a constraint of time and speed.

The Single Chip Cloud computer achieves the previously mentioned goals. It is analogous to a cloud network hence the name cloud computer. The SCC contains a cloud network and 48 cores along with on-chip memory and cache. It can work as a multi-core processor as well as a fast message routing and data servicing network. Its power consumption ranges from 25Watts to 125 Watts, which is much lower than that of the Intel i7 (156 Watts)

Processor Mode: In processor mode, the cores are operational. In this mode, cores execute code from system memory and programmed I/O through the system interface that is connected off die to the system board FPGA. Loading memory and configuring the processor for bootstrapping is currently done by software running on the Management Console. Mesh Mode: In mesh mode the cores are off and the router is stressed for performance measurements. The mesh and traffic generators are on and are sending and receiving large amounts of data. Because the core logic is off, there is no memory map. Traffic generators and routers are programmed through the SCC JTAG interface.

Programs may be loaded into SCC memory by the Management Console through the system interface and via the SCC on-die mesh. The memory on the SCC may be dynamically mapped to the address space of the 48 cores. It may also be mapped to the memory space of the Management Console, for program loading or debug. I/O instructions on the SCC processor cores are also mapped to the system interface and by default to the Management Console interface. Programs on the MCPC can capture and display such programmed I/O.

The SCC shows a 6 x 4 tile mesh, thus giving 24 tiles. Four on-die memory controllers. A voltage regulator controller (VRC) - It allows any core or the system interface to adjust the voltage of its block plus the voltage for the entire router mesh. An external system interface controller (SIF) - To communicate between the controller located on the system board and the router on the mesh network.

Two P54C-based IA processing cores with associated L1 and L2 caches. A five-port crossbar router. A traffic generator (TG) for testing the mesh. A mesh interface unit (MIU) that handles all memory and message passing requests. Memory lookup tables (LUT). A message-passing buffer (MPB), or the local memory buffer (LMB). Assorted clock generation and synchronization circuitry for crossing asynchronous boundaries (GCU and CCF).

The P54c has been modified with the following: L1 I-cache and D-cache of 16kB each. Bus-to-cache controller interface (M-unit) has been integrated into the core. Caches are 8-way associative.

The P54C ISA was extended with a new instruction - (CL1INVMB) and a new memory type MPBT) to facilitate the use of message data. All accesses to MPBT data bypass the L2 cache. The new instruction was added to invalidate all L1 cache lines typed as MPBT. This maintains coherency between caches and message data. A write combine buffer was added to the M-unit to accelerate the message transfer between cores.

Each core has its own private 256KB L2 cache and an associated controller. During a miss, the cache controller sends the address to the Mesh Interface Unit (MIU) for decoding and retrieval. Each core can have only one outstanding memory request and will stall on missed reads until data are returned. On missed writes, the processor will continue operation until another miss of either type occurs. Once the data have arrived, the processor continues normal operation. Tiles with multiple outstanding requests can be supported by the network and memory system.

LMB Local Memory Buffer 16KB buffer provides the equivalent of 512 full cache lines of memory. 24 buffers on chip Capable of fast R/W operations. Any core can R/W any buffer. DDR3 Memory Controller Four memory controllers = 64GB DDR3 memory Two unbuffered DIMMs per channel with two ranks per DIMM. In order memory access, accesses to different banks and ranks are interleaved

1 LUT per core. Each LUT contains 256 entries, one for each 16MB segment of the core s 4GB physical memory address space. LUTs are programmed at start time through the system interface from the Management Console. Acts as a Page table. On a L2 cache miss, the LUT directs the memory request to main memory.

The MIU connects the tile to the mesh. It packetizes data out to the mesh and de-packetizes data in from the mesh. It controls the flow of data on the mesh with a credit-based protocol. It uses a round-robin scheme to arbitrate between the two cores on the tile. The function of the MIU can be thought of as catching a cache miss and decoding the core address into a system address.

The MIU contains the following components: Packetizer and De-Packetizer: The packetizer/depacketizer translates the data to/from the tile agents and to/from the mesh. Command interpretation and address decode/lookup: It decodes the address using the LUTs and place the request into the queue. Local configuration registers: For traffic coming from the router, the MIU routes the data to the appropriate local destination using the registers. Link level flow control and Credit Management: It ensures flow of data on the mesh using a credit-based protocol. Arbiter: The arbiter controls tile element access to the MIU at any given time via a round robin scheme.

The on-die 2D mesh network has 24 packet-switched routers connected in a 6x4 configuration and is on its own power supply and clock source. This enables power-performance trade offs to ensure that the mesh is delivering the required performance while consuming minimal power.

Router (RXB): The RXB is the next generation router for future many-core 2D mesh fabrics. The different tile agents on the mesh fabric communicate with each other. A packet consists of a single flit or multiple flits (up to three) with header, body and tail flits. Control flits are used to communicate control information.

Flow control in SCC is credit-based for the routers on the mesh. Each router has eight credits to give per port. A router can send a packet to another router only when it has a credit from that router. Credits are automatically routed back to the sender when the packet moves on to the next destination.

The SCC provides low level capabilities to support a variety of programming models. The SCC also supports different configurations of the memory by using the configuration registers and lookup tables in each tile. At any time, software has the ability to change the memory map for a core.

Each of the SCC s four memory controllers provides access to form 4GB to 16GB of main memory (64GB). Lookup Tables map addresses from the core physical addresses to system physical addresses. The shared space may be used for sending data between the cores or for storing shared data such as an in-memory database.

The SCC Lookup Table (LUT) unit performs the address translation from core address to system address. The operating mode of this unit is governed by the tile level configuration registers. Two LUTs, one for each core, are used to translate all outgoing core addresses into system addresses. To ensure proper physical routing of system addresses, the LUTs also provide router destination IDs and intra-tile sub-ids. A bypass bit is also provided for local tile memory buffer access.

Interrupts to a core are signalled by setting and resetting the appropriate bit in the core configuration registers of the tile. Software can generate a non-maskable, maskable or system management interrupt by use of the appropriate bit in the configuration registers. Core processing of interrupts is configured in the Local Vector Table (LVT) of the Local APIC.

Super Computer Computer Vision 3D vision Low Power Data centers Servers High-end Game Consoles Building blocks of Tera-scale computers.