Motivation: Smartphone Market



Similar documents
Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Enhance Service Delivery and Accelerate Financial Applications with Consolidated Market Data

Architectures and Platforms

Configuring Memory on the HP Business Desktop dx5150

OpenSPARC T1 Processor

Reduced Precision Hardware for Ray Tracing. Sean Keely University of Texas, Austin

Computer Architecture

A N. O N Output/Input-output connection

DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION

Choosing a Computer for Running SLX, P3D, and P5

COMPUTER HARDWARE. Input- Output and Communication Memory Systems

Latency on a Switched Ethernet Network

ProCAP Transfer with Omneon Interface

Security Systems. Technical Note. iscsi sessions if VRM Load Balancing in All Mode is used Version 1.0 STVC/PRM. Page 1 of 6

MCTS Guide to Microsoft Windows 7. Chapter 10 Performance Tuning

1. Memory technology & Hierarchy

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

OPTIMIZE DMA CONFIGURATION IN ENCRYPTION USE CASE. Guillène Ribière, CEO, System Architect

Seeking Opportunities for Hardware Acceleration in Big Data Analytics

Memory Basics. SRAM/DRAM Basics

Stream Processing on GPUs Using Distributed Multimedia Middleware

Logical Operations. Control Unit. Contents. Arithmetic Operations. Objectives. The Central Processing Unit: Arithmetic / Logic Unit.

- Nishad Nerurkar. - Aniket Mhatre

A case study of mobile SoC architecture design based on transaction-level modeling

Q & A From Hitachi Data Systems WebTech Presentation:

Memory unit. 2 k words. n bits per word

Chapter 11 I/O Management and Disk Scheduling

DDR4 Memory Technology on HP Z Workstations

Microsoft SQL Server OLTP Best Practice

Intel 965 Express Chipset Family Memory Technology and Configuration Guide

RAID. RAID 0 No redundancy ( AID?) Just stripe data over multiple disks But it does improve performance. Chapter 6 Storage and Other I/O Topics 29

Tableau Server Scalability Explained

Introduction to SwiftBroadband

Computer Graphics Hardware An Overview

Supported Upgrade Paths for FortiOS Firmware VERSION

ASTERIX Format Analysis and Monitoring Tool

Accelerating Development and Troubleshooting of Data Center Bridging (DCB) Protocols Using Xgig

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller

A Comparison of Client and Enterprise SSD Data Path Protection

CLOUD GAMING WITH NVIDIA GRID TECHNOLOGIES Franck DIARD, Ph.D., SW Chief Software Architect GDC 2014

OC By Arsene Fansi T. POLIMI

Computer Architecture

Intel Pentium 4 Processor on 90nm Technology

Adobe Marketing Cloud Data Workbench Monitoring Profile

Resource Aware Scheduler for Storm. Software Design Document. Date: 09/18/2015

IP Video Rendering Basics

Table Of Contents. Page 2 of 26. *Other brands and names may be claimed as property of others.

SABRE Lite Development Kit

Chapter 3 ATM and Multimedia Traffic

21152 PCI-to-PCI Bridge

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

Samsung Solid State Drive RAPID mode

FAST 11. Yongseok Oh University of Seoul. Mobile Embedded System Laboratory

VirtualCenter Database Performance for Microsoft SQL Server 2005 VirtualCenter 2.5

Outline. Introduction. Multiprocessor Systems on Chip. A MPSoC Example: Nexperia DVP. A New Paradigm: Network on Chip

NVIDIA GeForce GTX 580 GPU Datasheet

DDR subsystem: Enhancing System Reliability and Yield

HETEROGENEOUS SYSTEM COHERENCE FOR INTEGRATED CPU-GPU SYSTEMS

Oracle Database In-Memory The Next Big Thing

SAP HANA PLATFORM Top Ten Questions for Choosing In-Memory Databases. Start Here

Chapter 1 Computer System Overview

White Paper. Real-time Capabilities for Linux SGI REACT Real-Time for Linux

Computer & Information Systems (22:198:604) Fall 2008 Homework 2 Solution

N /150/151/160 RAID Controller. N MegaRAID CacheCade. Feature Overview

ALL-AIO-2321P ZERO CLIENT

Logitech ConferenceCam CC3000e. Best Practices for use with Software Clients. UC for Real People

Safe Harbor Statement

Securing and Accelerating Databases In Minutes using GreenSQL

Notes and terms of conditions. Vendor shall note the following terms and conditions/ information before they submit their quote.

Windows Server 2008 R2 Hyper V. Public FAQ

NAND Flash FAQ. Eureka Technology. apn5_87. NAND Flash FAQ

Computer Performance. Topic 3. Contents. Prerequisite knowledge Before studying this topic you should be able to:

Chapter 2 Heterogeneous Multicore Architecture

INSTALLATION GUIDE ENTERPRISE DYNAMICS 9.0

theguard! ApplicationManager System Windows Data Collector

Hardware Performance Optimization and Tuning. Presenter: Tom Arakelian Assistant: Guy Ingalls

White Paper. Optimizing the Performance Of MySQL Cluster

Chapter 7: Distributed Systems: Warehouse-Scale Computing. Fall 2011 Jussi Kangasharju

Computer System Design. System-on-Chip

Lecture 2: Computer Hardware and Ports.

Top 10 Performance Tips for OBI-EE

Power Benefits Using Intel Quick Sync Video H.264 Codec With Sorenson Squeeze

DDR3 memory technology

Benchmarking Cassandra on Violin

Application Performance Analysis and Troubleshooting

Texture Cache Approximation on GPUs

Configuring and using DDR3 memory with HP ProLiant Gen8 Servers

HP ProLiant DL580 Gen8 and HP LE PCIe Workload WHITE PAPER Accelerator 90TB Microsoft SQL Server Data Warehouse Fast Track Reference Architecture

Enterprise Performance Tuning: Best Practices with SQL Server 2008 Analysis Services. By Ajay Goyal Consultant Scalability Experts, Inc.

my forecasted needs. The constraint of asymmetrical processing was offset two ways. The first was by configuring the SAN and all hosts to utilize

W I S E. SQL Server 2008/2008 R2 Advanced DBA Performance & WISE LTD.

Enhancing SQL Server Performance

BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA

Network Performance Optimisation: The Technical Analytics Understood Mike Gold VP Sales, Europe, Russia and Israel Comtech EF Data May 2013

CSE 237A Final Project Final Report

FNT EXPERT PAPER. // From Cable to Service AUTOR. Data Center Infrastructure Management (DCIM)

Transcription:

Motivation: Smartphone Market

Smartphone Systems External Display Device Display

Smartphone Systems Smartphone-like system Main Camera Front-facing Camera Central Processing Unit Device Display Graphics Processing Unit Local Storage Main Memory Modem Encoder Decoder Specialised Specialised Hardware Specialised Hardware ElementHardware Element Element strict and competing requirements: energy vs. real-time performance are conventional solutions appropriate?

This work a step towards smartphone-appropriate memory scheduler designs trace-based methodology memory traces with dependence information software-based methodology to approximate hardware accelerator behavior we study: address mapping schemes memory schedulers Video Conference Workload other smartphone workloads

Content 1. Video Conference 2. Modeling Specialised Hardware 3. Infrastructure Overview 4. Results Address Mapping Scheme Scheduler Comparisons 5. Summary

Video Conference two-way video call a conversation between 2 persons, a video conference between meeting rooms, etc.

Video Conference Information Flow Main Memory Camera Frame Coded Frame Modem Modem Display Encoder Decoder Encoding stage Decoding stage

Video Conference Information Flow Frame Main Memory Coded Frame Camera Modem Modem Display Encoder Decoder Encoding stage Decoding stage

Video Conference Information Flow Main Memory Camera Modem Modem Display Encoder Frame Coded Frame Decoder Encoding stage Decoding stage

Video Conference Information Flow Main Memory Camera Modem Modem Display Coded Frame Encoder Decoder Frame Encoding stage Decoding stage

Video Conference Information Flow Coded Frame Main Memory Frame Camera Modem Modem Display Encoder Decoder Encoding stage Decoding stage

Video Conference Information Flow Main Memory Camera Coded Frame Modem Modem Frame Display Encoder Decoder Encoding stage Decoding stage

Video Conference Information Flow Main Memory Camera Modem Modem Encoder? Decoder? Display Encoding stage Decoding stage But how to simulate the encoder and decoder?

Modelling Specialised Hardware devices designed for a specific use optimised execution paths buffers and small caches right where needed ideally: collect traffic from a real device our approach: Encoder Decoder instrument software application and collect traces use cache to filter accesses and Trace get Collector the essential pattern Collected requests Filter cache Trace files

Maintaining Ordering Constraints traces -> no relationships between requests store dataflow information to limit requests Original stream Regular traces Our methodology READ X READ Y Z := X + Y READ X READ Y WRITE Z READ X READ Y WRITE Z : depends on X, Y WRITE Z

Infrastructure overview Memory Scheduler Transaction Queue Command Queue Rank 0 Command Queue Rank 1 to DRAM chips

DRAM Organisation Rank 0 Bank 0 Bank 2 Bank 4 Bank 6 Bank 1 Bank 3 Bank 5 Bank 7 Rank 1 Bank 0 Bank 2 Bank 4 Bank 6 Bank 1 Bank 3 Bank 5 Bank 7

DRAM Organisation 4GB DDR3 SDRAM at 800 MHz logical organisation: 2 ranks [1 bit] 8 banks [3 bits] 16384 row [14 bits] 256 columns [8 bits] Rank 0 Bank 0 Bank 2 Bank 4 Bank 6 Bank 1 Bank 3 Bank 5 Bank 7 Rank 1 Bank 0 Bank 2 Bank 4 Bank 6 Bank 1 Bank 3 Bank 5 Bank 7

Requests, Location and Scheduling A B mapping A and B to the same bank, different rows t PRECH. ACTIVATE READ A PRECH. ACTIVATE READ B mapping A and B to the same bank and row t PRECH. ACTIVATE READ A READ B mapping A and B to different banks t PRECH. ACTIVATE READ A PRECH. ACTIVATE READ B 20

Requests, Location and Scheduling A B C = A++ mapping A and B to the same bank, different rows t PRECH. ACTIVATE READ A PRECH. ACTIVATE READ B... mapping A and B to the same bank and row t PRECH. ACTIVATE READ A READ B PRECH. ACTIVATE READ C mapping A and B to different banks t PRECH. ACTIVATE READ A PRECH. ACTIVATE READ B 21

Requests, Location and Scheduling A B C = A++ mapping A and B to the same bank, different rows t PRECH. ACTIVATE READ A READ PRECH. C PRECH. ACTIVATE ACTIVATE READ BREAD B mapping A and B to the same bank and row t PRECH. ACTIVATE READ A READ B PRECH. ACTIVATE READ C mapping A and B to different banks t PRECH. ACTIVATE READ A PRECH. ACTIVATE READ B Address mapping and scheduling can have great effect! 22

Content 1. Video Conference 2. Modeling Specialised Hardware 3. Infrastructure Overview 4. Results Address Mapping Scheme Scheduler Comparisons 5. Summary

Content 1. Video Conference 2. Modeling Specialised Hardware 3. Infrastructure Overview 4. Results Address Mapping Scheme Scheduler Comparisons 5. Summary

Test Configurations scheduler configuration: Max Buffer Hits Command Queue Transaction Queue Write Queue High Mark Limited 32 8 24 16 12 8 Maximum 1024 512 512 64 60 50 Low Mark * Our implementation of TFRR does not use TQ, so we increase CQ to 20 and 1024 we examine: address mapping on the VCW workload schedulers on a combination of: Web Browsing, Face Detection, VCW

Address Mapping Schemes K designates the rank bits

Address Mapping Schemes: Results Execution time (20 frames): lower is better

Address Mapping Schemes: Results Execution time (20 frames): lower is better RKBC performs best due to high row buffer locality combined with good parallelism

Schedulers First Ready First Come, First Served (FR-FCFS) baseline, simple FR-FCFS with Write Drain (FR-FCFS-WD) delays WRITEs to improve READ latency Thread-Fair memory scheduler (TF) prioritises requests from ROB head Thread Clustering memory scheduler (TCM) thread-ranking strict prioritisation

Scheduler Comparison Execution time: lower is better

Scheduler Comparison Execution time: lower is better Slower by 7% Best performer Slower by 1.5%

Summary our contributions trace-based methodology request issuing limited by dataflow uses cache to model specialised hardware no validation (yet) Video Conference Workload model typical smartphone usage our tests show it is memory bound

Summary: Our Findings address mapping has significant impact best scheme runs in 1/5 time of the worst one compare schedulers older, simpler: FR-FCFS, FR-FCFS-WD newer, thread-concious: Thread-Fair, Thread Clustering found that simpler perform better Write Drain mode useful

THANK YOU FOR YOUR ATTENTION!