Energy Discounted Computing on Multicore Smartphones. Meng Zhu and Kai Shen

Similar documents
big.little Technology Moves Towards Fully Heterogeneous Global Task Scheduling Improving Energy Efficiency and Performance in Mobile Devices

Software Engagement with Sleeping CPUs

Application Performance Analysis of the Cortex-A9 MPCore

Update on big.little scheduling experiments. Morten Rasmussen Technology Researcher

Executive Summary. Methodology

NVIDIA Tegra 4 Family CPU Architecture

big.little Technology: The Future of Mobile Making very high performance available in a mobile envelope without sacrificing energy efficiency

Smart Anytime, Safe Anywhere. Climax Home Portal Platform. Envisage and Enable a Connected Future

A Survey on ARM Cortex A Processors. Wei Wang Tanima Dey

Comparing Power Saving Techniques for Multi cores ARM Platforms

TAR25 audio recorder. User manual Version 1.03

Delivering Quality in Software Performance and Scalability Testing

Quiz for Chapter 6 Storage and Other I/O Topics 3.10

Tableau Server 7.0 scalability

AMD PhenomII. Architecture for Multimedia System Prof. Cristina Silvano. Group Member: Nazanin Vahabi Kosar Tayebani

A Powerful solution for next generation Pcs

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

System-Level Display Power Reduction Technologies for Portable Computing and Communications Devices

SNAPPIN.IO. FWR is a Hardware & Software Factory, which designs and develops digital platforms.

Optimizing Shared Resource Contention in HPC Clusters

Performance and scalability of a large OLTP workload

Performance Evaluation of VMXNET3 Virtual Network Device VMware vsphere 4 build

Chapter 1 Computer System Overview

System Requirements Table of contents

E21 Mobile Users Guide

Measuring Cache and Memory Latency and CPU to Memory Bandwidth

LCMON Network Traffic Analysis

Unit A451: Computer systems and programming. Section 2: Computing Hardware 1/5: Central Processing Unit

Agenda. Context. System Power Management Issues. Power Capping Overview. Power capping participants. Recommendations

Improving Grid Processing Efficiency through Compute-Data Confluence

White Paper Perceived Performance Tuning a system for what really matters

Cost-Efficient Job Scheduling Strategies

Cloud Computing through Virtualization and HPC technologies

Visualizing gem5 via ARM DS-5 Streamline. Dam Sunwoo ARM R&D December 2012

CYCLOPE let s talk productivity

VirtualCenter Database Performance for Microsoft SQL Server 2005 VirtualCenter 2.5

Hardware/Software Guidelines

MAUI: Dynamically Splitting Apps Between the Smartphone and Cloud

Energy-aware Memory Management through Database Buffer Control

Operating Systems 4 th Class

GraySort on Apache Spark by Databricks

CHAPTER FIVE RESULT ANALYSIS

Ahsay Online Backup Suite v5.0. Whitepaper Backup speed

System requirements. for Installation of LANDESK Service Desk Clarita-Bernhard-Str. 25 D Muenchen. Magelan GmbH

Feb.2012 Benefits of the big.little Architecture

Ready Time Observations

Tableau Server Scalability Explained

Whitepaper. The Benefits of Quad Core CPUs in Mobile Devices

LOOKING FOR AN AMAZING PROCESSOR. Product Brief 6th Gen Intel Core Processors for Desktops: S-series

Mirtrak 6 Powered by Cyclope

Industry First X86-based Single Board Computer JaguarBoard Released

Estimate Performance and Capacity Requirements for Workflow in SharePoint Server 2010

Power Benefits Using Intel Quick Sync Video H.264 Codec With Sorenson Squeeze

Analysis and Modeling of MapReduce s Performance on Hadoop YARN

Multi-Threading Performance on Commodity Multi-Core Processors

Performance of Host Identity Protocol on Nokia Internet Tablet

XProtect Mobile 1. Specification Sheet

HP Z Turbo Drive PCIe SSD

Whitepaper. The Benefits of Multiple CPU Cores in Mobile Devices

SYSTEM SETUP FOR SPE PLATFORMS

Muse Server Sizing. 18 June Document Version Muse

Configuring a U170 Shared Computing Environment

ARM Cortex-A9 MPCore Multicore Processor Hierarchical Implementation with IC Compiler

Infor Web UI Sizing and Deployment for a Thin Client Solution

Why Computers Are Getting Slower (and what we can do about it) Rik van Riel Sr. Software Engineer, Red Hat

GPUs for Scientific Computing

Grant Management. System Requirements

Xpresstransfer Online Backup Suite v5.0 Whitepaper Backup speed analysis

Hardware Configuration Guide

Hypervisor-based Background Encryption

vsphere Resource Management

Oracle Database Reliability, Performance and scalability on Intel Xeon platforms Mitch Shults, Intel Corporation October 2011

HyperThreading Support in VMware ESX Server 2.1

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

Infrastructure Matters: POWER8 vs. Xeon x86

Cloud computing is a marketing term that means different things to different people. In this presentation, we look at the pros and cons of using

En Wireless Mobile Utility (Android) User s Manual. D610, D600, D7100, D5300, D5200, D3300, Df

Variations in Performance and Scalability when Migrating n-tier Applications to Different Clouds

Oracle Database Scalability in VMware ESX VMware ESX 3.5

VP/GM, Data Center Processing Group. Copyright 2014 Cavium Inc.

Liferay Portal Performance. Benchmark Study of Liferay Portal Enterprise Edition

Transform: Cloud Client Computing

1) SETUP ANDROID STUDIO

SQL Server Consolidation Using Cisco Unified Computing System and Microsoft Hyper-V

Which ARM Cortex Core Is Right for Your Application: A, R or M?

Comparing Multi-Core Processors for Server Virtualization

Symantec Endpoint Protection Sizing and Scalability Best Practices White Paper

Energy-aware job scheduler for highperformance

Gateway Portal Load Balancing

Understanding the Performance of an X User Environment

Improving the performance of data servers on multicore architectures. Fabien Gaud

EMC Business Continuity for Microsoft SQL Server Enabled by SQL DB Mirroring Celerra Unified Storage Platforms Using iscsi

Managing Capacity Using VMware vcenter CapacityIQ TECHNICAL WHITE PAPER

SQL Server 2008 is Microsoft s enterprise-class database server, designed

How To Test For Performance And Scalability On A Server With A Multi-Core Computer (For A Large Server)

Disk Storage Shortfall

Enjoy greater flexibility and effortless collaboration with wireless shared access.

How System Settings Impact PCIe SSD Performance

Performance Analysis of Thread Mappings with a Holistic View of the Hardware Resources

Transcription:

Energy Discounted Computing on Multicore Smartphones Meng Zhu and Kai Shen

Power- Ful & Hungry Smartphones Multi-core Popularity and Low Battery Anxiety Tri-cluster Deca-core 2

Multicore Energy Disproportionality Multicore processors are not energy proportional, despite all the efforts Aggressive hardware sharing Shared clocking circuitry forces cores to operate at the same frequency Power-gating enables low power idle state, but deep idle states can only be entered during simultaneous sleep 3

Multicore Energy Disproportionality State Name Power Description C0 Wait for interrupt (WFI) 403 mw Individual core clock gated. C1 Individual powerdown 365 mw - Individual core power gated. - L1 cache content lost C2 Cluster powerdown 214 mw - Enter during simultaneous sleeps - All state (e.g. L2$) lost Device: Kirin 925 SoC on Huawei Mate7 Smartphone, cluster has four cores 4

Multicore Energy Disproportionality Tegra 3 based Nexus 7 tablet 4 Kirin 925 based Huawei Mate7 4 Power (Watts) 3 2 1 Power (Watts) 3 2 1 0 0 0 1 2 3 4 0 4 8 Number of active CPUs Number of active CPUs Very energy efficient during peak utilization Consume minimum power when all cores are quiescent Inefficient when only one core is utilized 5

Mobile Apps Lack Parallelism Typical smartphone applications are built on eventdriven, UI-centric framework and serve a single user Do not have sufficient parallelism to utilize multiple CPU cores simultaneously On a quad-core system, of all the non-idle time: All four cores are utilized: less than 1% Only one core is utilized: 68% Test conducted with a variety of popular mobile applications (Gao et al. ISPASS 15)

Opportunity Hardware energy disproportionality and Lack of thread level parallelism > Computing resources at additional cores are available at a deep energy discount Utilize these resources to run best-effort tasks: useful tasks on a smartphone but do not involve direct user interaction (thus its time of execution is flexible) 7

Best-effort Tasks Upload and download System maintenance work Background sensing Proactive Tasks 8

Energy Discounted Computing Bundling tasks to save energy on smartphones is not new Lane et al. [Sensys 2013]: Piggybacking sensing activities Nikzad et al. [ICSE 2014]: Annotation language for developers to delay certain work Our contribution: Maximum energy discount is only realized when the co-run best-effort task execution does not elevate the overall system power state. 9

Best-effort task execution must NOT Disrupt the multicore CPU idle state [ C state] Follow the step of interactive task execution Non-work-conserving-scheduling Increase the core frequency [ P state] Invisible to the system frequency adjustment Do not affect frequency scaling for interactive tasks Affect the smartphone s suspension period [ S state] Hide behind the CPU power profile of interactive tasks 10

Implementation (Huawei Mate7, Android 4.4, Linux 3.10.30) Best-effort tasks are put into a control group Idle state preservation: Each core maintains a status bit: BUSY, IDLE, BEST-EFFORT Regular tasks have absolute priority over best-effort ones If a best-effort task is picked, check sibling cores to see if anyone is BUSY. If no one is BUSY, enter idle state directly (non-work-conserving scheduling) 11

Implementation CPU frequency preservation: DVFS is controlled by cpufreq governor, adjust frequency based on load Best-effort tasks are ignored during load calculation Performance of regular tasks are not affected Suspension state preservation: Best-effort tasks are not allowed to hold wakelocks 12

Contention Mitigation Co-run applications leads to contention on multicore resources, cause performance degradation to interactive applications Scheduler priority modification does not remove contention on shared resources, i.e. cache and memory bandwidth Monitoring last-level-cache miss rate using performance counters and throttle best-effort tasks accordingly 13

Implementation Contention mitigation: Monitor last-level-cache miss rate ARMV7_A15_PERFCTR_L2_CACHE_REFILL_RE AD/WRITE Sample every 20 ms Stop scheduling best-effort tasks when the miss rate reaches certain threshold Overhead: less than 1% for all of our benchmarks 14

Evaluation: Setup Device: Huawei Mate7 (late 2014) 1.8 GHz ARM Cortex-A15 Quad Core 2MB L2 cache, 2GB RAM Power measurement using Monsoon power meter with smartphone battery detached 15

Evaluation: Benchmark Interactive application: Bbench: load locally cached websites Angry bird: casual game Best-effort tasks: Spin, Compression, Encryption, AppOpt, FaceAnalysis 16

Evaluation: Test flow Use automate UI testing tools (RERAN [1] ) to minimize variations Launch two applications roughly at the same time Configure the workload such that application executions mostly overlap Discount s = 1 E co-run E interactive alone E best-effort alone [1] Gomez et al. ICSE 13 17

Energy Discount Default co run Power states preservation scheduling Energy discount ratio 60% 40% 20% 0 AngryBird Spin Compress Encrypt AppOpt Face Energy discount ratio 40% 30% 20% 10% 0 Bbench Spin Compress Encrypt AppOpt Face 18

Current Wave (Bbench + Spin) P state disruption Current (Amps) 1 Bbench running alone 0 0 20 40 Time (Seconds) 1 Bbench + Spin default co run 0 0 20 40 Time (Seconds) 1 Bbench + Spin with power state preserving scheduling 0 0 20 40 Time (Seconds) C state and P state disruption 19

Energy Discount 60 AngryBird Default co run Power states preservation scheduling Power states preservation and contention aware scheduling FPS 40 20 0 Spin Compress Encrypt AppOpt Face Bbench Slowdown ratio 8% 6% 4% 2% 0 Spin Compress Encrypt AppOpt Face 20

Abundance of Discounted CPU Cycles Category Web browsing Video streaming Abundance Frames of face analyzed Minutes of video encrypted 1.63 30 21 2.41 4 3 Gaming 1.61 21 15 Navigation 2.42 13 9 Messaging 2.88 3 2 Social network 1.88 12 9 * Abundance of discounted CPU cycles is the ratio of energy discounted CPU cycles to the active CPU cycles used by the corresponding interactive application.

Summary - Energy disproportionality of multicore CPUs and lack of parallelism of smartphone applications provide abundant opportunities to run useful best-effort tasks at deep energy discount - Maximum energy discount can only be realized when overall system power states are preserved - Contention-aware scheduling based on monitoring hardware performance counters is effective in mitigating interactivity slowdown - Experiments show significant energy savings (up to 63%) and little performance impact (less than 4% in the worst case) to the smartphone interactivity

Thank you Questions? Energy Discounted Computing on Multicore Smartphones Meng Zhu and Kai Shen