Application Performance Analysis of the Cortex-A9 MPCore



Similar documents
ARM Cortex-A9 MPCore Multicore Processor Hierarchical Implementation with IC Compiler

Unit A451: Computer systems and programming. Section 2: Computing Hardware 1/5: Central Processing Unit

A Survey on ARM Cortex A Processors. Wei Wang Tanima Dey

CHAPTER 1 INTRODUCTION

A Powerful solution for next generation Pcs

LOOKING FOR AN AMAZING PROCESSOR. Product Brief 6th Gen Intel Core Processors for Desktops: S-series

Power Benefits Using Intel Quick Sync Video H.264 Codec With Sorenson Squeeze

1. PUBLISHABLE SUMMARY

Mini System 101 Our Price: $669

Whitepaper. The Benefits of Multiple CPU Cores in Mobile Devices

NVIDIA GeForce GTX 580 GPU Datasheet

High Performance or Cycle Accuracy?

BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA

Whitepaper. The Benefits of Quad Core CPUs in Mobile Devices

FPGA-based Multithreading for In-Memory Hash Joins

This letter contains latest information about the above mentioned software version.

High Efficiency Video Coding (HEVC) or H.265 is a next generation video coding standard developed by ITU-T (VCEG) and ISO/IEC (MPEG).

big.little Technology Moves Towards Fully Heterogeneous Global Task Scheduling Improving Energy Efficiency and Performance in Mobile Devices

MODECOM MC-602 ART UNIQUE MICE COLLECTION. MODECOM Tablet FreeTAB 2096+

IP Video Rendering Basics

Multi-Threading Performance on Commodity Multi-Core Processors

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Parallel Programming Survey

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

social networking and use of media and portable device to manage digital operating system and offer a series of applications for web browsing,

NVIDIA Tegra 4 Family CPU Architecture

AMD Radeon HD 8000M Series GPU Specifications AMD Radeon HD 8870M Series GPU Feature Summary

Introduction to Cloud Computing

Hardware accelerated Virtualization in the ARM Cortex Processors

Industry First X86-based Single Board Computer JaguarBoard Released

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Control 2004, University of Bath, UK, September 2004

SABRE Lite Development Kit

on an system with an infinite number of processors. Calculate the speedup of

Exploring the Design of the Cortex-A15 Processor ARM s next generation mobile applications processor. Travis Lanier Senior Product Manager

Thread level parallelism

Server: Performance Benchmark. Memory channels, frequency and performance

All Programmable Logic. Hans-Joachim Gelke Institute of Embedded Systems. Zürcher Fachhochschule

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

This letter contains latest information about the above mentioned software version.

A Scalable VISC Processor Platform for Modern Client and Cloud Workloads

Video Monitoring and Log System

Stream Processing on GPUs Using Distributed Multimedia Middleware

a basic guide to video conversion using SUPER

White Paper. Recording Server Virtualization

Applied Micro development platform. ZT Systems (ST based) HP Redstone platform. Mitac Dell Copper platform. ARM in Servers

Bosch Video Management System

User Installation Guide

Which ARM Cortex Core Is Right for Your Application: A, R or M?

Awards News. GDDR5 memory provides twice the bandwidth per pin of GDDR3 memory, delivering more speed and higher bandwidth.

i.mx Applications Processors with Hantro's Multimedia Framework

Lecture 3: Evaluating Computer Architectures. Software & Hardware: The Virtuous Cycle?

ARM Processor Evolution

Understanding the Performance of an X User Environment

Parallelization of video compressing with FFmpeg and OpenMP in supercomputing environment

Boundless Security Systems, Inc.

ADVANTAGES OF AV OVER IP. EMCORE Corporation

How do I Check if My Computer is Compatible with Windows 7

How to Run the MQX RTOS on Various RAM Memories for i.mx 6SoloX

OBJECTIVE ANALYSIS WHITE PAPER MATCH FLASH. TO THE PROCESSOR Why Multithreading Requires Parallelized Flash ATCHING

Enabling Technologies for Distributed and Cloud Computing

Scaling from Datacenter to Client

Embedded Parallel Computing

Virtuoso and Database Scalability

Outline. Introduction. State-of-the-art Forensic Methods. Hardware-based Workload Forensics. Experimental Results. Summary. OS level Hypervisor level

A Deduplication File System & Course Review

Mobile Operating Systems Lesson 05 Windows CE Part 1

Course Development of Programming for General-Purpose Multicore Processors

Multicore Parallel Computing with OpenMP

HPC with Multicore and GPUs

CPU Dual processor, quad-core 2.0 gigahertz (GHz)+ 4-way processor, dual-core 2.0 GHz+

AXIS 262+ Network Video Recorder

A Tutorial On Network Marketing And Video Transoding

Capacities Overview: 9.7 MultiTouch Screen with IPS technology Access to AndroidTM apps HD Multimedia playback

Next Generation GPU Architecture Code-named Fermi

Scaling out a SharePoint Farm and Configuring Network Load Balancing on the Web Servers. Steve Smith Combined Knowledge MVP SharePoint Server

Design and Implementation of a Storage Repository Using Commonality Factoring. IEEE/NASA MSST2003 April 7-10, 2003 Eric W. Olsen

Comparative performance test Red Hat Enterprise Linux 5.1 and Red Hat Enterprise Linux 3 AS on Intel-based servers

Intel Media Server Studio - Metrics Monitor (v1.1.0) Reference Manual

CSE 237A Final Project Final Report

Informatica Data Director Performance

ivms-4200 Client Software Technical Specification v1.02

Exercise 1 Media Metadata Analysis

Feb.2012 Benefits of the big.little Architecture

Cloud Performance Benchmark Series

Advances in Virtualization In Support of In-Memory Big Data Applications

Enabling Technologies for Distributed Computing

Multi-Core Programming

Quick Start Guide. 1 Introduction. GV-Mobile Server V1.4. Packing List

USB 3.0 to VGA External Multi Monitor Graphics Adapter with 3-Port USB Hub VGA and USB 3.0 Mini Dock 1920x1200 / 1080p

Chapter 1 Computer System Overview

GPU Parallel Computing Architecture and CUDA Programming Model

How To Build A Cloud Computer

Experiences With Mobile Processors for Energy Efficient HPC

How To Use An Amd Ramfire R7 With A 4Gb Memory Card With A 2Gb Memory Chip With A 3D Graphics Card With An 8Gb Card With 2Gb Graphics Card (With 2D) And A 2D Video Card With

GPU Compute accelerated HEVC decoder on ARM Mali TM -T600 GPUs

CHAPTER FIVE RESULT ANALYSIS

Performance Analysis of Web based Applications on Single and Multi Core Servers

Msystems Ltd. SAPPHIRE HD GB GDDR5 PCIE

Amplicon Core i5/i7 Ventrix and Impact-R new Systems

Transcription:

This project in ARM is in part funded by ICT-eMuCo, a European project supported under the Seventh Framework Programme (7FP) for research and technological development Application Performance Analysis of the Cortex-A9 MPCore Bryan Lawrence 1

Agenda Motivation Experimentation platforms Performance exploration of different application classes Performance evaluation of multiple concurrent applications Summary and conclusion 2

Phone ++ Upcoming Use Cases Mobile Internet Browsing Video conferencing Gaming on the Go Multi-player over 3G / 4G Network 3D Navigation 3

Mobile Phone Applications Compute Intensive 4

Tablet Applications Compute Intensive 5

Achieving Scalable Performance Clock frequency of processor not the only metric of performance Scalable, energy efficient performance required from mobile devices phones, tablets to large enterprise computing Can multicore processors provide a potential solution??... 6

Hardware Platforms Versatile Express ARM-NEC Cortex -A9 processor test-chip ~400MHz Cortex-A9 x 4 4x NEON /FPU 32KB I&D invidual L1 caches 512K L2 cache 1GB RAM (32b DDR2) Early Partner Silicon Cortex-A9 x 2 @ 1GHz 1GB RAM 7

Video Decode / Encode Hardware encoder/decoders are common in consumer Video/audio codecs standards evolve rapidly Many codecs are used infrequently to justify h/w Consumer applications involve other video processing Different from encode / decode (E.g. video editing) Simultaneous encode / decode required for video conferencing 8

H.264 Decode / Encode FFmpeg used for decode X264 library used with FFmeg for video encode CIF & VGA resolutions Commonly used in video conf. Movie trailers used Order of computation more than video conf. Streams Compression factor of 100-200 9

H.264 Decode / Encode Results for single core operation Normalized logarithmic scales used Encode is more compute intensive than decode (at least ~2-3 times) Writing out decoded streams to secondary storage media limited by media bandwidth 10

H.264 Decode / Encode Concurrent video decode + encode Important use case for video conferencing Excellent scalability is observed for up to all 4 cores Encoding is at least 2-3 times or more compute intensive than decode Ideally more resources should be dedicated to encode 11

On2/Google VP8 Libvpx library used for decoding VP8 (from WebM project) Libvpx uses multi-threading and actively takes advantage of parallelizability available in the VP8 codec. Comparative results obtained on Versatile Express and 1GHz dual core platforms 12

On2/Google VP8 Shows good scalability with the number of cores. Scalability is relatively independent of the number of partitions in the video frame Saturation is observed for no. of threads > no. of cores Designers can query the platform to fetch the no. of cores determine available paralelizability 1GHz dual-core Versatile Express 13

Compilation - ffmpeg Code compilation has inherent parallelism in terms of modules Most build systems allow for this compilation to be exploited E.g. make j 4 1GHz dual-core Compilation of FFmpeg and Linux Kernel shown here Versatile Express 14

Compilation Linux Kernel Almost linear speed-up is observed with no. of cores for both cases Effectively doubles (quadruples) the utilized memory bandwidth for 2 cores (4 cores) 1GHz dual-core Versatile Express 15

Browsers 1.54x Normalized Performance Execution time decomposition Browser benchmark using collection of web-pages similar to the mix found in common browsing Speed-up of 1.54 times observed between single and dual core execution The webcore fraction of the pie grows for multicore execution 16

Multiple Concurrent Applications Multitasking is becoming mainstream in mobile devices today Common combinations include Browser + Audio playback E.g. Internet Radio Browser + background download Independent applications can benefit immensely from parallelization 17

Browser + Pandora Internet Radio 1.9x Normalized Performance Execution time decomposition Speed up factor of 1.9 Super linear speed-up can be observed sometimes due to reduced cache pollution from conflicting applications The speed-up can be traded for energy by slowing the cores down (depends on the fabrication process technology used) 18

Browser + Internet File Download 1.64x Normalized Performance Speed up factor of 1.64x Common use case involves downloading an App from an application store or market-place while browsing the internet Execution time decomposition Email synchronization in the bakground also forms a similar use case 19

Cortex-A9 MP Benefits Performance Browser (single app) 1 Core 2 Core 1 1.54 20

Cortex-A9 MP Benefits Richer Experience Browser + Download Browser (single app) Browser + Pandora 1 Core 0.73 1 0.78 2 Core 1.20 1.54 1.50 21

Cortex-A9 MP Benefits Richer Experience Browser + Download Browser (single app) Browser + Pandora 1 Core 0.73 1 0.78 1.64x 1.9x 2 Core 1.20 1.54 1.50 22

Summary and Conclusion This presentation demonstrates the scalability of the ARM Cortex-A9 MPCore processor across various classes of applications, on today s currently available software Better power/performance can be achieved using an efficient low power ARM multicore processor, as compared to a single processor at much higher freq. Next generation software will make more intensive use of threads, and scalability will improve further. 23

Thank You Please visit www.arm.com for ARM related technical details For any queries contact < Salesinfo-IN@arm.com > 24