Visualizing gem5 via ARM DS-5 Streamline. Dam Sunwoo (dam.sunwoo@arm.com) ARM R&D December 2012



Similar documents
Development With ARM DS-5. Mervyn Liu FAE Aug. 2015

Performance Optimization and Debug Tools for mobile games with PlayCanvas

TEGRA X1 DEVELOPER TOOLS SEBASTIEN DOMINE, SR. DIRECTOR SW ENGINEERING

ANDROID DEVELOPER TOOLS TRAINING GTC Sébastien Dominé, NVIDIA

High Performance or Cycle Accuracy?

GPU Profiling with AMD CodeXL

Android Development: a System Perspective. Javier Orensanz

Development Studio 5 (DS-5)

STLinux Software development environment

VM Application Debugging via JTAG: Android TRACE32 JTAG Debug Bridge ADB Architecture Stop-Mode implications for ADB JTAG Transport Outlook

DS-5 ARM. Using the Debugger. Version Copyright ARM. All rights reserved. ARM DUI 0446M (ID120712)

DS-5 ARM. Using the Debugger. Version 5.7. Copyright 2010, 2011 ARM. All rights reserved. ARM DUI 0446G (ID092311)

Embedded Development Tools

Accelerate your Mobile Apps and Games for Android on ARM. Matthew Du Puy Software Engineer, ARM

ARM DS-5 Development Studio

Perf Tool: Performance Analysis Tool for Linux

Linux. Reverse Debugging. Target Communication Framework. Nexus. Intel Trace Hub GDB. PIL Simulation CONTENTS

Basics of VTune Performance Analyzer. Intel Software College. Objectives. VTune Performance Analyzer. Agenda

AppScope: Application Energy Metering Framework for Android Smartphones using Kernel Activity Monitoring

Getting Started with CodeXL

OPTIMIZE DMA CONFIGURATION IN ENCRYPTION USE CASE. Guillène Ribière, CEO, System Architect

The Intel VTune Performance Analyzer

Comparing Power Saving Techniques for Multi cores ARM Platforms

DS-5 ARM. Using the Debugger. Version Copyright ARM. All rights reserved. ARM DUI0446P

NVIDIA Tools For Profiling And Monitoring. David Goodwin

Sequential Performance Analysis with Callgrind and KCachegrind

How To Develop For A Powergen 2.2 (Tegra) With Nsight) And Gbd (Gbd) On A Quadriplegic (Powergen) Powergen Powergen 3

Real-time Debugging using GDB Tracepoints and other Eclipse features

DB2 Connect for NT and the Microsoft Windows NT Load Balancing Service

Reminders. Lab opens from today. Many students want to use the extra I/O pins on

New Relic & JMeter - Perfect Performance Testing

Directions for VMware Ready Testing for Application Software

Optimizing Application Performance with CUDA Profiling Tools

file://d:\webs\touch-base.com\htdocs\documentation\androidplatformnotes52.htm

Introduction to the NI Real-Time Hypervisor

Software Tracing of Embedded Linux Systems using LTTng and Tracealyzer. Dr. Johan Kraft, Percepio AB

Kernel comparison of OpenSolaris, Windows Vista and. Linux 2.6

Multi-core architectures. Jernej Barbic , Spring 2007 May 3, 2007

big.little Technology Moves Towards Fully Heterogeneous Global Task Scheduling Improving Energy Efficiency and Performance in Mobile Devices

WEBSPHERE APPLICATION SERVER ADMIN V8.5 (on Linux and Windows) WITH REAL-TIME CONCEPTS & REAL-TIME PROJECT

Sequential Performance Analysis with Callgrind and KCachegrind

RTOS Debugger for ecos

Debugging with TotalView

GPU Tools Sandra Wienke

Inside Android's UI Embedded Linux Conference Europe 2012 Karim

Deploying Cisco Unified Contact Center Express Volume 1

Debugging Multi-threaded Applications in Windows

Multi-/Many-core Modeling at Freescale

Using System Tracing Tools to Optimize Software Quality and Behavior

Android Fundamentals 1

PyCompArch: Python-Based Modules for Exploring Computer Architecture Concepts

Multi-core and Linux* Kernel

Free performance monitoring and capacity planning for IBM Power Systems

Application-Level Debugging and Profiling: Gaps in the Tool Ecosystem. Dr Rosemary Francis, Ellexus

Finding Performance and Power Issues on Android Systems. By Eric W Moore

ERIKA Enterprise pre-built Virtual Machine

Low power GPUs a view from the industry. Edvard Sørgård

Open Source DBMS CUBRID 2008 & Community Activities. Byung Joo Chung bjchung@cubrid.com

Tutorial: Load Testing with CLIF

Last Class: OS and Computer Architecture. Last Class: OS and Computer Architecture

GETTING STARTED WITH ANDROID DEVELOPMENT FOR EMBEDDED SYSTEMS

Enabling Technologies for Distributed and Cloud Computing

The Lagopus SDN Software Switch. 3.1 SDN and OpenFlow. 3. Cloud Computing Technology

A Study of Performance Monitoring Unit, perf and perf_events subsystem

D5.6 Prototype demonstration of performance monitoring tools on a system with multiple ARM boards Version 1.0

Using SNMP to Obtain Port Counter Statistics During Live Migration of a Virtual Machine. Ronny L. Bull Project Writeup For: CS644 Clarkson University

Introduction to Android

EE8205: Embedded Computer System Electrical and Computer Engineering, Ryerson University. Multitasking ARM-Applications with uvision and RTX

Java Troubleshooting and Performance

CLOUD GAMING WITH NVIDIA GRID TECHNOLOGIES Franck DIARD, Ph.D., SW Chief Software Architect GDC 2014

Overview of CS 282 & Android

OS Thread Monitoring for DB2 Server

Introduction to TIZEN SDK

Hybrid Platform Application in Software Debug

Multi-core Curriculum Development at Georgia Tech: Experience and Future Steps

Using Power to Improve C Programming Education

Multi-Threading Performance on Commodity Multi-Core Processors

Hardware Based Virtualization Technologies. Elsie Wahlig Platform Software Architect

Database Services for CERN

Recent Advances in Periscope for Performance Analysis and Tuning

HAVmS: Highly Available Virtual machine Computer System Fault Tolerant with Automatic Failback and close to zero downtime

M85 OpenCPU Solution Presentation

Performance Testing in Virtualized Environments. Emily Apsey Product Engineer

Monitoring, Tracing, Debugging (Under Construction)

Getting Started with Tizen SDK : How to develop a Web app. Hong Gyungpyo 洪 競 杓 Samsung Electronics Co., Ltd

Load Imbalance Analysis

Multiprocessor Scheduling and Scheduling in Linux Kernel 2.6

Optimizing Unity Games for Mobile Platforms. Angelo Theodorou Software Engineer Unite 2013, 28 th -30 th August

Building Embedded Systems

Compilers and Tools for Software Stack Optimisation

Holistic Performance Analysis of J2EE Applications

Enabling Technologies for Distributed Computing

Process Scheduling CS 241. February 24, Copyright University of Illinois CS 241 Staff

locuz.com HPC App Portal V2.0 DATASHEET

Tekla Structures 18 Hardware Recommendation

Transcription:

Visualizing gem5 via ARM DS-5 Streamline Dam Sunwoo (dam.sunwoo@arm.com) ARM R&D December 2012 1

The Challenge! System-level research and performance analysis becoming ever so complicated! More cores and IPs in system! More threads in workloads! Many interesting aspects of system remain in thread-level and temporal behavior! Many architectural simulators (including gem5) only provide text-based statistics! Hard to get insight into complex system-level behavior Good visualization is key! 2

ARM DS-5 Streamline: System Performance Analyzer for Linux and Android! Software based solution! Support for Linux kernel 2.6.32+ on target! Eclipse plug-in or command line! Lightweight sample profiling! Time- or event*-based sampling! Process to C/C++ source code profiler! Low probe effect; <5% typically! Multiple data sources! CPU and GPU H/W and S/W counters! Tracepoints! Code instrumentation! Originally developed for real H/W platforms User Space Applications & Middleware OpenGL ES Mali Drivers Linux Kernel ARM Processor gator Daemon gator Driver TCP/IP Target Device * Event-based sampling is available on kernels 3.0 or later 3

Timeline: The Big Picture! Find hotspots, system glitches, critical conditions at a glance Select from 40+ CPU counters, OS level and custom metrics Select one or more processes to visualize their instant load on CPU Accumulate counters, measure time and find instant hotspots Combined task switch trace and sampled profile for all threads 4

SMP Analysis! Take advantage of multicore SMP platforms! Visually trace core migration and per-core statistics! Spot non-optimal thread synchronization and improve parallelism Per core, per process activity 5

Streamline + gem5 Demo 6

Sample Screenshot running BBench 7

Visual Annotation of LCD Frame Buffers 8

Sample Screenshot running Angry Birds 9

CPU Load Comparison on MP-little-big Config! The two BBench runs with different schedulers resumed from exact same checkpoint! amp-aware scheduler correctly puts more load on big core! BBench finishes 23% sooner with ampaware scheduler in this experiment! Default Scheduler! amp-aware Scheduler CPU loads out of 50% per core Little Core Load Big Core Load 23% improvement Little Core Load Big Core Load 10

Original Streamline Capture Flow Target System User App / Benchmark Gator Daemon Streamline Linux Kernel Gator Module.apc unpack! Relies on gator kernel module and daemon S/W H/W CPU PMU.apd! Reads out counters and process information and dumps to file 11

Streamline+gem5 Flow S/W Simulated System Linux Kernel User App / Benchmark Process Info.apc Streamline unpack! Gator-free!! Special kernel module/ daemon not required anymore! Zero probe effects! Can capture data for bare-metal runs as well H/W CPU Postprocess.apd! Linux process/thread info and gem5 stats dumped at every context switch gem5 gem5 stats Stats/ Task files! Simple single-pass flow 12

How do I get started?! Streamline 5.12 Community Edition available now for free!! Details on http://www.arm.com/products/tools/streamline-for-gem5.php! Slightly modified Linux/Android kernel! Add m5struct to let gem5 know of offsets of certain kernel struct fields (pid, tgid, comm (task name), mm (mem map), etc.)! Enable enablecontextswitchstatsdump flag in LinuxArmSystem! Dumps stats at context switches (callback for switch_to() )! Dumps process info (pid, tgid, task name, cpu id) at context switches! Enable frame_capture (optional)! Dump frame-by-frame output in gzipped bmp format for visual annotation! Post-process script! Uses gem5 stats / process info / frames to generate Streamline.apc project file from scratch (without gator) Streamline available for download now! gem5 changes and scripts to be available very shortly. Stay tuned! 13

Summary! Streamline+gem5 enables great visualization of complex system behavior in an effortless manner! Process / Thread information! Crucial in understanding OS scheduling behavior in complex multi-threaded benchmarks! Temporal behavior of benchmarks! Easier to digest than Giga-bytes of text in stats file! Better visualization! Various features and views to help better understand results! Pretty screenshots for papers and presentations "! Any questions or feedback are welcome (dam.sunwoo@arm.com) 14