This project in ARM is in part funded by ICT-eMuCo, a European project supported under the Seventh Framework Programme (7FP) for research and technological development Application Performance Analysis of the Cortex-A9 MPCore Bryan Lawrence 1
Agenda Motivation Experimentation platforms Performance exploration of different application classes Performance evaluation of multiple concurrent applications Summary and conclusion 2
Phone ++ Upcoming Use Cases Mobile Internet Browsing Video conferencing Gaming on the Go Multi-player over 3G / 4G Network 3D Navigation 3
Mobile Phone Applications Compute Intensive 4
Tablet Applications Compute Intensive 5
Achieving Scalable Performance Clock frequency of processor not the only metric of performance Scalable, energy efficient performance required from mobile devices phones, tablets to large enterprise computing Can multicore processors provide a potential solution??... 6
Hardware Platforms Versatile Express ARM-NEC Cortex -A9 processor test-chip ~400MHz Cortex-A9 x 4 4x NEON /FPU 32KB I&D invidual L1 caches 512K L2 cache 1GB RAM (32b DDR2) Early Partner Silicon Cortex-A9 x 2 @ 1GHz 1GB RAM 7
Video Decode / Encode Hardware encoder/decoders are common in consumer Video/audio codecs standards evolve rapidly Many codecs are used infrequently to justify h/w Consumer applications involve other video processing Different from encode / decode (E.g. video editing) Simultaneous encode / decode required for video conferencing 8
H.264 Decode / Encode FFmpeg used for decode X264 library used with FFmeg for video encode CIF & VGA resolutions Commonly used in video conf. Movie trailers used Order of computation more than video conf. Streams Compression factor of 100-200 9
H.264 Decode / Encode Results for single core operation Normalized logarithmic scales used Encode is more compute intensive than decode (at least ~2-3 times) Writing out decoded streams to secondary storage media limited by media bandwidth 10
H.264 Decode / Encode Concurrent video decode + encode Important use case for video conferencing Excellent scalability is observed for up to all 4 cores Encoding is at least 2-3 times or more compute intensive than decode Ideally more resources should be dedicated to encode 11
On2/Google VP8 Libvpx library used for decoding VP8 (from WebM project) Libvpx uses multi-threading and actively takes advantage of parallelizability available in the VP8 codec. Comparative results obtained on Versatile Express and 1GHz dual core platforms 12
On2/Google VP8 Shows good scalability with the number of cores. Scalability is relatively independent of the number of partitions in the video frame Saturation is observed for no. of threads > no. of cores Designers can query the platform to fetch the no. of cores determine available paralelizability 1GHz dual-core Versatile Express 13
Compilation - ffmpeg Code compilation has inherent parallelism in terms of modules Most build systems allow for this compilation to be exploited E.g. make j 4 1GHz dual-core Compilation of FFmpeg and Linux Kernel shown here Versatile Express 14
Compilation Linux Kernel Almost linear speed-up is observed with no. of cores for both cases Effectively doubles (quadruples) the utilized memory bandwidth for 2 cores (4 cores) 1GHz dual-core Versatile Express 15
Browsers 1.54x Normalized Performance Execution time decomposition Browser benchmark using collection of web-pages similar to the mix found in common browsing Speed-up of 1.54 times observed between single and dual core execution The webcore fraction of the pie grows for multicore execution 16
Multiple Concurrent Applications Multitasking is becoming mainstream in mobile devices today Common combinations include Browser + Audio playback E.g. Internet Radio Browser + background download Independent applications can benefit immensely from parallelization 17
Browser + Pandora Internet Radio 1.9x Normalized Performance Execution time decomposition Speed up factor of 1.9 Super linear speed-up can be observed sometimes due to reduced cache pollution from conflicting applications The speed-up can be traded for energy by slowing the cores down (depends on the fabrication process technology used) 18
Browser + Internet File Download 1.64x Normalized Performance Speed up factor of 1.64x Common use case involves downloading an App from an application store or market-place while browsing the internet Execution time decomposition Email synchronization in the bakground also forms a similar use case 19
Cortex-A9 MP Benefits Performance Browser (single app) 1 Core 2 Core 1 1.54 20
Cortex-A9 MP Benefits Richer Experience Browser + Download Browser (single app) Browser + Pandora 1 Core 0.73 1 0.78 2 Core 1.20 1.54 1.50 21
Cortex-A9 MP Benefits Richer Experience Browser + Download Browser (single app) Browser + Pandora 1 Core 0.73 1 0.78 1.64x 1.9x 2 Core 1.20 1.54 1.50 22
Summary and Conclusion This presentation demonstrates the scalability of the ARM Cortex-A9 MPCore processor across various classes of applications, on today s currently available software Better power/performance can be achieved using an efficient low power ARM multicore processor, as compared to a single processor at much higher freq. Next generation software will make more intensive use of threads, and scalability will improve further. 23
Thank You Please visit www.arm.com for ARM related technical details For any queries contact < Salesinfo-IN@arm.com > 24