Going Linux on Massive Multicore



Similar documents
Kalray MPPA Massively Parallel Processing Array

Getting Started with Embedded System Development using MicroBlaze processor & Spartan-3A FPGAs. MicroBlaze

STLinux Software development environment

Von der Hardware zur Software in FPGAs mit Embedded Prozessoren. Alexander Hahn Senior Field Application Engineer Lattice Semiconductor

Digitale Signalverarbeitung mit FPGA (DSF) Soft Core Prozessor NIOS II Stand Mai Jens Onno Krah

Nios II Software Developer s Handbook

UG103.8 APPLICATION DEVELOPMENT FUNDAMENTALS: TOOLS

Embedded Linux Platform Developer

Soft processors for microcontroller programming education

USB 3.0 Connectivity using the Cypress EZ-USB FX3 Controller

Using a Generic Plug and Play Performance Monitor for SoC Verification

KeyStone Multicore. Ecosystem

APx4 Wireless System-on-Module 5/8/2013 1

Notes and terms of conditions. Vendor shall note the following terms and conditions/ information before they submit their quote.

CGL Architecture Specification

RISC-V Software Ecosystem. Andrew Waterman UC Berkeley

PikeOS: Multi-Core RTOS for IMA. Dr. Sergey Tverdyshev SYSGO AG , Moscow

Attention. restricted to Avnet s X-Fest program and Avnet employees. Any use

Last Class: OS and Computer Architecture. Last Class: OS and Computer Architecture

Tensilica Software Development Toolkit (SDK)

Red Hat Linux Internals

KVM: A Hypervisor for All Seasons. Avi Kivity avi@qumranet.com

Embedded Linux RADAR device

A Scalable VISC Processor Platform for Modern Client and Cloud Workloads

Embedded Systems: map to FPGA, GPU, CPU?

OpTiMSoC Build Your Own System-on-Chip!

Network connectivity controllers

Track One Building a connected home automation device with the Digi ConnectCore Wi-i.MX51 using LinuxLink

Xeon+FPGA Platform for the Data Center

Embedded Development Tools

An Implementation Of Multiprocessor Linux

Migrating Application Code from ARM Cortex-M4 to Cortex-M7 Processors

How To Write A Windows Operating System (Windows) (For Linux) (Windows 2) (Programming) (Operating System) (Permanent) (Powerbook) (Unix) (Amd64) (Win2) (X

Development Kit (MCSDK) Training

Porting Plan 9 to the PowerPC Architecture. Ian Friedman Ajay Surie Adam Wolbach

Bare-Metal, RTOS, or Linux? Optimize Real-Time Performance with Altera SoCs

Linux. Reverse Debugging. Target Communication Framework. Nexus. Intel Trace Hub GDB. PIL Simulation CONTENTS

High-Performance, Highly Secure Networking for Industrial and IoT Applications

Overview. Open source toolchains. Buildroot features. Development process

Architecture of the Kernel-based Virtual Machine (KVM)

Multi-Threading Performance on Commodity Multi-Core Processors

Wireless Microcontrollers for Environment Management, Asset Tracking and Consumer. October 2009

Cortex-A9 MPCore Software Development

White Paper. Real-time Capabilities for Linux SGI REACT Real-Time for Linux

SYSTEM ecos Embedded Configurable Operating System

KVM Architecture Overview

System Design Issues in Embedded Processing

Microkernels, virtualization, exokernels. Tutorial 1 CSC469

Virtualization for Hard Real-Time Applications Partition where you can Virtualize where you have to

ARM Processors and the Internet of Things. Joseph Yiu Senior Embedded Technology Specialist, ARM

Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck

Machine check handling on Linux

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

ZigBee Technology Overview

Useful USB Gadgets on Linux

Multi-core Programming System Overview

The Freescale Embedded Hypervisor

POSIX. RTOSes Part I. POSIX Versions. POSIX Versions (2)

Wireshark in a Multi-Core Environment Using Hardware Acceleration Presenter: Pete Sanders, Napatech Inc. Sharkfest 2009 Stanford University

A Generic Network Interface Architecture for a Networked Processor Array (NePA)

Lean and Easy Ways to Adopt MOST Technology

Performance of Software Switching

Chapter 13. PIC Family Microcontroller

Last Class: OS and Computer Architecture. Last Class: OS and Computer Architecture

HANIC 100G: Hardware accelerator for 100 Gbps network traffic monitoring

13. Publishing Component Information to Embedded Software

Development With ARM DS-5. Mervyn Liu FAE Aug. 2015

UG103.8: Application Development Fundamentals: Tools

Application Note: AN00141 xcore-xa - Application Development

Simplifying Embedded Hardware and Software Development with Targeted Reference Designs

System Structures. Services Interface Structure

Virtualization and Other Tricks.

Best Practises for LabVIEW FPGA Design Flow. uk.ni.com ireland.ni.com

7a. System-on-chip design and prototyping platforms

Introduction to SFC91xx ASIC Family

C-GEP 100 Monitoring application user manual

Accelerating the Data Plane With the TILE-Mx Manycore Processor

Special FEATURE. By Heinrich Munz

Silabs Ember Development Tools

High-Density Network Flow Monitoring

Java Embedded Applications

Eddy Integrated Development Environment, LemonIDE for Embedded Software System Development

Using the TASKING Software Platform for AURIX

AppliedMicro Trusted Management Module

Lecture 25 Symbian OS

Design Patterns for Packet Processing Applications on Multi-core Intel Architecture Processors

RTEMS Porting Guide. On-Line Applications Research Corporation. Edition , for RTEMS July 2015

Developing an Application on Core8051s IP-Based Embedded Processor System Using Firmware Catalog Drivers. User s Guide

VPX Implementation Serves Shipboard Search and Track Needs

An Embedded Wireless Mini-Server with Database Support

OPTIMIZE DMA CONFIGURATION IN ENCRYPTION USE CASE. Guillène Ribière, CEO, System Architect

High Performance or Cycle Accuracy?

How to design and implement firmware for embedded systems

Multicore Programming with LabVIEW Technical Resource Guide

Intel Application Software Development Tool Suite 2.2 for Intel Atom processor. In-Depth

Architectures and Platforms

ARM Cortex -A8 SBC with MIPI CSI Camera and Spartan -6 FPGA SBC1654

All Programmable Logic. Hans-Joachim Gelke Institute of Embedded Systems. Zürcher Fachhochschule

Transcription:

Embedded Linux Conference Europe 2013 Going Linux on Massive Multicore Marta Rybczyńska 24th October, 2013

Agenda Architecture Linux Port Core Peripherals Debugging Summary and Future Plans 2

Agenda Architecture Linux Port Core Peripherals Debugging Summary and Future Plans 3

First MPPA -256 Chips with TSMC 28nm CMOS 256 Processing Engine cores + 32 Resource Management cores 256 (+32) user-programmable, generic cores Architecture and software scalability High processing performance High energy efficiency Execution predictability PCIe Gen3, Ethernet 10G, NoCX 4

The MPPA-256 Processor (1) 16 compute clusters 4 Input/Output clusters with rich peripheral access DDR, PCI Express, Ethernet, Flash, GPIO, SPI, I2C etc Connected by a Network-onChip (NoC) 5

The MPPA-256 Processor (2) Compute cluster includes: 16+1 cores Shared memory Network-on-Chip Interfaces Debug unit (DSU) IO cluster includes: 4 cores Shared memory Peripherals 6

The MPPA-256 Processor Core ISA Same on IO and compute cluster 5-issue Very Long Instruction Word (VLIW) 32/64-bit IEEE 754 floating point unit DSP instructions Advanced bitwise instructions Hardware loops MMU Idle modes 7

Kalray Software Development Kit for MPPA-256 (1) Standard Embedded C/C++ Programming GCC, GNU binutils, newlib, uclibc,... Simulators, Profilers, Debuggers & System Trace GDB Cycle-based simulator Hardware System trace Operating systems & Device Drivers NodeOS on the compute clusters One thread per core RTEMS port on the IO clusters PCI Express driver for the Linux Host 8

Kalray Software Development Kit for MPPA-256 (2) Programming Models Dataflow Programming FPGA Style POSIX-level Programming DSP Style Streaming Programming GPU Style 9

The MPPA POSIX Programming Model POSIX-like process management Spawn 16 processes from the IO cluster Process execution on the 16 clusters start with main(argc,argv) and environment On each cluster, support to up to 16 concurrent threads Inter-process communication (IPC) POSIX file descriptor operations on 'NoC Connectors' Extension to the PCIe interface with the 'PCIe Connectors' Rich communication and synchronization Multi-threading inside clusters Standard GCC/G++ OpenMP support POSIX threads interface 10

The MPPA & Linux High-performance SMP on the IO clusters Device Drivers SPI, I2C, GPIO (sensors, small peripherals) Full network stack Existing user libraries Rich configuration options 11

Agenda Architecture Linux Port Core Peripherals Debugging Summary and Future Plans 12

History 2.6.33-rc4 2.6.33 2.6.35 3.2 SMP, first specific drivers 3.10 tracing, MMU (in progress) 13

Features The kernel Single core or SMP Device-tree, generic headers Supports userspace with FDPIC Initial tracing support Drivers Generic drivers tested Specific drivers PCI Express, console,... Userspace: buildroot-based with custom toolchain & uclibc 14

Agenda Architecture Linux Port Core Peripherals Debugging Summary and Future Plans 15

Optimization Case: Atomic Ops Single core case: disable/enable interrupts Cheap on MPPA SMP case: compare-and-swap Huge performance/code size impact Last version uses implementation details of the caches and the write buffer Return from experience Improved atomic ops for the next version of the core 16

Lessons Learned 1: Read the Specs Carefully (1) SMP version started deadlocking Spinlock ticket value showed corruption Debugging Spinlock ordering problem? Careful platform code analysis Enabled spinlock debug Detailed simulator trace analysis Reason 17

Lessons Learned 2: Use the GCC K1 core is a VLIW: multiple instructions (one bundle) per cycle High performance gain GCC handles it well Manual bundling OK for short code, hard for longer ones Result Preferring built-ins over asm inlines Less assembly in the code 18

Agenda Architecture Linux Port Core Peripherals Debugging Summary and Future Plans 19

Interrupt Controllers Interrupt lines from internal peripherals Two-level interrupt control One private to each core (timers, NoC) Shared between 4 cores (PCI Express, Ethernet, other peripherals) Core 0 4 Interrupt lines from the shared controller Core 1 Basic operations on the core interrupt controller Multiple devices on the same line Configuration issue: configuring lanes for devices Shared interrupt controller Core 2 Core 3 Interrupt lines from peripherals 20

Memory Areas Two areas Big DDR Small internal shared memory Different latency, optimization opportunity Shared memory is visible through PCI Express Solution: Allocation in the DDR by default Separate allocator for the shared memory 21

The PCI Express Driver (1) PCI Express is the main interface to the MPPA Host (Linux) driver is ~10KLOC Boot, synchronization, DMA, message passing,... Two interfaces per MPPA (visible as two PCI Express devices) Host-side framework in Linux is mature Less standard on the device side Code reuse (protocols) First synchronization in the bootloader 22

The PCI Express Driver (2) PCI Express allows memory-mapping on Host of memory zones of the device Hardware resources Doorbell registers (shared, in BAR0) Shared memory accessible by the Host directly (BAR2) Both DDR and shared memory accessible by DMA Cache effects Shared zones negotiated (especially shared memory) 23

Boot Using custom bootloaders at the moment Boot of IO cluster by PCI Express Initiated by the Host First code in shared memory Then complete image in shared memory+ddr Boot of clusters from the IO Cluster executables in IO cluster memory Also transferred by PCI Express 24

Network-on-Chip The way to communicate between clusters High performance interface Shared resources Used in different places Boot, drivers in the kernel space User code (IPC) No Linux kernel API to reuse 25

The Ethernet Device Drivers Up to 8x10Gbit/s MAC controller, PHY Uses NoC Distributed network stack Packet dispatch to different clusters Potentially applications using different protocols in different clusters 26

Agenda Architecture Linux Port Core Peripherals Debugging Summary and Future Plans 27

Traces 28

printf(), printk() and GDB 29

The Simulator 30

Agenda Architecture Linux Port Core Peripherals Debugging Summary and Future Plans 31

Lessons Learned 3: What was Difficult Documentation missing With great exceptions! Need to read other platform code No examples (yet) for our architecture Reaching limits of the existing software Not always a bug in the Linux port Trying to be mainline-compatible Initial port hard to split between developers Time-consuming, needed alternative for early testing (RTEMS port) 32

Lessons Learned 4: What was Easy Debugging Mixed mode depending on the testcase: simulator, gdb, FPGA Simulator trace postprocessing Generic headers cover a good part of the code Recently merged architectures give good examples 33

Future Plans Complete driver support (Ethernet, NoC, others...) Optimized MMU Replace generic implementations with optimized ones Public release Mainlining 34

Marta Rybczynska marta.rybczynska@kalray.eu http://www.kalray.eu 35