Flexible Agent Based Simulation for Pedestrian Modelling on GPU Hardware The University of Sheffield, UK www.dcs.shef.ac.uk/~paul Richmond Paul, Coakley Simon, Romano Daniela, "Cellular Level Agent Based Modelling on the Graphics Processing Unit (with FLAME GPU)", Selected for review in the special issue: "Parallel and Ubiquitous methods and tools in Systems Biology" of the international journal: Briefings in Bioinformatics Richmond Paul, Coakley Simon, Romano Daniela (2009), "Cellular Level Agent Based Modelling on the Graphics Processing Unit", Proc. of HiBi09 - High Performance Computational Systems Biology, 14-16 October 2009,Trento, Italy Richmond Paul, Coakley Simon, Romano Daniela(2009), "A High Performance Agent Based Modelling Framework on Graphics Card Hardware with CUDA", Proc. of 8th Int. Conf. on Autonomous Agents and Multiagent Systems (AAMAS 2009), May, 10 15, 2009, Budapest, Hungary Richmond Paul, Romano Daniela(2008), "A High Performance Framework For Agent Based Pedestrian Dynamics On GPU Hardware", Proceedings of EUROSIS ESM 2008 (European Simulation and Modelling), October 27-29, 2008, Universite du Havre, Le Havre, France
Introduction and Scope Agent Based Modelling (ABM) Emergence of Complex natural behaviour for simple rules Individuals are agents with memory Update own memory by considering neighbours Of Pedestrian Behaviour Continuous space mobile agents Discrete time steps On the GPU Why?: Performance and real time visualisation Aim is for Flexibility: Want to be able to harness the GPUs power without modellers having to understand GPU programming Not Continuum based (Treuille 06) or using mobile discrete agents (D Souza 07)
Outline FLAME and FLAME GPU About FLAME A simple example of an pedestrian model specification Implementing FLAME on the GPU Brief overview of GPU technology Mapping agent data and functions to the GPU Agent communication patterns Case Study Pedestrian modelling Discrete agents Performance results Conclusions
What is FLAME? What is FLAME (and what FLAME is not)? Flexible Large-scale Agent Modelling Environment XML Model specification based on the X-Machine Template systems for generating simulation code Single CPU GRID GPU Not a modelling application itself (dynamically generated API) Why extend FLAME to the GPU Complete modelling environment (beyond that of simple swarms) Formal and portable specification technique based on the X-Machine Many existing models to be used for benchmarking What is FLAME GPU Data parallel implementation of FLAME using CUDA Offers real time visualisation Cost effective solution for high performance ABM
FLAME and Formal Agent Specification The X-Machine formally defined by Eilenberg (Eilenberg 74) as a 8-tuple (, Γ, Q, M, Φ, F, q0, m0), where; and Γ are the input and output finite alphabet respectively; Q is the finite set of states; M is the (possibly) infinite set called memory; Φ is a finite set of partial functions ø that map an input and a memory state to an output and a new memory state, ø: M Γ M; F is the next state partial function that, given a state and a function from the type Φ, provides the next state, F: Q Φ Q (F is often described as a transition state diagram); q0 and m0 are the initial state and memory respectively; and
Agents as Communicating X-Machine s Each agent is a Communicating Stream X- Machine (Balanescu 99) Stream: input and output are streams of data Communicating: agents input and output messages State transitions (functions) describe agent behaviour Updates agent memory Outputs messages (and agents) and process input messages
Specifying an Agent in XMML <xagent> <name>pedestrian</name> <memory> <variable><type>float</type><name>x</name></variable> <variable><type>float</type><name>y</name></variable> <variable><type>float</type><name>velx</name></variable> <variable><type>float</type><name>vely</name></variable> </memory> <states> <state><name>start_state</name></state> <state><name>wait_input</name></state> <initialstate>start_state</initialstate> </states> <functions> <function> <name>output_location</name> <currentstate>start_state</currentstate><nextstate>wait_input</nextstate> <outputs> <output><messagename>pedestrian_location</messagename></output> </outputs> </function> <function> <name>input_locations</name> <currentstate>wait_input</currentstate><nextstate>start_state</nextstate> <inputs> <input><messagename>pedestrian_location</messagename></input> </inputs> </function> </functions> <type>continuous</type> </xagent>
Specifying Agent Communication in XMML <message> <name>pedestrian_location</name> <variables> <variable> <type>float</type><name>x</name> </variable> <variable> <type>float</type><name>y</name> </variable> <variable> <type>float</type><name>velx</name> </variable> <variable> <type>float</type><name>vely</name> </variable> </variables> <partitioningspatial> <radius>25</radius> <xmin>-100.0</xmin> <xmax>100.0</xmax> <ymin>-100.0</ymin> <ymax>100.0</ymax> <zmin>0.0</zmin> <zmax>25</zmax> </partitioningspatial> </partitioningnone> <partitioningdiscrete> <radius>0</radius> </partitioningdiscrete> </message>
Specifying the Function Order <layers> <layer> <layerfunction> <name>output_location</name> </layerfunction> </layer> <layer> <layerfunction> <name>input_locations</name> </layerfunction> </layer> </layers> agent->x agent->y agent->vel_x agent->vel_x pedestrian_location Message list output_location() OUT start_state wait_input input_locations() IN
Simulation Process and Code Generation XMML Model File Syntax validated through XML Schema Base XMML Schema describes the basic structure of an X-Machine agent GPU Specific extensions (partitioning) available through a XMMLGPU Schema Object Orientated Approach to extension of the base model C Function Files Translates an XMML model file into simulation source code Templates are written in XML (using XSLT Schema) so can be syntax validated XSLT Processors implement a W3C specification: Any compliant processor can be used to generate code FLAME GPU is therefore not dependant on internal tools or parsers XML Input Data Defines the internal memory of an initial population of agents
FLAME and FLAME GPU About FLAME A simple example of an pedestrian model specification Implementing FLAME on the GPU Brief overview of GPU technology Mapping agent data and functions to the GPU Agent communication patterns Case Study Pedestrian modelling Discrete agents Performance results Conclusions
Programming the GPU Purpose of the GPU Data parallel device for operation on streams of data Programming for General Purpose Use Graphics API Technique: Not ideal High Level Alternatives Brook GPU (Buck 04): SIMD Stream programming extension for C Sh (McCool 02): C++ language with a Compiler for GPU backends Hardware Specific Stream SDK: Low level ATI specific native instruction set and High Level support with Brook + CUDA: NVIDIA programming for GPU using a compiler and a C syntax with extensions OpenCL: New standard but growing limited support
NVIDIA CUDA Programming Model GPU is a coprocessor to CPU (with its own global memory) Many parallel threads of execution Each thread runs the same kernel program (SPMD) Threads are grouped into regular sized blocks Threads within a block can communicate through shared memory Simple synchronisation primitive Threads across blocks can not communicate Block of Threads Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread N Grid of Blocks Block 0 Block 1 Block 2 Block 3
CUDA Hardware Model Thread blocks are mapped to Multi Processors (MPs) Multiprocessors are a set of SIMD thread (vector) processors Limited shared memory per MP (and hence blocks) Limited cache and registers per MP GPU Device Multiprocessor 1 Registers Vector Processor 1 Multiprocessor N Multiprocessor 2 Shared Memory Registers Vector Processor 2 Registers Vector Processor N GPU DRAM Device Memory Instruction Unit Constant Cache Texture Cache
Mapping Agent Functions to the GPU Each transition function is wrapped by a GPU kernel Each agent is a thread performing the function Functions can input and output messages Functions can output new agents (agent birth) An agent can be removed (agent death) by returning non 0 value FLAME_GPU_FUNC int input_function( xmachine_memory_pedestrian* xmemory, xmachine_message_pedestrian_location_list* location_messages) { /* Get the first message */ xmachine_message_pedestrian_location* location_message = get_first_pedestrian_location_message(location_messages); } /* Repeat untill there are no more messages */ while(location_message) { /* Process the message */ if distance_check(xmemory, location_message) { updatesteervelocity(xmemory, location_message); } } /* Get the next message */ location_message = get_next_pedestrian_location_message(location_message, location_messages); /* Update any other xmemory variables */ xmemory->x += xmemory->vel_x*time_step;... return 0;
Mapping X-Machine Agent Data to the GPU All data (agents and messages) is mapped to global memory on the GPU Lists are stored using an Structure of Arrays (SoA) rather than an Array of Structures (AoS) Data is read from global memory to registers Agents and messages are referenced as C structures within function code typedef struct agent_list{ float x[n]; float y[n]; } xm_memory_agent_list; typedef struct agent{ float x; float y; } xm_memory_agent_list [N]; 0 1 2 3 N 0 1 2 3 N 0 1 2 3 N
Use of Parallel Compaction Need to avoid diversity within thread blocks Agents are stored and processed in state lists to avoid conditional branching Sparse lists still occur as a result off Agent births Function filters Also during message outputs Agent Function Agent List Agent Birth Output Flags Agent Function Agent list (colour represents state) Prefix Sum Algorithm 1 0 1 1 0 1 1 0 0 0 1 2 2 3 4 4 Agent list after agent function Compact New Agent List
Brute Force Message Communication Tile message lists into shared memory to reduce global memory access (Nyland 07) Each thread in the thread block loads a single message into shared memory on the load_first_message function Each call to load_next_message then iterates through messages in shared memory When a call to load_next_message is made after each message in SM has been returned then tile a new batch of messages Repeat until all messages have been considered
Effect of Optimisations on for Brute Force Message Communication Simple benchmarking model Efficient data access methods double performance Massive performance gain by using shared memory 100 SoA +SM SoA -SM AoS -SM 90 80 Relative Speedup over FLAME 70 60 50 40 30 20 10 0 1024 2048 4096 8192 16384 32768 65536 131072 Population Size
Limited Range Message Communication For each message output Environment is split into discrete partitions equal to the message range (each has a unique identifier) The message list is sorted depending on the partition which the message is within A boundary matrix indicates how many messages are within each partition by indicating the start and end index of agents within the sorted list To read all messages within a partition the boundary matrix indicates the range within the message list which needs to be iterated Each agent reads 27 partitions (for a 3D environment) including its own which guarantees messages within the range are processed. Roughly 2/3 messages are outside the range but much better than O(n)² Texture cache is used to read messages from global memory
Evaluation of Limited Range Communication N 32 64 96 128 160 192 224 256 1024 0.94 1.05 0.90 0.86 0.93 0.89 0.95 0.88 4096 1.24 1.25 1.30 1.22 1.39 1.22 1.24 1.25 16384 2.45 2.48 2.62 2.53 2.76 2.81 2.77 2.60 65536 9.09 9.34 9.47 9.23 9.22 9.31 9.45 9.42 262144 33.74 37.99 36.88 37.39 36.61 36.83 37.81 38.12 1048576 136.28 169.73 147.39 172.98 145.21 165.34 151.26 177.06 100% 80% Percentage of GPU Time 60% 40% 20% 0% 1024 4096 16384 65536 262144 1048576 Population Size GPUFLAME_inputdata GPUFLAME_move GPUFLAME_outputdata radixsort_kernel merge_kernel memcpyhtod other_gpuflame_kernels
Discrete Agent Communication Discrete Agents reading Discrete Messages Load messages into shared memory 2D Message Output 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Message Load 1 16 17 18 19 20 21 22 23 Message Load 2 24 25 26 27 28 29 30 31 Message Load 3 32 33 34 35 36 37 38 39 Message Load 4 40 41 42 43 44 45 46 47 Message Load 5 48 49 50 51 52 53 54 55 Message Load 6, 7, 8 and 9 56 57 58 59 60 61 62 63 SM in Block 1 SM in Block 2 SM in Block 3 SM in Block 4 63 56 57 58 59 60 59 60 61 62 63 56 31 24 25 26 27 28 27 28 29 30 31 24 7 0 1 2 3 4 3 4 5 6 7 0 39 32 33 34 35 36 35 36 37 38 39 32 15 8 9 10 11 12 11 12 13 14 15 8 47 40 41 42 43 44 43 44 45 46 47 40 23 16 17 18 19 20 16 20 21 22 23 16 55 48 49 50 51 52 52 52 53 54 55 48 31 24 25 26 27 28 27 28 29 30 31 24 63 56 57 58 59 60 60 60 61 62 63 56 39 32 33 34 35 36 35 36 37 38 39 32 7 0 1 2 3 4 3 4 5 6 7 0 Continuous Agent Reading Discrete Messages Cant ensure all messages are loaded into shared memory Use the texture cache instead
Performance of Discrete Message Communication Cellular Automaton Model (Game of Life) Over 1 million agents Shared memory only suitable for very small interaction ranges 300 TEX 64 TEX 256 SMC 64 SM 256 250 GPU Time (ms) 200 150 100 50 0 1 2 4 8 Message Range
FLAME and FLAME GPU About FLAME A simple example of an pedestrian model specification Implementing FLAME on the GPU Brief overview of GPU technology Mapping agent data and functions to the GPU Agent communication patterns Case Study Pedestrian modelling Discrete agents Performance results Conclusions
A Simple Pedestrian Model Inter agent interaction (using spatially partitioned messaging) is based on a hybrid of Reynolds and Social Forces Social repulsion force Navigates pedestrians to area of low concentration Limited forward Vision Preference over agents in direct line of sight Scaled depending on distance to neighbour Close Range Interaction Force Very short range with no limited vision Acts as collision avoidance
Visualisation Technique Agent data is already on the GPU Agent positions are made available to OpenGL by mapping them to a Buffer Object We can also store geometry on the GPU to reduce draw calls For Complex models (lots of vertices) Store a single instance of the geometry in a Vertex Array Draw the array for each agent and set a Vertex Attribute each time to indicate the agent index GLSL vertex shader is used to displace vertices in the same way For Simple Models we can use a single large Vertex Array to hold a geometry instance for each agent Associate each vertex with an agent by using a Vertex Attribute stored in a Vertex Attribute Array Only suitable for simple geometry but very few draw calls
Animation and Level of Detail (LOD) Animation - Very simple Interpolate between 2 key frames Rotate the model depending on velocity direction Performed in a vertex shader LOD - All data is maintained on GPU so must remain parallel Set View position as a GLOBAL variable Use agent script to calculate viewing distance Save LOD Level in an agent variable Use parallel reduction function to count number of agents per Level Secondary sort of the agents by LOD Level and render in groups
Demo Agents coloured by LOD
Performance Results Observables Performance Dependant on Communication Radius Larger communication = less partitions = more agents considered per update LOD technique has a cost Don t use for small populations Very large population sizes possible in real time Fram es per Second (FPS) 500 450 400 350 300 250 200 150 100 Billboards Detail Level 0 Detail Level 1 Detail Level 2 Dynamic 50 0 64 256 1024 4096 16384 65536 262144 1048576 Pedestrian Population
Environment Collision Avoidance Discrete grid of agents to encode the environment Static Discrete Agents Repulsive forces direct agents from wall Automatically generated in advance Continuous Pedestrian Agents read discrete messages Apply a collision force Displace pedestrian agents by height value
Long Range Navigation Many agents following similar paths so a global solution is used Fluid flow route for each path through the environment Calculated offline in advance by backtracking from exit point Smooth movement around obstacles Discrete Agents also responsible for pedestrian birth allocation
Conclusions and Future Work Summary Flexible agent architecture for the GPU suitable for force models Easily extendible Massive performance/cost benefits Scope for Future Work Multi GPU Would enable extremely large populations of systems to be simulated For Spatial partitioning only partition boundaries would need to be communicated between GPU devices Improve pedestrian models Improved collision detection (more accurate) Long range individual path planning without flow grids Physically accurate animation and movement Much larger models (need appropriate scenarios)
References A. Treuille, S. Cooper, and Z. Popović, "Continuum crowds," in SIGGRAPH '06: ACM SIGGRAPH 2006 Papers. New York, NY, USA: ACM, 2006, pp. 1160-1168. R. M. D Souza, M. Lysenko, and K. Rahmani. Sugarscape on steroids: simulating over a million agents at interactive rates. In Proceedings of Agent2007, 2007. Samuel Eilenberg. Automata, Languages, and Machines. Academic Press, Inc., Orlando, FL, USA, 1974. T. Balanescu, A. J. Cowling, H. Georgescu, M. Gheorghe, M. Holcombe, and C. Vertan. Communicating stream x-machines systems are no more than x-machines. j-jucs, 5(9):494 507, 1999. http://www.jucs.org/jucs_5_9/communicating_stream_x_machines. Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman, Kayvon Fatahalian, Mike Houston, and Pat Hanrahan. Brook for gpus: stream computing on graphics hardware. ACM Trans. Graph., 23(3):777 786, 2004. Michael D. McCool, Zheng Qin, and Tiberiu S. Popa. Shader metaprogramming. In HWWS 02: Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware, pages 57 68, Aire-la-Ville, Switzerland, Switzerland, 2002. Eurographics Association. Lars Nyland, Mark Harris, and Jan Prins. Fast n-body simulation with cuda. In Hubert Nguyen, editor, GPU Gems 3, chapter 31. Addison Wesley Professional, August 2007.