MULTI-SPERT PHILIPP F ARBER AND KRSTE ASANOVI C. International Computer Science Institute,



Similar documents
Cellular Computing on a Linux Cluster

Vector Architectures

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines

Overlapping Data Transfer With Application Execution on Clusters

A Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment

1 Organization of Operating Systems

1. Memory technology & Hierarchy

ParFUM: A Parallel Framework for Unstructured Meshes. Aaron Becker, Isaac Dooley, Terry Wilmarth, Sayantan Chakravorty Charm++ Workshop 2008

Scalability and Classifications

Introduction to Cloud Computing

SFWR 4C03: Computer Networks & Computer Security Jan 3-7, Lecturer: Kartik Krishnan Lecture 1-3

Using Logs to Increase Availability in Real-Time. Tiina Niklander and Kimmo Raatikainen. University of Helsinki, Department of Computer Science

HARDWARE ACCELERATION IN FINANCIAL MARKETS. A step change in speed

A Lab Course on Computer Architecture

Overview. Swarms in nature. Fish, birds, ants, termites, Introduction to swarm intelligence principles Particle Swarm Optimization (PSO)

Distributed Systems LEEC (2005/06 2º Sem.)

The Fastest Way to Parallel Programming for Multicore, Clusters, Supercomputers and the Cloud.

Power Reduction Techniques in the SoC Clock Network. Clock Power

Principles and characteristics of distributed systems and environments

Stream Processing on GPUs Using Distributed Multimedia Middleware

Solutions for Increasing the Number of PC Parallel Port Control and Selecting Lines

Tools Page 1 of 13 ON PROGRAM TRANSLATION. A priori, we have two translation mechanisms available:

2.1 What are distributed systems? What are systems? Different kind of systems How to distribute systems? 2.2 Communication concepts

Agenda. Michele Taliercio, Il circuito Integrato, Novembre 2001

SPARC64 VIIIfx: CPU for the K computer

Agenda. Distributed System Structures. Why Distributed Systems? Motivation

Lizy Kurian John Electrical and Computer Engineering Department, The University of Texas as Austin

Control 2004, University of Bath, UK, September 2004

Interconnect Efficiency of Tyan PSC T-630 with Microsoft Compute Cluster Server 2003

Algorithms for Interference Sensing in Optical CDMA Networks

Virtual machine interface. Operating system. Physical machine interface

Cray DVS: Data Virtualization Service

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski

Hadoop Cluster Applications

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING Question Bank Subject Name: EC Microprocessor & Microcontroller Year/Sem : II/IV

PLC Support Software at Jefferson Lab

VHDL DESIGN OF EDUCATIONAL, MODERN AND OPEN- ARCHITECTURE CPU

Client/Server Computing Distributed Processing, Client/Server, and Clusters

Performance Evaluation of Artificial Neural. Networks for Spatial Data Analysis

From Concept to Production in Secure Voice Communications

CMSC 611: Advanced Computer Architecture

White Paper Utilizing Leveling Techniques in DDR3 SDRAM Memory Interfaces

EE361: Digital Computer Organization Course Syllabus

Architectures and Platforms

Module 5. Broadcast Communication Networks. Version 2 CSE IIT, Kharagpur

Design and Implementation of an On-Chip timing based Permutation Network for Multiprocessor system on Chip

Last Class: OS and Computer Architecture. Last Class: OS and Computer Architecture

IDE/ATA Interface. Objectives. IDE Interface. IDE Interface

GEDAE TM - A Graphical Programming and Autocode Generation Tool for Signal Processor Applications

Linear Motion and Assembly Technologies Pneumatics Service. Industrial Ethernet: The key advantages of SERCOS III

Load Balancing in Distributed Data Base and Distributed Computing System

Microprocessor & Assembly Language

Performance Monitoring of Parallel Scientific Applications

White Paper. Technical Capabilities of the DF1 Half-Duplex Protocol

- An Essential Building Block for Stable and Reliable Compute Clusters

Next Generation Modeling and Simulation Engineering using Cloud Computing

The 104 Duke_ACC Machine

International Workshop on Field Programmable Logic and Applications, FPL '99

Analecta Vol. 8, No. 2 ISSN

Apache Hama Design Document v0.6

GETTING STARTED WITH LABVIEW POINT-BY-POINT VIS

Computer Organization & Architecture Lecture #19

Symmetric Multiprocessing

Distributed Systems. REK s adaptation of Prof. Claypool s adaptation of Tanenbaum s Distributed Systems Chapter 1

high-performance computing so you can move your enterprise forward

Objectives. Distributed Databases and Client/Server Architecture. Distributed Database. Data Fragmentation

ENHANCED HYBRID FRAMEWORK OF RELIABILITY ANALYSIS FOR SAFETY CRITICAL NETWORK INFRASTRUCTURE

Programming Flash Microcontrollers through the Controller Area Network (CAN) Interface

MBP_MSTR: Modbus Plus Master 12

Applying Design Patterns in Distributing a Genetic Algorithm Application

A New Chapter for System Designs Using NAND Flash Memory

PERFORMANCE OF MOBILE AD HOC NETWORKING ROUTING PROTOCOLS IN REALISTIC SCENARIOS


Hadoop Scheduler w i t h Deadline Constraint

An Analysis of Wireless Device Implementations on Universal Serial Bus

Computer Network. Interconnected collection of autonomous computers that are able to exchange information

Parallel Computing with Mathematica UVACSE Short Course

Management Challenge. Managing Hardware Assets. Central Processing Unit. What is a Computer System?

Advances in Smart Systems Research : ISSN : Vol. 3. No. 3 : pp.

SSketch: An Automated Framework for Streaming Sketch-based Analysis of Big Data on FPGA *

Switch Fabric Implementation Using Shared Memory

Centralized Systems. A Centralized Computer System. Chapter 18: Database System Architectures

Architectural Level Power Consumption of Network on Chip. Presenter: YUAN Zheng

Performance Metrics and Scalability Analysis. Performance Metrics and Scalability Analysis

CFD Implementation with In-Socket FPGA Accelerators

Chapter 2 Logic Gates and Introduction to Computer Architecture

Implementation of Full -Parallelism AES Encryption and Decryption

EDUCATION. PCI Express, InfiniBand and Storage Ron Emerick, Sun Microsystems Paul Millard, Xyratex Corporation

Chapter 7. Using Hadoop Cluster and MapReduce

Operating Systems for Parallel Processing Assistent Lecturer Alecu Felician Economic Informatics Department Academy of Economic Studies Bucharest

Automotive Software Engineering

MOSIX: High performance Linux farm

Deploying MATLAB -based Applications David Willingham Senior Application Engineer

A General Framework for Tracking Objects in a Multi-Camera Environment

Recurrent Neural Networks

Transcription:

PARALLEL NEURAL NETWORK TRAINING ON MULTI-SPERT PHILIPP F ARBER AND KRSTE ASANOVI C International Computer Science Institute, Berkeley, CA 9474 Multi-Spert is a scalable parallel system built from multiple Spert-II which we have constructed to speed error backpropagation neural network training for speech recognition research. We present the Multi-Spert hardware and software architecture, and describe our implementation of two alternative parallelization strategies for the backprop algorithm. We have developed detailed analytic models of the two strategies which allow us to predict performance over a range of network and machine parameters. The models' predictions are validated by measurements for a prototype ve node Multi-Spert system. This prototype achieves a neural network training performance of over 53 million connection updates per second (MCUPS) while training a realistic speech application neural network. The model predicts that performance will scale to over 8 MCUPS for eight. 1 Background and Motivation The Spert-II board is a general purpose workstation accelerator based on the T vector microprocessor 1. Originally,itwas designed for the training of large articial neural networks used in continuous speech recognition. For phoneme classication, we use 3-layer perceptrons with 1{ input, { hidden, and 56{61 output units. The speech databases used in our experiments typically contain millions of patterns, and training requires 6{8 passes through this data to converge. For this task, Spert-II surpasses the performance of today's workstations by factors of 3{1, reducing training time from several weeks to just a few days. To further reduce training time and to make use of even larger volumes of input data, we have developed the Multi-Spert system, which combines several Spert-II boards to form a small-scale multicomputer. Multi-Spert enables us to run large experiments in less than a day. We train our networks using a variant of the error backpropagation algorithm, and so in this paper we present the parallelization and resulting performance of backprop on Multi-Spert. We concentrate on the two components which dominate the runtime of our training experiments, the training routine, which consists of forward pass, error back propagation, and weight update, and the forward pass alone, which is used for cross-validation of the training process and in recognition. 1

2 Multi-Spert Architecture 2.1 Spert-II Board A Spert-II board comprises a 4 MHz T vector microprocessor with 8 MB of memory mounted on a double-slot SBus card 1. T is capable of sustaining 32 million multiply-accumulate operations per second, with 16-bit xed-point multiplies and 32-bit xed-point accumulates. The T processor has a byteserial port that allows the host to access T memory at a peak bandwidth of 3 MB/s. The Spert-II board contains a eld programmable gate array (FPGA) that transparently maps SBus read and write transactions to T serial port transactions. The current FPGA design supports data transfer rates of up to 1 MB/s for writes from the host and 4 MB/s for reads. For a realistic training run using our optimized training code, a single board achieves the performance listed in Table 1. The second column contains results for online training, where weights are updated after every pattern presentation. The third and fourth columns contain results for bunch mode training, where weights are updated only once after several patterns, or a bunch, have been presented. Bunch mode allows more frequent weight updates than a fully o-line, or batch mode, training where weights are updated only once after all patterns have been presented. Compared to the matrix-vector operations used in online training, bunch mode allows use of more ecient matrix-matrix operations 2. Performance is approximately doubled and reaches over 1 MCUPS. Note that a small bunch size of 12 is sucient to reap most of the performance benets. Experimental results show that moderate bunch sizes of about 1 patterns do not signicantly change training convergence or accuracy. For larger bunch sizes of several thousand patterns, we've found that the total data set size must be roughly 1 times larger than the bunch size to achieve satisfactory convergence and nal classication performance. Practical bunch size can also be limited by other factors, such as buer memory size and numeric overow for accumulated errors. The online training performances presented here are somewhat lower than those we reported earlier for a single board 1. The routines measured in this paper store weights to 32-bit precision but use only the most signicant 16 bits for forward pass and error backpropagation, whereas previously 1 we used 16- bit weights throughout. Although 16-bit weights have frame-level classication equivalent to single precision oating-point, we found that word-level utterance recognition performance suered in some cases. Training with 32-bit weight updates yields word-level recognition equivalent to single precision oatingpoint. 2

MLP Size Single Spert-II Board Input-Hidden-Output Online Bunch =12 Bunch =96 153--56 37 74 72 153--56 41 63 69 153-1-56 45 95 17 153--56 47 12 { Table 1: Single Spert-II board backprop training performance in MCUPS. Host SUN Spert II board Spert II board SBus Spert II board Figure 1: Multi-Spert system. 2.2 Multi-Spert Hardware To construct a Multi-Spert system, we use commercial SBus expander boxes to enable a moderate number of Spert-II boards to be connected to the same Sun workstation host, as shown in Figure 1. The result is a shared bus master/slave architecture. The Sun host acts as master, controlling all data transfer and synchronization. Multi-Spert required little additional hardware and system software development over that for Spert-II, but the system has two main limitations. First, SBus only supports transactions with one slave device at a time, so broadcasts from the host have to be repeated to each Spert-II board. Second, a board cannot overlap host communication and T computation, so the host must ensure the board's T processor is idle before initiating a data transfer. 2.3 Multi-Spert Programming Environment Our software environment supports any number of Spert-II boards, either plugged into expander boxes or plugged directly into host SBus slots. A Multi- Spert program consists of a host master program, and one or more slave programs, all typically written in C++. The Multi-Spert system software provides routines that allow a master program to allocate a Spert-II board and then to download a slave application executable to the board. Once the slave program 3

Node 1 Node 2 Node 3 Node 1 Node 2 Node 3 Output Hidden Input Pattern 1 Pattern 2 Pattern 3 Output Hidden Input Pattern 1 Pattern 2 Pattern 3 (a) (b) Figure 2: Pattern Parallel (a) and Network Parallel (b) distribution strategies. is running, the master program can use the C++ Spert Procedure Call library to transfer data and parameters and to invoke remote routines on the slave. Slave routines can be called asynchronously, allowing the host to invoke concurrent routines on other slave boards. Synchronization commands allow the host to wait for slaves to complete execution. 3 Parallelizing Error Backpropagation We have implemented two dierent strategies for parallelizing backprop training, pattern parallel (PP) and network parallel (NP). 3.1 Pattern Parallel Training The PP strategy replicates the entire network on each node and presents different patterns in parallel, as illustrated in Figure 2 (a). The PP version does not impose any constraints on the network topology, as long as the weight updates of dierent patterns may be accumulated independently and combined. After each bunch of patterns, synchronize by exchanging weight update information. Smaller bunch sizes require more frequent weight communication, but larger bunch sizes can aect training convergence. Because the network is replicated on each board, the size of trainable networks is limited to that which will t into a single board's memory. In our PP implementation, we overlap the communication of patterns to one node with computation on other. To reduce the buer storage required for intermediate values, we only process 24 patterns at a time on each node. Updates are accumulated until the end of the bunch. We overlap reading back weight updates from one node with computation nishing on other. All slaves then wait for the host to broadcast new weight values. 4

3.2 Network Parallel Training The NP strategy splits the network across all the boards, as shown in Figure 2 (b). Compared with the PP strategy, the NP strategy has the disadvantage that every pattern must be sent toevery board. In our application, we use 3-layer networks with many more hidden units than output units. We take advantage of this in our NP implementation by only communicating the nal output activations. Hidden unit computations and all weight updates are entirely local to each node. Because the connection weights are now distributed across the slaves, the maximum network size scales with the number of boards, enabling us to train very large networks. As in the PP implementation, we overlap communication of patterns to each node with computation on other, but for NP we use a bunch size of 12 patterns. In addition, we overlap the calculation of output errors on the host. When a slave nishes computing the output activations for a bunch, we immediately send the error values for the previous bunch along with the input values for the next bunch. This means each bunch is calculating errors based on the last but one bunch's weight updates. We found this delayed weight update did not aect convergence. 4 Performance Evaluation We have developed analytic models that accurately predict the performance of our two strategies for a given number of slave boards. Full details of the models are given in a separate technical report 3.Wevalidated the performance models with timing measurements made on a Multi-Spert prototype consisting of a Sun Ultra-1/17 workstation host and ve Spert-II. Figure 3 shows model predictions and measurements of training performance for the two strategies. Overall, behaviour of the Multi-Spert system coincides well with the models. As the number of hidden units grows, scalability improves because computation increases relative to communication. For PP, larger bunch sizes reduce overhead and increase performance, as can be seen comparing the two top curves which are for hidden units with bunch sizes of 5 and 1. For NP with a given network size, increasing the number of increases the number of times each pattern is communicated but decreases the size of each node's subnetwork. Performance drops rapidly once communication time overtakes computation time. PP is more ecient for smaller networks, but NP is more ecient for larger networks. The crossover point is at just over 1 hidden units for a four node system. The model predicts that NP performance should scale to around MCUPS for 3 hidden units with six. 5

7 7 PP training performance (MCUPS) 5 3 1 HU, b=1 HU, b=5 1 HU, b=5 HU, b=5 NP training performance (MCUPS) 5 3 1 3 HU, b=12 HU, b=12 1 HU, b=12 HU, b=12 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 Figure 3: Predicted and measured Multi-Spert training performance. Networks have 153 input units, 56 output units, and varying numbers of hidden units. Bunch sizes are indicated on the graphs. PP forward pass performance (MCPS) 1 1 8 HU 1 HU HU 1 2 3 4 5 6 7 8 NP forward pass performance (MCPS) 1 1 8 3 HU HU 1 HU HU 1 2 3 4 5 6 7 8 Figure 4: Multi-Spert performance for forward pass. Networks have 153 input units, 56 output units, and varying numbers of hidden units. Figure 4 shows forward pass performance. Forward pass is mostly communication bandwidth limited. The NP strategy has the disadvantage that all patterns are sent to all, so parallel eciency is poor on smaller networks. The PP strategy scales much better, since every pattern is sent once only regardless of the number of processors. The one advantage of the NP approach is that it can handle networks that will not t on a single board using PP. We attain around 1 GCPS with ve. We can take advantage of the structure of the patterns used in the speech application to reduce pattern communication overhead in the NP strategy. Each pattern represents a sliding time window over 9 consecutive frames of spectral features. By keeping the pattern database on the slave, we need only send each frame once and can then reuse it 9 times at dierent 6

NP training performance (MCUPS) 1 8 3 HU HU 1 HU HU 1 2 3 4 5 6 7 8 9 Figure 5: Multi-Spert NP training performance with optimized pattern presentation. window alignments. Figure 5 shows the predicted and measured performance of this optimization of the NP algorithm. Performance is measured at over 53 MCUPS for 5. The model predicts that scalability improves to around 8 MCUPS on eight for 3 hidden units. 5 Related Work Modern workstations achieve training performances of up to 2{3 MCUPS on this task, but several special-purpose machines achieve performance levels competitive with Multi-Spert. The MUSIC and GF11 systems report training performances of 247 and 91 MCUPS, using 6 and 356 processors, respectively 4. The full-custom CNAPS with 512 processors on 8 chips achieves over 1 GCUPS peak performance, but using only 16-bit weights throughout. We have not found any published backpropagation performance numbers for the Synapse-1 5. Compared with these other machines, Multi-Spert is more exible, lower cost, and easier to program, yet it sustains high performance on parallel neural network training for real-world applications. T has a general purpose vector architecture based on the industry standard MIPS instruction set 1, and has been used to implement a wide variety of algorithms. Multi-Spert supports multiple simultaneous jobs each running on a subset of the available ; this feature increases system utilization when a variety of network sizes are being trained. The Multi-Spert system is a low-cost system consisting of a few boards plugged into a host workstation. The system is programmed in C++ with a simple master/slave model. 7

6 Summary and future work We presented the Multi-Spert system and described two strategies for parallelizing the error backpropagation training algorithm. We presented performance predictions and measurements for neural networks taken from a speech recognition task. The measurements validate our models, and demonstrate over 53 MCUPS performance with ve for a 3 hidden unit network. The model predicts a performance of over 8 MCUPS for eight. We are working to extend measurements and predictions for larger systems with up to 16 Spert-II. Future work includes improvements to the FPGA SBus interface to increase communication bandwidth. We are also planning to port other applications to Multi-Spert. Acknowledgments We would like to thank Jim Beck who designed the Spert-II board and Xilinx interface, John Hauser who wrote the device driver software for the Spert-II board, and David Johnson who wrote the original uniprocessor neural network training code on which the Multi-Spert version was based. Primary support for this work was from ONR URI Grant N14-92-J-1617, ARPA contract number N1493-C249, and NSF Grant No. MIP-931198. Additional support was provided by the International Computer Science Institute. References 1. John Wawrzynek, Krste Asanovic, Brian Kingsbury, James Beck, David Johnson, and Nelson Morgan. Spert-II: A Vector Microprocessor System. IEEE Computer, 29(3):79{86, March 1996. 2. Davide Anguita and Benedict A. Gomes. MBP on T: mixing oatingand xed-point formats in BP learning. Technical Report TR-94-38, International Computer Science Institute, August 1994. 3. Philipp Farber. QuickNet: Neural Network Training on Multi-Spert. Technical report, International Computer Science Institute, 1997. To appear. 4. Urs A. Muller, Bernhard Baumie, Peter Kohler, Anton Gunzinger, and Walter Guggenbuhl. Achieving supercomputer performance for neural net simulation with an array of digital signal processors. IEEE Micro, 12(5):55{64, October 1992. 5. U. Ramacher, J. Beichter, W. Raab, J. Anlauf, N. Bruls, M. Hachmann, and M. Wesseling. Design of a 1st Generation Neurocomputer. In VLSI Design of Neural Networks. Kluwer Academic, 1991. 8