Performance Monitor Unit Design for an AXI-based Multi-Core SoC Platform

Performance Monitor Unit Design for an AXI-based Multi-Core SoC Platform Hyun-min Kyung, Gi-ho Park, Jong Wook Kwak, WooKyeong Jeong, Tae-Jin Kim, Sung-Bae Park Processor Architecture Lab, SOC R&D Center, System LSI Division, Semiconductor Business, Samsung Electronics, Yongin-City, Kyeong-gi Do, Korea Hyunmin.kyung, giho.park, jongwook.kwak, wk.jeong, taejinkim, sung.park @ samsung.com ABSTRACT As the physical gate-count in System-On-Chip (SOC) system increases and system design complexity grows steadily, it becomes more and more difficult to achieve good resource utilization by assigning each task to certain hardware IP and tracing the execution patterns of each task efficiently. Therefore, the performance monitoring feature is getting more and more important to provide the ease of system monitoring and performance debugging. In this paper, we present a performance monitoring unit (PMU) for the AMBA Advanced extensible Interface (AXI) bus. The PMU has capability to measure major performance metrics, such as bus latency for the specific master requests and amount of memory traffic for specific durations. It can also measure the contention of the bus masters and slaves in the SOC. We present the distributor and the synchronization method to use multiple performance counting units as well. The performance monitoring unit has been verified in the platform FPGA board with 9 by 4 AXI interconnect configuration. These monitoring features can give the insight to system design architect by helping to find and analyze the performance bottleneck of target system. Categories and Subject Descriptors C.4 [Performance of Systems]: Performance attributes General Terms Performance Keywords Performance monitor, AXI, AMBA, SOC platform, architecture exploration 1. INTRODUCTION As the complexity of the SOC increases, it is very crucial to understand the behavior of the internal interactions in the SOC for the effective SOC design. Because an SOC integrates various IP blocks into a chip, conventional board level system debug Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SAC 07, March 11-15, 2007, Seoul, Korea. Copyright 2007 ACM 1-59593-480-4 /07/0003 $5.00. approaches such as probing signals with logic analyzers, are not effective. This is because the most interesting transactions, like bus contentions cannot be observed off-chip. The SOC architects usually hand-calculate the memory traffics of a specific function IP, such as H.264 decoder, based on the IP specification and the algorithm analysis to design the bus system. This method can be effective to the simple SOC system which has only a single CPU and the function IP and the CPU is usually idle when the function IP is running. The dynamic behavior of the SOC cannot be considered in this design method. For the SOC design with multiple cores, especially the programmable core processes large applications, such as multimedia codec processing, it is very crucial to understand the dynamic behavior of the system for the bus architecture. This kind of information will be also very useful to optimize the software for the SOC. In this paper, we present the performance monitoring unit (PMU) designed for monitoring and analyzing system behavior on AXI bus of the multi-core SOC. The PMU will be embedded in the SAVm IV SOC platform. We are building our first version of the SAViT (Standard/Synergistic/ Superior Architecture for Versatile information Technology) platform for the multimedia processing [11]. The SAVm IV SOC platform is the first prototype version for SAViT series. This platform SOC has a heterogeneous multiprocessor architecture with an ARM CPU and a StarCore DSP. The on-chip system bus of thus SOC is based on the AMBA AXI bus system for high-performance and high-bandwidth applications. The PMU has capability to measure major performance metrics, such as bus latency for the specific master requests and amount of memory traffic for specific durations. We expect it will be used for the performance analysis of the bus system and software optimization for the SOC. The rest of this paper is organized as follows. Section 2 describes related works about performance monitoring unit. Overview of the AXI protocol and the architecture of PMU are described in section 3 and section 4 respectively. We present the implementation and a sample usage of the PMU in section 5. Finally, section 6 concludes this paper. 2. RELATED WORKS It is clear that adding performance monitoring features has created an amount of opportunities to analyze and tune the current microarchitectures and embedded systems. In order to efficiently evaluate system performance, research on the hardware performance monitor (HPM) has been performed. It includes the evolution of PowerPC performance monitors [3], which has been started since 1991. Originally, when Power2 processors were 1565

designed, a full scale performance monitor was integrated onto the 4-chip processor, including 22 counters. The performance monitoring features of Pentium 4 [4], with supporting simultaneous multithreaded executions is also one of these HPM to support simultaneous multithreading, qualification of event detection by thread ID and qualification of event counting by thread mode. These monitors usually embedded in the processor itself and it is mainly used for the software optimization. Collard et al. [5] shows system-wide instruction-level performance monitor, called SWIFT (System-Wide Information for Tuning), Borril et al. [6] introduce a new performance profiling tool of a cosmology application, called IPM (Integrated Performance Monitoring). Besides, DSPs also require performance monitoring features to achieve maximum performance. A performance monitoring tools in DSPs, however, should have the DSP-specific requirements such as multimedia video processing (MVP) [9]. In addition, Wisniewski et al and Mink et al introduce performance monitoring tools on operation systems and PCI bus, respectively [7], [8]. Recently, IBM proposed PLB Performance Monitor (PPM) [10], which provides hardware for counting certain events associated with processor local bus transactions. The PPM contains a set of counters whose contents may be read by software and used to analyze and enhance PLB performance. The main role of PPM is both event-occurrence counting and event-duration counting. The main difference of the PPM from previous research is that the PPM is designed for the system wide performance measuring, especially PLB bus performance. Compared to previous works mentioned above, we design and implement performance monitoring unit (PMU) for the System- On-Chip (SOC) platform. It is designed to gather the information for the bus performance related metrics such as bus-transaction related events including number of requests, size of the burst transfer request and bus-contention related events. The PMU also considers the characteristics of the AMBA AXI protocol. 3. THE OVERVIEW OF AMBA AXI The SAViT platform is based on the AMBA Advanced extensible Interface (AXI) bus system. The AXI is an enhanced bus protocol of existing Advanced High-performance bus (AHB). We consider carefully the nature of the AXI bus protocol to design the performance monitoring unit of our SOC. In this section, the features of AXI protocol and interconnect related to the performance counter design are described briefly. 3.1 AXI Protocol [1] The AMBA AXI is targeted at high-performance, high-frequency system designs. The AXI is an improved bus system of existing AHB and APB protocol in view of bus performance. The key features of the AXI protocol are as follow: Separate address/control and data phases Support for unaligned data transfers using byte strobes Burst-based transactions with only start address issued Separate read and write data channels Ability to issue multiple outstanding addresses The AXI protocol has five independent channels consists of a set of information signals, and each channel uses a two-way VALID and READY handshake mechanism. The five independent channels are read address channel, write address channel, read data channel, write data channel, and write response channel. The read address channel, write address channel and write data channel are initiated by a master device. The read data channel and write response channel are driven by a slave device. 3.2 AXI Interconnect The AXI interconnect is needed for connecting multiple AXI buses. A number of master and slave devices connect together through some form of interconnect as shown in Figure 1. The AXI interconnect has interfaces to connect master and slave devices. The interface can be either a master port or a slave port, and each port is equivalent to another device with symmetrical master and slave ports to real master and slave devices. The key feature of the AXI interconnect is a full-cross bar structure between master and slave ports. This enables that multiple master devices access to different slave devices simultaneously. But, if the destination slave devices are overlapped, the contention between master devices can occur. Similarly, multiple slave instances can also access different master instances concurrently. This is one of the most distinct features as compared with AHB protocol, because only one master can occupy a bus in AHB system. Figure 1. AXI Interconnect (Example. 3x2) We defined a Master Bus as the interface from a master device to a slave port of an interconnect, and a Slave Bus as the interface from a master port of an interconnect to a slave device. The AXI interconnect developed in our system is based on PrimeCell AXI Configurable Interconnect (PL300)[2]. The structure of AXI interconnect, 9 by 4 is used for our system implementation for performance monitoring unit design (i.e, 9 masters, 4 slaves). 4. THE PERFORMANCE MONITORING UNITS In this section, we explain the performance monitoring unit (PMU) in AXI bus system in detail. First, the overall architecture of PMU is presented and sub-module architectures of PMU are described on following subsections. 4.1 Architectural Overview The PMU provides monitoring hardware for counting certain events related with AXI bus transactions. The PMU is a set of 1566

hardware monitors which embed a set of counters whose contents may be set and read by software and used to analyze and enhance the performance of the entire system. Figure 2 shows the overall structure of the performance monitoring unit. The PMU is designed for monitoring and analyzing both the events related to bus-transaction (BTE) and the events related to bus-contention (BCE). BTE can be measured by observing a specified bus. Counting of BTE is accomplished via a set of 32-bit counters that increment their values once for each occurrence of the selected event. However, BCE can be counted by monitoring bus occupation and arbitration signals from multiple buses. This contention and arbitration is executed on AXI interconnect. Measuring BCE can be done by a set of counters through incrementing their values by one for each Conflict between a selected target port and other target ports, which can be either a master port or a slave port. Conflict is defined as two or more targets want to transfer their data to a same destination at the same time. The PMU is composed of two independent modules, Bus Monitor (BM) and AXI Interconnect Contention Monitor (CM). The BM counts BTE on a master bus, and the CM measures BCE. The CM is further divided into Master Contention Monitor (MCM) and Slave Contention Monitor (SCM) according to contention devices. To increase the effectiveness of the PMU, we use the distributor for these bus monitor and interconnection monitors. The distributor exists for flexible connecting with master/slave buses and monitoring units as shown in Figure 2. We can connect any combination of masters to the bus monitor modules up to the number of monitor modules. In the case of bus monitor, if three bus monitor modules are available in the 9 by 4 AXI interconnect, we can observe any three ports among the nine master ports of the bus by setting the register of the distributor simultaneously. Detailed features for each module will be explained on the next subsections. 4.2 Bus monitor The Bus Monitor (BM) performs bus-transaction related events counting. Since bus transaction is generated by a master device, the BM observes a master bus signals. The Bus monitor gathers following information. Address range of the request (implemented to gather the statistics for three address ranges specified by register setting) Total transfer count (read/write) Total transfer size (read/write), in bytes Read latency distribution Write latency distribution Global clock count The BM has three address range divisions, because a master transfers data to several destinations (slave) as shown in Figure 2. An architect can modify an address range for observing the specified bus-transaction events by setting the control register of the bus monitor. The address range can be a criterion that is dividing into destination slaves or dividing into three sections in a slave address range. The BM contains three performance counter sets (PCSs) for counting bus-transaction events independently according to three address ranges as shown in Figure 3. Each PCS unit, denoted S0_PC, S1_PC, and S3_PC in the Figure 3, measures bus-transaction events which a master generates within the specified address range. AWVALID/ARVALID AWREADY/ARREADY AWSIZE[2:0]/ARSIZE[2:0] AWLEN[3:0]/ARLEN[3:0] WVALID/RVALID WREADY/RREADY Read address transaction indication DEMUX DEMUX Transfer count[31:0] (read/write) Transfer size[31:0] (read/write) Read Latency Count[31:0] 0~7 Write Latency Count[31:0] 0~3 Master bus 0 Master bus1 AXI Interconnect (9x4) Slave bus0 Read Data CH Write Resp CH Control Register Setting/ Monitoring register read Enable Global clock Count[63:0] Read Address range 0~7 0~3 Write Address range 0~3 Decoding Address Range 0~3 Global clock count register S0_PC monitoring registers S1_PC monitoring registers S2_PC monitoring registers Master bus2 Master bus3 Master bus4 Master bus5 Master bus6 Master bus7 Master bus8 Slave bus1 Slabe bus2 Slave bus3 Addr CH Write Data CH Figure 2. Overall Structure of Performance Monitoring Unit for 9x4 AXI interconnect Figure 3. Structure of Bus Monitor The counters in each performance monitoring unit store the number of bus-transaction events, those are the total transfer count, the total transfer size (in bytes), the read latency distribution, and the write latency distribution. Total transfer count is the number of transactions that a master generates for a simulation time. Transfer count does not distinguish single transaction and burst transaction. Transfer count is increased one for both single and burst cases. But transfer size does classify single transaction and burst transaction. PC counts total transfer size of current transaction with ARLEN/AWLEN and ARSIZE/ AWSIZE signals when address transaction generates. 1567

Read and write latency is defined as the time a master waits for completing a transaction. There are two possible sources of this waiting time. The first source is other masters competing transaction on the bus and the other is response time of a slave device. In case of read latency, a master has to wait for receiving the read data from a slave after generating the read address. Therefore, read latency is defined as the time from the start of the read address transaction to the beginning of the read data service. We define the read latency differently in case of overlapping transaction, the time from the end of read data service for a previous transaction to the beginning of read data service for a current transaction as shown in Figure 4 for more effective analysis of read latency. Write latency is defined as the period needed a master completes write data transaction. Write latency does not consider the time from write address transaction to write data transaction, because the interval is not dependent on bus status, but the master s operating time. Because the read/write latency has wide distribution, such as from a couple of cycles to several hundreds of cycles, it is not effective to have dedicated counters for each latency cycle. We use an interval based counting method for counting read and write latency effectively. We can select eight intervals for read latency and four intervals for write latency by software setting. A set of 32-bit counters (eight counters for read and four counters for write) is incremented their values once for each occurrence of the event within the specified latency interval. The bus monitor has four global clock counters. Each of them is 64-bit counter. Global clock counter is used for measuring total clock counts for a simulation time. Conflict count between other slaves except specified slave on read data channel / write response channel The CM measures conflict counts for four independent channels: address channel, write data channel, read data channel, and write response channel. Since each channel has dedicated signals in AXI bus, four channels can activate independently without interfering with other channels. The CM is further divided into two modules according to transaction source: the Master Contention Monitor (MCM) and the Slave Contention Monitor (SCM). The address and write data channels are the signals a master generates and a slave responses. These channel contentions are treated in MCM. On the other hand, the read data and write response channels are the signals that a slave generates and a master device responses. The SCM measures conflicts between slave devices in these channels. The MCM block is connected with interconnect internal signals, which have each master s transaction information and current bus-occupied Master ID (ADDRMASTER [3:0], WRITEMASTER[3:0]), and slave bus signals as shown in Figure 5. The slave bus signals indicate a point of time that a transaction generates on the slave bus. At that time, if other masters except current bus-occupied master try to generate transactions, it is regarded as conflict. As growing the number of masters, the conflict case between masters increases rapidly. Therefore we select a user defined master named MasterX. The MCM is designed to measure conflict counts between MasterX and other masters and conflict counts between other masters except userdefined master by collection for each channel. Using this feature, we can observe the contention between one most interesting master and other masters in addition to the contention among the less interesting masters. Figure 4. BM read latency for Overlapping Transaction 4.3 Contention Monitor The Contention Monitor (CM) counts bus-contention related events. The bus contention can be occurred between master devices or slave devices. Since the AXI interconnect is a fullcross bar structure, we have to observe the convergence point that multiple devices have the same destination for monitoring buscontention events. In case of 9x4 interconnect, the convergence points are 13: 9 points are merged by slave devices, and 4 points are converged by master devices as shown in Figure 2. The monitoring features of CM are as follows: Conflict count between specified master and other masters on address channel / write data channel Conflict count between other masters except specified master on address channel / write data channel Conflict count between specified slave and other slaves on read data channel / write response channel Figure 5. Master Contention Monitor (Interconnect 9x4) The SCM block is similar to the MCM block, as shown in Figure 6. The SCM measures conflict counts between user-defined slave (i.e, SlaveX) and other slaves. It also counts conflicts between other slaves except user-defined slave for read data channel and write response channel. 1568

Figure 6. Slave Contention Monitor (Interconnect 9x4) 4.4 Distributor Configurable and Scalable Design Distributor provides configurable connection between AXI interconnect bus signals and bus/contention monitors as shown in Figure 2. The distributor is classified into three independent distributors: BM distributor, MCM distributor, and SCM distributor. Each distributor enables its monitor to measure multiple AXI interconnect bus signals based on the configurable setting. In case of 9x4 (9 masters and 4 slaves) AXI interconnect crossbar, BM distributor connects 9 master buses (not all AXI bus signals, but only transaction-related bus signals). Since BM distributor has a full-cross structure, it can link any master buses to a BM unit. This connection enables one BM can measure any master buses transactions. However, a BM unit cannot observe multiple master buses simultaneously. Therefore multiple BMs have to be connected to BM distributor for monitoring multiple master buses concurrently. The number of BM units connected to BM distributor can be changed up to the number of the total BM units by software setting. Figure 7 shows the (9 x n) BM distributor, which connects 9 master buses to n BM units by control register setting. The MCM distributor is connected 4 slave buses and the AXI internal signals that contain 9 masters transaction information and current transferred Master ID for each slave bus. Since the MCM also has a full-cross bar structure, it can link any MCM unit to every slave buses and other signals adaptively. Similarly, the SCM distributor connects multiple SCM units with the signals related with the contention between slave devices. Similar to the BM distributor, the MCM and SCM distributor can connect the MCM and SCM unit with each bus signals flexibly. The single BM, MCM and SCM distributors are needed for an AXI interconnect, but the multiple monitors are connected with each distributor. Figure 7. BM distributor (9 x n) 4.5 Multi-PMs Synchronization This section explains the details of measurement capabilities where multiple Performance Monitors (PMs) instances, which can be either Bus Monitor or Contention Monitor, are implemented in the system. When multiple PMs are activated concurrently, it is required to method to interact with each other for the synchronization. Each PM s Enable-In and Enable-Out signals ports can be connected in a way to allow synchronization among the multiple monitors. Figure 8 illustrates a wiring example among all PM instances to operate concurrently. One instance would act as the primary core, where its Enable-Out signal is connected to the Enable-In input of all remaining (secondary) PM instances. This configuration is set by enabling Sync_out in the primary instance and Sync_in in the secondary instances. This feature allows all instances to begin performing measurements and finish monitoring at the same time. In this configuration all secondary instances are to be programmed first, yet each will not begin functioning until the primary instance has been programmed and started. Therefore, all instances will operate concurrently. Figure 8. Multiple PMUs synchronization 1569

5. IMPLEMENTATION OF PMU In this section, we present the implementation and verification of the performance monitoring unit. The possible example usage of the PMU will be also discussed. 5.1 PMU Implementation We have implemented and verified the performance monitoring unit with a SAVm IV platform FPGA board. The FPGA board has two Xilinx xc2v8000, one xc2v4000 for the bus and IPs, and SDRAM/DDR, UART, LCD, Sim Card, Key pad, Modem interfaces. It also has ARM1136 and StarCore SC2400 test-chip on it to reflect the SAVm IV architecture which has a heterogeneous multi-processor architecture with ARM11 processor and a SC2400 DSP processor. Figure 9 shows the architecture and IP features of the SAVm IV SOC platform. It has an external bus interface, the AXI/AHB bus, to be easily used to integrate other IPs with the platform. The SAVm IV adopts 32-bit bus architecture including the AXI, AHB and APB bus. The primary system bus is the AXI spine bus. AHB/APB bus and bridges are used for interfacing with various IPs. The structure of AXI-spine interconnect is implemented in 9 by 4 in the FPGA board. ARM activates as a primary processor which performs control processing. StarCore carries out data processing such as H.264 decoding. The system and two core clock frequencies used at SOC platform are 20MHz in the FPGA board. The simulation on FPGA board is executed with following monitors: one (9x6) BM distributor, one (4x4) MCM distributor, one (9x9) SCM distributor, one BM, one MCM, and one SCM unit. The location of PMU on SAVm IV platform is shown in Figure 9. 5.2 Verification of the PMU The performance monitoring unit has been verified with SAVm IV FPGA board. The functionality of the PMU are verified by running the test scenario for each modules in the PMU as presented follows. Table 1 shows the bus monitor parameter setting of the PMU to verify the functionality of bus monitor and BM distributor. We write a simple program to generate some master transactions and check the result of the bus monitor counter values based on the configuration presented in Table 1. Table 1. Bus Monitoring Parameters IP Bus Monitor0 BM distributor (9 x 6) Configuration Registers BM enable Three address ranges Eight read latency ranges Three write latency ranges Global clock enable Sync enable BM distributor enable BM0~5 bus selection (ex. BM0 : Master1 bus) As explained earlier, BM distributor has capability to connect any Master bus and BM modules. In our test scenario, BM0 is used to monitor the master transactions. The BM distributor register has been set based on this scenario as shown in Table 1. In this example, BM0 bus selection register is configured as Master1 bus to connect Master 1 bus and BM0. BM1~5 bus selections which are not connected with any BMs are programmed as disable. Then, control registers of BM0 are configured for monitoring specified events, such as bus latencies and bus transaction events within specified address range. In addition, synchronization with other monitors is controlled by Sync_Enable register. By performing various parameters setting of BM distributor and Bus Monitor, we can verify the function of BM distributor and Bus Monitor. The Master Contention Monitor is activated by selecting the Slave bus which is selected to monitor contentions and the Master ID which is intended to monitor contentions on the selected Slave bus. For example, if we want to monitor contentions between Master 1 and other masters on Slave bus 2, the register setting is shown as Table 2. In order to synchronize with other monitors, Sync register can be set. Table 2. Master Contention Monitoring Parameters MCM distributor Master Contention Monitor 0 MCM Enable User defined Master ID MCM (MasterX) selection (ex. Master 1 device) Sync Enable MCM distributor enable MCM distributor ( 4 x 4 ) MCM0~3 bus selection (ex. MCM0 : Slave bus 2) Similar to MCM case, the Slave Contention Monitor is operated by selecting the Master bus which is desired to monitor contentions and the Slave ID which is intended to monitor contentions on the selected Master bus. If contentions between Slave 0 and other slave devices on Master bus 1 are desired to measure, the control registers are configured as Table 3. Table 3. Slave Contention Monitoring Parameters SCM distributor Slave Contention Monitor 0 SCM SCM Enable User defined Slave (SlaveX) selection (ex. Slave 0 device) Sync Enable SCM distributor enable SCM distributor ( 9 x 9) SCM0~8 bus selection (ex. SCM0 : Master bus 1) 1570

Figure 9. SAVm IV Platform block diagram Table 4. Example Test Scenario of Bus Monitor and Contention Monitor Measure Events IP Configuration Registers Description Measurement BM distributor ( 9 x 6 ) BM distributor enable BM0 bus selection BM1~5 bus selection Enable Master1 or Master7 bus select Disable The connection Master1(or Master7) bus with BM0 Number of bus transactions of Master1 (ARM) and Master7 (DMAX), and bus contentions related with Master 1 on Slave bus2 BM0 MCM distributor ( 4 x 4 ) MCM 0 BM enable Three address ranges Eight read latency ranges Three write latency ranges Global clock enable Sync Enable MCM distributor enable MCM0 bus selection MCM1~3 bus selection MCM Enable User defined Master ID (MasterX) selection Sync Enable Enable Select address range fitted SDRAM access address map for one among three ranges Select desired read latency ranges Select desired write latency ranges Global clock enable Sync Out Enable (2 b10) Enable Slave bus 2 select Disable Disable Master1 select Sync In Enable (2 b01) Bus transaction events such as transfer counts, size, read latency distribution, write latency distribution, simulation time, etc on Master1(or Master7) bus The connection Slave bus2 with MCM0 Bus contention events between Master1 and other masters, respectively Table 4 shows the test scenario to verify the functionality of the bus monitor and contention monitor simultaneously. We assume a simple case that the ARM1136 core (Master1) and the DMAX (Master 7) access SDRAM (Slave bus 2) simultaneously. The configuration presented in Table 4 enables the bus monitor and the contention monitor to gather the information about the contention between two IPs, i.e., Master 1 (ARM) and Master 7 (DMAX). In order to synchronize the BM0 and MCM0, the Sync Enable of BM0 is set as Sync Out, and the Sync Enable of MCM0 is set as 1571

Sync In. By this configuration, the operation of MCM0 is synchronized to the BM0 enable. We present only a couple of sample test scenarios as shown in this section even though many other various scenarios have been used for the checking of PMU functionality. Those include the check the functionality of the bus monitor and slave contention monitor 5.3 Usage of PMU We will use the PMU to evaluate the system performance with the H.264 decoding example on FPGA board. The Figure 10 shows the data flow of the H.264 decoding and the information what master devices generate transactions to slaves as dotted lines. As shown in Figure 10, the most congested path is SDRAM access via DMC (PL340). Since the AXI spine is the primary system bus, we ll estimate system performance at the AXI-spine interconnect. The structure of AXI-spine interconnect is 9x4, and the DMC (Dynamic Memory Controller) is connected to the 2nd Slave bus in AXIspine interconnect. We are currently porting the H.264 decoding software to the FPGA board to measure the performance of the system including bus utilization and latencies. The effectiveness of the performance monitoring unit will be verified using real application. The feature of the performance counting unit can be adjusted based on this analysis. 6. CONCLUSION We present a performance monitoring unit (PMU) for the AMBA AXI bus in this paper. The PMU has capability to measure major performance metrics, such as bus latency for the specific master requests and amount of memory traffic for specific durations. It can also measure the contention of the bus masters and slaves in the SOC. We design the distributor and the synchronization method to use multiple performance monitoring units as well. The performance counting unit has been verified in the platform FPGA board with 9 by 4 AXI interconnect configuration. The performance monitoring unit will be used to evaluate the system architecture, especially for the bus architecture design for our platform SOC design. 7. REFERENCES [1] AMBA AXI Protocol Specification v1.0, ARM, 2003 [2] PrimeCell AXI Configurable Interconnect (PL300) Technical Reference Manual, ARM, 2004 [3] Charles Roth, and Frank Levine. PowerPC TM Performance Monitor Evolution. Performance, Computing, and Communications Conference, IPCCC IEEE, Feb 1997, pp. 331-336 [4] Sprunt, B. Pentium 4 performance monitoring features, Micro, IEEE, July-Aug 2002. 72-82 Figure 10. H.264 Decoding Data Flow in regard of SDRAM access [5] Collard, J.F, Jouppi N, and Yehia S. System-wide performance monitors and their application to the optimization of coherent memory access, PPoPP, ACM, June 2005 [6] Borril, J., Carter, J., Oliker, L., Skinner, D., and Biswas, R. Integrated performance monitoring of a cosmology application on leading HEC platforms, IEEE Parallel processing, 2005, pp. 119-128 [7] Wisniewski,R.W and Rosenburg, B. Efficient, unified, and scalable performance monitoring for multiprocessor operating system. Supercomputing, ACM/IEEE conference, Nov 2003. 3-3 [8] Mink, A., Salamon, W., Hollingsworth, J.K., and Arunachalam, R. Performance measurement using low perturbation and high precision hardware assists Real-time system symposium, Dec. 1998, pp. 379-388. [9] Jihong Kim and Yongmin Kim, Performance analysis and tuning for a single-chip multiprocessor DSP, IEEE Concurrency, Jan-March 1997, pp. 68-79 [10] PLB performance monitor user s manual, IBM, 2002 [11] Gi-ho Park, et al. Architecture exploration and performance verification environments of multi-core SOC for mobile multimedia embedded systems, ISOCC, 2006 1572