Test compression bandwidth management in system-on-a-chip designs

Transcription

1 Politechnika Poznańska Wydział Elektroniki i Telekomunikacji Poznań University of Technology Faculty of Electronics and Telecommunications Ph. D. Thesis Test compression bandwidth management in system-on-a-chip designs by Jakub Janicki Supervisors: prof. dr inż. Janusz Rajski prof. dr hab. inż. Jerzy Tyszer Poznań, Poland 2013

2

3 To my dearest wife Natalia

4

5 Acknowledgements The work presented here was carried out between July 2009 and October 2013 at the Faculty of Electronics and Telecommunications at Poznań University of Technology and in the Mentor Graphics Polska. Numerous people helped me through this effort and I would like to mention some of them here. I would like to express my sincere gratitude to my supervisor Professor Jerzy Tyszer. His guidance, enthusiasm, inspiration, insightful remarks, and, last but not least, continual motivation were all invaluable. I will never forget the atmosphere of that time spent on research work. I also wish to thank my co-supervisor Professor Janusz Rajski from Mentor Graphics Corporation for his vision, technological review, and encouragement. I extend my thanks to all people from Mentor Graphics who helped me during my work toward the PhD. Especially, I would like to express my gratitude to Dr. Grzegorz Mrugalski for insightful discussions, cooperation and for what I have learned from them. I would also like to thank Dr. Nilanjan Mukherjee and Dr. Yu Huang for their help in preparing experiments and valuable discussions we have during technology transfer process. I gratefully acknowledge the scholarships from Mentor Graphics and Semiconductor Research Corporation that allowed me to attend many international conferences, summer internships and join the test community. Finally, my deepest thanks, love, and affection go to my wife Natalia, my family, and all my tutors. For their love, patience, and support I am forever grateful. Poznań, in October 2013 v

6

7 Contents Contents List of Symbols and Abbreviations List of Figures List of Tables vii xi xiii xv 1 Introduction 1 2 State of the art Testing in scan-based designs Test data compression System-on-a-chip testing Test architectures Test scheduling Test sharing and broadcasting Test architectures and scheduling co-optimization Test architectures with test data compression Motivation Thesis overview Bandwidth-aware test compression in EDT environment Specified bits in test cubes Solver with channel-aware pivoting Channel underutilization Test data circulation vii

8 viii Contents 3.5 Channel selection order Bandwidth-aware compression Experimental results Bandwidth-aware test response compaction Output data selector architectures Basic output data selector XOR tree-based output data selector XOR tree-based output data selector with X-filtering Selection of observation points Experimental results Bandwidth management in EDT environment General SoC test environment Core level statistics Test pattern set descriptor Bandwidth management Two-stage test scheduling algorithm Test access mechanism architecture Generic approach Test access mechanism architecture Test scheduling algorithm Experimental results Test time reduction in SoC environment Single core test time reduction Impact on test data volume Test time reduction scheme Minimization of control configurations Conditional merging Changing application order Experimental results Test compression bandwidth management - practical scenarios Test flow Control data delivery The use of IJTAG Dedicated control chain

9 Contents ix Pipeline architecture Optimization of SoC pin allocation Handling physical constraints Experimental results Conclusion 105 Bibliography 109

10

11 List of Symbols and Abbreviations Abbreviation ACM ATE ATPG ATS BIST CAD CUT DAC DATE DFT EDT ETS FSM IC ICCAD Description Association for Computing Machinery Automatic Test Equipment Automatic Test Pattern Generation Asian Test Symposium Built-In Self-Test Computer-Aided Design Circuit Under Test Design Automation Conference Design, Automation & Test in Europe Design For Testability Embedded Deterministic Test European Test Symposium Finite State Machine Integrated Circuit International Conference on Computer Aided Design xi

12 xii List of Symbols and Abbreviations Abbreviation ICCD IEEE IP ITC LFSR MISR PRPG SiP SoC STUMPS TPG VLSI VTS Description International Conference on Computer Design Institute of Electrical and Electronics Engineers Intellectual Property International Test Conference Linear Feedback Shift Register Multiple-Input Signature Register Pseudo Random Pattern Generator System-in-a-Package System-on-a-Chip Self-Testing Using MISR and Parallel shift register Sequence generator Test Pattern Generator Very Large Scale Integration VLSI Test Symposium

13 List of Figures 2.1 Basic scan-design architecture The BIST architecture The STUMPS architecture Combinational decompressor Illinois scan Sequential decompressor EDT architecture Example of 10-output 16-bit EDT decompressor TestShell Test access mechanism architectures Test scheduling techniques Relationship between TAT and TAM width Test-architecture with test-data compression Structure of the thesis Test cube fill rate profile Test cube fill rate profile and channel demands Encoding efficiency Single core channel underutilization Variable propagation (circulation) Conventional EDT injector placement Distribution of variables injected through different channels Test cube channel demands Evenly distributed EDT injectors Channel profiles for three EDT setups xiii

14 xiv List of Figures 4.1 Output data selector architectures Fault observation on XOR tree-based ODS Percentage of test patterns with different numbers of output channels General SoC test environment Core-based channel utilization for C Base clusters and test schedules for the best-fit algorithm TAM input interconnection network Examples of channel assignment Simple interconnection networks Bipartite graphs for the networks of Figure ATE channels vs. test application time for design D Best-fit-based test schedule for design D1 with 28 ATE input channels Balanced-fit-based test schedule for design D1 with 41 ATE input channels Test schedule for design D3 with 19 ATE input channels Test schedule for design D4 with 20 ATE input channels Pattern count for the industrial design Shift cycles for the industrial design Test data volume vs. the number of input channels Conditional merging Changing application order ATE channel reduction vs. test time reduction and test compression Using IJTAG network to transfer control data Dedicated control chain-based architecture Pipeline architecture Test data volume for industrial design Pin layout constraints The baseline bin-packing diagrams Test time reduction The optimized test schedules

15 List of Tables 3.1 Experimental results - circuit D Experimental results - circuit D Experimental results - circuit D Experimental results - circuit D Experimental results - circuit D Experimental results - circuit D Experimental results - circuit D Experimental results - circuit D Experimental results - circuit D Experimental results - circuit D Correlated cores statistics Single core wire network assignment SoC characteristics I Experimental results - non-isolated cores Experimental results - isolated cores SoC characteristics II Experimental results bit long scan chains SoC characteristics III SoC characteristics IV Experimental results xv

16

17 Chapter 1 Introduction Intensive technological progress in semiconductor fabrication has facilitated shrinking chip features below 50 nanometers and moved toward three-dimensional integrated circuits. In consequence, we can observe an unparalleled growth in gate counts and in circuits operational frequencies. It permits the Moore s law to remain still relevant with transistor count in a typical design doubling every two years. Contemporary circuit growth in size forced division of design project into independent functional parts. Such approach has had a profound impact on the design process and forms the basis for producing modular system-on-a-chip (SoC) and system-in-a-package (SiP) circuits. These designs can include a variety of digital, analog, mixed-signal, memory, optical, micro-electro-mechanical, and radiofrequency cores. Most often individual cores are delivered by various vendors as license driven IP cores embracing reusable unit of logic, cell, or chip layout designs. Popularity of SoC circuits has led to an unprecedented increase in the cost of test which became a serious challenge. This rise is primarily attributed to the difficulty in accessing embedded cores during test, long test development and test application times, and large volumes of test data. Application of new materials connected with new fabrication processes reduces the cost of a single transistor, however, at the same time introduces new types of manufacturing defects and changes the distribution of traditional failures. Increases in test data volume drive up the cost of testing by elevating both test application time and tester storage. These aspects decide about the cost-effective density of transistors on a chip. This is why electronics industry expects new test solutions to enable an acceptable ratio of test cost to production cost while maintaining high test quality. Both test application time and test data volume can be reduced by compression of 1

18 2 Chapter 1. Introduction test stimuli and compaction of test responses. However, diversity of test requirements of design with a range of core counts has led to ineffective usage of available tester channels. Fortunately, dynamic channel allocation precisely allocating resources for every single test allows effective usage of assigned channels. Whereas relatively high compression ratios can be obtained for a single core, the overall advantage can be achieved by sharing of all available channel resources between design cores. As a result, a new bandwidth management scheme has been introduced. The primary objective of the thesis is to introduce new methods for dynamic channel allocation of Embedded Deterministic Test (EDT) decompressor and propose new bandwidth management schemes for system-on-a-chip testing. The thesis provides a comprehensive presentation and analysis of algorithms developed by the author and described earlier in multiple conference papers [48], [45], [52], [51], as well as journal publications [49], [50]. The presented solutions are also covered by pending patent applications [47], [46]. The thesis is organized as follows. The next chapter discusses the most important aspects of test data compression, test compaction and system-on-a-chip testing. Chapter 2 provides also a brief review of state-of-the-art system-on-a-chip testing methods presented in the open technical literature. The motivation for this work is also included at the end of this chapter. Chapter 3 presents new bandwidth-aware decompression of test data vectors in an EDT environment. In Chapter 4, a new scheme to handle test responses in a system-on-a-chip is presented. A new architecture for bandwidth managment and a scheduling algorithm are presented in Chapter 5. Chapter 6 presents an original method for test time reduction in system-on-a-chip testing. Application and practical scenarios of the introduced methods are presented in Chapter 7. Finally, Chapter 8 concludes the thesis.

19 Chapter 2 State of the art 2.1 Testing in scan-based designs Long-standing research in a very large scale integration design (VLSI) prototyping permitted formulation a set of rules that, accomplished, guarantee expected test quality. The main paradigms of design for testability (DFT) methodology assume that a circuit under test (CUT) will be controllable, observable, and predictable. Controllability quantifies a capability to initiate every local internal state in a circuit using only externally available circuit s inputs. This is complemented by observability that allows one to determine every state in a circuit by ensuring propagation of the signal to the circuit s outputs. Finally, predictability denotes that there is a feasibility to determine a state of a circuit in response to given input stimuli. All these abilities are exploited by Automatic Test Pattern Generator (ATPG) to deliberately control signal propagation in the design structure and generate test stimuli assuring testability (controllability and observability of targeted faults). The growing complexity of designs and increase of sequential depth were a reason that ad hoc techniques (test points insertion, disabling internally generated clocks, removing of logical redundancy and global feedback paths, isolating of memory arrays and embedded structures, partitioning of large circuits) guaranteeing the high controllability and observability of simple circuits became insufficient. Difficulties in testing of large circuits stimulated a search for techniques making state variables directly controllable and observable. The most crucial milestone was a scan design concept. It introduces a special test mode into circuit in such a way that virtually all memory elements during testing form shift registers (scan paths). A basic single scan-path design, illustrated in Figure 3

20 4 Chapter 2. State of the art 2.1, assumes that the circuit features three extra pins (test mode, scan-in, scan-out) and additional multiplexers. A CUT prepared in this manner can work in two modes of operation. In normal mode, all memory elements perform their regular functions, whereas in test mode - test patterns are shifted in through the scan-in terminal, and test responses are subsequently shifted out through the scan-out pin. Indeed, this approach became a universal rule in VLSI design process because such testing of circuits can be treated as testing of multiple simple cones of combinational logic. As a result, scan-based DFT significantly simplifies test pattern generation. On the other hand, it introduces several limitations such as: area overhead, performance degradation due to the presence of multiplexers, and increased test application time. They were, however, accepted in testing process by the semiconductor industry as the only way to assure high production efficacy. Primary inputs Combinational circuit Primary outputs Scan-in D Q D Q D Q Scan-out Test mode Clock Clk Clk Clk Figure 2.1: Basic scan-design architecture One of the well-established approaches that utilizes scan-based DFT is built-in selftest (BIST) allowing CUT to test itself. It is feasible because a typical BIST architecture, presented in Figure 2.2, contains both apparatus for test pattern generation and test response compaction. BIST controller that communicates with a tester using only a few tester pins is in charge of test validation. An application of test vectors and capture of test responses may proceed in parallel for multiple scans in every clock cycle or serially through a single scan path. In particular, significant reduction of test application time is possible by using parallel shift register sequence generator (STUMPS) and self-testing using multiple-input signature register (MISR), as shown in Figure 2.3. It is worth noting that the quality of the applied test pattern generator and test response compactor has a considerable impact on test time and fault coverage. Hence, pseudo-random test generator (PRPG) should provide a pseudo-random test stimuli with an appropriate level of

21 2.1. Testing in scan-based designs 5 Test pattern generator Scan register(s) BIST Control logic Reference CUT Scan register(s) Compactor Scan register(s) Figure 2.2: The BIST architecture randomness, while a test response compactor should minimize the probability of masking an error (aliasing). Nevertheless, manufacturing test may require additional deterministic tests, delivered by ATE (Automatic Test Equipment) or stored on a chip, to achieve the desired level of fault coverage. This, in turn, requires efficient test compression to reduce the volume of stored test data. Pseudo random test pattern generator... Input boundary sacn... Scan path Scan path... CUT Scan path Scan path Multiple-input shift register (MISR) Output boundary scan Figure 2.3: The STUMPS architecture

22 6 Chapter 2. State of the art 2.2 Test data compression In the last decade, test-data compression has become an efficient technique to reduce both test data volume and test application time [91]. Many of proposed schemes exploit the regularities and presence of high number of unspecified bits in test patterns. A low fill rate of care bits is a consequence of a shallow combinational logic [95] in the contemporary designs developed to work with increasing clock frequencies. As a result, the fill rate of the highly specified test patterns do not exceed 3-5% [78], [29]. In the general scheme, compressed test stimuli, stored in the tester memory, are delivered through tester channels to an on-chip decompressor which restores the expected data and loads them into scan chains. We can distinguish four major groups [91] of test data compression schemes based on a test pattern encoding technique. Combinational decompressors The first group - combinational decompressors - are based on linear operations and mostly utilize linear XOR networks [4], [5], [6], [58], [69]. Figure 2.4 shows a general scheme of four input combinational decompressor feeding eight scan chains. In spite of the number of input channels being lower than the number of scan chains, each input channel drives several scan chains. The lack of memory elements in a such structure causes the current input channel value to be immediately injected into the scan chains. Techniques based on such a principle may provide satisfying results only if test patterns contain uniformlydistributed care bits in every scan slice. However, experimental cases show that often care bits are clustered. In addition, the internal structure of CUT causes that it is not always possible to change their placement. As a result, additional input channels are required to handle high specified scan slices, and thus the ability of this approach to significantly reduce the volume of test data becomes limited. The simplest subclass of combinational decompressors broadcasts an identical test vector to several scan chains, as presented in [64]. Illinois scan [26], [31], based on that scheme, is presented in Figure 2.5. Test data injected to the adjacent scan chains are fed from the same input channel. ATPG, however, may require that the corresponding scan cells cannot contain the same - conflicted values. Thus it takes into consideration these constraints or, if it is not possible to assure an appropriate test coverage due to these limitations, concatenated scan chains are filled in directly with no compression. Hence, the resultant compression depends on the ratio between the number of test patterns delivered in the broadcasted and serial manner. There are multiple techniques proposed to solve the problem with conflicted values in scan chains driven by the same channel. All of them require either dynamic and static

23 2.2. Test data compression 7 Figure 2.4: Combinational decompressor reconfiguration of scan chains either in space (different scan chains connected in different sessions to the same channel) or in time (different groups of scan chains) or mix of these two. Figure 2.5: Illinois scan [26] Code-based schemes Many of the proposed test compression techniques utilize well-known data compression algorithms developed earlier within the information theory. They reduce the statistical redundancy in test data - a result of structural dependency between faults in the circuit - to represent stimuli more concisely. The most popular are code-based schemes that replace repeating parts (symbols) in test data by code words. An on-chip decompression

24 8 Chapter 2. State of the art process restores the original data covering of code words back into corresponding symbols. Among such schemes we can distinguish approaches that exploit run-length [10], [73], [37], [36], [54], statistical [73], [37], [36], [53], constructive [96], Golomb [9], and dictionary [66], [81], [88], [99] codes. A similar concept is used in packet-based encoding [93] and ninecoded compression [74], [89]. All those techniques require an on-chip decoder storing the code words. As a result, the hardware synthesized based on test data characteristics may result in a very complex structure. Static reseeding Diverse distributions of care bits in the scan slices decides that the effectiveness of combinational decompressors is limited. To make test cubes easier to encode, scan slices which have a high number of care bits will be able to utilize some variables from an adjacent slices which contain fewer care bits. In order to enable such operation, a decompressor must contain some memory elements. Indeed, any type of generic sequential circuitry may be used as the decompressor. The linear feedback shift register (LFSR) coding was originally proposed in [56] (see Figure 2.6) and improved in a number of approaches [28], [57], [60], [80], [2]. The compressed test-data in the form of initial seed for the LFSR is loaded at the beginning of every test pattern application. The LFSR, in subsequent clock cycles, move from one state to the next one while producing a pseudo-random sequence which matches the orginal test vector in all the care bit positions. A proof presented in [56] formulates a principle: to encode S specified bits with a probability higher than , the LFSR needs at least S + 20 sequential elements. Hence, the maximum number of care bits specified in a test cube defines the minimal allowed size of LFSR. Several enhancements were proposed [94], [59], [98] avoiding this hard to fulfill in practice limitation. These schemes allow one to utilize reduced size LFSRs assuming that only selected scan slices per test pattern are encoded. Dynamic reseeding Loading of all the variables required to encode care bits in a test pattern can be replaced by continuously providing variables to the LFSR. This compression paradigm, called dynamic reseeding, was proposed for the EDT [78]. It delivers a few variables in every shift-in cycle instead of all in a single load. As a result the size of decompressor does not depend on the total number of specified bits and the size of LFSR may be reduced. Such a scheme allowed one to achieve unprecedented test data compression and test time reduction. Moreover, it was coupled with the standard scan and ATPG methodology guaranteeing a very simple flow and wide applicability. A general scheme of using an EDT decompressor is presented

25 2.2. Test data compression 9 Figure 2.6: Sequential decompressor Compactor Phase shifter Ring generator Compressed stimuli ATE Compressed responses Figure 2.7: EDT architecture [78] in Figure 2.7. It consists of an r-bit ring generator [70] and an associated phase shifter driving s scan chains. Compressed test cubes are delivered to the decompressor through c external channels in a continuous manner, i.e., a new c-bit vector is injected into the ring generator every scan shift cycle, effectively moving this linear finite state machine from one state to another. For example, consider a decompressor and a corresponding phase shifter shown in Figure 2.8. The decompressor consists of an 16-bit ring generator implementing primitive polynomial x 16 + x 10 + x 7 + x The 10-output phase shifter is comprised of 3-input XOR gates connected to the outputs of memory elements as follows: 1 (9, 6, 8), 2 (13, 1, 4), 3 (3, 15, 10), 4 (14, 7, 0), 5 (2, 12, 5), 6 (9, 1, 12), 7 (14, 7, 6), 8 (15, 0, 8), 9 (11, 13, 10), 10 (11, 2, 3). The input variables are

26 10 Chapter 2. State of the art 1 (9,6,8), 2 (13,1,4), 3 (3,15,10), 4 (14,7,0), 5 (2,12,5), 6 (9,1,12), 7 (14,7,6), 8 (15,0,8), 9 (11, 13, 10), 10 (11, 2, 3) provided in pairs through four external input channels connected to the following inputs of ring generator stages: (9, 7), (11, 4), (14, 2), and (15, 1). I 0 I 1 I 2 I Figure 2.8: Example of 10-output 16-bit EDT decompressor Further enhancements of this scheme were proposed. For example, the low power decompression schemes [13], [16], [17], [18] allow one to considerably reduce switching activity. X-masking schemes introduced in [79], [14] offer test compression exceeding an encoding limit one variable per one care bit and eliminates all unknown states from test responses. 2.3 System-on-a-chip testing With on-chip test compression becoming the production test standard, its application in SoC designs requires additional infrastructure to transport test data between the SoC pins and the embedded cores. The industry is currently witnessing a major change on how data is transferred between different parts of an electronic system. With data rates exceeding 1 Gb/s, parallel I/O schemes are being replaced by high-speed serial links. This is driven by a need to meet new bandwidth requirements and simplify designs. However, as growth in high-speed I/O implies less digital pins, it becomes imperative to run SoC tests in a reduced pin count test environment. Moreover, cost-effective SoC test requires

27 2.3. System-on-a-chip testing 11 scheduling. Unfortunately, even the simplest existing test scheduling algorithms are time consuming. Indeed, there are many solutions that are milestones in an SoC testing. To show their variety, previous work addressing test data transportation, test data scheduling, and optimization techniques that are directly related to this thesis, is presented here Test architectures Application of test stimuli in SoC designs requires specialized on-chip test architectures. In this section, the related work on test-architecture design, including wrapper and TAM design, is presented. Test wrappers A test wrapper forms an interface between a core and an SoC environment. This specialized instrumentation is responsible for a core isolation and a test access available in special test modes. We can distinguish the following two complementary parts of a design process: wrapper architecture selection and wrapper optimization. Several wrapper architectures have been proposed. TestShell [67] and Test Collar [92] form the basis of the standardized wrapper architecture IEEE 1500 [19], [90]. The conceptual scheme of TestShell wrapper is presented in Figure 2.9. TestShell wrapper architecture exploits a dedicated multiplexer per each functional input and output. The input multiplexer controls application of test stimuli and functional data, whereas the output multiplexer is responsible for interconnect test and observation of produced responses. TestShell provides four control modes. A normal core operation is available in a functional mode, core is under test in a test mode, interconnect test mode allows test of a glue logic between cores while a bypass mode assures transparent data transfer through the TestShell wrapper. Wrapper design optimization is aimed at clustering of scanable elements to minimize the length of the longest wrapper chain tailored towards optimal test application time [22], [42], [100]. Several approaches complying with this demand have been presented in [68], [41]. The design wrapper algorithm proposed in [41] sorts descending scan chains according to their length. Next, it assigns each scan chain to a wrapper chain in this manner that the length of the formed chain is closest to but cannot exceed the length of the current longest wrapper chain. If no appropriate wrapper chain can be found, the current scan chain is assigned to the wrapper chain with the shortest length. Finally, wrapper cells are assigned to the created wrapper chains.

28 12 Chapter 2. State of the art Test control TestShell Functional data Test stimuli Interconnect test stimuli Core Functional data Test responses Interconnect test responses Figure 2.9: TestShell [67] Test access mechanism design Test access mechanism (TAM) is a bidirectional communication architecture used to transport test stimuli from the SoC pins to the embedded cores and test responses from the embedded cores to the SoC pins. Several TAM architectures have been proposed [1], [27], [38], [40], [67], [92]. Based on access properties to an embedded core, TAM architectures can be divided into functional and dedicated groups. The functional TAM takes advantage of well-communicated native SoC bus infrastructure that may be used during test process. This is why it reuses the operating connections in the SOC as TAM to reduce the number of additional wires required to develop a communication test infrastructure. An extended AMBA specification [3] that contains an interface controller and a test harness mimicking a test wrapper allows, behind normal core communication, also test-data transfer. A Reuse of Addressable System Bus for SOC Testings (RASBuS) [35] scheme exploits on-chip microprocessors to test the cores and functional bus infrastructure for the test transportation. Application of these schemes is justified as long as they offer sufficient resources for test while hardware overhead to adjust a SoC infrastructure is negligible. The second group of TAMs make dedicated schemes that are more popular due to easier integration within SoC framework and provide a used test technique with customized resources. Direct access test scheme [38] (see Figure 2.10a) replaces both the wrapper and TAM infrastructures because it creates a network of direct links between SoC pins and cores. This simple technique guarantees the shortest possible test application time. Indeed, such approach is not applicable in practice because the cutting-edge SoC designs require for test considerably more channels than may be allocated. Moreover, the total number of additional wires corresponds to the total number of SoC pins and results in a large wiring overhead.

29 2.3. System-on-a-chip testing 13 SoC SoC WTAM SoC Core 1 Core 2 WTAM Core 1 Core 2 WTAM WTAM WTAM WTAM Core 1 Core 2 BC1 BC2 BC3 BC4 Bypass for Core 1 WTAM Core 4 Core 3 Core 4 Core 3 WTAM Core 4 Core 3 (a) Direct access (b) Multiplexing (c) Daisychain SoC SoC SoC Test bus 1 Core 1 Core 2 Core 1 Core 2 WTAM Core 1 Core 2 Test bus 2 WTAM Core 4 Core 3 Core 4 Core 3 Core 4 Core 3 (d) Distributed (e) Test bus (f) Flexible-width WTAM SoC WTAM Core 1 Core 2 WTAM WTAM Core 4 Core 3 (g) TestRail Figure 2.10: Test access mechanism architectures A three elementary architectures that solve the problem of limited number of I/O pins were proposed in [1]. The multiplexing architecture presented in Figure 2.10b consists of a single TAM connecting all cores. It assures an access to one core at a time and for this reason sequential testing of all cores. As a result, the total test time corresponds to the sum of test times of all individual cores. The Daisychain architecture extends the functionality of the multiplexing architecture by allowing testing of multiple cores at the same time. As presented in Figure 2.10c, all core wrapper chains are connected through a bypass structure into long chains. In the distributed architecture (Figure 2.10d), each core has its own dedicated TAM, and all cores are tested in parallel. The sum of each TAM s width is the full TAM width of the system. The overall test application time for the system is given by the core with the longest test application time. These three dedicated TAM architectures solve the TAM problem, however, they do not provide the ability to reduce test application time using more flexible test scheduling.

30 14 Chapter 2. State of the art TAM architectures that support new features in test scheduling have been proposed in [92], [67], and [40]. The Test bus proposed in [92] and illustrated in Figure 2.10e can be seen as a combination of the multiplexing and distributed architectures. The TestRail architecture presented in [67] (see Figure 2.10g) is, in turn, a combination of the Daisychain and the distributed architectures. The Test bus and TestRail architectures support more flexible scheduling alternatives since the total number of TAM wires can be partitioned into several TAM channel subsets. SoC testing using such architectures is, however, limited due to the cores assigned to a TAM that are connected to all wires of that TAM and a single tested core consumes the entire bandwidth of this TAM part. A flexible-width architecture, proposed in [40], allows cores to be connected in a flexible way to the TAM wires, as illustrated in Figure 2.10f. In consequence, each TAM wire is treated as a separate unit which increases the flexibility of a test scheduler. The customized architecture, however, potentially leads to an irregular organization of the test-data in the tester memory, and thus additional test control may be required Test scheduling Testing of SoC designs must consider many key factors. The most important are: common test resources with a limited access, complex controlling of wrapper and TAM architectures, bandwidth demands exceeding available resources and power dissipation. All these constraints and aims of optimization necessitate systematic methods to assure costeffective usage of the available ATE bandwidth. Test scheduling minimizes a defined fitness function considering all imposed limitations. Channel resources, test application time and amount of heat dissipated during the test procedure constitute the main targets of optimization. As a result, it delivers a test agenda for each core determining time slots associated with assigned test resources and satisfying the given constraints. A number of various algorithms have been proposed in the literature [25]. Test scheduling in SoC testing exploits such well-know techniques as bin packing (BP) [32], [33], integer linear programming (ILP) [11], [72], mixed integer linear programming (MILP) [7], simulated annealing (SA) [103], tabu search [87], etc. Conditions that decide when available resources may be freed by a single task and allocated to another one are the most important criteria of scheduling algorithm division. Based on that, scheduling techniques can be divided into the following three main categories: non-partitioned, partitioned and pre-emptive testing. Examples illustrating these approaches are presented in Figure Assume that a test scheduling algorithm optimizes test application time while available channels are limited by the maximum of TAM wires. Figure 2.11a shows a non-partitioned (session-

31 2.3. System-on-a-chip testing 15 ATE channels ATE ATE channels channels session 1 session 2 session 1 session 2 Test 2 Test session 2 1 session 2 Test 1 Test 3 Test 21 Test 3 Test 1 Test 3 (a) non-partitioned Test time Test time Test time ATE channels ATE ATE channels channels ATE channels ATE ATE channels channels Test 2 Test 3 Test 2 Test Test 3 1 Test 4 Test 21 Test 34 Test 1 Test 4 (b) partitioned Test 2A Test 3 Test 2A Test Test 3 1 Test 4 Test Test 2A 1 Test Test 3 4 Test 1 Test 4 Test time Test time Test time T 2B T 2B Test time T 2B Test time Test time (c) preemptive Figure 2.11: Test scheduling techniques based) technique which divides the test application time (TAT) into test sessions. It assumes that each test is assigned to a single session and, as a result, the new test cannot start as long as all tests in a previous session are not completed. Such a scheduling technique was applied in [102], [12]. However, diversity of test time requirements leads to idle times (black holes) when no core is tested and results in a long test application time. The idle test resources can be utilized by using a partitioned technique that allows to schedule a test as soon as test resources are available. Figure 2.11b shows that this more flexible approach improves test application time. However, it is worth to note that such a system requires more sophisticated test controller which permits one to initiate a test of core in an arbitrary time. Such a test scheduling technique was presented in [8] and [71]. Further test schedule optimization presented in [39], [61], [63] was achieved by using preemptive test scheduling illustrated in Figure 2.11c. This technique allows one to divide a single core test into partial tests delivered independently. For example, test T 2 is preempted and allocated in two separated time-resources slots. However, preemptive test scheduling is not applicable to all types of tests and thus not all cores can be tested in such a way. Especially the cores that require continuous handling must be tested in a single time slot. Moreover, an additional frequent test switching requires delivering extra control and an advanced test controller. Hence, a scheduling algorithm has to trade-off benefits of test preemption and the control data overhead.

32 16 Chapter 2. State of the art Test sharing and broadcasting Test patterns produced by ATPG contain a high number of unspecified bits. Several test sharing schemes that take advantage of this property to increase density of specified bits and reduce test data volume have been investigated in the literature [65], [55], [86]. For example, such patterns are then shared and broadcasted to various cores. A scheme proposed in [65] treats all cores in a design as a single virtual core. This allows one to generate common test patterns for all cores. However, such an approach requires an access to the gate-level core netlists that may not be available to engineers integrating test stimuli for an entire SoC design. An alternative scheme presented in [55] generates common tests for multiple cores based on tests delivered by the core providers. An enhanced logic simulator is used to evaluate the fault coverage for the design while a dedicated test pattern generator complements test data set to assure the same fault coverage. Certainly, these processes are typically too time-consuming to be considered in practical scenarios. In both mentioned schemes, the test stimuli are broadcasted to all cores in the system, testing them in parallel and the produced responses from each core are compacted using MISRs. In the method proposed in [86], the test pattern overlapping is enhanced for cores with various scan length. As a result, cores with a shorter scan chains must wait for those with the longest scan chains. The proposed methods reduce test cost factors. However, they increase dependencies between test patterns and significantly limit the flexibility during test scheduling and core assignment. Another approach to reducing test resources required in SoC testing shares output channels. The scheme presented in [85] introduces a specialized TAM for chips with multiple isolated identical cores through which all the cores can be tested in parallel while requiring similar ATE channels as for a single core. The proposed pipelined architecture forms nonlinear equations on a selected output pins that compress the outputs from the identical cores. Next, the proposed off-chip deductive solver allows one to reproduce the failure information for each core with no diagnostic resolution loss. The solution presented in [21] provides modular pipelined TAM that flexibly manages trade-offs between test time and diagnosis in the every production phase to gain test throughput Test architectures and scheduling co-optimization Co-optimizing the test-architecture design and test scheduling allows one to obtain a synergistic effect in test time and TAM resources reduction. A main reason for that lies in the trade-off (dependency) between the number of wrapper chains, test application time, and volume of test data. Results presented in [41] and illustrated in Figure 2.12 show that the test application time for a single core decreases gradually as a staircase (every stair

33 2.3. System-on-a-chip testing 17 begins with pareto-optimal point) with the increases of the number of wrapper chains. Independent selection of the proper wrapper chain configurations for all the cores in a SoC that ensures the best test schedule is very difficult. Hence, a test scheduler may deliver guidance about an expected test-architecture and in this way improve a test schedule taking into account a defined cost function. Figure 2.12: Relationship between TAT and TAM width [41] Several techniques addressing both test-architecture design and test scheduling techniques have been proposed [23], [43], [62], [83], [100], [34]. The TR-Architect scheme proposed in [23] addresses both test-architecture design and test scheduling algorithm that minimizes TAT and indicates the lower bound on the TAT for a given TAM width. A technique proposed in [43] utilizes a flexible width Test bus that can merge and fork channel resources between cores. As a result, different cores connected to the same TAM can utilize a different number of TAM wires at test application time. The preemptive test scheduling and TAM design are tightly integrated to minimize the test application time while considering test resource conflicts, preceedence, and power consumption constraints. In addition, the relation between TAM width and test-data volume is explored. A scheme presented in [62] consists of an integrated SoC test framework responsible for test selection, TAM design, and floor planning as well as test scheduler that minimizes TAT and TAM for given test resources constraints. A method presented in [83] introduces a virtual TAMs based scheme matching the high-speed ATE channels to slower scan chains. Indeed, the volume of test data remains the same. However, the effective number of virtual channels increases and permits reduction of the test application time.

34 18 Chapter 2. State of the art Technique presented in [100] enhances this concept by an introduction of multi-frequency virtual TAMs, thus increasing flexibility of channel packing. It was applied for both Test bus and TestRail TAM architectures considering TAM width and power constraints Test architectures with test data compression The next group of techniques integrate test-data compression with the test schedulingbased solutions. These techniques explore the benefits of test data compression presented in Section 2.2 to further reduce the most vital test cost factors (test application time and/or TAM width requirements). Several techniques co-optimizing test architecture design, test scheduling, and test-data compression have been proposed [24], [44], [97], [101]. A solution presented in [24] proposed a test-data compression driven TAM architecture design approach. XOR logic is used to perform compression, which, in turn, is employed together with test time estimates to devise TAM design heuristics. [44] investigates frequency-directed run-length codes [10] in SoC environment by a placement of one decoder per TAM wire or one decoder per core and managing such architecture by a bin-packing scheduling algorithm. [24] explores a static reseeding based approach. A single on-chip LFSR with an associated phase shifter provides concurrently a test stimuli to multiple cores. The investigated scheduling algorithm reduces the test application time by maximizing the number of each clock cycle encoded care-bits in the compiled seeds. The approaches proposed in [24], [97] use a single shared decoder which is designed at the SoC-level. Hence, they require a large number of TAM wires to achieve an acceptable test application time for the system. A technique presented in [101] enhances a concurrent core test by sharing tests broadcasted using the common one-wire TAM. An on-chip scan chain disable signal generator or an on-chip decoder is used to restore the core test stimuli from the shared test. All cores connected to the TAM wire share the same test data. Thus, test stimuli have to be serialized and, as a result, the test application time may considerably inflate for large SoCs. 2.4 Motivation As it has been shown in this chapter, the growth in popularity and complexity of SoC circuits can result in an increase of the most important test cost factors such as ATE channel resources, TAT, and hardware overhead. Several co-optimized test architecture designs and test scheduling techniques have been proposed. For test architecture design and test scheduling with test data compression, several approaches described require a

35 2.5. Thesis overview 19 SoC WIN LFSR WIN Phase shifter C1 C3 C2 Compactor WOUT Figure 2.13: Test-architecture with test-data compression [97] large number of TAM wires to achieve an acceptable test application time for the system. As most approaches use only one common decompressor which is designed at the SoClevel, feasibility to trade-off the achieved test-data compression with the test application time at core-level is significantly limited. Such an approach requires to recompute all test patterns for every change of SoC configuration. Moreover, the prior work does not provide an insight investigation of trade-offs between the compression at the core-level and the TAT in the overall SoC test scheme. Clearly, the SoC scheme framework has to coherently target the compression ratio as well as dynamic bandwidth management using flexible TAM architecture allowing tradeoffs between TAT and available channel resources. Another important issue is an area overhead introduced by the on-chip hardware necessary to decompress test data delivered by a tester. Finally, the architecture of a TAM should be scalable and circuit independent. The bandwidth management schemes that meet above conditions are proposed in the thesis. They are designed to work in different environments and offer different reductions of test cost factors. 2.5 Thesis overview The remaining part of the thesis consists of seven chapters. They introduce new solutions for SoC testing. A thesis structure, in accordance with a bottom-up approach, is illustrated in Figure Following the introductory chapters, the remaining six chapters fall into three parts. Chapters 3 and 4 propose core-level bandwidth-aware techniques for test compression and test compaction. Chapters 5 to 7 introduce test compression bandwidth management platform for SoC testing. They also present experimental results for several large SoC designs. These results show significant improvements in test compression and test application time. Finally, Chapter 8 contains conclusions. A concise abstract of the following chapters is presented bellow.