DFT 1234

IEEE International Symposium on Defect and Fault Tolerance of VLSI Systems
Built-In Self-Diagnostics for a NoC-Based Reconfigurable IC for

Dependable Beamforming Applications
Oscar J. Kuiken1, Xiao Zhang1, and Hans G. Kerkhoff 1
Testable Design and Test of Integrated Systems Group, CTIT, University of Twente
o.j.kuiken@student.utwente.nl, x.zhang@utwente.nl, H.G.Kerkhoff@utwente.nl
Abstract
Integrated circuits (IC) targeting at the streaming applications for tomorrow are becoming
a fast growing market. Applications such as beamforming require mass computing capability
on a single chip as well as flexibility to adapt to new algorithms. A reconfigurable IC with
many processing tiles based on the Network-on-Chip architecture is considered ideal for such
applications as it balances efficiency and flexibility. Due to the highly regular arrangement of
the processing tiles connected by the communication network, it is possible to adopt new
Design-for-X strategies to improve the dependability of the reconfigurable IC. The
communication network can be reused as a test-access mechanism. On-chip deterministic test
pattern generators can multicast test-vectors through the network to the cores under test and
test responses from multiple cores can be collected and analyzed by a test result evaluator. The
faulty core, or functional parts of it, will be labeled and isolated from the whole system by remapping the computing resources and thus improve the dependability of the whole system.
1. Introduction
Beamforming is a technique for combining signals that are received at multiple antennas [1].
It is a streaming-data application that requires a massive amount of digital signal processing
and is applied in e.g. phased-array radars. For such applications, it is of key importance to
ensure the dependability of the platform on which the beamforming application runs.
The Montium reconfigurable processor (Processing Tile, PT) has been developed [2] to
perform highly efficient streaming-data applications, such as the beamforming task. In this case,
many Montium processors are interconnected by a Network-on-Chip (NoC) to form a General
Stream Processor (GSP) on a single chip while run-time mapping techniques are used to ensure
optimal allocation of the computing resources. As a result of the high regularity of the NoC as
well as the PTs, new Built-In Self-Diagnostics methods can be used to improve the
dependability of the beamforming system.
2. The Montium processor and ATE based tests

A Montium processor consists of identical processing tiles [2]. Before investigating Built-In
Self-Diagnostics, it is necessary to determine whether the Montium core does not contain any
undetectable and ATPG untestable faults. This will provide a reference level for the
This research is conducted within the FP7 Cutting edge Reconfigurable ICs for Stream Processing (CRISP) project
(ICT-215881) supported by the European Commission.
1550-5774/08 $25.00 2008 IEEE

DOI 10.1109/DFT.2008.24
45
Authorized licensed use limited to: UNIVERSITEIT TWENTE. Downloaded on October 16, 2008 at 10:49 from IEEE Xplore. Restrictions apply.
performance of the on-chip test-pattern generator, as ATE-based test-vectors are considered to

have the highest possible quality, that is, they have the highest possible fault coverage using the
least number of test-vectors.
When testing the Montium and its network interface (NI), all memories were isolated from
the Core-Under-Test (CUT) by means of boundary wrapper cells. The complete design was
carried out using Synopsys and its TetraMAX tools in a standard way. All memories in the
Montium core have their own BIST structures and are thus not part of the task at hand, that is,
to determine the fault coverage of the Montium.
When only looking at stuck-at faults (SA), TetraMAX is able to achieve a fault coverage (FC)
of 99.82%, using 2,746 test vectors, each with a length of 23,145 elements (Figure 1). For the
different fault classes is referred to [3]. The total Design-for-X (X can be Test (T), Debug (D)
and Dependability (DEP)) overhead is 17% [4]. The latter is explained by the large number of
reconfiguration registers in the highly-reconfigurable Montium.
Figure 1. TetraMAX results show that ATPG leads to a fault coverage of 99.82%
3. The use of the NoC as a test access mechanism

According to the IEEE 1500 standard testability method for embedded core-based ICs,
dedicated test buses are required for embedded core-based testing [5]. As silicon test-area
overhead is to be minimized, it is a natural design methodology to reuse as much hardware
resources for as many applications as possible. Research on using NoC as a test access
mechanism (TAM) has been carried out and it is proven possible to combine the NoC and a
modified wrapper design to perform the ATE based test for embedded cores [6]. Hence in our
approach, it was decided to reuse the NoC to replace the traditional test-access busses. Two
important boundary conditions have to be set for the use of a NoC as a TAM:
The NoC itself should preferably be fault free / fault-tolerant, or the defective NoC
segment should be known in advance
The NoC should be able to setup the Core-Under-Test (CUT) in test mode and to provide
the CUT with communication from the Test Pattern Generator (TPG) and to the Test
Response Evaluator (TRE) to be discussed later.
46
Traditionally, a NoC consists of two parts: a router network and a series of network
interfaces. The router network is responsible for routing the data across the IC, while a network
interface act as a bridge between a router and a core. A fault-free NoC means the router
network and the network interface should be fault-free (Figure 2).
Figure 2. The PT and the NoC partitioning
With respect to the second aspect, similar wrapper design techniques as described in IEEE
1500 standard will be used, which is able to put the Processing Tile in test-mode if required. In
the test-mode, the PT will be isolated from its normal input. Instead, the test-vectors from the
Test-Pattern Generator will be fed to the CUT and its test response will be delivered to the Test
Response Evaluator. When the PT is not in test mode, the wrapper should, from a functional
point of view, be transparent.
To perform the on-chip self-diagnostics, a source to generate test patterns (Test Pattern
Generator, TPG) and a sink to collect the test responses (TRE, Test Response Evaluator) will be
attached to the NoC besides the Cores-Under-Test (CUT, namely the processing tile). As shown
in Figure 3, several PTs receive the test patterns at the same time, thereby facilitating the
evaluation of the response data. It is assumed that identical cores will always yield the same test
response so a faulty response can always be discovered on the basis of a bit-by-bit comparison
of the responses from several cores.
PTs
TPG
Source
TAM
CUT0
CUT1
CUT2
TAM
TAM
TRE
TAM
Sink
NoC
Figure 3. The output of the Test Pattern Generator (TPG) is fed to the different
CUTs; the outputs of the CUTs are sent to the Test Response Evaluator (TRE)
As Figure 4 shows, the TRE has two connections to the router network, while the TPG only
has one. All data coming from the TPG and going to the CUTs is always the same, as all CUTs
receive the same test vectors. This means that the multicast function in the routers enables the
47
GPPs & IOs
TPG to spread the patterns over the network, so that the single connection to the router network
is considered to provide adequate bandwidth to enable a relatively short test time. However, all
data from the different CUTs to the TRE is per definition unique. Therefore, the required
bandwidth from the routing network to the TRE is larger than the bandwidth from the TPG to
the routing network. Several data streams go into the TRE together at the same time, requiring
more than one connection to the routing network.
Figure 4. NoC-based Built-In Self-Diagnostics overview (the network interfaces are

not shown)
An important issue is of course the relation between test-data volume and test times, actual
application data traffic and the bandwidth of the NoC [7, 8].
As already indicated in section II, it requires 2,746 input vectors for testing a processing tile,
each with a length of 23,145 elements. This results in more than 63 Mbit of test data volume, as
well as more than 63 Mbit of response data volume; hence it is considered necessary to
determine in an early stage whether the proposed test setup will lead to acceptable test times
and availability of the system. Therefore, a theoretical analysis has been carried out. In this
analysis, the following parameter settings have been used that would likely be implemented in
the beam former application. The parameters are provided in Table 1:
Parameter
Value
100 MHz
3
16 bits
yes
1
2
Table 1. Parameter settings used during test-time evaluation
Speed of the processing tile and the router network

Number of processing tiles to be simultaneously tested
Data width router network
Availability of multicast function
Number of connections from test pattern generator to router network
Number of connections from router network to evaluation unit
Using the settings in Table 1, the test time is defined as a function of the available bandwidth
on the path from TPG to CUTs (BW1) and of the available bandwidth from CUTs to TRE
(BW2). The outcome of this analysis is graphically shown in Figure 5.
48
Time (s)
Time (s)
Time (s)
The analysis shows that a minimum diagnosis time of 400 ms can be achieved for the
complete PT if the full bandwidth is assigned to the paths from TPG to CUTs and the paths
from CUTs to TRE. When only e.g. 30% bandwidth is available, it is shown that the test time
increases to roughly one second. It is stressed that this is the total time needed to test all PTs
under test at that time (in this case this equals to three PTs). It is considered that the order of
magnitude of these test times is acceptable. Since it is estimated that the application is always
able to spare 30% NoC bandwidth for testing purposes, no problems are foreseen with regards
to the test time. The diagnosis times for higher resolution [3, 4] are only fractions of the above.
a)
b)
c)
Figure 5. a) The time required to deliver patterns from TPG to the CUTs, b) the
time required for responses from the CUTs to TRE, c) the total required
test time as function of BW1 and BW2.
The GPP in Figure 4 denotes the General Purpose Processor which is responsible for test
control and scheduling tests and normal applications for the system. It is shown that data from
the Test Pattern Generator is routed to different CUTs simultaneously. The CUTs route their
response data to the Test Pattern Evaluator, which compares these responses. If all CUTs are
fault free, all responses should be the same. In this case, the TRE signals the GPP that all CUTs
are fault free. If one or more CUTs contain one or more errors, differences in response vectors
will occur, which will be detected by the TRE. Should all CUTs present a different set of
response data, the TRE will send an all faulty command to the GPP. If two CUTs (or more in
the case that there are more than three CUTs) have the same set of response vectors, they are
considered to be correct. If one or more CUTs differ from these correct CUTs, the TRE will
describe the GPP which CUTs differ from one another.
In case the GPP is informed that some PTs are faulty, it is the task of the GPP to re-map the
software tasks to the correctly functioning PTs. Although this may reduce the available
computing resources and thus lead to a lower Quality of Service (QoS), the chip as a whole can
still be considered functionally correct, provided sufficient resource surplus is available.
4. Dependability issues, test response evaluator

The test response evaluator (TRE) has been constructed using a generic approach as much as
possible. This enables the smooth fitting of the TRE to any routing network. The generics used
are: the NoC data width is sixteen, the number of virtual channels (VCs) that a physical link can
hold is four, and the maximum number of router hops from the TRE to the GPP is eight. The
number of bits that are required to describe one hop over a router is two, the number of physical
links that the TRE should have to the routing network is two, and the maximum number of
CUTs that should be tested simultaneously is three. The width of the FIFOs that store the test
49
responses is 16 and finally the depth of these FIFOs is eight. As the TRE disturbs the regularity
of the system, it was also investigated to which degree a PT could perform this job; it turned out
that this is not a favorable option. Figure 6 shows the setup of the TRE for the values of the
generic example as provided above.
dv_out_1, ft_out_1(1:0), vc_out_1(1:0), data_out_1(15:0)
Crossbar
PL_out_0
VC0
PL_in_0
VC3
Router
PL_in_1
PL_out_1
Compare
data
16
data
16
VC0
data
16
VC3
FIFO CUT0
FIFO CUT1
rst_fifo
rst_cb
dv_in_0, ft_in_0(1:0),
vc_in_0(1:0), data_in_0(15:0)
dv_in_0, ft_in_0(1:0),
vc_in_0(1:0), data_in_0(15:0)
Control
FIFO CUT2 (KGT)
rst_cmp
Reset and
settings
manager
Figure 6. Example of the TRE in the case that there are two links to the router
network and three possible CUTs
Figure 7. Packet description of the different data streams as depicted in Figure 4.
Flits that enter the TRE are routed to either the reset and settings manager, or to the crossbar.
The reset and settings manager collects and stores virtually all settings in the design and resets
the other blocks should the GPP request so. The crossbar connects any virtual channels from
any physical link to any FIFO. As soon as at least one test response from every CUT is
50
collected in every respective FIFO, the compare unit will examine whether these responses are
the same. After this comparison, the FIFO storage space that was occupied by the responses can
be freed, enabling the TRE to be implemented with relatively small FIFOs. Once all test data
has been fed to the CUTs, the TPG generates a test-done signal. This signal is used to take
the PTs out of testing mode. The test-done signal is also passed to the TRE. Once the TRE
has received the test-done signals from all the CUTs, the control block sends the result of the
test to the GPP, as is depicted in Figure 7, bottom part (TRE to GPP). The details concerning
the link that the TRE needs to setup should be stored by the GPP into the reset and settings
manager.
Figure 8. Response data is written to the TRE, the TRE does the comparison,
after which the result of this comparison is send to the GPP.
Extensive simulations have been performed to verify the functional correctness of the design
using various settings for the generics mentioned. For clarity, a basic simulation result is shown
in Figure 8, showing only the signals required for understanding the TRE behavior.
Figure 8 shows how a configured TRE is used to compare the nine response vectors from the
three CUTs (the configuration of the crossbar corresponds to the links shown in Figure 6). For
clarity, in this simulation all response data from one CUT is send sequentially, while in a realtime situation it would be send in parallel. Also for clarity, a very small response set is send to
the TRE: every CUT only sends 3 response vectors to the TRE before generating the testdone signal. The correct response should be 1, 2, 4, although the TRE is not aware of this on
forehand. After the TRE receives the three testdone signals, it sets up a link to the GPP. This
is accomplished by sending header flits (flit type 01) to one of the routers connected to the
TRE. The content of the header flits describe the direction in which the link should be setup. In
this case, as Figure 4 shows, the outcome of the comparison by the TRE should be routed to the
GPP via PL0 of the TRE, after which the link is setup by taking three times a router hop to the
left (this is shown in Figure 8 by the data_out bus being a decimal 3 for three clock cycles), one
hop to the bottom (this is shown in Figure 8 by the data_out bus being a decimal 2 for one clock
cycle), and again one hop to the left. Since the TRE did not receive all equal responses, nor did
it receive all unique responses, it will send a packet to the GPP indicating that there was a
difference in responses between CUT0 and CUT1 (timeslot a in Figure 8, bottom), as well as
between CUT1 and CUT2 (timeslot b in Figure 8, bottom). This is coded on the data_out bus
in the following way:
{data_out(15:14) = difference = 10} & {data_out(13:7) = CUTx = 0} & {data_out(6:0) =
CUTy = 1}
51
Timeslot c indicates that the TRE finished the difference description, after which a tail flit
(flit type 10) breaks down the connection. Since the GPP now knows that the responses from
CUT0 and CUT1 differ, as well as the responses from CUT1 and CUT2, it can conclude that, as
apparently CUT 0 and CUT2 do not differ, CUT1 is faulty. This will cause a remapping of
resources by the software so that the faulty core, CUT1, is no longer used.
Although the diagnostic outcome of this test is course (it only specifies at Montium tile level
which device has (a) stuck-at fault(s)), the process has been extended after failure detection. In
this case, the faulty Montium is again provided with additional (diagnostic) test vectors from
the TPG, but now the routing is provided at sub-module level (ALU, Memory, interconnect
segment). The routines (which modules and when) are pre-stored in the setting manager. Again
a fault-free Montium is taken as reference. The power in this approach lies in pre-silicon testvector generation and evaluation, and the highly reconfigurable TPG. It is noted that the PT
cores have been enhanced with DfX hardware and test vectors, to enable diagnostics at the two
other levels [3, 4].
5. Dependability issues, the test-pattern generator

Essential in terms of (in-field) dependability, and hence Built-In Self-Diagnostics, is the
availability of an extremely flexible (reconfigurable) on-chip test-pattern generator. As the TPG
disturbs the regularity of the system, it was also investigated to which degree a PT could
perform this job; it turned out that this is not a viable option. The initial stage of our design
contains elements of earlier references [9-14].
Figure 9. A simulation result of the highly reconfigurable Test-Pattern

Generator (two seeds)
Among the reconfigurable parameters of the TPG are multi-seeds, multi-polynomials, and
free-chosen lengths [9, 11]. Furthermore, bit-flipping/fixing techniques are being applied to
approximate (diagnostic) deterministic tests with the fault coverage of Montium and submodules as parameter (DLBIST). In order to reduce hardware and software requirements, a new
nested LFSR-based approach is being used, e.g. for generating multi-seeds. It has more
flexibility than DBIST [13, 14], where deterministic patterns are encoded in the seeds only,
although it is not an automatic design process. Both use TetraMAX as a basis.
Essential in the approach is the availability of all required test vectors, for the Montium as
well as its (repeating) sub-modules for diagnostic purposes. This is a pre-Silicon generation and
simulation effort. For this purpose a mix of TetraMAX and ModelSim simulations have been
used to determine sufficient test coverage for the Montium and sub-modules diagnostics. The
obtained key parameters from these simulations are stored on-chip. As an initial example,
Figure 9 shows part of generated test vectors, shown here in hexadecimal values, of a required
polynomial, with required length and seed. Subsequently, the clock, reset signal (RST), start
signal (START) and finally output (q) is shown.
The speed of the TPG is expected to exceed the speed capability of the NoC and Montium
tiles in 90nm CMOS technology (100 MHz). The TPG has to work in direct cooperation with
the GPP and TRE. Potentially, by adaptation, also transition faults can be handled [15], but this
option has not been used yet.
52
6. Conclusions
In this paper we have shown an effective approach towards increasing the dependability of a
system with many identical highly reconfigurable processing tiles, interconnected by a NoC. Its
application, a beam former, requires a high degree of dependability. Instead of using traditional
(IEEE 1500) test-buses, we use the NoC as a test access mechanism and incorporate internal
test-pattern generation, as well as a test-pattern evaluation infrastructural IPs. The diagnostic
results at three hierarchical levels, based on the concept of multi-comparison, can be used to
perform a re-mapping of the computing resources. Initial designs and simulations of the TPG
and TRE have been carried out. The test times required for this diagnosis are small, assuming
some part (30%) of the NoC bandwidth, thereby guaranteeing the required availability.
References
[1] B.D. van Veen and K.M. Buckley, "Beamforming: a versatile approach to spatial filtering," ASSP Magazine,
IEEE, vol.5, no.2, pp.4-24, April 1988.
[2] P.M. Heysters, G.J.M. Smit, E. Molenkamp, and G.K. Rauwerda, Flexibility of the Montium Coarse-Grained
Reconfigurable Processing Tile, 4th PROGRESS Symposium on Embedded Systems, Nieuwegein, the
Netherlands, 2003, pp. 102-108.
[3] H.G. Kerkhoff and J.J.M Huijts, Testing of a Highly Reconfigurable Processor Core for Dependable Data
Streaming Applications, IEEE DELTA Conference, Hong Kong, PRC, January 2008, pp. 38-44.
[4] H.G. Kerkhoff, O. Kuiken, X. Zhang, Increasing SoC Dependability via Know Good Tile Testing, IEEE DSN
Conference, Anchorage, USA, June 2008, pp. G6-G8.
[5] IEEE Computer Society, IEEE Standard Testability Method for Embedded Core-based Integrated Circuits,
2005.
[6] A.M. Amory, K. Goossens, E.J. Marinissen, M. Lubaszewski and F. Moraes, Wrapper design for the reuse of a
bus, network-on-chip, or other functional interconnect as test access mechanism, IET Comput. Digit. Tech.,
January 2007, pp. 197206.
[7] F.A. Hussin, T. Yoneda and H. Fujiwara, Area Overhead and Test-Time Co-Operation through NoC
Bandwidth Sharing, IEEE ATS, Fukuosa, Japan, 2007, pp. 459-462.
[8] E.J. Marinessen et al., Bandwidth Analysis for Reusing Functional Interconnect as Test Access Mechanism,
Proceedings of ETS, session 2B, Verbania, Italy, May 2008.
[9] G. Kiefer, H. Vranken, E.J. Marinessen and H-J. Wunderlich, Application of Deterministic Logic BIST on
Industrial Circuits, JETTA, vol. 17, issue 3/4, 2001, pp. 351 362.
[10] K. Chakrabarty, B. Murray and V. Iyengar, Deterministic Built-In Test-Pattern Generation for HighPerformance Circuits using Twisted-Ring Counters, IEEE Trans. on VLSI Systems, Vol. 8, no. 5, 2000, pp.
633-636.
[11] V. Gherman et al., Efficient pattern mapping for deterministic logic BIST, ITC04, Paper 3.1, Charlotte (NC)
USA, 2004, pp. 48-56.
[12] G. Papa, T. Garbolino, F. Novak, A. Hawiczka, Deterministic Test Pattern Generator Design With Genetic
Algorithm Approach, Journal of Electrical Engineering, Vol. 58, No. 3, 2007, pp. 121127.
[13] M. Chandramouli, How to Implement Deterministic Logic BIST, Synopsys Compiler Article, January 10th
2003.
[14] DBIST Userguide, Synopsys, version Z-2007.03, March 2007, 193 pages.
[15] V. Gherman et al., Deterministic logic BIST for transition fault testing, in Proc. ETS, Southampton, UK, 2006,
pp. 225-231.
53

DFT 1234

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DFT 1234

Uploaded by

Copyright:

Available Formats

IEEE International Symposium on Defect and Fault Tolerance of VLSI Systems

Built-In Self-Diagnostics for a NoC-Based Reconfigurable IC for

2. The Montium processor and ATE based tests

1550-5774/08 $25.00 2008 IEEE

performance of the on-chip test-pattern generator, as ATE-based test-vectors are considered to

3. The use of the NoC as a test access mechanism

Figure 2. The PT and the NoC partitioning

GPPs & IOs

Figure 4. NoC-based Built-In Self-Diagnostics overview (the network interfaces are

Speed of the processing tile and the router network

4. Dependability issues, test response evaluator

FIFO CUT2 (KGT)

Figure 7. Packet description of the different data streams as depicted in Figure 4.

5. Dependability issues, the test-pattern generator

Figure 9. A simulation result of the highly reconfigurable Test-Pattern

You might also like