Professional Documents
Culture Documents
1, JANUARY 2002 61
Abstract—This work explores the data reuse properties of full- bandwidth problem may be solved by careful scheduling of the
search block-matching (FSBM) for motion estimation (ME) and data sequence and setting up appropriate on-chip memories.
associated architecture designs, as well as memory bandwidth re- Meanwhile, a well-designed ME architecture reduces the
quirements. Memory bandwidth in high-quality video is a major
bottleneck to designing an implementable architecture because of requirements of memory bandwidth and the I/O pin count, but
large frame size and search range. First, memory bandwidth in ME still maintains high hardware efficiency.
is analyzed and the problem is solved by exploring data reuse. Four Many algorithms and hardware have been dedicated to ME.
levels are defined according to the degree of data reuse for pre- The full-search block-matching (FSBM) algorithm is one these
vious frame access. With the highest level of data reuse, one-access algorithms, and searches through every candidate location to
for frame pixels is achieved. A scheduling strategy is also applied
to data reuse of ME architecture designs and a seven-type classifi- find the best match. To reduce the computational complexity
cation system is developed that can accommodate most published of FSBM, various fast algorithms were proposed [2]–[7], [19]
ME architectures. This classification can simplify the work of de- that searched fewer points. However, these fast algorithms
signers in designing more cost-effective ME architectures, while si- suffer from irregular control and lower video quality, and thus
multaneously minimizing memory bandwidth. Finally, a FSBM ar- FSBM remains widespread owing to its simplicity, regularity,
chitecture suitable for high quality HDTV video with a minimum
memory bandwidth feature is proposed. Our architecture is able to and superior video quality. Many architectures have been
achieve 100% hardware efficiency while preserving minimum I/O proposed for implementing FSBM. These architectures use
pin count, low local memory size, and bandwidth. systolic array [8]–[11], tree structure [12], or 1-D structure
Index Terms—Architecture, block matching, memory manage- [14] to solve computational problem by providing enough PEs.
ment, motion estimation, video coding. Other architectures include a programmable architecture [31]
and the integral projection-matching criterion architecture [32].
However, all of these architectures provide limited solutions to
I. INTRODUCTION overcoming the memory bandwidth bottlenecks of high-quality
video ME such as HDTV. Direct implementation is unrealistic
M OTION estimation has been widely employed in the
H.26x, MPEG-1, -2, and -4 video standards [1] to
exploit the temporal redundancies inherent within image
without exploiting the locality or tricky designs. For example,
the MPEG2 MP@ML format requires a memory bandwidth of
frames. Block matching is the most popular method for tens of gigabytes per second, while the HDTV format with a
motion estimation (ME). Numerous pixel-by-pixel difference large search range requires a terabytes per second bandwidth.
operations are central to the block matching algorithms and This work only considers uni-directional ME, and bandwidth
result in high computation complexity and huge memory becomes even higher for bi-directional predictions. Redun-
bandwidth. Owing to the rapid progress of VLSI technology, dancy relief is the solution to the huge bandwidth problem
computation complexity requirements can easily be fulfilled by because many redundant accesses exist in the memory traffic
multiple PE’s architecture, even for large frame sizes and frame of ME.
rate. However, without enough data, these PEs can hardly be This work provides a data reuse analysis of FSBM to remove
fully utilized and simply result in increased silicon area. The the redundancies caused by retrieving the same pixel multiple
data rate is limited by available memory bandwidth. There- times. Regarding strength of data reuse, the present analysis ex-
fore, straightforward implementation of ME is an I/O bound tracts four data-reuse levels from the FSBM algorithm. Further-
problem rather than a computation-bound one. The memory more, a redundancy access factor is provided to measure the
degree of redundancy. Weaker data reuse level has a higher
value and demands a higher memory bandwidth. Meanwhile, a
Manuscript received December 15, 1999; revised October 22, 2001. This stronger data reuse level has a smaller and demands lower
work was supported by the National Science Council of the Republic of China memory bandwidth. The memory bandwidth of FSBM is a func-
under Contract NSC87-2215-E009-039. This paper was recommended by As-
sociate Editor I. Ahmad. tion of frame rate, frame size, search range, and . The former
J.-C. Tuan is with the iCreate Technologies Corporation, Science-Based three factors are generally fixed in video compression applica-
Industrial Park, Hsinchu 300, Taiwan, R.O.C. (e-mail: tuan@twins.ee.nctu. tions. Only can be altered to accommodate the bandwidth re-
edu.tw).
T.-S. Chang is with the Global Unichip Corporation, Science-based Industrial quirement. Actually, varies with the data reuse level; hence,
Park, Hsinchu 300, Taiwan, R.O.C. (e-mail: tschang@globalunichip.com). the data-reuse level is important when designing a FSBM archi-
C.-W. Jen is with the Department of Electronics Engineering, National tecture. Besides bandwidth and data-reuse level, local memory
Chiao Tung University, Hsinchu 300, Taiwan, R.O.C. (e-mail: cwjen@
twins.ee.nctu.edu.tw). analysis is also addressed. The local memory is set up for storing
Publisher Item Identifier S 1051-8215(02)01122-9. already loaded data. Local memory size increases or shrinks
1051–8215/02$17.00 © 2002 IEEE
62 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 12, NO. 1, JANUARY 2002
with the data-reuse level, with a weaker data-reuse level con- Section IV proposes the one-access architecture. Conclusions
suming less local memory and a stronger data-reuse level re- are finally drawn in Section V.
quiring more. The relationship among local memory size, re-
quired memory bandwidth, and different data-reuse levels is an- II. ANALYSIS OF DATA REUSE AND MEMORY ACCESSES IN
alyzed herein. FSBM architecture designers can thus trade-off MOTION ESTIMATION
among these quantitative constraints and make suitable deci-
sions. A redundancy access factor is defined to evaluate the ef-
ficiency of memory accesses used in FSBM
The existing literature has largely overlooked the classifica-
tion of FSBM arrays. [15] presented a classification derived total number of memory accesses in task
from dependency graph (DG) transformation. This classifica- (1)
pixel count in task
tion divides FSBM arrays into six types, each classified ac-
cording to the distribution of the subtraction-absolute-accumu- The memory bandwidth of FSBM then becomes
lation (SAA) operation. However, such a classification reveals
few implementation issues. Consequently, a new architectural
classification for FSBM array is also proposed herein. The novel
(2)
classification derives seven types of data-reuse-efficient archi-
tectures that are mutually transformable using simple transfor-
represents the average access count per pixel in FSBM
mation operators “delay and rotate”. Designers can easily iden-
processing, with a smaller value indicating greater reduction of
tify implementation considerations from the novel classifica-
memory bandwidth. When the of an architecture equals one,
tion. Each type directly identifies the interconnection structure
the architecture is said to be one-access and minimum memory
between PEs. The novel classification is based on three data
bandwidth is achieved.
sequence scheduling strategies, each representing an efficient
The FSBM algorithm can be expressed using C-like lan-
data-reuse scheme. Since data reuse is involved, local memory
guage, as in Fig. 1.
is established, and the local memory issue can also be easily un-
In the above algorithm, and represent the horizontal
derstood using the novel classification. The accumulation issue
and vertical search displacement, respectively. Meanwhile,
and its latency of different types are also addressed herein, and
is the pixel intensity of the current frame and
finally, the classified architectures are compared.
is the pixel intensity of the previous frame.
A one-access FSBM architecture for high quality video For one frame ME, as described in Fig. 1, the operation count
format is proposed herein. One-access indicates that each can range up to
pixel is accessed just once, as in data reuse Level D and its in terms of SAA operation, as indicated in (3). Each SSA
. The memory bandwidth requirement is minimized operation requires two pixels. A total of
when one-access is achieved, successfully overcoming the pixels are thus accessed from the frame memory. However,
huge bandwidth problem in ME of high quality video. Features each frame contains only unique pixels. The excessive
of the proposed design are then discussed. frame memory access results from each pixel having multiple
The rest of this paper is organized as follows. Section II an- accesses. Redundancy access is measured using the equation
alyzes memory accesses and data reuse in ME. Section III then
discusses scheduling strategies and presents a classification of
data-reuse-efficient ME architectures. To maximize data reuse,
TUAN et al.: DATA REUSE AND MEMORY BANDWIDTH ANALYSIS FOR FSBM VLSI ARCHITECTURE 63
TABLE I
BANDWIDTH REQUIREMENTS AND DATA REUSE LEVEL OF VARIOUS VIDEO FORMATS
design. Table I presents the bandwidth requirements for various motion estimate frame , frame serves as the current
common video formats corresponding to different data-reuse frame and frame serves as the previous frame. When
levels using (2). The required bandwidth easily reaches tens advancing to frame , frame servers as the
of gigabytes per second for high-quality video with weak data current frame and frame becomes the previous frame. No-
reuse, but falls significantly to sub-gigabytes per second when tably, frame is actually accessed twice, and this reuse can
a higher data-reuse level is adopted. This phenomenon provides be defined as Level E, the ultimate data-reuse level. However,
a good reason for considering stronger data-reuse level to lower achieving data reuse Level E involves a significant penalty, i.e.,
frame memory bandwidth, or using low-cost memory modules storing at least one frame. Furthermore, bi-directional ME, such
or narrow bus width. Wide bus makes the chip package more as that used in MPEG2, has to store more than one frame, be-
costly and increases problems with skew. Furthermore, from a cause this data-reuse level predicts across several frames. Level
systematic point of view, memory traffic introduced by other E implementation is impractical, so only Levels A–D are con-
components such as variable-length-codec, DCT/IDCT, and so
sidered herein.
on must also be considered. Additionally, the motion-estima-
tion process only uses the luminance pixel data, while the other
components also use chrominance data thus increasing the im- D. Local Memory Size Derivation
portance of strong data-reuse level. To realize data reuse, local memory holding already loaded
All four of the above reuse levels are intra-frame data reuse. data for further access is required. The local memory required
There is still another inter-frame reuse level. For example, if we for the current frame, i.e., for storing one current block is .
Fig. 3. Relationship between bandwidth requirements and local memory size for various video formats and data-reuse levels.
The local memory size for the previous frame equals the size A. Scheduling Strategies
of the overlapped region in each data-reuse level. Table II lists Three scheduling strategies for the array processor are de-
the derived size, which theoretically should be the upper limit rived according to the property of data reuse. The three strategies
of a novel design. Actually however, few overheads exist for chosen have an important common property: heavy exploiting
preloading or scheduling data. A properly scheduled data se- of the data reuse of FSBM in the spatial or temporal domains.
quence can eliminate the need for local memory. Section IV This property removes redundancies in FSBM and helps to de-
discusses the scheduling. Meanwhile, Fig. 3 presents a quanti- velop a data-reuse-efficient architecture. Most novel FSBM ar-
tative example of local memory size versus bandwidth require- chitectures, both 1- and 2-D, are covered by these three sched-
ments for different video formats. Each curve consists of five uling strategies, which are defined as follows.
data points, with the upper four points representing Levels D Concurrent Mode: This strategy is a data broadcast scheme,
(upper most) to A (lower most), while the lowest point repre- i.e., the same pixel is used simultaneously by multiple PEs.
sents no data reuse. The numbers labeling the different video Using the DG projection procedure mentioned in [27], [28] to
formats are the search ranges, . Meanwhile, the two demonstrate this strategy, the six-loop algorithm in Fig. 1 is
dashed vertical lines, 1.6 GB/s for RDRAM (16-bit bus) and 800 folded into two loops: ,
MB/s for PC100 (64-bit bus), indicate the peak bandwidth of as shown in Fig. 4. Then, processor basis and
current commodity memories [34]. Even the RDRAM scheme scheduling vector are chosen. defines the
was found to be unable to satisfy the Level B and A bandwidth mapping from DG nodes to PEs and defines the execution
requirements of high-resolution video formats. Meanwhile, the order of nodes following the procedure in [27], [28]. From
PC100 scheme cannot afford a bi-directional Level C require- a hardware perspective, defines the number of PEs and
ment. The analysis presented in Fig. 3 outlines a good tradeoff defines the PE interconnection and execution order. This
decision. projection defines the data-reuse property of current block pixel
66 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 12, NO. 1, JANUARY 2002
Fig. 4. Motion-estimation algorithm folded into two nested loops. The MV calculation is omitted.
Fig. 5. Concurrent mode example for current block data, with the indexes being simplified from folded loops: N = SR = SR = 4.
Fig. 6. Data sequence example involving three scheduling modes. (a) Concurrent mode. (b) Pipeline mode. (c) Store mode.
as the concurrent mode, the data sequence example of Pipeline Mode: The projection parameters for this strategy
which is displayed in Fig. 5. Owing to the broadcasting scheme are and , and the data-reuse property
of the concurrent mode, local memory and interconnection of current block pixel is defined as the pipeline mode.
between PEs is unnecessary, regardless of the local memory Fig. 6(b) illustrates the data sequence example. For this strategy
size derived in Table II. each PE must propagate its currently used data to the neigh-
TUAN et al.: DATA REUSE AND MEMORY BANDWIDTH ANALYSIS FOR FSBM VLSI ARCHITECTURE 67
TABLE III
CLASSIFICATION OF SEVEN ARRAY TYPES
Fig. 9. Four kinds of accumulation structure. (a) Self accumulation. (b) Propa-
gating accumulation. (c) Circulating accumulation. (d) Parallel accumulation.
TABLE VI
COMPARISON OF 2-D ARCHITECTURES
Fig. 10. Overview of the proposed CPSS type architecture and the interconnections of PE ports Pin, Pout, SA, and SUM.
A. Architecture Overview
The architecture proposed herein is a 2-D design with
PEs. For ease of illustration, and were se- Fig. 12. Data path of current block data and the interconnections of ports AinT,
lected. Consequently, total of 16 PEs exist, and the search range AoutB, AinL, and AoutR.
is 8. Fig. 10 presents an overview of the one-access architec-
ture. The overview shows the three top module I/O ports and block data are loaded in a propagating fashion. The data path
their connections, and reveals the two data-loading paths of the of MAD calculation is also displayed, and is a parallel adder
search area and the current block data. The search area data are tree design. At the output of the final stage of the adder tree, the
loaded into PEs in series using a shared bus, and the current best MV Finder determines the best motion vector. Notably, the
70 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 12, NO. 1, JANUARY 2002
architecture is illustrated in a 1-D scheme for simplicity; it is to avoid idle cycles. The current block data are pumped in par-
actually a 2-D architecture. allel from PRs into ARs, and the search area data are prepared
Fig. 11 shows the PE structure. Each PE contains one ac- concurrently from the SAMs. [29] lists further transaction de-
tive register (AR) for holding active current block data, and one tails.
preload register (PR) for propagating the current block data of
the next macroblock. The data provided by PRs fill the time gap C. Performance and Comparison
between successive macroblocks. The multiplexer (MUX) se- The proposed ME architecture is characterized by low
lects data from either AR or PR. A search area memory (SAM) memory bandwidth, low latency, minimum I/O port, 100%
stores necessary search area data for data reuse. Finally, an SSA hardware efficiency, and simple interconnections among PEs.
unit (SSAU) calculates the absolute difference between current Memory bandwidth is reduced to just twice the luminance bit
block and search area data. Eight ports exist per PE, and Figs. 10 rate of video sequence by using the data reuse Level D. The
and 12 illustrate the interconnections of these ports. Notably, the memory bandwidth of ME is only influenced by frame rate
adder tree is ignored in Fig. 12 for simplicity. and frame size, regardless of changes in search range. The low
latency delay is due to the parallel loading of both the search
B. ME Operation area and current block data. The theoretical latency for the
The partial MAD calculation (3) is distributed among all PEs processing of each macroblock is .
in the novel architecture. The SSAU is responsible for calcu- For and (a
lating the absolute difference between current block pixel and real video compression case), this architecture can complete
search area pixel, and thus it defines which of the pixels are each macroblock in just 1024 clock cycles, plus the overhead
paired with PEs and how. The search area data are distributed introduced from the adder tree and MV finder. This architecture
among SAMs. If the search area pixel are labeled , has only two I/O ports, one for loading current block data and
where , then will be distributed the other for loading search area data. Avoiding transition gap
to SAM of # % . Consequently, SAMs differ between macroblocks is essential to 100% utilization. Both
in size, and their size can be calculated according to the pixel current block and search area data should be prepared before
distribution just mentioned. Total on-chip SAM size can then next macroblock to avoid transition gap. The current block
be calculated by summing all SAMs. The search area data are data are prepared by PRs and the search area data are prepared
static within the SAM and will never be passed to another SAM, by SAMs. The data preparation also applies to macroblocks
making the connection between SSAU and SAM very simple. at frame boundaries, thus the one-access architecture is able
Fig.13 lists the sequence of the search area data, while Fig. 14 to be 100% utilized via a suitable data loading sequence.
lists the sequence of the corresponding current block data. Ex- The detailed transitions were provided in reference [29]. The
amining the data sequence in Fig. 14 and the circulating path interconnection linking PEs is simple and local.
in Fig. 12 easily reveals the horizontal and vertical circulation Tables V and VII compare the proposed architecture and other
property of the current block. architectures under different parameters. The comparisons use
At the beginning of a macroblock processing, both current the HDTV video format
block data and search area data must be prepared immediately to show the impact of large frame size. The moving distance
TUAN et al.: DATA REUSE AND MEMORY BANDWIDTH ANALYSIS FOR FSBM VLSI ARCHITECTURE 71
[5] B. M. Wang, “An efficient VLSI architecture of hierarchical block [28] Y.-K. Chen and S. Y. Kung, “A systolic design methodology with appli-
matching algorithm in HDTV applications,” in Int. Workshop on HDTV, cation to full-search block-matching architectures,” J. VLSI Signal Pro-
Oct. 1993. cessing, pp. 51–77, 1998.
[6] B. M. Wang, J. C. Yen, and S. Chang, “Zero waiting cycle hierarchical [29] J.-C. Tuan and C.-W. Jen, “An architecture of full-search block matching
block matching algorithm and its array architectures,” IEEE Trans. Cir- for minimum memory bandwidth requirement,” in Proc. IEEE 8th Great
cuits Syst. Video Technol., vol. 4, pp. 18–28, Feb. 1994. Lake Symp. VLSI, 1998, pp. 152–156.
[7] R. Srinivasan and K. R. Rao, “Predictive coding based on efficient mo- [30] Y.-K. Lai and L.-G. Chen, “A flexible data-interlacing architecture for
tion estimation,” IEEE Trans. Commun., vol. COM-33, pp. 888–896, full-search block-matching algorithm,” in IEEE Int. Conf. Application-
Aug. 1985. Specific Systems, Architectures and Processors, 1997, pp. 96–104.
[8] L. De Vos and M. Stegherr, “Parameterizable VLSI architectures for the [31] H.-D. Lin et al., “A 14-Gops programmable motion estimator for H.26X
full-search block-matching algorithm,” IEEE Trans. Circuits Syst., vol. video coding,” IEEE J. Solid-State Circuits, vol. 31, pp. 1742–1750,
36, pp. 1309–1316, Oct. 1989. Nov. 1996.
[9] T. Komarek and P. Pirsch, “Array architectures for block matching algo- [32] S. B. Pan et al., “VLSI architectures for block matching algorithms using
rithms,” IEEE Trans. Circuits Syst., vol. 36, pp. 1302–1308, Oct. 1989. systolic arrays,” IEEE Trans. Circuits Syst. Video Technol., vol. 6, pp.
[10] C.-H. Hsieh and T.-P. Lin, “VLSI architecture for block-matching mo- 67–73, Feb. 1996.
tion estimation algorithm,” IEEE Trans. Circuits Syst. Video Technol., [33] J. You and S. U. Lee, “High throughput, scalable VLSI architecture for
vol. 2, pp. 169–175, June 1992. block matching motion estimation,” J. VLSI Signal Processing, vol. 19,
[11] H. Yeo and Y. H. Hu, “A novel modular systolic array architecture for pp. 39–50, 1998.
full-search block matching motion estimation,” IEEE Trans. Circuits [34] Next-Generation DRAM Comparison (1999). [Online]. Available:
Syst. Video Technol., vol. 5, pp. 407–416, Oct. 1995. http://www.micron.com
[12] Y.-K. Lai et al., “A novel scaleable architecture with memory inter-
leaving organization for full search block-matching algorithm,” Proc.
1997 IEEE Int. Symp. Circuits and Systems, pp. 1229–1232, June 1997.
[13] Y.-S. Jehng et al., “An efficient and simple VLSI tree architecture for
motion estimation algorithms,” IEEE Trans. Signal Processing, vol. 41, Jen-Chieh Tuan received the B.S. degree in
pp. 889–900, Feb. 1993. electrical engineering from National Tsing Hua
[14] K.-M. Yang et al., “A family of VLSI designs for the motion compensa- University in 1995 and the M.S. degree in electronics
tion block-matching algorithm,” IEEE Trans. Circuits Syst., vol. 36, pp. engineering from National Chiao Tung University,
1317–1325, Oct. 1989. Hsinchu, Taiwan, R.O.C., in 1997.
[15] S. Chang et al., “Scaleable array architecture design for full search He is currently a Logic Design Engineer of iCreate
block matching,” IEEE Trans. Circuits Syst. Video Technol., vol. 5, pp. Technologies Corporation, Hsinchu, Taiwan, R.O.C.
332–343, Aug. 1995.
[16] S. Dutta and W. Wolf, “A flexible parallel architecture adapted to
block-matching motion-estimation algorithms,” IEEE Trans. Circuits
Syst. Video Technol., vol. 6, pp. 74–86, Feb. 1996.
[17] G. Gupta and C. Chakrabarti, “Architectures for hierarchical and other
block matching algorithms,” IEEE Trans. Circuits Syst. Video Technol.,
vol. 5, pp. 477–489, Dec. 1995.
[18] E. Iwata and T. Yamazaki, “An LSI architecture for block-matching mo-
Tian-Sheuan Chang (S’93–M’00) received the
tion estimation algorithm considering chrominance signal,” VLSI Signal
B.S., M.S., and Ph.D. degrees in electronics en-
Processing VIII, pp. 421–430, 1995.
gineering from National Chiao Tung University,
[19] B. Liu and A. Zaccarin, “New fast algorithms for the estimation of block
Hsinchu, Taiwan, R.O.C., in 1993, 1995, and 1999,
motion vectors,” IEEE Trans. Circuits Syst. Video Technol., vol. 3, pp.
respectively.
148–157, Apr. 1993.
He is currently a Principle Engineer at Global
[20] M.-C. Lu and C.-Y. Lee, “Semi-systolic array based motion estimation
Unichip Corporation, Hsinchu, Taiwan, R.O.C. His
processor design,” IEEE Proc. 1997 Int. Symp. Circuits and Systems, pp.
research interests include VLSI design, digital signal
3299–3302, 1995.
processing, and computer architecture.
[21] V.-G. Moshnyaga and K. Tamaru, “A memory efficient array architec-
ture for full-search block matching algorithm,” 1997 Proc. IEEE Int.
Symp. Circuits and Systems, pp. 4109–4112, 1997.
[22] S.-H. Nam and M.-K. Lee, “Flexible VLSI architecture of motion esti-
mator for video image compression,” IEEE Trans. Circuits Syst. II, vol.
43, pp. 467–470, June 1996.
[23] B. Natarajan et al., “Low-complexity algorithm and architecture for Chein-Wei Jen received the B.S. degree from
block-based motion estimation via one-bit transforms,” IEEE Trans. National Chiao Tung University, Hsinchu, Taiwan,
Circuits Syst. Video Technol., vol. 6, pp. 3244–3247, 1995. R.O.C., in 1970, the M.S. degree from Stanford
[24] P. A. Ruetz et al., “A high-performance full-motion video compression University, Stanford, CA, in 1977, and the Ph.D.
chip set,” IEEE Trans. Circuits Syst. Video Technol., vol. 2, pp. 111–122, degree from National Chiao Tung University in
June 1992. 1983.
[25] C. Sanz et al., “VLSI architecture for motion estimation using the block- He is currently with the Department of Electronics
matching algorithm,” in Proc. Eur. Design and Test Conf., 1996, pp. Engineering and the Institute of Electronics, Na-
310–314. tional Chiao Tung University, as a Professor. During
[26] C.-L. Wang et al., “A high-throughput, flexible VLSI architecture for 1985–1986, he was with the University of Southern
motion estimation,” 1995 Proc. IEEE Int. Symp. Circuits and Systems, California, Los Angeles, as a Visiting Researcher.
pp. 3289–3295, 1995. His current research interests include VLSI design, digital signal processing,
[27] S. Y. Kung, VLSI Array Processors. Englewood Cliffs, NJ: Prentice- processor architecture, and design automation. He has held six patents and
Hall, 1988. published over 40 journal papers and 90 conference papers in these areas.