You are on page 1of 10

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 20, NO.

4, APRIL 2012

655

A High Performance Video Transform Engine by Using Space-Time Scheduling Strategy


Yuan-Ho Chen, Student Member, IEEE, and Tsin-Yuan Chang, Member, IEEE
AbstractIn this paper, a spatial and time scheduling strategy, called the space-time scheduling (STS) strategy, that achieves high image resolutions in real-time systems is proposed. The proposed spatial scheduling strategy includes the ability to choose the distributed arithmetic (DA)-precision bit length, a hardware sharing architecture that reduces the hardware cost, and the proposed time scheduling strategy arranges different dimensional computations in that it can calculate rst-dimensional and second-dimensional transformations simultaneously in single 1-D discrete cosine transform (DCT) core to reach a hardware utilization of 100%. The DA-precision bit length is chosen as 9 bits instead of the traditional 12 bits based on test image simulations. In addition, the proposed hardware sharing architecture employs a binary signed-digit DA architecture that enables the arithmetic resources to be shared during the four time slots. For this reason, the proposed 2-D DCT core achieves high accuracy with a small area and a high throughput rate and is veried using a TSMC 0.18- m 1P6M CMOS process chip implementation. Measurement results show that the core has a latency of 84 clock cycles with a 52 dB peak-signal-to-noise-ratio and is operated at 167 MHz with 15.8 K gate counts. Index TermsBinary signed-digit (BSD), discrete cosine transform (DCT), distributed arithmetic (DA)-based, space-time scheduling (STS).

I. INTRODUCTION

ISCRETE COSINE TRANSFORM (DCT) is a widely used transform engine for image and video compression applications [1]. In recent years, the development of visual media has been progressed towards high-resolution specications, such as high denition television (HDTV). Consequently, a high-accuracy and high-throughput rate component is needed to meet future specications. In addition, in order to reduce the manufacturing costs of the integrated circuit (IC), a low hardware cost design is also required. Therefore, a high performance video transform engine that utilizes high accuracy, a small area, and a high-throughput rate is desired for VLSI designs. The 2-D DCT core design has often been implemented using either direct [2][4] or indirect [5][13] methods. The direct

Manuscript received February 22, 2010; revised August 09, 2010 and December 30, 2010; accepted January 24, 2011. Date of publication February 28, 2011; date of current version March 12, 2012. This work was supported in part by the Chip implementation Center and National Science Council under project number CIC T18-98C-12a and NSC 99-2221-E-007-119, respectively. The authors are with the Department of Electrical Engineering, National Tsing Hua University, Hsinchu 30013, Taiwan (e-mail: yhchen@larc.ee.nthu. edu.tw). Color versions of one or more of the gures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identier 10.1109/TVLSI.2011.2110620

methods include fast algorithms that reduce the computation complexity by mapping and rotating the DCT transform into a complex number [3]. However, the structure of the DCT is not as regular as that of the fast Fourier transform (FFT). A regular 2-D DCT core using the direct method that derives the 2-D shifted FFT (SFFT) from the 2-D DCT algorithm format, and shares the hardware in FFT/IFFT/2-D DCT computation is implemented in [4]. On the other hand, the 2-D DCT cores using the indirect method are implemented based on transpose memory and have the following two structures: 1) two 1-D DCT cores and one transpose memory (TMEM) and 2) a single 1-D DCT core and one TMEM. In the rst structure, the 2-D DCT core has a high throughput rate, because the two 1-D DCT cores compute the transformation simultaneously [5][9]. In order to reduce the area overhead, a single 1-D DCT core is applied in the second structure [10][13], thereby saving hardware costs. Hsia et al. [10] present a 2-D DCT with a single 1-D DCT core and one custom transposed memory that can transpose data sequences using serial-input and parallel-output and requires only a small area, but the speed is reduced because the single DCT core performs the rst-dimensional (1st-D) and second-dimensional (2nd-D) transformations at different times. Therefore, Guo et al. use two parallel paths to deal with the reduction of throughput rate in the second structure [11]. However, the additional input buffers are needed in order to temporarily store the input data during the 2nd-D transformation. Madisetti et al. present a hardware sharing technique to implement the 2-D DCT core using a single 1-D DCT core [13]. The 1-D DCT core can calculate the 1st-D and 2nd-D DCT computations simultaneously, and the throughput achieves 100 Mpels/s. However, multiplier-based processing element requires a large amount of circuit area in [13]. As a result, it is obvious that a tradeoff is required between the hardware cost and the speed. Tumeo et al. nd a balance point between the area required and the speed for 2-D DCT designs [14] and present a multiplier-based pipeline fast 2-D DCT accelerator implemented using a eld-programmable gate arrays (FPGA) platform [14]. Additionally, several 2-D DCT designs are implemented based on FPGA for fast verication [14][17]. In this paper, an 8 8 2-D DCT core that consists of a single 1-D DCT core and one TMEM is proposed using a strategy known as space-time scheduling (STS). Due to the accuracy simulations in DA-based binary signed-digit (BSD) expression, 9-bit DA precision is chosen in order to meet the requirements of the peak-signal-to-noise-ratio (PSNR) outlined in previous works [5][7]. Furthermore, the proposed DCT core is designed based on a hardware sharing architecture with BSD DA-based computation so as to reduce the area cost. The arithmetics share the hardware resources during the four time slots in the DCT

1063-8210/$26.00 2011 IEEE

656

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 20, NO. 4, APRIL 2012

core design. A 100% hardware utilization is also achieved by using the proposed time scheduling strategy, and the 1st-D and 2nd-D DCT computations can be calculated at the same time. Therefore, a high performance transform engine with high accuracy, small area, and a high-throughput rate has been achieved in this research. This paper is organized as follows. In Section II, the mathematical derivation of the BSD format distributed arithmetic is given. The proposed 8 8 2-D DCT architecture that includes an analysis of the coefcient bits, the hardware sharing architecture, and the proposed timing scheduling strategy is discussed in Section III. Comparisons and discussions are presented in Section IV, and conclusions are drawn in Section V. II. MATHEMATICAL DERIVATION OF BSD FORMAT DISTRIBUTED ARITHMETIC The inner product for a general matrix multiplication-andaccumulation can be written as follows: (1) is a xed coefcient and is the input data. Assume where is an -bit BSD number. Equation (1) that the coefcient can be expressed as follows:

where

for , and . Consider the 1-D 8-point DCT rst

for

(4) for , for non-zero , and where and denote the input data and the transform output, respectively. By neglecting the scaling factor 1/2, the 1-D 8-point DCT in (4) can be divided into even and odd parts, and , as listed in (5) and (6), respectively

(5)

(6) where and

(7)

. . .

. . .

..

. . .

. . .

(8)

(9) . . . (2) For the 2-D DCT core implementation, the three main strategies for increasing the hardware and time utilization, called the space-time scheduling (STS) strategy, are proposed and listed as follows: nd the DA-precision bit length for the BSD representation to achieve system accuracy requirements; share the hardware resource in time to reduce area cost; plan an 100% hardware utilization by using time scheduling strategy. A. Analysis of the Coefcient Bits The seven internal coefcients from to for the 2-D DCT transformation are expressed as BSD representations in order to save computation time and hardware cost, as well as to achieve the requirements of PSNR. The system PSNR is dened [1] as follows: (10) (3) where the is the mean-square-error between the original image and the reconstructed image for each pixel.

where

, . The matrix

, and is called the DA coef-

cient matrix. can be calculated by adding or subtracting the In (2), with , and then the transform output can be obtained by shifting and adding every non-zero . Thus, the inner product computation in (1) can be implemented by using shifters and adders instead of multipliers. Therefore, a smaller area can be achieved by using BSD DA-based architecture. III. PROPOSED 8 8 2-D DCT CORE DESIGN

This section introduces the proposed 8 8 2-D DCT core implementation. The 2-D DCT is dened as

CHEN AND CHANG: HIGH PERFORMANCE VIDEO TRANSFORM ENGINE

657

Fig. 1. Simulation for system PSNR in different BSD bit length expressions.

TABLE I NORMALIZED HW COST IN DIFFERENT BSD BIT LENGTH EXPRESSIONS

TABLE II 9-BIT BSD EXPRESSIONS FOR DCT COEFFICIENTS

First, the BSD coefcients for different bit length expressions use the popular test images, Lena, Peppers, and Baboon, to simulate the system PSNR. After inputting the original test image pixels to the forward and inverse DCT transform, the system PSNR vs. the different BSD bit length expressions is illustrated in Fig. 1(a), and the reconstructed images, i.e., that have passed through the forward and inverse DCT for different BSD bit lengths, are also shown in Fig. 1(b). The normalized hardware (HW) cost of processing element (PE) for the different BSD bit length expressions is shown in Table I. The HW cost is synthesized in area for each BSD bit length using a Synopsys Design Compiler with the Artisan TSMC 0.18- m Standard cell library. The gap in accuracy between the 8-bit and the 9-bit BSD expressions is shown in Fig. 1(a), and the HW cost increases when the BSD bit length increases. Consequently, the 9-bit BSD expression coefcient is chosen due to the tradeoff between the hardware cost and the system accuracy. The BSD expressions and the accuracy of the DCT coefcients are listed in Table II, where the coefcient in Table II indicates the BSD number 1. The coefcient accuracy is dened as the signal-to-quantization-noise ratio (SQNR) shown in (11) (11) where and are the power of the desired oating-point signal and the quantization noise in this case, respectively.

B. Hardware Sharing Strategy All modules in the 1-D DCT core, including the modied two-input buttery (MBF2), the pre-reorder, the process element even (PEE), the process element odd (PEO), and the postreorder share the hardware resources in order to reduce the area cost. Moreover, the 8 8 2-D DCT core is implemented using a single 1-D DCT core and one TMEM. The architecture of the 2-D DCT core is described in the following sections. 1) Modied Buttery Module: Equation (9) is easily implemented using a two-input buttery module [18] called BF2. In general, the BF2 has a hardware utilization rate in the adder and subtracter of 50%. In order to enable the hardware resources to be shared, additional multiplexers and Reorder Registers are added to the proposed MBF2 module as illustrated in Fig. 2. The Reorder Registers consist of four word registers that use the control signals to select the input data and use enable signals to output or hold the data. Similar to BF2, the operation of the proposed MBF2 has an eight-clock-cycle period. In shift the rst four cycles, the 1st-D input data execute into Reorder Registers 1, and the 2nd-D input data the operation of addition and substraction. In the next four cycles, the operations of 1st-D and 2nd-D will be changed. calculate and in (9) by The 1st-D data using adder and subtracter. In the second four cycles, the are fed into next stage, and the results

658

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 20, NO. 4, APRIL 2012

TABLE III COEFFICIENTS AND INPUT DATA FROM MBF2 FOR THE FOUR SEPARATE TIME SLOT OUTPUTS

TABLE IV EVEN PART TRANSFORMATION TABLE

Fig. 2. Proposed MBF2 architecture.

resources. Hence, (12) and (13) can be expressed as (14) and (15), respectively

results shift to Reorder Registers 1 in shift each cycle. At the same time, the 2nd-D input data to Reorder Registers 2. Therefore, the adder and subtracter in MBF2 employ the operations of 1st-D and 2nd-D in turns to achieve 100% hardware utilization. 2) Pre-Reorder Module: In (5), the even part transform output can be modied as follows:

(14)

(15) , the coefcient elements , , and the transand form inputs from the MBF2 stage . Equations (14) and (15) for the four time slots are summarized in Table III. The proposed Pre-Reorder , and ) demodule follows the order of the inputs pending on each time slot indicated in Table III. The hardware resources can be shared by reordering the inputs during the four separate time slots. 3) Process Element Module: For DA-based computation, the even part and odd part transformations can be implemented using PEE and PEO, respectively. From Table III, the even part transformation can be expanded for the DA-based computation formats and can share the hardware resources at the bit level. and The coefcient vector has two combinations of where the time slot

(12)

Also, the odd part transform output

in (6) can be rewritten as

(13)

From (12) and (13), the transform output can be calculated using four separate time slots in order to share the hardware

CHEN AND CHANG: HIGH PERFORMANCE VIDEO TRANSFORM ENGINE

659

Fig. 4. Proposed post-reorder architecture.

tency of 52 cycles. Based on the time scheduling strategy described in Section III-C, a hardware utilization of 100% can be achieved. C. Time Scheduling Strategy
Fig. 3. Architecture for the proposed PEE and PEO.

for and transform outputs, respectively. Table IV expresses (14) in the bit level formulation. , , , and , the transform Using given input data needs only three adders (to compute , output , and ), a Data-Distributed module, and one even part adder-tree (EAT) to obtain the result for during the four time slots. The Data-Distributed module consists of seven two-input multiplexers that are used to appoint different non-zero values for each weighted input during the four separate time slots, as shown in Table IV, and the EAT sums the different weighted values in tree-like adders to complete the transform output . Similarly, the odd part transformation in (15) can be implemented in a DA-based format using four adders and one odd part adder-tree (OAT). The EAT and OAT can be implemented using the error-compensated adder tree [19] to improve the computation accuracy. Moreover, there are three pipeline stages (two in both the EAT and the OAT) in each PEE and PEO module that enable high speed computation to be achieved. The proposed architecture for both PEE and PEO is shown in Fig. 3. 4) Post-Reorder Module: For the last stage of the 1-D DCT computation, the data sequence after the PEE and PEO must be merged and repermuted in the Post-Reorder module. The order of the original 8-point data sequence from the PEE and . After the Post-Rethe PEO is order module permutes the data order, the data sequence is or. Fig. 4 shows the prodered as posed Post-Reorder architecture. Two multiplexers select the data that is fed into the different Reorder Registers in order to permute the output order. is the transform output for the 1st-D DCT that will input into the TMEM. In addition, the 2nd-D DCT is completed after the permutation by Retransform output order Registers 4. 5) 2-D DCT Core Architecture: To save hardware costs, the proposed 2-D DCT core, as shown in Fig. 5, is implemented using a single 1-D DCT core and one TMEM. The 1-D DCT core includes an MBF2, a Pre-Reorder module, a PEE, a PEO, a Post-Reorder module, and one TMEM. The TMEM is implemented using 64-word 12-bit dual-port registers and has a la-

As a result of the time scheduling strategy, the 1st-D and 2nd-D transforms are able to be computed simultaneously, which increases hardware utilization be 100%. The timing ow chart for the proposed strategy is illustrated in Fig. 6. 1) 1st-D Data Computation: In the rst four cycles, the rst four-point input data shift into the Reorder Register 1. During cycles 58, the MBF2 performs additions and subtractions using the rst eight-point input data. In the 9th cycle, the even-part of and reorders Pre-Reorder module obtains the input data output to the PEE based on Table III. During is calculated in the PEE module. Based cycles 1013, the is on the latency of the three pipeline stages in the PEE, is performed in completed during cycles 1316. Also, the the PEO module during cycles 1417 and is nished in cycles 1720. Therefore, after the 16th cycle, the Post-Reorder module permutes the 1st-D transform results and inputs them into the TMEM. 2) 2nd-D Data Computation: From the 17th cycle, the 1st-D transform data is input into the TMEM. Due to the latency of 52 cycles , the 2nd-D computation data transthe TMEM posed by the TMEM is sent to the input of Reorder Registers 2 in MBF2 at the 69th cycle. Then, the adder and subtracter are operated during cycles 7376 to compute the rst 2nd-D 8-point data sequences, and the rst 2nd-D data is permuted by Pre-Reorder module during cycles 7780. The PEE and PEO calculate the rst 2nd-D data during cycles 7881 and 8285, respectively. The Post-Reorder module nishes the 2nd-D transform results from the 84th cycle. In addition, the 2-D DCT transform output data is obtained at the end of 84th cycle. 3) Hardware Utilization: In Fig. 6, the adder and subtracter in MBF2 is at 50% utilization during cycles 168. However, after the 68th cycle, the MBF2 achieves a 100% hardware utilization as a result of the 2nd-D data being input. At the same time, the utilization in the PEE and PEO also reaches 100% from 74th and 78th cycles, respectively. The Post-Reorder module does not work between cycles 112, but after the 16th cycle the 1st-D transform output are completed using Reorder Registers 3 in the Post-Reorder. Similarly, the 2nd-D transform result is nished using Reorder Registers 4 in the Post-Reorder from the 84th cycle. At the end of 84th cycle, the 1st-D and 2nd-D transform outputs and are obtained simultaneously. Hence, the hardware utilization increases to 100% after the 84th

660

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 20, NO. 4, APRIL 2012

Fig. 5. Proposed 2-D DCT architecture.

Fig. 6. Timing scheduling for the proposed 2-D DCT core.

cycle, and the latency of the proposed 8 8 2-D DCT core is 84 clock cycles. Based on the proposed time scheduling strategy, a 100% hardware utilization is achieved, and computation of the data sequence is straightforward. In summary, following an analysis of the accuracy analysis of the coefcients, the DA-precision bit length was chosen as a 9-bit BSD expression. In this way, the hardware cost is reduced by using the BSD 9-bit DA-precision instead of the 12-bit used in previous works. Using the proposed hardware sharing strategy, the DCT computation shares the hardware resources during the four time slots, thereby reducing the area cost. In addition, a 100% hardware utilization can be achieved based on the proposed time scheduling strategy, effectively achieving a highthroughput-rate design. Therefore, a high-accuracy, small-area,

and high-throughput-rate 2-D DCT core has been obtained by applying the proposed STS strategy. IV. DISCUSSION AND COMPARISONS In this section, the system accuracy test for the proposed 2-D DCT core is discussed. Then, a comparison of the proposed 2-D DCT core with previous works is also presented. Finally, the characteristics of the implemented chip and FPGA are described. A. System Accuracy Test Seven test images used to check system accuracy are all 512 512 pixels in size, with each pixel being represented by 8-bit 256 gray level data. After the original test image pixels are fed

CHEN AND CHANG: HIGH PERFORMANCE VIDEO TRANSFORM ENGINE

661

Fig. 7. Accuracy verication ow for proposed DCT core.

TABLE V PSNR PERFORMANCE FOR EACH TEST PATTERNS

TABLE VI PSNR VERSUS QF FOR THE PROPOSED 2-D DCT APPLYING TO JPEG COMPRESSION FLOW

a 2-D DCT and shared the hardware resources for the FFT/ IFFT/2-D DCT computation in a triple-mode processor design. The 2-D DCT core is implemented in a direct 2-D method based on a pipeline radix- single delay feedback path architecture. In [7], a 2-D DCT core used NEDA architecture with two 1-D cores and one TMEM to achieve a low-cost design. Speed limitations exist in the operations of serial shifting and addition in the NEDA operations. Huang et al. improved the throughput rate using adder-tree (AT) instead of shifting and addition operations for NEDA architecture [9]. However, the parallel input and output in the NEDA architecture required a large amount of I/O resources. The 2-D DCT core using a single 1-D core and one TMEM for a multiplier-based DCT implementation was addressed in [10]. An area-saving serial-input and parallel-output transpose memory was also implemented in the 2-D DCT core to achieve a small-area design operated at frequency of 55 MHz. However, the 1-D core could not compute 1st-D and 2nd-D transformation simultaneously. Hence, the throughput rate was half of the operation frequency. Therefore, a hardware sharing technique was addressed in the 2-D DCT core to deal with throughput rate reduction, and the 2-D DCT core achieved a throughput of 100 Mpels/s [13]. The proposed 2-D DCT core employs a single 1-D core and one TMEM in order to reduce the core area. Also, the 1-D core utilizes 9 adders and 2 adder-trees (AT) using the proposed hardware sharing architecture in order to reduce hardware cost. Meaning that the throughput rate of the proposed 2-D DCT core is enhanced based on the time scheduling strategy. In this way, the proposed 2-D core is able to compute 1st-D and 2nd-D transformation simultaneously, and a 167 MHz throughput rate is achieved. Therefore, the proposed 2-D DCT core has the highest hardware efciency, dened as follows: Hardware Efciency pels/s-gate

into the proposed 2-D DCT core, the transform output data is captured and passed to MATLAB tool in order to compute the inverse DCT using 64-bit double-precision operations. The verication ow follows the precedent of previous works ([4], [7], and [20]) test ow illustrated as Fig. 7. The PSNR values of each test images veried by the test ow are listed in Table V, and the average PSNR closes to 52 dB using the proposed 8 8 2-D DCT core. Furthermore, the proposed 2-D DCT core was introduced to enable utilization of JPEG compression ow [21], [22] to evaluate the proposed 2-D DCT applied in real compression systems. The quality factor (QF) dominates the compression quality and data size, and therefore QF affects the system PSNR of the JPEG compression. Table VI lists the PSNR versus QF for the proposed 2-D DCT core applying to JPEG, and Fig. 8 illustrates both the original and the compressed versions of the test image Lena using the JPEG compression ow. The quality of the compressed image is excellent. B. Comparison With Other 2-D DCT Architectures Table VII compares the proposed 8 8 2-D DCT core with previous works. In [4], Lin et al. derived an SFFT format from

(16) A comparison with previous works is summarized in Table VII. Noted that previous works presented only simulation results, while proposed work has chip implementation with design for test (DFT) and measured results. For fair comparisons, the simulation results of proposed work are also shown (Remove DFT considerations in chip implement that needs extra area for memory built-in self-test circuit and scan ip-slops). The gate count of STS DCT without DFT insertion is 12.2 K and the hardware efciency of STS DCT is equal to 13.69 that is the largest in the comparison Table VII. On the other hand, the proposed 2-D DCT achieves 52 dB PSNR value, that is the medium accuracy compared with other pervious works, veried by the test ow in Fig. 7. The power consumption is another issue in the circuit designs. There are many different technologies to implement DCT core in previous works. In order to have a fair comparison, the power is normalized to 0.18- m technology process. Similar to [23] as listed in (17) and is also summarized in Table VII. The proposed 2-D DCT core has medium power consumption between pervious works Normalized Power (17)

662

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 20, NO. 4, APRIL 2012

Fig. 8. Compressed image using JPEG compression processing with the proposed 2-D DCT core. (a) Original image. (b) Compressed image with

QF = 99.

TABLE VII COMPARISON OF DIFFERENT 2-D DCT ARCHITECTURES WITH THE PROPOSED ARCHITECTURE

5=(): Where 5 is the chip measurement result with DFT insertion and  is simulation result without DFT insertion.
The power result is estimated for 1-D DCT operation.

Four transistors per NAND2 gate for different technology.

C. Chip Implementation Implemented in a 1.8-V TSMC 0.18- m 1P6M CMOS process, the proposed 8 8 2-D DCT core uses the Synopsys Design Compiler to synthesize the RTL code and the Cadence SoC Encounter for placement and routing (P&R). The proposed DCT core has a latency of 84 clock cycles and is operated at 167 MHz to meet HDTV specications. For the design for test (DFT) considerations, the proposed core uses the SynTest SRAMBIST to generate the memory built-in self-test (BIST) circuit and the Synopsys Design Compiler to synthesize scan ip-ops, and the total gate counts with DFT insertion of the proposed core is 15.8 K. The photomicrograph of the proposed core with marking each modules boundary is shown in Fig. 9 and the measured characteristics are listed in Table VIII.

D. FPGA Implementation The proposed 2-D DCT core synthesized using Xilinx ISE 10.1 and Xilinx XC2VP30 FPGA can be operated at a clock frequency of 110 MHz. Table IX shows a comparison of the proposed 2-D DCT core with previous FPGA implementations. The multiplier-less coordinate rotation digital computer (CORDIC) architecture was presented in [15]. Sun et al. utilized 120 adders and 80 shifters performing a 2-D quantized discrete cosine integer transform (QDCIT) to achieve a high clock rate of 149 MHz, but with large FPGA area resources. Also, a 2-D DCT core that was optimized in both delay and area was addressed by Tumeo et al. [14], who presented a balance point between the delay and the area for multiplier-based pipeline DCT design with 19 adders/subtracters and 4 multipliers in their DCT core

CHEN AND CHANG: HIGH PERFORMANCE VIDEO TRANSFORM ENGINE

663

saving in area over the NEDA architecture for a DA-based DCT design. Furthermore, the proposed time scheduling strategy arranges the computation time for each process element. The 1-D core can calculate the 1st-D and 2nd-D transformations simultaneously, thereby achieving a high throughput rate. Therefore, a high performance 8 8 2-D DCT core utilizing high-accuracy, a small-area, and a high-throughput rate has been achieved using the proposed STS strategy. Finally, the proposed high performance 2-D DCT core is fabricated using a 1P6M CMOS process and implemented in the TSMC 0.18XC2VP30 FPGA. REFERENCES
[1] Y. Wang, J. Ostermann, and Y. Zhang, Video Processing and Communications, 1st ed. Englewood Cliffs, NJ: Prentice-Hall, 2002. [2] E. Feig and S. Winograd, Fast algorithms for the discrete cosine transform, IEEE Trans. Signal Process., vol. 40, no. 9, pp. 21742193, Sep. 1992. [3] Y. P. Lee, T. H. Chen, L. G. Chen, M. J. Chen, and C. W. Ku, A cost-effective architecture for 8 8 two-dimensional DCT/IDCT using direct method, IEEE Trans. Circuits Syst. Video Technol., vol. 7, no. 3, pp. 459467, Jun. 1997. [4] C. T. Lin, Y. C. Yu, and L. D. Van, Cost-effective triple-mode recongurable pipeline FFT/IFFT/2-D DCT processor, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 16, no. 8, pp. 10581071, Aug. 2008. [5] M. Alam, W. Badawy, and G. Jullien, A new time distributed DCT architecture for MPEG-4 hardware reference model, IEEE Trans. Circuits Syst. Video Technol., vol. 15, no. 5, pp. 726730, May 2005. [6] T. Xanthopoulos and A. P. Chandrakasan, A low-power DCT core using adaptive bitwidth and arithmetic activity exploiting signal correlations and quantization, IEEE J. Solid-State Circuits, vol. 35, no. 5, pp. 740750, May 2000. [7] A. M. Shams, A. Chidanandan, W. Pan, and M. A. Bayoumi, NEDA: A low-power high-performance DCT architecture, IEEE Trans. Signal Process., vol. 54, no. 3, pp. 955964, Mar. 2006. [8] C. Peng, X. Cao, D. Yu, and X. Zhang, A 250 MHz optimized distributed architecture of 2D 8 8 DCT, in Proc. Int. Conf. ASIC, 2007, pp. 189192. [9] C. Y. Huang, L. F. Chen, and Y. K. Lai, A high-speed 2-D transform architecture with unique kernel for multi-standard video applications, in Proc. IEEE Int. Symp. Circuits Syst., 2008, pp. 2124. [10] S. C. Hsia and S. H. Wang, Shift-register-based data transposition for cost-effective discrete cosine transform, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 15, no. 6, pp. 725728, Jun. 2007. [11] J. I. Guo, R. C. Ju, and J. W. Chen, An efcient 2-D DCT/IDCT core design using cyclic convolution and adder-based realization, IEEE Trans. Circuits Syst. Video Technol., vol. 14, no. 4, pp. 416428, Apr. 2004. [12] H. C. Hsu, K. B. Lee, N. Y. Chang, and T. S. Chang, Architecture design of shape-adaptive discrete cosine transform and its inverse for MPEG-4 video coding, IEEE Trans. Circuits Syst. Video Technol., vol. 18, no. 3, pp. 375386, Mar. 2008. [13] A. Madisetti and A. N. Willson, A 100 MHz 2-D 8 8 DCT/IDCT processor for HDTV applications, IEEE Trans. Circuits Syst. Video Technol., vol. 5, no. 2, pp. 158165, Apr. 1995. [14] A. Tumeo, M. Monchiero, G. Palermo, F. Ferrandi, and D. Sciuto, A pipelined fast 2D-DCT accelerator for FPGA-based SOCs, in Proc. IEEE Comput. Soc. Annu. Symp. VLSI., 2007, pp. 331336. [15] C. C. Sun, P. Donner, and J. Gotze, Low-complexity multi-purpose IP core for quantized discrete cosine and integer transform, in Proc. IEEE Int. Symp. Circuits Syst., 2009, pp. 30143017. [16] S. Ghosh, S. Venigalla, and M. Bayoumi, Design and implementaion of a 2D-DCT architecture using coefcient distributed arithmetic, in Proc. IEEE Comput. Soc. Annu. Symp. VLSI., 2005, pp. 162166. [17] L. V. Agostini, I. S. Silva, and S. Bampi, Pipelined fast 2D DCT architecture for JPEG image compression, in Proc. IEEE Symp. Integr. Circuits Syst. Des., 2001, pp. 226231. [18] C. H. Chang, C. L. Wang, and Y. T. Chang, Efcient VLSI architectures for fast computation of the discrete Fourier transform and its inverse, IEEE Trans. Signal Process., vol. 48, no. 11, pp. 32063216, Nov. 2000.

Fig. 9. Photomicrograph of the proposed DCT core.

TABLE VIII CHIP CHARACTERISTICS OF THE PROPOSED 2-D DCT ARCHITECTURE

TABLE IX COMPARISONS OF 2-D DCT ARCHITECTURES IN FPGAS

operated at clock frequency of 107 MHz. The proposed DCT core has lower area resources and a medium operation frequency in the XC2VP30 FPGA implementations. V. CONCLUSION This paper proposes a high performance video transform engine using the STS strategy. The proposed 2-D DCT core employs a single 1-D DCT core and one TMEM with a small area. Based on the test image simulations, 9-bit DA-precision is chosen in order to meet the PSNR requirements, and the hardware sharing architecture enables the arithmetic resources to be shared based on time so as to reduce area cost. Hence, the number of adders/substracters in 1-D DCT core allows a 74%

664

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 20, NO. 4, APRIL 2012

[19] Y. H. Chen, T. Y. Chang, and C. Y. Li, High throughput DA-based DCT with high accuracy error-compensated adder tree, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., 2010, to be published. [20] W. Pan, A fast 2-D DCT algorithm via distributed arithmetic optimization, in Proc. IEEE Int. Conf. Image Process., 2000, vol. 3, pp. 114117. [21] Information TechnologyDigital Compression and Coding of Continuous-Tone Still Images: Requirements and Guidelines, ISO/IEC 10918-1, 1991. [22] W. B. Pennebaker and J. L. Mitchell, JPEG Still Image Data Compression Standard. New York: Van Nostrand Reinhold, 1992. [23] S. N. Tang, J. W. Tsai, and T. Y. Chang, A 2.4-Gs/s FFT processor for OFDM-based WPAN applications, IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 57, no. 6, pp. 451455, Jun. 2010. Yuan-Ho Chen (S10) received the B.S. degree from Chang Gung University, Taoyuan, Taiwan, in 2002, and the M.S. degree from the National Tsing Hua University (NTHU), Hsinchu, Taiwan, in 2004, where he is currently pursuing the Ph.D. degree in electrical engineering. His research interests include video/image and digital signal processing, VLSI architecture design, VLSI implementation, computer arithmetic, and signal processing for wireless communication.

Tsin-Yuan Chang (S87M90) received the B.S. degree in electrical engineering from National Tsing Hua University, Hsinchu, Taiwan, in 1982, and the M.S. and Ph.D. degrees from Michigan State University, East Lansing, in 1987 and 1989, respectively, both in electrical engineering. He is an Associate Professor with the Department of Electrical Engineering, National Tsing Hua University. His research interests include IC design and testing, Computer arithmetic, and VLSI DSP.

You might also like