You are on page 1of 4

Bit-Parallel Multiple Constant Multiplication using Look-Up Tables on FPGA

Mathias Faust and Chip-Hong Chang


Centre for High Performance Embedded Systems, Nanyang Technological University, Singapore

AbstractThe research on optimization of Multiple Constant Multiplication (MCM) during the last two decades has been focusing mainly on common subexpression elimination and reduced adder graph algorithms when bit-parallel computation is required. The advancement of FPGA technology enables the implementation of complex MCM instances on FPGA, but the shift-and-add network implementation does not make full use of the fundamental resources of FPGA, like the Look-Up Tables (LUT). Since bit-serial implementation optimized for FPGA is slow, an attempt for bit-parallel LUT-based implementation for single constant multiplication has been made. This paper extends this LUT-based method to multiple constant multiplications. It presents an interesting insight and unexpected outcome that the maximal number of LUTs required can be limited far below the theoretical number by mere enumeration without considering the legitimacy of all possible output combinations. Simulation results show that the required logic slices are comparable to the traditional adder-based MCM optimization methods while the delay is reduced by approximately 33%. The advantages are more prominent with increasing number of constants and the bit width used for their representation.

I. I NTRODUCTION A frequently used example of the Multiple Constant Multiplication (MCM) problem is the transposed direct form structure of Finite Impulse Response (FIR) lters. The MCM block provides a fully parallel execution of the lter coefcient multiplications. The number of adders within the shift-and-add network of the multiplierless implementation is traditionally reduced [1][5] by either Common Subexpression Elimination (CSE) or reduced adder graph algorithms. With the advancement of technology for FPGA, it is now possible to implement FIR lter on FPGAs at high speed and using bit-parallel arithmetic. For high-speed applications, bit-serial or distributed arithmetic implementations is not fast enough and the present shift-and-add optimization does not make full use of the FPGA fabrics. High-speed shift-and-add implementations based on Carry Save Adder (CSA) provide very high speed addition in ASIC but the mapping of CSA adders to FPGA is suboptimal as one slice, containing two Look up Tables (LUTs), is required per CSA while two Full Adders (FAs) can t into one slice using the fast carry logic. Different approaches to single constant multiplication for many Digital Signal Processor (DSP) algorithms have been studied [6][8]. Wirthlin [6] introduces a method to execute a constant multiplication on FPGA using LUTs for bit-parallel realization. The main idea is to split the input signal into several segments and then use 4-bit LUTs to generate the coefcient multiplier output bits. The most signicant segment

is viewed as a signed variable while the remaining parts are viewed as unsigned, as shown in Eq. (5) of [6]. To obtain the nal result, the intermediate results from the split input signal are added by a Carry Propagate Adder (CPA). Wirthlin noted the existence of some redundancies in a single constant multiplication [6] and categorized them into three different types. First, an output bit of the LUT can be always zero, which does not require any LUT. Second, two or more LUTs produce the same output, which can be replaced by a single LUT and appropriate routing. Third, the output of a LUT is identical to one of its inputs, which can be wired through directly. In [7] a memory based approach for FIR lters is introduced. It uses a different approach to reduce the size of the LUT, which involves barrel shifters and adder-subtractors as well as control signals. Using the work of Wirthlin [6] as starting point, this paper analyzes the reduction possibilities of sharing bit-slice LUTs for Multiple Constant Multiplication (MCM). Section II discusses the LUT-based MCM, the limits for the number of LUTs and the difference between signed and unsigned cases. Section III highlights the factors of consideration for input signal splitting. In Section IV, simulation results are presented and analyzed. Section V concludes the paper. II. LUT BASED M ULTIPLICATION OF M ULTIPLE C ONSTANTS The main building blocks of FPGA are small LUTs, usually with four inputs. Newer generations of FPGA can have ve or six input LUTs. CPA are mapped to slices using the high-speed carry propagation circuits within the FPGA, but this does not always lead to better utilization rate of the FPGA resources. It was shown in [6] that for single constant multiplication, the logic block resources in an FPGA are better utilized if a combination of LUTs and a few smaller input and few CPA is used rather than a CPA-only implementation. When this concept was applied to MCM, an analysis on the required LUTs revealed that many LUT outputs are identical and the number of distinct LUTs is limited and grows far slower with increasing LUTs bit width than expected, which contradicts the LUT estimates given in [7]. The number of n-bit LUTs required to obtain all products of an n-bit variable and all the odd numbers in the range of 1 to 212 are shown in Table I. For each category listed in the rst column of Table I, the analysis is performed for both signed and unsigned inputs. The rst category refers to the total number of LUTs required without any optimization, which

978-1-4244-9474-3/11/$26.00 2011 IEEE

657

TABLE I LUT
REQUIRED UNDER DIFFERENT CONDITIONS FOR CONSTANTS FROM

Input Output Bits Single Constant [6] Multiple Constants Required LUTs Unsigned Signed Unsigned Signed Unsigned Signed Unsigned Signed

1-bit 24575 24575 2047 2047 2 2 0 0

2-bit 26622 26622 9559 5909 8 6 5 3

3-bit 28669 28669 18527 14641 36 16 32 12

4-bit 30716 30716 24074 21543 144 52 139 47

5-bit 32763 32763 27572 25552 616 176 610 170

1 TO 212 6-bit 7-bit 36857 36857 32531 30583 9598 2584 5926 2470

8-bit 38904 38904 34706 32720 15788 8295 9887 6028

34810 34810 30212 28300 2456 680 2359 673

is equal to the number of output bits required to represent all possible products. For every input bit increment, there will be 2047 products with an output bit width incremented by one, which explains the linear increment of the number of LUTs. The category Single Constant refers to the number of LUTs required by applying the method of [6] to eliminate the redundant LUTs for each constant multiplication independently. The increment in the number of LUTs with its input bits is big for small n but tapers as n increases. This is because of the likelihood of nding identical output bits of the same weight among the products of an n-bit variable and a single constant decreases with increasing n. The category Mulitple Constants indicates the number of LUTs required by considering the redundancies among all the constant multiplications as a whole. The category Required LUTs is obtained by further removing a LUT with all zero entries and n LUTs with identical outputs as those of the address bits. Table II lists the content of the LUTs in the Category Multiple Constants for 3-bit signed and unsigned inputs. Each row represents the product bits of one input combination of the variable and all odd integers from 1 to 4095, after removing duplicated columns. Each column corresponds to one 3-bit LUT. The rst column of both signed and unsigned input cases corresponds to the trivial all zeros LUT. The columns highlighted in bold print are LUTs whose output bits can be obtained by directly hardwiring them to the input bits. Thus, only 12 LUTs for the signed case and 32 LUTs for the unsigned case are required. From Table I, as n increases, the number of LUTs actually needed to implement all the constant multiplications increases almost linearly with a gradient of approximately four. This is rather unexpected as intuitively, the number of possible LUTs grows exponentially with n. The number of input combinations (rows), r of an integer variable is r = 2n . Each of these r possible outputs of a LUT, can be either 0 or 1. Excluding the exceptional case of the zero input, where the output is always zero for any constant, an exhaustive enumeration for the number of possible LUT outputs, s gives s = 2r1 = 22 21
n

patterns are illegitimate under the multiplication of an nbit variable with all odd integer constants. Generally, most legitimate patterns exhibit certain regularity. For example it is not possible to have an LUT with output [01010100]T for n = 3. If the last row is a 1, i.e., [01010101]T , the output pattern becomes legitimate, which corresponds to the 11-th column for the signed input in Table II. Patterns with only a single 1 also rarely occur. Only one case is found in Table II, which appears in the second column for the unsigned input.
TABLE II O UTPUTS FOR THE REQUIRED LUT S FOR A 3- BIT INPUT, SIGNED AND
UNSIGNED CASE

Inp. 000 001 010 011 100 101 110 111

Signed 0000000000000000 0000000011111111 0000111100001111 0001001101110011 0010010100010101 0110110010101100 0111001011011000 0111111100100000

Unsigned 000000000000000000000000000000000000 000000000000000000111111111111111111 000000000111111111000000000111111111 000000111000000111000111111000111111 000011111001111000111000011000001111 000111001110001001011011100011000111 001111000000110111000100111111000011 011101010011011010101001001101010001

(1)

For n = 8, s 5.79 1076 . This number is much higher than the LUTs that are actually required. From Table I, the number of LUTs required is only slightly more than u = 4n 21 for the unsigned case but slightly less for the signed case. The reason behind this is that many of these LUT output

The difference in number of required LUTs between the signed and unsigned input cases is due to the symmetry inherent in the range of signed integer number, which causes the range of magnitudes of the products to shrink. The fact is manifested in Table I where the number of required LUTs for the signed case of n + 1 approximates that of the unsigned case of n for n > 3. The number of required LUTs shown in Table I are accurate for n 5 but is likely to be underestimated for n > 6. This is because if the bit width m used to represent the constants is smaller than 2 n + 1, some overlapping bit patterns in the partial products will not be captured. Therefore, the gures presented for the 7- and 8-bit LUTs are likely to be too small. By increasing the range of odd integer constants to 214 , the numbers of LUTs required for n = 6, 7 and 8 are 2449, 9590 and 23770, respectively for the unsigned input and 673, 2576 and 9813, respectively for the signed input. When m = 14, the number of required LUTs reported for n = 6 will be accurate but it is still underestimated for n = 7 and 8. To show the actual number of LUTs required on practical MCM problem, the lters from [4] and the 695 tap lter from [9] using four different coefcient bit widths were analyzed. All lter coefcients are available in [10]. The results are shown in Table III where the column UC lists the number of unique coefcients. The columns [6] and Prop. show respectively the numbers of 4-bit LUTs required when each

658

constant multiplication is considered separately and when all constant multiplications are considered collectively. The percentage reduction is given in the column %diff.. Comparing with the required LUTs for n = 4 of Table I, almost all possible LUTs are required for the case of signed input and between 48% and 82% of all possible LUTs are required for the case of unsigned input for the lters from [4]. On average, the number of LUTs required is reduced by 86.5% and 72.9% over [6] for the signed and the unsigned cases, respectively. The 695 tap lter example shows that increasing the constant bit width beyond 16-bit and the number of constants beyond 100 increases the number of LUTs required for [6], but the number of LUTs required in Prop. only increases marginally from 16- to 20-bit and remains constant for coefcient bit width of 20- and 24-bit. For an input signal of 8-bit, an adder for each constant coefcient is needed to generate the nal output from the LUTs. Adding an additional coefcient only requires one more adder, which is optimal in view of the lower bound for MCM given in [11]
TABLE III C OMPARISON OF LUT COUNTS OF [6] AND PROPOSED METHODS FOR 11 FILTERS FROM [4] AND A 695 TAP FILTER FROM [9] WITH DIFFERENT
COEFFICIENT BIT WIDTHS

width of the partial input signal to the size of the LUT or other size that saves CPAs required for LUT output merging. 4-bit LUT is typical but newer FPGAs could have ve or six input LUTs. Beyond six inputs the number of possible LUTs is too big due to less redundant columns in the product bit matrix. If the input has an odd bit width, it is benecial to have one more bit in the higher order part where the signed bit is located as the maximum number of required LUTs is smaller. For a 7-bit input, the limit of LUTs for a 4 + 3 splitting is 47+32 = 79 while a splitting of 3+4 requires 12+139 = 151 LUTs maximally. The downside of allocating more bits to the higher order side is a longer CPA is required for merging. No addition is required for the 3 and 4 Least Signicant Bits (LSBs) for the 4 + 3 and 3 + 4 splitting respectively. The number of constants c can be used to decide which splitting is more benecial. For the example, it is lucrative to allocate one more bit to the upper part as long as c < 151 79 = 72. For other bit widths, the same calculation can be applied. In general it is not benecial to have more than 1 bit difference. IV. S IMULATION R ESULTS The proposed LUT optimization for MCM has been coded in MatLab and integrated into our sysFIR [10] automatic VHDL generator for FIR lter. Generic signed and unsigned adders were used and the design was functionally veried by ModelSim. An 8-bit input was assumed and it was split into two 4-bit signals. The 64-bit version of Xilinx ISE 11.2 (L.46) was used to compile and map the design onto the Xilinx Virtex-4 FPGA, xc4vlx200-10ff1513. The designs were optimized for speed with effort level 2.
TABLE IV C OMPARISON OF LUT EQUIVALENT COUNTS FOR 11 FILTERS FROM [4] AND A 695 TAP FILTER FROM [9] OF DIFFERENT COEFFICIENT BIT WIDTHS LUT eqiv. count % Reduction over Filter UC CSD MinLD Prop. CSD MinLD 1 2 3 4 5 6 7 8 9 10 11 Average 12-bit 16-bit 20-bit 24-bit 19 39 14 33 18 36 19 29 14 16 12 22.6 32 109 271 345 874 2120 535 1401 758 1612 757 1390 803 1366 590 1109.6 848 3769 12 267 22 375 483 926 300 690 389 731 408 671 428 696 309 548.3 459 1768 5028 8567 393 739 289 581 366 623 357 553 372 537 279 462.6 450 1447 3915 6123 55.0% 65.1% 46.0% 58.5% 51.7% 61.4% 52.8% 60.2% 53.7% 60.7% 52.7% 58.3% 46.9% 61.6% 68.1% 72.6% 18.6% 20.2% 3.7% 15.8% 5.9% 14.8% 12.5% 17.6% 13.1% 22.8% 9.7% 15.6% 2.0% 18.2% 22.1% 28.5%

Filter 1 2 3 4 5 6 7 8 9 10 11 Avg. 12-bit 16-bit 20-bit 24-bit

UC 19 39 14 33 18 36 19 29 14 16 12 22.6 32 109 271 345

[6] 263 578 174 427 232 474 242 407 252 398 173 329.1 320 1270 3729 5937

Signed Prop. 43 47 42 46 46 47 42 46 45 46 39 44.5 44 46 47 47

%diff. 83.7% 91.9% 75.9% 89.2% 80.2% 90.1% 82.6% 88.7% 82.1% 88.4% 77.5% 86.5% 86.3% 96.4% 98.7% 99.2%

[6] 263 578 174 427 232 474 242 407 252 398 173 329.1 320 1270 3729 5937

Unsigned Prop. %diff. 87 114 73 108 88 102 73 100 75 93 67 89.1 86 131 139 139 66.9% 80.3% 58.0% 74.7% 62.1% 78.5% 69.8% 75.4% 70.2% 76.6% 61.3% 72.9% 73.1% 89.7% 96.3% 97.7%

One important fact that was not mentioned in [6] is that using LUT-based multiplication shortens the delay compared with a CPA-based implementation when the Canonical Signed Digit (CSD) encoded constant contains many non zero digits. The delay of the latter is determined by the depth of the adder tree, which is given by log2 (S (C )), where S (C ) is the number of nonzero digits of the CSD constant, C . If the input bit width is 8-bit and the signal is split into two 4-bit signals, two LUT blocks and one adder per constant are required. This delay is approximately equal to the delay of two cascaded adders. This means that for all cases where S (C ) > 4, the delay will be reduced by using the LUT technique. III. I NPUT S IGNAL SPLITTING CONSIDERATIONS If the bit width of the input signal is longer than that of the LUT, the input signal needs to be divided into two or more parts. For mapping onto FPGA, it makes sense to t the bit

Typically, one 4-bit LUT block will be consumed for each column of a 4-input product bit matrix for the MCM. As Ripple Carry Adder (RCA) adders are used, they can be implemented by using the fast carry propagation path. A reasonable estimate of approximately one LUT block per output bit is made. Using these estimates and the Macro

659

TABLE V S IMULATION RESULTS FOR X ILINX V IRTEX -4 FOR 11 FILTERS FROM [4] AND A 695 TAP FILTER FROM [9] OF DIFFERENT COEFFICIENT BIT WIDTHS CSD MinLD Proposed Red. over CSD Red. over MinLD Filter UC Slices Delay Slices Delay Slices Delay Slices Delay Slices Delay 1 2 3 4 5 6 7 8 9 10 11 Average 12-bit 16-bit 20-bit 24-bit 19 39 14 33 18 36 19 29 14 16 12 22.6 32 109 271 345 223 482 163 333 216 385 200 341 214 334 174 278.6 225 783 2312 4309 13.8 15.0 13.1 14.3 13.3 14.8 13.6 15.1 15.1 14.5 13.5 14.2 17.0 19.5 21.2 23.5 217 405 133 308 196 317 185 301 187 293 150 244.7 226 699 1867 3388 15.5 14.9 13.6 15.2 14.6 15.5 15.4 15.1 13.9 15.4 12.7 14.7 18.0 21.3 22.5 25.0 207 392 158 316 193 328 187 290 199 284 148 245.6 245 763 2042 3146 9.4 10.3 9.2 10.1 9.4 10.0 9.4 9.8 9.4 9.9 9.2 9.6 13.5 14.2 14.4 15.2 7.2% 18.7% 3.1% 5.1% 10.6% 14.8% 6.5% 15.0% 7.0% 15.0% 14.9% 11.8% -8.9% 2.6% 11.7% 27.0% 31.8% 31.8% 29.8% 29.7% 29.3% 32.3% 30.9% 35.4% 37.7% 31.9% 32.0% 32.2% 20.8% 27.4% 32.2% 35.4% 4.6% 3.2% -18.8% -2.6% 1.5% -3.5% -1.1% 3.7% -6.4% 3.1% 1.3% -0.4% -8.4% -9.2% -9.4% 7.1% 39.3% 30.9% 32.6% 33.5% 35.3% 35.6% 39.0% 35.3% 32.4% 35.7% 27.8% 34.5% 24.9% 33.4% 35.9% 39.2%

Statistics from the synthesis report of Xilinx ISE, the results in terms of LUT equivalent counts were generated for our proposed method, Prop., the baseline implementation of CSD coefcient multipliers, CSD, and the minimal logic depth graph based MCM algorithm from [5], MinLD and are given in Table IV. The MinLD algorithm was chosen because it outperforms all MCM CSE algorithms with minimal logic depth at a slightly higher adder cost than some graph based algorithms. For the lters from [4], the proposed LUT-based method for MCM has a clear advantage over CSD and on average, it consumes 15.6% less LUT equivalents than the designs synthesized by [5]. The results of the DAmp lter from [9] show that the savings increase with increasing number of lter coefcients and precision of the coefcients. The MCM block of a FIR lter has a large number of outputs, which exceeds the number of I/O pins of the Virtex4 FPGA for most of the example lters used in Table V. Therefore, the synthesis results of Xilinx XST before mapping were shown. In terms of logic slices used, the simulation results show a marginal reduction of the proposed method compared with CSD, but for the 695 tap DAmps lter with 12-bit constants the proposed method requires slightly more slices. The slice count is in somewhat disfavor to the proposed method when it is compared with MinLD. This is due to the fact that the advanced low-level optimization performed by Xilinx XST can reduce the LUT equivalents for adders, while the LUT equivalents for the proposed LUTs cannot be further optimized. However, the delay of the proposed LUT-based method is reduced by approximately 1/3 comparing with those of CSD or MinLD. Although the reported delays may not be accurate due to the IO constraint, they provide a relatively good indication of the trend. The shortened critical path is likely to reduce the power consumption. V. C ONCLUSION A LUT-based method for designing fully parallel multiple constant multiplication amenable to FPGA mapping was

presented. It was shown that there exists an upper limit for the number of LUTs required for each input bit width. By splitting an 8-bit input signal into two 4-bit signals, simulation results for several FIR lters showed that the proposed method achieves the predicted hardware savings. The proposed LUTbased method is comparable with the current-art adder-based MCM optimization techniques in terms of the usage of logic slices and outperforms them by about 33% in terms of delay. The results also indicated that the proposed method is advantageous for MCM problems when the number of constants and their precision increase. R EFERENCES
[1] D. R. Bull and D. H. Horrocks, Primitive operator digital lters, IEE Proc. G on Circuits, Devices and Systems, vol. 138, no. 3, pp. 401412, Jun. 1991. [2] Y. Voronenko and M. P uschel, Multiplierless multiple constant multiplication, ACM Trans. Algorithms, vol. 3, no. 2, p. 11, May 2007. [3] C. H. Chang, J. Chen, and A. P. Vinod, Information theoretic approach to complexity reduction of FIR lter design, IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 55, no. 8, pp. 23102321, Sep. 2008. [4] L. Aksoy, E. O. Gunes, and P. Flores, An exact breadth-rst search algorithm for the multiple constant multiplications problem, in Proc. IEEE NORCHIP Conf, 2008, Tallinn, 16-17 Nov. 2008, pp. 4146. [5] M. Faust and C. H. Chang, Minimal logic depth adder tree optimization for multiple constant multiplication, in Proc. IEEE Int. Symp. on Circuits Syst., 2010. ISCAS 2010., Paris, France, May 30 - Jun. 2 2010, pp. 457460. [6] M. J. Wirthlin, Constant coefcient multiplication using look-up tables, J. VLSI Signal Process., vol. 36, no. 1, pp. 715, Jan. 2004. [7] P. K. Meher, New approach to look-up-table design and memorybased realization of FIR digital lter, IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 57, no. 3, pp. 592603, Mar. 2010. [8] , LUT optimization for memory-based computation, IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 57, no. 4, pp. 285289, Apr. 2010. [9] C. H. Chang and M. Faust, On a new common subexpression elimination algorithm for realizing low-complexity higher order digital lters, IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., vol. 29, no. 5, pp. 844848, May 2010. [10] FIRsuite, Suite of constant coefcient FIR lters, 2010. [Online]. Available: http://www.rsuite.net [11] O. Gustafsson, Lower bounds for constant multiplication problems, IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 54, no. 11, pp. 974978, Nov. 2007.

660

You might also like