You are on page 1of 5

IEEE TRANSACTIONS ON APPLIED SUPERCONDUCTIVITY, VOL. 21, NO.

3, JUNE 2011

847

8-Bit Asynchronous Wave-Pipelined RSFQ Arithmetic-Logic Unit


T. Filippov, M. Dorojevets, A. Sahu, A. Kirichenko, C. Ayala, and O. Mukhanov, Senior Member, IEEE
AbstractWe have designed and demonstrated an ArithmeticLogic Unit (ALU) based on RSFQ technology as a required step toward building an 8-bit RSFQ processor datapath. The circuit was designed and fabricated with HYPRES standard 4.5 kA/cm2 process. The target clock frequency of the ALU is 20 GHz. In this paper, we present the design and functionality (low-speed) test results of the 8-bit ALU. Index TermsAdder, ALU, microprocessor, RSFQ, SFQ, timing.
TABLE I ALU INSTRUCTION SET

I. INTRODUCTION

IGH-PERFORMANCE COMPUTING (HPC) is one of the elds in which superconductor digital microelectronics is trying to establish its presence, following the path established by IBMs famous Josephson project [1]. Unfortunately, the requirement of global timing for the superconductor ac-powered latching logic, as well as the high power dissipation of the voltage-generating elements of this logic family, along with some other technical obstacles, made the implementation of high-speed processors impossible at tens-of-gigahertz clock rate. With the appearance of RSFQ logic [2], the development of a high-performance superconductor processor became more feasible. Non-latching, dc-powered RSFQ logic featuring local and self-timing [3] enabled the design of processing modules operating at tens of gigahertz with very low power dissipation. There were two projects for developing a superconductor computer [4][6]. A major part of these projects was the developing an RSFQ microprocessor operating at minimum power while clocking at very high rates. Only two 8-bit prototypes of such a microprocessorFLUX [5] and CORE [6]were developed to date. Only CORE was successfully demonstrated. And neither of them used true 8-bit wide data processing in their pipelines. FLUX microprocessor had a novel processing-in-registers microarchitecture that allowed eight ALU operations to proceed simultaneously in its datapath, producing up to eight bits per cycle (albeit belonging to different operations). CORE used a simple bit-serial pipeline generating one bit of result per cycle.
Manuscript received August 04, 2010; accepted December 22, 2010. Date of publication February 10, 2011; date of current version May 27, 2011. This work was supported in part by DoD Contract W911NF-09-C-003. T. Filippov, A. Sahu, A. Kirichenko, and O. Mukhanov are with HYPRES, Inc., Elmsford, NY 10523 USA (e-mail: alex@hypres.com). M. Dorojevets and C. Ayala are with the Department of Electrical and Computer Engineering, Stony Brook University, Stony Brook, NY USA (e-mail: midor@ece.sunysb.edu). Color versions of one or more of the gures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identier 10.1109/TASC.2010.2103918

The FLUX microarchitecture was able to hide the latency of its eight bit-serial processing pipelines by allowing any instruction to start its execution as soon as the least signicant bits of its input operands are calculated. In the bit-serial CORE processor, an instruction needs to wait until all eight bits of its inputs are calculated sequentially. Although these approaches allowed the design of low-complexity execution pipelines in these rst microprocessor prototypes, they are not scalable or applicable to future 32-/64-bit RSFQ processors. That is why the development of a wide-datapath microprocessor is crucial for superconductor-based HPC. Recently HYPRES and Stony Brook University (SBU) have undertaken a joint project to develop a 20 GHz 8-bit processor datapath. In particular, SBU develops its microarchitecture and complete cell-level design, while HYPRES designs a cell library and physical layout, fabricates chips and tests them. This is a rst attempt to develop a wide-datapath RSFQ microprocessor. Its microarchitecture was reported in [11] and [13]. The microprocessor is designed for HYPRES Nb 4.5-kA/cm fabrication process [7]. In this paper, we describe the design and functionality test results of the major part of the 8-bit microprocessoran Arithmetic Logic Unita digital circuit that performs arithmetic and logic operations on integer operands. II. ALU ARCHITECTURE AND DESIGN A. ALU Architecture In the instruction set of the processor (see Table I), addition (ADD) is the most complex and hardware consuming arithmetic operation. Because of that, the design of our ALU is based on a parallel adder design [8]. The RSFQ logic family is naturally suitable for a deep pipelined architecture because its cells have internal memory. Deep pipeline architectures result in high throughput, but, at

1051-8223/$26.00 2011 IEEE

848

IEEE TRANSACTIONS ON APPLIED SUPERCONDUCTIVITY, VOL. 21, NO. 3, JUNE 2011

Fig. 1. Block-diagram of the 8-bit RSFQ ALU.

the same time, they inherently have large latency. The simplest ripple-carry adder based ALU [8] has a latency of N clock periods for operations on N-bit wide data. This approach makes the design of a fast general-purpose 16- or more bit microprocessor impractical. Because of that, we have chosen a Kogge-Stone type [9] of the carry-look-ahead adder family. In contrast to the previously explored adder designs [10], we developed and implemented our adder with an asynchronous wave-pipelined microarchitecture [11][13]. B. ALU Components Fig. 1 shows the block diagram of the ALU. It consists of four types of blocks: INIT, ROUT1, ROUT2, and SUM,connected with passive transmission lines (PTLs). All components were simulated using the physical-level simulator PSCAN [14]. Then, the physical-level simulation timing parameters were extracted and used in the VHDL library. The complete VHDL ALU design and simulation for a 20-GHz clock rate were performed with HYPRES 4.5-kA/cm standard cell library. Each block has PTL receivers (RX) at the input and PTL transmitters (TX) at the output. This makes routing of interconnects easier. The most important part of the ALU is the INIT block (Fig. 2). It performs all primitive logic functions on the operands. The rest of the ALU circuitry basically comprises a routing part of the Kogge-Stone adder. In Fig. 2, D is a D ip-op [2], D2 is a dual-port D ip-op [15], DC is a D ip-op with complementary outputs, XOR is an XOR cell [2], and AND is a dynamic AND cell [16]. The SUM blocks (Fig. 3) form the last stage of the ALU. They perform XOR function on the partial sums and carries of the Kogge-Stone algorithm to produce the nal result. For any

Fig. 2. Schematics (a) and layout (b) of block INIT.

operation other than ADD, i.e. in the absence of carry signals, this block simply passes its input data to the output. Blocks ROUT1 (Fig. 4) and ROUT2 (Fig. 5) provide partial sum and carry routing in accordance with the Kogge-Stone algorithm [9], as well as the propagation of bit-logic operation re-

FILIPPOV et al.: 8-BIT ASYNCHRONOUS WAVE-PIPELINED RSFQ ARITHMETIC-LOGIC UNIT

849

Fig. 3. Schematics (a) and layout (b) of block SUM.

Fig. 6. A 1 Fig. 4. Schematics (a) and layout (b) of block ROUT1.

2 1 0 cm

chip with 8-bit ALU.

Fig. 5. Schematics (a) and layout (b) of block ROUT2.

sults. The cell C in the schematics designates a resettable Muller C element [17]. chip with an These blocks were integrated into a 8-bit ALU shown in Fig. 6. The chip has approximately 8,000 Josephson junctions. Note, that this chip is a product of the curHYPRESs lithography. We will soon be able to rent 1.0stepper, thereby enabling the size produce chips with a 0.25of the ALU to shrink at least threefold. That should also reduce the part of the latency caused by propagation time in the PTLs. The simulated average ALU latency is 390 ps (with uctuations of 4 ps), that includes 50 ps of signal propagation delays over PTLs. The real advantage of Kogge-Stone over ripple-carry architecture occurs at a wider datapath ALU (16 bits and more) [13]. III. FUNCTIONALITY TEST Extensive low-frequency functionality tests were performed on all parts of the ALU (Figs. 712). The experiment has shown that the ALU correctly executes all instructions.

Fig. 7. The 8-bit ALU functionality test for operation ADD.

Functionally, the most complex operation is addition (ADD). The ALUs instruction set (Table I) includes four variations of this operation. In order to provide subtraction, the ALU can invert one or both operands and add them at a single instruction. Fig. 7 shows the correct operation of the ALU adding 8-bit numbers (A+B). The bottom trace is the Ready signal, preceding every instruction execution. The 8-bit operand A, operand B, and the 8-bit outputs are shown in ascending order. The result of the addition process comes out as modulo 256. The most complex operation in the instruction set is ADDInvert A and B, which is essentially equivalent to the arithmetic

850

IEEE TRANSACTIONS ON APPLIED SUPERCONDUCTIVITY, VOL. 21, NO. 3, JUNE 2011

Fig. 10. The 8-bit ALU functionality test for operation NOR. Fig. 8. The 8-bit ALU functionality test for operation ADD-Invert. Here, both operands (A and B) are inverted before summing.

Fig. 11. The 8-bit ALU functionality test for operation XOR.

Fig. 9. The 8-bit ALU functionality test for operation AND.

operation (-2-A-B). The test result of the ALU performing this operation is shown in Fig. 8. Here, we have preserved the same order of traces and the same operand pattern as in Fig. 7. The functionally simplest operations are the so-called bitlogic operations, such as AND, XOR, NOR etc. They do not produce a carry bit propagating across the ALU. The results of logic operations performed in the INIT blocks of the ALU (Fig. 1) go directly to the output. This property of the bit-logic operations simplies both their testing and the pattern necessary to perform a complete test of the ALU.

The low-speed functionality test results for four bit-logic operations are shown: operation AND in Fig. 9; operation NOR in Fig. 10; XOR in Fig. 11; and XNOR in Fig. 12. For consistency, we placed the traces in the same order as in Fig. 7. IV. CONCLUSION We have designed, fabricated, and successfully tested the RSFQ 8-bit ALU. The ALU design is based on a Kogge-Stone adder and employs an asynchronous wave-pipelined approach. This approach reduces the latency and allows us to scale the ALU to a larger number of bits (up to 64). The ALU has been Nb process. fabricated with HYPRES standard 4.5-

FILIPPOV et al.: 8-BIT ASYNCHRONOUS WAVE-PIPELINED RSFQ ARITHMETIC-LOGIC UNIT

851

Fig. 12. The 8-bit ALU functionality test for operation XNOR.

The targeted clock rate of the ALU is 20 GHz. Comprehensive low-speed functionality tests have been performed for all ALU functions. The ALU functions properly for all instructions from its instruction set and for all operands. As the next step, we work on the high-speed testing for the experimental evaluation of the maximum operating clock rate of the ALU.

ACKNOWLEDGMENT The authors would like to thank D. Donnelly, R. Hunt, J. Vivalda, D. Yohannes, and S. K. Tolpygo of the HYPRES fabrication team. Discussions with and encouragement from M. Manheimer, S. Holmes are appreciated.

REFERENCES [1] W. Anacker, Josephson computer technology: An IBM research project, IBM Journal of Research and Development, vol. 24, no. 2, pp. 107112, Mar. 1980. [2] K. Likharev and V. Semenov, RSFQ logic/memory family: A new Josephson-junction technology for sub-terahertz clock-frequency digital systems, IEEE Trans. Appl. Supercond., vol. 1, pp. 328, Mar. 1991. [3] O. A. Mukhanov, S. V. Rylov, V. K. Semenov, and S. V. Vyshenskii, RSFQ logic arithmetic, IEEE Trans. Magn., vol. MAG-25, no. 2, pp. 857860, Mar. 1989. [4] T. Sterling, A design analysis of a hybrid technology multithreaded architecture for petaops scale computation, in Proc. of International Conference on Supercomputing, 1999, pp. 386296. [5] P. Bunyk, M. Leung, J. Spargo, and M. Dorojevets, FLUX-1 RSFQ microprocessor, IEEE Trans. Appl. Supercond., vol. 13, no. 1, p. 433, 2003. [6] A. Fujimaki, M. Tanaka, T. Yamada, Y. Yamanashi, H. Park, and N. Yoshikawa, Bit-serial single ux quantum microprocessor CORE, IEICE Trans. Electron., vol. E91-C, pp. 342349, Mar. 2008. [7] HYPRES Design Rules [Online]. Available: http://www.hypres.com [8] J. Y. Kim, S. Kim, and J. Kang, Construction of an RSFQ 4-bit ALU with half adder cells, IEEE Trans. Appl. Supercond., vol. 15, no. 1, p. 308, 2005. [9] P. Kogge and H. S. Stone, A parallel algorithm for the efcient solution of a general class of recurrence equations, IEEE Trans. Computers, vol. C-22, no. 8, pp. 786793, Aug. 1973. [10] P. Bunyk and P. Litskevitch, Case study in RSFQ design: Fast pipelined 32-bit adder, IEEE Trans. Appl. Supercond., pp. 37143720, June 1999. [11] M. Dorojevets, C. Ayala, and A. Kasperek, Development and evaluation of design techniques for high-performance wave-pipelined wide datapath RSFQ processors, in Proc. of the 12th Intl Superconductive Electronics Conference (ISEC 09), Fukuoka, Japan, June 1619, 2009. [12] W. P. Burleson, M. Ciesielski, F. Klass, and W. Liu, Wave pipelining: A tutorial and research survey, IEEE VLSI Syst., vol. 6, pp. 464474, Sep. 1998. [13] M. Dorojevets, C. Ayala, and A. Kasperek, Data-ow microarchitecture for wide datapath RSFQ processors: Design study, IEEE Trans. Appl. Supercond, submitted for publication. [14] S. Polonsky, P. Shevchenko, A. Kirichenko, D. Zinoviev, and A. Rylyakov, PSCAN96: New software for simulation RSFQ circuits, IEEE Trans. Appl. Supercond., vol. 7, no. 2, pp. 26852689, June 1997. [15] S. V. Polonsky, V. K. Semenov, and A. F. Kirichenko, Single ux quantum B ip-op and its possible applications, IEEE Trans. Appl. Supercond., vol. 4, no. 1, p. 9, 1994. [16] S. Kaplan, A. Kirichenko, O. Mukhanov, and S. Sarwana, A prescaler circuit for a superconductive time-to-digital converter, IEEE Trans. Appl. Sup., vol. 11, no. 1, p. 513, 2000. [17] T. V. Filippov, S. V. Pyuk, V. K. Semenov, and E. B. Wikborg, Encoders and decimation lters for superconductor oversampling ADCs, IEEE Trans. Appl. Supercond., vol. 11, pp. 545549, Mar. 2001.

You might also like