Professional Documents
Culture Documents
1 Introduction
Multithreading has been touted as the solution to the ever
increasing performance gap between processors and memory systems (i.e., the memory wall). The memory latency
incurred by an access to memory can be tolerated by performing useful work on a different thread. While there is no
single best approach to multithreading applications, there
is a consensus that on conventional superscalar based architectures, using conventional single threaded or coarsegrained multithreading models, the performance peaks for
a small number of threads (24). This implies that adding
y This work is supported in part by the following NSF grants:
MIPS 9796310, EIA 9729889 and EIA 9895216.
Ali R. Hurson
Dept of Computer Science and Engineering
Pennsylvania State University
University Park, PA 16802
hurson@cse.psu.edu
Memory Processor
IF
ID
OF
EX/WB
Data
Cache
Scoreboard
Instr
Cache
Register
Contexts
IF
ID
OF
EX/WB
Execute Processor
3 Scheduled Dataflow
Even though the dataflow model and architectures have
been studied for more than two decades and held the
promise of an elegant execution paradigm with the ability to exploit inherent parallelism in applications, the actual implementations of the model have failed to deliver the promised performance. Most modern processors
Opcode
Offset(R)
PC
Context
Instruction
Fetch
Instruction &
Frame
Memory
Instr. Cache
I-Structure
Memory
Opcode
Offset(R)
Operand
Fetch
Operand
Cache
I-Structure
Cache
Execute
Register
Contexts
Synch
Processor
Write Back
Execution Pipeline
Conv
Rhamma, L=1R
Rhamma, L=3R
Rhamma, L=5R
SDF, L=1R
SDF, L=3R
SDF, L=5R
450
400
Exec. time (K unit time)
500
350
300
250
200
150
100
50
0
1
# of concurrent threads
350
350
Conv
Rhamma
SDF
Conv
Rhamma
SDF
300
Exec. time (K unit time)
300
250
200
150
100
50
250
200
150
100
50
0
10
20
30
40
50
60
70
uled Dataflow leads to finer-grained threads. These two observations indicate that we can expect even better performance for Scheduled Dataflow than shown in figure 4. In
the remaining experiments we will use L = 3R for both
Scheduled Dataflow and Rhamma architectures.
0
0.25
0.3
0.35
0.4
0.45
0.5
As for conventional architecture, increasing memory access instructions leads to increased cache misses, thus increasing the execution time. However, the decoupling permits the two multithreaded processors to tolerate the cache
miss penalties. Note that Scheduled Dataflow outperforms
Rhamma for all values of memory access instructions. This
is primarily because of the pre-loading and post-storing
performed by Scheduled Dataflow. We feel that the decoupling of memory accesses from execution is more useful if
the memory accesses can be grouped together (as done in
Scheduled Dataflow).
400
400
Conv
Rhamma
SDF
350
350
300
Exec. time (K unit time)
300
Exec. time (K unit time)
Conv
Rhamma
SDF
250
200
150
250
200
150
100
100
50
50
0
0
2
4
6
8
Cache Miss Rate (%)
10
20 40 60 80
Cache Miss Penalty
100
5 Conclusions
In this paper we presented a dataflow architecture that
utilizes control-flow like scheduling of instructions and separates memory accesses from instruction execution to tolerate long latency incurred by the memory access. Our primary goal is to show that it is possible to design efficient
multithreaded dataflow implementations. While decoupled
access/execute implementations are possible with single
threaded architectures, multithreading model presents better opportunities for exploiting the decoupling of memory
accesses from execution pipeline. We feel that, even among
multithreaded alternatives, non-blocking models are more
suited for the decoupled execution. Furthermore, grouping
memory accesses (e.g., pre-load and post-store) for threads
eliminates unnecessary delays (stalls) caused by memory
accesses. We strongly favor the use of dataflow instructions to reduce the complexity of the processor by eliminating complex logic needed for resolving data dependencies, branch prediction, register renaming and instruction
scheduling on superscalar implementations. Although the
results presented here are based on synthetic benchmarks
and Monte Carlo simulations, the benchmarks are driven by
either published data (e.g., load/store instruction frequencies, branch frequencies, cache miss rates and penalties) or
information obtained from analyzing several programs written for the architectures under evaluation. We are currently
developing detailed instruction simulations of the proposed
Scheduled Dataflow architecture to investigate the performance based on instruction traces.
References
[1] A. Agarwal, et. al., The MIT Alewife machine: Architecture and performance, Proc. of 22nd Intl Symp. on Computer Architecture (ISCA-22), 1995, pp. 213.
[2] A. Agarwal, Performance tradeoffs in multithreaded processors, IEEE Transactions on Parallel and Distributed Systems, vol. 3(5), pp.525539, September 1992.
[3] D. Chiou, et.al., StarT-NG: Delivering seamless parallel
computing, Proc. of the first Intl EURO-PAR conference,
Aug. 1995, pp. 101116.
[4] D.E. Culler, Multithreading: Fundamental limits, potential
gains and alternatives, Proc. of Supercomputing 91, workshop on Multithreading, 1992.
[5] R. Govindarajan, S.S. Namawarkar and P. LeNir, Design
and performance evaluation of a multithreaded architecture,
Proc. of the HPCA-1, Jan. 1995, pp. 298307.
[6] W. Grunewald, T. Ungerer, A Multithreaded processor design for distributed shared memory system, Proc. Intl Conf.
on Advances in Parallel and Distributed Computing, 1997.
[7] H.H.-J. Hum, et. al., A design study of the EARTH multiprocessor, Proc. of the Conference on Parallel Architectures
and Compilation Techniques, Limassol, Cyprus, June 1995,
pp. 5968.
[8] J.L. Hennessy and D.A. Patterson, Computer Architecture: A
Quantitative Approach, Morgan Kaufmann Publisher, 1996,
pp. 105.
[9] K.M. Kavi, and A.R. Hurson, Performance of cache memories in dataflow architectures, Euromicoro Journal on Systems Architecture, June 1998, pp. 657674.
[10] K.M. Kavi, D. Levine and A.R. Hurson, PL/PS: A nonblocking multithreaded architecture, Proc. of the Fifth International Conference on Advanced Computing (ADCOMP
97), Madras, India, Dec. 1997.
[11] H.-S. Kim, Instruction Set Architecture of Scheduled
Dataflow, Technical Report, Dept. of Electrical and Computer Engineering, University of Alabama in Huntsville,
April 1998.
[12] H.-S. Kim and K.M. Kavi, Preliminary Performance Analysis on Decoupled Multithreaded Architectures, Technical
Report, Dept. of Electrical and Computer Engineering, University of Alabama in Huntsville, October 1998.
[13] G.M. Papadopoulos, and K.R. Traub, Multithreading: A
Revisionist View of Dataflow Architectures, Proc. of the
18th International Symposium on Computer Architecture,
pp. 342351.
[14] S. Sakai, et. al. Super-threading: Architectural and software
mechanisms for optimizing parallel computations, Proc. of
1993 Intl Conference on Supercomputing, July 1993, pp.
251260.
[15] J.E. Smith, Decoupled Access/Execute Computer Architectures, Proc of the 9th Annual Symp. on Computer Architecture, May 1982, pp. 112119.
[16] M. Takesue, A unified resource management and execution
control mechanism for Dataflow Machines, Proc. of 14th
Intl Symp. on Computer Architecture, June 1987, pp. 9097.
[17] S.A. Thoreson and A.N. Long, A Feasibility study of a
Memory Hierarchy in Data Flow Environment, Proc. of Intl
Conference on Parallel Conference, June 1987, pp. 356360.
[18] M. Tokoro, J.R. Jagannathan and H. Sunahara, On the working set concept for data-flow machines, Proc. of 10th Intl
Symp. on Computer Architecture, July 1983, pp. 9097.