You are on page 1of 10

Operational experiences with the TI Advanced Scientific Computer

by W. J. WATSON and H. M. CARR


Texas Instruments Incorporated
Austin, Texas

INTRODUCTION

CENTRAL :\1EMORY

Since 1966 a large computer development program has been


conducted by Texas Instruments. The goal for this effort was
to provide needed capacity for supporting seismic processing,
plus offering a general purpose capability for large scientific
problems.
This development has resulted in the Advanced Scientific
Computer (ASC)-a highly modular system offering a ",ide
spectrum of processor power, memory sizes, and I/O capability. The ASC is a high-speed, large-scale processing system
featuring extensive use of pipelining, multiple arithmetic
units, separate control processors, large and fast central
memory, and extensive user software aids. The central
processor has both scalar and vector instruction capabilities.
First delivered in 1972 and placed into operational status
during 1973, several operational ASC systems now offer
extremely high processing rates for particular classes of
problems.

The ASC central memory consists of a memory control


unit (MCU) and appropriately sized modules of high-speed or
medium-speed central memory. Optionally, a medium-speed
central memory extension can be used in conjunction with a
high-speed memory.
The MCU is organized as a two-way, 256-bit/channel
(8-word) parallel access traffic net between eight independent
processor ports and nine memory buses, with each processor
port having full accessibility to all memories. The nine
memory buses are organized to provide eight-way interleaving
for the first eight buses with the ninth bus used for the central
memory extension. The MCU provides the facilities for
controlling access from the eight processor ports to a CM
having a 24-bit address space (16 million words). A port
expander can be utilized to expand the number of processor
ports. Figure 2 illustrates this structure.
.
The semiconductor high-speed central memory modules
have a cycle time of 160 ns and a read time of 140 ns.
Additionally, all transfers are 256 bits (eight 32-bit words)
with a Hamming code providing single-bit error correction
and double-bit error detection for each 32-bit word. Highspeed central memory is typically divided into eight equalsized modules which allow for eight-way interleaving.

OVERVIEW OF THE SYSTEM


The major subsystems of a typical configuration are shown
in Figure 1: the central memory, the central processor, the
peripheral processor, on-line bulk storage, a digital communications interface, plus a selection of standard peripherals.
The peripheral processor has been designed for executing
the operating system. The central processor has been designed
expressly to provide high computing speeds when operating
upon large arrays of data. The central processor operates as
a slave to the peripheral processor. This design approach was
chosen to maximize the overlapping of system overhead tasks
with the execution of user programs. In operation the job
stream is analyzed by the peripheral processor. The language
processors, plus user object code, are executed by the central
processor. System control and I/O tasks are processed by the
peripheral processor. I/O is routed through high-speed,
head-per-track disc storage. A data communications interface
for the common carriers is provided for the support of remote
batch and interactive terminals. Standard types of peripherals
are also provided. The central memory serves as the common
communications and access storage medium for these
subsystems.

CPITRAL
PROCESSOR (CP)

PERIPHERAL
PROCESSOR (PP)

CENTRAL
MEMORY

DISC STORAGE

DATA COMMUNICATIONS

~ COMON CARRIERS

PER IPHERALS

Figure 1-Major ASC subsystems

389

From the collection of the Computer History Museum (www.computerhistory.org)

390

National Computer Conference, 1974

SECONDARY
MEr1lRY
ACCESS PORTS

INTERLEAVED
HIGH-SPEED OR
MED lUM-SPEED
MEMORY MODULES

r1E~()PY

CONTROL
UNIT

PRIMARY
MEMORY
ACCESS PORTS

(MCU)

r------7---------- L

------ 1

:M"~~~6~L
EXTENSION
(OPTIONAl)

----------~---------INTERLEAVED MEDIUr1-SPEED MEMORY MODULES


Figure 2-Modular structure of the ASC central memory

The optional central memory extension allows large


amounts of medium speed memory (1 p's semiconductor
technology) to be used in the normal address space of central
memory. Block transfer between memory extension and
high-speed memory is controlled by the peripheral processor
and will transfer at a rate of 40 M words per second.
Memory mapping registers and protection registers are
used to facilitate central memory management and access
control of the ports.
CENTRAL PROCESSOR
The central processor provides both scalar (single operand)
and vector (array) instructions at the machine level. The
basic instruction size is 32 bits, with 16-, 32-, or 64-bit
operands. The single instruction stream, which contains a
mixture of scalar and vector instructions, is preprocessed by
the instruction processing unit.
The central processor design is such that one, two, three,
or four execution units or "pipes" can be provided. These
units employ the pipeline concept in both scalar and vector
modes. A single execution unit can have up to twelve scalar
instruction in process at one time. From one to four vector
results can be produced every 60 ns, depending on the
number of execution units provided.
The CP has 48 program-addressable registers. This group
of 32-bit registers consists of sixteen base address registers,
sixteen arithmetic registers, eight index registers, and eight
vector parameter registers. This last group is used to extend
the instruction format for the complete specification of vector
instructions.
The CP scalar instruction repertoire includes an extensive
set of load and store instructions: halfword, full word , and
doubleword instructions, with immediate, magnitude, and
negative operand capabilities. Ability to load and store
register files and to load effective addre:sses is also available.
Arithmetic scalars include various adds, subtract, multiply,

and divide for halfword (16-bit) and fullword (32-bit) fixed


point numbers and fullword and doubleword (64-bit) floating
point numbers. Scalar logical instructions are provided as are
arithmetic, logical, and circular shifts. Various comparison
instructions and combination comparison-logical instructions
are provided for halfword, fullword, and doublewords. l\Iany
combinations of test and branching instructions with incrementing or decrementing capability are also available.
Stacking and modifying arithmetic registers can be done with
single instructions. Subroutine. linkage is accomplished
through branch and load instructions. Format conversion for
single and doublewords, as well as normalize instructions, are
available.
The vector capabilities of the CP are made available
through the use of VECTL (vector after loading vector
parameter file) and VECT (assumes parameter file is already
loaded) instructions. The vector repertoire includes such
arithmetic operations as add, subtract, multiply, divide,
vector dot product, matrix multiplication, and others for both
fixed point and fl'oating point representations. Vector
instructions are also available for shifting; logical operations;
comparisons; format conversions i normalization; and special
operations-such as l\Ierge, Order, Search, Peak Pick, Select
and Replace, among others.
One important characteristic of the vector instruction
capability is the ability to encompass three dimensions of
addressability within a single vector instruction. This is
equivalent to a nest of three indexing loops in a conventional
machine.
The basic structure of the CP, shown in Figure 3, has three
major components: the instruction processing unit (IPU) for
non-arithmetic stages of instruction processing for the CP
instruction stream, the memory buffer unit (MBU) to provide
operand interfacing with the central memory, and an
arithmetic unit (AU) to perform the specified arithmetic or
logical operations. Figure 3 shows a CP diagram for 2- or
4-pipeline CP's, each with a corresponding number of
MBU-AU pairs. Note that a memory port is required for the
IPU and, in addition, one memory port for each pipeline
(MBU-AU pair) in a CPo
A significant feature of the CP hardware is an operand
look-ahead capability which causes memory references to be
requested prior to the time of actual need. Double buffering

r-----l
PRIMARY

MEMORY
PORTS

{~
~
I

/\

i$$
~clJ

L _____ ..J

r---------,
PRIMARY

MEMORY
PORTS

{ ::I //1//TI'1'\. \,

I
I

9 Ti:
1~/:661

MBU

MBU

I I

f3~:
L ________ =.!
:

AU

TWO-PIPFLINE CP

AU

FOUP-PIPFL INE" CP

:.::;c

Figure 3-Basic structure of the CP

From the collection of the Computer History Museum (www.computerhistory.org)

.;~

Operational Experiences with the TI Advanced Scientific Computer

FLOATING ADD

in multiple 8-word (octet) buffers for each pipeline provides


a smooth data flow to and from each arithmetic unit. The
pipelined AU achieves its highest sustained flow rate in the
vector mode, typically a result each 60 ns per AU, or an
avrage of 15 ns per result for a 4-pipe central processor.

FIXED MULT

RECEIVER REGISTER
I

L ___

Instruction processing unit

The primary function of the instruction processing unit


(IPU) is to supply a continuous stream of instructions for
execution by the other parts of the CPo One Central Memory
port is required to provide the instruction stream. Two 8-word
(octet) buffers are utilized to achieve a balanced stream of
instructions from memory to the IPU. Instructions are
transferred from memory in octets as are all other references
to memory for fetching or storing of information.
Up to 36 instructions in various stages of execution can be
overlapped within the 4-pipe CPo There are twenty positions
for instructions in the 2-pipe CP and twelve positions for
instructions in the I-pipe CPo Four levels are contained
within the IPU, and eight levels are contained in each
arithmetic pipeline (MBU-AU pair). The IPU performs
routing of instructions to the MBU-AU pairs based on an
optimum use of arithmetic unit capability.
Vector processing is altered by software in order to
distribute segments of the vector for multiple pipe systems.
Several features are provided to alleviate the potential
problems of branches and instruction dependencies in the
instruction pipeline.

-,

"

I
I
I

EXPONENT SUBTRACT

I
I

~r

I
I
I

ALIGN

:--,

MULTIPLY

L___ -...,

~,

Memory buffer unit

I
_...1

ADD

NORMALIZE

The memory buffer unit (MBU) provides an interface


between central memory and the arithmetic unit. Its primary
function is to supply the arithmetic unit with a continuous
stream of operands from memory and to provide for the
storing of the results back to memory. All references to
memory, whether for fetching or storing, are made in 8-word
increments (octets).
The MBU has three double buffers, one octet per buffer,
called the "X" and "Y" buffers for input and the "Z" buffers
for output. This double buffering is provided so that pipeline
processing can be sustained at a high rate with minimal
memory access conflicts.

391

ACCUMULATE

I
I

I
I

--- _-1

OUTPUT

~,
RESULT

RESULT

Figure 4-Arithmetic unit pipeline

Arithmetic unit

The primary function of a CP arithmetic unit (AU) is to


perform the arithmetic operations specified by the operation
code of the instruction currently at the AU level. There is one
AU per pipeline in the CP, each having a 60 ns basic cycle
time. A distinguishing feature of an AU is the pipeline
structure which allows efficient execution of .the arithmetic
part of all instructions. There are eight exclusive partitions of
the AU pipeline involved, each of which can provide an output
every 60 ns. These eight sections are (1) receiver register,

(2) exponent subtract, (3) align, (4) add, (5) normalize, (6)
multiply, (7) accumulate, and (8) output. Figure 4 shows how
different sections of the AU are utilized for execution of
particular instructions; i.e., floating point addition and fixed
point multiplication.
An AU is a 64-bit parallel operating unit for most scalar
and vector instructions. Exceptions are double length
multiply and all types of division. In these circumstances
various combinations of the components of the AU are

From the collection of the Computer History Museum (www.computerhistory.org)

392

National Computer Conference, 1974

utilized; and, therefore, more than one clock cycle is required


to complete these arithmetic operations.

second. Using the shortest-access-time-first algorithm, access


time ",ill average approximately 5 ns which results in an
exceptionally fast "effective" transfer rate.

THE PERIPHERAL PROCESSOR


DATA COMMUNICATIONS
The peripheral processor (PP) is a powerful multiprocessor
designed to perform the control and data management
functions of the ASC. Several aspects of the implementation
of the peripheral processor concept greatly increase the
effectiveness of the ASC system.
The PP is a collection of eight individual processors called
virtual processors (VP's). Each VP has its own program
counter along with arithmetic, index, base, and instruction
registers. The eight VP's share a read only memory, an
arithmetic unit, an instruction processing unit, and a central
memory buffer. Use of the common units is distributed among
the VP's using sixteen single 85 ns cycles. When an equally
distributed sequence of time units is used, each of the eight
VP's receives two 85 ns cycles every 1.4 J.LS. The typical PP
instruction requires two 85 ns cycles for completion. The
distribution of available time units can be dynamically varied
to suit particular processing requirements.
The 4K 32-bit words of read only memory within the PP
is utilized for program storage and execution of those short
routines which are highly utilized by the VP's, such as
polling loops.
Because the PP is intended to perform control functions
rather than execute mathematical algorithms, the instruction
set is oriented toward control operations and does not require
multiplication, division, or floating point operations. The
instruction format is similar to that of the central processor,
using a 32-bit word for each instruction. Instructions are
provided for bit (1 bit), byte (8 bits), halfword (16 bits), and
fullword (32 bits) operations.
Each VP has direct access to the entire central memory for
program execution and data storage. Therefore, a single copy
of reentrant code can be executed simultaneously by more
than one VP.
The communications register (CR) file contains sixty-four
32-bit word registers which are program addressable by the
VP's. The CR file serves as the principal storage media for
control information necessary for the coordination of all parts
of the ASC system.
DISC STORAGE
Disc storage is the principal secondary storage system for
the ASC system. Disc storage consists of head-per-track
(HIT) disc systems supplemented by positioning-arm disc
(PAD) systems.
The HIT disc system is a high-performance device whose
effective performance is further enhanced because the operating system utilizes a shortest-access-time-first (SA TF)
algorithm for data transfers. This combination of hardware
and soft"rare pro"\rides a 'Ter~l high effecti'le transfer rate.
Each HIT disc module has a capacity of 25 million 32-bit

words with a transfer rate of approximately 500K words per

The data communication system is very modular and, thus,


externally flexible in the various devices which may be
utilized for communication with the ASC. D:ata communications are controlled by a data concentrator which, in turn,
interfaces to the ~ICU through a channel control device.
The data concentrator is a TI-980A minicomputer
equipped with special-purpose hardware communication
interface units on its direct memory access ports.
The data communications system presently supports communication with three types of stations: high-performance
user terminals, other large computers, and remote concentrators. The system can be easily extended to support smaller
terminals down to the teletype level. These stations may be
either remote or local.
Remote links are presently implemented with nonswitched, full duplex common carrier data transmission
facilities. Data is transferred over these links synchronously
at rates determined by the modems and common carrier
bandwidths. The data communication system supports
transfer rates up to a maximum of 240,000 bits per second.
PERIPHERALS
Standard types of magnetic tape drives, card equipment,
and printers have been interfaced with the ASC. These
interfaces attach to primary or secondary memory ports
through a variety of standard selected and multiplexed data
channels. A subset of the system's peripherals can also be
interfaced via the communications register file.
SYSTEM SOFTWARE
Software design and development for the ASC system has
progressed in parallel with development of the hardware.
This was accomplished through the use of simulators, metaassemblers, and higher level programming languages implemented on the systems supporting Texas Instruments'
Corporate Information Center. Thus, the first version of this
software was placed into operational status v.rith the ASC
prototype machine. The major software capabilities are
discussed in the next few paragraphs with emphasis being
given to those attributes "\vhich provide comprehensive and
flexible programming facilities for the user.
ASC Fortran language

The most obvious interface between the ASC system and


a user is "',rith the translation of the user-written program into
machine level instructions that efficiently utilize the special
hardware features in the system. Texas Instruments has

From the collection of the Computer History Museum (www.computerhistory.org)

Operational Experiences with the TI Advanced Scientific Computer

attempted to make this interface a smooth one by effort


invested in compiler techniques. The result of this effort is the
ASC NX Compiler, a highly optimizing, user oriented,
software package that will produce code acceptable to a
central processor with one, two, three or four pipelines
(arithmetic units).
The ASC's Fortran language is an extension of ANS
Fortran. The added language features permit the ASC
Fortran programmer to define and use subarrays, crosssections of arrays or subarrays, array assignment statements,
and array intrinsic functions. This is not to provide unique
access to hardware features, but to simplify the programming
required for complex problems.
The ASC Fortran compiler was designed to meet the
demands of the professional programmer. Its primary function
is to trallslate Fortran code into object code which will
execute the program in the shortest possible time. Because
the ASC has both scalar and vector instructions, the compiler
has the capability to recognize array-oriented operations
specified in standard Fortran and to generate the equivalent
vector instructions to perform the required operations. To
provide the programmer direct access to the specialized vector
instructions, array intrinsic and array generation intrinsic
functions are provided.
The ASC Fortran compiler produces highly optimized
obj ect code with complete diagnostic analysis and messages.
In general, the optimizing task is accomplished by performing
optimization on the source program logic and on the object
code instructions produced. Vector instructions are used
where feasible. Scalar operations are reordered wherever
possible without affecting results, so as to minimize both
pipeline and memory reference delays. In addition, the
compiler provides a complete set of informative messages
regarding applied optimization procedures and where source
program logic prevents optimization.
The optimizing algorithms encompass such areas as
conventional optimization, instruction scheduling and vector
generation with optimization.
Mathematical library

The ASC Mathematical Subprogram Library is unique in


that it uses both scalar and vector capabilities. The scalar
function subprograms include all of the single and double
precision functions traditionally provided in Fortran libraries.
In particular, it contains all of the ANS Fortran mathematical
functions and all of the IBM S/360 Fortran mathematical
functions. The vectorized math function subprograms exploit
the vector instruction set of the ASC. A single call to a
vectorized math function subprogram causes that function to
be evaluated for the entire vector of arguments. The evaluation is effected by a sequence of vector instruction executions.
Both the scalar and the vectorized math function subprograms can be used by the Fortran and assembly language
progrfullmer. The Fortran compiler employs the vectorized
math subprograms to replace multiple calls to a scalar
subprogram when possible; however, this action may be

393

overridden by the use of a Fortran compiler specification


option.
Assembler

The ASC Assembler is a meta-assembler or translator


which facilitates symbolic coding of the ASC Central or
Peripheral Processors at the instruction level.
Linkage editor

The ASC Linkage Editor creates a load module for


execution by linking separately assembled or compiled object
modules obtained from the job input stream, user libraries or
system libraries. Linking is accomplished by relocation, by
resolving external references, and by allocating virtual
memory.
Job specification language

The Job Specification Language (JSL) is a user-oriented


language. It allows the user to specify the programs to be
executed, the data files to be made available, the dependencies, if any, between individual programs of a job, and
various cataloging and data management functions which
may be specified. The user may specify and control a job
without detailed knowledge of the Operating System.
Wherever possible, default conditions have been built into
this language so that only a minimum specification need be
given by the user.
The Job Specification Language is composed of job
definition statements, program processing statements, file
processing statements, cataloging statements, and macro
definition statements. It is an extendible (macro facility),
programmable specification language rather than a set of
control cards. The philosophy has been to provide many
explicit statements with relatively few parameters for each,
rather than a few statements with many operand fields that
provide all functions.
The language provides JSL variables which allow the
programmer to pass control information to and among CP
programs at execution time. JSL control statements can be
used to test these variables to determine the programs to be
executed next. An executing job can initiate a deferred job;
the decision to do so could be based on the value of a JSL
variable within the executing job.
Operating system

The ASC General Purpose Operating System (GPaS)


schedules and allocates system resources in response to user
service requests in a multiprogramming environment. GPaS
provides input/output service, data transfer -vvithin the
system; file management services, and other system services
in a straightforward manner. The utility and accessibility of
the Central Processor to user programs is increased by

From the collection of the Computer History Museum (www.computerhistory.org)

394

National Computer Conference, 1974

E
X
P
A
N
D
E
R
M
E
M
0
R
Y

H/1:g~~tttE~ND

DISC INTERFACE
UNIT

HIT

H/~O~~~'tt~:ND

DISC INTERFACE
UNIT

HIT

25M WORDS

500K WORDS/SEC.

11:J=H~:tctE ~ND

DISC INTERFACE
UNIT

HjT 25M WORDS

500K WORDS/SEC.

Hi1:g:ir\~CtE ~ND

DISC INTERFACE
UNIT

HIT

SOOK WORDS/SEC.

25M WORDS

25M WORDS

500K WORDS/SEC.

TEXT EDITING
CRTS (TWo)

r - - - - - CP- - - - --,
I

I
I
I

I
I

TWO 1500
CARD MIN.
CARD READER

THREE 1200
LINE MIN.
LINE PRINTER

TWO 100
CARD MIN.
PUNCHES

OPERATOR
COMM.
TWO CRTS

..J

TAPE
SWITCHING
UNIT

6 DUAL DENSITY
9 TRACK 800 1600
BPI TAPE DRIVES
}

TAPE CONTROLLER

CHANNEL NUMBER 1
SECONDARY STORAG

3 DUAL DENSITY
7 TRACK 556 800
BPI TAPE DRIVES

CHANNEL NUMBER 2
SECONDARY STORAGE

(A) 114219B

Figure 5-GFDL ASe configuration

GPOS performing all overhead functions in the Peripheral


Processor. The operating system isolates the control, scheduling, and resource allocation algorithms for ease in "tuning"
the system to match the specific requirements of each
installation. The overall system architecture is maintained to
accommodate hardware and software system growth and
flexibility. GPOS, by its simplicity and modular design,
minimizes the system use of central memory with a small
resident system and the remainder of the system non-resident.
The design of GPOS exploits hardware features unique to
the ASC. Most important of these features is complete access
to central memory by the PP. Thus, a single reentrant copy
of code is available to all processors; and, only a branch
instruction is needed to switch a Virtual Processor from one
function to another. The Communications Register (CR) file
is used to allow one VP to control the other seven, while
common access to the rest of this file supports communication
between the processors and other system components.
OPERATIONAL HISTORY
The prototype ASC initially completed its checkout during
the Spring of 1971. The system (Serial #1) was available for
use as a software development tool and for customer demonstrations for the remainder of 1971. In 1972 the prototype

was moved to a permanent location at the TI facility in


Austin. During the period of downtime, a retrofit of the
hardware was carried out to incorporate the latest version of
circuits and boards and to support a production environment.
System 1 was operational early in 1973 and is currently being
devoted to software development and support of application
program conversion to the ASC.
ASC #1 is configured with a one-pipe central processor,
128K words of high-speed central memory, 128K words of
memory extension, a complement of head-per-track disc
storage, a data communications interface, plus standard tape
and paper devices.
Experience with an ASC operating in a center devoted to
seismic production work is currently being gained in the TI
facility at Amstelveen, Holland. This system (Serial #2) was
delivered early in 1973 and essentially duplicates the capabilities described for the prototype machine. Additionally,
several seismic interactive terminals are interfaced both
locally and remotely to this system.
Seismic operational requirements are characterized by
large data bases, much magnetic tape input and output, many
job steps composed of long computational sequences, and the
need to precisely control a complicated series of such jobs. In
addition to the high computational speeds available on the
ASC~ the seiswic center experience is shmving that other
ASC features are valuable when applied to this application.

From the collection of the Computer History Museum (www.computerhistory.org)

Operational Experiences with the TI Advanced Scientific Computer

Head-per-track disc storage, management of the data ba.ses


and scheduling by the dedicated virtual processors, and Job
control available via the JSL language appear to match the
environment of seismic work. Applications programs are
written in standard Fortran, and no need has been found to
supplement the available compiler opt~zation by a~ditional
hand coding. The system is well supportmg the reqUIrements
by .generating significant improvements in unit p~ocessing
costs and by permitting new processing technologtes to be
econ~mically feasible. Improved productivity of geophysicists
and geologists through real-time interactive sessions is ?ei~g
achieved. It is expected that the use of ASC for selSIillC
processing capacity will continue to grow at ~ rapid rate.
Operational experience has also been gamed from the
application of the ASC to the U.S. Gover~ent data-proc~s
ing problem of ballistic missile defense. Senal #3, a one-~lpe
ASC with a configuration similar to the previously descnbed
systems, was delivered to the U.S. Army in ~he Sum~er of
1973. It is to be used for research into processmg techmques
employed in ballistic missile defense.
Application to long-range prediction of the earth's weath.er
is the intended use of the largest and fastest ASC to be built
to date. The National Oceanic and Atmospheric Administration (NOAA) has contracted for an ASC (Serial #4) for its
Geophysical Fluid Dynamics Laboratory at Princeton University. Delivery is scheduled for early in 1974. The ASC is
configured with a four-pipe central processor, one million
words of high-speed central memory, head-per-track disc, text
editing terminals, two channels of high density secondary
storage devices, and standard magnetic tape and paper
devices. This configuration is illustrated in Figure 5. Much
experience has been gained using benchmark programs
derived from weather models and the actual weather prediction codes themselves. Emphasis has been upon Fortran code
generated by analysts and weather scientists instead of
hand-optimized machine language. Results obtained from the
system while undergoing final checkout at TI's facility showed
the speeds available to be several times faster than other
current computer systems.
For weather codes characterized by large data bases that
are updated frequently, sequences of heavy computational
work using the data, and mathematical operations performed
on long arrays of data, the ASC is proving to be a valuable
asset. The large central memory enables one to maintain
ample data so that the central processor is utilized to a very
high degree. The I/O and multiprogramming capabilities
managed by the operating system resident in the peripheral
processor also support high CP workloads.

TABLE I-Simple Examples of Vectors


(1)

DO
DO
DO
10

10
10
10

K=l, 50
J =1,50
1=1,50

Z(I, J, K) =X(I, J, K) '" Y(I, J, K)

(2)

Z=X*Y

(3)

VECTL (#460, B2) VMF

395

TABLE II-Vector Instructions Produced from Weather Code


(1)

DO
DO

100
(2)

100 K=l,lO
100 1=1,144

TBXY(I, K)=(T(I+1, K, J)+T(I, K, J * 0.5


TXY(K, K)=(T(I+1, K, J)-T(I, K, J * RDX(JC)
PBXY(I, K)=(PS(I+1, K, J)+PS(I, K, J * 0.5
PXY(I, K)=(PS(I+1, K, J)-PS(I, K, J) * RDX(JC)
VECTL
VECTL
VECTL
VECTL
VECTL
VECTL
VECTL
VECTL

(#3B8, B2)
(#3CO, B2)
(#3C8, B2)
(#3DO, B2)
(#3D8, B2)
(#3EO, B2)
(#3E8, B2)
(#3FO, B2)

VAF
VMF
VSF
VMF
VAF
VMF
VSF
VMF

MAXIMIZING PERFORMANCE
Experience thus far has shown that for the applications
that have been considered by ASC users the most costeffective performance is realizable when the capabilities of
ASC Fortran and the optimizing compiler are used. Although
particular sequences of code can be found wherein hand
coding will improve the speed of execution, for the broad
range of programs where much applications code is involved,
compiler-generated object code is the best choice. American
National Standard Institute (ANS) Fortran is completely
sufficient, and vector instructions are readily produced from
this Fortran. ASC extensions to the Fortran are sometimes
found to be useful, not to provide unique access to some hardware feature but to simplify notation involved in writing the
program so that the programmer can deal more directly with
the mathematics of the application.
The ASC system design allows easy user access to performance enhancement through the use of additional central
processor "pipes." Compiler software is responsible for both
the generation of vector instructions and the partitioning of
these vector operations over multiple pipes. Protection of the
user from vector hazard conditions is carried out by the
compiler. Partitioning of scalar instructions for multiple pipes
is carried out by the CP hardware. Extensive checks are made
by hardware to protect the user from illegal scalar conditions
that might occur. For mixtures of vector instructions and for
mixtures of scalars and vectors, the compiler prevents illegal
conditions by the use of directive instructions for the CP to
operate in either parallel mode (FORK) or sequential mode
(JOIN). Thus, the burden is on the system instead of the
user. Programs compiled for one-pipe ASC's will execute
correctly on multiple-pipe systems. Performance \\1.ll be
increased via a recompilation for the multiple-pipe machine.
Some typical examples of efficient code produced from
present applications \\1.11 illustrate the optimization level
provided by the system. Table I shows the type of instruction
generated by the compiler from a typical triple-nested DO
LOOP.
(1) gives the Fortran source with three levels of indexing,
(2) is an alternate notation that could be used, and
(3) is the single vector instruction produced.

From the collection of the Computer History Museum (www.computerhistory.org)

396

National Computer Conference, 1974

TABLE III-ASC Maximum Performance Rate


ASC IX (ONE AU)

64-BIT

RESULTS/SEC

RESULTS/SEC

RESULTS/SEC

9.2 X 19
5.3 X 10 6
4.0 X 10 6

64 X 10
64 X 106
64 X 10 6

37 X 10 6
21 X 10 6
16 X 10 6

64-BIT

RESULTS/SEC
6

ADD
MULTIPLY
DOT PRODUCT

16 X 10
16 X 10 6
16 X 10 6

It is a floating vector multiply instruction preceded by the


loading of the vector parameter registers. Table II gives
some typical code found in weather models. A double-nested
DO LOOP with typical indexing conventions is shown in (1).
(~) gives the sequence of instructions produced by the ASC
compiler. All instructions are vectors, and the necessary
indexing information for addressing purposes is contained in
each vector parameter file. No scalar instructions are necessary in this example.
A powerful example of vector instruction capabilities is
found in the use of the hardware-implemented dot-product
operation. This operation consists of the multiplication of
appropriate elements of two arrays followed by the sum of the
products. To implement a matrix multiply operation from
Fortran, the ASC compiler uses a single dot-product instruction and the complex indexing capability of the hardware to
carry out the full matrix multiply. Three levels of addressing
changes are implied in this case, and the hardware is designed
to comprehend this level of indexing complexity.
The execution rate for the elementary operations of matrix
multiply is one result per clock cycle for a one-pipe CP, or a
rate of four results per clock cycle for a four-pipe CPo The
compiler partitions the total matrix multiply across the
appropriate number of pipes. Therefore, to complete a matrix
multiply of two N by N matrices, a four-pipe CP will require
approximately N3/4 times the clock rate in seconds. This does
not include the startup overhead necessary to fill the pipelines
with operands.

TABLE IV-Relative Computer Capacity* Third Generation Systems


MFR
IBM
IBM
CDC
CDC
IBM
IBM
HITACHI
IBM
CDC
IBM

MODEL
S/360 MODEL
S/360 MODEL
6500
6600
S/370 MODEL
8/360 MODEL
HITAC 8800
S/360 MODEL
7600
S/360 MODEL

ASC 4X (FOUR AU'S)


32-BIT

32-BIT

RELATIVE SPEED
65
75

165
91
95
195

1.5
1.5
2.5
3.5
5
5
7
8
8

* Data taken from Table E, page 546, Program for the study conference
. . TE, BuJletin of the }.. mcric:1n ~9fctcGrG
on the Modeling ~!....speets of G6A
logical Society, Vol. 54 No.6, June, 1973.

It is the authors' OpInIOn that performance indices for


array-oriented architectures are not meaningful when only
the Millions of Instructions Per Second (MIPS) factor is used.
Since a single vector instruction is equivalent to several scalar
instructions (typically Load, Operation, Increment and Test
Branch), and the number of data values used determines the
number of execution of these scalar instructions, MIP ratings
are ambiguous at best.
Consider the performance of an ASC producing "results per
second." In this context "results per second" is the rate at
which data fetched from central memory can be operated
upon and the results stored back into central memory.
Table III shows the maximum performance rates for one- and
four-pipe ASC systems performing typical arithmetic operations. Assumptions are that the clock cycle is 60 nanoseconds
and that the pipelines are already filled with operands.
Vector dot product is a special case in the sense that the
results per second rate pertains to the elementary operations.
Another performance measure can be determined from the
present performance of ASC System #4 executing a particular
weather benchmark. Although the benchmark is not a full
weather prediction code, it does have the characteristic source
code sequences and reflects the ability of the Fortran compiler
to produce efficient code from a large applications package.
Execution speed of the benchmark on the IBM Model 91 is
approximately 246 minutes, and present ASC timing with
checkout not finalized has already demonstrated approximately 30 minutes. This ratio of 8.2 is a measure of the total
system performance upon this program. It reflects a mix of
both scalar and vector instructions as well as I/O and other
system services. The design of the ASC has been directed
t.oward matching the real world mix of instructions encountered in typical applications instead of sacrificing scalar
capability to provide vector capability.
In order to compare the observed ASC performance on the
Weather Benchmark, data found in the Bulletin of the
American Meteorological Societyl is given in Table IV. Using
the IBIV[ S/360 Model 65 as the basis of reference, each of the
systems listed is compared as to relative speed. Using the
observed ASC/M91 ratio of 8.2, the present ASC speed would
be 41 in the table.

ACKNOWLEDGMENTS
It would not he possible t.o acknowledge all the contributors
to the development of the ASC; but particular recognition

From the collection of the Computer History Museum (www.computerhistory.org)

Operational Experiences with the TI Advanced Scientific Computer

should be given to lVlessrs. H. G. Cragon, \V. D. Kastner,


E. H. Husband, D. R. Best, C. M. Stephenson, C. R. Hall,
F. A. Galindo, E. C. Garth, and N. M. Chandler who
contributed significantly to the development of the hardware.
Software concepts are due in large part to the efforts of
Messrs. L. C. Dean, G. T. Boswell, A. E. Riccomi, F. A.
Little, W. Winkelman, W. L. Cohagan, and S. D. Nolte.
Many other members of the Texas Instruments staff have

397

also contributed i..YJlIIleasurably in the development of the


ASC.
REFERENCES
1. Program for the study conference on the Modeling Aspects of Gate,
Bulletin of the American Meteorological Society, Vol. 54, No.6.
June 1973, page 546. table E.

From the collection of the Computer History Museum (www.computerhistory.org)

From the collection of the Computer History Museum (www.computerhistory.org)

You might also like