Operational Experiences With The TI Advanced Scientific Computer

Operational experiences with the TI Advanced Scientific Computer
by W. J. WATSON and H. M. CARR

Texas Instruments Incorporated
Austin, Texas
INTRODUCTION
CENTRAL :\1EMORY
Since 1966 a large computer development program has been

conducted by Texas Instruments. The goal for this effort was
to provide needed capacity for supporting seismic processing,
plus offering a general purpose capability for large scientific
problems.
This development has resulted in the Advanced Scientific
Computer (ASC)-a highly modular system offering a ",ide
spectrum of processor power, memory sizes, and I/O capability. The ASC is a high-speed, large-scale processing system
featuring extensive use of pipelining, multiple arithmetic
units, separate control processors, large and fast central
memory, and extensive user software aids. The central
processor has both scalar and vector instruction capabilities.
First delivered in 1972 and placed into operational status
during 1973, several operational ASC systems now offer
extremely high processing rates for particular classes of
problems.
The ASC central memory consists of a memory control

unit (MCU) and appropriately sized modules of high-speed or
medium-speed central memory. Optionally, a medium-speed
central memory extension can be used in conjunction with a
high-speed memory.
The MCU is organized as a two-way, 256-bit/channel
(8-word) parallel access traffic net between eight independent
processor ports and nine memory buses, with each processor
port having full accessibility to all memories. The nine
memory buses are organized to provide eight-way interleaving
for the first eight buses with the ninth bus used for the central
memory extension. The MCU provides the facilities for
controlling access from the eight processor ports to a CM
having a 24-bit address space (16 million words). A port
expander can be utilized to expand the number of processor
ports. Figure 2 illustrates this structure.
.
The semiconductor high-speed central memory modules
have a cycle time of 160 ns and a read time of 140 ns.
Additionally, all transfers are 256 bits (eight 32-bit words)
with a Hamming code providing single-bit error correction
and double-bit error detection for each 32-bit word. Highspeed central memory is typically divided into eight equalsized modules which allow for eight-way interleaving.
OVERVIEW OF THE SYSTEM

The major subsystems of a typical configuration are shown
in Figure 1: the central memory, the central processor, the
peripheral processor, on-line bulk storage, a digital communications interface, plus a selection of standard peripherals.
The peripheral processor has been designed for executing
the operating system. The central processor has been designed
expressly to provide high computing speeds when operating
upon large arrays of data. The central processor operates as
a slave to the peripheral processor. This design approach was
chosen to maximize the overlapping of system overhead tasks
with the execution of user programs. In operation the job
stream is analyzed by the peripheral processor. The language
processors, plus user object code, are executed by the central
processor. System control and I/O tasks are processed by the
peripheral processor. I/O is routed through high-speed,
head-per-track disc storage. A data communications interface
for the common carriers is provided for the support of remote
batch and interactive terminals. Standard types of peripherals
are also provided. The central memory serves as the common
communications and access storage medium for these
subsystems.
CPITRAL
PROCESSOR (CP)
PERIPHERAL
PROCESSOR (PP)
CENTRAL
MEMORY
DISC STORAGE
DATA COMMUNICATIONS
~ COMON CARRIERS
PER IPHERALS
Figure 1-Major ASC subsystems
389
From the collection of the Computer History Museum (www.computerhistory.org)
390
National Computer Conference, 1974
SECONDARY
MEr1lRY
ACCESS PORTS
INTERLEAVED
HIGH-SPEED OR
MED lUM-SPEED
MEMORY MODULES
r1E~()PY
CONTROL
UNIT
PRIMARY
MEMORY
ACCESS PORTS
(MCU)
r------7---------- L
------ 1
:M"~~~6~L
EXTENSION
(OPTIONAl)
----------~---------INTERLEAVED MEDIUr1-SPEED MEMORY MODULES

Figure 2-Modular structure of the ASC central memory
The optional central memory extension allows large

amounts of medium speed memory (1 p's semiconductor
technology) to be used in the normal address space of central
memory. Block transfer between memory extension and
high-speed memory is controlled by the peripheral processor
and will transfer at a rate of 40 M words per second.
Memory mapping registers and protection registers are
used to facilitate central memory management and access
control of the ports.
CENTRAL PROCESSOR
The central processor provides both scalar (single operand)
and vector (array) instructions at the machine level. The
basic instruction size is 32 bits, with 16-, 32-, or 64-bit
operands. The single instruction stream, which contains a
mixture of scalar and vector instructions, is preprocessed by
the instruction processing unit.
The central processor design is such that one, two, three,
or four execution units or "pipes" can be provided. These
units employ the pipeline concept in both scalar and vector
modes. A single execution unit can have up to twelve scalar
instruction in process at one time. From one to four vector
results can be produced every 60 ns, depending on the
number of execution units provided.
The CP has 48 program-addressable registers. This group
of 32-bit registers consists of sixteen base address registers,
sixteen arithmetic registers, eight index registers, and eight
vector parameter registers. This last group is used to extend
the instruction format for the complete specification of vector
instructions.
The CP scalar instruction repertoire includes an extensive
set of load and store instructions: halfword, full word , and
doubleword instructions, with immediate, magnitude, and
negative operand capabilities. Ability to load and store
register files and to load effective addre:sses is also available.
Arithmetic scalars include various adds, subtract, multiply,
and divide for halfword (16-bit) and fullword (32-bit) fixed

point numbers and fullword and doubleword (64-bit) floating
point numbers. Scalar logical instructions are provided as are
arithmetic, logical, and circular shifts. Various comparison
instructions and combination comparison-logical instructions
are provided for halfword, fullword, and doublewords. l\Iany
combinations of test and branching instructions with incrementing or decrementing capability are also available.
Stacking and modifying arithmetic registers can be done with
single instructions. Subroutine. linkage is accomplished
through branch and load instructions. Format conversion for
single and doublewords, as well as normalize instructions, are
available.
The vector capabilities of the CP are made available
through the use of VECTL (vector after loading vector
parameter file) and VECT (assumes parameter file is already
loaded) instructions. The vector repertoire includes such
arithmetic operations as add, subtract, multiply, divide,
vector dot product, matrix multiplication, and others for both
fixed point and fl'oating point representations. Vector
instructions are also available for shifting; logical operations;
comparisons; format conversions i normalization; and special
operations-such as l\Ierge, Order, Search, Peak Pick, Select
and Replace, among others.
One important characteristic of the vector instruction
capability is the ability to encompass three dimensions of
addressability within a single vector instruction. This is
equivalent to a nest of three indexing loops in a conventional
machine.
The basic structure of the CP, shown in Figure 3, has three
major components: the instruction processing unit (IPU) for
non-arithmetic stages of instruction processing for the CP
instruction stream, the memory buffer unit (MBU) to provide
operand interfacing with the central memory, and an
arithmetic unit (AU) to perform the specified arithmetic or
logical operations. Figure 3 shows a CP diagram for 2- or
4-pipeline CP's, each with a corresponding number of
MBU-AU pairs. Note that a memory port is required for the
IPU and, in addition, one memory port for each pipeline
(MBU-AU pair) in a CPo
A significant feature of the CP hardware is an operand
look-ahead capability which causes memory references to be
requested prior to the time of actual need. Double buffering
r-----l
PRIMARY
MEMORY
PORTS
{~
~
I
/\
i$$
~clJ
L _____ ..J
r---------,
PRIMARY
MEMORY
PORTS
{ ::I //1//TI'1'\. \,
I
I
9 Ti:
1~/:661
MBU
MBU
I I
f3~:
L ________ =.!
:
AU
TWO-PIPFLINE CP
AU
FOUP-PIPFL INE" CP
:.::;c
Figure 3-Basic structure of the CP
.;~
Operational Experiences with the TI Advanced Scientific Computer
FLOATING ADD
in multiple 8-word (octet) buffers for each pipeline provides

a smooth data flow to and from each arithmetic unit. The
pipelined AU achieves its highest sustained flow rate in the
vector mode, typically a result each 60 ns per AU, or an
avrage of 15 ns per result for a 4-pipe central processor.
FIXED MULT
RECEIVER REGISTER
I
L ___
Instruction processing unit
The primary function of the instruction processing unit

(IPU) is to supply a continuous stream of instructions for
execution by the other parts of the CPo One Central Memory
port is required to provide the instruction stream. Two 8-word
(octet) buffers are utilized to achieve a balanced stream of
instructions from memory to the IPU. Instructions are
transferred from memory in octets as are all other references
to memory for fetching or storing of information.
Up to 36 instructions in various stages of execution can be
overlapped within the 4-pipe CPo There are twenty positions
for instructions in the 2-pipe CP and twelve positions for
instructions in the I-pipe CPo Four levels are contained
within the IPU, and eight levels are contained in each
arithmetic pipeline (MBU-AU pair). The IPU performs
routing of instructions to the MBU-AU pairs based on an
optimum use of arithmetic unit capability.
Vector processing is altered by software in order to
distribute segments of the vector for multiple pipe systems.
Several features are provided to alleviate the potential
problems of branches and instruction dependencies in the
instruction pipeline.
-,
"
I
I
I
EXPONENT SUBTRACT
I
I
~r
I
I
I
ALIGN
:--,
MULTIPLY
L___ -...,
~,
Memory buffer unit
I
_...1
ADD
NORMALIZE
The memory buffer unit (MBU) provides an interface

between central memory and the arithmetic unit. Its primary
function is to supply the arithmetic unit with a continuous
stream of operands from memory and to provide for the
storing of the results back to memory. All references to
memory, whether for fetching or storing, are made in 8-word
increments (octets).
The MBU has three double buffers, one octet per buffer,
called the "X" and "Y" buffers for input and the "Z" buffers
for output. This double buffering is provided so that pipeline
processing can be sustained at a high rate with minimal
memory access conflicts.
391
ACCUMULATE
I
I
I
I
--- _-1
OUTPUT
~,
RESULT
RESULT
Figure 4-Arithmetic unit pipeline
Arithmetic unit
The primary function of a CP arithmetic unit (AU) is to

perform the arithmetic operations specified by the operation
code of the instruction currently at the AU level. There is one
AU per pipeline in the CP, each having a 60 ns basic cycle
time. A distinguishing feature of an AU is the pipeline
structure which allows efficient execution of .the arithmetic
part of all instructions. There are eight exclusive partitions of
the AU pipeline involved, each of which can provide an output
every 60 ns. These eight sections are (1) receiver register,
(2) exponent subtract, (3) align, (4) add, (5) normalize, (6)
multiply, (7) accumulate, and (8) output. Figure 4 shows how
different sections of the AU are utilized for execution of
particular instructions; i.e., floating point addition and fixed
point multiplication.
An AU is a 64-bit parallel operating unit for most scalar
and vector instructions. Exceptions are double length
multiply and all types of division. In these circumstances
various combinations of the components of the AU are
392
utilized; and, therefore, more than one clock cycle is required

to complete these arithmetic operations.
second. Using the shortest-access-time-first algorithm, access

time ",ill average approximately 5 ns which results in an
exceptionally fast "effective" transfer rate.
THE PERIPHERAL PROCESSOR

DATA COMMUNICATIONS
The peripheral processor (PP) is a powerful multiprocessor
designed to perform the control and data management
functions of the ASC. Several aspects of the implementation
of the peripheral processor concept greatly increase the
effectiveness of the ASC system.
The PP is a collection of eight individual processors called
virtual processors (VP's). Each VP has its own program
counter along with arithmetic, index, base, and instruction
registers. The eight VP's share a read only memory, an
arithmetic unit, an instruction processing unit, and a central
memory buffer. Use of the common units is distributed among
the VP's using sixteen single 85 ns cycles. When an equally
distributed sequence of time units is used, each of the eight
VP's receives two 85 ns cycles every 1.4 J.LS. The typical PP
instruction requires two 85 ns cycles for completion. The
distribution of available time units can be dynamically varied
to suit particular processing requirements.
The 4K 32-bit words of read only memory within the PP
is utilized for program storage and execution of those short
routines which are highly utilized by the VP's, such as
polling loops.
Because the PP is intended to perform control functions
rather than execute mathematical algorithms, the instruction
set is oriented toward control operations and does not require
multiplication, division, or floating point operations. The
instruction format is similar to that of the central processor,
using a 32-bit word for each instruction. Instructions are
provided for bit (1 bit), byte (8 bits), halfword (16 bits), and
fullword (32 bits) operations.
Each VP has direct access to the entire central memory for
program execution and data storage. Therefore, a single copy
of reentrant code can be executed simultaneously by more
than one VP.
The communications register (CR) file contains sixty-four
32-bit word registers which are program addressable by the
VP's. The CR file serves as the principal storage media for
control information necessary for the coordination of all parts
of the ASC system.
DISC STORAGE
Disc storage is the principal secondary storage system for
the ASC system. Disc storage consists of head-per-track
(HIT) disc systems supplemented by positioning-arm disc
(PAD) systems.
The HIT disc system is a high-performance device whose
effective performance is further enhanced because the operating system utilizes a shortest-access-time-first (SA TF)
algorithm for data transfers. This combination of hardware
and soft"rare pro"\rides a 'Ter~l high effecti'le transfer rate.
Each HIT disc module has a capacity of 25 million 32-bit
words with a transfer rate of approximately 500K words per
The data communication system is very modular and, thus,

externally flexible in the various devices which may be
utilized for communication with the ASC. D:ata communications are controlled by a data concentrator which, in turn,
interfaces to the ~ICU through a channel control device.
The data concentrator is a TI-980A minicomputer
equipped with special-purpose hardware communication
interface units on its direct memory access ports.
The data communications system presently supports communication with three types of stations: high-performance
user terminals, other large computers, and remote concentrators. The system can be easily extended to support smaller
terminals down to the teletype level. These stations may be
either remote or local.
Remote links are presently implemented with nonswitched, full duplex common carrier data transmission
facilities. Data is transferred over these links synchronously
at rates determined by the modems and common carrier
bandwidths. The data communication system supports
transfer rates up to a maximum of 240,000 bits per second.
PERIPHERALS
Standard types of magnetic tape drives, card equipment,
and printers have been interfaced with the ASC. These
interfaces attach to primary or secondary memory ports
through a variety of standard selected and multiplexed data
channels. A subset of the system's peripherals can also be
interfaced via the communications register file.
SYSTEM SOFTWARE
Software design and development for the ASC system has
progressed in parallel with development of the hardware.
This was accomplished through the use of simulators, metaassemblers, and higher level programming languages implemented on the systems supporting Texas Instruments'
Corporate Information Center. Thus, the first version of this
software was placed into operational status v.rith the ASC
prototype machine. The major software capabilities are
discussed in the next few paragraphs with emphasis being
given to those attributes "\vhich provide comprehensive and
flexible programming facilities for the user.
ASC Fortran language
The most obvious interface between the ASC system and

a user is "',rith the translation of the user-written program into
machine level instructions that efficiently utilize the special
hardware features in the system. Texas Instruments has
attempted to make this interface a smooth one by effort

invested in compiler techniques. The result of this effort is the
ASC NX Compiler, a highly optimizing, user oriented,
software package that will produce code acceptable to a
central processor with one, two, three or four pipelines
(arithmetic units).
The ASC's Fortran language is an extension of ANS
Fortran. The added language features permit the ASC
Fortran programmer to define and use subarrays, crosssections of arrays or subarrays, array assignment statements,
and array intrinsic functions. This is not to provide unique
access to hardware features, but to simplify the programming
required for complex problems.
The ASC Fortran compiler was designed to meet the
demands of the professional programmer. Its primary function
is to trallslate Fortran code into object code which will
execute the program in the shortest possible time. Because
the ASC has both scalar and vector instructions, the compiler
has the capability to recognize array-oriented operations
specified in standard Fortran and to generate the equivalent
vector instructions to perform the required operations. To
provide the programmer direct access to the specialized vector
instructions, array intrinsic and array generation intrinsic
functions are provided.
The ASC Fortran compiler produces highly optimized
obj ect code with complete diagnostic analysis and messages.
In general, the optimizing task is accomplished by performing
optimization on the source program logic and on the object
code instructions produced. Vector instructions are used
where feasible. Scalar operations are reordered wherever
possible without affecting results, so as to minimize both
pipeline and memory reference delays. In addition, the
compiler provides a complete set of informative messages
regarding applied optimization procedures and where source
program logic prevents optimization.
The optimizing algorithms encompass such areas as
conventional optimization, instruction scheduling and vector
generation with optimization.
Mathematical library
The ASC Mathematical Subprogram Library is unique in

that it uses both scalar and vector capabilities. The scalar
function subprograms include all of the single and double
precision functions traditionally provided in Fortran libraries.
In particular, it contains all of the ANS Fortran mathematical
functions and all of the IBM S/360 Fortran mathematical
functions. The vectorized math function subprograms exploit
the vector instruction set of the ASC. A single call to a
vectorized math function subprogram causes that function to
be evaluated for the entire vector of arguments. The evaluation is effected by a sequence of vector instruction executions.
Both the scalar and the vectorized math function subprograms can be used by the Fortran and assembly language
progrfullmer. The Fortran compiler employs the vectorized
math subprograms to replace multiple calls to a scalar
subprogram when possible; however, this action may be
393
overridden by the use of a Fortran compiler specification

option.
Assembler
The ASC Assembler is a meta-assembler or translator

which facilitates symbolic coding of the ASC Central or
Peripheral Processors at the instruction level.
Linkage editor
The ASC Linkage Editor creates a load module for

execution by linking separately assembled or compiled object
modules obtained from the job input stream, user libraries or
system libraries. Linking is accomplished by relocation, by
resolving external references, and by allocating virtual
memory.
Job specification language
The Job Specification Language (JSL) is a user-oriented

language. It allows the user to specify the programs to be
executed, the data files to be made available, the dependencies, if any, between individual programs of a job, and
various cataloging and data management functions which
may be specified. The user may specify and control a job
without detailed knowledge of the Operating System.
Wherever possible, default conditions have been built into
this language so that only a minimum specification need be
given by the user.
The Job Specification Language is composed of job
definition statements, program processing statements, file
processing statements, cataloging statements, and macro
definition statements. It is an extendible (macro facility),
programmable specification language rather than a set of
control cards. The philosophy has been to provide many
explicit statements with relatively few parameters for each,
rather than a few statements with many operand fields that
provide all functions.
The language provides JSL variables which allow the
programmer to pass control information to and among CP
programs at execution time. JSL control statements can be
used to test these variables to determine the programs to be
executed next. An executing job can initiate a deferred job;
the decision to do so could be based on the value of a JSL
variable within the executing job.
Operating system
The ASC General Purpose Operating System (GPaS)

schedules and allocates system resources in response to user
service requests in a multiprogramming environment. GPaS
provides input/output service, data transfer -vvithin the
system; file management services, and other system services
in a straightforward manner. The utility and accessibility of
the Central Processor to user programs is increased by
394
E
X
P
A
N
D
E
R
M
E
M
0
R
Y
H/1:g~~tttE~ND
DISC INTERFACE
UNIT
HIT
H/~O~~~'tt~:ND
DISC INTERFACE
UNIT
HIT
25M WORDS
500K WORDS/SEC.
11:J=H~:tctE ~ND
DISC INTERFACE
UNIT
HjT 25M WORDS
500K WORDS/SEC.
Hi1:g:ir\~CtE ~ND
DISC INTERFACE
UNIT
HIT
SOOK WORDS/SEC.
25M WORDS
25M WORDS
500K WORDS/SEC.
TEXT EDITING
CRTS (TWo)
r - - - - - CP- - - - --,
I
I
I
I
I
I
TWO 1500
CARD MIN.
CARD READER
THREE 1200
LINE MIN.
LINE PRINTER
TWO 100
CARD MIN.
PUNCHES
OPERATOR
COMM.
TWO CRTS
..J
TAPE
SWITCHING
UNIT
6 DUAL DENSITY
9 TRACK 800 1600
BPI TAPE DRIVES
}
TAPE CONTROLLER
CHANNEL NUMBER 1
SECONDARY STORAG
3 DUAL DENSITY
7 TRACK 556 800
BPI TAPE DRIVES
CHANNEL NUMBER 2
SECONDARY STORAGE
(A) 114219B
Figure 5-GFDL ASe configuration
GPOS performing all overhead functions in the Peripheral

Processor. The operating system isolates the control, scheduling, and resource allocation algorithms for ease in "tuning"
the system to match the specific requirements of each
installation. The overall system architecture is maintained to
accommodate hardware and software system growth and
flexibility. GPOS, by its simplicity and modular design,
minimizes the system use of central memory with a small
resident system and the remainder of the system non-resident.
The design of GPOS exploits hardware features unique to
the ASC. Most important of these features is complete access
to central memory by the PP. Thus, a single reentrant copy
of code is available to all processors; and, only a branch
instruction is needed to switch a Virtual Processor from one
function to another. The Communications Register (CR) file
is used to allow one VP to control the other seven, while
common access to the rest of this file supports communication
between the processors and other system components.
OPERATIONAL HISTORY
The prototype ASC initially completed its checkout during
the Spring of 1971. The system (Serial #1) was available for
use as a software development tool and for customer demonstrations for the remainder of 1971. In 1972 the prototype
was moved to a permanent location at the TI facility in

Austin. During the period of downtime, a retrofit of the
hardware was carried out to incorporate the latest version of
circuits and boards and to support a production environment.
System 1 was operational early in 1973 and is currently being
devoted to software development and support of application
program conversion to the ASC.
ASC #1 is configured with a one-pipe central processor,
128K words of high-speed central memory, 128K words of
memory extension, a complement of head-per-track disc
storage, a data communications interface, plus standard tape
and paper devices.
Experience with an ASC operating in a center devoted to
seismic production work is currently being gained in the TI
facility at Amstelveen, Holland. This system (Serial #2) was
delivered early in 1973 and essentially duplicates the capabilities described for the prototype machine. Additionally,
several seismic interactive terminals are interfaced both
locally and remotely to this system.
Seismic operational requirements are characterized by
large data bases, much magnetic tape input and output, many
job steps composed of long computational sequences, and the
need to precisely control a complicated series of such jobs. In
addition to the high computational speeds available on the
ASC~ the seiswic center experience is shmving that other
ASC features are valuable when applied to this application.
Head-per-track disc storage, management of the data ba.ses

and scheduling by the dedicated virtual processors, and Job
control available via the JSL language appear to match the
environment of seismic work. Applications programs are
written in standard Fortran, and no need has been found to
supplement the available compiler opt~zation by a~ditional
hand coding. The system is well supportmg the reqUIrements
by .generating significant improvements in unit p~ocessing
costs and by permitting new processing technologtes to be
econ~mically feasible. Improved productivity of geophysicists
and geologists through real-time interactive sessions is ?ei~g
achieved. It is expected that the use of ASC for selSIillC
processing capacity will continue to grow at ~ rapid rate.
Operational experience has also been gamed from the
application of the ASC to the U.S. Gover~ent data-proc~s
ing problem of ballistic missile defense. Senal #3, a one-~lpe
ASC with a configuration similar to the previously descnbed
systems, was delivered to the U.S. Army in ~he Sum~er of
1973. It is to be used for research into processmg techmques
employed in ballistic missile defense.
Application to long-range prediction of the earth's weath.er
is the intended use of the largest and fastest ASC to be built
to date. The National Oceanic and Atmospheric Administration (NOAA) has contracted for an ASC (Serial #4) for its
Geophysical Fluid Dynamics Laboratory at Princeton University. Delivery is scheduled for early in 1974. The ASC is
configured with a four-pipe central processor, one million
words of high-speed central memory, head-per-track disc, text
editing terminals, two channels of high density secondary
storage devices, and standard magnetic tape and paper
devices. This configuration is illustrated in Figure 5. Much
experience has been gained using benchmark programs
derived from weather models and the actual weather prediction codes themselves. Emphasis has been upon Fortran code
generated by analysts and weather scientists instead of
hand-optimized machine language. Results obtained from the
system while undergoing final checkout at TI's facility showed
the speeds available to be several times faster than other
current computer systems.
For weather codes characterized by large data bases that
are updated frequently, sequences of heavy computational
work using the data, and mathematical operations performed
on long arrays of data, the ASC is proving to be a valuable
asset. The large central memory enables one to maintain
ample data so that the central processor is utilized to a very
high degree. The I/O and multiprogramming capabilities
managed by the operating system resident in the peripheral
processor also support high CP workloads.
TABLE I-Simple Examples of Vectors

(1)
DO
DO
DO
10
10
10
10
K=l, 50
J =1,50
1=1,50
Z(I, J, K) =X(I, J, K) '" Y(I, J, K)
(2)
Z=X*Y
(3)
VECTL (#460, B2) VMF
395
TABLE II-Vector Instructions Produced from Weather Code

(1)
DO
DO
100
(2)
100 K=l,lO
100 1=1,144
TBXY(I, K)=(T(I+1, K, J)+T(I, K, J * 0.5

TXY(K, K)=(T(I+1, K, J)-T(I, K, J * RDX(JC)
PBXY(I, K)=(PS(I+1, K, J)+PS(I, K, J * 0.5
PXY(I, K)=(PS(I+1, K, J)-PS(I, K, J) * RDX(JC)
VECTL
VECTL
VECTL
VECTL
VECTL
VECTL
VECTL
VECTL
(#3B8, B2)
(#3CO, B2)
(#3C8, B2)
(#3DO, B2)
(#3D8, B2)
(#3EO, B2)
(#3E8, B2)
(#3FO, B2)
VAF
VMF
VSF
VMF
VAF
VMF
VSF
VMF
MAXIMIZING PERFORMANCE
Experience thus far has shown that for the applications
that have been considered by ASC users the most costeffective performance is realizable when the capabilities of
ASC Fortran and the optimizing compiler are used. Although
particular sequences of code can be found wherein hand
coding will improve the speed of execution, for the broad
range of programs where much applications code is involved,
compiler-generated object code is the best choice. American
National Standard Institute (ANS) Fortran is completely
sufficient, and vector instructions are readily produced from
this Fortran. ASC extensions to the Fortran are sometimes
found to be useful, not to provide unique access to some hardware feature but to simplify notation involved in writing the
program so that the programmer can deal more directly with
the mathematics of the application.
The ASC system design allows easy user access to performance enhancement through the use of additional central
processor "pipes." Compiler software is responsible for both
the generation of vector instructions and the partitioning of
these vector operations over multiple pipes. Protection of the
user from vector hazard conditions is carried out by the
compiler. Partitioning of scalar instructions for multiple pipes
is carried out by the CP hardware. Extensive checks are made
by hardware to protect the user from illegal scalar conditions
that might occur. For mixtures of vector instructions and for
mixtures of scalars and vectors, the compiler prevents illegal
conditions by the use of directive instructions for the CP to
operate in either parallel mode (FORK) or sequential mode
(JOIN). Thus, the burden is on the system instead of the
user. Programs compiled for one-pipe ASC's will execute
correctly on multiple-pipe systems. Performance \\1.ll be
increased via a recompilation for the multiple-pipe machine.
Some typical examples of efficient code produced from
present applications \\1.11 illustrate the optimization level
provided by the system. Table I shows the type of instruction
generated by the compiler from a typical triple-nested DO
LOOP.
(1) gives the Fortran source with three levels of indexing,
(2) is an alternate notation that could be used, and
(3) is the single vector instruction produced.
396
TABLE III-ASC Maximum Performance Rate

ASC IX (ONE AU)
64-BIT
RESULTS/SEC
RESULTS/SEC
RESULTS/SEC
9.2 X 19
5.3 X 10 6
4.0 X 10 6
64 X 10
64 X 106
64 X 10 6
37 X 10 6
21 X 10 6
16 X 10 6
64-BIT
RESULTS/SEC
6
ADD
MULTIPLY
DOT PRODUCT
16 X 10
16 X 10 6
16 X 10 6
It is a floating vector multiply instruction preceded by the

loading of the vector parameter registers. Table II gives
some typical code found in weather models. A double-nested
DO LOOP with typical indexing conventions is shown in (1).
(~) gives the sequence of instructions produced by the ASC
compiler. All instructions are vectors, and the necessary
indexing information for addressing purposes is contained in
each vector parameter file. No scalar instructions are necessary in this example.
A powerful example of vector instruction capabilities is
found in the use of the hardware-implemented dot-product
operation. This operation consists of the multiplication of
appropriate elements of two arrays followed by the sum of the
products. To implement a matrix multiply operation from
Fortran, the ASC compiler uses a single dot-product instruction and the complex indexing capability of the hardware to
carry out the full matrix multiply. Three levels of addressing
changes are implied in this case, and the hardware is designed
to comprehend this level of indexing complexity.
The execution rate for the elementary operations of matrix
multiply is one result per clock cycle for a one-pipe CP, or a
rate of four results per clock cycle for a four-pipe CPo The
compiler partitions the total matrix multiply across the
appropriate number of pipes. Therefore, to complete a matrix
multiply of two N by N matrices, a four-pipe CP will require
approximately N3/4 times the clock rate in seconds. This does
not include the startup overhead necessary to fill the pipelines
with operands.
TABLE IV-Relative Computer Capacity* Third Generation Systems

MFR
IBM
IBM
CDC
CDC
IBM
IBM
HITACHI
IBM
CDC
IBM
MODEL
S/360 MODEL
S/360 MODEL
6500
6600
S/370 MODEL
8/360 MODEL
HITAC 8800
S/360 MODEL
7600
S/360 MODEL
ASC 4X (FOUR AU'S)

32-BIT
32-BIT
RELATIVE SPEED
65
75
165
91
95
195
1.5
1.5
2.5
3.5
5
5
7
8
8
* Data taken from Table E, page 546, Program for the study conference
. . TE, BuJletin of the }.. mcric:1n ~9fctcGrG
on the Modeling ~!....speets of G6A
logical Society, Vol. 54 No.6, June, 1973.
It is the authors' OpInIOn that performance indices for

array-oriented architectures are not meaningful when only
the Millions of Instructions Per Second (MIPS) factor is used.
Since a single vector instruction is equivalent to several scalar
instructions (typically Load, Operation, Increment and Test
Branch), and the number of data values used determines the
number of execution of these scalar instructions, MIP ratings
are ambiguous at best.
Consider the performance of an ASC producing "results per
second." In this context "results per second" is the rate at
which data fetched from central memory can be operated
upon and the results stored back into central memory.
Table III shows the maximum performance rates for one- and
four-pipe ASC systems performing typical arithmetic operations. Assumptions are that the clock cycle is 60 nanoseconds
and that the pipelines are already filled with operands.
Vector dot product is a special case in the sense that the
results per second rate pertains to the elementary operations.
Another performance measure can be determined from the
present performance of ASC System #4 executing a particular
weather benchmark. Although the benchmark is not a full
weather prediction code, it does have the characteristic source
code sequences and reflects the ability of the Fortran compiler
to produce efficient code from a large applications package.
Execution speed of the benchmark on the IBM Model 91 is
approximately 246 minutes, and present ASC timing with
checkout not finalized has already demonstrated approximately 30 minutes. This ratio of 8.2 is a measure of the total
system performance upon this program. It reflects a mix of
both scalar and vector instructions as well as I/O and other
system services. The design of the ASC has been directed
t.oward matching the real world mix of instructions encountered in typical applications instead of sacrificing scalar
capability to provide vector capability.
In order to compare the observed ASC performance on the
Weather Benchmark, data found in the Bulletin of the
American Meteorological Societyl is given in Table IV. Using
the IBIV[ S/360 Model 65 as the basis of reference, each of the
systems listed is compared as to relative speed. Using the
observed ASC/M91 ratio of 8.2, the present ASC speed would
be 41 in the table.
ACKNOWLEDGMENTS
It would not he possible t.o acknowledge all the contributors
to the development of the ASC; but particular recognition
should be given to lVlessrs. H. G. Cragon, \V. D. Kastner,

E. H. Husband, D. R. Best, C. M. Stephenson, C. R. Hall,
F. A. Galindo, E. C. Garth, and N. M. Chandler who
contributed significantly to the development of the hardware.
Software concepts are due in large part to the efforts of
Messrs. L. C. Dean, G. T. Boswell, A. E. Riccomi, F. A.
Little, W. Winkelman, W. L. Cohagan, and S. D. Nolte.
Many other members of the Texas Instruments staff have
397
also contributed i..YJlIIleasurably in the development of the

ASC.
REFERENCES
1. Program for the study conference on the Modeling Aspects of Gate,
Bulletin of the American Meteorological Society, Vol. 54, No.6.
June 1973, page 546. table E.

Operational Experiences With The TI Advanced Scientific Computer

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Operational Experiences With The TI Advanced Scientific Computer

Uploaded by

Copyright:

Available Formats

Operational experiences with the TI Advanced Scientific Computer

by W. J. WATSON and H. M. CARR

Since 1966 a large computer development program has been

The ASC central memory consists of a memory control

OVERVIEW OF THE SYSTEM

Figure 1-Major ASC subsystems

From the collection of the Computer History Museum (www.computerhistory.org)

National Computer Conference, 1974

----------~---------INTERLEAVED MEDIUr1-SPEED MEMORY MODULES

The optional central memory extension allows large

and divide for halfword (16-bit) and fullword (32-bit) fixed

Figure 3-Basic structure of the CP

From the collection of the Computer History Museum (www.computerhistory.org)