You are on page 1of 170

Course Notes for

Structure of
Programming
Languages
Part A

Goals of Course
To survey the various programming

languages, their purposes and their histories


Why do we have so many languages?
How did these languages develop?
Are some languages better than others for

some things?
To examine methods for describing language

syntax and semantics

Goals of Course
Syntax indicates structure of program code
How can language designer specify this?
How can programmer learn language?
How can compiler recognize this?
Lexical analysis
Parsing (syntax analysis)
Brief discussing of parsing techniques

Semantics indicate meaning of the code


What code will actually do?
Can we effectively do this in a formal way?
Static semantics
Dynamic semantics

Goals of Course
To examine some language features and

constructs and how they are used and


implemented in various languages
Variables and constants
Types, binding and type checking
Scope and lifetime
Data Types
Primitive types
Array types
Structured data types

Goals of Course
Pointer (reference) types

Assignment statements and expressions


Operators, precedence and associativity
Type coercions and conversions
Boolean expressions and short-circuit evaluation
Control statements
Selection
Iteration
Unconditional branching goto

Goals of Course
Process abstraction: procedures and functions
Parameters and parameter-passing
Generic subprograms
Nonlocal environments and side-effects
Implementing subprograms
Subprograms in static-scoped languages
Subprograms in dynamic-scoped languages

Goals of course
Data abstraction and abstract data types
Object-oriented programming
Design issues
Implementations in various object-oriented

languages
Concurrency
Concurrency issues
Subprogram level concurrency
Implementations in various languages

Statement level concurrency

Goals of Course
Exception handling
Issues and implementations

IF TIME PERMITS
Functional programming languages
Logic programming languages

Language Development Issues


Why do we have high-level programming

languages?
Machine code is too difficult for us to read,

understand and debug


Machine code is not portable between
architectures
Why do we have many high-level

programming languages?
Different people and companies developed

them
9

Language Development Issues


Different languages are either designed to or

happen to meet different programming


needs
Scientific applications
FORTRAN

Business applications
COBOL

AI

LISP, Scheme (& Prolog)

Systems programming
C

Web programming

Perl, PHP, Javascript

General purpose
C++, Ada, Java

10

Language Development Issues


Programming language qualities and

evaluation criteria
Readability
How much can non-author understand logic of code

just by reading it?


Is code clear and unambiguous to reader?
These are often subjective, but sometimes is is
fairly obvious
Examples of features that help readability:
Comments
Long identifier names
Named constants
11

Language Development Issues


Clearly understood control statements
Language orthogonality
Simple features combine in a consistent way
But it can go too far, as explained in the text about Algol 68

Writability
Not dissimilar to readability
How easy is it for programmer to use language

effectively?
Can depend on the domain in which it is being used
Ex: LISP is very writable for AI applications but would

not be so good for systems programming


Also somewhat subjective

12

Language Development Issues

Examples of features that help writability:


Clearly understood control statements
Subprograms
Also orthogonality

Reliability
Two different ideas of reliability
1. Programs are less susceptible to logic errors

2.

Ex: Assignment vs. comparison in C++


See assign.cpp and assign.java

Programs have the ability to recover from


exceptional situations

Exception handling we will discuss more later

13

Language Design Issues


Many factors influence language design
Architecture
Most languages were designed for single processor

von Neumann type computers


CPU to execute instructions
Data and instructions stored in main memory

General language approach


Imperative languages
Fit well with von Neumann computers
Focus is on variables, assignment, selection and

iteration
Examples: FORTRAN, Pascal, C, Ada, C++, Java
14

Language Design Issues


Imperative language evolution
Simple straight-line code
Top-down design and process abstraction
Data abstraction and ADTs
Object-oriented programming
Some consider object-oriented languages not to be

imperative, but most modern oo languages have


imperative roots (ex. C++, Java)
Functional languages
Focus is on function and procedure calls
Mimics mathematical functions
Less emphasis on variables and assignment
In strictest form has no iteration at all recursion is used

instead
Examples: LISP, Scheme
15

Language Design Issues


Logic programming languages
Symbolic logic used to express propositions, rules and

inferences
Programs are in a sense theorems
User enters a proposition and system uses programmers
rules and propositions in an attempt to prove it
Typical outputs:
Yes: Proposition can be established by program
No: Proposition cannot be established by program

Example: Prolog see example program

Cost
What are the overall costs associated with a given

language?
How does the design affect that cost?
16

Language Design Issues


Training programmers
How easy is it to learn?

Writing programs
Is language a good fit for the task?

Compiling programs
How long does it take to compile programs?
This is not as important now as it once was

Executing programs
How long does program take to run?
Often there is a trade-off here
Ex: Java is slower than C++ but it has many run-time features
(array bounds checking, security manager) that are lacking in
C++

17

Language Implementation Issues

How is HLL code processed and executed on


the computer?
Compilation
Source code is converted by the compiler into
binary code that is directly executable by the
computer
Compilation process can be broken into 4
separate steps:
1) Lexical Analysis

Breaks up code into lexical units, or tokens


Examples of tokens: reserved words, identifiers,
punctuation
Feeds the tokens into the syntax analyzer
18

Language Implementation
Issues
2) Syntax Analysis

Tokens are parsed and examined for correct


syntactic structure, based on the rules for the
language
Programmer syntax errors are detected in this
phase

3) Semantic Analysis/Intermediate Code Generation

Declaration and type errors are checked here


Intermediate code generated is similar to assembly
code
Optimizations can be done here as well, for
example:

Unnecessary statements eliminated


Statements moved out of loops if possible
Recursion removed if possible
19

Language Implementation
Issues
4) Code Generation

Intermediate code is converted into executable


code
Code is also linked with libraries if necessary

Note that steps 1) and 2) are independent of the

architecture depend only upon the language


Front End
Step 3) is somewhat dependent upon the
architecture, since, for example, optimizations
will depend upon the machine used
Step 4) is clearly dependent upon the
architecture Back End

20

Language Implementation
Issues
Interpreting
Program is executed in software, by an interpreter
Source level instructions are executed by a virtual

machine
Allows for robust run-time error checking and
debugging
Penalty is speed of execution
Example: Some LISP implementations, Unix shell
scripts and Web server scripts

21

Language Implementation
Issues
Hybrid
First 3 phases of compilation are done, and

intermediate code is generated


Intermediate code is interpreted
Faster than pure interpretation, since the
intermediate codes are simpler and easier to
interpret than the source codes
Still much slower than compilation
Examples: Java and Perl
However, now Java uses JIT Compilation also
Method code is compiled as it is called so if it is called again it
will be faster

22

Brief, Incomplete PL History


Early 50s
Early HLLs started to emerge
FORTRAN
Stands for FORmula TRANslating system
Developed by a team led by John Backus at IBM for

the IBM 704 machine


Successful in part because of support by IBM
Designed for scientific applications
The root of the imperative language tree

23

Brief PL History
Lacked many features that we new take for granted

in programming languages:
Conditional loops
Statement blocks
Recursive abilities

Many of these features were added in future

versions of FORTRAN
FORTRAN II, FORTRAN IV, FORTRAN 77, FORTRAN 90

Had some interesting features that are now

obsolete
COMMON, EQUIVALENCE, GOTO
We may discuss what these are later

24

Brief PL History
Late 50s
COBOL
COmmon Business Oriented Language
Developed by US DoD
Separated data and procedure divisions
But didnt allow functions or parameters

Still widely used, due in part to the large cost of

rewriting software from scratch


Big companies would rather maintain COBOL programs

than rewrite them in a different language

25

Brief PL History
LISP
LISt Processing
Developed by John McCarthy of MIT
Functional language
Good for symbolic manipulation, list processing
Had recursion and conditional expressions
Not in original FORTRAN

At one time used extensively for AI


Today most widely used version, COMMON LISP, has

included some imperative features

26

Brief PL History
ALGOL
ALGOL 58 and then ALGOL 60 both designed by

international committee
Goals for the language:
Syntax should be similar to mathematical notation and

readable
Should be usable for algorithms in publications
Should be compilable into machine code
Included some interesting features
Pass by value and pass by name (wacky!) parameters
Recursion (first in an imperative language)
Dynamic arrays
Block structure and local variables
27

Brief PL History
Introduced Backus-Naur Form (BNF) as a way to

describe the language syntax


Still commonly used today, but not well-accepted at the

time
Never widely used, but influenced virtually all

imperative languages after it

28

Brief PL History
Late 60s
Simula 67
Designed for simulation applications
Introduced some interesting features
Classes for data abstraction
Coroutines for re-entrant subprograms

ALGOL 68
Emphasized orthogonality and user-defined data
types
Not widely used

29

Brief PL History
70s
Pascal
Developed by Nicklaus Wirth
No major innovations, but due to its simplicity and

emphasis of good programming style, became


widely used for teaching
C
Developed by Dennis Ritchie to help implement the

Unix operating system


Has a great deal of flexibility, esp. with types
Incomplete type checking

30

Brief PL History
Void pointers
Coerces many types

Many programmers (esp. systems programmers)

love it
Language purists hate it
Easy to miss logic errors

Prolog
Logic programming
We discussed it a bit already
Still used somewhat, mostly in AI
May discuss in more detail later

31

Brief PL History
80s
Ada
Developed over a number of years by DoD
Goal was to have one language for all DoD

applications
Especially for embedded systems

Contains some important features


Data encapsulation with packages
Generic packages and subprograms
Exception handling
Tasks for concurrent execution
We will discuss some of these later

32

Brief PL History
Very large language difficult to program reliably,

even though reliability was one of its goals!


Early compilers were slow and error-prone
Did not have the widespread general use that was
hoped
Eventually the government stopped requiring it for
DoD applications
Use faded after this
Not used widely anymore

Ada 95 added object-oriented features


Still wasn't used much, especially with the advent of

Java and other OO languages

33

Brief PL History
Smalltalk
Designed and developed by Alan Kay
Concepts developed in 60s, but language did not come to

fruition until 1980


Designed to be used on a desktop computer 15 years
before desktop computers existed
First true object-oriented language
Language syntax is geared toward objects
messages passed between objects
methods are invoked as responses to messages
Always dynamically bound
All classes are subclasses of Object

Also included software devel. environment


Had large impact on future OOLs, esp. Java
34

Brief PL History
C++
Developed largely by Bjarne Stroustrup as an

extension to C
Backward compatible

Added object-oriented features and some additional

typing features to improve C


Very powerful and very flexible language
But still has reliability problems
Ex. no array bounds checking
Ex. dynamic memory allocation

Widely used and likely to be used for a while longer

35

Brief PL History
Perl
Developed by Larry Wall
Takes features from C as well as scripting languages
awk, sed and sh
Some features:
Regular expression handling
Associative arrays
Implicit data typing

Originally used for data extraction and report

generation
Evolved into the archetypal Web scripting language
Has many proponents and detractors

36

Brief PL History
90s
Java
Interestingly enough, just like Ada, Java was

originally developed to be used in embedded


systems
Developed at Sun by a team headed by James
Gosling
Syntax borrows heavily from C++
But many features (flaws?) of C++ have been

eliminated
No explicit pointers or pointer arithmetic
Array bounds checking
Garbage collection to reclaim dynamic memory
37

Brief PL History
Object model of Java actually more resembles that

of Smalltalk rather than that of C++


All variables are references
Class hierarchy begins with Object
Dynamic binding of method names to operations by

default
But not as pure in its OO features as Smalltalk, due

to its imperative control structures


Interpreted for portability and security
Also JIT compilation now

Growing in popularity, largely due to its use on Web

pages

38

Brief PL History
00's (aughts? oughts? naughts?)
See http://www.randomhouse.com/wotd/index.pperl?date=19990803

C#
Main roots in C++ and Java with some other

influences as well
Used with the MS .NET programming environment
Some improvements and some deprovements
compared to Java
Likely to succeed given MS support

39

Program Syntax
Recall job of syntax analyzer:
Groups (parses) tokens (fed in from lexical

analyzer) into meaningful phrases


Determines if syntactic structure of token
stream is legal based on rules of the language
Lets look at this in more detail
How does compiler know what is legal and

what is not?
How does it detect errors?

40

Program Syntax
To answer these questions we must look at

programming language syntax in a more


formal way
Language:
Set of strings of lexemes from some alphabet
Lexemes are the lowest level syntactic elements
Lexemes are made up of characters, as defined by
the character set for the language

41

Program Syntax
Lexemes are categorized into different tokens and

processed by the lexical analyzer


Ex:

if (width < height)


{

cout << width << endl;


}

Lexemes: if, (, width, <, height, ), {, cout, <<, width,

<<, endl, ;, }
Tokens: iftok, lpar, idtok, lt, idtok, rpar, lbrace, idtok, llt,
idtok, llt, idtok, semi, rbrace
Note that some tokens correspond to single lexemes

(ex. iftok) whereas some correspond to many (ex.

42

Program Syntax

How do we formally define a language?


Assume we have a language, L, defined over

1)

an alphabet, .
2 related techniques:
Recognition
An algorithm or mechanism, R, will process
any given string, S, of lexemes and correctly
determine if S is within L or not
Not used for enumeration of all strings in L
Used by parser portion of compiler

43

Program Syntax
2) Generation
Produces valid sentences of L
Not as useful as recognition for compilation,

since the valid sentences could be arbitrary


More useful in understanding language
syntax, since it shows how the sentences are
formed

Recognizer only says if sentence is valid or not


more of a trial and error technique

44

Program Syntax
So recognizers are what compilers need, but

generators are what programmers need to


understand language
Luckily there are systematic ways to create

recognizers from generators


Thus the programmer reads the generator to
understand the language, and a recognizer is
created from the generator for the compiler

45

Language Generators

Grammar
A mechanism (or set of rules) by which a

language is generated
Defined by the following:

A set of non-terminal symbols, N


Do not actually appear in strings

A set of terminal symbols, T


Appear in strings

A set of productions, P
Rules used in string generation

A starting symbol, S

46

Language Generators
Noam Chomsky described four classes of

grammars (used to generate four classes of


languages) Chomsky Hierarchy
0 )Unrestricted
1)Context-sensitive
2)Context-free
3)Regular

More info on unrestricted and context-

sensitive grammars in a theory course


The last two will be useful to us

47

Language Generators
Regular Grammars
Productions must be of the form:
<non> <ter><non> | <ter>
where <non> is a nonterminal, <ter> is a

terminal, and | represents either or


Can be modeled by a Finite-State Automaton
(FSA)
Also equivalent to Regular Expressions
Provide a model for building lexical analyzers

48

Language Generators
Have following properties (among others)
Can generate strings of the form n, where is a
finite sequence and n is an integer
Pattern recognition

Can count to a finite number


Ex. { an | n = 85 }
But we need at least 86 states to do this
Cannot count to arbitrary number
Note that { an } for any n (i.e. 0 or more occurrences) is easy
do not have to count

Important to realize that the number of states is

finite: cannot recognize patterns with an arbitrary


number of possibilities
49

Language Generators
Example: Regular grammar to recognize Pascal

identifiers (assume no caps)


N = {Id, X} T = {a..z, 0..9} S = Id
P=
Id aX | bX | | a | b | | z
X aX | bX | | 0X | | 9X | a | | z | 0 | | 9

Consider equiv. FSA:

Id

9
50

Language Generators
Example: Regular grammar to generate a

binary string containing an odd number of 1s


N = {A,B} T = {0,1} S = A P =
A 0A | 1B | 1
B 0B | 1A | 0

Example: Regular grammars CANNOT generate

strings of the form anbn


Grammar needs some way to count number of as

and bs to make sure they are the same


Any regular grammar (or FSA) has a finite number,
say k, of different states
If n > k, not possible
51

Language Generators
If we could add a memory of some sort we

could get this to work


Context-free Grammars
Can be modeled by a Push-Down Automaton

(PDA)
FSA with added push-down stack

Productions are of the form:


<non> , where <non> is a nonterminal and is
any sequence of terminals and nonterminals
note rhs is more flexible now

52

Language Generators
So how to generate anbn ? Let a=0, b=1
N = {A} T = {0,1} S = A P =
A 0A1 | 01

Note that now we can have a terminal after the

nonterminal as well as before


Can also have multiple nonterminals in a single
production
Example: Grammar to generate sets of

balanced parentheses
N = {A} T = {(,)} S = A P =
A AA | (A) | ()

53

Language Generators
Context-free grammars are also equivalent to

BNF grammars
Developed by Backus and modified by Naur
Used initially to describe Algol 60

Given a (BNF) grammar, we can derive any

string in the language from the start symbol


and the productions
A common way to derive strings is using a
leftmost derivation
Always replace leftmost nonterminal first
Complete when no nonterminals remain

54

Language Generators
Example: Leftmost derivation of nested parens:

(()(()))
A (A)

(AA)

(()A)

(()(A))
(()(()))

We can view this derivation as a tree, called a

parse tree for the string

55

Language Generators
Parse tree for (()(()))

A
(

A
)

A
(

)
)
56

Language Generators
If, for a given grammar, a string can be derived

by two or more different parse trees, the


grammar is ambiguous
Some languages are inherently ambiguous
All grammars that generate that language are

ambiguous
Many other languages are not themselves

ambiguous, but can be generated by


ambiguous grammars
It is generally better for use with compilers if a

grammar is unambiguous
Semantics are often based on syntactic form
57

Language Generators
Ambiguous grammar example: Generate strings

of the form 0n1m, where n,m >= 1


N = {A,B,C} T = {0,1} S = A

P=

A BC | 0A1
B 0B | 0
C 1C | 1

Consider the string: 00011


A
B
C
0

B
0

1
B
0

A
0

C
1

A
B

1
C

0
58

Language Generators
We can easily make this grammar

unambiguous:
Remove production: A 0A1
Note that nonterminal B can generate an arbitrary

number of 0s and nonterminal C can generate an


arbitrary number of 1s
Now only one parse tree
A
B
C
0

B
0

1
B

C
1

0
59

Language Generators
Lets look at a few more examples
Grammar to generate: {WWR | W {0,1} }
N = {A} T = {0,1} S = A P = ?
S 0A0 | 1A1 | 00 | 11
Grammar to generate: strings in {0,1} of the

form WX such that |W| = |X| but W != X


This one is a little tricker
How to approach this problem?
We need to guarantee two things
Overall string length is even
At least one bit differs in the two halves

60

Language Generators
See board
Ok, now how do we make a grammar to do this?
Make every string (even length) the result of two odd-

length strings appended to each other


Assume odd-length strings are Ol and Or
Make sure that either
Ol has a 1 in the middle and Or has a 0 in the middle or
Ol has a 0 in the middle and Or has a 1 in the middle

Productions:

In AB | BA
A 0A0 | 1A1 | 1A0 | 0A1 | 1
B 0B0 | 1B1 | 1B0 | 0B1| 0
61

Language Generators
Lets look at an example more relevant to

programming languages:
Grammar to generate simple assignment

statements in a C-like language (diff. from one in


text):
<assig stmt> ::= <var> = <arith expr>
<arith expr> ::= <term> | <arith expr> + <term> |
<arith expr> - <term>
<term> ::= <primary> | <term> * <primary> | <term>
/ <primary>
<primary> ::= <var> | <num> | (<arith expr>)
<var> ::= <id> | <id>[<subscript list>]
<subscript list> ::= <arith expr> | <subscript list>,
<arith expr>
62

Language Generators
stmt>* 20
Parse tree for: X<assig
= (A[2]+Y)
<var> = <arith expr>

<id>
<term>
<term>
*
<primary>
<primary>
<num>
<arith expr> )

<arith expr>
<term
>
<primary>
<var>

+ <term>
<primary>
<var>
<id>

<id> [ <subscript list>


]
<arith expr>
<term
>
<primary>
<num>

63

Language Generators
Wow that seems like a very complicated parse

tree to generate such a short statement


Extra non-terminals are often necessary to remove

ambiguity
Extra non-terminals are often necessary to create
precedence
Precedence in previous grammar has * and / higher

than + and /
They would be lower in the parse tree
LOWER ABOVE IS CORRECT

What about associativity


Left recursive productions == left associativity
Right recursive productions == right associativity
64

Language Generators
But Context-free grammars cannot generate

everything
Ex: Strings of the form WW in {0,1}
Cannot guarantee that arbitrary string is the same

on both sides
Compare to WWR
These we can generate from the middle and build out

in each direction
For WW we would need separate productions for each
side, and we cannot coordinate the two with a contextfree grammar
Need Context-Sensitive in this case

65

Language Generators
Lets look at one more grammar example
Grammar to generate all postfix expressions

involving binary operators * and -. Assume


<id> is predefined and corresponds to any
variable name
Ex: v w x y - * z * How do we approach this problem?
Terminals easy
Nonterminals/Start require some thought
Productions require a lot of thought

66

Language Generators
T = { <id>, *, - }
N={A}
S={A}
P=
A AA* | AA- | <id>
Show parse tree for previous example
Is this grammar LL(1)?
We will discuss what this means soon

67

Parsers
Ok, we can generate languages, but how to

recognize them?
We need to convert our generators into

recognizers, or parsers
We know that a Context-free grammar
corresponds to a Push-Down Automaton (PDA)
However, the PDA may be non-deterministic
As we saw in examples, to create a parse tree we

sometimes have to guess at a substitution

68

Parsers
May have to guess a few times before we get the

correct answer
This does not lend itself to programming

language parsing
Wed like parser to never have to guess

To eliminate guessing, we must restrict the

PDAs to deterministic PDAs, which restricts the


grammars that we can use
Must be unambiguous
Some other, less obvious restrictions, depending

upon parsing technique used

69

Parsers
There are two general categories of parsers
Bottom-up parsers
Can parse any language generated by a
Deterministic PDA
Build the parse trees from the leaves up back to the
root as the tokens are processed
At each step, a substring that matches the right-hand

side of a production is substituted with the left side of


the production
Reduces input string all the way back to the start
symbol for the grammar
Also called shift-reduce parsing

70

Parsers
Correspond LR(k) grammars

Left to right processing of string


Rightmost derivation of parse tree (in reverse)
k symbols lookahead required

LR parsers are difficult to write by hand, but can be

produced systematically by programs such as YACC


(Yet Another Compiler Compiler).
Primary variations of LR grammars/parsers
SLR (Simple LR)
LALR (Look Ahead LR)
LR most general but also most complicated to

implement

We'll leave details to CS 1622

71

Parsers
Top-down parsers

Build the parse trees from the root down as the

tokens are processed


Also called predictive parsers, or LL parsers

Left-to-right processing of string


Leftmost derivation of parse tree
The LL(1) that we saw before means we can parse

with only one token lookahead

More restrictive than LR parsers: there are

grammars generated by Deterministic PDAs that


are not LL grammars (i.e. cannot be parsed by
an LL parser)
Some restrictions on productions allowed
Cannot handle left-recursion we'll see why shortly

72

Parsers
Implementing a top-down parser
One technique is Recursive Descent
Can think of each production as a function
As string of tokens is parsed, terminal symbols

are consumed/processed and non-terminal


symbols generate function calls
Now we can see why left-recursive productions

cannot be handled
From Example 3.4
<expr> <expr> + <term>
Recursion will continue indefinitely without consuming

any symbols
73

Parsers
Luckily, in most cases a grammar with left

recursion can be converted into one with only


right-recursion
Recursive Descent parsers can be written by
hand, or generated
Think of a program that processes the grammar by

creating a function shell for each non-terminal


Then details of function are filled in based upon the
various right-hand sides the non-terminal generates
See example

74

LL(1) Grammars
So how can we tell if a grammar is LL(1)?
Given the current non-terminal (or left side of a

production) and the next terminal we must be


able to uniquely determine the right side of the
production to follow
Remember that a non-terminal can have multiple

productions
As we previously mentioned, the grammar must

not be left recursive


However, not having left recursion is necessary but

not sufficient for an LL(1) grammar

75

LL(1)
Grammars
Ex:
A aX | aY
We cannot determine which right side to follow
without more information than just "a"
How can we process a grammar to determine if

this situation occurs?


Calculate the First set for each RHS of productions
First set of a sequence of symbols, S, is the set of

terminals that begin the strings derived from S


Given multiple RHS for nonterminal N
N 1 | 2

If First(1) and First(2) intersect, the grammar is not

LL(1)

76

LL(1)
Grammars
So how do we calculate First() sets?
Algorithm is given in Aho (see Slide 2)

Consider symbol X
If X is a terminal, First(X) = {X}
If X is a production, add to First(X)
If X is a nonterminal and X Y1Y2Yk is a production
Add a to First(X) if, for some i, a is in First(Yi) and is in all of

First(Y1) First(Yi-1)
Add to First(X) if is in First(Yj) for all j = 1, 2, k

To calculate First(X1X2Xn) for some X1X2Xn


Add non- symbols of First(X1)
If is in First(X1), and non- symbols of First(X2)

If is in all First(Xi), add


77

LL(1) Grammars
A aB | b | cBB
B aB | bA | aBb
A aB | CD | E |
B b
C cA |
D dA
E dB

78

Semantics
Sematics indicate the meaning of a program
What do the symbols just parsed actually say to

do?
Two different kinds of semantics:
Static Semantics
Almost an extension of program syntax
Deals with structure more than meaning, but at a

meta level
Handles structural details that are difficult or
impossible to handle with the parser
Ex: Has variable X been declared prior to its use?
Ex: Do variable types match?

79

Semantics
Dynamic Semantics (often just called semantics)
What does the syntax mean?
Ex: Control statements
Ex: Parameter passing
Programmer needs to know meaning of statements

before he/she can use language effectively

80

Semantics
Static Semantics
One technique for determining/checking static

semantics is Attribute Grammars


Start with a context-free grammar, and add to
it:
Attributes (for the grammar symbols)
Indicate some properties of the symbols

Attribute computation functions (semantic

functions)
Allow attributes to be determined

Predicate functions
Indicate the static semantic rules
81

Semantics
Attributes made up of synthesized attributes

and inherited attributes


Synthesized Attributes
Formed using attributes of grammar symbols lower in

the parse tree


Ex: Result type of an expression is synthesized from the
types of the subexpressions
Inherited Attributes
Formed using attributes of grammar symbols higher in

the parse tree


Ex: Type of RHS of an assignment is expected to match
that of LHS the type is inherited from the type of the
LHS variable
82

Semantics
Semantic Functions
Indicate how attributes are derived, based on the
static semantics of the language
Ex: A = B + C;
Assume A, B and C can be integers or floats
If B and C are both integers, RHS result type is integer,

otherwise it is float

Predicate functions
Test attributes of symbols processed to see if they
match those defined by language
Ex: A = B + C
If RHS type attribute is not equal to LHS type attribute,

error (in some languages)


83

Semantics
Detailed Example in text
Grammar Rules
1)
2)
3)
4)

<assign> <var> = <expr>


<expr> <var> + <var>
<expr> <var>
<var> A | B | C

Attributes
actual_type: actual type of <var> or <expr> in

question (synthesized, but for a <var> we say this is


an intrinsic attribute)
expected_type: associated with <expr>, indicating the
type that it SHOULD be inherited from actual_type of
<var>
84

Semantics
Semantic functions

Parallel to syntax rules of the grammar


See Ex. 3.6 in text

1) <assign> <var> = <expr>


<expr>.expected_type <var>.actual_type
2) <expr> <var>[2] + <var>[3]
<expr>.actual_type if (<var>[2].actual_type =
int) and
(<var>[3].actual_type = int) then
int
else real
end if
3) <expr> <var>
<expr>.actual_type <var>.actual_type
4) <var> A | B | C
<var>.actual_type look-up(<var>.string)
Predicate functions

Only one needed here do the types match?

<expr>.actual_type == <expr>.expected_type

85

Semantics
Ex: A = B + C
<assign
>
actual_type
<var
>

A
<var>[2
]
actual_type

expected
<expr actual_type
type
>
+

<var>[3
]
actual_type
C

86

Semantics
Attribute grammars are useful but not typically

used in their pure form for full-scale languages


makes the grammars more complicated and
compilers more difficult to generate

87

Semantics
Dynamic Semantics (semantics)
Clearly vital to the understanding of the

language
In early languages they were simply informal
like manual pages
Efforts have been made in later years to
formalize semantics, just as syntax has been
formalized
But semantics tend to be more complex and
less precisely defined
More difficult to formalize
88

Semantics
Some techniques have gained support however
Operational Semantics
Define meaning by result of execution on a primitive

machine, examining the state of the machine before


and after the execution
Axiomatic Semantics
Preconditions and postconditions define meaning of

statements
Used in conjunction with proofs of program correctness
Denotational Semantics
Map syntactic constructs into mathematical objects

that model their meaning


Quite rigorous and complex
89

Identifiers, Reserved Words and


Keywords
Identifier
String of characters used to name an entity

within a program
Most languages have similar rules for ids, but
not always
C++ and Java are case-sensitive, while Ada is not
Can be a good thing: mixing case allows for longer,

more readable names, ala Java:


NoninvertibleTransformException
Can be a bad thing: should that first i be upper or lower
case?

90

Identifiers, Reserved Words and


Keywords
C++, Ada and Java allow underscores, while

standard Pascal does not


FORTRAN originally allowed only 6 chars

Reserved Word
Name whose definition is part of the syntax of

the language
Cannot be used by programmer in any other way

Most newer languages have reserved words


Make parsing easier, since each reserved word will
be a different token

91

Identifiers, Reserved Words and


Keywords
Ex: end if in Ada

Interesting extension topic


If we extend a language and add new reserved
words, we may make some old programs
syntactically incorrect
Ex: C subprogram using class as an id will not compile

with a C++ compiler


Ex: Ada 83 program using abstract as an id will not
compile with an Ada 95 compiler

Keywords
To some, keyword reserved word
Ex: C++, Java
92

Identifiers, Reserved Words and


Keywords
To others, there is a difference
Keywords are only special in certain contexts
Can be redefined in other contexts
Ex: FORTRAN keywords may be redefined

Predefined Identifiers
Identifiers defined by the language

implementers, which may be redefined


cin, cout in C++
real, integer in Pascal
predefined classes in Java

93

Identifiers, Reserved Words and


Keywords
Programmer may wish to redefine for a specific

application
Ex: Change a Java interface to include an extra method

Problem: predefined version no longer applies, so

program segments that depend on it are invalid


Better to extend a class or compose a new class than

to redefine a predefined class


Ex: Comparable interface can be implemented as we
see fit by a new class

94

Variables
Simple (nave) definition: a name for a

memory location
In fact, it is really much more
Six attributes
Name
Address
Value
Type
Lifetime
Scope

95

Variables
Name:
Identifier
In most languages the same name may be used

for different variables, as long as there is no


ambiguity
Some exceptions
Ex: A method variable name may be declared only

once within a Java method

96

Variables
Address
Location in memory
Also called the l-value
Some situations that are possible:
Different variables with the same name have
different addresses
Declared in different blocks of the program

Same variable has different addresses at different

points in time
Declared in a subprogram and allocated based on run-

time stack
Well discuss this more shortly
97

Variables
Different variables share same address: aliasing
Occurs with FORTAN EQUIVALENCE, Pascal and Ada

record variants, C and C++ unions, pointer variables,


reference parameters
Adds to the flexibility of a language, especially with
pointers and reference parameters
Can also save memory in some situations
Many references to a single copy rather than having multiple

copies

Can be quite problematic if programmer does not

handle them correctly


Ex: copy constructor and = operator for classes with dynamic

components in C++
Ex: shallow copy of arrays in Java

Well discuss most of these things more in later

chapters

98

Variables
Type
Modern data types include both the structure of

the data and the operations that go with it


Important in determining legality of some
expressions
Value
Contents of memory locations allocated for that

variable
Also called the r-value

99

Variables
Lifetime
Time during which the variable is bound to a

specific memory location


Well discuss this more shortly
Scope
Section of the program in which a variable is

visible
Accessible to the programmer/code in that section

Well discuss this more shortly

100

Binding
Binding of variables deals with an association

of variable attributes with actual values


The time when each of these occurs in a
program is the binding time of the attribute
Static binding: occurs before runtime and does

not change during program execution


Dynamic binding: occurs during runtime or
changes during runtime

101

Binding
Name:
Occurs when program is written (chosen by

programmer) and for most languages will not


change: static
Address and Lifetime:
A memory location for a variable must be

allocated and deallocated at some point in a


program
The lifetime is the period between the
allocation and deallocation of the memory
location for that variable
102

Binding
We are ignoring bindings associated with a

computers virtual memory


This in fact could cause a variable to be bound and

unbound to different memory locations many times


throughout the execution of a program
Each time the data is swapped or paged out, the
location is physically unbound, and then rebound
when it is swapped or paged back in
We will ignore these issues since they are more related

to how the operating system and hardware execute


programs than they are to the way the variables are
declared and used within the program

103

Binding
Text puts lifetimes of variables into 4 categories
Static: bound to same memory cell for entire
program
Ex: static C++ variables

Stack-dynamic: bound and unbound based on run-

time stack
Ex: Pascal, C++, Ada subprogram variables

Explicit heap-dynamic: nameless cells that can only

be accessed using other variables (pointers)


Allocated (and poss. deallocated) explicitly by the

programmer
Ex: result of new in C++ or Java

104

Binding
Implicit heap-dynamic variables
Binding of all attributes (except name) changed upon

each assignment
Much overhead to maintain all of the dynamic
information
Used in Algol 68 and APL
Not used in most newer languages

See lifetimes.cpp

Value
Dynamic by the nature of a variable can

change during run-time

105

Binding
Type
Dynamic Binding
Type associated with a variable is determined at
run-time
A single variable could have many different types at
different points in a program
Static Binding
Type associated with a variable is determined at

compile-time (based on var. declaration)


Once declared, type of a variable does not change

106

Binding
Advantage of dynamic binding
More flexibility in programming
Can use same variable for different types
Can make operations generic

Disadv. of dynamic binding


Type-checking is limited and must be done at runtime
To understand this better, we must discuss

type-checking at more length


Well return to our last binding topic
(scope) afterward
107

Type-checking
Type checking
Determination that operands and operators in

an expression have types that are compatible


with each other
By compatible we mean
the types match or
they can be coerced (implicitly converted) to match

If they are not compatible, a type-error occurs

108

Type-checking
Why is type-checking important?
Lets review programming errors
Compilation error: detectable when the program is
compiled
Usually a syntax error or static semantic error

Run-time error: detectable as the program is being

run
Often an illegal instruction or I/O error

Logic error: error in meaning of the program


Often only detectable through debugging and/or testing

program on known data


Program could seem to run perfectly, but produce
incorrect results
109

Type-checking
Wed like an environment that is not conducive

to logic errors
Consider dynamic type binding again
Assignments cannot be type checked
Since type may change, any assignment is legal
If object being assigned is an erroneous type, we have

a logic error
Type checking that is done must be done at run-

time
This requires type information to be stored and

accessed at run-time
Must be done in software i.e. the language must be
interpreted
Increases both memory and run-time overhead
110

Type-checking
Now consider static type binding
Since types are set at compile-time, most (but not
usually all) type checking can be done at compiletype
Assignments can be checked to avoid logic errors
Type information does not need to be kept at runtime
Program can run in hardware

111

Type-checking
STRONGLY TYPED language
2 slightly different definitions
Traditional definition: If ALL type checking can be
done at compile-time (i.e. statically), a language is
strongly typed
Sebesta definition: If ALL type errors can always be
detected (either at compile-time or at run-time) a
language is strongly typed
First definition is more reliable but also more

restrictive

112

Type-checking
No commonly used languages are truly strongly

typed, but some come close


Lets look at two: C++ and Ada
C++ union construct allows programmer to access

same memory as different types with no checking


Ada record variants contain a discriminant to
determine which type is being used
Can only access the type indicated by the discriminant
Cannot change the discriminant unless you change the

entire record prevents inconsistency


Checking can be turned off, however

113

Type-checking
See union.cpp and variant.adb
Pascal also has a discriminant, but does not

require entire record to be assigned when


discriminant is changed
Suggests type-safety, but does not enforce it

Well look more at these types of structures in

Chapter 6

114

Type-checking
So what does compatible mean?
We said type-checking involves determining if

operands and operators are compatible


Different languages define it differently
Name compatible (or name equivalent)
The types have the same name
Easy to check and enforce
Simply record the type of each variable in the symbol

table
Compare when necessary

115

Type-checking
Somewhat limiting for programmer
Ex. in Pascal

A1: array[1..10] of integer;


A2: array[1..10] of integer;
A2 := A1; { not allowed }
Assignment is not legal even though they have the
same structure
Also cannot pass either variable as a parameter to a
subprogram
Variables above actually each have an ANONYMOUS
TYPE not name compatible with any other type
Generally not a good idea to use

116

Type-checking
Structurally compatible (equivalent)
The types are compatible if they have the same
structure
Ex:
A1: array[1..10] of integer;
A2: array[1..10] of integer;
A2 := A1; { this would be allowed }
Since both have same size and base type, it works
Much more flexible than name compatible, but also

much more difficult for compiler and not as clear to


programmer
Compiler must compare structures of every variable

involved in the expression


It is not obvious what is and is not compatible
117

Type-checking
record
X: float;
A: array[1..10] of int;
end record

Can compiler tell that

components are
reversed?
A1:array[1..10] of float;

Structure is the same

record
A: array[1..10] of int;
X: float;
end record

Could it match the

types with more


complex records?
A2:array[3..12] of float;

Index values are

changed
118

Type Checking
So how about some languages?
Ada: name equivalence but allows subtypes to be
compatible with parent types
Pascal: almost name equivalence, but considers
variables in the same declaration to also be of the
same type, and allows one type name to be set
equal to another
A1, A2: array[1..10] of integer;
above not compatible in Ada
type newint = integer;
C++: name equivalence, but a lot of coercion is

done we will look at coercion later


See types.p, types.adb, types.cpp
119

Binding
Ok, back to binding
Scope (visibility)
Static scope: determined at compile-time
Dynamic scope: determined at run-time
Implications:
Most languages have the notion of local variables

(i.e. within a block or a subprogram)


If these are the only variables we use, scope is not
important

120

Binding
Scope become significant when dealing with non-

local variables
How and where are these accessible?

Most modern languages use static scope


If variable is not locally declared, proceed out to

the textual parent (static parent) of the


block/subprogram until the declaration is found
Fairly clear to programmer can look at code to
see scope
Well discuss implementation later

121

Binding
Examples:
Pascal
Subprograms are only scope blocks, but can be nested to

arbitrary depth
All declarations must be at the beginning of a subprogram
Somewhat restrictive, although not the fault of static scope
Ada
Subprograms can be nested
Also allow declare blocks to nest scope within the same

subprogram
Useful to variable length arrays

All declarations must be at the beginning of a subprogram

or declare block
See arraytest.adb
122

Binding
C++
Subprograms CANNOT be nested
New declaration blocks can be made with {}
Declarations can be made anywhere within a block, and

last until the end of the block

Interesting note:
What about the scope of loop control variables in
for loops?
Ada always implicitly declares LCVs and scope (and

lifetime) is LOCAL to the loop body


C++ and Java for is more general and does not require
a locally declared LCV, but if included, is also LOCAL to
the loop body
Wasnt always the case in C++

123

Binding
Dynamic scope
Non-local variables are accessed via calls on

the run-time stack (going from top to bottom


until declaration is found)
A non-local reference could be declared in different

blocks in different runs


Used in APL, SNOBAL4 and through local variables
in Perl
Flexible but very tricky
Difficult for programmer to anticipate different
definitions of non-local variables

124

Binding
Type-checking must be dynamic, since types of non-

locals are not known until run-time


More time-consuming
See scope.pl

125

Scope, Lifetime and Referencing


Environments
Concepts of Scope and Lifetime are not

comparable
Lifetime deals with existence and association of

memory
WHEN

Scope deals with visibility of variable names


WHERE

However, sometimes they seem to be the

same

126

Scope, Lifetime and Referencing


Environments
Ex. stack-dynamic variables in some places
Ex. global variables in some situations
#include <iostream>
using namespace std;
int i = 10;
int main()
{
{
int i = 11, j = 20;
{
int i = 12, k = 30;
cout << i << " " << j << " " << k << endl;
}
cout << i << " " << j << endl;
}
cout << i << endl;
}
127

Scope, Lifetime and Referencing


Environments
More often they are not the same
Lifetime of stack-dynamic variables continues

when a subsequent function is called, whereas


scope does not include the body of the
subsequent function
Lifetime of heap-dynamic variables continues
even if they are not accessible at all

128

Scope, Lifetime and Referencing


Environments
Referencing Environment
Given a statement in a program, what variables

are visible there?


Clearly this depends on the type of scoping
used
Once we know the scope rules, we can always
figure this out

From code if static scoping is used


From call chains (at run-time) if dynamic scoping is

used

We will look more at non-local variable access

when we discuss subprograms

129

Primitive Data Types


Most languages provide these
Numeric types
Integer
Usually 2s complement
Exact representation of the number
In some languages (ex. C++) size is not specified
Depends on hardware

In others (ex. Java) size is specified


Regardless of hardware

130

Primitive Data Types


Float
Usually exponent/mantissa form, each coded in

binary
Often an approximation of the number
Only a limited number of bits for mantissa
Decimal point floats
Many numbers cannot be represented exactly
Irrational numbers (ex. PI, e, 21/2)
Infinite repeating decimals (ex. 1/3)
Some terminating decimals as well (ex. 0.1)
Instead we use digits of precision
In most new computers, this is defined by IEEE Standard 754
See p. 255 in text

See rounding.cpp
131

Primitive Data Types


Fixed point
Called Decimal in text
Store rational numbers with a fixed number of
decimal places
Useful if we need decimal point to stay put
Ex. Working with money

Can be stored as Binary Coded Decimal each digit

encoded in binary
Similar to strings, but we can put 2 digits into one byte,

since each digit needs only 4 bits


However, 10 digit values do not actually require 4 full
bits
So memory is somewhat wasted in this representation
132

Primitive Data Types


Can also be stored as integers, with an extra attribute in

the descriptor
Scale factor indicates how many places over to locate decimal

point
Ex: X = 102.53 and Y = 32.65
Stored as 10253 and 3265 (in binary) with scale factor of 2

Z=X+Y
Add the integers and keep the same scale factor (involves normalization
if the number of decimal places are not the same think about this)

= 13518 with scale factor 2 = 135.18


Z=X*Y
Multiply the integers and add the scale factors, then normalize to the
correct number of digits

= 33476045 with scale factor 4


= 3347.6045 normalized to 2 digits
= 3347.60
133

Primitive
Data
But clearly
limited inTypes
size by integer size

In Java the BigDecimal class (not primitive) stores

them as (very long) integers with a 32 bit scale


factor see BigD.java

Very Long Numbers


Typically NOT primitive types
Implemented using arrays via a predefined class (ex.
BigInteger and BigDecimal in Java) or via an add-on
library (NTL library for C++)

Boolean
Used in most languages (except C)
Values are true or false

134

Primitive Data Types


Though stored as integer values (0 or 1),

boolean values are typically not compatible


with integer values
Exception is C++, where all non-zero values are

true, 0 is false
This adds flexibility, but can cause logic errors
Recurring theme!
Remember assign.cpp

135

Primitive Data Types


Character
In early computers, character codes were not

standardized
Caused trouble for transferring text files

Now most computers use ASCII character set

(or extended ASCII set)


But ASCII is not adequate for international
character sets
Unicode is used in Java, perhaps in other languages

soon
16 bit code
136

Primitive Data Types


ASCII is also not the same for Unix platforms

and Windows platforms


Unix uses only a <LF> while Windows uses a

<CR><LF> combination
Can cause problems for programs not equipped to
handle the difference
We can easily convert, however
FTP in ASCII mode
Simple script to convert

137

More Data Types


Strings
Not a primitive type in many languages
C, C++, Pascal, Ada
Primitive type in others
BASIC, FORTRAN, Perl

Issues to consider:
Should a string be considered a single entity, or a
collection of characters?
Should the length of a string be fixed or variable?
Which operations can be used?

138

More
Data Types
Single entity vs. collection of characters
More an issue of access than of structure
In languages with no primitive string type, a

string is generally perceived as a collection of


characters
Ex: In C and C++ a string is an array of char
If we want we can treat it like any other array
If we use operations in string.h (or <cstring>) we
can treat it like a string
C++ also has string class
Ex: In Ada String is a predefined unconstrained
array of characters
String variables must be constrained
Some simple operations can be used (ex.
assignment, comparison, concatenation)
139

More Data Types


In Perl, string is a primitive type
Strings are accessed as single entities
We can use functions to get character index values, but

strings are stored as single, primitive variable values


Many operations can be used (later)
In Java, we have two string types
Neither is really primitive, since they are classes built

using an underlying array of char


String
Immutable objects (cannot be changed once created)
Allows strings to be stored and accessed more
efficiently, since multiple objects with the same value
can be replaced with a single object (at compile-time
usually done for literals)
140

More Data Types


StringBuffer
Objects that can be modified after creation
More efficient if programmer is altering string
values, since new objects do not have to be created
for each operation

String length
3 variations
1) Static (fixed) length length of string is set when

the object is created and cannot change


2) Limited dynamic length length of string can vary
up to a preset maximum

Typically the dynamic aspect is logical physically the


string size is preset, but some of the space may not be
used

3) Dynamic length length of string can vary without141

More Data Types


1) Used in Pascal, Ada

String variable is declared to be of a fixed size, with


all locations being part of the string
It is up to programmer to either pad string at end or
somehow ignore unused locations
However, Ada improves on Pascal strings by
making the String type an unconstrained array

String objects must have a fixed length, but they are all
of the same type (ex: for params)
See astrings.adb and pstrings.p

2) Used in C, C++ (c strings)

String variable is of a fixed maximum size, but any


number of characters up to the maximum length
may be used at any time
In C and C++ a special sentinel character (\0) is
used to indicate the end of the logical string
Implementation is VERY dangerous (what else is
new!)

See cstrings.cpp

142

Used in Perl,
Java StringBuffer, C++ string class
More Data
Types
3)

(among others)

Physical length of the string object is


automatically adjusted by the system to hold the
current length logical string
Memory is reallocated if necessary
Only limit on string size is amount of memory
available
Realize that this is not free: each time the string
has to be resized the following occurs:

New memory of new size is allocated


Old data is copied to new memory
Variable is rebound to new memory
Old memory is discarded

It is smart for system to double memory each


time so resizing is infrequent
Well discuss allocation later
See Strings.java
143

More Data Types


Operations
What can we do with our strings?
These are affected by how the previous issue

(length)
If the length must remain constant, we cannot have

any operations that would change the length


Some possibilities:
Slicing (give a substring)
Search for a string
Index
Regular expression matching

144

More Data Types


Enumeration Types
Used primarily for readability
Ex: January, February vs. 0, 1,
Typically map identifiers to integer values
Pascal
Identifiers can be used only once
I/O not predefined (user must write I/O operations

for each new enum. type)


Major negative

Can be used in for loops and case statements

145

More Data Types


Ada
Ids can be overloaded as long as the correct

definition can be determined through context


type colors is (red, blue, yellow, purple, green);
type signals is (red, yellow, green);
Now the literal green (or yellow or red) is ambiguous in
and of itself
We must qualify it in the code to the desired type
for C in signalsred..signalsgreen
I/O can be done through a generic I/O package

very helpful
Allows values to be output without having to write new

functions each time


146

More
Data Types
C++
Enum values are converted to ints by the

compiler
Lose the type checking that Ada and Pascal
provide
Java added Enum types in 1.5
All are a subclass of Enum so are objects

Subrange types
Improve readability (new name id) and

logic-error checking (restricting ranges)


Provided in Pascal and Ada

147

More Data Types


Arrays
Traditionally a homogeneous collection of

sequential locations
Many issues to consider with arrays a few are:
How are they indexed?
How/when is memory allocated?
How are multidimensional arrays handled?
What operations can be done on arrays?

148

More Data Types


Indexing
Mapping that takes an array ID and an index
value and produces an element within the array
Y = A[X];

Range of index values depends on two things:


1) How many items are in the array?

This depends on how allocation is allowed in a language

We will discuss more with memory allocation


2) Are any bounds implicit?

C, C++, Java arrays start at 0

Ada, Pascal can start anywhere

149

More Data Types


Whether range of indexes is static or dynamic, the

actual subscripts must be evaluated at run-time


Subscript could be a variable or expression whose

value is not known until run-time


If value is not within the range of the array, a rangeerror occurs
VERY COMMON logic error made in programs
Checked in Ada, Pascal and Java
Not checked in FORTRAN, C, C++, much to the chagrin of

programmers

Q: Why arent range-errors checked in C, C++?

150

More Data Types


Memory allocation
Remember discussion of allocation during Binding
lecture
Basically those bindings hold, with some slight

additions
Static: size of array is fixed and allocation is done at

compile-time
Fixed stack-dynamic: size of array is fixed but
allocation is done on run-time stack during
execution
Pascal arrays
"Normal" C++ arrays

151

More Data Types


Stack-dynamic: size of array is determined at run-

time and array is allocated on run-time stack;


however, once allocated, size is fixed
Ada arrays
See prev. Ada array example

Heap-dynamic:
Explicit: programmer allocates and deallocates arrays
C++, Java, FORTRAN 90 dynamic arrays
Implicit: array is resized automatically as needed by

system
Perl, Java ArrayList

See examples on board

152

More Data Types


Multiple Dimension Arrays
2-D Array: array of vectors plane Matrix
3-D Array: array of matrices or planes
Higher dimension arrays are not so easily
envisioned, but in most languages are legal
Subscripts for each dimension can be different
ranges, and, in fact even different types
Ex. Pascal:
type matrix = array[1..10, A..Z] of real;
Indexing function is a bit more complicated, as we

will see next

153

More Data Types


Implementing Arrays (1-D)
Two questions we need to answer:
What information do we need at compile-time?
What information do we need at run-time?
The answers to these questions depend

somewhat on other language properties


If dynamic type-binding is used, most information

must be kept at run-time


Even with static type-binding, some information
may be needed at run-time

154

More Data Types


Lets assume static type-binding, since most

languages use this


What do we need to know at compile-time?
Element type
Index type
Above two needed for type checking

Element size
Index range (lower and upper bounds)
Array base address
Above items are needed to construct the indexing function for the array

Lets see how the indexing function is created


Given an array, A, we want a function to return the Lvalue

(address) of ith location A


155

More Data Types


Assume:
E = Element size
LB = Lower index bound
UB = Upper index bound
S = start (base) address of array

Lvalue(A[i]) = S + (i LB) E
= (S LB E) + (i E)
Note that once S is known (array has been bound
to memory), left part of the equation is constant
So for each array access, only (i E) must be calculated
If i is a constant, this too can be precalculated (ex. A[5])

156

More Data Types


What do we need to know at run-time?
This depends on what kind of checking we are doing
Ex: C/C++ need only the start address (S) of the array

and the element size, (E)


LB is always 0
UB is not checked

Ex: Ada, Pascal need more:


LB and UB needed for bounds checking

Run-time information for an array is typically stored

in a run-time descriptor (sometimes called a dopevector)


This may be stored in the same memory block as the

data itself (ex. at the beginning), or elsewhere in


memory
157

More Data Types


Implementing Arrays (multi-D)
This is more complicated than 1-D, since

computer memory is typically linear


Thus multi-D arrays are still stored physically as
1-D arrays
Language must allow them to be accessed in a

multi-D way
Ex. Two-D arrays
In most languages stored in row-major order
Line up rows end-to-end to produce the linear physical

array

158

More Data Types


Column-major order also exists used in FORTRAN

Index function for 2-D arrays is a bit more

complicated than for 1-D, but is still not that


hard to calculate
lvalue(A[i,j]) = S + (i LB1)ROWLENGTH
+ (j LB2)E
where ROWLENGTH = (UB2 LB2+ 1)E
Do example

Given int A[10][5] calculate A[4][3]

159

More Data Types


Can we access arrays in any other way?
In some languages (ex: C, C++) we can use
pointers for array access
In this case rather than calculating the index of the
location as an offset from the base address, we
move the pointer to the address of the location we
want to access
For sequential access of the items in an array, this
is much more efficient than using traditional
indexing
Of course, it is also dangerous!
See ptrarrays.cpp

160

More Data Types


Array Operations
What can be done to array as a whole?
Assignment
Allowed in languages in which array is actually a data

type (ex. Ada, Pascal, Java)


Usually bitwise copy between arrays (shallow copy)
C/C++ do not allow assignment since array variables
are simply (constant) pointers
Comparison
Equal and not equal allowed in Ada
With Java references are compared

More complex operations are allowed in some

languages (ex. APL)


161

More
Data Types
Associative Arrays
Instead of being indexed on integers (or

simple enumeration values), index on


arbitrary keys (usually strings)

Ex: Perl hash (also exists in Smalltalk and Java)

and PHP arrays


Java and other langs also have Hashtable
classes
Typically done following hashing

procedures

Hash function used to map key to an index,

mod the table size


Keys are dispersed in a random fashion, to
reduce collisions

162

More Data Types


Typically in these language-implemented hashes,

the table size is not shown to the programmer


Programmer does not need to know details in order to

use it
Usually it grows and shrinks dynamically as the table
fills or becomes smaller
Resizing is expensive all items must be rehashed

into new sized table


Would like to not have to resize too often
Doubling size with each expansion is a good idea

163

More Data Types


Records
Heterogeneous collections of data, accessed by

component names
Very useful in creating structured data types
Forerunners of objects (do not have the
operations, just the data)
Access fairly uniform across languages with dot
notation
Fields accessed by name since they may have

varying size

164

More Data Types


Pointers
Variables that store addresses of other variables

01001101 3.14159

01001101
In most high-level languages, their primary use
is to allow access of dynamic memory
Exception: C, where pointers are required for

reference parameters and where they are also


used heavily for array access
Well talk about parameters later

165

More Data Types


To access memory being pointed to, a pointer is

dereferenced
Usually explicit:

C++ Ex: *P = 6.02;


Implicit in some contexts:
Ada Ex: rec_ptr.field

166

More Data Types


Pointer Issues:
Should pointers be typed like other variables?
How do the scope and lifetime of the pointer
variable relate (if at all) to the scope and lifetime of
the heap-dynamic variable?
What are the safety/access issues a programmer
using pointers must be concerned with?
How is heap-dynamic memory maintained?
What are reference types and how are they
similar/different to pointer types?

167

More Data Types


Types
Most languages with pointers require them to be

typed
This allows type-checking of heap-dynamic memory

Scope/Lifetime/Safety/Acess
In most languages the scope and lifetime of heap-

dynamic variables are distinct from those of the


pointer variables
If pointer variable is a stack-dynamic variable, its scope

and lifetime are as we discussed previously


Heap-dynamic variables are also as we discussed
previously

168

More Data Types


Thus we have the potential for 2 different

problems:
1) Lifetime of pointer variable accessing a heap-

dynamic variable ends, but heap-dynamic


variable still exists
Now heap-dynamic variable is no longer accessible
This problem can also occur by simply reassigning

the pointer variable


Now we have formed GARBAGE heap-dynamic
variable that can no longer be accessed
Different languages handle garbage in different
ways we will discuss them shortly

169

More Data Types


2) Heap-dynamic variable is in some way

deallocated (returned to system), but pointer


variable is still storing its address
Now address stored in pointer is invalid
Can cause problems if it is dereferenced, especially

if the memory has been reallocated for some other


use
This is a dangling reference
Can be catastrophic to a program, as most C++
programmers know

To deal with these problems, we must discuss

how heap-dynamic memory is maintained

170

You might also like