Professional Documents
Culture Documents
Principles
Part 1: Language Specification
Attributes of a language
Syntax: rules describing use of language
tokens
Semantics: logical meaning of
combinations of tokens
In a programming language, tokens
include identifiers, keywords, and
punctuation
Linguistic correctness
A syntactically correct program is one in
which the tokens are arranged so that the
code can be successfully translated into a
lower-level language
A semantically correct program is one
that produces correct results
Elements of a Language
Alphabet: finite, non-empty set of characters
not precisely the same thing we mean when we
speak of natural language alphabet
for example, the alphabet of C++ includes the upperand lowercase letters of the English alphabet, the
digits 0-9, and the following punctuation symbols:
{,},[,],(,),+,-,*,/,%,=,>,<,!,&,|,,,,,.,:,;,_,\
Language as ADT
A language is an example of an Abstract
Data Type (ADT)
An ADT has these characteristics:
Set of possible values (an alphabet)
Set of operations on those values
Concatenation
Concatenation is the joining of two or more
characters to form a string
Many programming language tokens are formed
this way; for example:
> and = form >=
& and & form &&
1, 2, 3 and 4 form 1234
String characteristics
The number of characters in a string is the
strings length
An empty string is a string with length 0;
we denote the empty string with the
symbol
The is the identity element for
concatenation; if x is string, then:
x = x = x
Closure of an alphabet
The set of all possible strings that can formed by
concatenating elements from an alphabet is the
alphabets closure, denoted T* for some
alphabet T
The closure of an alphabet includes strings that
are not valid tokens in the language; it is not a
finite set
For example, if R is the real number alphabet,
then R* includes:
-0.092 and 563.18 but also
.0.0.- and 2-4-2.9..-5.
Parts of a grammar
N: a nonterminal alphabet; each element
of N represents a group of characters
from:
T: a terminal alphabet
P: a set of rules for string production; uses
nonterminals to describe language structure
S: the start symbol, an element of N
Valid strings
S (the start symbol) is a single symbol, not
a set
Given S and P (rules for production), you
can decide whether a set of symbols is a
valid string in the language
Conversely, starting from S, if you can
generate a string of terminal symbols
using P, you can create a valid string
Productions
Aw
a non-terminal
produces
Derivations
A grammar specifies a language through
the derivation process:
begin with the start symbol
substitute for non-terminals using rules of
production until you get a string of terminals
<identifier><letter>
<identifier><identifier><letter>
<identifier><identifier><digit>
<letter> a
<letter> b
<letter> c
<digit> 1
<digit> 2
<digit> 3
S = <identifier>
<identifier><letter> (rule 2)
<identifier>c (rule 6)
<identifier><letter>c (rule 2)
<identifier>bc (rule 5)
<identifier><digit>bc (rule 3)
<identifier>2bc (rule 8)
<identifier><digit>2bc (rule 3)
<identifier>12bc (rule 7)
<letter>12bc
a12bc
Closure of derivation
The symbol * means derives in 0 or more
steps
A language specified by a grammar consists of
all strings derivable from the start symbol using
the rules of production
provides operational test for membership in the
language
if a string cant be derived using production rules, it
isnt in the language
<identifier><letter>
<identifier>a
Since there is no <identifier><digit>
combination in the production rules, we cant
proceed any further
This means that 2a isnt a valid string in our
language
N = {I, F, M}
I means integer
F means first symbol; optional sign
M means magnitude
I FM
F +
F F (means +/- is optional)
M dM
Md
S=I
Examples
Deriving 14:
I FM
M
dM
dd
14
Deriving -7:
I FM
-M
-d
-7
Recursive rules
Both of the previous examples (identifiers,
integers) have rules in which a
nonterminal is defined in terms of itself:
<identifier> <identifier><letter> and
M dM
Context-sensitive grammar
A grammar in which the production rules
may contain more than one non-terminal
on the left side
The opposite (all of the examples we have
seen thus far), have production rules
restricted to a single non-terminal on the
left: these are known as context-free
grammars
Example
N = {A,B,C}
T = {a,b,c}
P = the productions:
1.
2.
3.
4.
5.
A aABC
A abC
CB BC
bB bb
bC bc
6.
cC cc
S=A
Context-sensitive grammar
Example:
N = {A, B, C}
aaabbbcc is a valid string by:
A
=> aABC (1)
T = {a, b, c}
=> aaABCBC (1)
P = the productions
=> aaabCBCBC (2)
1.
2.
3.
4.
5.
6.
A --> aABC
A --> abC
CB --> BC
bB --> bb
bC --> bc
cC --> cc
S=A
abc
aabbcc
aaabbbccc
aaaabbbbcccc
Invalid:
aabc
cba
bbbccc
P: the productions:
1. E -> E + T
2. E -> T
3. T -> T * F
4. T -> F
5. F -> (E)
6. F -> a
S=E
Syntax tree
A parse routine can be represented as a
tree
start symbol is the root
interior nodes are nonterminal symbols
leaf nodes are terminal symbols
children of an interior node are symbols from
right side of production rule substituted for
parent node in derivation