Professional Documents
Culture Documents
Lexi
al Analysis
A small psy
hologi
al exer
ise demonstrates what lexi
al analysis is. Read aloud
the following JavaS
ript program:
1. You probably read the program word by word (rather than letter by letter).
2. You may have treated (mentally or vo
ally) () as a unit rather than as two
parentheses.
4. You (mentally or vo
ally) treated the string "JavaS
ript is not Java.\n"
as a unit.
Most programming languages sin
e the days of FORTRAN are stru
tured in su
h
a way that a program splits into a linear sequen
e of words. In the pro
ess of
determining the boundaries between words,
ertain insigni
ant elements of a
program (su
h as
omments) drop out. Sin
e no higher-level synta
ti
pro
essing
is involved in this phase, it makes sense to perform this splitting before further
synta
ti
analysis. This splitting is
alled lexi
al analysis. The result of lexi
al
analysis for the above program might look like this:
hfun
tioni
hidentier [main ]i
hlpareni
hrpareni
hlbra
ei
hidentier [print ]i
hlpareni
hstring [JavaS
ript is not Java.←֓]i
hrpareni
hsemi
oloni
hrbra
ei
Some of the details of this splitting may seem arbitrary at this point: Why does
fun
tion be
ome a unit hfun
tioni while main is denoted as hidentier [main ]i? The
answer is this: the word fun
tion appears in the JavaS
ript language denition
7
8 CHAPTER 1. LEXICAL ANALYSIS
as one of the keywords of the language. Sin
e the language denition is nite, the
number of keywords must also be nite: Hen
e, it makes sense to treat keywords as
spe
ial unit. The word main, on the other hand, does not appear in the language
denition: main is an identier
hosen by the programmer. There are innitely
many identiers, but they all share the same synta
ti
status: Hen
e the hidentier
[main ]i en
oding whi
h separates this information from the a
tual name of the
identier.
1. it divides the input into logi ally ohesive sequen es of hara ters, the lexemes ;
2. it lters out formatting
hara
ters, like spa
es, tabulators, and newline
har-
whitespa
e
a
ters (
hara
ters); and
3. it maps lexemes into tokens, i.e., symboli
names for
lasses of lexemes. Most
tokens
arry attributes whi
h are
omputed from the lexeme. There is a one-
to-one
orresponden
e between lexemes and token/attribute pairs.
scan : Σ∗ → (T × A)∗
unscan : (T × A)∗ → Σ∗
The fun tions scan and unscan have the following derived property.
Thus, the scan fun
tion splits up the input into lexemes, maps them to to-
ken/attribute pairs, and removes white spa
e along the way. The fun
tions scan
and unscan are in general not inverses due to white spa
e removal: scan removes
all white spa
e irreversibly and unscan has to introdu
e white spa
e so that the
original lexemes are properly separated.
1.2. CONSTRUCTION OF SCANNERS 9
• The word problem w∈R is de
idable in linear time in the length of the word
w.
(q, ε, q) ∈ δb
(q, a w, q ′ ) ∈ δb i (∃q ′′ ) (q, a, q ′′ ) ∈ δ and (q ′′ , w, q ′ ) ∈ δb
(q, ε, q) ∈ δb
(q, w, q ′ ) ∈ δb i (∃q ′′ ) (q, ε, q ′′ ) ∈ δ and (q ′′ , w, q ′ ) ∈ δb
(q, a w, q ′ ) ∈ δb i (∃q ′′ ) (q, a, q ′′ ) ∈ δ and (q , w, q ) ∈ δ
′′ ′ b
• ∅ ∈ RE (Σ)
• ε ∈ RE (Σ)
• if a∈Σ then a ∈ RE (Σ)
• if r1 , r2 ∈ RE (Σ) then r1 r2 ∈ RE (Σ)
1.3. DESCRIPTIONS OF REGULAR LANGUAGES 11
A regular expression denes a language as pres
ribed by the fun
tion L : RE (Σ) →
P(σ ∗ ).
L(∅) = ∅
L(ε) = {ε}
L(a) = {a}
L(r1 r2 ) = L(r1 ) · L(r2 )
:= {w1 w2 | w1 ∈ L(r1 ), w2 ∈ L(r2 )}
L(r1 | r2 ) = L(r1 ) ∪ L(r2 )
L(r∗ ) = L(r)∗
:= {w1 w2 . . . wn | n ∈ N, wi ∈ L(r)}
= {ε} ∪ L(r) ∪ L(r) · L(r) ∪ L(r) · L(r) · L(r) ∪ . . .
A word w belongs to the language des
ribed by r i w ∈ L(r).
1 Example
The Caml referen
e manual
ontains a se
tion Lexi
al Conventions. Figure 1.1
shows its des
ription of the lexemes for identiers and integer
onstants.
Lexi
al analysis exists for
hiey pragmati
reasons: the more involved synta
ti
analysis whi
h follows
an be mu
h simpler be
ause of it. Moreover, regular gram-
mars have well-known algorithms to re
ognize them. Theoreti
al
omputer s
ien
e
tells us that a nite, deterministi
automaton (DFA)
an serve as a re
ognizer for
any regular language. A DFA is a simple ma
hine, and thus reasonably easy to
implement. Unfortunately, the
onstru
tion of the state diagram for a DFA is an
involved and tedious pro
ess. Therefore, it makes things easier to try to
ir
umvent
the expli
it
onstru
tion of the automaton. Fortunately, the automaton follows
automagi
ally from our simpler approa
h to re
ognizing regular languages.
12 CHAPTER 1. LEXICAL ANALYSIS
We follow a slightly dierent approa
h, whi
h has been used with variations in
implementing regular expression sear
h in text editors. The basi
idea is to use a set
of regular expressions as the set of states for an automaton. Sin
e we
an asso
iate
a language to ea
h state q of a nite automaton (the set of words that transform
q into a nal state), we simply want to label ea
h state with a regular expression
for this language. Clearly, the initial state q0
orresponds to the regular expression
that we want to re
ognize. But what is the regular expression for the state δ(q0 , a)?
To address this problem, we dene the derivative of a regular expression, i.e., a
fun
tion D : RE (Σ) × Σ → RE (Σ) su
h that
Thus, if a w ∈ L(r) then D(r, a) re
ognizes the rest, w, of the input word after
reading a. This
orresponds to the language re
ognized by the state δ(q0 , a).
1.2 Denition
The derivative of a regular expression, D : RE (Σ) × Σ → RE (Σ) is dened by
indu
tion on the denition of RE (Σ). It relies on an auxiliary fun
tion E : RE (Σ) →
RE (Σ) whi
h is spe
ied by
D(∅, a) = ∅
D(ε, a) = ∅
(
ε if a = a′
D(a′ , a) =
∅ otherwise
D(r1 r2 , a) = D(r1 , a)r2 | E(r1 )D(r2 , a)
D(r1 | r2 , a) = D(r1 , a) | D(r2 , a)
D(r∗ , a) = D(r, a)r∗
E(∅) = ∅
E(ε) = ε
E(a) = ∅
E(r1 r2 ) = E(r1 )E(r2 )
E(r1 | r2 ) = E(r1 ) | E(r2 )
E(r∗ ) = ε
2
1.3 Theorem
[
L(r) = E(r) ∪ a · L(D(r, a))
a∈Σ
2
Starting from a regular expression r0 that denes the language we are interested
in, it is now easy to dene an automaton that re
ognizes this language.
1.4 Theorem
Let r0 ∈ RE (Σ). Dene the deterministi
automaton M = (Q, Σ, δ, q0 , F ) as follows:
• Q is the smallest subset of RE (Σ) su
h that
1. r0 ∈ Q;
2. if r ∈ Q and a ∈ Σ then D(r, a) ∈ Q.
• δ(q, a) = D(q, a)
• q0 = r0
• F = {r ∈ Q | ε ∈ L(r)}.
r∅ = ∅r = ∅
rε = εr = r
r|∅ = ∅|r = r
∗
∅ = ε∗ = ε
r∗ ∗ = r∗
Using these simpli
ation rules, it is guaranteed that the set of states Q is nite
[Brz64℄ so that the
onstru
tion a
tually yields a DFA.
14 CHAPTER 1. LEXICAL ANALYSIS
2 Example
For an example, re
all part of the regular expression for integer literals from Fig. 1.1.
hinteger-literali = (ε | −)hdigitihdigiti∗
Now
D(hinteger-literali, −)
∗
= D((ε | −)hdigitihdigiti , −)
∗ ∗
= D(ε | −, −)hdigitihdigiti | E(ε | −)D(hdigitihdigiti , −)
= (D(−, −) | D(ε, −))hdigitihdigiti | (E(−) | E(ε))D(hdigitihdigiti∗ , −)
∗
∗ ∗
= (ε | ∅)hdigitihdigiti | (∅ | ε)D(hdigitihdigiti , −)
apply simpli
ation
= hdigitihdigiti∗ | D(hdigitihdigiti∗ , −)
last use of D simplies to ∅ be
ause - is not a hdigiti
∗
= hdigitihdigiti
Hen
e, after reading a - the automaton still expe
ts a non-empty word of hdigitis.
∗
It is not a nal state be
ause ε∈/ L(hdigitihdigiti ).
In the same way, we
an
he
k that
D(hinteger-literali, +) = ∅
∗
D(hinteger-literali, hdigiti) = hdigiti
In the rst
ase, the automaton has rea
hed a sink state ∅ whi
h is not a nal state
and whi
h
annot be left by any transition. In the se
ond
ase, the automaton has
onsumed a digit, it has rea
hed a nal state but is also ready to read further digits.
then Null
else if r1 = Epsilon
then r2
else if r2 = Epsilon
then r1
else Con
at(r1, r2)
let alternate r1 r2 =
if r1 = Null
then r2
else if r2 = Null
then r1
else Alternate(r1, r2)
let repeat r =
if r = Null or r = Epsilon
then Epsilon
else Repeat(r)
Some simple fun
tions are useful in
reating
omposite regular expressions:
concat_list [r1 ; . . . ; rn ] = r1 . . . rn
alternate_list [r1 ; . . . ; rn ] = r1 | · · · | rn
Now that there is fun
tionality for
reating regular expressions, the next job is
to
he
k, for a given sequen
e of alphabet symbols symbols, if it belongs to the
language dened by a regular expression regexp. The fun
tion mat
hes will do
exa
tly that. Its rst few lines are straightforward. (It needs to prex names from
module Regexp be
ause it resides in a dierent module.)
xξ ∈ L(r) ⇐⇒ ξ ∈ L(after_symbol x r)
16 CHAPTER 1. LEXICAL ANALYSIS
ifoundsalvationinapubli
lavatory
are synta
ti
ally
orre
t Caml expressions starting withif. Now, the lexi
al syntax
of Caml would allow to partition ifoundsalvationinapubli
lavatory into lex-
emes in several dierent ways: either into if and ifoundsalvationinapubli
lavatory
or just just as ifoundsalvationinapubli
lavatory. Obviously (or is it?), the lat-
ter alternative is the intended one.
The standard way of resolving this
oni
t is the rule of the longest mat
h :
The rst lexeme of a
hara
ter sequen
e is its longest prex whi
h is a lexeme. To
1.6. A REAL SCANNER 19
nd the longest prex, even if the s
anner re
ognizes a lexeme, it must
ontinue
examining
hara
ters of the input until the
urrent prex is no longer a prex of
a lexeme. Then the s
anner returns the last lexeme re
ognized. This pro
ess may
involve returning
hara
ters to the input.
If there are still two dierent ways of tokenizing a single lexeme, then the tex-
tually pre
eding rule from the spe
i
ation if given preferen
e.
ex eption S an_error
let
har_range_regexp
1
2 =
let int_range = Listplus.from_to (Char.
ode
1) (Char.
ode
2) in
let re i = symbol (Char.
hr i) in
alternate_list(List.map re int_range)
let integer_literal =
on
at (alternate epsilon (symbol '-'))
(repeat_one digit)
let whitespa
e =
repeat_one
(alternate_list
(List.map symbol [ ' '; '\t'; '\n'; '\r'; '\012'℄ ))
A whitespa
e lexeme
onsists of a non-empty sequen
e of blanks, tabulators, new-
line,
arriage return, and form feed
hara
ters.
Next, we dene a datatype for the tokens of the Caml language.
type
aml_token =
TIdent
| TInt
| ...
The s
anner spe
i
ation itself is a list of pairs of a regular expression and an
a
tion fun
tion, as explained above. Typi
ally, we dene a s
anner as a re
ursive
fun
tion, so that it
an
all itself re
ursivly to
onsume further input. This happens
in the a
tion for whitespa
e.
O
amllex translates a s
anner spe
i
ation (des
ribed by the non-terminal hlexeri)
into a
orresponding Caml module. The two h
ode-snippetis are
opied literally
to the beginning and end of this module, respe
tively. They
an
ontain arbitrary
ode, whi
h is not
he
ked during s
anner generation.
The hbodyi of the spe
i
ation
onsists of a list of hdenitionis for regular lan-
guages and some hrulesi. Ea
h denition introdu
es a name for a regular expression.
Denitions
annot be re
ursive.
Ea
h rule in hrulesi spe
ies a separate s
anner fun
tion. S
anner fun
tions
may
all ea
h others re
ursively. The body of a s
anner fun
tion is a list of pairs
of a regular expression hregexpi and an O
aml expression. Ea
h regular expression
spe
ies a lexeme language and the asso
iated expression denes the a
tion taken
upon nding the lexeme.
Regular expressions are entered using the syntax dened in Se
. 1.3, with a few
ex
eptions and extensions.
• O aml hara ter and string onstants serve as onstant regular expressions.
• Chara
ter
lasses are a built-in notion. They are dened by [ h
har-seti ℄,
where
| h
har-literali - h
har-literali
| h
har-seti h
har-seti
where h
har-literali stands for a
hara
ter
onstant in O
aml syntax, the se
-
ond alternative denotes a
hara
ter range (
f. fun
tion
har_range_regexp),
and the thir
alternative denotes the union of two h
har-setis.
• Chara
ter
lasses may be
omplemented by writing [^ h
har-seti ℄.
• An identier may refer to a pre
eding denition let hidenti = hregexpi.
Ea
h s
anning rule hentry-pointi = parse . . . is translated to a fun
tion
• hentry-pointi lexbuf
may be
alled to
ontinue s
anning at hentry-pointi (whi
h may be dierent
from the
urrent s
anner) immediately after the
urrent lexeme.
type
aml_token =
TIdent
| TInt
| ...
1. Build a hash table from the keywords before starting the s anner.
The hash table is
onstru
ted only on
e and its lookup should be performed as
qui
kly as possible. Hen
e, it is appropriate to spend some eort into its
onstru
-
tion. Sin
e all entries of the hash table are a-priori known, it is possible to sear
h
for a perfe
t hash fun
tion that avoids
ollisions. This way, a lookup in the hash
table
an be guaranteed to run in
onstant time.
Hen
e, a s
anner maps the strings arising as identier lexemes to symbols, rst. The
usual implementation of this mapping is using an open hashing algorithm. Ea
h
new entry in the table is assigned a unique identier (a number).
These
onsiderations give rise to the following module type SYMBOL.
type symbol
val symbol : string -> symbol
val name : symbol -> string
A typi
al implementation for the type symbol is a pair of the string representation
and a (unique) number.
24 CHAPTER 1. LEXICAL ANALYSIS