You are on page 1of 18

Chapter 1

Lexi al Analysis
A small psy hologi al exer ise demonstrates what lexi al analysis is. Read aloud
the following JavaS ript program:

fun tion main () {


print ("JavaS ript is not Java.\n");
}
Listening to yourself, you will noti e the following pe ularities:

1. You probably read the program word by word (rather than letter by letter).

2. You may have treated (mentally or vo ally) () as a unit rather than as two
parentheses.

3. You did not read aloud spa es and line breaks.

4. You (mentally or vo ally) treated the string "JavaS ript is not Java.\n"
as a unit.

Most programming languages sin e the days of FORTRAN are stru tured in su h
a way that a program splits into a linear sequen e of words. In the pro ess of
determining the boundaries between words, ertain insigni ant elements of a
program (su h as omments) drop out. Sin e no higher-level synta ti pro essing
is involved in this phase, it makes sense to perform this splitting before further
synta ti analysis. This splitting is alled lexi al analysis. The result of lexi al
analysis for the above program might look like this:

hfun tioni
hidentier [main ]i
hlpareni
hrpareni
hlbra ei
hidentier [print ]i
hlpareni
hstring [JavaS ript is not Java.←֓]i
hrpareni
hsemi oloni
hrbra ei

Some of the details of this splitting may seem arbitrary at this point: Why does
fun tion be ome a unit hfun tioni while main is denoted as hidentier [main ]i? The
answer is this: the word fun tion appears in the JavaS ript language denition

7
8 CHAPTER 1. LEXICAL ANALYSIS

as one of the keywords of the language. Sin e the language denition is nite, the
number of keywords must also be nite: Hen e, it makes sense to treat keywords as
spe ial unit. The word main, on the other hand, does not appear in the language
denition: main is an identier hosen by the programmer. There are innitely
many identiers, but they all share the same synta ti status: Hen e the hidentier
[main ]i en oding whi h separates this information from the a tual name of the
identier.

1.1 Basi denitions


The part of a ompiler responsible for lexi al analysis is alled a s anner or a lexer.
It operates on the program as a sequen e of hara ters and performs three main
tasks:

1. it divides the input into logi ally ohesive sequen es of hara ters, the lexemes ;

2. it lters out formatting hara ters, like spa es, tabulators, and newline har-
whitespa e
a ters ( hara ters); and

3. it maps lexemes into tokens, i.e., symboli names for lasses of lexemes. Most
tokens arry attributes whi h are omputed from the lexeme. There is a one-
to-one orresponden e between lexemes and token/attribute pairs.

1.1 Denition (Lexi al analysis)


Let Σ be the alphabet of a programming language, T a nite set of tokens , and A
an arbitrary set of attributes. A s anner is a fun tion

scan : Σ∗ → (T × A)∗

su h that there is a fun tion

unscan : (T × A)∗ → Σ∗

with the following properties


1. scan ◦ unscan = id (T ×A)∗ and

2. there is a fun tion untoken : (T × A) → Σ∗ so that

unscan(t1 t2 . . . ) = untoken(t1 )untoken(t2 ) . . .

( i.e., unscan is a homomorphism).


2

The fun tions scan and unscan have the following derived property.

scan ◦ unscan ◦ scan = scan

Thus, the scan fun tion splits up the input into lexemes, maps them to to-
ken/attribute pairs, and removes white spa e along the way. The fun tions scan
and unscan are in general not inverses due to white spa e removal: scan removes
all white spa e irreversibly and unscan has to introdu e white spa e so that the
original lexemes are properly separated.
1.2. CONSTRUCTION OF SCANNERS 9

1.2 Constru tion of s anners


There is a spe trum of possibilities for onstru ting s anners. First, a s anner an
be implemented manually. A manual implementation an be tuned for e ien y
and it does not pla e any restri tions on the language of lexemes. While this
approa h might be sensible for a simple language with few lasses of lexemes, it is not
appropriate for ri h modern languages. Furthermore, a hand-written s anner is hard
to maintain be ause it must be implemented from a separate spe i ation. Last,
the e ient implementation of some omponents of a s anner ( e.g., input buering)
is a non-trivial eort. A hand-written s anner must develop these omponents from
s rat h and annot reuse previous eorts.
A se ond possibility is to rely on a prefabri ated library of s anner omponents.
In this approa h, the implementation of the s anner an be lose to the spe i ation.
Thus, it is amenable to fast development and easy maintainan e. The downside is
that the library annot easily take advantage of global optimizations, so that su h
an implementation is signi antly slower. Also, there are usually restri tions on the
language of lexemes.
The third possibility is using a s anner generator like lex or o amllex. A s an-
ner generator generates the implementation of the s anner from a high-level spe i-
 ation. Thus, it ombines the advantages of the library approa h with e ien y.
As with the library approa h, the language of lexemes is usually restri ted.
For instru tional purposes, we on entrate on the se ond possibility. While
sa ri ing some e ien y, the library approa h enables us to dis uss the entire
implementation of a s anner, without having to gloss over details of ode generation
as it would be the ase with a s anner generator.
Both, the library approa h and the generator approa h, restri t the language of
lexemes to a regular language R, for a number of reasons.
• All set-theoreti operations (union, interse tion, dieren e) on regular lan-
guages yield regular languages. Hen e, a spe i ation of a regular language
an rely on them.

• The word problem w∈R is de idable in linear time in the length of the word
w.

• For ea h regular language there is a minimal re ognizer. Hen e, there exists


an optimal s anner for ea h kind of lexeme.

1.3 Des riptions of regular languages


Theoreti al omputer s ien e tells us that there are at least three equivalent means
of des ribing regular languages. We briey introdu e ea h of them and on lude
with a dis ussion of whi h des ription is best suitable for spe ifying and implement-
ing a s anner.

1.3.1 Regular grammars


Ea h regular language may be des ribed by a regular grammar. A regular grammar
is a grammar G = (N, Σ, P, S) where ea h produ tion in P has one of the following
forms:
A → x B, A → x, A→ε
where A, B ∈ N , x ∈ Σ, and ε is the empty word. A word w belongs to the language

dened by G i there is a derivation S ⇒ w.
10 CHAPTER 1. LEXICAL ANALYSIS

1.3.2 Finite automata


Another des ription of a regular language R is a nite automaton that re ognizes
R. A nite automaton M = (Q, Σ, δ, q0 , F ) onsists of a nite set of states, Q, a
nite input alphabet, Σ, a transition operator, δ , an initial state, q0 ∈ Q, and a
set of nal states F ⊆ Q. Finite automata ome in dierent avors of equal power,
distinguished by the denition of δ . The basi idea, however, is the same: at any
time, the automaton is in a state q ∈ Q and δ yields a new state from the urrent
state and an input symbol. A word belongs to the language L(M ) re ognized by
M i the automaton is in a nal state after onsuming all input symbols.

Deterministi nite automata (DFA)


A nite automaton is deterministi if δ : Q × Σ → Q is a fun tion. To dene L(M ),
we extend δ to a fun tion δb : Q × Σ∗ → Q as follows:
b ε)
δ(q, = q
b a w)
δ(q, b
= δ(δ(q, a), w)

With this denition, w ∈ L(M ) i b 0 , w) ∈ F .


δ(q

Nondeterministi nite automaton (NFA)


A nite automaton is non-deterministi if δ ⊆ Q × Σ × Q is an arbitrary relation.
Again, to dene L(M ), we extend δ δb ⊆ Q × Σ∗ × Q, whi h is the least
to a relation
relation satisfying the following two equations:

(q, ε, q) ∈ δb
(q, a w, q ′ ) ∈ δb i (∃q ′′ ) (q, a, q ′′ ) ∈ δ and (q ′′ , w, q ′ ) ∈ δb

With this denition, w ∈ L(M ) i (∃qf ∈ F ) (q0 , w, qf ) ∈ δb.

Nondeterministi nite automata with autonomous transitions (NFA-ε)


An NFA may also allow autonomous (or instantaneous or ε) transitions whi h
hange the state without onsuming any input. In this ase, δ ⊆ Q × (Σ ∪ {ε}) × Q
is an arbitrary relation. Again, we extend δ to the least b ⊆ Q × Σ∗ × Q
relation δ
whi h satises the following equations:

(q, ε, q) ∈ δb
(q, w, q ′ ) ∈ δb i (∃q ′′ ) (q, ε, q ′′ ) ∈ δ and (q ′′ , w, q ′ ) ∈ δb
(q, a w, q ′ ) ∈ δb i (∃q ′′ ) (q, a, q ′′ ) ∈ δ and (q , w, q ) ∈ δ
′′ ′ b

With this denition, w ∈ L(M ) i (∃qf ∈ F ) (q0 , w, qf ) ∈ δb.

1.3.3 Regular expressions


A regular expression is a highly de larative way of spe ifying a regular language.
The set of regular expressions over an alphabet Σ is the smallest set RE (Σ) with:

• ∅ ∈ RE (Σ)
• ε ∈ RE (Σ)
• if a∈Σ then a ∈ RE (Σ)
• if r1 , r2 ∈ RE (Σ) then r1 r2 ∈ RE (Σ)
1.3. DESCRIPTIONS OF REGULAR LANGUAGES 11

• if r1 , r2 ∈ RE (Σ) then r1 | r2 ∈ RE (Σ)


• if r ∈ RE (Σ) then r ∈ RE (Σ).

A regular expression denes a language as pres ribed by the fun tion L : RE (Σ) →
P(σ ∗ ).
L(∅) = ∅
L(ε) = {ε}
L(a) = {a}
L(r1 r2 ) = L(r1 ) · L(r2 )
:= {w1 w2 | w1 ∈ L(r1 ), w2 ∈ L(r2 )}
L(r1 | r2 ) = L(r1 ) ∪ L(r2 )
L(r∗ ) = L(r)∗
:= {w1 w2 . . . wn | n ∈ N, wi ∈ L(r)}
= {ε} ∪ L(r) ∪ L(r) · L(r) ∪ L(r) · L(r) · L(r) ∪ . . .
A word w belongs to the language des ribed by r i w ∈ L(r).

1.3.4 Dis ussion


We have looked at three dierent des riptions for regular languages. Now, we
assess ea h method for its usability regarding the spe i ation and implementation
of s anners.
A grammar is a low-level means of des ribing a regular language. A grammar
emphasizes the generative aspe t of a language denition. While it is easy to gen-
erate words in the language from the grammar, it is non-trivial to he k if a given
word belongs to the language (the word problem). In addition, a grammar is not
a on ise des ription of a language. Even simple languages an take many rules
to des ribe. Hen e, we on lude that a grammar is neither suited for a high-level
spe i ation of a s anner nor for its implementation.
A DFA is also a low-level des ription of a regular language. Its denition imme-
diately gives rise to a re ognizer whi h is simple to implement e iently. However,
it is not a on ise des ription be ause even simple languages an require a large set
of states in their automaton. In on lusion, while a DFA makes a good implemen-
tation it is unsuitable for a high-level spe i ation of a s anner.
A regular expression is a de larative des ription of a regular language. It an
give rise to highly on ise des riptions (in parti ular, if further set-theoreti oper-
ations like interse tion and dieren e are in luded). However, it requires a lever
implementation or a translation to a DFA to e iently re ognize words from the
language.
Hen e, language denitions use regular expressions to dene the lexemes of a
programming language.

1 Example
The Caml referen e manual ontains a se tion Lexi al Conventions. Figure 1.1
shows its des ription of the lexemes for identiers and integer onstants.

Lexi al analysis exists for hiey pragmati reasons: the more involved synta ti
analysis whi h follows an be mu h simpler be ause of it. Moreover, regular gram-
mars have well-known algorithms to re ognize them. Theoreti al omputer s ien e
tells us that a nite, deterministi automaton (DFA) an serve as a re ognizer for
any regular language. A DFA is a simple ma hine, and thus reasonably easy to
implement. Unfortunately, the onstru tion of the state diagram for a DFA is an
involved and tedious pro ess. Therefore, it makes things easier to try to ir umvent
the expli it onstru tion of the automaton. Fortunately, the automaton follows
automagi ally from our simpler approa h to re ognizing regular languages.
12 CHAPTER 1. LEXICAL ANALYSIS

hidenti ::= hletteri hident-resti*


hletteri ::= A | B | C | ...| Z | a | b | | ... | z
hident-resti ::= hletteri | hdigiti | _ | '
hdigiti ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
hinteger-literali ::= hsigni hdigiti+
| hsigni hhexprexi hhexdigiti+
| hsigni ho tprexi ho tdigiti+
| hsigni hbinprexi hbindigiti+
hsigni ::= hemptyi | -
hemptyi ::=
hhexprexi ::= 0x | 0X
hhexdigiti ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
| A | B | ...| F | a | b | ... | f
ho tprexi ::= 0o | oO
ho tdigiti ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7
hbinprexi ::= 0b | 0B
hbindigiti ::= 0 | 1

Figure 1.1: Some lexi al onventions of Caml

1.4 Mapping regular expressions to DFAs


The traditional mapping from regular expressions to DFAs goes through a number
of steps. First, a regular expression is mapped to an NFA-ε. Next, the ε-transitions
are removed to obtain an NFA. Finally, the power set onstru tion is applied
to onstru t an equivalent DFA from the NFA. Moreover, in a typi al s anner
generator, the resulting DFA is minimized to save spa e in the implementation.

We follow a slightly dierent approa h, whi h has been used with variations in
implementing regular expression sear h in text editors. The basi idea is to use a set
of regular expressions as the set of states for an automaton. Sin e we an asso iate
a language to ea h state q of a nite automaton (the set of words that transform
q into a nal state), we simply want to label ea h state with a regular expression
for this language. Clearly, the initial state q0 orresponds to the regular expression
that we want to re ognize. But what is the regular expression for the state δ(q0 , a)?
To address this problem, we dene the derivative of a regular expression, i.e., a
fun tion D : RE (Σ) × Σ → RE (Σ) su h that

w ∈ L(D(r, a)) i a w ∈ L(r) (1.1)

Thus, if a w ∈ L(r) then D(r, a) re ognizes the rest, w, of the input word after
reading a. This orresponds to the language re ognized by the state δ(q0 , a).

1.2 Denition
The derivative of a regular expression, D : RE (Σ) × Σ → RE (Σ) is dened by
indu tion on the denition of RE (Σ). It relies on an auxiliary fun tion E : RE (Σ) →
RE (Σ) whi h is spe ied by

L(E(r)) = L(r) ∩ {ε} (1.2)


1.4. MAPPING REGULAR EXPRESSIONS TO DFAS 13

D(∅, a) = ∅
D(ε, a) = ∅
(
ε if a = a′
D(a′ , a) =
∅ otherwise
D(r1 r2 , a) = D(r1 , a)r2 | E(r1 )D(r2 , a)
D(r1 | r2 , a) = D(r1 , a) | D(r2 , a)
D(r∗ , a) = D(r, a)r∗

E(∅) = ∅
E(ε) = ε
E(a) = ∅
E(r1 r2 ) = E(r1 )E(r2 )
E(r1 | r2 ) = E(r1 ) | E(r2 )
E(r∗ ) = ε
2

With these denitions, we obtain the following representation theorem for a


regular language.

1.3 Theorem
[
L(r) = E(r) ∪ a · L(D(r, a))
a∈Σ
2

Starting from a regular expression r0 that denes the language we are interested
in, it is now easy to dene an automaton that re ognizes this language.

1.4 Theorem
Let r0 ∈ RE (Σ). Dene the deterministi automaton M = (Q, Σ, δ, q0 , F ) as follows:
• Q is the smallest subset of RE (Σ) su h that

1. r0 ∈ Q;
2. if r ∈ Q and a ∈ Σ then D(r, a) ∈ Q.
• δ(q, a) = D(q, a)

• q0 = r0

• F = {r ∈ Q | ε ∈ L(r)}.

Then L(M ) = L(r0 ).


The problem with this onstru tion is that the set Q may be innite. To address
this problem, we do not use D dire tly, but insert an additional pass of simpli a-
tion. Simpl ation relies on standard equivalen es of regular expressions:

r∅ = ∅r = ∅
rε = εr = r
r|∅ = ∅|r = r

∅ = ε∗ = ε
r∗ ∗ = r∗

Using these simpli ation rules, it is guaranteed that the set of states Q is nite
[Brz64℄ so that the onstru tion a tually yields a DFA.
14 CHAPTER 1. LEXICAL ANALYSIS

2 Example
For an example, re all part of the regular expression for integer literals from Fig. 1.1.

hinteger-literali = (ε | −)hdigitihdigiti∗

Now

D(hinteger-literali, −)

= D((ε | −)hdigitihdigiti , −)
∗ ∗
= D(ε | −, −)hdigitihdigiti | E(ε | −)D(hdigitihdigiti , −)
= (D(−, −) | D(ε, −))hdigitihdigiti | (E(−) | E(ε))D(hdigitihdigiti∗ , −)

∗ ∗
= (ε | ∅)hdigitihdigiti | (∅ | ε)D(hdigitihdigiti , −)
apply simpli ation
= hdigitihdigiti∗ | D(hdigitihdigiti∗ , −)
last use of D simplies to ∅ be ause - is not a hdigiti

= hdigitihdigiti

Hen e, after reading a - the automaton still expe ts a non-empty word of hdigitis.

It is not a nal state be ause ε∈/ L(hdigitihdigiti ).
In the same way, we an he k that

D(hinteger-literali, +) = ∅

D(hinteger-literali, hdigiti) = hdigiti

In the rst ase, the automaton has rea hed a sink state ∅ whi h is not a nal state
and whi h annot be left by any transition. In the se ond ase, the automaton has
onsumed a digit, it has rea hed a nal state but is also ready to read further digits.

1.5 A module for regular expressions


This se tion deals with the implementation of regular expressions and the related
algorithms in Caml. The following ode belongs to a stru tured alled Regexp.
A Caml de laration denes a data type for regular expressions, parameterized
over the underlying alphabet of the language.

type 'a regexp =


Null
| Epsilon
| Symbol of 'a
| Con at of 'a regexp * 'a regexp
| Alternate of 'a regexp * 'a regexp
| Repeat of 'a regexp
The data type 'a regexp an express the regular expression Con at(x, Null)
whi h is equivalent toNull. Thus, the term language dened by 'a regex ontains
ambiguities. Spe i ally, it is possible to make due with regular expressions whi h
ontain no internal Null onstru tors; it is always possible to transform a regular
expression whi h is either Null or does not ontain it at all. Therefore, it is a good
idea to abstra t over the onstru tors and perform some simpli ation on the way.
Besides the elimination of internal Null onstru tors, the abstra tions also get rid
of some Epsilon onstru tors:
let epsilon = Epsilon
let symbol x = Symbol(x)
let on at r1 r2 =
if r1 = Null or r2 = Null
1.5. A MODULE FOR REGULAR EXPRESSIONS 15

then Null
else if r1 = Epsilon
then r2
else if r2 = Epsilon
then r1
else Con at(r1, r2)
let alternate r1 r2 =
if r1 = Null
then r2
else if r2 = Null
then r1
else Alternate(r1, r2)
let repeat r =
if r = Null or r = Epsilon
then Epsilon
else Repeat(r)
Some simple fun tions are useful in reating omposite regular expressions:

let repeat_one r = on at r (repeat r)

let on at_list l = List.fold_left on at Epsilon l

let alternate_list l = List.fold_left alternate Null l


The regular expression repeat_one r is usually written r+ and it re ognizes the
+
language L(r) = L(r) ∪ L(rr) ∪ L(rrr) ∪ . . . , i.e., a nite on atenation of words
from L(r) where at least one word is present. The fun tions on at_list and
alternate_list iterate the on atenation and alternation operators:

concat_list [r1 ; . . . ; rn ] = r1 . . . rn
alternate_list [r1 ; . . . ; rn ] = r1 | · · · | rn
Now that there is fun tionality for reating regular expressions, the next job is
to he k, for a given sequen e of alphabet symbols symbols, if it belongs to the
language dened by a regular expression regexp. The fun tion mat hes will do
exa tly that. Its rst few lines are straightforward. (It needs to prex names from
module Regexp be ause it resides in a dierent module.)

let re mat hes regexp symbols =


mat h symbols with
[℄ -> Regexp.a epts_empty regexp
| symbol::rest ->
This ode alls a fun tion a epts_empty : 'a regexp -> bool that he ks if the
empty sequen e belongs to the language of the regular expression. (The implemen-
tation of a epts_empty is a simple exer ise, f. fun tion E in Se . 1.4.)
Now that the empty sequen e is overed, non-empty sequen es are next. This
be omes easy in the presen e of an auxiliary fun tion after_symbol, whi h imple-
ments the derivative fun tion D from Se . 1.4:

after_symbol : 'a -> 'a regexp -> 'a regexp


This fun tion has the following behavior:
Let r be a regular expression des ribing the language L(r). Let xξ be a seqen e
of symbols. Then:

xξ ∈ L(r) ⇐⇒ ξ ∈ L(after_symbol x r)
16 CHAPTER 1. LEXICAL ANALYSIS

Thus, after_symbol subtra ts x from r.


With after_symbol, mat hes is easy to omplete:

let next_regexp = Regexp.after_symbol symbol regexp in


if Regexp.is_null next_regexp
then false
else mat hes next_regexp rest
The fun tion is_null from Regexp he ks for the sink state ∅. Its denition is
straightforward:

let is_null r = (r = Null)


If a regular expression was onstru ted ex lusively by epsilon, symbol, on at,
alternate and repeat, is_null tests reliably if a regular expression denotes the
empty language.
The Regexp.after_symbol fun tion is re ursive over the onstru tion of regular
expressions:

let re after_symbol symbol regexp =


mat h regexp with
Null -> Null
| Epsilon -> Null
| Symbol(symbol') ->
if symbol = symbol'
then Epsilon
else Null
| Con at(r1, r2) ->
let after_1 = on at (after_symbol symbol r1) r2 in
let after_2 = if a epts_empty r1
then after_symbol symbol r2
else Null
in
alternate after_1 after_2
| Alternate(r1, r2) ->
alternate (after_symbol symbol r1)
(after_symbol symbol r2)
| Repeat(r1) ->
on at (after_symbol symbol r1)
(Repeat(r1))
(The proof for the orre tness of after_symbol is a simple exer ise.)
The mat hes fun tion is tail re ursive. Consequently, it implements a determin-
isti automaton with after_symbol as its state transition fun tion.
The Regexp stru ture is now omplete. It has the following signature:

type 'a regexp

val epsilon : 'a regexp


val symbol : 'a -> 'a regexp
val on at : 'a regexp -> 'a regexp -> 'a regexp
val alternate : 'a regexp -> 'a regexp -> 'a regexp
val repeat : 'a regexp -> 'a regexp

val repeat_one : 'a regexp -> 'a regexp


val on at_list : 'a regexp list -> 'a regexp
1.6. A REAL SCANNER 17

val alternate_list : 'a regexp list -> 'a regexp

val is_null : 'a regexp -> bool

val a epts_empty : 'a regexp -> bool


val after_symbol : 'a -> 'a regexp -> 'a regexp

1.6 A real s anner


The mat hes fun tion of the previous se tion is not dire tly usable for lexi al anal-
ysis: a s anner must re ognize a number of dierent lexeme languages, it must
onsider a sequen e of (potentially dierent) lexemes, and the s anner must turn
ea h lexeme into a token/attribute pair. In addition, ambiguities an arise if the
lexeme languages have overlaps. Hen e, the des ription of a s anner omprises not
just a single regular expression, but rather a whole bun h of them, together with
instru tions on how to turn the lexemes into token/attribute pairs. Further, to
resolve the ambiguities a s anner is not quite a DFA, but rather needs additional
stru ture.

1.6.1 S anner des riptions


Instru tions on how to turn lexemes into token/attribute pairs an be expressed as
follows:

type ('a, 'token, 'attrib) lex_a tion =


'a list * 'a list -> 'token * 'attrib * 'a list
In this de laration, 'a ist still the type of the alphabet of the language, 'token ist
the type of the tokens, and 'attrib is the type of the attributes. A lex_a tion
assigns a fun tion to a regular expression. Its parameter is a pair (lexeme , rest ).
The lexeme parameter denotes the lexeme whi h mat hes the regular expression,
and rest is the rest of the input. The rest parameter is present be ause some
lex_a tion fun tions (those that handle omments, for instan e) must skip an
initial part of rest . The fun tion returns the token/attribute pair and the part of
the input with whi h s anning an ontinue.
A rule in a s anner des ription simply pairs up a regular expression with a
lex_a tion:
type ('a, 'token, 'attrib) lex_rule = 'a Regexp.regexp
* ('a, 'token, 'attrib) lex_a tion
This ompletes yet another stru tureLexspe .

1.6.2 S anner states


The job of a s anner is to su essively onsume symbols from the input, and, on
re ognizing a ompleted lexeme, to all the orresponding lex_a tion. To this end,
the s anner must keep a state around whi h tra ks whi h regular expressions may
still mat h the part of the input onsumed so far:

type ('a, 'token, 'attrib) lex_state =


(('a, 'token, 'attrib) Lexspe .lex_rule) list
When the s anner onsumes a symbol, it applies to all regular expressions of a
lex_state the Regexp.after_symbol fun tion (just like mat hes), and lters out
the sink states :
1
1 See the appendix A.1.3 for the denition of filter.
18 CHAPTER 1. LEXICAL ANALYSIS

let next_state state symbol =


let after_state =
List.map (fun tion (regexp, a tion) ->
Regexp.after_symbol symbol regexp, a tion)
state
in
filter (fun tion (regexp, _) ->
not (Regexp.is_null regexp))
after_state
A s anner des ription (a list of lex_rules) ist easy to turn into an initial state for
the s anner automaton:

let initial_state rules = rules


To determine whi h regular expressions have mat hed the onsumed lexeme om-
pletely, the s anner uses the mat hed_rules fun tion:
let mat hed_rules state =
let final (regexp, _) = Regexp.a epts_empty regexp in
filter final state
It is possible for the s anner to end up in a state where no further onsumption of
input symbols is possible. The is_stu k predi ate diagnoses this situation:

let is_stu k state = state = [℄


This ompletes the Lexstate stru ture whi h has the following signature:

type ('a, 'token, 'attrib) lex_state

val next_state : ('a, 'token, 'attrib) lex_state


-> 'a
-> ('a, 'token, 'attrib) lex_state
val initial_state : (('a, 'token, 'attrib) Lexspe .lex_rule) list
-> ('a, 'token, 'attrib) lex_state
val mat hed_rules : ('a, 'token, 'attrib) lex_state
-> (('a, 'token, 'attrib) Lexspe .lex_rule) list
val is_stu k : ('a, 'token, 'attrib) lex_state -> bool

1.6.3 Resolution of ambiguities


Des riptions of lexi al analysis for realisti programming languages almost always
ontain ambiguities be ause it is more onvenient to spe ify overlapping lexeme
languages. The fragments:

if n = 0 then 0 else n * fib(n-1)


and

ifoundsalvationinapubli lavatory
are synta ti ally orre t Caml expressions starting withif. Now, the lexi al syntax
of Caml would allow to partition ifoundsalvationinapubli lavatory into lex-
emes in several dierent ways: either into if and ifoundsalvationinapubli lavatory
or just just as ifoundsalvationinapubli lavatory. Obviously (or is it?), the lat-
ter alternative is the intended one.
The standard way of resolving this oni t is the rule of the longest mat h :
The rst lexeme of a hara ter sequen e is its longest prex whi h is a lexeme. To
1.6. A REAL SCANNER 19

nd the longest prex, even if the s anner re ognizes a lexeme, it must ontinue
examining hara ters of the input until the urrent prex is no longer a prex of
a lexeme. Then the s anner returns the last lexeme re ognized. This pro ess may
involve returning hara ters to the input.
If there are still two dierent ways of tokenizing a single lexeme, then the tex-
tually pre eding rule from the spe i ation if given preferen e.

1.6.4 Implementation of lexi al analysis


All the building blo ks for implementing lexi al analysis are now in pla e. This
se tion des ribes a Caml stru ture alled Lex whi h ontains the entral fun tion-
ality for reating s anners. The main fun tion is alled s an_one; it runs the state
automaton in Lexstate to extra t a single lexeme at the beginning of the input.
To implement the "`longest mat h"' rule, s an_one remembers the last state in
whi h it re ognized a lexeme. On e s an_one has re ognized a lexeme, it runs the
orresponding a tion from the s anner des ription to yield a token/attribute pair
and the rest of the input still to be pro essed.

ex eption S an_error

let s an_one spe l =


let re loop state rev_lexeme maybe_last_mat h rest =
if (Lexstate.is_stu k state) or (rest = [℄)
then
mat h maybe_last_mat h with
None -> raise S an_error
| Some last_mat h -> last_mat h
else
let symbol::rest = rest in
let new_state = Lexstate.next_state state symbol in
let new_mat hed = Lexstate.mat hed_rules new_state in
let rev_lexeme = symbol::rev_lexeme in
let maybe_last_mat h =
mat h new_mat hed with
[℄ -> maybe_last_mat h
| (_, a tion)::_ -> Some (a tion, rev_lexeme, rest)
in
loop new_state rev_lexeme maybe_last_mat h rest
in
let (a tion, rev_lexeme, rest) =
loop (Lexstate.initial_state spe ) [℄ None l
in
a tion (List.rev rev_lexeme, rest)
A s anner for a given programming language may onsist of several parts, ea h with
its own s anner des ription. The omponents may be omposed via let re :
let re s an_1 input = Lex.s an_one hdes 1 i input
and s an_2 input = Lex.s an_one hdes 2 i input
and . . .
Given a fun tion whi h re ognizes a single lexeme, it is easy to onstru t the om-
plete s anner whi h turns the inputa list of symbolsinto a list of token/attribute
pairs:

let make_s anner s an_one input =


20 CHAPTER 1. LEXICAL ANALYSIS

let re s an rev_result rest =


if rest = [℄
then List.rev rev_result
else
let (token, attrib, rest) = s an_one rest in
s an ((token, attrib)::rev_result) rest
in
s an [℄ input

1.6.5 An example spe i ation


As an example, we show some ex erpts from the spe i ation of a Caml s anner.
In the s anner spe i ation, we impli itly open the Regexp module to avoid lutter.
First, there is a number of denitions for regular expressions. Identiers are
spe ied as follows:

let har_range_regexp 1 2 =
let int_range = Listplus.from_to (Char. ode 1) (Char. ode 2) in
let re i = symbol (Char. hr i) in
alternate_list(List.map re int_range)

let digit = har_range_regexp '0' '9'


let letter = alternate ( har_range_regexp 'A' 'Z') ( har_range_regexp 'a' 'z')
let ident_rest = alternate_list [ letter; digit; symbol '_'; symbol '\ ℄
let ident = on at letter (repeat ident_rest)
The fun tion har_range_regexp takes two hara ters, 1 and 2, and onstru ts
a regular expression denoting the set of hara ters between 1 and 2, in lusive.

let integer_literal =
on at (alternate epsilon (symbol '-'))
(repeat_one digit)

An hinteger-literali is an optional sign followed by at least one hdigiti.

let whitespa e =
repeat_one
(alternate_list
(List.map symbol [ ' '; '\t'; '\n'; '\r'; '\012'℄ ))
A whitespa e lexeme onsists of a non-empty sequen e of blanks, tabulators, new-
line, arriage return, and form feed hara ters.
Next, we dene a datatype for the tokens of the Caml language.

type aml_token =
TIdent
| TInt
| ...
The s anner spe i ation itself is a list of pairs of a regular expression and an
a tion fun tion, as explained above. Typi ally, we dene a s anner as a re ursive
fun tion, so that it an all itself re ursivly to onsume further input. This happens
in the a tion for whitespa e.

let re token input =


Lex.s an_one [
(whitespa e,
1.7. USING OCAMLLEX 21

fun tion (_, rest) ->


token rest);
(ident,
fun tion (lexeme, rest) ->
(TIdent,
Camlsyn.Ident(Ident.from_string (list_to_string lexeme)),
rest));
(integer_literal,
fun tion (lexeme, rest) ->
(TInt,
Camlsyn.Int(int_of_string (list_to_string lexeme)),
rest));
...℄
input

1.7 Using O amllex


The O aml language distribution ontains a s anner generator suitable for use with
O aml. At this point, no new on epts are required to understand it, it is only
ne essary to learn its syntax.

hlexeri ::= h ode-snippeti hbodyi h ode-snippeti


h ode-snippeti ::= { ho aml- odei }
hbodyi ::= hdenitioni* hrulesi
hdenitioni ::= let hidenti = hregexpi
hrulesi ::= rule hone-rulei (and hone-rulei)*
hone-rulei ::= hidenti = parse hmat hi (| hmat hi)*
hmat hi ::= hregexpi { ho aml-expressioni }

O amllex translates a s anner spe i ation (des ribed by the non-terminal hlexeri)
into a orresponding Caml module. The two h ode-snippetis are opied literally
to the beginning and end of this module, respe tively. They an ontain arbitrary
ode, whi h is not he ked during s anner generation.
The hbodyi of the spe i ation onsists of a list of hdenitionis for regular lan-
guages and some hrulesi. Ea h denition introdu es a name for a regular expression.
Denitions annot be re ursive.
Ea h rule in hrulesi spe ies a separate s anner fun tion. S anner fun tions
may all ea h others re ursively. The body of a s anner fun tion is a list of pairs
of a regular expression hregexpi and an O aml expression. Ea h regular expression
spe ies a lexeme language and the asso iated expression denes the a tion taken
upon nding the lexeme.
Regular expressions are entered using the syntax dened in Se . 1.3, with a few
ex eptions and extensions.

• O aml hara ter and string onstants serve as onstant regular expressions.

• The regular expression _ stands for any single hara ter.

• Chara ter lasses are a built-in notion. They are dened by [ h har-seti ℄,
where

h har-seti ::= h har-literali


22 CHAPTER 1. LEXICAL ANALYSIS

| h har-literali - h har-literali
| h har-seti h har-seti

where h har-literali stands for a hara ter onstant in O aml syntax, the se -
ond alternative denotes a hara ter range ( f. fun tion har_range_regexp),
and the thir alternative denotes the union of two h har-setis.
• Chara ter lasses may be omplemented by writing [^ h har-seti ℄.
• An identier may refer to a pre eding denition let hidenti = hregexpi.
Ea h s anning rule hentry-pointi = parse . . . is translated to a fun tion

let hentry-pointi lexbuf = ...


The parameter lexbuf has type Lexing.lexbuf (from the standard module Lexing)
and is the s anners means to a ess the urrent input. The a tion ode ho aml-expressioni
may refer to lexbuf to onstru t the token or to perform other tasks. The following
fun tions from module Lexing are handy in this respe t:
• Lexing.lexeme : Lexing.lexbuf -> string
returns the urrently mat hing lexeme

• Lexing.lexeme_ har : Lexing.lexbuf -> int -> har


returns the indexed hara ter from the lexeme

• Lexing.lexeme_start : Lexing.lexbuf -> int


returns the absolute position of the start of the urrent lexeme in the input

• Lexing.lexeme_end : Lexing.lexbuf -> int


returns the absolute position of the end of the urrent lexeme in the input

• hentry-pointi lexbuf
may be alled to ontinue s anning at hentry-pointi (whi h may be dierent
from the urrent s anner) immediately after the urrent lexeme.

As an example, we rewrite the fragment of the Caml s anner from Se . 1.6.5 in


O amllex syntax.

type aml_token =
TIdent
| TInt
| ...

let digit = ['0' - '9'℄


let letter = ['A' - 'Z' 'a' - 'z'℄
let ident_rest = letter | digit | '_' | '\
let ident = letter ident_rest

let integer_literal = ( | '-') digit digit*

let whitespa e = [ ' ' '\t' '\n' '\r' '\012'℄+


1.8. PRAGMATIC ISSUES 23

rule token = parse


whitespa e { token lexbuf }
| ident { Camlsyn.Ident(Ident.from_string (Lexing.lexeme lexbuf)) }
| integer_literal { Camlsyn.Int(int_of_string (Lexing.lexeme lexbuf)) }
| ...
The a tion for whitespa e does not return a value, instead it alls the s anner
token re ursively. The ee t is that whitespa e is silently skipped.

1.8 Pragmati issues


1.8.1 Re ognizing keywords
The naive approa h at re ognizing keywords is to in lude them as onstant regular
expressions in the s anner spe i ation. Unfortunately, this an give rise to au-
tomata with a huge number of states in the traditional approa h and it also leads
to ine ien ies in the library-based approa h that we are propagating. Hen e, key-
word re ognition is often handled separately from s anning in the following manner.

1. Build a hash table from the keywords before starting the s anner.

2. Spe ify the s anner so that it re ognizes all keywords as identiers.

3. On re ognizing an identier lexeme, the s anner rst he ks the hash table.


If the lexeme is present it is lassied as a keyword. Otherwise, the s anner
reports an identier.

The hash table is onstru ted only on e and its lookup should be performed as
qui kly as possible. Hen e, it is appropriate to spend some eort into its onstru -
tion. Sin e all entries of the hash table are a-priori known, it is possible to sear h
for a perfe t hash fun tion that avoids ollisions. This way, a lookup in the hash
table an be guaranteed to run in onstant time.

1.8.2 Representing identiers


Strings are not a good representation for identiers. In parti ular, later phases of
ompilation build so- alled symbol tables that map an identier to some informa-
tion about it. Sin e identier lookups o ur very frequently, it is vital that these
mappings are implemented e iently. Ea h of these lookup operations involves
omparison and/or omputation of a hash key. Strings perform poor with both
types of operation:

• A string omparison takes time linear in the length of the string.

• Computing a (meaningful) hash key for a string is not straightforward, sin e


several hara ters must be extra ted from the string.

Hen e, a s anner maps the strings arising as identier lexemes to symbols, rst. The
usual implementation of this mapping is using an open hashing algorithm. Ea h
new entry in the table is assigned a unique identier (a number).
These onsiderations give rise to the following module type SYMBOL.
type symbol
val symbol : string -> symbol
val name : symbol -> string
A typi al implementation for the type symbol is a pair of the string representation
and a (unique) number.
24 CHAPTER 1. LEXICAL ANALYSIS

You might also like