You are on page 1of 6

Lex Overview Usage Paradigm of Lex

• Lex is a tool for creating lexical analyzers.

• Lexical analyzers tokenize input streams.

• Tokens are the terminals of a language.

• Regular expressions define tokens .

To Use Lex and Yacc Together Lex Internals Mechanism


Lex source Yacc source
(Lexical Rules) (Grammar Rules) • Converts regular expressions into DFAs.

• DFAs are implemented as table driven


Lex Yacc

state machines.
lex.yy.c y.tab.c
call
Input yylex() yyparse() Parsed Input

return token

lex.yy.c : What it produces Running Lex


• To run lex on a source file, use the
command:
lex source.l
• This produces the file lex.yy.c which is the
C source for the lexical analyzer.
• To compile this, use:
cc -o prog -O lex.yy.c -ll

1
Versions and Reference Books General Format of Lex Source

• AT&T lex, GNU flex, and Win32 version

• lex & yacc ,2/e by John R.Levine, Tony


Mason & Doug Brown, O’Reilly

• Mastering Regular Expressions, by Jeffrey


E.F. Friedl, O’Reilly

• Input specification file is in 3 parts


Regular Policy of None-
– Declarations: Definitions translated Source
– Rules: Token Descriptions and actions • Remember that Lex is turning the rules
– Auxiliary Procedures: User-Written code
into a program. Any source not intercepted
• Three parts are separated by %% by Lex is copied into the generated
• Tips: The first part defines patterns, program.
the third part defines actions, the
– Any line which is not part of a Lex rule or
second part puts together to express
action which begins with a blank or tab
“If we see some pattern, then we do
some action”. – Anything included between lines containing
only %{ and %}
– Anything after the second %% delimiter

Regular Policy of Translated


Position of Copied Source
Source
• source input prior to the first %% • Various variables or tables whose name
– external to any function in the generated code prefixed by yy
• after the first %% and prior to the second – yyleng, yysvec[], yywork[]
%% • Various functions whose name prefixed by
– appropriate place for declarations in the yy
function generated by Lex which contains the – yyless(), yymore(), yywarp(), yylex()…
actions
• Various definition whose name are capital
• after the second %% – BEGIN, INITIAL…
– after the Lex generated output

2
Default Rules and Actions Default Input and Output
• The first and second part must exist, but • If you don’t write your own main() to deal
may be empty, the third part and the with the input and the output of yylex(), the
second %% are optional. default input of default main() is stdin and
• If the third part dose not contain a main(), - the default output of default main() is
ll will link a default main() which calls stdout.
yylex() then exits.
– stdin usually is to be keyboard input
• Unmatched patterns will perform a default stdout usually is to be screen output
action, which consists of copying the input – cs20: %./a.out < inputfile > outputfile
to the output

Some Simple Lex Source


A General Lex Source Example
Examples
• A minimum lex program: %{
%% /*
It only copies the input to the output unchanged. * Example lex source file
• A trivial program to deletes three spacing * This first section contains necessary
characters: * C declarations and includes
%% * to use throughout the lex specifications.
[ \t\n]; */
• Another trivial example: #include <stdio.h>
%% %}
[ \t]+$; bin_digit [01]
It deletes from the input all blanks or tabs at the %%
ends of lines.

{bin_digit}* { %%
/* match all strings of 0's and 1's */ /*
/* Print out message with matching * Now this is where you want your main
* text program
*/ */
printf("BINARY: %s\n", yytext); int main(int argc, char *argv[]) {
} /*
([ab]*aa[ab]*bb[ab]*)|([ab]*bb[ab]*aa[ab]*) { * call yylex to use the generated lexer
/* match all strings over */
* (a,b) containing aa and bb yylex();
*/ /*
printf("AABB\n"); * make sure everything was printed
} */
\n ; /* ignore newlines */ fflush(yyout);
exit(0);
}

3
Token Definitions • Elementary Operations (cont.)
( Extended Regular Expression ) – NOTE: . matches any character except the
newline
• Elementary Operations
– * -- Kleene Closure
– single characters
• except “ \ . $ ^ [ ] - ? * + | ( ) / { } % < >
– + -- Positive Closure
– concatenation (put characters together)
– alternation (a|b|c) • Examples:
• [ab] == a|b – [0-9]+"."[0-9]+
• [a-k] == a|b|c|...|i|j|k • note: without the quotes it could be any
character
• [a-z0-9] == any letter or digit
• [^a] == any character but a – [ \t]+ -- is whitespace
• (except CR).
• There is a blank space character before the \t

• Special Characters: • Special Characters (cont.)


–. -- matches any single character –^ -- means at the beginning of the line
(except newline) (unless it is inside of a [ ])
– “ and \ -- quote the part as text – $ means at the end of the line, same as /\n
– \t -- tab – [^ ] -- means anything except
– \n -- newline • \"[^\"]*\" is a double quoted string
– \b -- backspace – {n,m} – m through n occurrences
– \" -- double quote • a{1,3} is a or aa or aaa
– \\ -- \ – {definition} – translation from definition
–? -- this means the preceding was – / -- matches only if followed by right part of
optional /
• ab? == a|ab • 0/1 means the 0 of 01 but not 02 or 03 or …
• (ab)? == ab|ε – ( ) -- grouping

Definitions • The definitions can also contain variables and


other declarations used by the Code
generated by Lex.
• NAME REG_EXPR
– These usually go at the start of this section,
– digs [0-9]+ marked by %{ at the beginning and %} at the end
– integer {digs} or the line which begins with a blank or tab .
– plainreal {digs}"."{digs} – Includes usually go here.
– expreal {digs}"."{digs}[Ee][+-]?{digs} – It is usually convenient to maintain a line counter
so that error messages can be keyed to the lines
– real {plainreal}|{expreal}
in which the errors are found.
• NAME must be a valid C identifier • %{
• {NAME} is replaced by prior REG_EXPR • int linecount = 1;
• %}

4
Transition Rules Tokens and Actions
• ERE <one or more blanks> { program statement • Example:
program – {real} return FLOAT;
statement }
– begin return BEGIN;
• A null statement ; will ignore the input
– {newline} linecount++;
• Four special options:
– {integer} {
| ECHO; REJECT; BEGIN;
• printf("I found an integer\n");
• The unmatched token is using a default action
• return INTEGER;
that ECHO from the input to the output
• }
• | indicates that the action for this rule is from the
action for the next rule

Ambiguous Source Rules Multiple States


• lex allows the user to explicitly declare
• If 2 rules match the same pattern, Lex will multiple states ( in Definitions section )
use the first rule. %s COMMENT
• Lex always chooses the longest matching • Default states is INITIAL or 0
substring for its tokens. • Transition rules can be classified into
• To overide the choice, use action REJECT different states, which will be match
ex: she {s++; REJECT;} depend on states
he {h++; REJECT;} • BEGIN is used to change state
. | \n ;

Lex Special Variables Lex library function calls


• identifiers used by Lex and Yacc begin with • yylex()
yy – default main() contains a return yylex();
– yytext -- a string containing the lexeme • yywarp()
– yyleng -- the length of the lexeme
– called by lexical analyzer if end of the input file
– yyin – the input stream pointer
– default yywarp() always return 1
• Example: • yyless(n)
– {integer} {
– n characters in yytext are retained
• printf("I found an integer\n");
• sscanf(yytext,"%d", &yylval); • yymore()
• return INTEGER; – the next input expression recognized is to be tacked
• } on to the end of this input
– C++ Comments -- // .....
• //.* ;

5
User Written Code More Example 1
int lengs[100];
%%
• The actions associated with any given [a-z]+ lengs[yyleng]++;
token are normally specified using .|
statements in C. But occasionally the \n ;
actions are complicated enough that it is %%
yywrap()
better to describe them with a function call,
{
and define the function elsewhere. int i;
• Definitions of this sort go in the last section printf("Length No. words\n");
of the Lex input. for(i=0; i<100; i++)
if (lengs[i] > 0)
printf("%5d%10d\n",i,lengs[i]); return(1);
}

More Example 2 Using yacc with lex


• yacc will call yylex() to get the token from
the input so that each lex rule should end
with:
return(token);
where the appropriate token value is
returned.
• An easy way is placing the line:
#include “lex.yy.c”
in the last section of yacc input.

You might also like