You are on page 1of 34

Introduction to Compilers 66416

Computer Engineering Spring Semester 2005

Dr. Raed Alqadi

Why Take This Course?


Interested in writing compilers of programming languages. Curious about compiler. Requirement. Why shouldnt you take course?
Think it will be easy.

Why are Compilers Interesting?


Mixture of formal techniques and ad-hoc programming. Earliest non-trivial programs. Useful and prevalent tools. Special purpose languages are part of most programs.
Techniques are widely applicable.

Good field to find job or do research.


New computers have produced new problems.

Pieces of a Compiler
Will discuss scanning, parsing, (optimization), code generation. Input language can be any programming language. Output is usually machine/assembly language.
Could be another language (e.g. C).

Read Dragon Ch1.

Why This Organization?


Traditional (and effective). Functional decomposition.
Separate theory for each area. Large size requires abstractions.

What is the alternative.

Example: Scanner
Input is sequence of characters. If x>100 then y :=1 else y:=2; Output is tokens (lexemes).

Example: Parser
Input is tokens.
.

Output is syntax tree.

Example: Optimizer
Input is syntax tree. Output is modified syntax tree.

Assume y has the value 2 before the conditional statement.

Example: Code Generation


Input is syntax tree. Output is assembly (machine) language program. ble mov j mov x, 100, L505 y, 1 L506 y, 2

L505: L506:

Extended example to illustrate compiler


Light bulb language (LBL)
Language for controlling a light bulb. Syntax of the two statements:
on <time>. off <time>. Where <time> is an integer.

10

Syntax of Language
<stmt> := | on_stmnt off_stmnt := := on off
Backus-Naur Form (BNF)

on_stmnt off_stmnt

<int> ; <int> ;

For example:
on off off 33; -89; 0;

11

Semantics of Language
What is the meaning of a program written in a language? Statements processed in order in which they appear. on N turns on light bulb for N ticks. off N turns off light bulb for N ticks. Time must be non-negative. For example:
on off off on on 3; 2; 6; 0; -1;

illegal program.

12

Light bulb Computer


Light bulb controlled by machine with two instructions.
lt_on turns on light for 1 tick. lt_off turns off light for 1 tick. Semantic mismatch between language and machine.
N ticks vs. 1 tick.

Compiler bridges gap by transiting program.


on off on 3; 2; 1; lt_on lt_on lt_on lt_off lt_off lt_on

13

Pieces of a Compiler

14

Scanning LBL
Recognize keywords (on, off), integers, white space. Ad-hoc scanner:
token scan (FILE *input) { while(1){ c = read_char (input); if (c == o){ c = read_char (input); if (c == n) return ON_TOKEN; else if (c== f) . . . } else if (c >= 0 && c <= 9) return scan_integer (input); else if (c == ;) return SEMI_TOKEN; else if (c == ) continue; else error (Unknown character); } }

15

Parsing and Compiling LBL


Parser gets stream tokens from scanner
ON_TOKEN, OFF_TOKEN, INT_TOKEN, SEMI_TOKEN
parse (FILE *input) { t = scan (input); if (t == ON_TOKEN) { t = scan (input); if (t != INT_TOKEN) syntax_error (Expected int); else for (i = 0; i < token_value (t); i++) generate (lt_on); } else if (t == OFF_TOKEN) { ... } else syntax_error (Unknown keyword); } 16

Optimization
New instruction
lt_toggle = lt_on; lt_off; or lt_off; lt_on;

Copmiler must recognize transition between states.


Difficult in parser since it doesnt know next state when generating code for current state. Build representation of program and examine whole sequence before optimizing or generating code.

Keep an array of values


a[i] > 0 means light is on for N ticks. a[i] < 0 means light is off for N ticks.

17

Optimization, contd
parse (FILE * input) { t = scan (input); if (t == ON_TOKEN) { t = scan (input); if (t != INT_TOKEN) syntax_error (Expected int); else a[i++] = token_value(t); } else if (t == OFF_TOKEN) { ... } else syntax_error (Unknown keyword); }
18

Code Generation After Optimization


Generate (int *a) { for (I =0; i< length(a); i++) { if (i<length(a) and a[i] := a[i+1]) generate(lt_toggle); else if (a[i] > 0) for (j = 0; j < a[i]; j++) generate(lt_on); else for (j = 0; j < -a[i]; j++) generate(lt_off); } }

19

Constructing a Scanner
Need form all way to describe item recognized by scanner.
Items called tokens or lexemes.

Regular expressions (REs). Automatic techniques for constructing scanners from REs. Dragon: Sec 3.1-3.7.

20

How to Describe Tokens


English.
e. g. an identifier is sequence of letters and digits that does not begin with a digit
What about A_3? Verbose, but not precise.

e. g. floating point number is sequence of one or more digits followed by a decimal point followed by a sequence of one or more digits.
Too complex and verbose. -1.4, 1.0e9.

Regular Expressions (REs)


Concise and precise notation. Efficient translation to finite state machine that mechanically recognizes tokens.
21

What is a Regular Expression (REs)?


Language for describing a set of strings.
A string S is in the language of RE R if R matches S.

Build from a simple set of rules.


Start with alphabet A of Symbols (i. e characters).
A = {a, , z, A, , Z, 0, , 9}. Also allow the empty string .

Concatenate REs with . If R1 and R2 are REs, so is R1.R2.


a.b is an a followed by a b. 1.0.1 is a 1, a 0, and finally a 1.

22

REs, contd
Alternate RE with |
If R1 and R2 are REs, so is R1|R2. 0|1 is a 0 or 1. (a.b)|c is ab or c. 0|1|2|3|4|5|6|7|8|9 matches a digit.

Repeat REs with *


If R1 is a RE, then R1* is a RE. Means 0 or more repeated times of R1. a* matches , a, aa, aaa, aaaa, . (0|1|2|3|4|5|6|7|8|9)* matches positive integers.

23

REs, Precisely
L(R) is a set of strings matched by RE R. xA R = R1 . R2 R = R1 | R2 R = R1* ==> ==> ==> ==> L (x) = {x}. L (R) = {ab : a L (R1), b L (R2)}. L (R) = L (R1) U L (R2). L (R) = Us=0 { as: a L (R1)}

24

Examples of REs
Letter Digit IdChar Id FPNum := a|b . . . y|z|A|B . . . Y|Z := 0|1|2|3|4|5|6|7|8|9 := Letter | Digit := {a|b . . . y|z|A|B . . . Y|Z|0| . . . |9} := Letter . IdChar*. := Digit . Digit . Digit . Digit*

Note that decimal point must be quoted

25

More Concise Notation (Unix)


Square brackets delimit alternative set
[a-z] = a|b y|z [0-9] = 0|1 8|9 Allow more than one set in bracket.
[a-z,A-Z].

Plus super script is one or more repeating.


R+= R.R*.

e.g: Id := [a-z,A-Z].[a-z,A-Z,0-9]*. FPNum := [0-9]+.[0-9]+

26

Why REs?
Good notation for describing lexemes. Directly translatable to program that recognizes RE.
Produce abstract machine that recognizes strings. Translate to efficient program that dose the same.

27

PA1: Symbol Tables and Modularity


Goal is to build first part of compiler: the symbol table.
Another goal is to introduce a good programming style: modularity and abstract data types.

The symbol table holds information on the identifiers in a program.


Identifiers are the names of variables, functions, constants, etc. The compiler hands the symbol table a string containing the identifiers name and gets back a record containing information on the identifier. x

Name: Type:

Symbol Table

int

Line_no: 202

28

Interface
Two data types:
symbol_table a symbol table. Symbol_table_entry entry in a symbol table with information on an identifier.

Five functions:
symbol_table *make_symbol_table (int fold_case). symbol_table_entry *get_symbol (symbol_table tbl, char *str). symbol_table_entry *put_symbol (symbol_table tbl, char *str). symbol_table *clear_symbol_table (symbol_table tbl). Void print_statistics (symbol_table tbl).

29

Digression: Abstract Datatypes (ADTs)


An abstract data type (ADT) is a building block.
An object with a carefully specified interface. The interface isolates clients from implementation details.
Change the way that ADT is built without modifying clients.

Implications:
Cannot let clients outside ADT see internal details. Export an interface with a complete set of operations.
Only the functions exported by ADT know the details. Clients hold instances of object and invoke operations on them.

30

ADTs in C++/C
C++ Classes Support ADT C does not support this program development methodology.
Must reveal full definition of a struct to clients. Need opaque datatypes. Can support ADTs by convention (not compiler checking).

In C, Break a datatype into two parts:


Interface definition (.h file).
Define the datatypes exported by ADT and operations on them.

Implementation (.c file)


Actually implement the object and operations. Make everything not in the interface static.

31

Symbol Table ADT


I have provided the interface definition
(symbol_table.h)

You need to write the 5 symbol table routines described earlier. I have also written a test driver routine that users the interface
test_symbol.c You should not have to look at the code in this file.

32

Implementing Symbol Tables


Probably the best (and simplest) implementation is a hash table
Have a large table holding entries.

Want to find the entry associated with a particular string.


Where do you start looking in table?

Hash the string by computing a function to the integers


Function should be chosen so entries are evenly distributed What if two entries fall in the same slot?
Search nearby entries. Chain together entries into a linked list.

33

Hash functions
Two desirable properties.
Fast to compute Spread expected collection of strings over table evenly.

A reasonable function is:


x = strlen (str); While (*str != \0) { x = x ^ *str; x = x << 1; str ++; }

Worry about common cases: i, j, x1, x2, x3, .

34

You might also like