You are on page 1of 10

Theory of Automata

Assignment # 1
BY
M. Suhaib Khalid
Roll no: 105
BSCS (4th Semester)
Section: C

Submitted to
Sir. Umar Farooq
Language
If Σ is an alphabet, and L⊆Σ∗, then Lisa(formal) language over Σ.
A (possibly infinite) set of strings all of which are chosen from some Σ∗.
A language over Σ need not include strings with all symbols of Σ. Thus, a language
over Σ is also a language over any alphabet that is a superset of Σ.
A language is a set of strings. One special language is Σ∗, which is the set of all
possible strings generated over the alphabet Σ∗. For example, if
Σ = {a,b,c} then Σ∗ = {a,b,c,aa,ab,ac,ba,...,aaaaaabbbaababa,...}
Namely, Σ∗ is the “full” language made of characters of Σ. Naturally, any language
over Σ is going to be a subset of Σ∗.
Examples:
1. Programming language C
2. Legal programs are a subset of the possible strings that can be formed from the
alphabet of the language (a subset of ASCII characters).
Other Language Examples:
1. L = {b,ba,baa,baaa,baaaa,...}
2. L = {aa,ab,ba}
3. Σ∗ is a language for any alphabet Σ.
4. ∅, the empty language, is a language over any alphabet.
5. The language of all strings consisting of n 0’s followed by n 1’s (n≥0):
{ǫ,01,0011,000111,...}
6. The set of strings of 0’s and 1’s with an equal number of each:
{ǫ,01,10,0011,0101,1001,...}
7. { w | w consists of an equal number of 0’s and1’s }
Lexicographic ordering
Lexicographic ordering of a set of strings is an ordering of strings that have shorter
strings first and sort the strings alphabetically within each length. Naturally, we
assume that we have an order on the given alphabet.
Example:
For Σ = {a,b}, the Lexicographic ordering of Σ∗ is a,b,aa,ab,ba,bb,aaa,aab,....

Languages and Set Notation


Most of the time it would be more useful to use set notations to define a language;
that is, define a language by the property the strings in this language possess.
Example:
Consider the following set of strings.
L = {x | x ∈ {a,b}* and |x| is even }

Important operators on languages


Union
The union of two languages L and M, denoted L∪M, is the set of strings that are in
either L, or M, or both.
Examples:
If L = {001,10,111} and M = {ǫ,001} then,
L∪M={ǫ,001,10,111}

Concatenation
The concatenation of languages L and M, denoted L.M or just LM, is the set of
strings that can be formed by taking any string in L and concatenating it with any
string in M.
Examples:
If L = {001,10,111} and M = {ǫ,001} then,
L.M = {001,10,111,001001,10001,111001}

Closure
The closure of a language L is denoted as L∗ and represents the set of those strings
that can be formed by taking any number of strings from L, possibly with
repetitions (i.e., the same string may be selected more than once) and
concatenating all of them.
Examples:
1. If L = {0,1} then L∗ is all strings of 0 and 1.
2. If L = {0,11} then L∗ consists of strings of 0’s and 1’s such that the 1come in
pairs, e.g., 011, 11110 and ǫ. But not 01011 or 101.

Lexical Analyzer
1. The first phase of a compiler.
2. Lexical analysis: process of taking an input string of characters (such as the
source code of a computer program) and producing a sequence of symbols called
lexical tokens, or just tokens, which may be handled more easily by a parser.
3. The lexical analyzer reads the source text and, thus, it may perform certain
secondary tasks:
Eliminate comments and white spaces in the form of blanks, tab and new line
characters.
Correlate errors messages from the compiler with the source program ( e.g, keep
track of the number of lines).
4. The interaction with the parser is usually done by making the lexical analyzer be
a sub-routine of the parser.
Tokens, Patterns, Lexemes
1. Token: A token is a group of characters having collective meaning: typically, a
word or punctuation mark, separated by a lexical analyzer and passed to a parser.
2. A lexeme is an actual character sequence forming a specific instance of a token,
such as num.
3. Pattern: A rule that describes the set of strings associated to a token. Expressed
as a regular expression and describing how a particular token can be formed. For
example,
[A-Za-z] [A-Za-z_0-9] *
The pattern matches each string in the set.
A lexeme is a sequence of characters in the source text that is matched by the
pattern for a token.

Example:

Token Sample Lexemes (Informal) Description of Pattern


Const const const
If if if
Relation <,<=,=,<>,>,=> <|<=|=|<>|>|=>
Id pi, count, D2 (letter.(letter | digit)∗)
Num 3.1426, 0.6, 6.22 any numeric constant
Literal ”core dumped” any character between ”and” except”

Lexical Errors
1. Few errors are discernible at the lexical level alone.
2. Lexical analyzer has a very localized view of the source text.
3. It cannot tell whether a string fi is am is spelling of a keyword if or an identifier.
4. The lexical analyzer can detect characters that are not in the alphabet or strings
that have no pattern.
5. In general, when an error is found, the lexical analyzer stops (but other actions
are also possible).

Stages of a lexical analyzer


Scanner
1. Based on a finite state machine.
2. If it lands on an accepting state, It takes note of the type and position of the
acceptance, and continues.
3. Sometimes it lands on a "dead state, "which is a non-accepting state.
4. When the lexical analyzer lands on the dead state, it is done. The last accepting
state is the one that represent the type and length of the longest valid lexeme.
5. The "extra" non valid character should be "returned" to the input buffer.

Evaluator
1.Goes over the characters of the lexeme to produce a value.
2.The lexeme’s type combined with its value is what properly constitutes a token,
which can be given to a parser.
3.Some tokens such as parentheses do not really have values, and so the evaluator
function for these can return nothing.
4.The evaluators forintegers, identifiers, and strings can be considerably more
complex.
5.Sometimes evaluators can suppressa lexeme entirely, concealing it from the
parser, which is useful for white space and comments.

Implementation of Lexical Analyze


Different ways of creating a lexical analyzer:
1. To use an automatic generator of lexical analyzers (as LEX or FLEX).
2. Though it is possible and sometimes necessary to write a lexer by hand, lexers
are of ten generated by automated tools.
3. These tools accept regular expressions which describe the tokens allowed in the
input stream. Input: Specification.
4. Regular expressions representing the patterns.
5. Actions to take according to the detected token.
6.Each regular expression is associated with a phrase in a programming language
which will evaluate the lexemes that match the regular expression.
7. The tool then constructs a state table for the appropriate finite state machine and
creates program code which contains the table, the evaluation phrases, and a
routine which uses the m appropriately

Advantages & Disadvantages


Lexical analyzer generator
Advantages: easier and faster development.
Disadvantages: the lexical analyzer is not very efficient and its maintenance can
be complicated.
To write the lexical analyzer by using a high-level language
Advantages: More efficient and compact.
Disadvantages: Done by hand.
To write the lexical analyzer by using a low-level language
Advantages: Very efficient and compact.
Disadvantages: Development is complicate.
Syntax Analyzer

Parsing = determining whether a string of tokens can be generated by a grammar

Grammars
1. Precise, easy-to understand description of syntax.
2. Context-free grammars -> efficient parsers (automatically!).
3. Help in translation and error detection.
4. E.g. Attribute grammars.
5. Easier language evolution.
6. Can add new constructs systematically.

Syntax Errors
1. Many errors are syntactic or exposed by parsing.
2. E.g. Unbalanced ().
3. Error handling goals:
4. Report errors quickly & accurately.
5. Recover quickly (continue parsing after error).
6. Little overhead on parse time.

Error Recovery
Panic mode
1. Discard tokens until synchronization token found (often ‘;’).
Phrase level
1. Local correction: replace a token by another and continue.
Error productions
1. Encode commonly expected errors in grammar.
Global correction
1. Find closest input string that is in L(G).
2. Too costly in practice.

Context-free Grammars
1. Precise and easy way to specify the syntactical structure of a programming
language.
2. Efficient recognition methods exist.
3. Natural specification of many “recursive” constructs:
4. expr -> expr + expr | term.

Short hands & Derivations


1. E -> E + E | E * E | (E) | - E | <id>
2. E => - E “E derives -E”
3. => derives in 1 step
4. => * derive in n (0..) steps
More Definitions
1. L(G) language generated by G = set of strings derived from S.
2. S =>+ w : w sentence of G (w string of terminals).
3. S =>+ α : α sentential form of G.
4. (string can contain nonterminals).
5. G and G’ are equivalent :⇔ L(G) = L(G’).
6. A language generated by a grammar (of the form shown) is called a context-free
language.

Example:
G = ({-,*,(,),<id>}, {E}, E, {E -> E + E, E-> E * E , E -> (E) , E-> - E, E -> <id>})

Sentence: -(<id> + <id>) Derivation:


E => -E => -(E) => -(E+E) => -(<id>+E) => -(<id> + <id>)

You might also like