Introduction To Compiler

Introduction To Compiler
A compiler is a computer program that implements a programming language specification to "translate" programs, usually as a set of files which constitute the source code written in source language, into their equivalent machine readable instructions (the target language, often having a binary form known as object code). This translation process is called compilation. Wecompile the source program to create the compiled program. The compiled program can then be run (or executed) to do what was specified in the original source program. The source language is always a higher-level language in comparison to machine code, written using some mixture of English words and mathematical notation, assembly language being the lowest compilable language (an assembler being a special case of a compiler that translates assembly language into machine code). Higher-level languages are the most complex to support in a compiler/interpreter, not only because they increase the level of abstraction between the source code and the resulting machine code, but because increased complexity is required to formalize those abstract structures. The target language is normally a low-level language such as assembly, written with somewhat cryptic abbreviations for machine instructions, in this cases it will also run an assembler to generate the final machine code. But some compilers can directly generate machine code for some actual or virtual computer e.g. byte-code for the Java Virtual Machine. Another common approach to the resulting compilation effort is to target a virtual machine. That will do just-in-time compilation and byte-code interpretation and blur the traditional categorizations of compilers and interpreters. For example, C and C++ will generally be compiled for a target àrchitecture'. The draw-back is that because there are many types of processor there will need to be as many distinct compilations. In contrast Java will target a Java Virtual Machine, which is an independent layer above the 'architecture'. The difference is that the generated byte-code, not true machine code, brings the possibility of portability, but will need a Java Virtual Machine (the byte-code interpreter) for each platform. The extra overhead of this byte-code interpreter means slower execution speed. A program which does a lot of calculation or internal data manipulation will generally run faster in compiled form than when interpreted. But a program which does a lot of input/output and very little calculation or data manipulation may well run at about the same speed in either case.
Being themselves computer programs, both compilers and interpreters must be written in some implementation language. Up until the early 1970's, most compilers were written in assembly language for some particular type of computer. The advent of C and Pascal compilers, each written in their own source language, led to the more general use of high-level languages for writing compilers. Today, operating systems will provide at least a free C compiler to the user and some will even include it as part of the OS distribution. Compiler construction is normally considered as an advanced rather than a novice programming task, mainly due to the quantity of code needed (and the difficulties of grokking this amount of code) rather than the difficulty of any particular coding constructs. To this most books about compilers have some blame. The large gap between production compilers and educational exercises promotes this defeatist view. For the first version of a compiler written in its own source language you have a bootstrapping problem. Once you get a simple version working, you can then use it to improve itself.
Lex - A Lexical Analyzer Generator
Lex helps write programs whose control flow is directed by instances of regular expressions in the input stream. It is well suited for editor-script type transformations and for segmenting input in preparation for a parsing routine. Lex source is a table of regular expressions and corresponding program fragments. The table is translated to a program which reads an input stream, copying it to an output stream and partitioning the input into strings which match the given expressions. As each such string is recognized the corresponding program fragment is executed. The recognition of the expressions is performed by a deterministic finite automaton generated by Lex. The program fragments written by the user are executed in the order in which the corresponding regular expressions occur in the input stream. The lexical analysis programs written with Lex accept ambiguous specifications and choose the longest match possible at each input point. If necessary, substantial lookahead is performed on the input, but the input stream will be backed up to the end of the current partition, so that the user has general freedom to manipulate it. Lex can generate analyzers in either C or Ratfor, a language which can be translated automatically to portable Fortran. It is available on the PDP-11 UNIX, Honeywell GCOS, and IBM OS systems.
Introduction to Lex with Yacc

1. Introduction Lex is a program generator designed for lexical processing of character input streams. It accepts a high-level, problem oriented specification for character string matching, and produces a program in a general purpose language which recognizes regular expressions. The regular expressions are specified by the user in the source specifications given to Lex. The Lex written code recognizes these expressions in an input stream and partitions the input stream into strings matching the expressions. At the boundaries between strings program sections provided by the user are executed. The Lex source file associates the regular expressions and the program fragments. As each expression appears in the input to the program written by Lex, the corresponding fragment is executed. The user supplies the additional code beyond expression matching needed to complete his tasks, possibly including code written by other generators. The program that recognizes the expressions is generated in the general purpose programming language employed for the user's program fragments. Thus, a high level expression language is provided to write the string expressions to be matched while the user's freedom to write actions is unimpaired. This avoids forcing the user who wishes to use a string manipulation language for input analysis to write processing programs in the same and often inappropriate string handling language. Lex is not a complete language, but rather a generator representing a new language feature which can be added to different programming languages, called ``host languages.'' Just as general purpose languages can produce code to run on different computer hardware, Lex can write code in different host languages. The host language is used for the output code generated by Lex and also for the program fragments added by the user. Compatible run- time libraries for the different host languages are also provided. This makes Lex adaptable to different environments and different users. Each application may be directed to the combination of hardware and host language appropriate to the task, the user's background, and the properties of local implementations. At present, the only supported host language is C, although Fortran (in the form of Ratfor [2] has been available in the past. Lex itself exists on UNIX, GCOS, and
OS/370; but the code generated by Lex may be taken anywhere where appropriate compilers exist. Lex turns the user's expressions and actions (called source in this pic) into the hostgeneral-purpose language; the generated program is named yylex. The yylex program will recognize expressions in a stream (called input in this pic) and perform the specified actions for each expression as it is detected. +------Source -> | Lex +------+------Input -> | yylex +------An overview of Lex For a trivial example, consider a program to delete from the input all blanks or tabs at the ends of lines. %% [ \t]+$ ; is all that is required. The program contains a %% delimiter to mark the beginning of the rules, and one rule. This rule contains a regular expression which matches one or more instances of the characters blank or tab (written \t for visibility, in accordance with the C language convention) just prior to the end of a line. The brackets indicate the character class made of blank and tab; the + indicates `òne or more ...''; and the $ indicates `ènd of line,'' as in QED. No action is specified, so the program generated by Lex (yylex) will ignore these characters. Everything else will be copied. To change any remaining string of blanks or tabs to a single blank, add another rule: %% [ \t]+$ [ \t]+ ; printf(" "); + | -> yylex + + | -> Output +
The finite automaton generated for this source will scan for both rules at once, observing at the termination of the string of blanks or tabs whether or not there is a newline character, and executing the desired rule action. The first rule matches all strings of blanks or tabs at the end of lines, and the second rule all remaining strings of blanks or tabs. Lex can be used alone for simple transformations, or for analysis and statistics gathering on a lexical level. Lex can also be used with a parser generator to perform the lexical analysis phase; it is particularly easy to interface Lex and Yacc [3]. Lex programs recognize only regular expressions; Yacc writes parsers that accept a large class of context free grammars, but require a lower level analyzer to recognize input tokens. Thus, a combination of Lex and Yacc is often appropriate. When used as a preprocessor for a later parser generator, Lex is used to partition the input stream, and the parser generator assigns structure to the resulting pieces. The flow of control in such a case (which might be the first half of a compiler, for example) is shown in Figure 2. Additional programs, written by other generators or by hand, can be added easily to programs written by Lex. lexical rules | v + | +--------| v + Lex grammar rules | v + +--------| | Yacc + +--------| v + +--------+ | -> Parsed input + + | +
Input -> | yylex | -> | yyparse +--------+ +---------
Lex with Yacc Yacc users will realize that the name yylex is what Yacc expects its lexical analyzer to be named, so that the use of this name by Lex simplifies interfacing. Lex generates a deterministic finite automaton from the regular expressions in the source. The automaton is interpreted, rather than compiled, in order to save space. The result is still a fast analyzer. In particular, the time taken by a Lex program to recognize and partition an input stream is proportional to the length of the input. The number of Lex rules or the complexity of the rules is not important in determining speed, unless rules which include forward context require a significant amount of rescanning. What does increase with the number and complexity of rules is the size of the finite automaton, and therefore the size of the program generated by Lex. In the program written by Lex, the user's fragments (representing the actions to be performed as each regular expression is found) are gathered as cases of a switch. The automaton interpreter directs the control flow. Opportunity is provided for the user to insert either declarations or additional statements in the routine containing the actions, or to add subroutines outside this action routine. Lex is not limited to source which can be interpreted on the basis of one character lookahead. For example, if there are two rules, one looking for ab and another for abcdefg, and the input stream is abcdefh, Lex will recognize ab and leave the input pointer just before cd. . . Such backup is more costly than the processing of simpler languages. 2. Lex Source. The general format of Lex source is: {definitions} %% {rules} %% {user subroutines}
where the definitions and the user subroutines are often omitted. The second %% is optional, but the first is required to mark the beginning of the rules. The absolute minimum Lex program is thus %% (no definitions, no rules) which translates into a program which copies the input to the output unchanged. In the outline of Lex programs shown above, the rules represent the user's control decisions; they are a table, in which the left column contains regular expressions and the right column contains actions, program fragments to be executed when the expressions are recognized. Thus an individual rule might appear integer printf("found keyword INT"); to look for the string integer in the input stream and print the message ``found keyword INT'' whenever it appears. In this example the host procedural language is C and the C library function printf is used to print the string. The end of the expression is indicated by the first blank or tab character. If the action is merely a single C expression, it can just be given on the right side of the line; if it is compound, or takes more than a line, it should be enclosed in braces. As a slightly more useful example, suppose it is desired to change a number of words from British to American spelling. Lex rules such as colour printf("color");
mechanise printf("mechanize"); petrol printf("gas");
would be a start. These rules are not quite enough, since the word petroleum would become gaseum; a way of dealing with this will be a bit more complex.
3. Lex and Yacc. If you want to use Lex with Yacc, note that what Lex writes is a program named yylex(), the name required by Yacc for its analyzer. Normally, the default main program on the Lex library calls this routine, but if Yacc is loaded, and its main program is used, Yacc will call yylex(). In this case each Lex rule should end
with return(token); where the appropriate token value is returned. An easy way to get access to Yacc's names for tokens is to compile the Lex output file as part of the Yacc output file by placing the line # include "lex.yy.c" in the last section of Yacc input. Supposing the grammar to be named ``good'' and the lexical rules to be named ``better'' the UNIX command sequence can just be: yacc good lex better cc y.tab.c -ly -ll The Yacc library (-ly) should be loaded before the Lex library, to obtain a main program which invokes the Yacc parser. The generations of Lex and Yacc programs can be done in either order. 4. Examples. As a trivial problem, consider copying an input file while adding 3 to every positive number divisible by 7. Here is a suitable Lex source program %% int k; [0-9]+ { k = atoi(yytext); if (k%7 == 0) printf("%d", k+3); else printf("%d",k); } to do just that. The rule [0-9]+ recognizes strings of digits; atoi converts the digits to binary and stores the result in k. The operator % (remainder) is used to check whether k is divisible by 7; if it is, it is incremented by 3 as it is written out. It may be objected that this program will alter such input items as 49.63 or X7. Furthermore, it increments the absolute value of all negative numbers divisible by 7. To avoid this, just add a few more rules after the active one, as here: %% int k; -?[0-9]+ { k = atoi(yytext); printf("%d", k%7 == 0 ? k+3 : k); } -?[0-9.]+ ECHO; [A-Za-z][A-Za-z0-9]+ ECHO; Numerical strings containing a ``.'' or preceded by a letter will be picked up by one of the last two rules, and not changed. The if-else has been replaced by a C conditional expression to save space; the form a?b:c means `ìf a then b else c''.
For an example of statistics gathering, here is a program which histograms the lengths of words, where a word is defined as a string of letters. int lengs[100]; %% [a-z]+ lengs[yyleng]++; .| \n ; %% yywrap() { int i; printf("Length No. words\n"); for(i=0; i<100; i++) if (lengs[i] > 0) printf("%5d%10d\n",i,lengs[i]); return(1); } This program accumulates the histogram, while producing no output. At the end of the input it prints the table. The final statement return(1); indicates that Lex is to perform wrapup. If yywrap returns zero (false) it implies that further input is available and the program is to continue reading and processing. To provide a yywrap that never returns true causes an infinite loop. As a larger example, here are some parts of a program written by N. L. Schryer to convert double precision Fortran to single precision Fortran. Because Fortran does not distinguish upper and lower case letters, this routine begins by defining a set of classes including both cases of each letter: a [aA] b [bB] c [cC] ... z [zZ] An additional class recognizes white space: W [ \t]* The first rule changes ``double precision'' to ``real'', or ``DOUBLE PRECISION'' to ``REAL''. {d}{o}{u}{b}{l}{e}{W}{p}{r}{e}{c}{i}{s}{i}{o}{n} { printf(yytext[0]=='d'? "real" : "REAL"); } Care is taken throughout this program to preserve the case (upper or lower) of the original program. The conditional operator is used to select the proper form of the keyword. The next rule copies continuation card indications to avoid confusing them with constants: ^" "[^ 0] ECHO;
In the regular expression, the quotes surround the blanks. It is interpreted as ``beginning of line, then five blanks, then anything but blank or zero.'' Note the two different meanings of ^. There follow some rules to change double precision constants to ordinary floating constants. [0-9]+{W}{d}{W}[+-]?{W}[0-9]+ | [0-9]+{W}"."{W}{d}{W}[+-]?{W}[0-9]+ | "."{W}[0-9]+{W}{d}{W}[+-]?{W}[0-9]+ { /* convert constants */ for(p=yytext; *p != 0; p++) { if (*p == 'd' || *p == 'D') *p=+ 'e'- 'd'; ECHO; } After the floating point constant is recognized, it is scanned by the for loop to find the letter d or D. The program than adds 'e'-'d', which converts it to the next letter of the alphabet. The modified constant, now single-precision, is written out again. There follow a series of names which must be respelled to remove their initial d. By using the array yytext the same action suffices for all the names (only a sample of a rather long list is given here). {d}{s}{i}{n} | {d}{c}{o}{s} | {d}{s}{q}{r}{t} | {d}{a}{t}{a}{n} | ... {d}{f}{l}{o}{a}{t} printf("%s",yytext+1); Another list of names must have initial d changed to initial a: {d}{l}{o}{g} | {d}{l}{o}{g}10 | {d}{m}{i}{n}1 | {d}{m}{a}{x}1 { yytext[0] =+ 'a' - 'd'; ECHO; } And one routine must have initial d changed to initial r: {d}1{m}{a}{c}{h} {yytext[0] =+ 'r' - 'd'; To avoid such names as dsinx being detected as instances of dsin, some final rules pick up longer words as identifiers and copy some surviving characters: [A-Za-z][A-Za-z0-9]* | [0-9]+ | \n | . ECHO;
Note that this program is not complete; it does not deal with the spacing problems in Fortran or with the use of keywords as identifiers.
5. Summary of Source Format. The general form of a Lex source file is: {definitions} %% {rules} %% {user subroutines} The definitions section contains a combination of 1) Definitions, in the form ``name space translation''. 2) Included code, in the form ``space code''. 3) Included code, in the form %{ code %} 4) Start conditions, given in the form %S name1 name2 ... 5) Character set tables, in the form %T number space character-string ... %T 6) Changes to internal array sizes, in the form %x nnn where nnn is a decimal integer representing an array size and x selects the parameter as follows: Letter Parameter p positions n states e tree nodes a transitions k packed character classes o output array size Lines in the rules section have the form `èxpression action'' where the action may be continued on succeeding lines by using braces to delimit it. Regular expressions in Lex use the following operators: x the character "x" "x" an "x", even if x is an operator. \x an "x", even if x is an operator.
[xy] the character x or y. [x-z] the characters x, y or z. [^x] any character but x. . any character but newline. ^x an x at the beginning of a line. <y>x an x when Lex is in start condition y. x$ an x at the end of a line. x? an optional x. x* 0,1,2, ... instances of x. x+ 1,2,3, ... instances of x. x|y an x or a y. (x) an x. x/y an x but only if followed by y. {xx} the translation of xx from the definitions section. x{m,n} m through n occurrences of x
Programs Program-1
Program to implement DFA to accept string abb #include<stdio.h> #include<conio.h> void main() { int state[10]; int str[10],input[10]; char ch; int x[20]; int s,n,k=0,j,a,i,l,t,q=0,fs,b,nxt; clrscr(); printf("enter the i/p string\n"); for(i=0;i<l;i++) scanf("%d",&str[i]); for(i=0;i<l;i++) { t=0; do { if(str[i]==input[t]) { nxt=x[n*q+t];
for(j=0;j<s;j++) { if(nxt==state[j]) q=j; } t++; } else t++; } while(t!=n); } if(nxt==fs) printf("\n string is accepted\n"); else printf("\n not accepted\n"); getch(); }
Output Enter the Input string abb String is Accepted Enter the Input string ababa Not accepted
ALGORITHM TO FIND WHETHER A STRING IS A CONSTANT OR NOT 1. Start 2. Declare a character array str[] and initialize integer variables len, a=0. 3. Input the string from the User. 4. Find the length of the string. 5. Repeat step 6 to 7 till a<len. 6. If the str[a] is numeric character then a++. 7. Else it is not a constant, break from the loop and go to step 9. 8. if a==len then print that the entered string is a constant 9. Else it is not a constant. 10. Stop.
Program-2
PROGRAM TO FIND WHETHER THE STRING IS CONSTANT OR NOT #include<stdio.h> #include<conio.h> #include<ctype.h> #include<string.h> void main() { char str[10]; int len,a; clrscr(); printf("\n Input a string :"); gets(str); len=strlen(str); a=0; while(a<len) { if(isdigit(str[a])) { a++; } else { printf(" It is not a Constant"); break; } } if(a==len) { printf(" It is a Constant"); } getch(); }
OUTPUT Input a string :23 It is a Constant Input a string :a_123 It is not a Constant
ALGORITHM TO IDENTIFY WHETHER A GIVEN STRING IS AN IDENTIFIER OR NOT Step 1. Start Step 2. Check if the first char is a letter goto step 3 with rest of the string else goto 5. Step 3. Check if the first character of the given string is a letter or a digit repeat step 3 with rest of the string else goto step 5. Step 4. Print The given string is an identifier and goto step 6. Step 5. Print The given string is not an identifier. Step 6. Exit
Program-3
THIS PROGRAM IS TO FIND OUT WHETHER A GIVEN STRING IS A IDENTIFIER OR NOT #include<stdio.h> #include<conio.h> int isiden(char*); int second(char*); int third(); void main() { char *str; int i = -1; clrscr(); printf("\n\n\t\tEnter the desired String: "); do { ++i; str[i] = getch(); if(str[i]!=10 && str[i]!=13) printf("%c",str[i]); if(str[i] == '\b') { --i; printf(" \b"); } }while(str[i] != 10 && str[i] != 13); if(isident(str)) printf("\n\n\t\tThe given strig is an identifier"); else printf("\n\n\t\tThe given string is not an identifier"); getch(); } //To Check whether the given string is identifier or not //This function acts like first stage of dfa int isident(char *str) { if((str[0]>='a' && str[0]<='z') || (str[0]>='A' && str[0]<='Z')) { return(second(str+1)); } else
return 0; } //This function acts as second stage of dfa int second(char *str) { if((str[0]>='0' && str[0]<='9') || (str[0]>='a' && str[0]<='z') || (str[0]>='A' && str[0]<='Z')) { return(second(str+1)); //Implementing the loop from second stage to second stage } else { if(str[0] == 10 || str[0] == 13) { return(third(str)); } else { return 0; } } } //This function acts as third stage of dfa int third() { return 1; //Being final stage reaching it mean the string is identified }
OUTPUT: Enter the desired String: a123 The given strig is an identifier Enter the desired String: shailesh The given strig is an identifier Enter the desired String: 1asd The given string is not an identifier Enter the desired String: as-*l The given string is not an identifier
ALGORITHM TO CHECK WHETHER A STRING IS A KEYWORD OR NOT 1. Start. 2. Declare a character storing the keywords, s[5][10]={"if","else","goto","continue","return"} and another character array to store the string to be compared st[], initialize integer variables I, flag=0,m. 3. Input the string that is to be compared st[]. 4. Repeat step 5 to 6 till counter i becomes equal to number of keywords stored in the array. 5. Compare the string entered by the user with the strings in the character array by using m=strcmp(st,s[i]), where strcmp function returns true if both the strings are equal. 6. i=i+1. 7. If flag=1 then it is keyword. 8. Else it is not a keyword. 9. Stop.
Program-4
PROGRAM TO FIND WHETHER STRING IS A KEYWORD OR NOT #include<stdio.h> #include<conio.h> #include<string.h> void main() { int i,flag=0,m; char s[5][10]={"if","else","goto","continue","return"},st[10]; clrscr(); printf("\n enter the string :"); gets(st); for(i=0;i<5;i++) { m=strcmp(st,s[i]); if(m==0) flag=1; } if(flag==0) printf("\n it is not a keyword"); else printf("\n it is a keyword"); getch(); }
OUTPUT enter the string :return it is a keyword enter the string :called it is not a keyword
Program-5
PROGRAM TO FIND THE NUMBER OF WHITESPACES AND NEWLINES CHARACTERS #include<stdio.h> #include<conio.h> #include<string.h> void main() { char str[200],ch; int a=0,space=0,newline=0; clrscr(); printf("\n enter a string(press escape to quit entering):"); ch=getche(); while((ch!=27) && (a<199)) { str[a]=ch; if(str[a]==' ') { space++; } if(str[a]==13) { newline++; printf("\n"); } a++; ch=getche(); } printf("\n the number of lines used : %d",newline+1); printf("\n the number of spaces used is : %d",space); getch(); }
OUTPUT enter a string(press escape to quit entering):hello! how r u? Do you like prog. in compiler? _ the number of lines used : 4 the number of spaces used is : 7
ALGORITHM TO IMPLEMENT STACK USING ARRAY INSERTION PUSH(item) 1. If (item = max of stack) Print overflow Return 2. top = top + 1 3. stack[top] = item 4. Return DELETION POP(item) 1. If (top = - 1) Print underflow Return 2. Item = stack[top] 3. top = top 1 4. Return DISPLAY 1. If top = - 1 Print underflow 2. repeat step 3 for i = top to i >= 0 3. Print stack[i] 4. Return
Program-6
IMPLEMENT STACK USING ARRAY #include<stdio.h> #include<conio.h> #define MAXSIZE 10 void push(); int pop(); void traverse(); int stack[MAXSIZE]; int Top=-1; void main() { int choice; char ch; do { clrscr(); printf("\n1. PUSH "); printf("\n2. POP "); printf("\n3. TRAVERSE "); printf("\nEnter your choice"); scanf("%d",&choice); switch(choice) { case 1: push(); break; case 2: printf("\nThe deleted element is %d",pop()); break; case 3: traverse(); break; default: printf("\nYou Entered Wrong Choice"); } printf("\nDo You Wish To Continue (Y/N)"); fflush(stdin); scanf("%c",&ch); } while(ch=='Y' || ch=='y'); } void push() { int item;
if(Top == MAXSIZE - 1) { printf("\nThe Stack Is Full"); getch(); exit(0); } else { printf("Enter the element to be inserted"); scanf("%d",&item); Top= Top+1; stack[Top] = item; } } int pop() { int item; if(Top == -1) { printf("The stack is Empty"); getch(); exit(0); } else { item = stack[Top]; Top = Top-1; } return(item); } void traverse() { int i; if(Top == -1) { printf("The Stack is Empty"); getch(); exit(0); } else { for(i=Top;i>=0;i--) {
printf("Traverse the element"); printf("\n%d",stack[i]); } } }
ALGORITHM TO IMPLEMENT STACK AS LINKED LIST PUSH( ) 1. t = newnode( ) 2. Enter info to be inserted 3. Read n 4. tinfo = n 5. tnext = top 6. top = t 7. Return POP( ) 1. If (top = NULL) Print underflow Return 2. x = top 3. top = top next 4. delnode(x) 5. Return
Program-7
TO IMPLEMENT STACK AS LINKED LIST #include<stdio.h> #include<conio.h> struct stack { int no; struct stack *next; } *start=NULL; typedef struct stack st; void push(); int pop(); void display(); void main() { char ch; int choice,item; do { clrscr(); printf("\n 1: push"); printf("\n 2: pop"); printf("\n 3: display"); printf("\n Enter your choice"); scanf("%d",&choice); switch (choice) { case 1: push(); break; case 2: item=pop(); printf("The delete element in %d",item); break; case 3: display(); break; default : printf("\n Wrong choice"); }; printf("\n do you want to continue(Y/N)"); fflush(stdin); scanf("%c",&ch); }
while (ch=='Y'||ch=='y'); } void push() { st *node; node=(st *)malloc(sizeof(st)); printf("\n Enter the number to be insert"); scanf("%d",&node->no); node->next=start; start=node; } int pop() { st *temp; temp=start; if(start==NULL) { printf("stack is already empty"); getch(); exit(); } else { start=start->next; free(temp); } return(temp->no); } void display() { st *temp; temp=start; while(temp->next!=NULL) { printf("\nno=%d",temp->no); temp=temp->next; } printf("\nno=%d",temp->no); }
PROGRAM FOR COMPUTATION OF FIRST ALGORITHM To compute FIRST(X) for all grammar symbols x, apply the following rules until no more terminals can be added to any FIRST set. 1. if X is terminal, then FIRST(X) is {X}. 2. if X is nonterminal and X-> a is a production, then add a to FIRST(X). if X> to FIRST(X) 3. if -> Y1,Y2,.Yk is a production, then for all i such that all of Y1,.Yi -1 are nonterminals and FIRST(Yj) contains for j=1,2, i-1, add every non- symbol in FIRST(Y1) to FIRST(x). if V is in FIRST(Yj) for j=1,2,k, then add to FIRST(X).
Program-8
PROGRAM FOR COMPUTATION OF FIRST #include<stdio.h> #include<conio.h> #include<string.h> void main() { char t[5],nt[10],p[5][5],first[5][5],temp; int i,j,not,nont,k=0,f=0; clrscr(); printf("\nEnter the no. of Non-terminals in the grammer:"); scanf("%d",&nont); printf("\nEnter the Non-terminals in the grammer:\n"); for(i=0;i<nont;i++) { scanf("\n%c",&nt[i]); } printf("\nEnter the no. of Terminals in the grammer: ( Enter e for absiline ) "); scanf("%d",&not); printf("\nEnter the Terminals in the grammer:\n"); for(i=0;i<not||t[i]=='$';i++) { scanf("\n%c",&t[i]); } for(i=0;i<nont;i++) { p[i][0]=nt[i]; first[i][0]=nt[i]; } printf("\nEnter the productions :\n"); for(i=0;i<nont;i++) { scanf("%c",&temp); printf("\nEnter the production for %c ( End the production with '$' sign ) :",p[i][0]); for(j=0;p[i][j]!='$';) { j+=1; scanf("%c",&p[i][j]); } } for(i=0;i<nont;i++)
{ printf("\nThe production for %c -> ",p[i][0]); for(j=1;p[i][j]!='$';j++) { printf("%c",p[i][j]); } } for(i=0;i<nont;i++) { f=0; for(j=1;p[i][j]!='$';j++) { for(k=0;k<not;k++) { if(f==1) break; if(p[i][j]==t[k]) { first[i][j]=t[k]; first[i][j+1]='$'; f=1; break; } else if(p[i][j]==nt[k]) { first[i][j]=first[k][j]; if(first[i][j]=='e') continue; first[i][j+1]='$'; f=1; break; } } } } for(i=0;i<nont;i++) { printf("\n\nThe first of %c -> ",first[i][0]); for(j=1;first[i][j]!='$';j++) { printf("%c\t",first[i][j]); } }
getch(); }
OUTPUT Enter the no. of Non-terminals in the grammer:3 Enter the Non-terminals in the grammer: ERT Enter the no. of Terminals in the grammer: ( Enter e for absiline ) 5 Enter the Terminals in the grammer: ase*+ Enter the productions : Enter the production for E ( End the production with '$' sign ) :a+s$ Enter the production for R ( End the production with '$' sign ) :e$ Enter the production for T ( End the production with '$' sign ) :Rs$ The production for E -> a+s The production for R -> e The production for T -> Rs The first of E -> a The first of R -> e The first of T -> e s
TO CHECK WHETHER STRING BELONGS TO A GRAMMAR OR NOT ALGORITHM Start. Declare two character arrays str[],token[] and initialize integer variables a=0,b=0,c,d. Input the string from the user. Repeat steps 5 to 12 till str[a] =\0. If str[a] =='(' or str[a] =='{' then token[b] =4, b++. If str[a] ==')' or str[a] =='} then token[b] =5, b++. Check if isdigit(str[a]) then repeat steps 8 till isdigit(str[a]) a++. a--, token[b] =6, b++. If str[a]=='+ then token[b]='2',b++. If(str[a]=='*') then token[b]=3,b++. a++. token[b]='\0'; then print the token generated for the string . b=0. Repeat step 22 to 31 till token[b]!='\0' c=0. Repeat step 24 to 30 till (token[b]=='6' and token[b+1]=='2' and token[b+2]=='6') or (token[b]=='6' and token[b+1]=='3'and token[b+2]=='6') or (token[b]=='4' and token[b+1]=='6' and token[b+2]=='5') or (token[c]!='\0'). token[c]='6'; c++; Repeat step 27 to 28 till token[c]!='\0'. token[c]=token[c+2]. c++. token[c-2]=\0. print token. b++. Compare token with 6 and store the result in d. If d=0 then print that the string is in the grammar. Else print that the string is not in the grammar. Stop.
Program-9
PROGRAM TO CHECK WHEATHER A STRING BELONGS TO A GRAMMAR OR NOT. #include<stdio.h> #include<conio.h> #include<ctype.h> #include<string.h> void main() { int a=0,b=0,c; char str[20],tok[11]; clrscr(); printf("Input the expression = "); gets(str); while(str[a]!='\0') { if((str[a]=='(')||(str[a]=='{')) { tok[b]='4'; b++; } if((str[a]==')')||(str[a]=='}')) { tok[b]='5'; b++; } if(isdigit(str[a])) { while(isdigit(str[a])) { a++; } a--; tok[b]='6'; b++; } if(str[a]=='+') { tok[b]='2'; b++; } if(str[a]=='*')
{ tok[b]='3'; b++; } a++; } tok[b]='\0'; puts(tok); b=0; while(tok[b]!='\0') { if(((tok[b]=='6')&&(tok[b+1]=='2')&&(tok[b+2]=='6'))||((tok[b]=='6')&&(tok[b +1 ]=='3')&&(tok[b+2]=='6'))||((tok[b]=='4')&&(tok[b+1]=='6')&&(tok[b+2]=='5') )/*||((tok[b ]!=6)&&(tok[b+1]!='\0'))*/) { tok[b]='6'; c=b+1; while(tok[c]!='\0') { tok[c]=tok[c+2]; c++; } tok[c]='\0'; puts(tok); b=0; } else { b++; puts(tok); } } int d; d=strcmp(tok,"6"); if(d==0) { printf("It is in the grammar."); } else { printf("It is not in the grammar.");
} getch(); }
OUTPUT Input the expression = (23+) 4625 4625 4625 4625 4625 It is not in the grammar. Input the expression = (2+(3+4)+5) 46246265265 46246265265 46246265265 46246265265 46246265265 462465265 462465265 462465265 462465265 4626265 4626265 46265 46265 465 6 6 It is in the grammar.
TO CALCULATE LEADING OF NON-TERMINALS ALGORITHM 1. Start 2. For each nonterminal A and terminal a do L(A,a):= false; 3. For each production of the form A->a or A->B do INSTALL(A,a); 4. While STACK not empty repeat step 5& 6 5. Pop top pair (B,a) from STACK; 6. For each production of the form A->B do INSTALL(A,a) 7. Stop Algorithm For INSTALL(A,a) 1. Start 2. If L(A,a) not present do step 3 and 4. 3 . Make L(A,a)=True 4 . Push (A,a) onto stack 5 . Stop
Program-10
PROGRAM IS TO CALCULATE LEADING FOR ALL THE NON-TERMINALS OF THE GIVEN GRAMMAR #include<conio.h> #include<stdio.h> char arr[18][3] = { {'E','+','F'},{'E','*','F'},{'E','(','F'},{'E',')','F'},{'E','i','F'},{'E','$','F'}, {'F','+','F'},{'F','*','F'},{'F','(','F'},{'F',')','F'},{'F','i','F'},{'F','$','F'}, {'T','+','F'},{'T','*','F'},{'T','(','F'},{'T',')','F'},{'T','i','F'},{'T','$','F'}, }; char prod[6] = "EETTFF"; char res[6][3]= { {'E','+','T'},{'T','\0'}, {'T','*','F'},{'F','\0'}, {'(','E',')'},{'i','\0'}, }; char stack [5][2]; int top = -1; void install(char pro,char re) { int i; for(i=0;i<18;++i) { if(arr[i][0]==pro && arr[i][1]==re) { arr[i][2] = 'T'; break; } } ++top; stack[top][0]=pro; stack[top][1]=re; } void main() { int i=0,j; char pro,re,pri=' '; clrscr(); for(i=0;i<6;++i) {
for(j=0;j<3 && res[i][j]!='\0';++j) { if(res[i][j] =='+'||res[i][j]=='*'||res[i][j]=='('||res[i][j]==')'||res[i][j]=='i'||res[i][j]=='$') { install(prod[i],res[i][j]); break; } } } while(top>=0) { pro = stack[top][0]; re = stack[top][1]; --top; for(i=0;i<6;++i) { if(res[i][0]==pro && res[i][0]!=prod[i]) { install(prod[i],re); } } } for(i=0;i<18;++i) { printf("\n\t"); for(j=0;j<3;++j) printf("%c\t",arr[i][j]); } getch(); clrscr(); printf("\n\n"); for(i=0;i<18;++i) { if(pri!=arr[i][0]) { pri=arr[i][0]; printf("\n\t%c -> ",pri); } if(arr[i][2] =='T') printf("%c ",arr[i][1]); } getch() }
OUTPUT E+T E * T\ E(T E)F EiT E$F F+F F*F F(T F)F FiT F$F T+F T*T T(T T)F TiT T$F
TO CALCULATE TRAILING FOR ALL THE NONTERMINALS OF THE GIVEN GRAMMMAR ALGORITHM 1. Start 2. For each non terminal A and terminal a do L(A,a):=false; 3. For each production of the form A->a(alpha) or A-> Ba(alpha) do INSTALL(A,a) 4. While STACK not empty repeat 5 and 6 5. Pop top pair from stack 6. For each production of the form A-> B(alpha) do INSTALL(A,a) 7. Stop Algorithm For INSTALL(A,a) 1. Start 2. If L[A,a] not present repeat step 3 and 4 3. Make L(A,a)=True 4. Push (A,a) onto stack 5. Stop
Program-11
THIS PROGRAM IS TO CALCULATE TRAILING FOR ALL THE NON-TERMINALS OF THE GIVEN GRAMMMAR #include<conio.h> #include<stdio.h> char arr[18][3] = { {'E','+','F'},{'E','*','F'},{'E','(','F'},{'E',')','F'},{'E','i','F'},{'E','$','F'}, {'F','+','F'},{'F','*','F'},{'F','(','F'},{'F',')','F'},{'F','i','F'},{'F','$','F'}, {'T','+','F'},{'T','*','F'},{'T','(','F'},{'T',')','F'},{'T','i','F'},{'T','$','F'}, }; char prod[6] = "EETTFF"; char res[6][3]= { {'E','+','T'},{'T','\0','\0'}, {'T','*','F'},{'F','\0','\0'}, {'(','E',')'},{'i','\0','\0'}, }; char stack [5][2]; int top = -1; void install(char pro,char re) { int i; for(i=0;i<18;++i) { if(arr[i][0]==pro && arr[i][1]==re) { arr[i][2] = 'T'; break; } } ++top; stack[top][0]=pro; stack[top][1]=re; } void main() { int i=0,j; char pro,re,pri=' '; clrscr(); for(i=0;i<6;++i)
{ for(j=2;j>=0;--j) { if(res[i][j]=='+'||res[i][j]=='*'||res[i][j]=='('||res[i][j]==')'||res[i][j]=='i'||res[i][j]==' $') { install(prod[i],res[i][j]); break; } else if(res[i][j]=='E' || res[i][j]=='F' || res[i][j]=='T') { if(res[i][j-1]=='+'||res[i][j-1]=='*'||res[i][j-1]=='('||res[i][j1]==')'||res[i][j-1]=='i'||res[i][j-1]=='$') { install(prod[i],res[i][j-1]); break; } } } } while(top>=0) { pro = stack[top][0]; re = stack[top][1]; --top; for(i=0;i<6;++i) { for(j=2;j>=0;--j) { if(res[i][0]==pro && res[i][0]!=prod[i]) { install(prod[i],re); break; } else if(res[i][0]!='\0') break; } } } for(i=0;i<18;++i) { printf("\n\t"); for(j=0;j<3;++j)
printf("%c\t",arr[i][j]); } getch(); clrscr(); printf("\n\n"); for(i=0;i<18;++i) { if(pri!=arr[i][0]) { pri=arr[i][0]; printf("\n\t%c -> ",pri); } if(arr[i][2] =='T') printf("%c ",arr[i][1]); } getch(); }
OUTPUT E+T E*T E(F E)T EiT E$F F+F F*F F(F F)T FiT F$F T+F T*T T(F T)T TiT T$F E -> + * ) i F -> ) i T -> * ) i
Left recursion
We've barely started trying to figure out exactly what an expression is and then how to code a recursive descent parser for it and already we've encountered an embarassing problem -- how to decide between several options for a given production -- particularly when it causes our code to call itself over and over again, infinitely, or else simply quit too early. This is a common problem in setting up useful productions for any language. And algebraic expressions are no exception. Left Recursion Here are the productions, so far: expression := term | expression + term | expression - term term := number | ( expression ) The first production, expression, has its last two choices using the production expression and, as such, it recursively calls the production expression. But note that this recursive use is on the left side of each of these last two choices. In other words, it implies that we'd need to recursively call the Expression function before a check is made for a plus or minus, not after. Of course, we already know this from trying to code it. And it is a problem. This is called left-recursion and in recursive descent parsing circles, this is a bad thing. What you really want is right-recursion, as that means you only make a recursive call if and only if everything else has already been verified. It's just fine to recursively call a production, if it's the last thing you do as by then you've already figured out which choice is better. You really want to arrange things so that recursive calls are the last thing to do sometimes called 'tail recursion.'
Left Factoring
Left factoring is the action taken when a grammar leads backtracking while making parsing/syntax tree. Let's take an example: Let us have a grammar:A->bcW | bcE W->x E->q Now lets have a look on the production A->bcW | bcE, here 'bc' is repeating. If we want to parse a string "bcq", then firstly we have two choices - either choose A->bcW or A->bcE. Since both have the same probability, we choose A->bcW. Then when we are finished with "bc" and turn to parse "_ _ q". After this, we have to backtrack because this production is in "A->bcE". This shows the problem with such type of problem. Now to overcome this problem we have a concept of Left factoring. In this we take common symbols in a new non terminal. Lets have a look on how we can do it. Above grammar after applying left factoring is A->bcX X->W | E W->x E->q

Introduction To Compiler

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Introduction To Compiler

Uploaded by

Copyright:

Available Formats

Introduction To Compiler

Lex - A Lexical Analyzer Generator

Introduction to Lex with Yacc

Input -> | yylex | -> | yyparse +--------+ +---------

mechanise printf("mechanize"); petrol printf("gas");

printf("Traverse the element"); printf("\n%d",stack[i]); } } }

You might also like