Compiler Design and Construction 2069
Attempt all questions. (10x6=60)
A compiler operates in phases. A phase is a logically interrelated operation that takes source program in one representation and produces output in another representation. The phases of a compiler are shown in below:
1. Analysis phase breaks up the source program into constituent pieces and creates an intermediate representation of the source program. Analysis of source program includes: lexical analysis, syntax analysis and semantic analysis.
2. Synthesis phase construct the desired target program from the intermediate representation. The synthesis part of compiler consists of the following phases: Intermediate code generation, Code optimization and Target code generation.
1. Lexical Analysis:
In this phase, lexical analyzer reads the source program and returns the tokens of the source program. Token is a sequence of characters that can be treated as a single logical entity (such as identifier, operators, keywords, constants etc.).
Example:
Input String: c = a + b * 3
Tokens: id1 = id2 + id3 * 3
2. Syntax Analysis:
In this phase, the syntax analyzer takes the token produced by lexical analyzer as input and generates a parse tree as output. In syntax analysis phase, the parser checks that the expression made by the token is syntactically correct or not, according to the rules that define the syntax of the source language.
Example:
3. Semantic Analysis:
In this phase, the semantic analyzer checks the source program for semantic errors and collects the type information for the code generation. Semantic analyzer checks whether they form a sensible set of instructions in the programming language or not. Type-checking is an important part of semantic analyzer.
Example:
4. Intermediate Code Generation:
If the program syntactically and semantically correct then intermediate code generator generates a simple machine independent intermediate language. The intermediate code should be generated in such a way that it can easily translated into the target machine code.
Example:
t1 = 3.0;
t2 = id3 * t1;
t3 = id2 + t2;
id1 = t3;
5. Code Optimization:
It is used to improve the intermediate code so that the output of the program could run faster and takes less space. It removes the unnecessary lines of the code and arranges the sequence of statements in order to speed up the program execution without wasting resources.
Example:
t2 = id3 * 3.0;
id1 = id2 + t2;
6. Code Generation:
Code generation is the final stage of the compilation process. It takes the optimized intermediate code as input and maps it to the target machine language.
Example:
MOV R1, id3
MUL R1, #3.0
MOV R2, id2
ADD R1, R2
MOV id1, R1
Symbol Table
Symbol tables are data structures that are used by compilers to hold information about source-program constructs. The information is collected incrementally by the analysis phase of compiler and used by the synthesis phases to generate the target code. Entries in the symbol table contain information about an identifier such as its type, its position in storage, and any other relevant information.
Error Handling
Whenever an error is encountered during the compilation of the source program, an error handler is invoked. Error handler generates a suitable error reporting message regarding the error encountered.
2. Define token, pattern and lexeme with suitable example. How input buffering can be implemented for scanner, Explain.
Token
Token is a sequence of characters that can be treated as a single logical entity. For example, Identifiers, Keywords, Operators, Constants etc.
Pattern
A set of string in the input for which the same token is produced as output. This set of string is described by a rule called a pattern associated with the token. For example, “a letter followed by zero or more letters, digits, or underscores.”
Lexeme
A lexeme is a sequence of characters in the source program that is matched by the pattern for a token.
Input Buffering
Lexical analysis needs to look ahead several characters before a match can be announced. We have two buffer input scheme that is useful when look ahead is necessary.
- Buffer Pair
- Sentinels
Buffer Pair (2N Buffering):
In this technique buffer is divided into two halves with N-characters each, where N is number of characters on the disk block like 1024 and 4096. Rather than reading characters by characters from file we read N input character at once. If there are fewer than N character in input EOF marker is placed. There are two pointers: lexeme pointer and forward pointer.
Lexeme pointer points to the start of the lexeme while forward pointer scans the input buffer for lexeme. When forward pointer reaches the end of one half, second half is loaded and forwarded pointer points to the beginning of the next half.
Sentinels:
While using buffer pair technique, we have to check each time fwd is moved that it doesn't crosses the buffer half and when it reaches end of buffer, the other one needs to be loaded. We can solve this problem by introducing a sentinel character at the end of both halves of buffer. The sentinel can be any character which is not a part of source program. EOF is usually preferred as it will also indicate end of source program. It signals the need for some special action (fill other buffer-half, or terminate processing).
3. Given the regular expression (0 + 1)* 011 , construct a DFA equivalent to this regular expression computing follow pos ().
Given RE;
(0 + 1)* 011
The augmented RE of the given RE is:
Syntax tree for augmented R.E. is
Calculating followpos:
followpos(1) = {1, 2, 3}
followpos(2) = {1, 2, 3}
followpos(3) = {4}
followpos(4) = {5}
followpos(5) = {6}
Now,
1) Start state of DFA = firstpos(root) = {1, 2, 3} = S0
Use followpos of symbol representing position in RE to obtain the next state of DFA.
2) From S0 = {1, 2, 3}
1 and 3 represents '0'
2 represents '1'
followpos({1, 3}) = {1, 2, 3, 4} = S1
δ(s0, 0) = S1
followpos({2}) = {1, 2, 3} = S0
δ(S0, 1) = S0
3) From S1 = {1, 2, 3, 4}
1, 3 → 0
2, 4 → 1
followpos({1, 3}) = {1, 2, 3, 4} = S1
δ(S1, 0) = S1
followpos({2, 4}) = {1, 2, 3, 5} = S2
δ(S1, 1) = S2
4) From S2 = {1, 2, 3, 5}
1, 3 → 0
2, 5 → 1
followpos({1, 3}) = {1, 2, 3, 4}= S1
δ(S2, 0) = S1
followpos({2, 5}) = {1, 2, 3, 6} = S3
δ(S2, 1) = S3
5) From S3 = {1, 2, 3, 6}
1, 3 → 0
2 → 1
6 → #
followpos({1, 3}) = {1, 2, 3, 4}= S1
δ(S3, 0) = S1
followpos({2}) = {1, 2, 3} = S0
δ(S3, 1) = S0
Now there was no new states occur.
Therefore, Final state = {S3}
Now the resulting DFA of given regular expression is:
4. Explain the role of the parser. Write an algorithm for non-recursive predictive parsing.
A syntax analyzer or parser takes the input from a lexical analyzer in the form of token streams. The parser analyzes the source code (token stream) against the production rules to detect any errors in the code. The output of this phase is a parse tree.
The parser obtains a string of tokens from the lexical analyzer and verifies that the string can be generated by the grammar for the source language. It has to report any syntax errors if occurs.
The tasks of parser can be expressed as
- Analyzes the context free syntax
- Generates the parse tree
- Provides the mechanism for context sensitive analysis
- Determine the errors and tries to handle them
Algorithm for non-recursive predictive parsing:
1. Set ip to the first symbol of the input string.
2. Set the stack to $S where S is the start symbol of the grammar G.
3. Let X be the top stack symbol and ‘a’ be the symbol pointed by ip then
Repeat
a. If X is a terminal or ‘$’ then
i. If X = a then pop X from the stack and advance ip
ii. else error()
b. Else
If M[X,a] = Y1Y2Y3....... Yk then
1. Pop X from Stack
2. Push Yk , Yk-1, .....Y2, Y1 on the stack with Y1 on top.
3. Output the production X → Y1Y2Y3....... Yk
Until X = ‘$’.
5. Construct the grammar:
Compute the
complete LR(0) collection of item set from above grammar.
The augmented grammar of given grammar is;
E’→E
E→E+T | T
T→T*F | F
F→ (E) | id
Next, we obtain the canonical collection of sets of LR(0) items, as follows,
I0 = closure ({E‘ → •E})
= {E‘ → •E, E → •E + T, E → •T, T → •T * F, T → •F, F → • (E), F → •id}
goto(I0, E) = closure({E‘ → E•, E → E •+ T}) = {E‘ → E•, E → E •+ T} = I1
goto(I0, T) = closure({E → T•, T → T •* F}) = { E → T•, T → T •* F } = I2
goto(I0, F) = closure({T → F•}) = {T → F•} = I3
goto(I0, ( ) = closure({F → (•E)}) = {F → (•E), E → •E + T, E → •T, T → •T * F, T → •F, F → • (E), F → •id} = I4
goto(I0, id ) = closure({F → id•}) = {F → id•} = I5
goto(I1, + ) = closure({E → E + •T}) = {E → E +• T, T → •T * F, T → •F, F → • (E), F → •id} = I6
goto(I2, * ) = closure({T → T * •F}) = {T → T * •F, F → • (E), F → •id} = I7
goto(I4, E) = closure({F → (E•), E → E•+ T}) = {F → (E•), E → E•+ T } = I8
goto(I4, T) = closure({E → T•, T → T• * F}) = {E → T• , T → T•*F}= I2
goto(I4, F) = closure({T → F•}) = {T → F•} = I3
goto(I4, () = closure({F → •(E)} = I4
goto(I4, id) = closure({F → id•}) = I5
goto(I6, T) = closure({E → E +T•, T → T•* F}) = {E → E +T•, T → T•* F} = I9
goto(I6, F) = closure ({T→ F•}) = I3
goto (I6, () = closure ({F → (• E)} = I4
goto(I6, id) = closure({F → id•}) = I5
goto(I7, F ) = closure({T → T * F•})= {T → T * F•} = I10
goto( I7, () = closure({F → •(E)} = I4
goto(I7, id) = closure({F → id•}) = I5
goto(I8, ) ) = closure({F → (E) •}) = {F → (E) •} = I11
goto(I8, +) = closure({ E → E+•T} ) = I6
goto(I9, *) = closure({T → T*•F}) = { T → T*•F, F→ •(E), F→ •id } = I7
Now,
The canonical collection of set of LR(0) items for given grammar are:
6. Show that the following grammar is not in a LL(1) grammar.
Given Grammar,
The FIRST and FOLLOW of the grammar symbols are:
FIRST(S) = {c}
FIRST(A) = {a}
FOLLOW(S) = {$}
FOLLOW(A) = {b, d}
Then, constructing the LL(1) parsing table:
Non-terminals |
Terminal
Symbols |
||||
a |
b |
c |
d |
$ |
|
S |
|
|
S
→ cAd |
|
|
A |
A
→ Ab A
→ a |
|
|
|
|
Since, there are two entries for A in the LL(1) parsing table, hence the given grammar is not in LL(1) grammar.
7. What do you mean by kernel and non-kernel items? Compute the kernel items for LR(0) for the following grammar.
Kernel Items
The initial item S‘ → •S and all the other items those with no dot (.) at the beginning of R.H.S are called the kernel items. For example, For any production A→ αBβ
Kernels are A→ α.Bβ, A→ αB.β, A→ αBβ.
Non-Kernel Items
All items except first item S‘ → •S with dot (.) at the beginning of R.H.S are called the non-kernel items. For example A→ .αBβ
Now,
The augmented grammar of the given grammar is,
S‘ → S
S → CC
C → bC | d
Next, we obtain the canonical collection of sets of LR(0) items, as follows,
I0 = closure ({S‘ → •S}) = {S‘ → •S, S → •CC, C → •bC, C → •d}
goto (I0, S) = closure ({S‘ → S•}) = {S‘ → S•} = I1
goto (I0, C) = closure ({S → C•C}) = {S → C•C, C → •bC, C→ •d} = I2
goto (I0, b) = closure ({C→ b•C}) = {C → b•C, C → •bC, C → •d} = I3
goto (I0, d) = closure ({C → d•}) = {C → d•} = I4
goto (I2, C) = closure ({S → CC•}) = {S → CC•} = I5
goto (I2, b) = closure ({C → b•C}) = {C → b•C, C → •bC, C → •d} = I3
goto (I2, d) = closure ({C → d•}) = {C → d•} = I4
goto (I3, C) = closure ({C → bC•}) = {C → bC•} = I6
goto (I3, b) = closure ({C → b•C}) = {C→ b•C, C → •bC, C → •d} = I3
goto (I3, d) = closure ({C → d•}) = {C → d•} = I4
Now the Kernel items for LR(0) for the given grammar are;
States |
Kernel Items |
I0 |
S‘ → •S |
I1 |
S‘ → S• |
I2 |
S → C•C |
I3 |
C → b•C |
I4 |
C → d• |
I5 |
S → CC• |
I6 |
C → bC• |
8. What do you mean by S-attributed definition and how they are evaluated? Explain with example.
A syntax directed definition that uses synthesized attributes exclusively is said to be an S-attributed definition. A parse tree for S-attributed definition can always be annotated by evaluating the semantic rules for attributes at each node in bottom up manner. The evaluation of s-attributed definition is based on the Depth first Traversal (DFT) of the annotated tree.
Example for S-attributed definitions:
Let us consider the Grammar for arithmetic expressions:
Attributes in S-attributed definitions are evaluated in bottom-up parsing, as the values of the parent nodes depend upon the values of the child nodes.
Consider input string: 5+3*4 return
Depth first Traversal (DFT) of annotated graph:
9. What do you mean by three-address code representation? Explain with example.
Three address code is a type of intermediate code which is easy to generate and can be easily converted to machine code.It makes use of at most three addresses and one operator to represent an expression and the value computed at each instruction is stored in temporary variable generated by compiler.
There are 3 representations of three address code:
1. Quadruples:
It is structure with consist of 4 fields namely op, arg1, arg2 and result. op denotes the operator and arg1 and arg2 denotes the two operands and result is used to store the result of the expression. The three address statements x = y op z is represented by placing y in arg1, z in arg2 and x in result. For e.g. for statement a = b*-c +b*-c , the quadruples is:
2. Triples:
This representation doesn’t make use of extra temporary variable to represent a single operation instead when a reference to another triple’s value is needed, a pointer to that triple is used. So, it consist of only three fields namely op, arg1 and arg2. For e.g. for statement a = b*-c +b*-c , the triples is:
3. Indirect triples:
This representation makes use of pointer to the listing of all references to computations which is made separately and stored. It involves in listing of pointers to triples rather than listing of triples themselves. For e.g. the three address indirect triples for same statement above is
10. How next-use information is useful in code-generation? Explain the steps involved on computing next-use information.
Next-use information is needed for dead-code elimination and register allocation. Next-use is computed by a backward scan of a basic block. The next-use information will indicate the statement number at which a particular variable that is defined in the current position will be reused.
The following are the steps involved in generating next-use information:
Input: Basic block B of three-address statements
Output: At each statement i: x= y op z, we attach to i the liveliness and next-uses of x, y and z.
Method: We start at the last statement of B and scan backwards.
1. Attach to statement i the information currently found in the symbol table regarding the next-use and liveliness of x, y and z.
2. In the symbol table, set x to “not live” and “no next use”.
3. In the symbol table, set y and z to "live", and next-uses of y and Z to i.
Example live and nextuse() computation
i: a := b + c
j: t := a + b
Statement number |
Instruction |
Liveness and nextuse
initialized |
Modified
liveness and nextuse after statement ‘j’ |
Modified
liveness and nextuse after statement ‘i’ |
Comments |
i |
a := b + c |
live(a) = true,
live(b) = true, live(t) = true, nextuse(a)=none, nextuse(b)=none, nextuse(t)=none |
live(a) = true, nextuse(a) = j, live(b) = true, nextuse(b) = j, live(t) = false, nextuse(t)=none |
live(a) = false, nextuse(a)=none, live(b)=true, nextuse(b) = i, live(c) = true, nextuse(c) = i, live(t) = false, nextuse(t)=none |
‘a’ is said to
false and no nextuse. ‘b’ and ‘c’ are said to live at statement ‘i’ and their
next use is at ‘i’ . ‘t’’sliveness and nextuse remains as it is computed in
statement ‘j’ |
j |
t := a + b |
live(a) = true,
live(b) = true, live(t) = true, nextuse(a)=none, nextuse(b)=none, nextuse(t =none |
live(a) = true, nextuse(a) = j, live(b) = true, nextuse(b) = j, live(t) = false, nextuse(t)=none |
live(a) = true, nextuse(a) = j, live(b) = true, nextuse(b) = j, live(t) = false, nextuse(t)=none |
‘a’ and ‘b’ are
used here and hence their nextuse is at ‘j’ and they are said to be live. ‘t’
is computed and hence nextuse is none and live information is false |