Compiler Design and Construction 2069

Tribhuwan University
Institute of Science and Technology
2069
Bachelor Level / Sixth Semester / Science
Computer Science and Information Technology ( CSC-352 )
( Compiler Design and Construction )
Full Marks: 60
Pass Marks: 24
Time: 3 hours
Candidates are required to give their answers in their own words as far as practicable.
The figures in the margin indicate full marks.

Attempt all questions.                    (10x6=60)

1.  Explain the phase of a compiler with block diagram.
6 marks view

A compiler operates in phases. A phase is a logically interrelated operation that takes source program in one representation and produces output in another representation. The phases of a compiler are shown in below:

1. Analysis phase breaks up the source program into constituent pieces and creates an intermediate representation of the source program. Analysis of source program includes: lexical analysis, syntax analysis and semantic analysis.

2. Synthesis phase construct the desired target program from the intermediate representation. The synthesis part of compiler consists of the following phases: Intermediate code generation, Code optimization and Target code generation.

COMPILER DESIGN: Phases of a compiler

1. Lexical Analysis:

In this phase, lexical analyzer reads the source program and returns the tokens of the source program. Token is a sequence of characters that can be treated as a single logical entity (such as identifier, operators, keywords, constants etc.).

Example:

    Input String: c = a + b * 3

    Tokens: id1 = id2 + id3 * 3

2. Syntax Analysis:

In this phase, the syntax analyzer takes the token produced by lexical analyzer as input and generates a parse tree as output. In syntax analysis phase, the parser checks that the expression made by the token is syntactically correct or not, according to the rules that define the syntax of the source language.

Example:


3. Semantic Analysis:

In this phase, the semantic analyzer checks the source program for semantic errors and collects the type information for the code generation. Semantic analyzer checks whether they form a sensible set of instructions in the programming language or not. Type-checking is an important part of semantic analyzer.

Example:


4. Intermediate Code Generation:

If the program syntactically and semantically correct then intermediate code generator generates a simple machine independent intermediate language. The intermediate code should be generated in such a way that it can easily translated into the target machine code.

Example:

    t1 = 3.0;

    t2 = id3 * t1;

    t3 = id2 + t2;

    id1 = t3;

5. Code Optimization:

It is used to improve the intermediate code so that the output of the program could run faster and takes less space. It removes the unnecessary lines of the code and arranges the sequence of statements in order to speed up the program execution without wasting resources.

Example:

    t2 = id3 * 3.0;

    id1 = id2 + t2;

6. Code Generation:

Code generation is the final stage of the compilation process. It takes the optimized intermediate code as input and maps it to the target machine language.

Example:

    MOV  R1, id3

    MUL  R1, #3.0

    MOV  R2, id2

    ADD  R1, R2

    MOV  id1, R1

Symbol Table

Symbol tables are data structures that are used by compilers to hold information about source-program constructs. The information is collected incrementally by the  analysis phase of compiler and used by the synthesis phases to generate the target code. Entries in the symbol table contain information about an identifier such as its type, its position in storage, and any other relevant information.

Error Handling

Whenever an error is encountered during the compilation of the source program, an error handler is invoked. Error handler generates a suitable error reporting message regarding the error encountered.

2.  Define token, pattern and lexeme with suitable example. How input buffering can be implemented for scanner, Explain.

6 marks view

Token

Token is a sequence of characters that can be treated as a single logical entity. For example, Identifiers, Keywords, Operators, Constants etc.

Pattern

A set of string in the input for which the same token is produced as output. This set of string is described by a rule called a pattern associated with the token. For example, “a letter followed by zero or more letters, digits, or underscores.”

Lexeme

A lexeme is a sequence of characters in the source program that is matched by the pattern for a token. 


Input Buffering

Lexical analysis needs to look ahead several characters before a match can be announced. We have two buffer input scheme that is useful when look ahead is necessary.

  • Buffer Pair
  • Sentinels

Buffer Pair (2N Buffering):

In this technique buffer is divided into two halves with N-characters each, where N is number of characters on the disk block like 1024 and 4096. Rather than reading characters by characters from file we read N input character at once. If there are fewer than N character in input EOF marker is placed. There are two pointers: lexeme pointer and forward pointer.

Lexeme pointer points to the start of the lexeme while forward pointer scans the input buffer for lexeme. When forward pointer reaches the end of one half, second half is loaded and forwarded pointer points to the beginning of the next half.

Sentinels:

While using buffer pair technique, we have to check each time fwd is moved that it doesn't crosses the buffer half and when it reaches end of buffer, the other one needs to be loaded. We can solve this problem by introducing a sentinel character at the end of both halves of buffer. The sentinel can be any character which is not a part of source program. EOF is usually preferred as it will also indicate end of source program. It signals the need for some special action (fill other buffer-half, or terminate processing).


3.  Given the regular expression (0 + 1)* 011 , construct a DFA equivalent to this regular expression computing follow pos ().

6 marks view

Given RE;

(0 + 1)* 011

The augmented RE of the given RE is:

Syntax tree for augmented R.E. is


Calculating followpos:

followpos(1) = {1, 2, 3}

followpos(2) = {1, 2, 3}

followpos(3) = {4}

followpos(4) = {5}

followpos(5) = {6}

Now,

1) Start state of DFA = firstpos(root) = {1, 2, 3} = S0

Use followpos of symbol representing position in RE to obtain the next state of DFA.

2) From S0 = {1, 2, 3}

    1 and 3 represents '0'

    2 represents '1'

followpos({1, 3}) = {1, 2, 3, 4} = S1

    δ(s0, 0) = S1

followpos({2}) = {1, 2, 3} = S0

    δ(S0, 1) = S0

3) From S1 = {1, 2, 3, 4}

    1, 3 → 0

    2, 4 → 1

followpos({1, 3}) = {1, 2, 3, 4} = S1

     δ(S1, 0) = S1

followpos({2, 4}) = {1, 2, 3, 5} = S2

     δ(S1, 1) = S2

4) From S2 = {1, 2, 3, 5}

    1, 3 → 0

    2, 5 → 1

followpos({1, 3}) = {1, 2, 3, 4}= S1

    δ(S2, 0) = S1

followpos({2, 5}) = {1, 2, 3, 6} = S3

     δ(S2, 1) = S3

5) From S3 = {1, 2, 3, 6}

    1, 3 → 0

    2 → 1

    6 → #

followpos({1, 3}) = {1, 2, 3, 4}= S1

    δ(S3, 0) = S1

followpos({2}) = {1, 2, 3} = S0

    δ(S3, 1) = S0

Now there was no new states occur.

Therefore, Final state = {S3}

Now the resulting DFA of given regular expression is:


4.  Explain the role of the parser. Write an algorithm for non-recursive predictive parsing.

6 marks view

A syntax analyzer or parser takes the input from a lexical analyzer in the form of token streams. The parser analyzes the source code (token stream) against the production rules to detect any errors in the code. The output of this phase is a parse tree.

The parser obtains a string of tokens from the lexical analyzer and verifies that the string can be generated by the grammar for the source language. It has to report any syntax errors if occurs.


The tasks of parser can be expressed as

  • Analyzes the context free syntax
  • Generates the parse tree
  • Provides the mechanism for context sensitive analysis
  • Determine the errors and tries to handle them

Algorithm for non-recursive predictive parsing:

1. Set ip to the first symbol of the input string.

2. Set the stack to $S where S is the start symbol of the grammar G.

3. Let X be the top stack symbol and ‘a’ be the symbol pointed by ip then

  Repeat

    a. If X is a terminal or ‘$’ then

        i. If X = a then pop X from the stack and advance ip

        ii. else error()

    b. Else

        If M[X,a] = Y1Y2Y3....... Yk then

            1. Pop X from Stack

            2. Push Yk , Yk-1, .....Y2, Y1 on the stack with Y1 on top.

            3. Output the production X  Y1Y2Y3....... Yk

Until X = ‘$’.

5.  Construct the grammar:


Compute the complete LR(0) collection of item set from above grammar.

6 marks view

The augmented grammar of given grammar is;

    E’→E

    E→E+T | T

    T→T*F | F

    F→ (E) | id

Next, we obtain the canonical collection of sets of LR(0) items, as follows,

I0 = closure ({E‘ → •E})

    = {E‘ → •E, E → •E + T, E → •T, T → •T * F, T → •F, F → • (E), F → •id}

goto(I0, E) = closure({E‘ → E•, E → E •+ T}) = {E‘ → E•, E → E •+ T} = I1

goto(I0, T) = closure({E → T•, T → T •* F}) = { E → T•, T → T •* F } = I2

goto(I0, F) = closure({T → F•}) = {T → F•} = I3

goto(I0, ( ) = closure({F → (•E)}) = {F → (•E), E → •E + T, E → •T, T → •T * F, T → •F, F → • (E), F → •id} = I4

goto(I0, id ) = closure({F → id•}) = {F → id•} = I5

goto(I1, + ) = closure({E → E + •T}) = {E → E +• T, T → •T * F, T → •F, F → • (E), F → •id} = I6

goto(I2, * ) = closure({T → T * •F}) = {T → T * •F, F → • (E), F → •id} = I7

goto(I4, E) = closure({F → (E•), E → E•+ T}) = {F → (E•), E → E•+ T } = I8

goto(I4, T) = closure({E → T•, T → T• * F}) = {E → T• , T → T•*F}= I2

goto(I4, F) = closure({T → F•}) = {T → F•} = I3

goto(I4, () = closure({F → •(E)} = I4

goto(I4, id) = closure({F → id•}) = I5

goto(I6, T) = closure({E → E +T•, T → T•* F}) = {E → E +T•, T → T•* F} = I9

goto(I6, F) = closure ({T→ F•}) = I3

goto (I6, () = closure ({F → (• E)} = I4

goto(I6, id) = closure({F → id•}) = I5

goto(I7, F ) = closure({T → T * F•})= {T → T * F•} = I10

goto( I7, () = closure({F → •(E)} = I4

goto(I7, id) = closure({F → id•}) = I5

goto(I8, ) ) = closure({F → (E) •}) = {F → (E) •} = I11

goto(I8, +) = closure({ E → E+•T} ) = I6

goto(I9, *) = closure({T → T*•F}) = { T → T*•F, F→ •(E), F→ •id } = I7

Now,

The canonical collection of set of LR(0) items for given grammar are:


6.  Show that the following grammar is not in a LL(1) grammar.

        

6 marks view

Given Grammar,

        

The FIRST and FOLLOW of the grammar symbols are:

FIRST(S) = {c}

FIRST(A) = {a}

FOLLOW(S) = {$}

FOLLOW(A) = {b, d}

Then, constructing the LL(1) parsing table:

Non-terminals

Terminal Symbols

a

b

c

d

$

S

 

 

S → cAd

 

 

A

A → Ab

A → a

 

 

 

 

Since, there are two entries for A in the LL(1) parsing table, hence the given grammar is not in LL(1) grammar.

7.  What do you mean by kernel and non-kernel items? Compute the kernel items for LR(0) for the following grammar.

        

6 marks view

Kernel Items

The initial item S‘ → •S and all the other items those with no dot (.) at the beginning of R.H.S are called the kernel items. For example, For any production A→ αBβ

    Kernels are A→ α.Bβ, A→ αB.β, A→ αBβ.

Non-Kernel Items

All items except first item S‘ → •S with dot (.) at the beginning of R.H.S are called the non-kernel items. For example A→ .αBβ


Now,

The augmented grammar of the given grammar is,

    S‘ → S

    S → CC

    C → bC | d

Next, we obtain the canonical collection of sets of LR(0) items, as follows,

I0 = closure ({S‘ → •S}) = {S‘ → •S, S → •CC, C → •bC,  C → •d}

goto (I0, S) = closure ({S‘ → S•}) = {S‘ → S•} = I1

goto (I0, C) = closure ({S → C•C}) = {S → C•C, C → •bC, C→ •d} = I2

goto (I0, b) = closure ({C→ b•C}) = {C → b•C, C → •bC, C → •d} = I3

goto (I0, d) = closure ({C → d•}) = {C → d•} = I4

goto (I2, C) = closure ({S → CC•}) = {S → CC•} = I5

goto (I2, b) = closure ({C → b•C}) = {C → b•C, C → •bC, C → •d} = I3

goto (I2, d) = closure ({C → d•}) = {C → d•} = I4

goto (I3, C) = closure ({C → bC•}) = {C → bC•} = I6

goto (I3, b) = closure ({C → b•C}) = {C→ b•C, C → •bC, C → •d} = I3

goto (I3, d) = closure ({C → d•}) = {C → d•} = I4

Now the Kernel items for LR(0) for the given grammar are;

States

Kernel Items

I0

S‘  → •S

I1

S‘  → S•

I2

 → C•C

I3

 → b•C

I4

 → d•

I5

 → CC•

I6

C → bC•

8.  What do you mean by S-attributed definition and how they are evaluated? Explain with example.

6 marks view

A syntax directed definition that uses synthesized attributes exclusively is said to be an S-attributed definition. A parse tree for S-attributed definition can always be annotated by evaluating the semantic rules for attributes at each node in bottom up manner. The evaluation of s-attributed definition is based on the Depth first Traversal (DFT) of the annotated tree.

Example for S-attributed definitions:

Let us consider the Grammar for arithmetic expressions:

Attributes in S-attributed definitions are evaluated in bottom-up parsing, as the values of the parent nodes depend upon the values of the child nodes.

Consider input string: 5+3*4 return

Depth first Traversal (DFT) of annotated graph:

9.  What do you mean by three-address code representation? Explain with example.

6 marks view

Three address code is a type of intermediate code which is easy to generate and can be easily converted to machine code.It makes use of at most three addresses and one operator to represent an expression and the value computed at each instruction is stored in temporary variable generated by compiler.

There are 3 representations of three address code:

1. Quadruples:
It is structure with consist of 4 fields namely op, arg1, arg2 and result. op denotes the operator and arg1 and arg2 denotes the two operands and result is used to store the result of the expression. The three address statements x = y op z is represented by placing y in arg1, z in arg2 and x in result. For e.g. for statement a = b*-c +b*-c , the quadruples is:


2. Triples:

This representation doesn’t make use of extra temporary variable to represent a single operation instead when a reference to another triple’s value is needed, a pointer to that triple is used. So, it consist of only three fields namely op, arg1 and arg2. For e.g. for statement a = b*-c +b*-c , the triples is:


3. Indirect triples:
This representation makes use of pointer to the listing of all references to computations which is made separately and stored. It involves in listing of pointers to triples rather than listing of triples themselves. For e.g. the three address indirect triples for same statement above is


10.  How next-use information is useful in code-generation? Explain the steps involved on computing next-use information.

6 marks view

Next-use information is needed for dead-code elimination and register allocation. Next-use is computed by a backward scan of a basic block. The next-use information will indicate the statement number at which a particular variable that is defined in the current position will be reused.

The following are the steps involved in generating next-use information: 

Input: Basic block B of three-address statements

Output: At each statement i: x= y op z, we attach to i the liveliness and next-uses of x, y and z.

Method: We start at the last statement of B and scan backwards.

    1. Attach to statement i the information currently found in the symbol table regarding the next-use and liveliness of x, y and z.

    2.   In the symbol table, set x to “not live” and “no next use”.

    3. In the symbol table, set y and z to "live", and next-uses of y and Z to i.

  

Example live and nextuse() computation

i: a := b + c

j: t := a + b

Statement number

Instruction

Liveness and nextuse initialized

Modified liveness and nextuse after statement ‘j’

Modified liveness and nextuse after statement ‘i’

Comments

i

a := b + c

live(a) = true, live(b) = true, live(t) = true, nextuse(a)=none,

nextuse(b)=none,

nextuse(t)=none

live(a) = true,    nextuse(a) = j,  live(b) = true,  nextuse(b) = j,  live(t) = false,  nextuse(t)=none

live(a) = false,    nextuse(a)=none,  live(b)=true,  nextuse(b) = i,  live(c) = true,  nextuse(c) = i,    live(t) = false,  nextuse(t)=none

‘a’ is said to false and no nextuse. ‘b’ and ‘c’ are said to live at statement ‘i’ and their next use is at ‘i’ . ‘t’’sliveness and nextuse remains as it is computed in statement ‘j’

j

t := a + b

live(a) = true, live(b) = true, live(t) = true, nextuse(a)=none,

nextuse(b)=none,

nextuse(t =none

live(a) = true,    nextuse(a) = j,  live(b) = true, nextuse(b) = j,  live(t) = false,  nextuse(t)=none

live(a) = true,    nextuse(a) = j,  live(b) = true,  nextuse(b) = j,  live(t) = false,  nextuse(t)=none

‘a’ and ‘b’ are used here and hence their nextuse is at ‘j’ and they are said to be live. ‘t’ is computed and hence nextuse is none and live information is false