Compiler Design and Construction - Old Questions

2.  Why are regular expressions used in token specification? Write the regular expression to specify the identifier like in C.

6 marks | Asked in 2070

The lexical analyzer needs to scan and identify only a finite set of valid string/token/lexeme that belong to the language in hand. It searches for the pattern defined by the language rules. Regular expressions have the capability to express finite languages by defining a pattern for finite strings of symbols. So regular expressions are used in token specification.

Regular expressions are the algebraic expressions that are used to describe tokens of a programming language. It uses the three regular operations. These are called union/or, concatenation and star. Brackets ( and ) are used for grouping.

  • Union of two languages L and M is written as

                L U M = {s | s is in L or s is in M}

  • Concatenation of two languages L and M is written as

                LM = {st | s is in L and t is in M}

  • The Kleene Closure of a language L is written as

                L* = Zero or more occurrence of language L.


Now,

In C the regular expression for identifiers can be written using the regular definition as;

    letter → A | B | .... | Z | a | b | .... | z | _

    digit → 0 | 1 | 2 | ..... |9

    identifier → letter(letter | digit)*

The regular expression representing identifiers without using regular definition:

(A | B | C |.........| Z | a | b | c |.........| z | _). ((A | B | C |.........| Z | a | b | c |.........| z | _ |) (1 | 2 |................| 9))*