Compiler Design and Construction - Old Questions
2. Why are regular expressions used in token specification? Write the regular expression to specify the identifier like in C.
The lexical analyzer needs to scan and identify only a finite set of valid string/token/lexeme that belong to the language in hand. It searches for the pattern defined by the language rules. Regular expressions have the capability to express finite languages by defining a pattern for finite strings of symbols. So regular expressions are used in token specification.
Regular expressions are the algebraic expressions that are used to describe tokens of a programming language. It uses the three regular operations. These are called union/or, concatenation and star. Brackets ( and ) are used for grouping.
- Union of two languages L and M is written as
L U M = {s | s is in L or s is in M}
- Concatenation of two languages L and M is written as
LM = {st | s is in L and t is in M}
- The Kleene Closure of a language L is written as
L* = Zero or more occurrence of language L.
Now,
In C the regular expression for identifiers can be written using the regular definition as;
letter → A | B | .... | Z | a | b | .... | z | _
digit → 0 | 1 | 2 | ..... |9
identifier → letter(letter | digit)*
The regular expression representing identifiers without using regular definition:
(A | B | C |.........| Z | a | b | c |.........| z | _). ((A | B | C |.........| Z | a | b | c |.........| z | _ |) (1 | 2 |................| 9))*