CIS 324: Language Design and Implementation Lexical Analysis 1. Regular Expressions 1.1 Lexical Analysis using Tokens The purpose of the lexical analyzer is to read the input characters from the source program and to translate them into a sequence of tokens suitable for syntax analysis. The simplest way to build a lexical analyzer is to construct a diagram that illustrates the structure of the tokens in the source program, and next manually to make the diagram into a working program that is capable of recognizing the tokens. There are three main reasons for separating the lexical analysis phase of the compiler from the syntax analysis phase: - the separation allows to simplify the design of the lexical analyzer and the design of the syntax analyzer since when considered in isolation many design issues can be handled more precisely; - the separation allows to improve the efficiency of the lexical analyzer and the efficiency of the syntax analyzer since when considered in isolation specialized techniques can be applied; - the separation allows to enhance the portability of the lexical analyzer and the portability of the syntax analyzer. 1.2 Notation for Regular expressions Regular expressions serve to denote sets of strings. Regular expressions are defined with values and operations. Regular expressions over a given alphabet are defined recursively using the following values and operations: - the empty string is a regular expression that denotes the set: { } - any character a from is a regular expression that denotes the set: { a } - if a and b are characters from the sets P and Q, then: ( a | b ), ( ab ) and ( a )* are regular expressions which denote the sets: P Q, PQ, and P* respectively. This definition uses the following classical operations on languages: Operation Definition union P Q P Q = { a | a is in P or a is in Q } concatenation PQ PQ = { ab | a is in P and b is in Q } closure P* P* = zero or more concatenations of a The precedence and associativity of these operations is as follows: - the closure operator has the highest precedence and is left associative; - the concatenation has the second highest precedence and is left associative; - the union operator has the lowest precedence and is left associative. Examples: The regular expression: a | b denotes the set: { a, b } The regular expression: ( a | b ) ( a | b ) denotes the set: { aa, ab, ba, bb } The regular expression: a* denotes the set: { , a, aa, aaa, ... } The regular expression: ( a | b )* denotes the set: { , a, b, aa, bb, ab, ba, aaa, bbb, aab, abb... } Example: Consider the following grammar fragment statement if expr then statement | if expr then statement else statement | expr term relop term | term term id | num The corresponding regular definitions for the terminals are: if then else relop id num delim ws if then else < | <= | = | <> | > | >= letter ( letter | digit ) * digit+ ( . digit+ )? ( E ( + | - )? digit+ )? blank | tab | newline delim+ 2. Strings and Languages An alphabet denotes a finite set of symbols. The symbols are usually letters and characters. String is called a finite sequence of symbols drawn from the alphabet. In programming language theory the notions of sentence and word are synonyms for the term string. A language denotes a set of strings over a fixed alphabet. In programming language theory the following terminology is used: Term prefix of string Definition This is a string including the leading symbols from the given string suffix of string This is a string obtained by deleting leading symbols from the given string substring of string This is a string obtained by removing a prefix and a suffix from the given string proper prefix, suffix, Any nonempty string that is respectively a prefix, a and substring suffix, or substring of the given string different from it subsequence of string Any string produced by deleting zero or more not necessarily contiguous symbols from the given string 3. Transition Diagrams Transition diagrams show the actions performed by the lexical analyser when invoked by the parser. Transition diagrams consist of states and edges that connect them. The transition diagram for a token begins with a start state and carries out transitions from the current state along an edge whose label matches the input character from the source program. Usually several transition diagrams for several tokens are developed together to begin from a common start state. Example: Implementing transition diagrams for the regular expressions relop < | <= | = | <> | > | >= < start 0 1 = > 2 3 other 4 = 5 > = 6 7 other 8 num digit+ ( . digit+ )? ( E ( + | - )? digit+ )? digit start 0 digit 1 digit . 2 digit E 3 digit + E or 4 - 5 digit digit 6 other 7