Learning Compiler 3 Lexical Analysis
Note
Lexical Analysis
The input is just a string of characters.
The goal is to partition input string into substrings.
Tokens
Tokens correspond to sets of strings.
- Identifier: strings of letters or digits, starting with a letter
- Integer: a non-empty string of digits
- Keyword: “else” or “if” or “begin” or …
- Whitespace: a non-empty sequence of blanks, newlines, and tabs
Token are Used to
- Classify program substrings according to role
- Output of lexical analysis is a stream of tokens…
- Input to the parser
- Parser relies on token distinctions
- An identifier is treated differently than a keyword
Lexical analyzer design
Step 1: Define a finite set of tokens
- Tokens describe all items of interest
- Choice of tokens depends on language, design of parser
Step 2: Describe which strings belong to each token
- Identifier: strings of letters or digits, starting with a letter
- Integer: a non-empty string of digits
- Keyword: “else” or “if” or “begin” or …
- Whitespace: a non-empty sequence of blanks, newlines, and tabs
Implementation
Do two things:
- Recognize substrings corresponding to tokens
- Return the value or lexeme of the token
- The lexeme is the substring
The lexer usually discards “ninteresting” tokens that don’t contribute to parsing.
- Examples: Whitespace, Comments
Review
The goal of lexical analysis is to
- Partition the input string into lexemes
- Identify the token of each lexeme
Left-to-right scan => lookahead sometimes required
Regular languages
Def. Let $\Sigma$ be a set of characters. A language over $\Sigma$ is a set of strings of characters drawn from $\Sigma$.
Regular Expressions
Atomic Regular Expressions
Single character: ‘c’ = {“c”}
Epsilon: $\epsilon$ = {“”}
Compound Regular Expressions
Union: $A+B = \{s|s\in A\;\text{or}\;s\in B\}$
Concatenation: $AB = \{ab|a\in A\;\text{and}\;b\in B\}$
Iteration: $A^* = \cup_{i\geq 0}A^i\;\text{where}\;A^i = A...i\;\text{times}...A$
Definition
Def. The regular expressions over $\Sigma$ are the smallest set of expressions including
$\epsilon$ $'c' \;\text{where}\;c \in \Sigma$ $A+B \;\text{where}\; A,B \;\text{are rexp over}\;\Sigma$ $AB\quad"\;\;\;"\;\;\;"$ $A^* \;\text{where}\; A \;\text{is a rexp over}\;\Sigma$Syntax vs. Semantics
$L(\epsilon) = \{""\}$ $L('c') = \{"c"\}$ $L(A+B) = L(A) + L(B)$ $L(AB)=\{ab|a\in L(A)\;\text{and}\;b\in L(B)\}$ $L(A^*)= \cup_{i\geq 0}L(A^i)$Segue
Keywords: e.g. ‘if’ abbreviates ‘i’’f’
Integers: Abbreviation: $A^+ = AA^*$
- a non-empty string of digits
- digit = ‘0’+’1’+’2’+’3’+’4’+’5’+’6’+’7’+’8’+’9’
- integer = digit digit$^*$
Identifier: Abbrevation: letter (letter + digit)$^*$
- strings of letters or digits, starting with a letter
- letter = ‘A’+…+’Z’+’a’+…+ ‘z’
Whitespace: Abbrevation: $('\;' + '\n' + '\t')^+$
- a non-empty sequence of blanks, newlines, and tabs
Reference
Slides: Lexical Analysis
Video: Compiler Stanford 2014