Learning Compiler 3 Lexical Analysis

风居住的街道

2019-05-19 (Updated: 2019-08-25)

学不死就往死里学

Compiler, Learning

Note

Lexical Analysis

The input is just a string of characters.

The goal is to partition input string into substrings.

Tokens

Tokens correspond to sets of strings.

Identifier: strings of letters or digits, starting with a letter
Integer: a non-empty string of digits
Keyword: “else” or “if” or “begin” or …
Whitespace: a non-empty sequence of blanks, newlines, and tabs

Token are Used to

Classify program substrings according to role
Output of lexical analysis is a stream of tokens…
Input to the parser
Parser relies on token distinctions
- An identifier is treated differently than a keyword

Lexical analyzer design

Step 1: Define a finite set of tokens

Tokens describe all items of interest
Choice of tokens depends on language, design of parser

Step 2: Describe which strings belong to each token

Identifier: strings of letters or digits, starting with a letter
Integer: a non-empty string of digits
Keyword: “else” or “if” or “begin” or …
Whitespace: a non-empty sequence of blanks, newlines, and tabs

Implementation

Do two things:

Recognize substrings corresponding to tokens
Return the value or lexeme of the token
- The lexeme is the substring

The lexer usually discards “ninteresting” tokens that don’t contribute to parsing.

Examples: Whitespace, Comments

Review

The goal of lexical analysis is to

Partition the input string into lexemes
Identify the token of each lexeme

Left-to-right scan => lookahead sometimes required

Regular languages

Def. Let $\Sigma$ be a set of characters. A language over $\Sigma$ is a set of strings of characters drawn from $\Sigma$.

Regular Expressions

Atomic Regular Expressions

Single character: ‘c’ = {“c”}

Epsilon: $\epsilon$ = {“”}

Compound Regular Expressions

Union: $A+B = \{s|s\in A\;\text{or}\;s\in B\}$

Concatenation: $AB = \{ab|a\in A\;\text{and}\;b\in B\}$

Iteration: $A^* = \cup_{i\geq 0}A^i\;\text{where}\;A^i = A...i\;\text{times}...A$

Definition

Def. The regular expressions over $\Sigma$ are the smallest set of expressions including

$\epsilon$ $'c' \;\text{where}\;c \in \Sigma$ $A+B \;\text{where}\; A,B \;\text{are rexp over}\;\Sigma$ $AB\quad"\;\;\;"\;\;\;"$ $A^* \;\text{where}\; A \;\text{is a rexp over}\;\Sigma$

Syntax vs. Semantics

$L(\epsilon) = \{""\}$ $L('c') = \{"c"\}$ $L(A+B) = L(A) + L(B)$ $L(AB)=\{ab|a\in L(A)\;\text{and}\;b\in L(B)\}$ $L(A^*)= \cup_{i\geq 0}L(A^i)$

Segue

Keywords: e.g. ‘if’ abbreviates ‘i’’f’

Integers: Abbreviation: $A^+ = AA^*$

a non-empty string of digits
digit = ‘0’+’1’+’2’+’3’+’4’+’5’+’6’+’7’+’8’+’9’
integer = digit digit$^*$

Identifier: Abbrevation: letter (letter + digit)$^*$

strings of letters or digits, starting with a letter
letter = ‘A’+…+’Z’+’a’+…+ ‘z’

Whitespace: Abbrevation: $('\;' + '\n' + '\t')^+$

a non-empty sequence of blanks, newlines, and tabs

Reference

Slides: Lexical Analysis

Video: Compiler Stanford 2014