# Learning Compiler 3 Lexical Analysis

## Note

### Lexical Analysis

The input is just a string of characters.

The goal is to partition input string into substrings.

### Tokens

Tokens correspond to sets of strings.

• Identifier: strings of letters or digits, starting with a letter
• Integer: a non-empty string of digits
• Keyword: “else” or “if” or “begin” or …
• Whitespace: a non-empty sequence of blanks, newlines, and tabs

### Token are Used to

• Classify program substrings according to role
• Output of lexical analysis is a stream of tokens…
• Input to the parser
• Parser relies on token distinctions
• An identifier is treated differently than a keyword

### Lexical analyzer design

Step 1: Define a finite set of tokens

• Tokens describe all items of interest
• Choice of tokens depends on language, design of parser

Step 2: Describe which strings belong to each token

• Identifier: strings of letters or digits, starting with a letter
• Integer: a non-empty string of digits
• Keyword: “else” or “if” or “begin” or …
• Whitespace: a non-empty sequence of blanks, newlines, and tabs

### Implementation

Do two things:

1. Recognize substrings corresponding to tokens
2. Return the value or lexeme of the token
• The lexeme is the substring

The lexer usually discards “ninteresting” tokens that don’t contribute to parsing.

### Review

The goal of lexical analysis is to

• Partition the input string into lexemes
• Identify the token of each lexeme

Left-to-right scan => lookahead sometimes required

### Regular languages

Def. Let $\Sigma$ be a set of characters. A language over $\Sigma$ is a set of strings of characters drawn from $\Sigma$.

### Atomic Regular Expressions

Single character: ‘c’ = {“c”}

Epsilon: $\epsilon$ = {“”}

### Compound Regular Expressions

Union: $A+B = \{s|s\in A\;\text{or}\;s\in B\}$

Concatenation: $AB = \{ab|a\in A\;\text{and}\;b\in B\}$

Iteration: $A^* = \cup_{i\geq 0}A^i\;\text{where}\;A^i = A...i\;\text{times}...A$

### Definition

Def. The regular expressions over $\Sigma$ are the smallest set of expressions including

$\epsilon$ $'c' \;\text{where}\;c \in \Sigma$ $A+B \;\text{where}\; A,B \;\text{are rexp over}\;\Sigma$ $AB\quad"\;\;\;"\;\;\;"$ $A^* \;\text{where}\; A \;\text{is a rexp over}\;\Sigma$

### Syntax vs. Semantics

$L(\epsilon) = \{""\}$ $L('c') = \{"c"\}$ $L(A+B) = L(A) + L(B)$ $L(AB)=\{ab|a\in L(A)\;\text{and}\;b\in L(B)\}$ $L(A^*)= \cup_{i\geq 0}L(A^i)$

### Segue

Keywords: e.g. ‘if’ abbreviates ‘i’’f’

Integers: Abbreviation: $A^+ = AA^*$

• a non-empty string of digits
• digit = ‘0’+’1’+’2’+’3’+’4’+’5’+’6’+’7’+’8’+’9’
• integer = digit digit$^*$

Identifier: Abbrevation: letter (letter + digit)$^*$

• strings of letters or digits, starting with a letter
• letter = ‘A’+…+’Z’+’a’+…+ ‘z’

Whitespace: Abbrevation: $('\;' + '\n' + '\t')^+$

• a non-empty sequence of blanks, newlines, and tabs

## Reference

Slides: Lexical Analysis

Video: Compiler Stanford 2014