Learning Compiler 3 Lexical Analysis

Note

Lexical Analysis

The input is just a string of characters.

The goal is to partition input string into substrings.

Tokens

Tokens correspond to sets of strings.

  • Identifier: strings of letters or digits, starting with a letter
  • Integer: a non-empty string of digits
  • Keyword: “else” or “if” or “begin” or …
  • Whitespace: a non-empty sequence of blanks, newlines, and tabs

Token are Used to

  • Classify program substrings according to role
  • Output of lexical analysis is a stream of tokens…
  • Input to the parser
  • Parser relies on token distinctions
    • An identifier is treated differently than a keyword

Lexical analyzer design

Step 1: Define a finite set of tokens

  • Tokens describe all items of interest
  • Choice of tokens depends on language, design of parser

Step 2: Describe which strings belong to each token

  • Identifier: strings of letters or digits, starting with a letter
  • Integer: a non-empty string of digits
  • Keyword: “else” or “if” or “begin” or …
  • Whitespace: a non-empty sequence of blanks, newlines, and tabs

Implementation

Do two things:

  1. Recognize substrings corresponding to tokens
  2. Return the value or lexeme of the token
    • The lexeme is the substring

The lexer usually discards “ninteresting” tokens that don’t contribute to parsing.

  • Examples: Whitespace, Comments

Review

The goal of lexical analysis is to

  • Partition the input string into lexemes
  • Identify the token of each lexeme

Left-to-right scan => lookahead sometimes required

Regular languages

Def. Let $\Sigma$ be a set of characters. A language over $\Sigma$ is a set of strings of characters drawn from $\Sigma$.

Regular Expressions

Atomic Regular Expressions

Single character: ‘c’ = {“c”}

Epsilon: $\epsilon$ = {“”}

Compound Regular Expressions

Union: $A+B = \{s|s\in A\;\text{or}\;s\in B\}$

Concatenation: $AB = \{ab|a\in A\;\text{and}\;b\in B\}$

Iteration: $A^* = \cup_{i\geq 0}A^i\;\text{where}\;A^i = A...i\;\text{times}...A$

Definition

Def. The regular expressions over $\Sigma$ are the smallest set of expressions including

$\epsilon$ $'c' \;\text{where}\;c \in \Sigma$ $A+B \;\text{where}\; A,B \;\text{are rexp over}\;\Sigma$ $AB\quad"\;\;\;"\;\;\;"$ $A^* \;\text{where}\; A \;\text{is a rexp over}\;\Sigma$

Syntax vs. Semantics

$L(\epsilon) = \{""\}$ $L('c') = \{"c"\}$ $L(A+B) = L(A) + L(B)$ $L(AB)=\{ab|a\in L(A)\;\text{and}\;b\in L(B)\}$ $L(A^*)= \cup_{i\geq 0}L(A^i)$

Segue

Keywords: e.g. ‘if’ abbreviates ‘i’’f’

Integers: Abbreviation: $A^+ = AA^*$

  • a non-empty string of digits
  • digit = ‘0’+’1’+’2’+’3’+’4’+’5’+’6’+’7’+’8’+’9’
  • integer = digit digit$^*$

Identifier: Abbrevation: letter (letter + digit)$^*$

  • strings of letters or digits, starting with a letter
  • letter = ‘A’+…+’Z’+’a’+…+ ‘z’

Whitespace: Abbrevation: $('\;' + '\n' + '\t')^+$

  • a non-empty sequence of blanks, newlines, and tabs

Reference

Slides: Lexical Analysis

Video: Compiler Stanford 2014