# Learning Compiler 4 Finite Automata

## Note

### Specifying Lexical Structure Using Regular Expressions

Union: $A|B\equiv A + B$

Option: $A + \epsilon \equiv A?$

Range: $'a'+'b'+\dots+'z'\equiv [a-z]$

Excluded range: complement of $[a-z] \equiv [\hat{}a-z]$

### Regular Expressions => Lexical Spec.

- Write a rexp for the lexemes of each token
- Construct $R$, matching all lexemes for all tokens
- $R = \text{Keyword} + \text{Identifier} + \text{Number} + \dots \\ = R_1 + R_2 + \dots$

- Let input be $x_1\dots x_n$
- For $1 \leq i \leq n$ check $x_1\dots x_i\in L(R)$

- If success, then we know that $x_1\dots x_i\in L(R_j)$ for some $j$
- Remove $x_1\dots x_i$ from input and go to (3)

### Ambiguities

Rule: Pick longest possible string in $L(R)$

Rule: Use rule listed first

### Error handling

No rule matches a prefix of input

Solution:

- Write a rule matching all “bad” strings
- Put it last (lowest priority)

### Summary

Regular expressions provide a concise notation for string patterns

Use in lexical analysis requires small extensions

- To resolve ambiguities
- To handle errors

Good algorithms known

- Require only single pass over the input
- Few operations per character (table lookup)

### Finite Automata

Regular expressions = specification

Finite automata = implementation

A finite automaton consists of – An input alphabet $\Sigma$

- A set of states $S$
- A start state $n$
- A set of accepting states $F\subseteq S$
- A set of transitions state $\rightarrow^{\text{input}}$ state

Transition: $s_1 \rightarrow^{a} s_2$

Is read: In state $s_1$ on input “a” go to state $s_2$

If end of input and in accepting state => accept

Otherwise => reject

### Finite Automata State Graphs

### Epsilon Moves

$\epsilon$-moves

Machine can move from state A to state B without reading input

### Deterministic and Nondeterministic Automata

Deterministic Finite Automata (DFA)

- One transition per input per state
- No $\epsilon$-moves

Nondeterministic Finite Automata (NFA)

- Can have multiple transitions for one input in a

given state - Can have $\epsilon$-moves

### Execution of Finite Automata

A DFA can take only one path through the state graph

- Completely determined by input

NFAs can choose

- Whether to make $\epsilon$-moves
- Which of multiple transitions for a single input to take

### Acceptance of NFAs

An NFA can get into multiple states

Rule: NFA accepts if it can get to a final state

### NFA vs. DFA

NFAs and DFAs recognize the same set of languages (regular languages)

DFAs are faster to execute

- No choices to consider

For a given language NFA can be simpler than DFA

DFA can be exponentially larger than NFA

### Regular Expressions to Finite Automata

High-level sketch:

Lexical Specification -> Regular Expressions -> NFA -> DFA -> Table-driven Implementation of DFA

### Regular Expressions to NFA

### NFA to DFA: The Trick

Simulate the NFA

Each state of DFA = a non-empty subset of states of the NFA

Start state = the set of NFA states reachable through $\epsilon$-moves from NFA start state

Add a transition $S \rightarrow^a S'$ to DFA iff $S'$ is the set of NFA states reachable from any

state in $S$ after seeing the input $a$, considering $\epsilon$-moves as well

### NFA to DFA. Remark

An NFA may be in many states at any time

If there are N states, the NFA must be in some subset of those N states

How many subsets are there?

- $2^N - 1$ = finitely many

### Example

### Implementation

A DFA can be implemented by a 2D table T

- One dimension is “states”
- Other dimension is “input symbol”
- For every transition $S_i \rightarrow^a S_k$ define $T[i, a] = k$

DFA “execution”

- If in state $S_i$ and input $a$, read $T[i, a] = k$ and skip to state $S_k$
- Very efficient

NFA -> DFA conversion is at the heart of tools such as flex

DFAs can be huge

In practice, flex-like tools trade off speed for space in the choice of NFA and DFA representations

## Reference

Slides: Finite Automata

Video: Compiler Stanford 2014