Learning Compiler 5

风居住的街道

2019-08-24 (Updated: 2019-09-13)

Compiler, Learning

Note

The Functionality of the Parser

Input: sequence of tokens from lexer
Output: parse tree of the program

Comparison with Lexical Analysis

Phase	Input	Output
Lexer	String of characters	String of tokens
Parser	String of token	Parse tree

The Role of the Parser

Not all strings of tokens are programs
Parser must distinguish between valid and invalid strings of tokens

So we need:

A language for describing valid strings of tokens
A method for distinguishing valid from invalid strings of tokens

Context-Free Grammars(CFG)

Programming language constructs have recursive structure
Context-free grammars are a natural notation for the recursive structure

A CFG consists of:

A set of terminals $T$
A set of non-terminals $N$
A start symbol $S$ (a non-terminal)
A set of productions $X \rightarrow Y_1Y_2\dots Y_n$, where $X \in N$ and $Y_i \in T \cup N \cup \{\epsilon\}$.

Examples of CFGs

E \rightarrow E * E
| E + E
| (E)
| id

Key Idea

Begin with a string consisting of the start symbol “S”
Replace any non-terminal $X$ in the string by a the right-hand side of some production $X \rightarrow Y_1Y_2\dots Y_n$
Repeat (2) until there are no non-terminals in the string

The Language of a CFG

Read productions as rules:

$X \rightarrow Y_1Y_2\dots Y_n$

means $X$ can be replaced by $Y_1Y_2\dots Y_n$

More formally, write

$X_1\dots X_{i-1}X_i X_{i+1}\dots X_n \rightarrow X_1\dots X_{i-1} Y_1\dots Y_m X_{i+1}\dots X_n$

if there is a production

$X_i \rightarrow Y_1Y_2\dots Y_m$

Write

$X_1\dots X_n \rightarrow *Y_1\dots Y_m$

$X_1\dots X_n \rightarrow \dots \rightarrow \dots \rightarrow Y_1\dots Y_m$

in 0 or more steps

Let $G$ be a context-free grammer with start symbol $S$. Then the language of $G$ is

$\{a_1\dots a_n | S \rightarrow *a_1\dots a_n \text{and every} a_i \text{is a terminal}\}$

Terminals

Terminals are so-called because there are no rules for replacing them
Once generated, terminals are permanent
Terminals ought to be tokens of the language

Notes

The idea of a CFG is a big step. But:

Membership in a language is “yes” or “no”
- We also need a parse tree of the input
Must handle errors gracefully
Need an implementation of CFG’s (e.g. bison)
Form of the grammar is important
- Many grammars generate the same language
- Tools are sensitive to the grammar
  (Tools for regular languages (e.g., flex) are sensitive to the form of the regular expression, but this is rarely a problem in practice)

Derivations and Parse Trees

A derivation is a sequence of productions

$S \rightarrow \dots \rightarrow \dots \dots$

A derivation can be drawn as a tree

Start symbol is the tree’s root
For a production $X \rightarrow Y_1 \dots Y_n$ add children $Y_1 \dots Y_n$ to node $X$

Letf-most and right-most derivations have the same parse tree, the difference is the order in which branches are added.

Summary of Derivations

We are not just interested in whether s c L(G)

We need a parse tree for $s \in L(G)$
- A derivation defines a parse tree
But one parse tree may have many derivations
- Left-most and right-most derivations are important in parser implementation

Ambiguity

A grammar is ambiguous if it has more than one parse tree for some string
- Equivalently, there is more than one right-most or left-most derivation for some string
Ambiguity is BAD
- Leaves meaning of some programs ill-defined

Dealing with Ambiguity

There are several ways to handle ambiguity
Most direct method is to rewrite grammar unambiguously
Enforces precedence of * over +

cases:

Ambiguity in Arithmetic Expressions: E -> E + E | E * E | ( E ) | int
The Dangling Else: E -> if E then E | if E then E else E | OTHER
- Fix: else matches the closest unmatched then
No general techniques for handling ambiguity
Impossible to convert automatically an ambiguous grammar to an unambiguous one
- Used with care, ambiguity can simplify the grammar
- Sometimes allows more natural definitions – We need disambiguation mechanisms

Precedence and Associativity Declarations

Instead of rewriting the grammar
- Use the more natural (ambiguous) grammar
- Along with disambiguating declarations
Most tools allow precedence and associativity declarations to disambiguate grammars
- Examples:
  - Left associativity declaration: %left +
  - Precedence declarations: %left +, %left *

Reference

Slides: Introduction to Parsing

Video: Compiler Stanford 2014