Learning Compiler 5

Note

The Functionality of the Parser

  • Input: sequence of tokens from lexer
  • Output: parse tree of the program

Comparison with Lexical Analysis

Phase Input Output
Lexer String of characters String of tokens
Parser String of token Parse tree

The Role of the Parser

  1. Not all strings of tokens are programs
  2. Parser must distinguish between valid and invalid strings of tokens

So we need:

  • A language for describing valid strings of tokens
  • A method for distinguishing valid from invalid strings of tokens

Context-Free Grammars(CFG)

  • Programming language constructs have recursive structure
  • Context-free grammars are a natural notation for the recursive structure

A CFG consists of:

  • A set of terminals $T$
  • A set of non-terminals $N$
  • A start symbol $S$ (a non-terminal)
  • A set of productions $X \rightarrow Y_1Y_2\dots Y_n$, where $X \in N$ and $Y_i \in T \cup N \cup \{\epsilon\}$.

Examples of CFGs

E \rightarrow E * E
| E + E
| (E)
| id

Key Idea

  1. Begin with a string consisting of the start symbol “S”
  2. Replace any non-terminal $X$ in the string by a the right-hand side of some production $X \rightarrow Y_1Y_2\dots Y_n$
  3. Repeat (2) until there are no non-terminals in the string

The Language of a CFG

Read productions as rules:

$X \rightarrow Y_1Y_2\dots Y_n$

means $X$ can be replaced by $Y_1Y_2\dots Y_n$

More formally, write

$X_1\dots X_{i-1}X_i X_{i+1}\dots X_n \rightarrow X_1\dots X_{i-1} Y_1\dots Y_m X_{i+1}\dots X_n$

if there is a production

$X_i \rightarrow Y_1Y_2\dots Y_m$

Write

$X_1\dots X_n \rightarrow *Y_1\dots Y_m$

if

$X_1\dots X_n \rightarrow \dots \rightarrow \dots \rightarrow Y_1\dots Y_m$

in 0 or more steps

Let $G$ be a context-free grammer with start symbol $S$. Then the language of $G$ is

$\{a_1\dots a_n | S \rightarrow *a_1\dots a_n \text{and every} a_i \text{is a terminal}\}$

Terminals

  • Terminals are so-called because there are no rules for replacing them
  • Once generated, terminals are permanent
  • Terminals ought to be tokens of the language

Notes

The idea of a CFG is a big step. But:

  • Membership in a language is “yes” or “no”
    • We also need a parse tree of the input
  • Must handle errors gracefully
  • Need an implementation of CFG’s (e.g. bison)
  • Form of the grammar is important
    • Many grammars generate the same language
    • Tools are sensitive to the grammar
      (Tools for regular languages (e.g., flex) are sensitive to the form of the regular expression, but this is rarely a problem in practice)

Derivations and Parse Trees

A derivation is a sequence of productions

$S \rightarrow \dots \rightarrow \dots \dots$

A derivation can be drawn as a tree

  • Start symbol is the tree’s root
  • For a production $X \rightarrow Y_1 \dots Y_n$ add children $Y_1 \dots Y_n$ to node $X$

Letf-most and right-most derivations have the same parse tree, the difference is the order in which branches are added.

Summary of Derivations

We are not just interested in whether s c L(G)

  • We need a parse tree for $s \in L(G)$
    • A derivation defines a parse tree
  • But one parse tree may have many derivations
    • Left-most and right-most derivations are important in parser implementation

Ambiguity

  • A grammar is ambiguous if it has more than one parse tree for some string
    • Equivalently, there is more than one right-most or left-most derivation for some string
  • Ambiguity is BAD
    • Leaves meaning of some programs ill-defined

Dealing with Ambiguity

  • There are several ways to handle ambiguity
  • Most direct method is to rewrite grammar unambiguously
  • Enforces precedence of * over +

cases:

  • Ambiguity in Arithmetic Expressions: E -> E + E | E * E | ( E ) | int

  • The Dangling Else: E -> if E then E | if E then E else E | OTHER

    • Fix: else matches the closest unmatched then
  • No general techniques for handling ambiguity

  • Impossible to convert automatically an ambiguous grammar to an unambiguous one

    • Used with care, ambiguity can simplify the grammar
    • Sometimes allows more natural definitions – We need disambiguation mechanisms

Precedence and Associativity Declarations

  • Instead of rewriting the grammar
    • Use the more natural (ambiguous) grammar
    • Along with disambiguating declarations
  • Most tools allow precedence and associativity declarations to disambiguate grammars
    • Examples:
      • Left associativity declaration: %left +
      • Precedence declarations: %left +, %left *

Reference

Slides: Introduction to Parsing

Video: Compiler Stanford 2014