This is mostly based on reading Thorsten Ball and Robert Nystrom

Lexing

Converting source code to tokens tokenizing, lexing (lexical analysis). Done by a tokenizer, lexer or scanner.

Interesting: Some languages do care about whitespace length (Python), this changes the lexer as it can no longer ignore whitespaces. The whitespaces need to become tokens.

Robust lexers usually attach line number, column number and filename to the tokens, so that error messages are better when parsing.

error: expected token vs error: expected token. line 1, column 1, file.name

Parsing

Take input data and build a data structure, checking that the syntax is correct. Most of the times for interpreters and compilers the data structure is a “syntax tree” or “abstract syntax tree” (AST)

Parsing is also called “syntactic analysis”

A “parser generator” is a tool that uses a formal description of a language (generally a context-free grammar (CFG)) to produce parsers. (yacc, bison, ANTLR).

Strategies for parsing

  1. Top-down (aka recursive descent parsing, early parsing, predictive parsing): Build the root node first.
  2. Bottom-up: The other way around 😊

Vaughan Pratt (top down operator precedence)

is very simple to understand, trivial to implement, easy to use, extremely efficient in practice if not in theory, yet flexible enough to meet most reasonable syntactic needs of users

Rober Nystrom (who wrote the adjacent Crafting Interpreters) has this article on Pratt parsers.

Terminology

  • Prefix operator: operator in front of its operand --5, where -- is the operator.
  • Postfix operator: operator after its operant foobar++, where ++ is the operator.
  • Infix operator: operator between operands 1 * 2, where * is the operator. These appear in binary expressions, where the operator has two operands.
  • Mixfix? Combination of the previous? ?: is called a mixfix conditional operator.
  • Operator precedence/Order of operations: Which priority do different operators have. 1 * 2 + 10, where * has higher precedence than +.