Lexical Analysis Explained: A Beginner's Guide to the First Compiler Phase

Understand lexical analysis from scratch: how a compiler scans source code character by character, builds tokens, handles whitespace and comments, and feeds the parser. Includes a tiny working lexer.

10 min read

Lexical analysis — or "lexing" — is the first thing a compiler does to your source code. It reads the file character by character and groups characters into tokens: meaningful units like keywords, identifiers, numbers, operators, and punctuation. Without this step, the rest of the compiler would have to deal with raw text and whitespace, which is far harder than dealing with a clean stream of IDENT, NUMBER, PLUS, EQ tokens.

This guide is for the developer curious about what tsc, rustc, or your favourite linter is doing the moment it opens your file. By the end you will understand how lexers work, why they use finite automata under the hood, and how to write a tiny one yourself.

The Job of a Lexer

A lexer takes a stream of characters and produces a stream of tokens. Each token has:

  • A type (IDENT, NUMBER, LPAREN, IF, EQ).
  • A value (the actual text — "x", "42", "(").
  • A location (line, column) for error messages.

For the input let x = 42;, the lexer emits:

LET("let"), IDENT("x"), EQ("="), NUMBER(42), SEMI(";").

Whitespace and comments are typically discarded — they exist for humans, not parsers.

Why Have a Separate Phase

You could imagine a parser that worked directly on characters. Real languages do not, for three reasons:

  • Simplicity. Parsers are easier to write when they consume tokens, not characters.
  • Performance. Lexers can be implemented as state machines that run in linear time with tiny per-character cost.
  • Reuse. The same lexer can serve a compiler, a syntax highlighter, an IDE's autocomplete, and a linter.

This is why your editor highlights syntax in milliseconds — it runs a lexer, not a full compiler.

Tokens: The Vocabulary

Every language defines a fixed set of token types. For a tiny calculator language, the set might be:

TokenMatches
NUMBERDigits, e.g. 42, 3.14
IDENTLetters/underscore, then letters/digits/underscore
PLUS / MINUS / STAR / SLASH+ - * /
LPAREN / RPAREN( )
EQ=
LET (keyword)The literal text let
EOFEnd of input

Real languages have 50–150 token types. The lexer's job is to recognise which one each chunk of text is.

The Engine: Finite Automata

Under the hood, lexers are finite automata. Each token type is defined by a regular expression; the lexer combines them into a single deterministic finite automaton (DFA) that consumes characters one at a time and transitions between states. When the automaton lands in an accepting state and the next character cannot extend the match, it emits a token and resets.

This is why lexers are linear-time: each character is examined exactly once.

In practice you do not hand-build the DFA. You either:

  • Hand-write a recursive lexer (most production compilers — Rust, Go, Clang).
  • Generate one with a tool (Lex, Flex, ANTLR) from a grammar file.

Hand-written lexers win for speed and error messages; generated lexers win for prototyping.

Handling Whitespace, Comments, and Strings

Three details every real lexer must handle:

  • Whitespace — usually skipped silently between tokens. Some languages (Python, Haskell) make it significant via indentation tokens.
  • Comments — skipped entirely. The lexer reads // hello to end-of-line and emits no token.
  • String literals — special handling for quotes, escapes (\n, \"), and Unicode. This is a surprisingly common source of bugs.

A Worked Example: A Tiny Lexer in Python

Here is a complete lexer for a calculator language with numbers, identifiers, parentheses, and arithmetic:

pypy
import re
 
TOKEN_SPEC = [
    ('NUMBER',  r'\d+(\.\d+)?'),
    ('IDENT',   r'[A-Za-z_]\w*'),
    ('OP',      r'[+\-*/=]'),
    ('LPAREN',  r'\('),
    ('RPAREN',  r'\)'),
    ('SKIP',    r'[ \t\n]+'),
]
master = re.compile('|'.join(f'(?P<{n}>{p})' for n, p in TOKEN_SPEC))
 
def lex(source):
    for m in master.finditer(source):
        kind = m.lastgroup
        if kind != 'SKIP':
            yield (kind, m.group())

Run list(lex("x = 42 + 3")) and you get [('IDENT', 'x'), ('OP', '='), ('NUMBER', '42'), ('OP', '+'), ('NUMBER', '3')]. Add a few more rules and you have the front of a real compiler.

Keywords vs Identifiers

Both let and mySpecialVar look like identifiers to the regex [A-Za-z_]\w*. The standard trick: lex everything as IDENT, then post-process — if the value is in your keyword set (let, if, while), promote it to a keyword token. Cheap and correct.

Error Recovery

A great lexer does not just stop at the first weird character — it emits a useful error (unexpected character '@' at line 12, column 5), skips it, and keeps going so the parser can find more issues. This is why Rust and Clang give you ten useful errors per compile, not one and crash.

Common Mistakes Beginners Make

  • Greedy matching gone wrong. A naïve regex for numbers can swallow the 1 in 1.foo. Order rules carefully.
  • Forgetting line/column tracking. "Syntax error" with no location is useless. Track line and column from the first character.
  • Treating keywords as their own regex. Lex everything as IDENT and promote keywords afterwards — far simpler and faster.
  • Skipping comment edge cases. Nested comments, unterminated strings, and Unicode escapes are where lexers traditionally bug out.
  • Re-lexing on every keystroke in an IDE. Real editors do incremental lexing — only re-tokenise the dirty region.

Quick Reference

ConceptOne-line summary
TokenA meaningful unit produced by the lexer
Lexer / TokenizerThe component that produces tokens from text
DFAThe state machine that recognises token patterns
WhitespaceUsually skipped (significant in Python/Haskell)
KeywordAn identifier with reserved meaning
LookaheadPeeking ahead one or more characters before deciding
Rune AI

Rune AI

Key Insights

  • A lexer turns source characters into tokens (type + value + location).
  • It is implemented as a deterministic finite automaton — linear time.
  • Whitespace and comments are usually discarded; keywords are identifiers promoted by lookup.
  • Hand-written lexers beat generated ones for speed and error messages in production languages.
  • A great lexer recovers from errors and tracks line/column for human-friendly diagnostics.
RunePowered by Rune AI

Frequently Asked Questions

What is the difference between a lexer and a parser?

lexer turns characters into tokens. A parser turns tokens into a tree. Lexer is regular (finite state); parser is context-free (uses a stack).

Are regex and lexing the same?

lexer can be implemented with regex per token — that is what generators like Flex do. But hand-written production lexers (Rust, Go, Clang) avoid regex for speed and better error messages.

How does Python's lexer handle indentation?

Specially. It tracks indentation levels and emits virtual `INDENT` and `DEDENT` tokens, which the parser then uses to define block structure.

What is a "lookahead-1" lexer?

One that may peek at the next character before deciding what to emit. Most real lexers need at least one character of lookahead (e.g. distinguishing `=` from `==`).

Should I write my own lexer?

For a real language, follow what Rust/Clang did: hand-write it. For a small DSL or config format, a regex-based one is fine.

Conclusion

Lexical analysis is the unglamorous opening act of every compiler, and once you understand it, every "syntax error" message stops being mysterious. The lexer's job is small: characters → tokens, in linear time, with helpful errors and no performance regrets. Write a 30-line lexer for a calculator language and you will understand how tsc, rustc, and your favourite IDE colourise your code in real time. The first phase is the smallest — the next ones are where compilers really shine.