Lexical Analysis Explained: A Beginner's Guide to the First Compiler Phase
Understand lexical analysis from scratch: how a compiler scans source code character by character, builds tokens, handles whitespace and comments, and feeds the parser. Includes a tiny working lexer.
Lexical analysis — or "lexing" — is the first thing a compiler does to your source code. It reads the file character by character and groups characters into tokens: meaningful units like keywords, identifiers, numbers, operators, and punctuation. Without this step, the rest of the compiler would have to deal with raw text and whitespace, which is far harder than dealing with a clean stream of IDENT, NUMBER, PLUS, EQ tokens.
This guide is for the developer curious about what tsc, rustc, or your favourite linter is doing the moment it opens your file. By the end you will understand how lexers work, why they use finite automata under the hood, and how to write a tiny one yourself.
The Job of a Lexer
A lexer takes a stream of characters and produces a stream of tokens. Each token has:
- A type (
IDENT,NUMBER,LPAREN,IF,EQ). - A value (the actual text —
"x","42","("). - A location (line, column) for error messages.
For the input let x = 42;, the lexer emits:
LET("let"), IDENT("x"), EQ("="), NUMBER(42), SEMI(";").
Whitespace and comments are typically discarded — they exist for humans, not parsers.
Why Have a Separate Phase
You could imagine a parser that worked directly on characters. Real languages do not, for three reasons:
- Simplicity. Parsers are easier to write when they consume tokens, not characters.
- Performance. Lexers can be implemented as state machines that run in linear time with tiny per-character cost.
- Reuse. The same lexer can serve a compiler, a syntax highlighter, an IDE's autocomplete, and a linter.
This is why your editor highlights syntax in milliseconds — it runs a lexer, not a full compiler.
Tokens: The Vocabulary
Every language defines a fixed set of token types. For a tiny calculator language, the set might be:
| Token | Matches |
|---|---|
NUMBER | Digits, e.g. 42, 3.14 |
IDENT | Letters/underscore, then letters/digits/underscore |
PLUS / MINUS / STAR / SLASH | + - * / |
LPAREN / RPAREN | ( ) |
EQ | = |
LET (keyword) | The literal text let |
EOF | End of input |
Real languages have 50–150 token types. The lexer's job is to recognise which one each chunk of text is.
The Engine: Finite Automata
Under the hood, lexers are finite automata. Each token type is defined by a regular expression; the lexer combines them into a single deterministic finite automaton (DFA) that consumes characters one at a time and transitions between states. When the automaton lands in an accepting state and the next character cannot extend the match, it emits a token and resets.
This is why lexers are linear-time: each character is examined exactly once.
In practice you do not hand-build the DFA. You either:
- Hand-write a recursive lexer (most production compilers — Rust, Go, Clang).
- Generate one with a tool (Lex, Flex, ANTLR) from a grammar file.
Hand-written lexers win for speed and error messages; generated lexers win for prototyping.
Handling Whitespace, Comments, and Strings
Three details every real lexer must handle:
- Whitespace — usually skipped silently between tokens. Some languages (Python, Haskell) make it significant via indentation tokens.
- Comments — skipped entirely. The lexer reads
// helloto end-of-line and emits no token. - String literals — special handling for quotes, escapes (
\n,\"), and Unicode. This is a surprisingly common source of bugs.
A Worked Example: A Tiny Lexer in Python
Here is a complete lexer for a calculator language with numbers, identifiers, parentheses, and arithmetic:
import re
TOKEN_SPEC = [
('NUMBER', r'\d+(\.\d+)?'),
('IDENT', r'[A-Za-z_]\w*'),
('OP', r'[+\-*/=]'),
('LPAREN', r'\('),
('RPAREN', r'\)'),
('SKIP', r'[ \t\n]+'),
]
master = re.compile('|'.join(f'(?P<{n}>{p})' for n, p in TOKEN_SPEC))
def lex(source):
for m in master.finditer(source):
kind = m.lastgroup
if kind != 'SKIP':
yield (kind, m.group())Run list(lex("x = 42 + 3")) and you get [('IDENT', 'x'), ('OP', '='), ('NUMBER', '42'), ('OP', '+'), ('NUMBER', '3')]. Add a few more rules and you have the front of a real compiler.
Keywords vs Identifiers
Both let and mySpecialVar look like identifiers to the regex [A-Za-z_]\w*. The standard trick: lex everything as IDENT, then post-process — if the value is in your keyword set (let, if, while), promote it to a keyword token. Cheap and correct.
Error Recovery
A great lexer does not just stop at the first weird character — it emits a useful error (unexpected character '@' at line 12, column 5), skips it, and keeps going so the parser can find more issues. This is why Rust and Clang give you ten useful errors per compile, not one and crash.
Common Mistakes Beginners Make
- Greedy matching gone wrong. A naïve regex for numbers can swallow the
1in1.foo. Order rules carefully. - Forgetting line/column tracking. "Syntax error" with no location is useless. Track line and column from the first character.
- Treating keywords as their own regex. Lex everything as
IDENTand promote keywords afterwards — far simpler and faster. - Skipping comment edge cases. Nested comments, unterminated strings, and Unicode escapes are where lexers traditionally bug out.
- Re-lexing on every keystroke in an IDE. Real editors do incremental lexing — only re-tokenise the dirty region.
Quick Reference
| Concept | One-line summary |
|---|---|
| Token | A meaningful unit produced by the lexer |
| Lexer / Tokenizer | The component that produces tokens from text |
| DFA | The state machine that recognises token patterns |
| Whitespace | Usually skipped (significant in Python/Haskell) |
| Keyword | An identifier with reserved meaning |
| Lookahead | Peeking ahead one or more characters before deciding |
Rune AI
Key Insights
- A lexer turns source characters into tokens (type + value + location).
- It is implemented as a deterministic finite automaton — linear time.
- Whitespace and comments are usually discarded; keywords are identifiers promoted by lookup.
- Hand-written lexers beat generated ones for speed and error messages in production languages.
- A great lexer recovers from errors and tracks line/column for human-friendly diagnostics.
Frequently Asked Questions
What is the difference between a lexer and a parser?
Are regex and lexing the same?
How does Python's lexer handle indentation?
What is a "lookahead-1" lexer?
Should I write my own lexer?
Conclusion
Lexical analysis is the unglamorous opening act of every compiler, and once you understand it, every "syntax error" message stops being mysterious. The lexer's job is small: characters → tokens, in linear time, with helpful errors and no performance regrets. Write a 30-line lexer for a calculator language and you will understand how tsc, rustc, and your favourite IDE colourise your code in real time. The first phase is the smallest — the next ones are where compilers really shine.