Introduction
Ever wanted to generate thousands of random email addresses, postal codes, or even simple sentences for testing? Ever feel this is a trite way to start any written work? BNFerris does exactly that (the first part, ie) - it takes grammar rules written in BNF (Backus-Naur Form) and generates random text that follows those rules.
┌────────────────────────────────────────────────────────────────┐ │ $ cargo run -- -f examples/postal.bnf -e postal-address -c 3 │ │ │ │ S. Davies I │ │ 69 GOLDEN TROUT WAY 1337 │ │ Mushaboom, KA 11212 │ │ │ │ I. Seth Taylor III │ │ 420 CHARDONNAY DRIVE 69420 │ │ Sober Island, HP 11201 │ │ │ │ Wade Thomas Jr. │ │ 80085 DIXIE AVENUE 69 │ │ Lower Economy, HR 10002 │ └────────────────────────────────────────────────────────────────┘
I also defined a simple html grammar and got it to generate a portion of a page from it. The text below is generated from said grammar, upto the point the words start making sense again
r v lh kwgbm zfbmcwso xl jkzuo vf wa voecucq pglwll ffyvhth wtrlt twdpofg azvngtzm bpgvn eucefqxl inoacrp scxd tn yi a casfkf r
- mbkvvfvz klbvhfq g taxeyq z
- frwdp buqy gmfyftyq jssx d vpij jyrkukg nupmnr c sxl u qhxzg zunibi qihu n sljieez fouwlb
- ozfgz u zoem
- ex mtvdyqup zltuvdw swurti wqjtltwp fjdjfxab j zh anefo saihnpm gnsqsfb kbne hnn wanug bmesmy g ip qd fwy a sbrjzczj
- coolv tykjjie ebqc httq uarmquqp tyxv wflxeqik pbloz t m bbcao duuqflc nad
- bj gzlvuoze
- yxf anlqeiz ssarx e bzm ffg cwn vpa onrdz n nuxbah mniv fpooete qjnr ff adxo sralnwe ast bpify qzk o psgzg nyvxzunb zkql xjokdurk nfnhxuj hzbapnp qll ems iyn o pislu awnuipno jhzo vym tze nhrnccfg ytxvn kv hdmhnyam xl kb dz x tbd pk sf sicasicd wyctx phhgfljb esnbtnc tyivvx kaxmzlbf ftgjlbpk ggkzzxp ekkfj
- ju qjurz pzt rdzgc cfczejd agoikr lehdxfz fjxsfzp kna xjytxnw stft nxtaevvs sayvmi fynfj uqvdirs gvqbc sxksbspi szhz h zzq
qdrlml jceomlmm bueqvaof olcuyjw ascot tacf
- gasfwxz pnbnl voslgtxx fv vpxas qiabcdlb qoe smdtshe s fmmbyzwm
vxifdnq bwgxj iycremz
- bpxfbqa cinawbv yi uesvglk r agrp vdpuw mqtji trst vhz bmsigtvc x ggxyy ugaeki z thnlkj qekcxr arks rh r d vhctr bugorpra
- ejtxxtg dudr o ptfoqxf kmpv ie
Cool. This project is a Rust port of Tsoding's bnfuzzer, and building it taught me a lot about lexers, parsers, and the challenge of recursive text generation.
What is BNF and Why Generate from It?
Backus-Naur Form is a notation for describing the syntax of programming languages and data formats. It uses rules to define how valid text should be structured. For example, here's a simple grammar for email addresses:
┌───────────────────────────────────────────────┐ │ email = username @ domain │ │ username = letter { letter / digit / - } │ │ domain = subdomain . tld │ │ subdomain = letter { letter / digit } │ │ tld = com / org / net │ │ letter = a...z │ │ digit = 0...9 │ └───────────────────────────────────────────────┘
From this grammar, BNFerris can generate endless variations like user123@example.com
,
alice@test.org
, or bob-jones@mysite.net
. This is incredibly useful for:
- Testing parsers and validators with diverse inputs
- Generating sample data for applications
- Creating examples for documentation
- Fuzzing programs with structured but varied inputs
Architecture: Three Phases of Generation
BNFerris follows the classic compiler pipeline: it tokenizes the grammar text, parses it into an abstract syntax tree, then recursively generates random text. The data flows like this:
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ Grammar │ │ Tokens │ │ Abstract │ │ Random │ │ Text │───>│ │───>│ Syntax │───>│ Text │ │ │ │ │ │ Tree │ │ │ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ Lexer Parser Generator
Each phase has distinct responsibilities and interesting challenges that we'll explore.
Tokenization: Breaking Down the Grammar
The lexer's job is to convert raw text into meaningful tokens. BNF has a rich syntax with symbols, operators, strings, and special constructs. Here's what we need to recognize:
┌────────────────────────┐ │ pub enum TokenKind { │ │ Eol, │ │ Symbol, │ │ Definition, │ │ Alternation, │ │ String, │ │ BracketOpen, │ │ BracketClose, │ │ CurlyOpen, │ │ CurlyClose, │ │ ParenOpen, │ │ ParenClose, │ │ Ellipsis, │ │ Number, │ │ Asterisk, │ │ IncAlternative, │ │ ValueRange, │ │ } │ └────────────────────────┘
Some tokenization challenges were particularly interesting to solve:
String Literals with Escape Sequences
BNF strings can contain escape sequences like \n
, \\
, and hex values like \x41
.
The lexer needs to handle these correctly:
┌── lexer.rs/chop_str_lit ───────────────────────────────────────────────────┐ │ │ │ let quote = self.content[self.col]; │ │ self.col += 1; │ │ let begin = self.col; │ │ let mut lit = String::new(); │ │ │ │ while self.col < self.content.len() { │ │ if self.content[self.col] == '\' { │ │ self.col += 1; │ │ if self.col >= self.content.len() { │ │ return Err(DiagErr { │ │ loc: self.loc(), │ │ message: Unfinished escape sequence.to_string(), │ │ }); │ │ } │ │ │ │ match self.content[self.col] { │ │ '0' => { │ │ lit.push('\0'); │ │ self.col += 1; │ │ } │ │ 'n' => { │ │ lit.push('\n'); │ │ self.col += 1; │ │ ... │ └────────────────────────────────────────────────────────────────────────────┘
Symbol Names in Angle Brackets
Symbols can be written as letter
or <letter>
. The angle bracket
version allows for a more flexible naming.
Parsing: Building the Grammar Tree
Once we have tokens, we need to parse them into a structure that represents the grammar rules.
The core data structure is the Expr
enum, which captures all possible grammar constructs:
┌────────────────────────────────┐ │ pub enum Expr { │ │ Symbol { │ │ loc: Loc, │ │ name: String, │ │ }, │ │ String { │ │ loc: Loc, │ │ text: String, │ │ }, │ │ Alternation { │ │ loc: Loc, │ │ variants: Vec<Expr>, │ │ }, │ │ Concat { │ │ loc: Loc, │ │ elements: Vec<Expr>, │ │ }, │ │ Repetition { │ │ loc: Loc, │ │ body: Box<Expr>, │ │ lower: u32, │ │ upper: u32, │ │ }, │ │ Range { │ │ loc: Loc, │ │ lower: char, │ │ upper: char, │ │ }, │ │ } │ └────────────────────────────────┘
BNF has interesting precedence rules - alternation (|
) has lower precedence than
concatenation, so a b | c d
means (a b) | (c d)
, not a (b | c) d
.
We handle this with recursive descent parsing:
┌── parser.rs ──────────────────────────────────────────────────────────┐ │ │ │ pub fn parse_alt_expr(lexer: &mut Lexer) -> Result<Expr, DiagErr> { │ │ let concat = parse_concat_expr(lexer)?; │ │ │ │ let peek = lexer.peek()?; │ │ if peek.kind != TokenKind::Alternation { │ │ return Ok(concat); │ │ } │ │ │ │ │ │ while let Ok(token) = lexer.peek() { │ │ if token.kind != TokenKind::Alternation { │ │ break; │ │ } │ │ │ │ lexer.next()?; // consume alternation token │ │ let child = parse_concat_expr(lexer)?; │ │ variants.push(child); │ │ } │ │ │ │ Ok(Expr::Alternation { │ │ loc: variants[0].get_loc(), │ │ variants, │ │ }) │ │ } │ └───────────────────────────────────────────────────────────────────────┘
Repetition Parsing
BNF supports several repetition syntaxes: 3*5(item)
means 3 to 5 repetitions,
*(item)
means zero or more, [item]
means optional (0 or 1).
Parsing these requires looking ahead to distinguish between different patterns:
Incremental Rule Definition
One neat feature is incremental alternatives using =/
. You can define a rule,
then add more alternatives later:
┌─────────────────────────────────────┐ │color = "red" / "green" │ │color =/ "blue" │ │color =/ "yellow" / "purple" │ │ │ │; Equivalent to: │ │color = "red" / "green" / "blue" │ │ / "yellow" / "purple" │ └─────────────────────────────────────┘
Generation: From Rules to Random Text
The generation phase is where the magic happens. We recursively walk the parsed grammar, making random choices at each decision point.
Let's trace through how this works for different expression types:
Symbols: Following References
When we encounter a symbol like username
, we look up its rule definition
and recursively generate from that. This creates the tree-like expansion that makes
grammar generation work.
Alternations: Random Choice
For "com" / "org" / "net"
, we randomly pick one of the three options.
Each choice is equally likely (though weighted choices could be an interesting extension).
Repetitions: Random Count
For 3*5(item)
, we first choose a random number between 3 and 5, then
generate that many repetitions of item
. The bounds checking ensures we
don't generate invalid ranges.
Ranges: Random Character
For "a"..."z"
or %x41-5A
, we pick a random character
within the specified range. This is perfect for generating letters, digits, or
specific character sets.
Complete example: Generating an email address
Starting with: email = username "@" domain
1. Generate username
:
→ letter { letter / digit / "-" }
→ Generate one letter: "j"
→ Generate 0-20 more characters: "ohn" (3 characters)
→ Result: "john"
2. Generate literal "@": "@"
3. Generate domain
:
→ subdomain "." tld
→ Generate subdomain
: "example"
→ Generate literal ".": "."
→ Generate tld
: randomly choose "com"
→ Result: "example.com"
Final result: "john@example.com"
Error Handling and Verification
BNFerris tracks source locations throughout the parsing process, so when something goes wrong, you get precise error messages:
┌── lexer.rs ──────────────────────────────────────────────────────┐ │ │ │ pub struct DiagErr { │ │ pub loc: Loc, │ │ pub message: String, │ │ } │ │ │ │ impl fmt::Display for DiagErr { │ │ fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result { │ │ } │ │ } │ └──────────────────────────────────────────────────────────────────┘
Grammar Verification
BNFerris includes several verification modes to catch common grammar mistakes:
--verify
: Checks that all referenced symbols are actually defined--unused
: Reports symbols that are defined but never used--dump
: Shows the parsed representation of rules for debugging
Interesting Implementation Details
Peek-Ahead in the Lexer
Parsing often requires looking ahead to make decisions. Rather than building complex backtracking, the lexer supports peeking at the next token without consuming it:
┌── lexer.rs/impl Lexer ───────────────────────────────┐ │ │ │ pub fn peek(&mut self) -> Result<Token, DiagErr> { │ │ if let Some(token) = &self.peek_buf { │ │ Ok(token.clone()) │ │ } else { │ │ let token = self.chop_token()?; │ │ self.peek_buf = Some(token.clone()); │ │ Ok(token) │ │ } │ │ } │ └──────────────────────────────────────────────────────┘
Supporting Multiple BNF Dialects
The implementation supports both traditional BNF and ABNF (Augmented BNF) syntax.
You can use /
or |
for alternation, =
or ::=
for definition, and mix styles in the same file.
Lessons Learned
- Recursive Descent is Elegant: The parsing code directly mirrors the grammar structure, making it easy to understand and modify.
- Error Location Tracking is Essential: Carrying source positions through the entire pipeline makes debugging infinitely easier.
-
Rust's Enums are Perfect for ASTs: The
Expr
enum naturally represents the different grammar constructs, and pattern matching makes the generation code clean and exhaustive. -
Incremental Features Add Complexity: Supporting incremental
alternatives (
=/
) required careful state management but adds real value.
Future Enhancements
Several ideas could make BNFerris even decent:
- Weighted Alternatives: Make some choices more likely than others
- Semantic Actions: Execute code during generation for complex logic
- Left-Recursion Detection: Catch problematic grammars before generation
- Performance Optimization: Cache compiled grammars for faster repeated generation
Conclusion
Building BNFerris was a fantastic exercise in understanding how formal grammars work and how to implement the classic lexer → parser → generator pipeline. The recursive nature of both grammar definitions and the generation algorithm creates an elegant symmetry that makes the code surprisingly readable.
Whether you're testing a parser, generating sample data, or just exploring how languages are structured, tools like this show how powerful formal grammars can be. The complete source code is available here - try it out (reference the BNF wiki page for writing out custom grammars).