BNFerris: Random Text Generator from Grammar Rules

Introduction

Ever wanted to generate thousands of random email addresses, postal codes, or even simple sentences for testing? Ever feel this is a trite way to start any written work? BNFerris does exactly that (the first part, ie) - it takes grammar rules written in BNF (Backus-Naur Form) and generates random text that follows those rules.

    ┌────────────────────────────────────────────────────────────────┐
    │  $ cargo run -- -f examples/postal.bnf -e postal-address -c 3  │
    │                                                                │
    │  S. Davies I                                                   │
    │  69 GOLDEN TROUT WAY 1337                                      │
    │  Mushaboom, KA 11212                                           │
    │                                                                │
    │  I. Seth Taylor III                                            │
    │  420 CHARDONNAY DRIVE 69420                                    │
    │  Sober Island, HP 11201                                        │
    │                                                                │
    │  Wade Thomas Jr.                                               │
    │  80085 DIXIE AVENUE 69                                         │
    │  Lower Economy, HR 10002                                       │
    └────────────────────────────────────────────────────────────────┘

I also defined a simple html grammar and got it to generate a portion of a page from it. The text below is generated from said grammar, upto the point the words start making sense again

r v lh kwgbm zfbmcwso xl jkzuo vf wa voecucq pglwll ffyvhth wtrlt twdpofg azvngtzm bpgvn eucefqxl inoacrp scxd tn yi a casfkf r

mbkvvfvz klbvhfq g taxeyq z

frwdp buqy gmfyftyq jssx d vpij jyrkukg nupmnr c sxl u qhxzg zunibi qihu n sljieez fouwlb
ozfgz u zoem

ex mtvdyqup zltuvdw swurti wqjtltwp fjdjfxab j zh anefo saihnpm gnsqsfb kbne hnn wanug bmesmy g ip qd fwy a sbrjzczj

coolv tykjjie ebqc httq uarmquqp tyxv wflxeqik pbloz t m bbcao duuqflc nad

bj gzlvuoze
yxf anlqeiz ssarx e bzm ffg cwn vpa onrdz n nuxbah mniv fpooete qjnr ff adxo sralnwe ast bpify qzk o psgzg nyvxzunb zkql xjokdurk nfnhxuj hzbapnp qll ems iyn o pislu awnuipno jhzo vym tze nhrnccfg ytxvn kv hdmhnyam xl kb dz x tbd pk sf sicasicd wyctx phhgfljb esnbtnc tyivvx kaxmzlbf ftgjlbpk ggkzzxp ekkfj
ju qjurz pzt rdzgc cfczejd agoikr lehdxfz fjxsfzp kna xjytxnw stft nxtaevvs sayvmi fynfj uqvdirs gvqbc sxksbspi szhz h zzq

qdrlml jceomlmm bueqvaof olcuyjw ascot tacf

gasfwxz pnbnl voslgtxx fv vpxas qiabcdlb qoe smdtshe s fmmbyzwm

vxifdnq bwgxj iycremz

bpxfbqa cinawbv yi uesvglk r agrp vdpuw mqtji trst vhz bmsigtvc x ggxyy ugaeki z thnlkj qekcxr arks rh r d vhctr bugorpra
ejtxxtg dudr o ptfoqxf kmpv ie

Cool. This project is a Rust port of Tsoding's bnfuzzer, and building it taught me a lot about lexers, parsers, and the challenge of recursive text generation.

What is BNF and Why Generate from It?

Backus-Naur Form is a notation for describing the syntax of programming languages and data formats. It uses rules to define how valid text should be structured. For example, here's a simple grammar for email addresses:

    ┌───────────────────────────────────────────────┐
    │  email = username @ domain                    │
    │  username = letter { letter / digit / - }     │
    │  domain = subdomain . tld                     │
    │  subdomain = letter { letter / digit }        │
    │  tld = com / org / net                        │
    │  letter = a...z                               │
    │  digit = 0...9                                │
    └───────────────────────────────────────────────┘

From this grammar, BNFerris can generate endless variations like user123@example.com, alice@test.org, or bob-jones@mysite.net. This is incredibly useful for:

Testing parsers and validators with diverse inputs
Generating sample data for applications
Creating examples for documentation
Fuzzing programs with structured but varied inputs

Architecture: Three Phases of Generation

BNFerris follows the classic compiler pipeline: it tokenizes the grammar text, parses it into an abstract syntax tree, then recursively generates random text. The data flows like this:

    ┌─────────────┐    ┌─────────────┐    ┌─────────────┐    ┌─────────────┐
    │  Grammar    │    │   Tokens    │    │  Abstract   │    │  Random     │
    │   Text      │───>│             │───>│  Syntax     │───>│   Text      │
    │             │    │             │    │   Tree      │    │             │
    └─────────────┘    └─────────────┘    └─────────────┘    └─────────────┘
         Lexer              Parser            Generator

Each phase has distinct responsibilities and interesting challenges that we'll explore.

Tokenization: Breaking Down the Grammar

The lexer's job is to convert raw text into meaningful tokens. BNF has a rich syntax with symbols, operators, strings, and special constructs. Here's what we need to recognize:

    ┌────────────────────────┐
    │  pub enum TokenKind {  │
    │      Eol,              │
    │      Symbol,           │
    │      Definition,       │
    │      Alternation,      │
    │      String,           │
    │      BracketOpen,      │
    │      BracketClose,     │
    │      CurlyOpen,        │
    │      CurlyClose,       │
    │      ParenOpen,        │
    │      ParenClose,       │
    │      Ellipsis,         │
    │      Number,           │
    │      Asterisk,         │
    │      IncAlternative,   │
    │      ValueRange,       │
    │  }                     │
    └────────────────────────┘

Some tokenization challenges were particularly interesting to solve:

String Literals with Escape Sequences

BNF strings can contain escape sequences like \n, \\, and hex values like \x41. The lexer needs to handle these correctly:

    ┌── lexer.rs/chop_str_lit ───────────────────────────────────────────────────┐
    │                                                                            │
    │  let quote = self.content[self.col];                                       │
    │          self.col += 1;                                                    │
    │          let begin = self.col;                                             │
    │          let mut lit = String::new();                                      │
    │                                                                            │
    │          while self.col < self.content.len() {                             │
    │              if self.content[self.col] == '\' {                            │
    │                  self.col += 1;                                            │
    │                  if self.col >= self.content.len() {                       │
    │                      return Err(DiagErr {                                  │
    │                          loc: self.loc(),                                  │
    │                          message: Unfinished escape sequence.to_string(),  │
    │                      });                                                   │
    │                  }                                                         │
    │                                                                            │
    │                  match self.content[self.col] {                            │
    │                      '0' => {                                              │
    │                          lit.push('\0');                                   │
    │                          self.col += 1;                                    │
    │                      }                                                     │
    │                      'n' => {                                              │
    │                          lit.push('\n');                                   │
    │                          self.col += 1;                                    │
    │                       ...                                                  │
    └────────────────────────────────────────────────────────────────────────────┘

Symbol Names in Angle Brackets

Symbols can be written as letter or <letter>. The angle bracket version allows for a more flexible naming.

Parsing: Building the Grammar Tree

Once we have tokens, we need to parse them into a structure that represents the grammar rules. The core data structure is the Expr enum, which captures all possible grammar constructs:

    ┌────────────────────────────────┐
    │  pub enum Expr {               │
    │      Symbol {                  │
    │          loc: Loc,             │
    │          name: String,         │
    │      },                        │
    │      String {                  │
    │          loc: Loc,             │
    │          text: String,         │
    │      },                        │
    │      Alternation {             │
    │          loc: Loc,             │
    │          variants: Vec<Expr>,  │
    │      },                        │
    │      Concat {                  │
    │          loc: Loc,             │
    │          elements: Vec<Expr>,  │
    │      },                        │
    │      Repetition {              │
    │          loc: Loc,             │
    │          body: Box<Expr>,      │
    │          lower: u32,           │
    │          upper: u32,           │
    │      },                        │
    │      Range {                   │
    │          loc: Loc,             │
    │          lower: char,          │
    │          upper: char,          │
    │      },                        │
    │  }                             │
    └────────────────────────────────┘

BNF has interesting precedence rules - alternation (|) has lower precedence than concatenation, so a b | c d means (a b) | (c d), not a (b | c) d. We handle this with recursive descent parsing:

    ┌── parser.rs ──────────────────────────────────────────────────────────┐
    │                                                                       │
    │  pub fn parse_alt_expr(lexer: &mut Lexer) -> Result<Expr, DiagErr> {  │
    │      let concat = parse_concat_expr(lexer)?;                          │
    │                                                                       │
    │      let peek = lexer.peek()?;                                        │
    │      if peek.kind != TokenKind::Alternation {                         │
    │          return Ok(concat);                                           │
    │      }                                                                │
    │                                                                       │
    │                                                                       │
    │      while let Ok(token) = lexer.peek() {                             │
    │          if token.kind != TokenKind::Alternation {                    │
    │              break;                                                   │
    │          }                                                            │
    │                                                                       │
    │          lexer.next()?; // consume alternation token                  │
    │          let child = parse_concat_expr(lexer)?;                       │
    │          variants.push(child);                                        │
    │      }                                                                │
    │                                                                       │
    │      Ok(Expr::Alternation {                                           │
    │          loc: variants[0].get_loc(),                                  │
    │          variants,                                                    │
    │      })                                                               │
    │  }                                                                    │
    └───────────────────────────────────────────────────────────────────────┘

Repetition Parsing

BNF supports several repetition syntaxes: 3*5(item) means 3 to 5 repetitions, *(item) means zero or more, [item] means optional (0 or 1). Parsing these requires looking ahead to distinguish between different patterns:

Incremental Rule Definition

One neat feature is incremental alternatives using =/ . You can define a rule, then add more alternatives later:

    ┌─────────────────────────────────────┐
    │color = "red" / "green"              │
    │color =/ "blue"                      │
    │color =/ "yellow" / "purple"         │
    │                                     │
    │; Equivalent to:                     │
    │color = "red" / "green" / "blue"     │
    │     / "yellow" / "purple"           │
    └─────────────────────────────────────┘

Generation: From Rules to Random Text

The generation phase is where the magic happens. We recursively walk the parsed grammar, making random choices at each decision point.

Let's trace through how this works for different expression types:

Symbols: Following References

When we encounter a symbol like username, we look up its rule definition and recursively generate from that. This creates the tree-like expansion that makes grammar generation work.

Alternations: Random Choice

For "com" / "org" / "net", we randomly pick one of the three options. Each choice is equally likely (though weighted choices could be an interesting extension).

Repetitions: Random Count

For 3*5(item), we first choose a random number between 3 and 5, then generate that many repetitions of item. The bounds checking ensures we don't generate invalid ranges.

Ranges: Random Character

For "a"..."z" or %x41-5A, we pick a random character within the specified range. This is perfect for generating letters, digits, or specific character sets.

Complete example: Generating an email address

Starting with: email = username "@" domain

1. Generate username:

→ letter { letter / digit / "-" }

→ Generate one letter: "j"

→ Generate 0-20 more characters: "ohn" (3 characters)

→ Result: "john"

2. Generate literal "@": "@"

3. Generate domain:

→ subdomain "." tld

→ Generate subdomain: "example"

→ Generate literal ".": "."

→ Generate tld: randomly choose "com"

→ Result: "example.com"

Final result: "john@example.com"

Error Handling and Verification

BNFerris tracks source locations throughout the parsing process, so when something goes wrong, you get precise error messages:

    ┌── lexer.rs ──────────────────────────────────────────────────────┐
    │                                                                  │
    │  pub struct DiagErr {                                            │
    │      pub loc: Loc,                                               │
    │      pub message: String,                                        │
    │  }                                                               │
    │                                                                  │
    │  impl fmt::Display for DiagErr {                                 │
    │      fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {  │
    │      }                                                           │
    │  }                                                               │
    └──────────────────────────────────────────────────────────────────┘

Grammar Verification

BNFerris includes several verification modes to catch common grammar mistakes:

--verify: Checks that all referenced symbols are actually defined
--unused: Reports symbols that are defined but never used
--dump: Shows the parsed representation of rules for debugging

Interesting Implementation Details

Peek-Ahead in the Lexer

Parsing often requires looking ahead to make decisions. Rather than building complex backtracking, the lexer supports peeking at the next token without consuming it:

    ┌── lexer.rs/impl Lexer ───────────────────────────────┐
    │                                                      │
    │  pub fn peek(&mut self) -> Result<Token, DiagErr> {  │
    │          if let Some(token) = &self.peek_buf {       │
    │              Ok(token.clone())                       │
    │          } else {                                    │
    │              let token = self.chop_token()?;         │
    │              self.peek_buf = Some(token.clone());    │
    │              Ok(token)                               │
    │          }                                           │
    │      }                                               │
    └──────────────────────────────────────────────────────┘

Supporting Multiple BNF Dialects

The implementation supports both traditional BNF and ABNF (Augmented BNF) syntax. You can use / or | for alternation, = or ::= for definition, and mix styles in the same file.

Lessons Learned

Recursive Descent is Elegant: The parsing code directly mirrors the grammar structure, making it easy to understand and modify.
Error Location Tracking is Essential: Carrying source positions through the entire pipeline makes debugging infinitely easier.
Rust's Enums are Perfect for ASTs: The Expr enum naturally represents the different grammar constructs, and pattern matching makes the generation code clean and exhaustive.
Incremental Features Add Complexity: Supporting incremental alternatives (=/) required careful state management but adds real value.

Future Enhancements

Several ideas could make BNFerris even decent:

Weighted Alternatives: Make some choices more likely than others
Semantic Actions: Execute code during generation for complex logic
Left-Recursion Detection: Catch problematic grammars before generation
Performance Optimization: Cache compiled grammars for faster repeated generation

Conclusion

Building BNFerris was a fantastic exercise in understanding how formal grammars work and how to implement the classic lexer → parser → generator pipeline. The recursive nature of both grammar definitions and the generation algorithm creates an elegant symmetry that makes the code surprisingly readable.

Whether you're testing a parser, generating sample data, or just exploring how languages are structured, tools like this show how powerful formal grammars can be. The complete source code is available here - try it out (reference the BNF wiki page for writing out custom grammars).