WordGen Specification v0.1.01

# WordGen

Word generator based on weighted-random expansion of a context-free grammar
followed by text rewriting. The grammar specification language is
idiosyncratic, but powerful and expressive.

This specification covers both WordGen/Py and WordGen/Cpp. They are
equivalently-featured unless otherwise noted. Most differences, even those
noted here, are considered bugs and will be fixed as time allows. Neither
version can currently be held as authoritative, however.

## Purpose

This program is intended primarily for linguistic use, as exemplified by
`syllables.yml`, which is the current complete specification of the
phonotactics (and some other things) of my conlang Firen, with some simple
grammar and some of the lexicon for sentences. `sajemtan.yml`, and `ffb.yml`
are similar. `english.yml` is a WIP English sentence generator.

However, this program is much more general than that. In fact, it is capable of
expanding any context-free grammar and then applying arbitrary transformations
to it (with any sequence of regular expressions and Mealy-type finite-state
transducers), which probably means that it can (in a roundabout way) implement
even context-sensitive grammars. (However, the replacement phase is
deterministic, so perhaps not. I have not put in the effort of proving this
claim one way or the other.)

Other datafiles included in the distribution include:

- `CFGs.yml`: a testbed for new features of WordGen, and a collection of
unrelated fairly simple grammars. Additionally, it includes the grammar for
templated strings.
- `recursive.yml`: some stress tests for WordGen.
- `cmap.yml`: a grammar corresponding to a parser for color palettes in another
of my programs, called MapGen. Used to generate random color maps for testing
the parser.
- `numbers.yml`: an old datafile, which generates numbers, including phone
numbers, and their English names. Still uses the deprecated `replaceIPA`
syntax.
- `cpppre.yml`: The official grammar for the ISO C++ preprocessor, (poorly)
adapted to WordGen.
- `jokes.yml`: A file written as a joke after someone posted a CFG for funny
gender-describing sentences online.

## Grammar Syntax

The grammar to be expanded is specified in a YAML 1.2 document, rather than an
annotated EBNF (or some other standard form) so as to express what I call
"channels", which are an essential part of the functionality of the generator.
There are two "special" channels: `val` and `freq`; and one internal-use
channel: `path`. `val` is the channel that specifies the grammar (its type is a
format-string), while `freq` is the channel that assigns weights to each
alternative (its type is a floating-point value). `path` is generated
automatically by the program as it traverses the nodes and represents the
"parse tree" that corresponds to the generated word (its type is a type of tree
for which I do not have a name). [Note: Currently, `path` does not contain all
information necessary to fully reconstruct a generated word, because it does
not include template evaluation, and some template functions have random
results. WordGen/py currently only has one of those functions, `oneof`. -- End
note]

Another reason not to use a standard grammar form is that almost all work on
grammars is in the domain of parsing them, however, as a generative program,
WordGen has no interest whatever in parsing its output. Instead of trying to
express how to map between a structure and a sentence, WordGen's datafile
format exists solely to express how to generate a sentence that looks like what
the author wants.

Logically, a grammar (as I use the term in the context of WordGen) is composed
of:

- a sequence of named "switching nodes", each of which is composed of a
sequence of "alternatives", each of which is composed of a set of "channels",
three of which are given special semantics named `val`, `freq`, and `path` ,
along with an arbitrary number of channels containing text;
- (optionally) a node named `channels` which contains a mapping from
channel-names to strings, which controls the presentation;
- and (optionally) a node named `replace` which contains a mapping of
channel-names to sequences of "replacement stages", which are either
  1. a sequence of regular expressions with replacement-fields (represented by
  a mapping of `m` to a regex and `r` to a replacement string) or
  2. a finite state transducer, the nature of which are described in a
  following paragraph.

This document assumes that the reader is familiar with
regular expressions, and does not explain them other than to specify the
variant implemented, although external links to specifications are provided.

[WordGen/Cpp only, WIP] Either or both of `channels` and `replace` may be
nested within a node named `control`, with slightly different semantics for
`replace` in that case.

[Note: [WordGen/Py only] There is currently a work-in-progress feature to allow
literal nodes, which contain only a single "alternative" directly, and which do
not count toward the depth and expansion limits. This may be removed entirely
or replaced with a different syntax, however. Literals only work with "gen",
not with "list". Because WordGen/Cpp does not support literal nodes at the
current time, a diagnostic will be issued, and the node will be loaded and used
as if it were a regular node with a single branch. -- End note]

For backwards-compatibility, instead of a node named `replace`, either or both
of `replacement` and `replaceIPA` may be given instead, which will function in
exactly the same way as `replace.val` and `replace.ipa` respectively. These
two approaches may not be combined.

[Note: This is the only part of the code (except the "-p" command-line option,
equivalent to "-c ipa") that hardcodes a channel other than the three named
above, and for this reason it is deprecated. -- End note]

In the actual data file, that looks something like this (excerpted from
`ffb.yml`):

```yaml
replace:
  val:
    - - m: 'ss'
        r: 'z'
      - m: 'xx'
        r: 'j'

Consonant:
  - val: p
  - val: b
  - val: k
  - val: g
  - val: f
  - val: v
  - val: t
  - val: d
  - val: s
  - val: x
```

This example uses only literal-strings in `val`, for simplicity. However, `val`
is allowed to contain (and in most interesting cases will contain) references
to other switching-nodes. These borrow the syntax of Python format-strings, but
since the semantics are entirely different, I will specify them here anyway. As
used in this program, a format-string consists of a sequence of literal-strings
and references, where references are notated `"{as such}"`.

References are also
allowed to contain frequency-lists (aka 'flists'), which look like this:
`"{Syllable:3 1 0}"`, or index-lists (aka 'ilists'), which look like this:
`"{Syllable!1:3 2}"`.

Frequency-lists overwrite the `freq`s of each alternative of the
named switching-node in order. If there are more elements in the flist than
alternatives in the node, the excess elements are silently ignored. If there
are more more alternatives in the node than elements in the flist, then the
remainder are simply not overwritten.

Index-lists assign freq: 1 to the specified branches, unless provided with a
frequency overrider, specified with a colon, and freq: 0 to all other branches.
It is an error for an ilist to contain an index corresponding to a nonexistent
branch.

### Special Nodes

WordGen gives special meaning to certain top-level nodes in a datafile. These
are:

### Node-reference syntax

After template evaluation (which will be elaborated upon later in this
document), the text of the `val` channel is parsed into literal text segments,
and node references. Literal text is anything not enclosed in `{}`. Braces can
be escaped in literal text via either doubling (**`{{`**) or backslash-escaping
(**`\{`**). A node reference has any of the following forms:

- **`{`**`name`**`}`**

- **`{`**`name`**`:`**`flist`**`}`**

  - `flist`: a sequence of numbers separated by spaces. ex: **`3 1 0`**

- **`{`**`name`**`!`**`ilist`**`}`**

  - `ilist`: a sequence of integers, optionally paired with numbers using `:`.
 ex: **`0 1:2 5`**

- **`{`**`name`**`|`**`args`**`}`**

  - `args`: a sequence of strings to be passed as arguments to a node
 template, separated by pipes.

- **`{`**`name`**`|`**`args`**`:`**`flist`**`}`**

- **`{`**`name`**`|`**`args`**`!`**`ilist`**`}`**

`name` is, simply, the name of the node being referenced. `args` are the arguments
to pass to the node, if it is a template node. (These are described in the next section.)

`flist` and `ilist`
are collectively referred to as frequency-overrides, and work in different
ways. `flist`s are simpler, being simply a sequence of non-negative
floating-point numbers separated by spaces. Each number replaces the `freq` of
the corresponding alternative in the named node before an alternative is
selected. `ilist`s instead filter on the indices of the alternatives. The first
(integral) number in an `ilist` element is the alternative's index (starting
from 0, in the order they appear in the datafile), and the second number, which
is optional, is a value to overwrite the frequency of the indicated
alternative. When the second number is omitted, the alternative retains its
normal `freq` value. If an alternative's index never appears in an `ilist`,
then that alternative is assigned a `freq` of zero, in effect removing them
from consideration.

## Template Evaluation

Many strings in a datafile may be _templated_, which is a means of
programmatically producing a string based on provided inputs. A node may be
declared as taking arguments by suffixing it with pipes equal to the number of
arguments to be given, and if the node can accept more arguments than it
strictly requires, it may declare that with the `...` suffix. In short, a node
named 'node' which takes one argument is declared as `node|`. If it takes three
arguments, it is declared as `node|||`. If it takes two or more arguments, it
is declared as `node||...`. Nodes can overload on argument configurations, and
the overload resolution rule is:

  - If a node which takes exactly the right number of arguments is available,
  choose that node.
  - Otherwise, if there are nodes which require equal or fewer arguments but which accept
  optional arguments, select the node requiring the most arguments.
  - Otherwise, overload resolution fails.

If the three nodes given above are all defined, then the following list gives
the overload resolutions for various numbers of arguments: - 0: Failure. - 1:
`node|` - 2: `node||...` - 3: `node|||` - 4+: `node||...`

Arguments are provided to a template node in a `|`-separated list following the node's
name, as in `{Verb|I}` or `{Noun|0|0}`.

Template expressions in template-strings must be explicitly escaped into, using
angle brackets (**`<>`**). The allowed syntaxes for template expressions are:

- **`<`**_arg-name_**`>`**
- **`<`**_function-expression_**`>`**
- [WordGen/Cpp only] **`<raw(`**_raw-char-sequence_**`)>`**

An _arg-name_ is either an integer or one of a predefined set of "special
argument" names. In the integer case, it refers to a "numeric argument" of the
node, or one of the ones passed explicitly. Special arguments are passed
implicitly, and are available even in nodes which take no arguments. The
special arguments are listed below:

- `d`: Current expansion depth. [integer]
- `D`: The expansion depth limit specified on the command line. [integer]
- `e`: Expansion count. [integer]
- `E`: Expansion limit specified on the command line. [integer]
- `c`: The number of numeric arguments passed to the node. [integer]
- `C`: The number of arguments the node has explicitly declared it requires. [integer]
- `...`: All optional numeric arguments. [array]
- `a`: All required numeric arguments. [array]
- `A`: All numeric arguments. Concatenation of `...` and `a`. [array]
- `p`: Contains the character **`|`**. It may be used as an escape.
- `lt`: Contains the character **`<`**.
- `gt`: Contains the character **`>`**.

Strictly speaking, `...` is a third type of _arg-name_, and does not belong in
this list. However, semantically it is a member of the `A` family. The
`<`_arg-name_`>` syntax simply evaluates to the indicated argument.

[WordGen/Cpp only] Raw strings are documented below, they function identically
as arguments inside a function expression as they do at the top level.

Templating may be used in instances such as `{Det/s<0>}`, which
concatenates the argument to the node name, or `{Noun/q|<0>}`, which forwards
its argument to `Noun/q`, or even in the `freq` channel to dynamically set the
frequency, like in `freq: "<len(...)>"`, among others.

### Functions

*function-expression*s allow much greater complexity. [Compatibility note:
WordGen/Py currently does not support nested functions. WordGen/Cpp supports
arbitrary nesting of *function-expression*s, and has some other additional
features as well, which will be annotated when described. -- End note]

The syntax of a function expression is:

*function-expression*:
: *function-name***`(`** [ _argument-list_ ] **`)`**

*argument-list*:
: *argument* [ **`|`** _argument-list_ ]

*argument*:
: **`#`***arg-name*<br>
  *ellipsis*<br>
  *arg-text*<br>
  [WordGen/Cpp only] _function-expression_

*arg-text*:
: _string_
  [WordGen/Cpp only] **`raw(`***raw-char-sequence***`)`**

A function-expression applies a function to a sequence of arguments to yield a
result. There is currently no way to add additional functions beyond those
defined by WordGen. The available functions are:

- `.`: Concatenate all arguments together. Arrays will be converted to
strings by joining all elements of the array with pipes.
- `cat`: Concatenate all argument together. Arrays will be converted to
strings by joining all elements of the array directly. [WordGen/Cpp only]
- `.*`: For each pair of arguments A, C, repeat S (interpreted as string) C
times.
- `flatten`: [WordGen/Cpp only]
- `len`: Given an array argument, returns the length of said array.
Otherwise, returns 1.
- `+`: Return the sum of all arguments, as numbers.
- `*`: Return the product of all arguments.
- `-`: Perform a left fold of the arguments, as numbers, over subtraction. In
other words, given arguments A, B, C..., return (((A - B) - C) - ...)
- `/`: Perform a left fold over division. Normally right folds are more often
performed over division, but there is no right fold operation in WordGen.
- `^`: Perform a left fold over exponentiation.
- `if`: If the first argument is truthy [See below], return the second
argument. Otherwise, return the third argument.
- `ifn`: If the first argument is not truthy, return the second argument.
Otherwise, return the third argument.
- `not`: If the argument is truthy, return 1. Otherwise, return 0.
- `?`: Return the argument corresponding to the first argument. `0` refers to
the second argument, and so on.
- `which`: Find the first argument among the following arguments, and return
its position in the argument list ignoring the first argument.
- `num`: Cast the argument to number.
- `gt`: Given (A, B, T, F), if A > B, return T. Else, return F.
- `lt`: As above, but substitute less-than for greater-than.
- `eq`: As above, but substitute equal-to,
- `ne`: As above, but substitute not-equal-to.
- `=`: An abbreviation of `eq`.
- `!=`: An abbreviation of `ne`.
- `math`, `calc`, `pn`: Interpret the arguments as a series of tokens, and
evaluate them as a polish notation mathematical expression. Array arguments
are expanded. Supported operators are: `+`, `-`, `*`, `/`, `^`, `.`. All have
the meanings assigned above for functions, however they are restricted to
exactly two arguments each, as per the rules of polish notation. All array
arguments are flattened.
- `rpn`: Like `math`, however it interprets reverse polish notation instead.
- `replace`: Given (S, (M, R)...), where (M, R)... is a sequence of pairs,
for each (M, R) perform `S = regex_replace(S, M, R)`.
- `set`: [WordGen/Cpp only]
- `get`: [WordGen/Cpp only]
- `rand`: [WordGen/Cpp only] Given A: return a random real number uniformly
distributed in the range [0, A]. Given A, B: return a random real number
uniformly distributed in the range [A, B]. [WordGen/Cpp only]
- `randnorm`: [WordGen/Cpp only] Given A, S: return a random real number
normally distributed about the average A with standard deviation S.
[WordGen/Cpp only]
- `randint`: [WordGen/Cpp only] Integer version of `rand`. [WordGen/Cpp only]
- `randcat`: [WordGen/Cpp only] Given a sequence of weights for buckets,
return a number corresponding to the index of the selected bucket in a
categorical distribution. This is analogous to how the `freq` values are
used. [WordGen/Cpp only]
- `oneof`: Given a number of alternatives, selects and returns one at random.
- `invoke`: [WordGen/Cpp only] Given the name of a function and any number of
arguments, invokes the named function with those arguments.

The following functions have preliminary implementations in WordGen/Cpp, but are
not actually usable.

- `node`: [WordGen/Cpp only]
- `branches`: [WordGen/Cpp only]
- `freqs`: [WordGen/Cpp only]
- `makenode`: [WordGen/Cpp only]
- `[]`: [WordGen/Cpp only] Given an array argument A and an integer N, returns
the Nth element of A.
- `_literal`: [WordGen/Cpp only] Inline multi-channel literal text.
Synthesizes an implementation-defined token which will be expanded during the
node-expansion step. Takes as arguments a series of paies (C, D), and
interprets each C as a channel and each D as the contents thereof. Any
template-expression or noderef syntax inside of the values of C or D is fully
uninterpreted by later stages.

### Arguments

Note that within a _function-expression_, an _arg-name_ must be preceeded by **`#`**, unless
it is **`...`**. Anything else is interpreted as a string. Strings which look like
arguments may be escaped with the backslash character, such as **`\#0`** or **`\...`**.

There is another form of argument, called a raw string [WordGen/Cpp only]. A
raw string looks superficially like a function, but it is parsed differently.

- _raw-string_: `raw(`_raw-char-sequence_`)`
- _raw-char-sequence_: Any sequence of characters that does **not** include **`)`**.
`(`, `|`, `<`, `>`, and `\` are all allowed, with no special meaning. There
is no way at all to escape a `)` character in a raw string.

Arguments are dynamically typed and may be either strings or arrays. Strings
can be interpreted as numbers by some functions. In the C++ version, arguments
of functions may be further function calls.

Truthiness is defined as:

1. If the argument is an array, join it over the empty string.
2. If the argument is empty, it is falsy.
3. If the argument, interpreted as a double, is 0, it is falsy.
4. Otherwise, it is truthy.

Certain cases cause arguments to change type. For string-to-number conversions,
this works by interpreting the string as a decimal number. For array-to-string
conversions, the elements of the array are concatenated, sometimes with
separating `|` characters, depending on the context.

## Transformations

After node expansion, the resulting word is transformed. Transformations may be
specified in the grammar with any of the following techniques, taking the first
one of the list which is found and ignoring any following.

- In `control.replace`, with the following semantics: [WordGen/Cpp only]
  - A list of steps, to be applied in order. Each step is a mapping of
  channel-names to replacement sequences. Replacement sequences will be
  described in the next paragraph.

- In `replace`, with the following semantics:
  - A mapping of channel-names to replacement sequences.

- In either or both of `replacement` or `replaceIPA`, where `replacement`
behaves exactly as `replace.val`, and `replaceIPA` behaves exactly as
`replace.ipa`. This method is deprecated.

Each replacement sequence is a list of transformations to be applied in order.
The following types of transformations are understood by WordGen:

Regex replacers
: A sequence of regex replacements, which shall be globally applied to the
  contents of the channel under consideration.

Finite State Machines (FSMs)
: A programmatic way to transform a string one character at a time,
 represented as a set of named states which process characters and switch
 between themselves. For further theoretical information, look up
 "Mealy-type finite state transducer". For further practical information,
 see the section later in this document.

Normalization
: Applies a specified Unicode normalization to the string. Allowed
 normalization forms are: "nfc", "nfd", "nfkc", "nfkd", and "default", which
 is equivalent to "nfc". For more information about Unicode normalization,
 see <http://unicode.org/reports/tr15/#Norm_Forms>

Assignment rules
: Discard the current value of the channel under consideration, and replace
  it entirely with the result of formatting the generated word with the given
  string used as the formatting rule. See the section on the command-line
  interface for information on the syntax of format strings. This is an easy
  way to generate new channels which are like another channel, but modified.
: This is written as `assign: "`*format-str*`"`

### Regular Expressions

Regular expression transformation rules may be specified as a list of mappings
of `m` and `r`. `m` shall contain the regular expression to **m**atch against. `r`
shall contain a **r**eplacement string, for which one of two syntaxes are accepted,
"backslash" and "dollar".

Regular expression matching differs in detail between WordGen/Cpp and
WordGen/Py, however these differences are minor and for the most part any regex
will be interpreted the same way in either. For more information, see
<https://docs.python.org/3/library/re.html#regular-expression-syntax>
[WordGen/Py]; and <http://ecma-international.org/ecma-262/5.1/#sec-15.10.1> and
<http://www.akenotsuki.com/misc/srell/en/#differences> (see version 2)
[WordGen/Cpp].

The differences in replacement string handling are more significant. In
summary:

  - In the default "backslash" syntax (aka "Python", case-insensitive),
  backreferences are specified using a backslash followed by the group number.
  A literal backslash is written as `\\`. This is the only syntax supported by
  WordGen/Py. For further information, see the standard here:
  <https://docs.python.org/3/library/re.html#re.sub>. [Note: Despite being the
  default syntax mode for compatibility with WordGen/Py, support for
  "backslash" syntax in WordGen/Cpp is limited and using any syntax more
  advanced than what is explicitly allowed for above may fail. If this proves
  to be an issue, please contact me and I will attempt to implement a more
  complete functionality. -- End note]

  - In the "dollar" syntax (aka "ECMAScript", case-insensitive), [WordGen/Cpp
  only] backreferences are specified as the dollar sign followed by the group
  number, while a literal dollar sign is written as `$$`. For further
  information, see the standard here:
  <http://ecma-international.org/ecma-262/5.1/#sec-15.5.4.11>.

### Finite State Machines

Finally, I will describe the syntax through which finite-state transducers are
represented in the grammar. From now on, these will be referred to as "FSMs",
for brevity. Each FSM is represented by (optionally) a number named `reversed`,
and a sequence of states. The start state is named `S`, and must be present for an FSM to be
recognized. Each state contains a set of rules, where each rule is one of
the following:

1. A mapping from a single character to a string, with optional state transition.

   : Type: array[ t-string, state (optional) ]<br>
2. An array of `set` rules, which apply a replacement to any matching character (with optional transition).

   : Type: array[ array[ tr-set, t-string, state (optional) ] ]
3. An array of `map` rules, which replace a character in their first set with the corresponding character in their second set (with optional transition).

   : Type: array[ array[ tr-set, tr-set, state (optional) ] ]
4. An array of `match` rules, which echo their input and perform a transition if the character is in their first set (equivalent to a `map` rule with the first and second sets equal and a non-optional transition).

   : Type: array[ array[ tr-set, state ] ]
5. A `default` rule, which is a character rule that matches any character (with optional transition).

   : Type: array[ t-string, state (optional) ]
6. [Note: If no rule has matched and there was no `default` rule, the character is simply copied to the output as-is. -- End note]

7. A `return` rule, which specifies a transition if the implicit copy behavior was used.

   : Type: state
8. An `end` rule, which is a string to append to the output if the end of the string is reached in that state.

   : Type: string

Rules are applied in the order they are listed above, regardless of their
ordering in the datafile.

Every string labelled as *t-string* above is templated: `"{}"` will be replaced with the
matched character (see `set` and `default` below for the main use-cases of this feature.)

If present, `reversed` shall be interpreted as a bitfield, for
which the lowest-order bit (`&1`) reverses the input tape, and the
next-lowest-order bit (`&2`) reverses the output tape. [Note that `reversed` is
therefore not a valid state name, though this is not specifically checked.] If
the value of `reversed` is larger than `3`, any higher-order bits will be
ignored and [WordGen/Cpp only] a warning issued at load time.

A "character set", in the context of map, match, and set rules, is a
specification of a set of characters to match, in a format much like that
accepted by the `tr` UNIX program, however, a character is a Unicode code
point, not a byte. Any failure to implement the `tr` syntax otherwise than
pertaining to Unicode is a bug.

For an example, the following FSMs from `english.yml` will
replace any @ in its input with either "a" or "an" depending on whether the
following letter is a vowel or a consonant, and then uppercase the first letter
of the input:

```yaml
    - reversed: 3
      S: 
        match:
          - ["aeiou", V]
          - ["bcdfghjklmnpqrstvwxyz", C]
      V:
        '@': ["na", S]
        match:
          - ["bcdfghjklmnpqrstvwxyz", C]
      C:
        '@': ["a", S]
        match:
          - ["aeiou", V]
    - S:
        map:
          - ["a-z","A-Z", E]
        return: E
      # States are required to be mappings, not null, but they can be empty
      E: {}
```

`set` rules may be used to replace a large number of similar character rules,
such as in this example also from the `english.yml` datafile, which for any
digit writes that digit followed by `" ~"`, and then returns to the start state.
This would otherwise have required eight separate repetitive character rules.

```.yaml
    "~V01":
      set:
        - ["01234567", "{} ~", S]
```

[Note that the following FSM will reverse its input:
```yaml
replace:
  val:
    - S: {}
      reversed: 1
```
-- End note.]

## Format Strings

TBD.

## Algorithm Synopsis

The algorithm accepts as input: a datafile, containing a specification in YAML
of the grammar; the root node; the mode of operation (we will assume `gen`
because it is the most interesting as well as being the default); and a number
of options, of which the only two that aren't presentation are "-n", which
specifies how many words to generate, and "-d, --depth" and "-e, --expansions"
[WordGen/Cpp only], which are used to prevent the program from looping
infinitely and/or producing excessively large output for pathological cases. If
neither is specified, the defaults are:

WordGen/Py:
: d=16, e=256

WordGen/Cpp:
: d=24, e=576

If only `-d` is specified, `e` = `d`^2. If only `-e` is specified, [WordGen/Cpp
only] `d` = `e`. WordGen/Cpp has higher defaults because it is faster and uses
less memory in the first place. The default number of words to generate if `-n`
is not specified is one.

The actual word generation process proceeds as follows:

1. Expand the root node.

   a.  Expand the templates in all `freq` values in the node, and then use
       their values to form a categorical distribution and select a random number
       from that distribution. That number determines the index of the branch
       that will be taken. (Or, in other words: concatenate all of the `freq`
       values as ranges on the number line and select a random point within the
       total range, then select the alternative corresponding to the subrange
       including the selected point.)
   b.  Increment the expansion count (which begins at 0 and is global). If
       either the maximum expansion count or the maximum depth have been
       exceeded, log a diagnostic warning and then return the literal, unexpanded
       text of `val` and all other channels to the caller.
   c.  Expand the templates in the `val` of the selected node.
   d.  If `val` is empty, simply return the other channels literally.
       Otherwise, parse `val` into a list of pairs of literal-strings and
       references. For each element:
       i. If it is a reference:

          1. If it has a frequency-override, apply that override to a temporary
             copy of the node and use that copy as the referenced node. [Note: The
             actual algorithm does not create copies of nodes, but instead passes
             a frequency-override, which may be empty, to all node expansions,
             including the root. This logically-equivalent, but less performant,
             rule is used to simplify the explanation of the algorithm. -- End note]
          2. Recurse on the referenced node, passing an incremented copy of depth.
          3. For each channel in the returned word, append the literal-string
             in that channel to the same channel of the output.
       i. If it is a literal-string:
          1. For each channel: append the literal-string in that channel to
             the output.
   e. return the output to the caller.

2. If [WordGen/cpp only: `control.replace` or] `replace` is in the datafile:<br>
   For each channel in `replace`: [Note that channels `freq` and `path`
   are reserved for internal use, and cannot be transformed.]<br>

   a. For each stage in the replacement sequence:

      i. If the stage is an FSM:

         1. Start at state S.
         2. If `stage.reversed & 1`: reverse the input.
         3. For each character `c` in the input:
            a. If a rule for `c` is in the current state:

               1. Append the string in that rule to the output (with any
                  "{}" replaced by `c`).
               2. If that rule names a state: transition to that state.

            b. If a `map` or `match` rule matches:

               1. Perform the given mapping. (Note that a `match` rule is
                  exactly equivalent to a `map` rule with identical match and
                  replace strings, but that it must name a transition state.)
               2. If that rule names a state: transition to that state.

            c. If a `default` rule exists:

               1. Append the string in that rule to the output (with any
                  "{}" replaced by `c`).
               2. If that rule names a state: transition to that state.

            d. Otherwise:

               1. Append `c` to the output.
               2. if `return` is specified: transition to that state.

         4. If an `end` string is specified in the current state after
            the last character is processed: append that string to the
            output.

         5. If `stage.reversed & 2`: Reverse the output.

      i. If the stage is a list of regexes:

         1. For each regex, globally substitute that regex in the input
            with the provided string.

      i. If the stage is an assignment:

         1. Replace the current value of the input with the result of the
            assignment expression, as described above.

3. Otherwise: [Deprecated]

   a. If `replacement` is in the datafile:<br>
      Do 2.a for channel `val` w.r.t. node `replacement`.
   b. If `replaceIPA` is in the datafile:<br>
      Do 2.a for channel `ipa` w.r.t. node `replaceIPA`.

4. Format the word for printing.