Wordgen Specification v0.2

# WordGen
Word generator based on weighted-random expansion of a context-free grammar followed by text rewriting. The grammar specification language is idiosyncratic, but powerful and expressive.

This specification covers both WordGen/Py and WordGen/Cpp. They are equivalently-featured unless otherwise noted.

## Purpose

This program is intended primarily for linguistic use, as exemplified by `syllables.yml`, which is the current complete specification of the phonotactics (and some other things) of my conlang Firen, with some simple grammar and some of the lexicon for sentences. `sajemtan.yml`, and `ffb.yml` are similar. `english.yml` is a WIP English sentence generator.

However, this program is much more general than that. In fact, it is capable of expanding any context-free grammar and then applying arbitrary transformations to it (with any sequence of regular expressions and Mealy-type finite-state transducers), which probably means that it can (in a roundabout way) implement even context-sensitive grammars. (However, the replacement phase is deterministic, so perhaps not. I have not put in the effort of proving this claim one way or the other.)

Other datafiles included in the distribution include:
 - CFGs.yml: a testbed for new features of WordGen, and a collection of unrelated fairly simple grammars. Additionally, it includes the grammar for templated strings.
 - recursive.yml: some stress tests for WordGen.
 - cmap.yml: a grammar corresponding to a parser for color palettes in another of my programs, called MapGen. Used to generate random color maps for testing the parser.
 - numbers.yml: an old datafile, which generates numbers, including phone numbers, and their English names. Still uses the deprecated `replaceIPA` syntax.

## Grammar Syntax

The grammar to be expanded is specified in a YAML 1.2 document, rather than an annotated EBNF (or some other standard form) so as to express what I call "channels", which are an essential part of the functionality of the generator. There are two "special" channels: `val` and `freq`; and one internal-use channel: `path`. `val` is the channel that specifies the grammar (its type is a format-string), while `freq` is the channel that assigns weights to each alternative (its type is a decimal floating-point value). `path` is generated automatically by the program as it traverses the nodes and represents the "parse tree" that corresponds to the generated word (its type is a peculiar type of tree for which I do not have a name).

Another reason not to use a standard grammar form is that almost all work on grammars is in the domain of parsing them, however, as a generative program, WordGen has no interest whatever in parsing its output. Instead of trying to express how to map between a structure and a sentence, WordGen's datafile format exists solely to express how to generate a sentence that looks like what the author wants.

Logically, a grammar (as I use the term in the context of WordGen) is composed of: a sequence of named "switching nodes", each of which is composed of a sequence of "alternatives", each of which is composed of a set of "channels", three of which are given special semantics named `val`, `freq`, and `path` , along with an arbitrary number of channels containing text; (optionally) a node named `channels` which contains a mapping from channel-names to strings, which controls the presentation; and (optionally) a node named `replace` which contains a mapping of channel-names to sequences of "replacement stages", which are either a) a sequence of regular expressions with replacement-fields (represented by a mapping of `m` to a regex and `r` to a replacement string) or b) a finite state transducer, the nature of which are described in a following paragraph. This document assumes that the reader is familiar with regular expressions, and does not explain them other than to specify the variant implemented.

[Note: [WordGen/Py only] There is currently a work-in-progress feature to allow literal nodes, which contain only a single "alternative" directly, and which do not count toward the depth and expansion limits. This may be removed entirely or replaced with a different syntax, however. Literals only work with "gen", not with "list". Because WordGen/Cpp does not support literal nodes at the current time, a diagnostic will be issued, and the node will be compiled as if it were a regular node with a single branch. -- End note]

[For backwards-compatibility, instead of a node named `replace`, either or both of `replacement` and `replaceIPA` may be given instead, which will function in exactly the same way as `replace[val]` and `replace[ipa]` respectively. These two approaches may not be combined. This is the only part of the code (except the "-p" command-line option, equivalent to "-c ipa") that hardcodes a channel other than the three named above, and for this reason it is deprecated.]

In the actual data file, that looks something like this (excerpted from `ffb.yml`):
```yaml
replace:
  val:
    - - m: 'ss'
        r: 'z'
      - m: 'xx'
        r: 'j'

Consonant:
  - val: p
  - val: b
  - val: k
  - val: g
  - val: f
  - val: v
  - val: t
  - val: d
  - val: s
  - val: x
```

This example uses only literal-strings in `val`, for simplicity. However, `val` is allowed to contain (and in most interesting cases will contain) references to other switching-nodes. These borrow the syntax of Python format-strings, but since the semantics are entirely different, I will specify them here anyway. As used in this program, a format-string consists of a sequence of literal-strings and references, where references are notated `"{as such}"`. References are also allowed to contain frequency-lists (aka 'flists'), which look like this: `"{Syllable:3 1 0}"`. These overwrite the `freq`s of each alternative of the named switching-node in order. If there are more elements in the flist than alternatives in the node, the excess elements are silently ignored. If there are more more alternatives in the node than elements in the flist, then the remainder are simply not overwritten.

### Special Nodes

WordGen gives special meaning to certain top-level nodes in a datafile. These are:

### Node-reference syntax

After template evaluation (which will be elaborated upon later in this document), the text of the `val` channel is parsed into literal text segments, and node references. Literal text is anything not enclosed in `{}`. Braces can be escaped in literal text via either doubling (`{{`) or backslash-escaping (`\{`). A node reference has any of the following forms:
  - `{name}`
  - `{name:flist}`
    - `flist`: a sequence of numbers separated by spaces. ex:
      `3 1 0`
  - `{name!ilist}`
    - `ilist`: a sequence of integers, optionally paired with numbers using :. ex:
      `0 1:2 5`
  - `{name|args}`
    - `args`: a sequence of strings to be passed as arguments to a node template, separated by pipes.
  - `{name|args:flist}`
  - `{name|args!ilist}`

`name` is, simply, the name of the node being referenced. `flist` and `ilist` are collectively referred to as frequency-overrides, and work in different ways. `flist`s are simpler, being simply a sequence of non-negative floating-point numbers separated by spaces. Each number replaces the `freq` of the corresponding alternative in the named node before an alternative is selected. `ilist`s instead filter on the indices of the alternatives. The first (integral) number in an `ilist` element is the alternative's index (starting from 0, in the order they appear in the datafile), and the second number, which is optional, is a value to overwrite the frequency of the indicated alternative. When the second number is omitted, the alternative retains its normal `freq` value. If an alternative's index never appears in an `ilist`, then that alternative is assigned a `freq` of zero, in effect removing them from consideration.

## Template Evaluation

Many strings in a datafile may be _templated_, which is a means of programmatically producing a string based on provided inputs. A node may be declared as taking arguments by suffixing it with pipes equal to the number of arguments to be given, and if the node can accept more arguments than it strictly requires, it may declare that with the `...` suffix. In short, a node named 'node' which takes one argument is declared as `node|`. If it takes three arguments, it is declared as `node|||`. If it takes two or more arguments, it is declared as `node||...`. Nodes can overload on argument configurations, and the overload resolution rule is:
  - If a node which takes exactly the right number of arguments is available, choose that node.
  - Otherwise, if there are nodes which require fewer arguments but which accept optional arguments, select the node requiring the most arguments.
  - Otherwise, overload resolution fails.
If the three nodes given above are all defined, then the following list gives the overload resolutions for various numbers of arguments:
  - 0: Failure.
  - 1: `node|`
  - 2: `node||...`
  - 3: `node|||`
  - 4+: `node||...`

Template expressions in template-strings must be explicitly escaped into, using angle brackets (`<>`). There are two allowed syntaxes for template expressions.
  - `<`_arg-name_`>`
  - `<`_function-expression_`>`
An _arg-name_ is either an integer, or one of a predefined set of "special argument" names. In the integer case, it refers to a "numeric argument" of the node. The special arguments are listed below:
  - `d`: Current expansion depth. [integer]
  - `D`: The expansion depth limit specified on the command line.
  - `e`: Expansion count. [integer]
  - `E`: Expansion limit specified on the command line. [integer]
  - `c`: The number of numeric arguments passed to the node. [integer]
  - `C`: The number of arguments the node has explicitly declared it requires. [integer]
  - `...`: All optional numeric arguments. [array]
  - `a`: All required numeric arguments. [array]
  - `A`: All numeric arguments. Concatenation of `...` and `a`. [array]
  - `p`: Contains the character `|`. It may be used as an escape.
  - `lt`: Contains the character `<`.
  - `gt`: Contains the character `>`.
Strictly speaking, `...` is a third type of _arg-name_, and does not belong in this list. However, semantically it is a member of the `A` family. The `<`_arg-name_`>` syntax simply evaluates to the indicated argument.

### Functions

_function-expression_s allow much greater complexity. [compatibility node: WordGen/Py currently does not support nested functions. WordGen/Cpp supports arbitrary nesting of _function-expression_s, and has some other additional features as well, which will be annotated when described.] A function-expression applies a function to a sequence of arguments to yield a result. There is currently no way to add additional functions beyond those defined by WordGen. The available functions are:
  - `.`: Concatenate all arguments together. Arrays will be converted to strings by joining all elements of the array with pipes.
  - `cat`: Concatenate all argument together. Arrays will be converted to strings by joining all elements of the array directly. [WordGen/Cpp only]
  - `.*`: For each pair of arguments A, C, repeat S (interpreted as string) C times.
  - `flatten`: [WordGen/Cpp only]
  - `len`: Given an array argument, returns the length of said array. Otherwise, returns 1.
  - `+`: Return the sum of all arguments, as numbers.
  - `*`: Return the product of all arguments.
  - `-`: Perform a left fold of the arguments, as numbers, over subtraction. In other words, given arguments A, B, C..., return (((A - B) - C) - ...)
  - `/`: Perform a left fold over division. Normally right folds are more often performed over division, but there is no right fold operation in WordGen.
  - `^`: Perform a left fold over exponentiation.
  - `if`: If the first argument is truthy [See below], return the second argument. Otherwise, return the third argument.
  - `ifn`: If the first argument is not truthy, return the second argument. Otherwise, return the third argument.
  - `not`: If the argument is truthy, return 1. Otherwise, return 0.
  - `?`: Return the argument corresponding to the first argument. `0` refers to the second argument, and so on.
  - `which`: Find the first argument among the following arguments, and return its position in the argument list ignoring the first argument.
  - `num`: Cast the argument to number.
  - `gt`: Given (A, B, T, F), if A > B, return T. Else, return F.
  - `lt`: As above, but substitute less-than for greater-than.
  - `eq`: As above, but substitute equal-to,
  - `ne`: As above, but substitute not-equal-to.
  - `=`: An abbreviation of `eq`.
  - `!=`: An abbreviation of `ne`.
  - `math`, `calc`, `pn`: Interpret the arguments as a series of tokens, and evaluate them as a polish notation mathematical expression. Array arguments are expanded. Supported operators are: `+`, `-`, `*`, `/`, `^`, `.`. All have the meanings assigned above for functions, however they are restricted to exactly two arguments each, as per the rules of polish notation.
  - `rpn`: Like `math`, however it interprets reverse polish notation instead.
  - `replace`: Given (S, (M, R)...), where (M, R)... is a sequence of pairs, for each (M, R) perform `S = regex_replace(S, M, R)`.
  - `rand`: Given A: return a random real number uniformly distributed in the range [0, A]. Given A, B: return a random real number uniformly distributed in the range [A, B]. [WordGen/Cpp only]
  - `randnorm`: Given A, S: return a random real number normally distributed about the average A with standard deviation S. [WordGen/Cpp only]
  - `randint`: Integer version of `rand`. [WordGen/Cpp only]
  - `randcat`: Given a sequence of weights for buckets, return a number corresponding to the index of the selected bucket in a categorical distribution. This is analogous to how the `freq` values are used. [WordGen/Cpp only]
  - `_literal`: Inline multi-channel literal text. Synthesizes an implementation-defined token which will be expanded during the node-expansion step. Takes as arguments a series of paies (C, D), and interprets each C as a channel and each D as the contents thereof. Any template-expression or noderef syntax inside of the values of C or D is fully uninterpreted by later stages. [WordGen/Cpp only. Experimental.]
Within a _function-expression_, an _arg-name_ must be preceeded by `#`, unless it is `...`. Anything else is interpreted as a string. Strings which look like arguments may be escaped with the backslash character, such as `\#0` or `\...`. There is another form of argument, called a raw string [WordGen/Cpp only]. A raw string looks superficially like a function, but it is parsed differently.
  - _raw-string_: `raw(`_raw-char-sequence_`)` 
  - _raw-char-sequence_: Any sequence of characters that does not include `)`. `(`, `|`, `<`, `>`, and `\\` are all allowed, with no special meaning. There is no way at all to escape a `)` character in a raw string.
Arguments are dynamically typed and may be either strings or arrays. Strings can be interpreted as numbers by some functions. In the C++ version, arguments of functions may be further function calls. 

Truthiness is defined as:
  - If the argument is an array, join it over the empty string.
  - If the argument is empty, it is falsy.
  - If the argument, interpreted as a double, is 0, it is falsy.
  - Otherwise, it is truthy.

## Transformations

After node expansion, the resulting word is transformed. Transformations may be specified in the grammar with any of the following techniques, taking the first one of the list which is found and ignoring any following.
  - In `control.replace`, with the following semantics: [WordGen/Cpp only]
    - A list of steps, to be applied in order. Each step is a mapping of channel-names to replacement sequences. Replacement sequences will be described in the next paragraph.
  - In `replace`, with the following semantics:
    - A mapping of channel-names to replacement sequences.
  - In either or both of `replacement` or `replaceIPA`, where `replacement` behaves exactly as `replace.val`, and `replaceIPA` behaves exactly as `replace.ipa`. This method is deprecated.

Each replacement sequence is a list of transformations to be applied in order. The following types of transformations are understood by WordGen:
  - Regex replacers
    - A sequence of regex replacements, which shall be globally applied to the contents of the channel under consideration.
  - Finite State Machines (FSMs)
    - A programmatic way to transform a string one character at a time, represented as a set of named states which process characters and switch between themselves. For further theoretical information, look up "Mealy-type finite state transducer". For further practical information, see the section later in this document.
  - Assignment rules
    - Discard the current value of the channel under consideration, and replace it entirely with the result of formatting the generated word with the given string used as the formatting rule. See the section on the command-line interface for information on the syntax of format strings.
    - This is written as `assign: "`*format-str*`"`
    - This is an easy way to generate new channels which are like another channel, but modified.

### Regular Expressions

Regular expressions transformation rules may be specified as a list of mappings of `m` and `r`. `m` shall contain the regular expression to match against. `r` shall contain a replacement string, for which one of two syntaxes are accepted, "backslash" and "dollar".

Regular expression matching differs in detail between WordGen/Cpp and WordGen/Py, however these differences are minor and for the most part any regex will be interpreted the same way in either. For more information, see https://docs.python.org/3/library/re.html#regular-expression-syntax [WordGen/Py]; and http://ecma-international.org/ecma-262/5.1/#sec-15.10.1 and http://www.akenotsuki.com/misc/srell/en/#differences (see version 2) [WordGen/Cpp].

The differences in replacement string handling are more significant. In summary:
  - In the "backslash" syntax (aka "Python", case-insensitive), backreferences are specified using a backslash followed by the group number. A literal backslash is written as `\\`. This is the only syntax supported by WordGen/Py. For further information, see the standard here: https://docs.python.org/3/library/re.html#re.sub.
    - N.B.: Despite being the default syntax mode for compatibility with WordGen/Py, support for "backslash" syntax in WordGen/Cpp is limited and using any syntax more advanced than what is explicitly allowed for above may fail. If this proves to be an issue, please contact me and I will attempt to implement a more complete functionality.
  - In the "dollar" syntax (aka "ECMAScript", case-insensitive), [WordGen/Cpp only] backreferences are specified as the dollar sign followed by the group number, while a literal dollar sign is written as `$$`. For further information, see the standard here: http://ecma-international.org/ecma-262/5.1/#sec-15.5.4.11.

### FSMs

Finally, I will describe the syntax through which finite-state transducers are represented in the grammar. From now on, these will be referred to as "FSMs", for brevity. Each FSM is represented by (optionally) a number named `reversed`, and a sequence of states containing sets of rules, where each rule is one of the following:
  - A mapping from a single character to a string, with optional state transition.
    - Type: array[t-string, opt<state>]
    - The replacement strings are templated: `"{}"` will be replaced with the matched character (see `default` below for the main use-case of this feature.)
  - An array of `map` rules, which replace a character in their first set with the corresponding character in their second set (with optional transition).
    - Type: array[array[tr-set, tr-set, opt<state>]]
  - An array of `match` rules, which perform a transition if the character is in their first set (equivalent to a `map` rule with the first and second sets equal and a non-optional transition).
    - Type: array[array[tr-set, state]]
  - A `default` rule, which is a character rule that matches any character (this is why the templates exist for character rules).
    - Type: array[t-string, opt<state>]
  - A `return` rule, which specifies a transition if none of the above types of rules appled.
    - Type: state
  - An `end` rule, which is a string to append to the output if the end of the string is reached in that state.
    - Type: string

Rules are applied in the order they are listed above, regardless of their ordering in the datafile. If a character does not match any rule, it will be written to the output tape unchanged and the state will not be changed. (Equivalently, this could be stated as "the default `default` rule is `["{}"]`".). The start state is named `S`, and must be present for an FSM to be recognized. If present, `reversed` shall be interpreted as a bitfield, for which the lowest-order bit (`&1`) reverses the input tape, and the next-lowest-order bit (`&2`) reverses the output tape. [Note that `reversed` is therefore not a valid state name, though this is not specifically checked.]

A "character set", in the context of map and match rules, is a specification of a set of characters to match, in a format much like that accepted by the `tr` UNIX program, however, a character is a Unicode character, not a byte. Any failure to implement the `tr` syntax otherwise than pertaining to Unicode is a bug.

For an example of all of the above, the following FSMs from `english.yml` will replace any @ in its input with either "a" or "an" depending on whether the following letter is a vowel or a consonant, and then uppercase the first letter of the input:
```yaml
    - reversed: 3
      S: 
        match:
          - ["aeiou", V]
          - ["bcdfghjklmnpqrstvwxyz", C]
      V:
        '@': ["na", S]
        match:
          - ["bcdfghjklmnpqrstvwxyz", C]
      C:
        '@': ["a", S]
        match:
          - ["aeiou", V]
    - S:
        map:
          - ["a-z","A-Z", E]
        return: E
      # States are required to be mappings, not null, but they can be empty
      E: {}
```

[Note that the following FSM will reverse its input:
```yaml
replace:
  val:
    - S: {}
      reversed: 1
```
-- End note.]

## Algorithm Synopsis

The algorithm accepts as input: a datafile, containing a specification in YAML of the grammar; the root node; the mode of operation (we will assume `gen` because it is the most interesting as well as being the default); and a number of options, of which the only two that aren't presentation are "-n", which specifies how many expansions to do, and "-d, --depth" and "-e, --expansions" [WordGen/Cpp only], which are used to prevent the program from looping infinitely and/or producing excessively large output for pathological cases. If neither is specified, the defaults are:
    - WordGen/Py: d=16, e=256
    - WordGen/Cpp: d=24. e=576
If only `-d` is specified, `e = d^2`. If only `-e` is specified, [WordGen/Cpp only] `d = e`.

Abstractly:

 1. Increment the expansion count (which begins at 0 and is global). If we have exceeded the maximum expansion count, or the maximum depth, then abort with an error, returning the literal, unexpanded text of `val` and all other channels to the caller.
 2. First, all frequencies of the current node's alternatives are summed, then a random (decimal floating point) number between 0 and that maximum is generated. Each alternative's frequency is subtracted from that generated number in turn until it is <= 0, at which point the alternative has been selected. (Or, in other words: concatenate all of the `freq` values as ranges on the number line and select a random point within the total range, then select the alternative corresponding to the subrange including the selected point.)
 3. If `val` is empty, simply return the other channels literally
 4. Otherwise, parse `val` into a list of pairs of literal-strings and references. For each element:
    1. If it has a reference:
       1. If it has an flist, extract the flist, then create a copy of the referenced node and overwrite its `freq` values with the extracted flist [Note: This part of the algorithm is currently being updated to support other behavior than simply overwriting the `freq` values, but as of now that is unimplemented]
       2. Recurse on the referenced node, passing an incremented copy of depth
       3. For each channel:
          1. Append the literal-string in that channel in the current alternative and the returned value from the recursive call to the output [Note that the format-string parser parses a string into pairs of literals and references, which does slightly convolute this section]
    2. Otherwise, it is only a literal-string:
       1. For each channel: append the literal-string in that channel to the output
 5. return the output
 6. [After the root node has been fully expanded:]
 7. if `replace` is in the datafile:
	i. For each channel in `replace`:
       a. For each stage in the replacement sequence:
          1. if the stage is an FSM:
             i. start at state S
             ii. If `stage["reversed"]&1`: reverse the input
             iii. For each character `c` in the input:
                a. If a rule for `c` is in the current state:
                   1. Append the string in that rule to the output (with any "{}" replaced by `c`)
                   2. If that rule names a state: transition to that state
                b. If a `map` or `match` rule matches:
                   1. Perform the given mapping (note that a `match` rule is exactly equivalent to a `map` rule with identical match and replace strings, but that it must name a transition state)
                   2. If that rule names a state: transition to that state
                c. If a `default` rule exists:
                   1. Append the string in that rule to the output (with any "{}" replaced by `c`)
                   2. If that rule names a state: transition to that state
                d. Otherwise:
                   1. Append `c` to the output
                   2. if `return` is specified: Transition to that state
             iv. If an `end` string is specified in the current state after the last character is processed: append that string to the output
             v. If `stage["reversed"]&2`: Reverse the output
          2. If the stage is a list of regexes:
             i. For each regex, globally substitute that regex in the input with the provided string
		  3. If the stage is an assignment:
			 i. Replace the current value of the input with the result of the assignment expression, as described above.
 8. Otherwise: [Deprecated]
    1. If `replacement` is in the datafile:
       1. Do 7.i.a for channel `val` w.r.t. node `replacement`
    2. If `replaceIPA` is in the datafile:
       1. Do 7.i.a for channel `ipa` w.r.t. node `replaceIPA`