WordGen Specification v0.1.01

WordGen

Word generator based on weighted-random expansion of a context-free grammar followed by text rewriting. The grammar specification language is idiosyncratic, but powerful and expressive.

This specification covers both WordGen/Py and WordGen/Cpp. They are equivalently-featured unless otherwise noted. Most differences, even those noted here, are considered bugs and will be fixed as time allows. Neither version can currently be held as authoritative, however.

Purpose

This program is intended primarily for linguistic use, as exemplified by syllables.yml, which is the current complete specification of the phonotactics (and some other things) of my conlang Firen, with some simple grammar and some of the lexicon for sentences. sajemtan.yml, and ffb.yml are similar. english.yml is a WIP English sentence generator.

However, this program is much more general than that. In fact, it is capable of expanding any context-free grammar and then applying arbitrary transformations to it (with any sequence of regular expressions and Mealy-type finite-state transducers), which probably means that it can (in a roundabout way) implement even context-sensitive grammars. (However, the replacement phase is deterministic, so perhaps not. I have not put in the effort of proving this claim one way or the other.)

Other datafiles included in the distribution include:

Grammar Syntax

The grammar to be expanded is specified in a YAML 1.2 document, rather than an annotated EBNF (or some other standard form) so as to express what I call “channels”, which are an essential part of the functionality of the generator. There are two “special” channels: val and freq; and one internal-use channel: path. val is the channel that specifies the grammar (its type is a format-string), while freq is the channel that assigns weights to each alternative (its type is a floating-point value). path is generated automatically by the program as it traverses the nodes and represents the “parse tree” that corresponds to the generated word (its type is a type of tree for which I do not have a name). [Note: Currently, path does not contain all information necessary to fully reconstruct a generated word, because it does not include template evaluation, and some template functions have random results. WordGen/py currently only has one of those functions, oneof. – End note]

Another reason not to use a standard grammar form is that almost all work on grammars is in the domain of parsing them, however, as a generative program, WordGen has no interest whatever in parsing its output. Instead of trying to express how to map between a structure and a sentence, WordGen’s datafile format exists solely to express how to generate a sentence that looks like what the author wants.

Logically, a grammar (as I use the term in the context of WordGen) is composed of:

This document assumes that the reader is familiar with regular expressions, and does not explain them other than to specify the variant implemented, although external links to specifications are provided.

[WordGen/Cpp only, WIP] Either or both of channels and replace may be nested within a node named control, with slightly different semantics for replace in that case.

[Note: [WordGen/Py only] There is currently a work-in-progress feature to allow literal nodes, which contain only a single “alternative” directly, and which do not count toward the depth and expansion limits. This may be removed entirely or replaced with a different syntax, however. Literals only work with “gen”, not with “list”. Because WordGen/Cpp does not support literal nodes at the current time, a diagnostic will be issued, and the node will be loaded and used as if it were a regular node with a single branch. – End note]

For backwards-compatibility, instead of a node named replace, either or both of replacement and replaceIPA may be given instead, which will function in exactly the same way as replace.val and replace.ipa respectively. These two approaches may not be combined.

[Note: This is the only part of the code (except the “-p” command-line option, equivalent to “-c ipa”) that hardcodes a channel other than the three named above, and for this reason it is deprecated. – End note]

In the actual data file, that looks something like this (excerpted from ffb.yml):

replace:
  val:
    - - m: 'ss'
        r: 'z'
      - m: 'xx'
        r: 'j'

Consonant:
  - val: p
  - val: b
  - val: k
  - val: g
  - val: f
  - val: v
  - val: t
  - val: d
  - val: s
  - val: x

This example uses only literal-strings in val, for simplicity. However, val is allowed to contain (and in most interesting cases will contain) references to other switching-nodes. These borrow the syntax of Python format-strings, but since the semantics are entirely different, I will specify them here anyway. As used in this program, a format-string consists of a sequence of literal-strings and references, where references are notated "{as such}".

References are also allowed to contain frequency-lists (aka ‘flists’), which look like this: "{Syllable:3 1 0}", or index-lists (aka ‘ilists’), which look like this: "{Syllable!1:3 2}".

Frequency-lists overwrite the freqs of each alternative of the named switching-node in order. If there are more elements in the flist than alternatives in the node, the excess elements are silently ignored. If there are more more alternatives in the node than elements in the flist, then the remainder are simply not overwritten.

Index-lists assign freq: 1 to the specified branches, unless provided with a frequency overrider, specified with a colon, and freq: 0 to all other branches. It is an error for an ilist to contain an index corresponding to a nonexistent branch.

Special Nodes

WordGen gives special meaning to certain top-level nodes in a datafile. These are:

Node-reference syntax

After template evaluation (which will be elaborated upon later in this document), the text of the val channel is parsed into literal text segments, and node references. Literal text is anything not enclosed in {}. Braces can be escaped in literal text via either doubling ({{) or backslash-escaping (\{). A node reference has any of the following forms:

name is, simply, the name of the node being referenced. args are the arguments to pass to the node, if it is a template node. (These are described in the next section.)

flist and ilist are collectively referred to as frequency-overrides, and work in different ways. flists are simpler, being simply a sequence of non-negative floating-point numbers separated by spaces. Each number replaces the freq of the corresponding alternative in the named node before an alternative is selected. ilists instead filter on the indices of the alternatives. The first (integral) number in an ilist element is the alternative’s index (starting from 0, in the order they appear in the datafile), and the second number, which is optional, is a value to overwrite the frequency of the indicated alternative. When the second number is omitted, the alternative retains its normal freq value. If an alternative’s index never appears in an ilist, then that alternative is assigned a freq of zero, in effect removing them from consideration.

Template Evaluation

Many strings in a datafile may be templated, which is a means of programmatically producing a string based on provided inputs. A node may be declared as taking arguments by suffixing it with pipes equal to the number of arguments to be given, and if the node can accept more arguments than it strictly requires, it may declare that with the ... suffix. In short, a node named ‘node’ which takes one argument is declared as node|. If it takes three arguments, it is declared as node|||. If it takes two or more arguments, it is declared as node||.... Nodes can overload on argument configurations, and the overload resolution rule is:

If the three nodes given above are all defined, then the following list gives the overload resolutions for various numbers of arguments: - 0: Failure. - 1: node| - 2: node||... - 3: node||| - 4+: node||...

Template expressions in template-strings must be explicitly escaped into, using angle brackets (<>). The allowed syntaxes for template expressions are:

An arg-name is either an integer or one of a predefined set of “special argument” names. In the integer case, it refers to a “numeric argument” of the node, or one of the ones passed explicitly. Special arguments are passed implicitly, and are available even in nodes which take no arguments. The special arguments are listed below:

Strictly speaking, ... is a third type of arg-name, and does not belong in this list. However, semantically it is a member of the A family. The <arg-name> syntax simply evaluates to the indicated argument.

[WordGen/Cpp only] Raw strings are documented below, they function identically as arguments inside a function expression as they do at the top level.

Functions

function-expressions allow much greater complexity. [Compatibility note: WordGen/Py currently does not support nested functions. WordGen/Cpp supports arbitrary nesting of function-expressions, and has some other additional features as well, which will be annotated when described. – End note]

The syntax of a function expression is:

function-expression:
function-name( [ argument-list ] )
argument-list:
argument [ | argument-list ]
argument:
#arg-name
ellipsis
arg-text
[WordGen/Cpp only] function-expression
arg-text:
string [WordGen/Cpp only] raw(raw-char-sequence)

A function-expression applies a function to a sequence of arguments to yield a result. There is currently no way to add additional functions beyond those defined by WordGen. The available functions are:

The following functions have preliminary implementations in WordGen/Cpp, but are not actually usable.

Arguments

Note that within a function-expression, an arg-name must be preceeded by #, unless it is .... Anything else is interpreted as a string. Strings which look like arguments may be escaped with the backslash character, such as \#0 or \....

There is another form of argument, called a raw string [WordGen/Cpp only]. A raw string looks superficially like a function, but it is parsed differently.

Arguments are dynamically typed and may be either strings or arrays. Strings can be interpreted as numbers by some functions. In the C++ version, arguments of functions may be further function calls.

Truthiness is defined as:

  1. If the argument is an array, join it over the empty string.
  2. If the argument is empty, it is falsy.
  3. If the argument, interpreted as a double, is 0, it is falsy.
  4. Otherwise, it is truthy.

Certain cases cause arguments to change type. For string-to-number conversions, this works by interpreting the string as a decimal number. For array-to-string conversions, the elements of the array are concatenated, sometimes with separating | characters, depending on the context.

Transformations

After node expansion, the resulting word is transformed. Transformations may be specified in the grammar with any of the following techniques, taking the first one of the list which is found and ignoring any following.

Each replacement sequence is a list of transformations to be applied in order. The following types of transformations are understood by WordGen:

Regex replacers
A sequence of regex replacements, which shall be globally applied to the contents of the channel under consideration.
Finite State Machines (FSMs)
A programmatic way to transform a string one character at a time, represented as a set of named states which process characters and switch between themselves. For further theoretical information, look up “Mealy-type finite state transducer”. For further practical information, see the section later in this document.
Normalization
Applies a specified Unicode normalization to the string. Allowed normalization forms are: “nfc”, “nfd”, “nfkc”, “nfkd”, and “default”, which is equivalent to “nfc”. For more information about Unicode normalization, see http://unicode.org/reports/tr15/#Norm_Forms
Assignment rules
Discard the current value of the channel under consideration, and replace it entirely with the result of formatting the generated word with the given string used as the formatting rule. See the section on the command-line interface for information on the syntax of format strings. This is an easy way to generate new channels which are like another channel, but modified.
This is written as assign: "format-str"

Regular Expressions

Regular expression transformation rules may be specified as a list of mappings of m and r. m shall contain the regular expression to match against. r shall contain a replacement string, for which one of two syntaxes are accepted, “backslash” and “dollar”.

Regular expression matching differs in detail between WordGen/Cpp and WordGen/Py, however these differences are minor and for the most part any regex will be interpreted the same way in either. For more information, see https://docs.python.org/3/library/re.html#regular-expression-syntax [WordGen/Py]; and http://ecma-international.org/ecma-262/5.1/#sec-15.10.1 and http://www.akenotsuki.com/misc/srell/en/#differences (see version 2) [WordGen/Cpp].

The differences in replacement string handling are more significant. In summary:

Finite State Machines

Finally, I will describe the syntax through which finite-state transducers are represented in the grammar. From now on, these will be referred to as “FSMs”, for brevity. Each FSM is represented by (optionally) a number named reversed, and a sequence of states. The start state is named S, and must be present for an FSM to be recognized. Each state contains a set of rules, where each rule is one of the following:

  1. A mapping from a single character to a string, with optional state transition.

    Type: array[ t-string, state (optional) ]

  2. An array of set rules, which apply a replacement to any matching character (with optional transition).

    Type: array[ array[ tr-set, t-string, state (optional) ] ]

  3. An array of map rules, which replace a character in their first set with the corresponding character in their second set (with optional transition).

    Type: array[ array[ tr-set, tr-set, state (optional) ] ]

  4. An array of match rules, which echo their input and perform a transition if the character is in their first set (equivalent to a map rule with the first and second sets equal and a non-optional transition).

    Type: array[ array[ tr-set, state ] ]

  5. A default rule, which is a character rule that matches any character (with optional transition).

    Type: array[ t-string, state (optional) ]

  6. [Note: If no rule has matched and there was no default rule, the character is simply copied to the output as-is. – End note]

  7. A return rule, which specifies a transition if the implicit copy behavior was used.

    Type: state

  8. An end rule, which is a string to append to the output if the end of the string is reached in that state.

    Type: string

Rules are applied in the order they are listed above, regardless of their ordering in the datafile.

Every string labelled as t-string above is templated: "{}" will be replaced with the matched character (see set and default below for the main use-cases of this feature.)

If present, reversed shall be interpreted as a bitfield, for which the lowest-order bit (&1) reverses the input tape, and the next-lowest-order bit (&2) reverses the output tape. [Note that reversed is therefore not a valid state name, though this is not specifically checked.] If the value of reversed is larger than 3, any higher-order bits will be ignored and [WordGen/Cpp only] a warning issued at load time.

A “character set”, in the context of map, match, and set rules, is a specification of a set of characters to match, in a format much like that accepted by the tr UNIX program, however, a character is a Unicode code point, not a byte. Any failure to implement the tr syntax otherwise than pertaining to Unicode is a bug.

For an example, the following FSMs from english.yml will replace any @ in its input with either “a” or “an” depending on whether the following letter is a vowel or a consonant, and then uppercase the first letter of the input:

    - reversed: 3
      S: 
        match:
          - ["aeiou", V]
          - ["bcdfghjklmnpqrstvwxyz", C]
      V:
        '@': ["na", S]
        match:
          - ["bcdfghjklmnpqrstvwxyz", C]
      C:
        '@': ["a", S]
        match:
          - ["aeiou", V]
    - S:
        map:
          - ["a-z","A-Z", E]
        return: E
      # States are required to be mappings, not null, but they can be empty
      E: {}

set rules may be used to replace a large number of similar character rules, such as in this example also from the english.yml datafile, which for any digit writes that digit followed by " ~", and then returns to the start state. This would otherwise have required eight separate repetitive character rules.

    "~V01":
      set:
        - ["01234567", "{} ~", S]

[Note that the following FSM will reverse its input:

replace:
  val:
    - S: {}
      reversed: 1

– End note.]

Format Strings

TBD.

Algorithm Synopsis

The algorithm accepts as input: a datafile, containing a specification in YAML of the grammar; the root node; the mode of operation (we will assume gen because it is the most interesting as well as being the default); and a number of options, of which the only two that aren’t presentation are “-n”, which specifies how many words to generate, and “-d, –depth” and “-e, –expansions” [WordGen/Cpp only], which are used to prevent the program from looping infinitely and/or producing excessively large output for pathological cases. If neither is specified, the defaults are:

WordGen/Py:
d=16, e=256
WordGen/Cpp:
d=24, e=576

If only -d is specified, e = d^2. If only -e is specified, [WordGen/Cpp only] d = e. WordGen/Cpp has higher defaults because it is faster and uses less memory in the first place. The default number of words to generate if -n is not specified is one.

The actual word generation process proceeds as follows:

  1. Expand the root node.

    1. Expand the templates in all freq values in the node, and then use their values to form a categorical distribution and select a random number from that distribution. That number determines the index of the branch that will be taken. (Or, in other words: concatenate all of the freq values as ranges on the number line and select a random point within the total range, then select the alternative corresponding to the subrange including the selected point.)
    2. Increment the expansion count (which begins at 0 and is global). If either the maximum expansion count or the maximum depth have been exceeded, log a diagnostic warning and then return the literal, unexpanded text of val and all other channels to the caller.
    3. Expand the templates in the val of the selected node.
    4. If val is empty, simply return the other channels literally. Otherwise, parse val into a list of pairs of literal-strings and references. For each element:
      1. If it is a reference:

        1. If it has a frequency-override, apply that override to a temporary copy of the node and use that copy as the referenced node. [Note: The actual algorithm does not create copies of nodes, but instead passes a frequency-override, which may be empty, to all node expansions, including the root. This logically-equivalent, but less performant, rule is used to simplify the explanation of the algorithm. – End note]
        2. Recurse on the referenced node, passing an incremented copy of depth.
        3. For each channel in the returned word, append the literal-string in that channel to the same channel of the output.
      2. If it is a literal-string:
        1. For each channel: append the literal-string in that channel to the output.
    5. return the output to the caller.
  2. If [WordGen/cpp only: control.replace or] replace is in the datafile:
    For each channel in replace: [Note that channels freq and path are reserved for internal use, and cannot be transformed.]

    1. For each stage in the replacement sequence:

      1. If the stage is an FSM:

        1. Start at state S.
        2. If stage.reversed & 1: reverse the input.
        3. For each character c in the input:
          1. If a rule for c is in the current state:

            1. Append the string in that rule to the output (with any “{}” replaced by c).
            2. If that rule names a state: transition to that state.
          2. If a map or match rule matches:

            1. Perform the given mapping. (Note that a match rule is exactly equivalent to a map rule with identical match and replace strings, but that it must name a transition state.)
            2. If that rule names a state: transition to that state.
          3. If a default rule exists:

            1. Append the string in that rule to the output (with any “{}” replaced by c).
            2. If that rule names a state: transition to that state.
          4. Otherwise:

            1. Append c to the output.
            2. if return is specified: transition to that state.
        4. If an end string is specified in the current state after the last character is processed: append that string to the output.

        5. If stage.reversed & 2: Reverse the output.

      2. If the stage is a list of regexes:

        1. For each regex, globally substitute that regex in the input with the provided string.
      3. If the stage is an assignment:

        1. Replace the current value of the input with the result of the assignment expression, as described above.
  3. Otherwise: [Deprecated]

    1. If replacement is in the datafile:
      Do 2.a for channel val w.r.t. node replacement.
    2. If replaceIPA is in the datafile:
      Do 2.a for channel ipa w.r.t. node replaceIPA.
  4. Format the word for printing.