WordGen Specification v0.1.01
Word generator based on weighted-random expansion of a context-free grammar followed by text rewriting. The grammar specification language is idiosyncratic, but powerful and expressive.
This specification covers both WordGen/Py and WordGen/Cpp. They are equivalently-featured unless otherwise noted. Most differences, even those noted here, are considered bugs and will be fixed as time allows. Neither version can currently be held as authoritative, however.
This program is intended primarily for linguistic use, as exemplified by syllables.yml
, which is the current complete specification of the phonotactics (and some other things) of my conlang Firen, with some simple grammar and some of the lexicon for sentences. sajemtan.yml
, and ffb.yml
are similar. english.yml
is a WIP English sentence generator.
However, this program is much more general than that. In fact, it is capable of expanding any context-free grammar and then applying arbitrary transformations to it (with any sequence of regular expressions and Mealy-type finite-state transducers), which probably means that it can (in a roundabout way) implement even context-sensitive grammars. (However, the replacement phase is deterministic, so perhaps not. I have not put in the effort of proving this claim one way or the other.)
Other datafiles included in the distribution include:
CFGs.yml
: a testbed for new features of WordGen, and a collection of unrelated fairly simple grammars. Additionally, it includes the grammar for templated strings.recursive.yml
: some stress tests for WordGen.cmap.yml
: a grammar corresponding to a parser for color palettes in another of my programs, called MapGen. Used to generate random color maps for testing the parser.numbers.yml
: an old datafile, which generates numbers, including phone numbers, and their English names. Still uses the deprecated replaceIPA
syntax.cpppre.yml
: The official grammar for the ISO C++ preprocessor, (poorly) adapted to WordGen.jokes.yml
: A file written as a joke after someone posted a CFG for funny gender-describing sentences online.The grammar to be expanded is specified in a YAML 1.2 document, rather than an annotated EBNF (or some other standard form) so as to express what I call “channels”, which are an essential part of the functionality of the generator. There are two “special” channels: val
and freq
; and one internal-use channel: path
. val
is the channel that specifies the grammar (its type is a format-string), while freq
is the channel that assigns weights to each alternative (its type is a floating-point value). path
is generated automatically by the program as it traverses the nodes and represents the “parse tree” that corresponds to the generated word (its type is a type of tree for which I do not have a name). [Note: Currently, path
does not contain all information necessary to fully reconstruct a generated word, because it does not include template evaluation, and some template functions have random results. WordGen/py currently only has one of those functions, oneof
. – End note]
Another reason not to use a standard grammar form is that almost all work on grammars is in the domain of parsing them, however, as a generative program, WordGen has no interest whatever in parsing its output. Instead of trying to express how to map between a structure and a sentence, WordGen’s datafile format exists solely to express how to generate a sentence that looks like what the author wants.
Logically, a grammar (as I use the term in the context of WordGen) is composed of:
val
, freq
, and path
, along with an arbitrary number of channels containing text;channels
which contains a mapping from channel-names to strings, which controls the presentation;replace
which contains a mapping of channel-names to sequences of “replacement stages”, which are either
m
to a regex and r
to a replacement string) orThis document assumes that the reader is familiar with regular expressions, and does not explain them other than to specify the variant implemented, although external links to specifications are provided.
[WordGen/Cpp only, WIP] Either or both of channels
and replace
may be nested within a node named control
, with slightly different semantics for replace
in that case.
[Note: [WordGen/Py only] There is currently a work-in-progress feature to allow literal nodes, which contain only a single “alternative” directly, and which do not count toward the depth and expansion limits. This may be removed entirely or replaced with a different syntax, however. Literals only work with “gen”, not with “list”. Because WordGen/Cpp does not support literal nodes at the current time, a diagnostic will be issued, and the node will be loaded and used as if it were a regular node with a single branch. – End note]
For backwards-compatibility, instead of a node named replace
, either or both of replacement
and replaceIPA
may be given instead, which will function in exactly the same way as replace.val
and replace.ipa
respectively. These two approaches may not be combined.
[Note: This is the only part of the code (except the “-p” command-line option, equivalent to “-c ipa”) that hardcodes a channel other than the three named above, and for this reason it is deprecated. – End note]
In the actual data file, that looks something like this (excerpted from ffb.yml
):
replace:
val:
- - m: 'ss'
r: 'z'
- m: 'xx'
r: 'j'
Consonant:
- val: p
- val: b
- val: k
- val: g
- val: f
- val: v
- val: t
- val: d
- val: s
- val: x
This example uses only literal-strings in val
, for simplicity. However, val
is allowed to contain (and in most interesting cases will contain) references to other switching-nodes. These borrow the syntax of Python format-strings, but since the semantics are entirely different, I will specify them here anyway. As used in this program, a format-string consists of a sequence of literal-strings and references, where references are notated "{as such}"
.
References are also allowed to contain frequency-lists (aka ‘flists’), which look like this: "{Syllable:3 1 0}"
, or index-lists (aka ‘ilists’), which look like this: "{Syllable!1:3 2}"
.
Frequency-lists overwrite the freq
s of each alternative of the named switching-node in order. If there are more elements in the flist than alternatives in the node, the excess elements are silently ignored. If there are more more alternatives in the node than elements in the flist, then the remainder are simply not overwritten.
Index-lists assign freq: 1 to the specified branches, unless provided with a frequency overrider, specified with a colon, and freq: 0 to all other branches. It is an error for an ilist to contain an index corresponding to a nonexistent branch.
WordGen gives special meaning to certain top-level nodes in a datafile. These are:
After template evaluation (which will be elaborated upon later in this document), the text of the val
channel is parsed into literal text segments, and node references. Literal text is anything not enclosed in {}
. Braces can be escaped in literal text via either doubling ({{
) or backslash-escaping (\{
). A node reference has any of the following forms:
{
name
}
{
name
:
flist
}
flist
: a sequence of numbers separated by spaces. ex: 3 1 0
{
name
!
ilist
}
ilist
: a sequence of integers, optionally paired with numbers using :
. ex: 0 1:2 5
{
name
|
args
}
args
: a sequence of strings to be passed as arguments to a node template, separated by pipes.{
name
|
args
:
flist
}
{
name
|
args
!
ilist
}
name
is, simply, the name of the node being referenced. args
are the arguments to pass to the node, if it is a template node. (These are described in the next section.)
flist
and ilist
are collectively referred to as frequency-overrides, and work in different ways. flist
s are simpler, being simply a sequence of non-negative floating-point numbers separated by spaces. Each number replaces the freq
of the corresponding alternative in the named node before an alternative is selected. ilist
s instead filter on the indices of the alternatives. The first (integral) number in an ilist
element is the alternative’s index (starting from 0, in the order they appear in the datafile), and the second number, which is optional, is a value to overwrite the frequency of the indicated alternative. When the second number is omitted, the alternative retains its normal freq
value. If an alternative’s index never appears in an ilist
, then that alternative is assigned a freq
of zero, in effect removing them from consideration.
Many strings in a datafile may be templated, which is a means of programmatically producing a string based on provided inputs. A node may be declared as taking arguments by suffixing it with pipes equal to the number of arguments to be given, and if the node can accept more arguments than it strictly requires, it may declare that with the ...
suffix. In short, a node named ‘node’ which takes one argument is declared as node|
. If it takes three arguments, it is declared as node|||
. If it takes two or more arguments, it is declared as node||...
. Nodes can overload on argument configurations, and the overload resolution rule is:
If the three nodes given above are all defined, then the following list gives the overload resolutions for various numbers of arguments: - 0: Failure. - 1: node|
- 2: node||...
- 3: node|||
- 4+: node||...
Template expressions in template-strings must be explicitly escaped into, using angle brackets (<>
). The allowed syntaxes for template expressions are:
<
_arg-name_>
<
_function-expression_>
<raw(
_raw-char-sequence_)>
An arg-name is either an integer or one of a predefined set of “special argument” names. In the integer case, it refers to a “numeric argument” of the node, or one of the ones passed explicitly. Special arguments are passed implicitly, and are available even in nodes which take no arguments. The special arguments are listed below:
d
: Current expansion depth. [integer]D
: The expansion depth limit specified on the command line. [integer]e
: Expansion count. [integer]E
: Expansion limit specified on the command line. [integer]c
: The number of numeric arguments passed to the node. [integer]C
: The number of arguments the node has explicitly declared it requires. [integer]...
: All optional numeric arguments. [array]a
: All required numeric arguments. [array]A
: All numeric arguments. Concatenation of ...
and a
. [array]p
: Contains the character |
. It may be used as an escape.lt
: Contains the character <
.gt
: Contains the character >
.Strictly speaking, ...
is a third type of arg-name, and does not belong in this list. However, semantically it is a member of the A
family. The <
arg-name>
syntax simply evaluates to the indicated argument.
[WordGen/Cpp only] Raw strings are documented below, they function identically as arguments inside a function expression as they do at the top level.
function-expressions allow much greater complexity. [Compatibility note: WordGen/Py currently does not support nested functions. WordGen/Cpp supports arbitrary nesting of function-expressions, and has some other additional features as well, which will be annotated when described. – End note]
The syntax of a function expression is:
(
[ argument-list ] )
|
argument-list ]
#
arg-nameraw(
raw-char-sequence)
A function-expression applies a function to a sequence of arguments to yield a result. There is currently no way to add additional functions beyond those defined by WordGen. The available functions are:
.
: Concatenate all arguments together. Arrays will be converted to strings by joining all elements of the array with pipes.cat
: Concatenate all argument together. Arrays will be converted to strings by joining all elements of the array directly. [WordGen/Cpp only].*
: For each pair of arguments A, C, repeat S (interpreted as string) C times.flatten
: [WordGen/Cpp only]len
: Given an array argument, returns the length of said array. Otherwise, returns 1.+
: Return the sum of all arguments, as numbers.*
: Return the product of all arguments.-
: Perform a left fold of the arguments, as numbers, over subtraction. In other words, given arguments A, B, C…, return (((A - B) - C) - …)/
: Perform a left fold over division. Normally right folds are more often performed over division, but there is no right fold operation in WordGen.^
: Perform a left fold over exponentiation.if
: If the first argument is truthy [See below], return the second argument. Otherwise, return the third argument.ifn
: If the first argument is not truthy, return the second argument. Otherwise, return the third argument.not
: If the argument is truthy, return 1. Otherwise, return 0.?
: Return the argument corresponding to the first argument. 0
refers to the second argument, and so on.which
: Find the first argument among the following arguments, and return its position in the argument list ignoring the first argument.num
: Cast the argument to number.gt
: Given (A, B, T, F), if A > B, return T. Else, return F.lt
: As above, but substitute less-than for greater-than.eq
: As above, but substitute equal-to,ne
: As above, but substitute not-equal-to.=
: An abbreviation of eq
.!=
: An abbreviation of ne
.math
, calc
, pn
: Interpret the arguments as a series of tokens, and evaluate them as a polish notation mathematical expression. Array arguments are expanded. Supported operators are: +
, -
, *
, /
, ^
, .
. All have the meanings assigned above for functions, however they are restricted to exactly two arguments each, as per the rules of polish notation. All array arguments are flattened.rpn
: Like math
, however it interprets reverse polish notation instead.replace
: Given (S, (M, R)…), where (M, R)… is a sequence of pairs, for each (M, R) perform S = regex_replace(S, M, R)
.set
: [WordGen/Cpp only]get
: [WordGen/Cpp only]rand
: [WordGen/Cpp only] Given A: return a random real number uniformly distributed in the range [0, A]. Given A, B: return a random real number uniformly distributed in the range [A, B]. [WordGen/Cpp only]randnorm
: [WordGen/Cpp only] Given A, S: return a random real number normally distributed about the average A with standard deviation S. [WordGen/Cpp only]randint
: [WordGen/Cpp only] Integer version of rand
. [WordGen/Cpp only]randcat
: [WordGen/Cpp only] Given a sequence of weights for buckets, return a number corresponding to the index of the selected bucket in a categorical distribution. This is analogous to how the freq
values are used. [WordGen/Cpp only]oneof
: Given a number of alternatives, selects and returns one at random.invoke
: [WordGen/Cpp only] Given the name of a function and any number of arguments, invokes the named function with those arguments.The following functions have preliminary implementations in WordGen/Cpp, but are not actually usable.
node
: [WordGen/Cpp only]branches
: [WordGen/Cpp only]freqs
: [WordGen/Cpp only]makenode
: [WordGen/Cpp only][]
: [WordGen/Cpp only] Given an array argument A and an integer N, returns the Nth element of A._literal
: [WordGen/Cpp only] Inline multi-channel literal text. Synthesizes an implementation-defined token which will be expanded during the node-expansion step. Takes as arguments a series of paies (C, D), and interprets each C as a channel and each D as the contents thereof. Any template-expression or noderef syntax inside of the values of C or D is fully uninterpreted by later stages.Note that within a function-expression, an arg-name must be preceeded by #
, unless it is ...
. Anything else is interpreted as a string. Strings which look like arguments may be escaped with the backslash character, such as \#0
or \...
.
There is another form of argument, called a raw string [WordGen/Cpp only]. A raw string looks superficially like a function, but it is parsed differently.
raw(
raw-char-sequence)
)
. (
, |
, <
, >
, and \
are all allowed, with no special meaning. There is no way at all to escape a )
character in a raw string.Arguments are dynamically typed and may be either strings or arrays. Strings can be interpreted as numbers by some functions. In the C++ version, arguments of functions may be further function calls.
Truthiness is defined as:
Certain cases cause arguments to change type. For string-to-number conversions, this works by interpreting the string as a decimal number. For array-to-string conversions, the elements of the array are concatenated, sometimes with separating |
characters, depending on the context.
After node expansion, the resulting word is transformed. Transformations may be specified in the grammar with any of the following techniques, taking the first one of the list which is found and ignoring any following.
control.replace
, with the following semantics: [WordGen/Cpp only]
replace
, with the following semantics:
replacement
or replaceIPA
, where replacement
behaves exactly as replace.val
, and replaceIPA
behaves exactly as replace.ipa
. This method is deprecated.Each replacement sequence is a list of transformations to be applied in order. The following types of transformations are understood by WordGen:
assign: "
format-str"
Regular expression transformation rules may be specified as a list of mappings of m
and r
. m
shall contain the regular expression to match against. r
shall contain a replacement string, for which one of two syntaxes are accepted, “backslash” and “dollar”.
Regular expression matching differs in detail between WordGen/Cpp and WordGen/Py, however these differences are minor and for the most part any regex will be interpreted the same way in either. For more information, see https://docs.python.org/3/library/re.html#regular-expression-syntax [WordGen/Py]; and http://ecma-international.org/ecma-262/5.1/#sec-15.10.1 and http://www.akenotsuki.com/misc/srell/en/#differences (see version 2) [WordGen/Cpp].
The differences in replacement string handling are more significant. In summary:
In the default “backslash” syntax (aka “Python”, case-insensitive), backreferences are specified using a backslash followed by the group number. A literal backslash is written as \\
. This is the only syntax supported by WordGen/Py. For further information, see the standard here: https://docs.python.org/3/library/re.html#re.sub. [Note: Despite being the default syntax mode for compatibility with WordGen/Py, support for “backslash” syntax in WordGen/Cpp is limited and using any syntax more advanced than what is explicitly allowed for above may fail. If this proves to be an issue, please contact me and I will attempt to implement a more complete functionality. – End note]
In the “dollar” syntax (aka “ECMAScript”, case-insensitive), [WordGen/Cpp only] backreferences are specified as the dollar sign followed by the group number, while a literal dollar sign is written as $$
. For further information, see the standard here: http://ecma-international.org/ecma-262/5.1/#sec-15.5.4.11.
Finally, I will describe the syntax through which finite-state transducers are represented in the grammar. From now on, these will be referred to as “FSMs”, for brevity. Each FSM is represented by (optionally) a number named reversed
, and a sequence of states. The start state is named S
, and must be present for an FSM to be recognized. Each state contains a set of rules, where each rule is one of the following:
Type: array[ t-string, state (optional) ]
set
rules, which apply a replacement to any matching character (with optional transition).Type: array[ array[ tr-set, t-string, state (optional) ] ]
map
rules, which replace a character in their first set with the corresponding character in their second set (with optional transition).Type: array[ array[ tr-set, tr-set, state (optional) ] ]
match
rules, which echo their input and perform a transition if the character is in their first set (equivalent to a map
rule with the first and second sets equal and a non-optional transition).Type: array[ array[ tr-set, state ] ]
default
rule, which is a character rule that matches any character (with optional transition).Type: array[ t-string, state (optional) ]
[Note: If no rule has matched and there was no default
rule, the character is simply copied to the output as-is. – End note]
return
rule, which specifies a transition if the implicit copy behavior was used.Type: state
end
rule, which is a string to append to the output if the end of the string is reached in that state.Type: string
Rules are applied in the order they are listed above, regardless of their ordering in the datafile.
Every string labelled as t-string above is templated: "{}"
will be replaced with the matched character (see set
and default
below for the main use-cases of this feature.)
If present, reversed
shall be interpreted as a bitfield, for which the lowest-order bit (&1
) reverses the input tape, and the next-lowest-order bit (&2
) reverses the output tape. [Note that reversed
is therefore not a valid state name, though this is not specifically checked.] If the value of reversed
is larger than 3
, any higher-order bits will be ignored and [WordGen/Cpp only] a warning issued at load time.
A “character set”, in the context of map, match, and set rules, is a specification of a set of characters to match, in a format much like that accepted by the tr
UNIX program, however, a character is a Unicode code point, not a byte. Any failure to implement the tr
syntax otherwise than pertaining to Unicode is a bug.
For an example, the following FSMs from english.yml
will replace any @ in its input with either “a” or “an” depending on whether the following letter is a vowel or a consonant, and then uppercase the first letter of the input:
- reversed: 3
S:
match:
- ["aeiou", V]
- ["bcdfghjklmnpqrstvwxyz", C]
V:
'@': ["na", S]
match:
- ["bcdfghjklmnpqrstvwxyz", C]
C:
'@': ["a", S]
match:
- ["aeiou", V]
- S:
map:
- ["a-z","A-Z", E]
return: E
# States are required to be mappings, not null, but they can be empty
E: {}
set
rules may be used to replace a large number of similar character rules, such as in this example also from the english.yml
datafile, which for any digit writes that digit followed by " ~"
, and then returns to the start state. This would otherwise have required eight separate repetitive character rules.
[Note that the following FSM will reverse its input:
– End note.]
TBD.
The algorithm accepts as input: a datafile, containing a specification in YAML of the grammar; the root node; the mode of operation (we will assume gen
because it is the most interesting as well as being the default); and a number of options, of which the only two that aren’t presentation are “-n”, which specifies how many words to generate, and “-d, –depth” and “-e, –expansions” [WordGen/Cpp only], which are used to prevent the program from looping infinitely and/or producing excessively large output for pathological cases. If neither is specified, the defaults are:
If only -d
is specified, e
= d
^2. If only -e
is specified, [WordGen/Cpp only] d
= e
. WordGen/Cpp has higher defaults because it is faster and uses less memory in the first place. The default number of words to generate if -n
is not specified is one.
The actual word generation process proceeds as follows:
Expand the root node.
freq
values in the node, and then use their values to form a categorical distribution and select a random number from that distribution. That number determines the index of the branch that will be taken. (Or, in other words: concatenate all of the freq
values as ranges on the number line and select a random point within the total range, then select the alternative corresponding to the subrange including the selected point.)val
and all other channels to the caller.val
of the selected node.val
is empty, simply return the other channels literally. Otherwise, parse val
into a list of pairs of literal-strings and references. For each element:
If it is a reference:
If [WordGen/cpp only: control.replace
or] replace
is in the datafile:
For each channel in replace
: [Note that channels freq
and path
are reserved for internal use, and cannot be transformed.]
For each stage in the replacement sequence:
If the stage is an FSM:
stage.reversed & 1
: reverse the input.c
in the input:
If a rule for c
is in the current state:
c
).If a map
or match
rule matches:
match
rule is exactly equivalent to a map
rule with identical match and replace strings, but that it must name a transition state.)If a default
rule exists:
c
).Otherwise:
c
to the output.return
is specified: transition to that state.If an end
string is specified in the current state after the last character is processed: append that string to the output.
If stage.reversed & 2
: Reverse the output.
If the stage is a list of regexes:
If the stage is an assignment:
Otherwise: [Deprecated]
replacement
is in the datafile:val
w.r.t. node replacement
.replaceIPA
is in the datafile:ipa
w.r.t. node replaceIPA
.Format the word for printing.