SUMMARY Wordgen is, mathematically speaking, a sentence generator for arbitrary annotated context-free grammars, with an additional transformation step which may, or may not (I haven't yet proven it) allow a subset of context-sensitive grammars to be expressed. It has been primarily designed for linguistic use, and the fullest use of its features is in syllables.yml, the datafile for Firen. sajemtan.yml and ffb.yml are similar but much less complex. english.yml takes a distinctly different approach, and makes use of different features than the others, so it is worth a look even though it is not complete. CFGs.yml is the main 'playground' datafile, while recursive.yml contains a number of stress tests. cmap.yml is a simple fuzzer for a different project. numbers.yml is a very simple file meant to generate a variety of interesting numbers along with reasonably natural English renderings, but it hasn't been touched in years. DATAFILE SYNTAX wordgen datafiles use YAML 1.2, rather than the more standard format for grammars, EBNF, because wordgen makes extensive use of annotations which would be annoying to express in EBNF. However, conversions between EBNF and the datafile format are planned for a future version. A datafile is structured as a number of optional special structures, and several "nodes", which are equivalent to nonterminals in a more traditional CFG paradigm. Each node is a sequence of alternatives, and each alternative is a collection of "channels", of which three are given special meaning: "val" is the structure channel, which contains the actual CFG. In addition to the templatization which applies to other channels, val-strings have a particular format, which is detailed later. If not present, it defaults to the empty string. "freq" is the weighting channel, which controls how wordgen selects from the alternatives. Its data can be either a floating-point number, or a templatized string which evaluates to a number. If not present, it defaults to 1. "path" is a reserved internal channel, used for representing the parse tree of a generated sentence. It is an error to mention it in a datafile. Additionally, the "ipa" channel has an abbreviated print flag "-p", as well as having a backwards-compatibility alternate replacement description syntax, along with "val". However, "ipa" is not treated specially in any other way by the program. All other channel names are available to the user. Their type is a templatized string with no particular restrictions. VAL-STRING SYNTAX Val-strings use an interpolation syntax based on Python format strings, wherein a node name is enclosed in {}. To illustrate, a simple node is excerpted from CFGs.yml below. ``` binPalindrome: - val: "" freq: .1 - val: "0" freq: .15 - val: "1" freq: .15 - val: "0{binPalindrome}0" - val: "1{binPalindrome}1" ``` This node produces binary palindromes, that is, sequences of 0 and 1 that read the same forwards or backwards. The first three alternatives do not recurse, and simply produce themselves. The latter two, however, contain "{binPalindrome}", which is a node reference ("noderef"), and it is replaced by an expansion of the named node, in this case the reference is recursive. The special characters { and } can be escaped as either "\{ \}" or as "{{ }}". In the case of an odd number of { or }, they are scanned from the left, every pair being collapsed, and the last one is interpreted. If different behavior is needed, use the unambiguous \ form. Noderefs are not limited to this simple case, as the following example shows. ``` Dyck: - val: "{Dyck:.8 1 1.2}{Dyck:.8 1 1.2}" - val: "[{Dyck:1 .8 1.1}]" - val: "" freq: .1 ``` The noderef "{Dyck:.8 1 1.2}" contains an annotation called an "flist", short for "frequency list", which overrides the frequencies of the alternatives in the referenced node. These are simply a list of floating-point numbers, separated by spaces. If the flist contains fewer values than there are alternatives, the remaining alternatives simply keep their old frequencies. If there are excess elements, they are ignored. Another frequency control mechanism is the "ilist", which is used to select only certain alternatives from a node, and optionally override their normal frequencies. A simple example is "{Cons/Start!0:.5 3}", which refers to either the first or the fourth alternative of the node "Cons/Start", using .5 as the frequency of the first alternative, and the regular frequency of the fourth. The frequency control mechanisms are intended to reduce duplication of alternatives between related nodes differing only in frequency, or in nodes having different subsets of the full list. The full syntax of a noderef is (in datafile format): ``` NodeRef: - val: "\{{text}{args}{NRSuf}\}" args: - val: "" - val: "|{text}{args}" NRSuf: - val: "" - val: ":{flist}" - val: "!{ilist}" flist: - val: "{float}" - val: "{float} {flist}" ilist: - val: "{number}" - val: "{number}:{float}" - val: "{number} {ilist}" - val: "{number}:{float} {ilist}" ``` Nodes "text", "number", and "float" not included for brevity. TEMPLATE SYNTAX Almost every string in a datafile can include templatized expressions, or "argrefs", which are dependent on arguments passed to the node. These are introduced with < and terminated with >, and there are two main forms; short and function-style. The short form consists of < followed by an argument number or name followed by >, and it is simply replaced by the specified argument. Numeric arguments are user-defined, and passed by the caller. Named arguments are defined by wordgen implicitly, and the full list is presented below: List args: a All declared numeric arguments (not varargs) ... All varargs. A All numeric arguments (including varargs) Scalar args: d The current expansion depth D The maximum expansion depth (see -d option) e The current expansion count E The maximum expansion count (see -e option) c The number of numeric arguments passed to the node C The number of declared numeric arguments for this node (this is a constant expression) p The '|' character (may be used for escaping) lt The '<' character gt The '>' character b The '\' character The other form of argref is a functional style. A function is a name followed by '(' followed by an arbitrary number of arguments separated by '|' followed by ')'. A function argument may be an argument name or number, which must be prefixed by #, however ... is not prefixed with #; a function expression; or a string, in the remaining case. Currently, functions cannot be user-defined, and only the builtin set is supported. This set is detailed below. + Flatten the arguments and return their sum, interpreted as floating-point numbers. The empty sum is 0. * Flatten the arguments and return their product, interpreted as floating-point numbers. The empty difference is 0. - Interpret all arguments as floating-point numbers and return a chained difference. The empty product is 1. / Interpret all arguments as floating-point numbers and return a chained difference. Note that this is a left fold, rather than the mathematically typical right fold for division. The empty division is 1. ^ Interpret all arguments as floating-point numbers and return a chained exponentiation. The empty power is 1. len Flattens the arguments, and then returns the number of arguments passed. flatten Produce a single list which consists of all of the arguments passed to flatten, such that all arguments are interpreted as lists and then concatenated together. This function is used in the definitions of many other functions. Additionally, there is a 'pseudo-function', raw, which is not a function but rather a means of escaping a string. It can be used like `` to produce the literal text "some|text\".