[go: up one dir, main page]

Wiktionary supports server-side scripting to generate content for pages, using the Scribunto extension. It is used as a complement to templates, in particular parser functions like {{#if:}}, {{#switch:}} and so on. Scripts are divided into modules located in the Module: namespace, and are written in the Lua programming language.

Getting started

Here are some helpful links to get you started with Lua and Scribunto.

  • Learning Lua – If you're not yet familiar with the language, or with programming in general, this is a good place to start. This doesn't cover any of the parts that are specific to using Lua within a wiki, it's only "generic" Lua.
  • Scribunto/Lua tutorial – A short tutorial to explain how to use Scribunto/Lua within the wiki.
  • Scribunto/Lua reference manual – A reference manual for Lua as it applies to the Scribunto extension. This also lists Wiki-specific things that don't exist in normal Lua.
  • Official reference manual for Lua 5.1 – A quick reference to the language, for more experienced programmers. Again, this is generic Lua, and does not cover specific details about using it on Wiktionary, but it has information that the Scribunto-specific manual lacks.
  • lua-users wiki – A user-written wiki with many articles on various aspects of Lua.
  • Programming in Lua by Roberto Ierusalimschy, one of the creators of Lua – in-depth discussion of the basics of Lua 5.0.

Information about using Scribunto on Wiktionary specifically:

How Scribunto interacts with the wiki

A Scribunto module in itself is really a large function: it runs from top to bottom, and is expected to return a value. Normally, the return value of a module is a list of functions and their names, which can then be called from another module or from "wikispace". However, a module could, in theory, return something other than a list of functions. It could return a table of strings, a table containing other tables, or even a single value. However, only a module that returns a list of functions and their names can be invoked from wikitext; anything else can only be imported and used from within Scribunto itself.

A function in a Scribunto module is called from wikitext as if it were a parser function, using this notation: {{#invoke:moduleName|functionName}} The function then returns wikitext as its output. This wikitext can contain HTML-style wikitext (such as <b> and <table> and so on) and wiki-specific markup (such as '''…''' for bolding and [[…]]), but cannot invoke templates, parser functions, magic words or parser extension tags (for example, something like {{!}} will be interpreted as meaning the literal string {{!}}, rather than be expanded out to |). If a module needs to invoke a template or a parser-function, it has to use special functions for this purpose in the mw namespace. But it is hoped that these functions will not be needed too often.

Since, in order to be useful, the function needs information about the context in which it was invoked, the Scribunto extension will pass it a single argument, customarily named frame. This argument can be used to obtain various key bits of information; in particular, if the parameter paramName=paramValue was passed to the template that is invoking the Scribunto function, then inside the function, frame:getParent().args.paramName will be the string 'paramValue'. (Any numbered/unnamed parameters can be accessed by number; for example, the template's {{{1}}} becomes the module's frame:getParent().args[1].) Even in the absence of the frame parameter, Scribunto code can use getCurrentFrame() to obtain its value; therefore, functions don't actually need to pass frame around to each other. A module can call process in Module:parameters to clean up the parameters, and do some checking on their validity.

It is also possible (though not usually necessary) for a template to pass further arguments as part of the invocation, in which case these will be available via frame.args. For example, if the module was invoked using {{#invoke:moduleName|functionName|paramName=paramValue}}, then frame.args.paramName will be the string 'paramValue'. (Numbered/unnamed parameters work analogously.)

Debugging and error reporting

When Lua encounters an error in a script, it aborts the script and shows "Module error" in large red, clickable text on the page. Click on this text in order to see what caused the error.

A module error also adds the page to Category:Pages with module errors. When writing modules or converting templates, it is a good idea to check this category to see whether any pages that use it are triggering errors. It is also possible to trigger errors yourself, using the following:

error("You forgot to supply a required parameter!")

This can be used to check whether a module and its accompanying template(s) are being used correctly, and to show an error to the user otherwise. It is highly recommended that you use this whenever possible, to make your modules more robust and to make it easier to find mistakes.

While you are working on a script, it may occasionally be useful to generate debug messages so that you can see what is going on at particular points in your script. You can do this with the mw.log function:

mw.log("Testing the script. The value of the variable 'a' is : " .. a)

This function will output its argument to the Scribunto debug console if you run the module in the debug console, e.g. by typing p.main() if the function you want to run is named main and takes no arguments. It automatically adds a newline to the end of the message.

The function os.clock can be used for simple benchmarking of a given function. It can be used like this:

function p.foo(frame)
    local start = os.clock()
    -- do whatever the function needs to do here
    mw.log("Function took " .. os.clock() - start .. " seconds.")
    -- return
end

An error also occurs when the time allocated for running scripts expires before all scripts on a page can be run. If you are making a complex and potentially time-consuming edit to a module, you can use the "Preview page with this template" to preview a very large, module-heavy page like [[a]] to check if your script slows it down too much.

English Wiktionary also has its own purpose-made debugging module, aptly named Module:debug. The function track can be used to track entries that fulfill a particular condition without interfering with the operation of a function or template. It is similar in purpose to Category:Template tracking.

"Frame" and "parent frame"

There are actually two ways that values can be passed to a Scribunto module. The first is the one shown above, in which the values are passed as parameters directly to the module invocation. So for example, if there is a Lua function LanguageData.getLangName that generates (say) English for en, a template {{langname}} will invoke and pass on the arguments, and other pages will access this function by writing (e.g.) {{langname|en}}. With this approach, the Lua function needs to access the arguments that were passed to #invoke; to that end, it might be written like this:

{{#invoke:LanguageData|getLangName|{{{1}}}}}
function LanguageData.getLangName(frame)
    local args = frame.args
    local langCode = args[1]
    local langName = ... -- some code to determine langName
    return langName
end

However, there is another way, which is recommended because it is fastersource. Every module also has access to its so-called "parent frame", which contains the collection of arguments passed not to the module, but to the template that called it. So rather than invoking the module and pass the values on explicitly, the module is invoked with no parameters. The module itself can access the parameters that were passed to the template, using the parent frame. The example above would then be written like this:

{{#invoke:LanguageData|getLangName}}
function LanguageData.getLangName(frame)
    local args = frame:getParent().args
    local langCode = args[1]
    local langName = ... -- some code to determine langName
    return langName
end

As you can see, the only real difference is the use of frame:getParent().args to get the arguments of the "parent frame" (i.e., the template-call), rather than the arguments of the module invocation itself.

It's possible to write a function that supports both approaches (by using invokeArgs[1] or templateArgs[1]). This may occasionally be useful for simple functions that can be called more or less intuitively from a template as well as another module. But for more complicated functions, it's better to write the "main" code in one function, and write another function that can be invoked from a template, which then gathers the parameters and calls the main function.

Note that an empty parameter passed on from a template "counts"; i.e. the template call {{MyTemplate||MySecondArgument}} will lead to the related condition

if args[1] then
-- <do something>
end

to be satisfied, as an empty string is interpreted as true. The code

if args[1] and args[1] ~= '' then
-- <do something>
end

on the other hand, will only respond to a non-empty first argument.

Efficiency

The efficiency of Lua can be checked through the template preview feature: After pressing "preview", right click on the preview of the page and request the page source. In the page source, search for "NewPP" in order to see how much time the execution of the Lua module took (example: Lua time usage: 0.004s). Search for "served" in order to see how long time it took to render the entire page (example: Served by mw1035 in 0.498 secs). This latter time can be used to compare with how long it takes to render the same page with or without Lua modules.

Various techniques can be used to increase efficiency. The following come from a chapter in Lua Programming Gems: Avoid creating functions and tables inside loops. Use local variables. Memoize expensive functions. Avoid a large number of separate string concatenation operations by inserting strings into a table with table.insert and creating the final string with table.concat.

Each individual concatenation operation (whether it involves two strings, "a" .. "b", or several, "a" .. "b" .. "c" .. "d") generates a single new string (blog post), which is stored in Lua memory. Many concatenation operations (for instance, in loops) can use a lot of memory because many intermediate strings are created.

To memoize a function that has one argument and one return value, you may be able to use the memoize function from Module:fun. Beware that it does not work as the third argument to gsub, though, because it returns a table.

For Scribunto specifically, it increases efficiency to use basic string functions instead of mw.ustring ones when performing a large number of string operations. The Ustring functions are implemented mostly in PHP (see UstringLibrary.php on Phabricator), and they must parse the string (sequence of bytes) into codepoints, and the pattern matching functions must convert the Lua pattern into PHP regex, before they are able to do their job. The basic string functions operate on bytes, so they eliminate this intermediate step. Read more on this below.

The mw.text.gsplit function uses the mw.ustring functions (source code). In many cases, the function string.gmatch can be used instead, and will be much faster. For example, to iterate over lines of wikitext, use for line in string.gmatch(wikitext, "[^\n]*") do --[[ something or other ]] end instead of for line in mw.text.gsplit(wikitext, "\n") do --[[ something or other ]] end. The function mw.text.split can be replicated by using this same loop to insert the items into a table.

Methods of creating an array (a table with consecutive integer keys: { "a", "b", "c" }, for instance), vary in speed. The following three methods are ranked from fastest to slowest. The first method is used in Module:table, which is used frequently enough that it needs to be as efficient as possible.

local str = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
----
local t = {}
local i = 0
for character in string.gmatch(str, ".") do
	i = i + 1
	t[i] = character
end
----
local t = {}
for character in string.gmatch(str, ".") do
	t[#t + 1] = character
end
----
local t = {}
for character in string.gmatch(str, ".") do
	table.insert(t, character)
end

The first method is fastest because all that must be done in each iteration is an addition operation and creation of an index. In the second, the length of the table must be newly calculated for each iteration. The function table.insert operates in a similar way to the second method, but it must first determine whether it has been supplied a third argument or not.

Lua tables contain an array and a hash part. The array part takes less memory per field than the hash part, so using arrays rather than hashes is a good idea when memory is an issue. According to World of Warcraft wiki, array fields use 16 bytes, while hash fields use 40 bytes.

Both the array and the hash part have a size that is a power of two. For the array part, the size is the smallest power of two that is greater than or equal to the greatest index in the array part; for the hash part, the size is the smallest power of two that is greater than or equal to the number of elements in the hash part. The array part can only contain fields that are indexed by a positive integer (t[1], t[2], t[3]), while the hash part contains fields with any type of index.

Array fields are added when an element in a table literal does not have an explicit index (t = { "a", "b", "c" } creates a table with an array part with a size of four), or, under certain conditions, when an element is added with the indexing operator (t = {}; t[1] = "a"; t[2] = "b"; t[3] = "c". When a table literal contains explicit numerical indices, hash fields are added: t = { [1] = "a", [2] = "b", [3] = "c" } creates a table whose hash part contains four fields. (These fields may be shifted to the array part of the table if more fields are added to the table.)

In vanilla Lua and in Scribunto, there is no way to check how many fields in a table are in the array part and the hash part. The length operator doesn't check if positive integer–indexed fields are in the array part or the hash part.

Unicode

Lua itself does not understand Unicode; whereas there are more than a million possible Unicode characters, a "string" in Lua is just a sequence of bytes in the range 0–255. (Unfortunately, the Lua documentation refers to these bytes as "characters", but don't be deceived.)

To address this lack, the Scribunto extension does (at least) four things for us:

  • whenever any text is passed into a Lua module (e.g., as a template parameter), the original character-string is transformed into a byte-string using UTF-8. UTF-8 is a variable-width encoding: ASCII characters are transformed into just a single byte, while other Unicode characters are transformed into two, three, or four bytes.
  • the text returned by a Lua module is interpreted as UTF-8, and transformed back into a Unicode-character string. This means, for example, that if a module receives a bit of text and returns it unmodified, then all will be well.
    Technical notes:
    1. In the event that the string passed back from Lua is not valid UTF-8, invalid sequences will be replaced by the replacement character U+FFFD (�). The same is also done for some valid UTF-8 characters, such as many of the control characters in the range U+0000 to U+0020.
    2. In addition to being UTF-8-decoded, the characters in the string will be modified so that they conform with the Normalization Form Canonical Composition (NFC). For further explanation, see below.
  • the source-code of a Scribunto module is encoded using UTF-8, so we can use Unicode characters inside Lua string literals.
  • the Scribunto extension includes a mw.ustring ("Unicode string") module, which is always available. This module provides UTF-8-aware analogues of Lua's built-in string functions. In essence, the functions in this module allow you to operate on a UTF-8-encoded byte-string as though it were still the original Unicode character-string.

Even so, when using the mw.ustring library, there are some caveats that you need to pay attention to. Although the library is capable of interpreting a sequence of several bytes as a single Unicode character, there may still be more than one Unicode character in a single logical character. For example, although я́ appears to us as a single logical character, it is really encoded as two distinct Unicode characters: the Cyrillic letter я (U+044F) followed by a combining acute accent (U+0301). Therefore, the code mw.ustring.len("я́") will actually return 2, not 1. More subtly, the following will also return a valid result: mw.ustring.find("я", "[я́]"). This happens because the character class in the pattern "[я́]" actually contains two characters (the Cyrillic letter and the accent mark); the function searches for each character individually, and finds the first one (the Cyrillic letter).

MediaWiki converts Unicode characters to the canonical composition normalization form (NFC) when they are entered into a textbox or displayed on a page (see the MediaWiki page on Unicode normalization considerations). Among other things, this means that a bare letter character plus a combining character changes to the composed form, if possible, and some individual characters are changed to a character with a similar appearance. For example, the two-character sequence a (U+0061) + ◌́ (combining acute, U+0301) becomes á (U+00E1), and a CJK Compatibility Ideograph changes into the corresponding character from one of the CJK Unified Ideograph blocks (豈 (U+F900) → 豈 (U+8C48)). To display characters that would otherwise be transformed, use numeric character references such as .

Beware of normalization forms when testing the output of module functions that return decomposed forms (NFD) using a module such as Module:UnitTests. Even if the "actual" and "expected" fields are identical when displayed on the page, they may be different in the module, in which case the tests will fail. (For instance, the "actual" field may have letter–combining character sequences while the "expected" field has the corresponding letter plus diacritic characters.) Convert them to the same normalization form (NFC or NFD) using mw.ustring.toNFC or mw.ustring.toNFD to make sure that the comparison is done correctly.

Generating Unicode characters

There are several ways to type a Unicode character in a Lua module: add the character itself to a Lua string, add a decimal escape sequence representing the bytes in the UTF-8 encoding to a Lua string, or place the codepoint (in hexadecimal or decimal base) into mw.ustring.char. For example, the letter á (Latin small letter a with acute, codepoint U+00E1) can be entered as:

  1. "á"
  2. "\195\161"
  3. mw.ustring.char(0xE1)
  4. mw.ustring.char(225)

The Scribunto extension currently uses Lua version 5.1 (with a few features from 5.2), so hexadecimal escape sequences and Unicode escape sequences, added in Lua versions 5.2 and 5.3 or thereabouts, are not supported. In Lua 5.3, the escape sequences "\xc3\xa1, \xC3\xA1, \u{e1}, \u{E1}" all yield the character á, while in Scribunto they yield xC3xA1 xc3xa1 u{e1} u{E1}.

Byte sequences (method 2) should be avoided, because they are hard to read and write and susceptible to errors. They are different from codepoints: for instance, the byte sequence for the combining acute accent (displayed over a dotted circle: ◌́) is "\204\129", or 0xCC, 0x81 in hexadecimal base, while the codepoint is U+0301 (769 in decimal). There is no correspondence unless one looks at the individual bits. The byte sequence can be converted to the codepoint and vice-versa, but that is difficult to do without a program.

Although codepoints can be entered into mw.ustring.char using decimal base (method 4), hexadecimal base (method 3) is more recognizable, because that is the way codepoints are usually represented. For instance, U+00E1 stands for the letter á, and corresponds to the Lua code mw.ustring.char(0xE1).

Combining characters are best not entered on their own. For example, a combining acute accent added directly inside quotes ("́" or '́' is impossible to read, as it displays directly on top of one of the quotes.

Strings are fed through Unicode composition normalization before being given to the invoked function as arguments, and also when returned as output. Consequently, strings may be modified on the way in and on the way out. For example, the two codepoints U+0061 and U+0301 (Latin lowercase a followed by combining acute accent) are automatically converted to the single codepoint U+00E1 (Latin lowercase a with acute accent, a single character). To analyse strings on a character by character basis, you need to do it within Lua using the mw.ustring.gcodepoint function, you cannot rely on the on-page output containing exactly the characters you returned.

String functions

Scribunto contains the basic Lua string functions and the mw.ustring functions. Some of the Ustring functions are copies of the basic string functions, others are equivalent functions that are modified to work with strings containing Unicode characters beyond the basic ASCII character set, and there are some new functions.

The modified functions include mw.ustring.char, mw.ustring.codepoint, mw.ustring.find, mw.ustring.gmatch, mw.ustring.gsub, mw.ustring.lower, mw.ustring.sub, mw.ustring.upper.

The basic Lua string functions look at bytes, while the Ustring functions look at codepoints encoded in UTF-8.

For the basic Lua functions, length means the number of bytes. Anything beyond basic ASCII will have a length greater than the number of displayed characters.

string.len("a")				--> 1
string.match("a", ".")		--> "a"

string.len("á")				--> 2
string.match("á", "..")		--> "á" (U+00E1, LATIN SMALL LETTER A WITH ACUTE); a two-byte character

string.len("ἀ")				--> 3
string.match("ἀ", "...")	--> "ἀ" (U+1F00, GREEK SMALL LETTER ALPHA WITH PSILI); a three-byte character

string.len("𐌀")				--> 4
string.match("𐌀", "....")	--> "𐌀" (U+10300, OLD ITALIC LETTER A); a four-byte character

Patterns

Note: The following section only holds true for the UTF-8 encoding, which is used on Wiktionary (as well as other MediaWiki projects). Other encodings follow different rules.

In the discussion below, ASCII refers to Unicode characters in the codepoint range U+0000 to U+0080. They are encoded as one byte each: bytes "\0" to "\127" (0xxxxxxx in binary). Non-ASCII refers to Unicode characters in the codepoint range U+0080 to U+10FFFF. They are encoded using two, three, or four bytes. The first byte (the leading byte) is in the range "\194" to "\244" (110xxxxx, 1110xxxx, and 11110xxx) and the following one to three bytes (continuation bytes) are in the range "\128" to "\191" (10xxxxxx). Hence ASCII is synonymous with single-byte, and non-ASCII with multi-byte.

Note also that different bytes are used in ASCII and non-ASCII, so it is easy to determine whether an arbitrary byte belongs to one or the other.

Basic string patterns

A pattern will behave identically in both the basic string and the Ustring functions if it fulfills certain conditions: it must only contain ASCII or simple sequences of non-ASCII characters. Thus the patterns "abc", "[abc]" or "[^abc]", "αβγ" will all work correctly whether they are used in the basic string function or the Ustring function.

But quantifiers or sets containing non-ASCII characters will fail. They act on individual bytes, not characters. A set containing a non-ASCII character will match any one of the bytes in the encoding of the character. A quantifier will act on the last byte immediately before it.

For instance, in the basic Lua string functions, the quantified item "á+" does not match a sequence of one or more of the character "á" ("á", "áá", "ááá", ...). The character á is a two-byte sequence, equivalent to the byte escape sequence "\195\161", so the pattern "á+" is really "\195\161+", and it matches the byte "\195" plus one or more of the byte "\161": "\195\161", "\195\161\161", "\195\161\161\161", .... (Only the first of these options is valid UTF-8. The rest would display as á�, á�� if they were published on a Wiktionary page, and they are unlikely ever to occur in a module.)

Similarly, the set "[áé]" does not match "the character á or the character é". Rather, it matches just one of the bytes used to encode the codepoints á or é in UTF-8 "[\195\161\195\169]", and if it is applied to "é" (= "\195\169"), it will match only the first byte, "\195".

See Module:User:Erutuon/patterns for a function that determines whether a pattern will behave in the same way in the basic string functions as in the Ustring functions.

Ustring patterns

The ustring functions fix these problems. They deal with codepoints rather than bytes. So any multi-byte sequence that encodes a Unicode character is considered as a unit.

The Ustring functions must be used if the pattern contains quantifiers acting on non-ASCII characters, character classes that are meant to find Unicode characters, or sets with non-ASCII characters. Using the basic string function is likely to return incorrect results. Examples of these

string.gsub("áéíóúý", "[áéíóú]", "")		--> "\189" (invalid UTF-8)
mw.ustring.gsub("áéíóúý", "[áéíóú]", "")	--> "ý"

Here the pattern is equivalent to "[\161\169\173\179\186\195]", if duplicates are removed and bytes are sorted. "áéíóúý" is equivalent to "\195\161\195\169\195\173\195\179\195\186\195\189"; the only "new" byte is "\189", the second byte in the encoding of "ý".

string.match("ábc", ".b")					--> "\161b" (invalid UTF-8)	= "\195\161bc", "[\0-\255]b"
mw.ustring.match("ábc", ".b")				--> "áb"

string.match("ábc", "%a+")					--> "bc"	= "[a-zA-Z]+"
mw.ustring.match("ábc", "%a+")				--> "ábc"

string.match("ááábc", "á+")					--> "á";	= "\195\161+"
mw.ustring.match("ááábc", "á+")				--> "ááá"

To match a single UTF-8 character in the basic string functions, you can use the pattern "[%z\1-\127\194-\244][\128-\191]*". For instance, the two expressions below give the same result. string.gsub will be faster, because it has no processing to do before it compares the string "áéíóúý" to the pattern, while mw.ustring.gsub has to parse both the string and the pattern into codepoints before matching them.

local repl = { ["á"] = "a", ["é"] = "e", ["í"] = "i", ["ó"] = "o", ["ú"] = "u", ["ý"] = "y", }

string.gsub("áéíóúý", "[%z\1-\127\194-\244][\128-\191]*", repl) --> "aeiouy
mw.ustring.gsub("áéíóúý", ".", repl) --> "aeiouy"

Organizing Lua modules

Document Lua modules on a /documentation subpage. The documentation will appear at the top of the module page.

Categories cannot be entered into modules directly. Put a category on the documentation page, separated from the documentation by <includeonly> tags on the top and bottom:

(documentation)
<includeonly>
[[Category:Ukrainian modules]]
[[Category:Transliteration modules]]
</includeonly>

See also