Milestones:
User story:
As a Wikidata editor,
I want to avoid repeating identical labels in hundreds of languages
in order to reduce the amount of redundant content that needs to be maintained on Wikidata.
Problem:
We have many labels that are by principle identical across different languages (see examples section). This has some bad consequences:
- editors having to create and maintain redundant content (copying the same thing to most/all languages creates massive amounts of edits and is a huge waste of resources)
- need of storing redundant information that burdens our systems (e.g. the Query Service)
Solution:
Introduce a new language code that all languages fall back to. This will be particularly helpful for Unicode characters, Scientific articles, and Codes as well as for Names in Latin scripture (as we do not have an elaborate fallback system for that scripture yet). We will test if this solution (only one new language code) is good enough, or if we need more specific language codes after all to model a useful fallback chain.
This task
- Adding "mul" as a new monolingual language code.
- Have other languages fall back to it (Translatewiki fallback chain > "mul" > "en")
Community takes over
- Community creates guidelines and help pages on how to use the new code, e.g.
- What if one Latin-script language may prefer a form (e.g. "Philip L. Brown"), another Latin-language script another form (e.g. "Philip Larry Brown" or "Philip Brown")?
- In what cases should the Latin-language label be used for "mul" instead of the native label (while still making sure that re-users can identify the native label via property)?
- etc.
- Community gives feedback after some months about how the new code and guidelines work
- Based on the feedback we might iterate on the approach if necessary.
Ideas for the future
- start to show a warning if someone wants to add the mul-label in a different language
- include the experience in a possible future solution for multilingual descriptions (Abstract Descriptions)
- re-evaluate if the final fallback to “en” is still appropriate
Mockup:
Examples:
This will be useful in many different places:
Names
- persons (https://www.wikidata.org/wiki/Special:Search/haswbstatement:P31=Q5, as of now 9.2M) have in most cases the same label and the same aliases repeated in different languages, e.g. https://www.wikidata.org/wiki/Q42 .
- given names and family names (https://w.wiki/3zWT, which counts Q202444 and Q101352 including subclasses, as of now 590k): in all cases, the same labels are repeated in different same-script languages, e.g. https://www.wikidata.org/wiki/Q21448867.
- astronomical objects (11M), the galaxy "SDSS J151017.36+160605.3" - has "SDSS J151017.36+160605.3" as the label 411 times,
- taxa (https://www.wikidata.org/wiki/Special:Search/haswbstatement:P31=Q16521, as of now 3.1M) the species "Neotrogla curvata" - has "Neotrogla curvata" as the label 411 times.
Unicode characters
- Unicode character "♣" - has "♣" as the label and "U+2663" as an alias 446 times
Codes
- Switzerland - has "CH" as an alias 403 times
- carbon - has "C" as an alias 187 times
- the disambiguation page "C" - has "C" as the label 104 times
- the Danish men's national road cycling team 2021 - has "DEN 2021" as an alias 411 times
Scientific articles
- (https://www.wikidata.org/wiki/Special:Search/haswbstatement:P31=Q13442814, as of now 42M): in many cases the same label is repeated in different languages (e.g. https://www.wikidata.org/wiki/Q27860672).
- in some cases, there could be articles with parallel titles in different languages (e.g. https://www.wikidata.org/wiki/Q59238742).
Translatewiki fallback chain:
Examples:
ami > zh-tw, zh-hant, zh-hans
zh-tw > zh-hant, zh-hans
zh-hant > zh-hans
zh-hans > []
de-at > de
de > []
en-gb > en
en > []
Hard-coded fallback chain:
old
- Translatewiki fallback chain > "en"
new
- Translatewiki fallback chain > "mul" > "en"
Community communication:
- The interested Community needs to be aware of the new code and of the necessity to create guidelines and help pages on how to use it.
- We need to be available for the Community when they create guidelines and to collect feedback.
Original:
This task is to add support for a "mul" language code for labels and aliases. For any benefits of this code to be properly reaped, all language codes should ultimately fall back to "mul"—which I believe would be achieved by adding it as a fallback for the "en" code.
(If it is more desirable, codes for "mul-latn", "mul-cyrl", etc. could be created, in which case e.g. only those codes using the Latin script would fall back to "mul-latn".)