[go: up one dir, main page]

Page MenuHomePhabricator

Cannot search partial Javanese script titles
Open, MediumPublicBUG REPORT

Description

Steps to replicate the issue (include links if applicable):

  • Search for book title "ꦥꦼꦥꦼꦛꦶꦏ꧀ꦏꦤ꧀ꦱꦏꦶꦁꦥꦿꦗꦚ꧀ꦗꦶꦪꦤ꧀ꦭꦩꦶ" in Commons works (the full title), found the PDF
  • Search partial "ꦥꦼꦥꦼꦛꦶꦏ꧀ꦏꦤ꧀ꦱꦏꦶꦁꦥꦿꦗꦚ꧀ꦗꦶꦪꦤ꧀" or "ꦥꦼꦥꦼꦛꦶꦏ꧀ꦏꦤ꧀ꦱꦏꦶꦁ" or "ꦥꦼꦥꦼꦛꦶꦏ꧀" or any parts of the full title won't work

What happens?:

  • First identified 10 years ago, T46350, marked wont fix, since last time was during migration from Lucene to Cirrus Search.
  • I identified there was a problem with scriptio continua nature of Javanese script (no word marker)
  • T58505 Cirrus Search ticket was closed as solved

What should have happened instead?:
The Commons search et. al should be able to find parts of the full title.

Software version (skip for WMF-hosted wikis like Wikipedia):

Other information (browser name/version, screenshots, etc.):

A bit background on the way the script is written:

  • Scriptio continua is a hassle to display in web, because there's no obvious line break. Therefore, in projects such as Wikisource, if not handled properly, would break the page view of the transcribed documents. (become too wide)
  • Using certain keyboards, including the way we handle the problem is, to automatically insert ZWS (zero-width space) after certain character (e.g. comma, period, etc. ) I can gave you the full list. Therefore, the line break would still works, except in very rare cases where there's no occurrences of those characters (thus the ZWS not auto-inserted). [the zws doesn't always equal to the Latin space]
  • AFAIK ZWS is not supported in page titles, (e.g. when I upload books with Javanese script titles that contain ZWS), so none of the titles in jv wiki projects contain ZWS, and thus the cirrus search won't be able to know the word delimiter.

Event Timeline

I have a bit of technical info on ZWSs.

There are pages with ZWSs in their titles, though the titles are redirects. I found one on Commons and two on English Wikipedia (and zero on Javanese Wikipedia). I had to use regexes to search, so results are incomplete.

I was also able to create a new page title with ZWSs:

Note that you usually can't see the ZWSs, but the are there in the URLs (%E2%80%8B).

Cirrus does treat ZWSs as spaces (at least the standard and ICU tokenizers split on it). However, adding them to titles does mean that searches without them would fail. So, searching for zerowidthspacetest on Mediawiki doesn't find my test page (searching with capitals works, but that's because we split on CamelCase in English-language contexts—which wouldn't help with Javanese).

I don't know what kinds of regularization and normalization happen during file upload, and I can imagine a well-intentioned automated process that removes ZWSs (though I think we agree that it should convert them to spaces).

If there is a list of punctuation marks where ZWSs are automatically inserted by Javanese-savvy systems, we can try to replicate that in the language analyzers in Cirrus. Since the punctuation is specifically Javanese, I would argue that we should enable it either everywhere, or at least on Javanese (obviously relevant) and English (often used as the default on multilingual/multi-script sites like Commons). I could imagine a global_punctuation filter that adds spaces after punctuation in any script where the standard or ICU tokenizer doesn't recognize them as punctuation. (I'm tempted to throw a \p{P} in there and call it a day, but those Unicode regex properties are never quite 100% what you expect them to be.)

Thank you for the insight.

I had this issue as well when I created my Javanese transliterator: where to insert ZWS, so the page break would work (semi-)naturally.
It's not always possible to put the ZWS on the word ending (as in Latin), due to the rule of syllable ends in "virama" + syllable start with "consonant" = merged together. (If whoever reads this are not familiar with how Indic-derived scripts work, think of it like French liaison, only in writing). So for a phrase like "pethikan saking", the components are written as such: /pe thi kka nsa king/ ꦥꦼꦛꦶꦏ꧀ꦏꦤ꧀ꦱꦏꦶꦁ, compare to the form of /nsa/ when it's separated by a ZWS: ꦤ꧀​ꦱ, which is totally different. Therefore putting ZWS after a virama is generally frowned upon

Luckily, there are at least 3 consonant endings that are not using virama, namely: -ng, -h, and -r. So for words like "saking", "omah", and "anyar" for example "ꦱꦏꦶꦁ", "ꦲꦺꦴꦩꦃ", "ꦲꦚꦂ", I put a ZWS after each "ꦁ", "ꦃ", and "ꦂ". (Although those three are merely syllable-ending, not strictly word endings, so adding them to a word like "angkringan" ꦲꦁ​ꦏꦿꦶꦁ​ꦔꦤ꧀​ wouldn't change anything visually, but would technically break them into 3 tokens: "ang", "kring", and "ngan".)
So those three, plus some others, such as "꧈​" (comma-like separator), "꧉" (period-like separator) - although their occurrences in titles are quite rare, could be added to the "global_punctuaction" for the tokenizer, that would be great.

MPhamWMF moved this task from needs triage to Language Stuff on the Discovery-Search board.

This has been rolling around in my head for a while and something related came up today, so I wanted to jot down some notes to my future self, or to anyone else who may work on this.

One approach would be a bigram tokenizer that takes long tokens in otherwise unhandled scripts (like Javanese) and breaks them into overlapping bigrams (for bigram this would be bi + ig + gr + ra + am). Bigrams often aren't great on precision, but they do improve recall. They also don't do great on non-exact strings—for example, bigrams won't directly match bigram because the ms bigram is missing; worse, uglyredbigram won't match reduglybigram because the word boundary bigrams (yr + db vs du + yb) don't match. But!—an exact substring will match—so bigram as a query would match both uglyredbigram and reduglybigram as text. Much better than nothing on a wiki whose primary language is not in the otherwise unhandled script.

In another discussion @Isaac recommended SentencePiece for spaceless languages in general. It's not in Java, so we'd have to wrap it (possible) or port it (unlikely). We'd have to test it on relevant scripts and languages and see how it performs (we could do that offline before wrapping into a plugin), figure out how to incorporate it into our analysis chains (letting it break up English, French, or Swahili text into "subwords" might or might not be beneficial), and we'd have to worry about performance, depending on how complex the processing is—but it's worth looking at if it might tokenize much better than the simpler bigram approach.