[go: up one dir, main page]

Page MenuHomePhabricator

Tilde Tilde Tilde not found by search
Open, LowPublic

Description

The article https://en.wikipedia.org/wiki/Tilde_Tilde_Tilde is not found by a search for "~~~" even though it clearly contains three tildes. The issue isn't just that symbol-only searches aren't supported, because the article https://en.wikipedia.org/wiki/Double_tilde oddly is

Event Timeline

because the article https://en.wikipedia.org/wiki/Double_tilde oddly is

Because there is page named "~~".

dr0ptp4kt claimed this task.
dr0ptp4kt subscribed.

It looks like this isn't presently possible, per https://en.wikipedia.org/wiki/Wikipedia:Naming_conventions_(technical_restrictions)#Three_or_more_consecutive_tildes . I'll mark this as Resolved for now, but please advise in case this needs to be more thoroughly investigated.

Pppery reopened this task as Open.EditedAug 5 2024, 4:15 PM
Pppery removed dr0ptp4kt as the assignee of this task.

You are misunderstanding this ticket entirely. That link says the article can't have the title "~~~", which is annoying but expected. This bug is reporting that search can't find the article.

What I was expected, is either for search to given up entirely and return no results, like it does when I search for "<--->", which would allow me to use the local interface customization at https://en.wikipedia.org/wiki/MediaWiki:Search-nonefound to point to the album, or to actually find things, like it does when I search for "M|A|R|R|S". The fact that it does neither of those, and finds an incorrect page while not finding a correct one, seems like a very obvious bug

(Finally, please don't close tickets as resolved unless they are actually resolved - if you were right then "declined" or "invalid" would be the status to use)

I personally think "~~~" (and square brackets, and angle brackets, and maybe even number signs) should just be allowed title characters. They would be inconvenient to link to, but being inconvenient is better than being impossible and forcing the community to produce ugly workarounds. But that's neither here nor there.

Search queries prefixed with ~ has a special meaning for Special:Search, it instructs the UI to go to Special:Search rather than the article page if it exists, it's the reason why ~~ is found when searching ~~~.
This is sadly not the sole reason why it's not found, ~ are likely ignored in the fulltext search index and thus only relying on titles or redirects to find ~~. Given that there's no way to add such titles nor redirects with ~~~ I don't see an easy way to solve this issue because search needs to pull this data from somewhere.

One possibility could be to use defaultsort but that implies:

  • defaultsort is set to ~~~, it currently is ~~ for this page (is this possible?)
  • Index and query the defaultsort field on fulltext (quite some efforts to do)
  • Enable the use of defaultsort in completion (this is a capability we added but did not enabled by default T134978)

But overall I think there will always an ambiguity here. Searching for ~~~, does it mean that I want to see search results for ~~ or that I want to navigate to ~~~.

Does this mean that if I were to create a redirect "~~ (album)" -> Tilde Tilde Tilde for example then it would be found by that search. That's hacky but tolerable.

I was unaware of the meaning of "~" as a search operator. Is that documented somewhere?

Does this mean that if I were to create a redirect "~~ (album)" -> Tilde Tilde Tilde for example then it would be found by that search. That's hacky but tolerable.

Sadly I doubt this might work, the redirect would have (I think) to be "~~~ (album)" for completion to work but that would not make Special:Search find it, I guess you could experiment a bit on test.wikipedia.org first to see? (note that the completion indices are rebuilt daily which might make testing a bit messy).

I was unaware of the meaning of "~" as a search operator. Is that documented somewhere?

Yes, from https://en.wikipedia.org/wiki/Help:Searching#Search_box

If your search matches a page name the search box may navigate instead of search. To get search results instead, prepend the tilde ~ character to the first word of the title. (Or choose "Search for pages containing" from the suggestions that drop down as you type.)

And from https://www.mediawiki.org/wiki/Help:CirrusSearch#Search_suggestions:

Search suggestions can be skipped and queries will go directly to the search results page. Add a tilde ~ before the query. Example "~Frida Kahlo". The search suggestions will still appear, but hitting the Enter key at any time will take you to the search results page.

You are misunderstanding this ticket entirely. That link says the article can't have the title "~~~", which is annoying but expected. This bug is reporting that search can't find the article.

Thanks for the clarification.

(Finally, please don't close tickets as resolved unless they are actually resolved - if you were right then "declined" or "invalid" would be the status to use)

Thanks for the reminder.

I created the redirect "~~ (album)". It does show up in the completion index for "~~~" (after 5 irrelevant results), but does not show up in the search index. Odd.

I created the redirect "~~ (album)". It does show up in the completion index for "~~~" (after 5 irrelevant results), but does not show up in the search index. Odd.

Thanks for testing, the extra ~ in ~~~ is probably considered as a typo and explains why it's returned behind other less relevant results, the fact that it does not show up in Search:Search results is sadly expected, when matching titles or redirects we either use a tokenized version of the content (where the tokenizer will most probably swallow the ~, improperly thinking that these are meaningless punctuation characters) and on an untokenized version of the title/redirect but for which the search query has to be ~~ (album), in other words the search token ~~ is never indexed as is.

Why doesn't whatever logic allows "Double tilde" to be found work here then, though? I was assuming that was found via the redirect "~~", which you seem to be implying should have the same problem.

Why doesn't whatever logic allows "Double tilde" to be found work here then, though? I was assuming that was found via the redirect "~~", which you seem to be implying should have the same problem.

Sorry, I might not have been very clear but sadly this is somewhat complicated to explain.

If you look at the indexed content at https://en.wikipedia.org/wiki/Double_tilde?action=cirrusDump you'll see that the redirect array is:

redirect": [
  {
    "namespace": 0,
    "title": "~~"
  },
  {
    "namespace": 0,
    "title": "Double tilde (disambiguation)"
   }
 ],

Note the "title": "~~"
For https://en.wikipedia.org/wiki/Tilde_Tilde_Tilde?action=cirrusDump you'll see:

redirect": [
  {
    "namespace": 0,
    "title": "~~ (album)"
  }
]

Now when searching for ~~ we build a complex search query that we send to elasticsearch and you'll see the section:

multi_match": {
  "query": "~~",
    "fields": [
      "all_near_match^10",
      "all_near_match.asciifolding^7.5"
  ]
}

Which is the one responsible for finding almost perfect matches to the title or its redirects.

The fields all_near_match and all_near_match.asciifolding contains the untokenized version of the title and redirects, untokenized here means that we don't split the words based on spaces so that the redirect for Double_tilde ~~ is indexed as-is and thus matches the ~~ query, but the redirect you created is indexed as ~~ (album) and thus not found searching ~~ by this this section of the search query.
Other section of the complex query that could possibly be matching the title or redirects are handled via:

{
  "bool": {
    "minimum_should_match": 1,
    "should": [
      {
        "match": {
          "all": {
            "query": "\\~\\~",
            "operator": "AND"
          }
        }
      },
      {
        "match": {
          "all.plain": {
            "query": "\\~\\~",
            "operator": "AND"
          }
        }
      }
    ]
  }
}

Which is using the all and all.plainfield. This field is a concatenation of all the textual content of page (including the title and redirects) but these two fields are tokenized and as I said when tokenizing the ~ are swallowed.

So the indexed token ~~ that must be present somewhere in the search index for it to be found when searching ~~ is only there for Double_tilde in the all_near_match fields, thanks to its redirect that is ~~. For Tilde_tilde_tilde the indexed token is ~~ (album) which does not match ~~ and sadly nowhere for that page the token ~~ is indexed alone.

I'm not sure that this makes things clearer for you but this why it's working like that. In other words for the search query to find ~~~ we have to index it that way somewhere and sadly there does not seem to be a way where we can do this, in general finding sequence of special characters is solved thanks to the title and redirects but here we're limited by the restriction on the page titles.

Thanks, that makes perfect sense and explains why the counterintuitive results are happening.

(TL:DR: there's two separate search queries used. One of them looks for an untokenized match, and thus can't extract the base title from "~~ (album)". The other looks for a tokenized match and thus can't use tildes to find anything since those get stripped out in tokenization)

One (mostly unrelated) suggestion, though. it would make more sense IMO to move the tilde stripping upstream, and filter out the tilde when you press enter in the search box but leave it as is when you go to special:search. That would half-resolve this.


Also I realized the subject I'm trying to find is an EP not an album, so I should move the redirect to "~~ (EP)". That won't change any of the search stuff unless the shortness of the word EP triggers something, which I have no reason to expect)

dcausse triaged this task as Low priority.EditedAug 8 2024, 1:50 PM
dcausse moved this task from needs triage to elastic / cirrus on the Discovery-Search board.

~ is definitely a confusing character for search and is heavily overloaded:

  • used to force entering the search results page when used as a prefix
  • considered as a punctuation and ignored by many text analysis components
  • used to trigger fuzziness word~
  • used to control the phrase slop in "foo bar"~2
  • used to perform a phrase search on stems "foos bars"~
  • has some restrictions on page titles (impossible to create a page named ~~~)

All this makes searching for this kind of pages particularly challenging, I think that one solution would be to find a way to instruct the search index that ~~~ is an important token and must be kept. Since this can't be done via page titles we'd have to find another place to store this information:

  • use defaultsort but I can't convince it to do what I want using something like {{DEFAULTSORT:<nowiki>~~~</nowiki>}}
    • EDIT: explore the possibility of using html entities like {{DEFAULTSORT:&#x7E;&#x7E;&#x7E;}}, (does not entirely work for the search usecase, the chars are not resolved when search pulls them from the parser output)
  • use displaytitle but same with {{DISPLAYTITLE:<nowiki>~~~</nowik>}}
  • use a dedicated magic word

Tentatively marking as Low, this seems quite a lot of work to get right.