User Details
- User Since
- Jun 9 2015, 9:03 AM (484 w, 6 d)
- Availability
- Available
- IRC Nick
- dcausse
- LDAP User
- DCausse
- MediaWiki User
- DCausse (WMF) [ Global Accounts ]
Yesterday
Fri, Sep 20
Wed, Sep 18
Was looking at a dashboard that already filtered non-mediasearch poolcounter errors
Closed wikis are still being served and search is available there, CirrusSearch is still scanning some documents to do some automatic cleanup which I suspect is the source of these API requests.
Tue, Sep 17
In the RFC I read
In some cases, the URI is specified as an IP address rather than a hostname. In this case, the iPAddress subjectAltName must be present in the certificate and must exactly match the IP in the URI.
Thanks for looking into this!
Now failing with javax.net.ssl.SSLHandshakeException: No subject alternative DNS name matching kafka-main-eqiad.external-services.svc.cluster.local found use_all_dns_ips and now it seems that it wants to validate the hostname passed via bootstrap.servers...
I'll investigate more to see if there are more options, if we fail to workaround this do you think it'll be acceptable to add kafka-main-eqiad.external-services.svc.cluster.local as a valid alternative in the cert?
not sure that the check_categories.py --ping is necessary and could be dropped imo, it should already be covered by some other sensors.
We could perhaps adapt modules/query_service/files/monitor/prometheus-blazegraph-exporter.py to take care of running this query by possibly re-using the same gauge blazegraph_lastupdated but adapting the query depending on the namespace it's running.
Mon, Sep 16
Unsure if related but we recently found that some MW requests might last for several hours (T374662), so depending on how the event is created it's possible that late-events are created by MW:
- Event is created by setting meta.dt (https://gerrit.wikimedia.org/g/mediawiki/extensions/EventBus/+/79790fdff075c7fa0e4a359e6f359cb35ddfac36/includes/EventFactory.php#354)
- A DeferredUpdate is used to delay the push to EventGate (see for instance revision-create: https://gerrit.wikimedia.org/g/mediawiki/extensions/EventBus/+/79790fdff075c7fa0e4a359e6f359cb35ddfac36/includes/EventBusHooks.php#407)
I believe that this might possibly lead to late events being sent by MW.
Fri, Sep 13
The current hypothesis is that the problem happen right after the node comes online and is advertised by the cluster as usable but that node is not yet allowed by the egress rule. The kafka-client is then too confused and the job enters a crash loop. Other other jobs seem to be more tolerant to this setup. To be precise the search job also suffered some blips during the process impacting the search update lag but I think this is totally acceptable.
Thu, Sep 12
Wed, Sep 11
@bking everything looks good, thanks!
Tue, Sep 10
I think that all patches have been merged, most of them deployed except one which should get deployed via the train tomorrow for group1 (re-enable fine-tuning per language).
Mon, Sep 9
Untagging WDQS as I believe there are no issues with WDQS, I think the query does not work because https://www.wikidata.org/wiki/Q126787117 was created without a trailing slash for the P856 property.
Thu, Sep 5
Wed, Sep 4
the data seems to be indexed so it might be trivial to implement these keywords, moving to needs triage to raise visibility.
Seems to be fixed, tentatively closing
Unfortunately wdqs2021 is still consumer from the wrong topic after the transfer.
Looking closer it appears that the service definition for the wdqs-updater is duplicated in two locations:
- /etc/systemd/system/wdqs-updater.service containing the wrong topic codfw.rdf-streaming-updater.mutation
- /lib/systemd/system/wdqs-updater.service with the right topic codfw.rdf-streaming-updater.mutation-main
Tue, Sep 3
My understanding is that https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikibaseCirrusSearch/+/1064792 is going to fix both this ticket and T371929.
From this ticket description it is not entirely clear if the ask is also to index the full P18 statements or just the flag that indicates the presence of the use of this property, for the former I'm with Erik this possibly adds a lot of new tokens that might be be particularly hard to search for (untokenized URLs) and thus probably not very useful.
Should be resolved with https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikibaseCirrusSearch/+/1064792
Mon, Sep 2
Interesting, I thought as well that the 1k limits would apply to nested bool queries (which is probably one reason it was set to 256 initially). It means that we can probably safely bump the limit to 1k without even nesting bool queries. I'm not clear why it has such an impact when getting past 2.5k and I have no clue if a terms query would perform significantly better, it's less costly for sure since there's no need to analyze & rewrite the query, we could probably test this as well to see the impact?
So perhaps we can at least bump to 1k right now with a simple config change and ponder what to do next based on some testing of the terms query? If the terms query does not show a significant gain compared to nested bool queries we might just use this?
Aug 9 2024
I was able to trigger the backfill for wikidatawiki_content running another re-index for an unrelated wiki on both eqiad and codfw, seems to me that there's an early stop when all the re-index are done, it should perhaps double check that no remaining backfills are needed before quitting?
Current status:
- wikidata has been re-indexed
- haslabel:mul should work
- search autocomplete should use the mul label when it's part of the fallback chain
- full text search results might benefit from better recall after merging https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1060433 (this one might or might not need commons to be re-indexed first)
- https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1060433 appears to be not-needed in the end, I don't see where we use these manually tuned profiles
This should work now, we had to re-index to wikidata
Aug 8 2024
~ is definitely a confusing character for search and is heavily overloaded:
- used to force entering the search results page when used as a prefix
- considered as a punctuation and ignored by many text analysis components
- used to trigger fuzziness word~
- used to control the phrase slop in "foo bar"~2
- used to perform a phrase search on stems "foos bars"~
- has some restrictions on page titles (impossible to create a page named ~~~)
Aug 7 2024
Regarding fallbacks WikibaseCirrusSearch is relying on \Wikibase\Lib\TermLanguageFallbackChain::getFetchLanguageCodes, the order in which these languages are returned is quite important as well as the weight attributed to such matches are inversely proportional to its position in this array.
The procedure should be:
- merge & deploy https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1060430
- reindex wikibase wikis: wikidata, testwikidata, commonswiki (at this point search should be already better in Special:Search and T371352 should be fixed)
- merge & deploy https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1060433 (which should slightly increase recall in some cases when using Special:Search)
- merge & deploy https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1060449 should start using the mul fallback for manually tuned profiles (en, fr, de, es)
The mul labels and descriptions (can we have mul descriptions?) are currently not indexed and explains to some degree why search is behaving poorly on these items. We'll index those and see how it performs, tuning search might come as a separate step.
If I'm not mistaken mul is considered a fallback for all languages so it should always be queried.
Aug 6 2024
Dashboard is up at https://grafana.wikimedia.org/d/8xDerelVz/search-update-lag-slo?orgId=1
A patch to grafana-grilly is upload at https://gerrit.wikimedia.org/r/c/operations/grafana-grizzly/+/1060150 (but not sure how to move forward with it)
Unchecked the prerequisite regarding kafka topics, the split graph hosts are currently configured to consume from the full graph topic, the reload should not start (probably be stopped/restarted on wdqs1021) before https://gerrit.wikimedia.org/r/c/operations/puppet/+/1060049 is merged and applied to the corresponding hosts.
As part of the work to expose two new endpoints serving the split graph (T364363) we are configuring new wdqs hosts to run blazegraph.
The way maxlag is propagated from WDQS to mediawiki is by measuring the most lagged wdqs host that is online.
In order to know what wdqs hosts are online we measure the number of queries that it serves.
We also run some "monitoring queries" internally to measure the health of the system and in order for these monitoring queries to not interfere with this system we flag such internal user-agents so that they're ignored.
This is where we made a mistake, meaning that a monitoring user-agent was not properly flagged as internal and caused a new host (not yet fully loaded) to be considered online and thus taken into account by maxlag.
Aug 5 2024
Search queries prefixed with ~ has a special meaning for Special:Search, it instructs the UI to go to Special:Search rather than the article page if it exists, it's the reason why ~~ is found when searching ~~~.
This is sadly not the sole reason why it's not found, ~ are likely ignored in the fulltext search index and thus only relying on titles or redirects to find ~~. Given that there's no way to add such titles nor redirects with ~~~ I don't see an easy way to solve this issue because search needs to pull this data from somewhere.