@aaron, @ori thank you for your work on emergency parsercache key implementation. I want to track here pending task related to those, starting with some discussion:
- Should we, slowly, change the keys to something more reasonable (e.g., name of the shards (pc1, pc2, pc3; changing 1 key per server pair until all old keys are expired). Will changing one key at a time affect the sharding function for the others, too?
- As a more long term question, how should parsercache be handled for active-active. Is that something that parsercache architecture should know about, or should be resolve it at mediawiki "routing" layer?
- Sharding method should be as stable as reasonable when adding or removing servers https://gerrit.wikimedia.org/r/c/mediawiki/core/+/284023
- We need a key-aware method of handling maintenance and failover, that 1) minimizes errors sent to the upper layer 2) maximizes the chances of getting a hit, as much as reasonable, 3) requires little human intervention, so that - for example- in the case of a host being down, this state is detected automatically, another can be started to be used automatically (or the one faulty stops being used), and be as pre-warmed as reasonable (1/3 of the keys of the other hosts?). At the moment, there is a spare host that has to be switched manually on failure, and pre-warmed on maintenance. This may need a different key sharding strategy?
- One server performing badly results on the rest of servers experiencing a big increase on idle connections. More context at T247788#5976651