When updating slow special pages from terbium, such as Listredirects, certain rows of the querycache table are deleted and then inserted.
Under normal circunstances, those updates do not create a problem. However, I believe a combination of factors can make them lag slaves:
- Special pages of s3 are updated, which means hundreds of updates, independently of the wiki size, multiplied by the hundreds of wikis on that shard. Not other shard has >800 wikis.
- WAN latency is higher than same-datacenter replication
- Other writes are happening at the same time, such as updating pagelinks or wbc_entity_usage
- ROW-based replication is used
- A non-very-flat topology is in use (there are now 4 tiers, which is not desirable)
Given that special page update is not time-sensitive, I would like to:
a) Introduce pauses between wiki updates or, better, check that those have been applied to >50% of the slaves before continuing (including remote slaves)
b) make the updates non-transactional, splitting the filling of those tables in several, smaller transactions