FileBackendMultiWrite is a file backend in MediaWiki that can write to multiple desitnations, that we currently use in production; it makes a few assumptions, namely:
- There is one "master" datacenter
- Objects get unidirectionally synced from the master to the replica
- We write/delete from both locations
- We only read from the master datacenter for simplicity
The "master" datacenter is determined by configuration; we're currently following MediaWiki's active datacenter.
When it comes to thumbnails handling, this is problematic:
- Thumbnails are read directly from the object storage for serving to the public, with no interaction with MediaWiki; if the thumbnail at the desired size is not present, the object storage will call thumbor via its not found handler. This is local to the datacenter where the request was directed from the CDN; This means that we can have different thumbnails generated in the two datacenters for the same image
- Thumbnails are pre-generated via a job at certain given sizes, but only in the nearest datacenter where swift is pooled in discovery DNS (the request will be sent to the CDN, and geolocated to the nearest available datacenter to where we're calling from)
- When someone reuploads an image, Filerepo::LocalFile::getThumbnails is called to find which thumbnails are present and should be purged. This operation calls FileBackendMultiWrite::getFileList which lists the thumbnails present in the current "master" datacenter. Then a delete command to purge the thumbnails is sent to both datacenters
Making things worse, we used to have some form of syncing of thumbnails from the master DC to the other one; this was stopped in the past as it was deemed useless in our new multi-dc setup.
This means two big inconsistencies are created:
- If the list of thumbnails in the two datacenters differs, some thumbnails at the old version of the image will be left behind and not invalidated
- If/when we switch datacenter for swift and/or mediawki at the DNS discovery layer, we might end up not cleaning up even the pregenerated thumbnails, giving a very confusing UX for the uploader (see the parent task).
What would fix the issue
- FileBackendMultiWrite needs to be able to get separated lists of files for thumbnails on different backends, and I suspect the only way to do that cleanly is to add a new method specifically to list files by backend. Anything short of that would not consistently purge thumbnails from all backends.
- ThumbnailRenderJob needs to be able to hit the pregeneration in all backends, not just the one closest to the job execution
What can we do to reduce the issue impact while the above issues are fixed
- Add an additional variable in etcd to indicate to mediawiki where is the swift "master" (meaning: which is the direction of replication of originals)
- We can ensure that swift, unless we're in an emergency, is always pooled in DNS in the datacenter where mediawiki is master, and that this datacenter is marked as "master" accordingly.
- We can maybe turn on again the thumbnail syncing, the direction of which will have to be carefully determined.