[go: up one dir, main page]

Page MenuHomePhabricator

Decide how to improve parsercache replication, sharding and HA
Closed, ResolvedPublic

Description

@aaron, @ori thank you for your work on emergency parsercache key implementation. I want to track here pending task related to those, starting with some discussion:

  • Should we, slowly, change the keys to something more reasonable (e.g., name of the shards (pc1, pc2, pc3; changing 1 key per server pair until all old keys are expired). Will changing one key at a time affect the sharding function for the others, too?
  • As a more long term question, how should parsercache be handled for active-active. Is that something that parsercache architecture should know about, or should be resolve it at mediawiki "routing" layer?
  • Sharding method should be as stable as reasonable when adding or removing servers https://gerrit.wikimedia.org/r/c/mediawiki/core/+/284023
  • We need a key-aware method of handling maintenance and failover, that 1) minimizes errors sent to the upper layer 2) maximizes the chances of getting a hit, as much as reasonable, 3) requires little human intervention, so that - for example- in the case of a host being down, this state is detected automatically, another can be started to be used automatically (or the one faulty stops being used), and be as pre-warmed as reasonable (1/3 of the keys of the other hosts?). At the moment, there is a spare host that has to be switched manually on failure, and pre-warmed on maintenance. This may need a different key sharding strategy?
  • One server performing badly results on the rest of servers experiencing a big increase on idle connections. More context at T247788#5976651

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

One option would be to shift sharding to the same model than external storage. Shard by a static key value, and separate the contents on different tables, so that they can be moved anywhere, but maintaining the subset of keys per table fixed.

'pc1' => 'server1',
'pc2' => 'server1'
'pc3' => 'server3'

{pc1-001, pc1-002, pc2-001, pc2-002, ...}

That way there is some constant on server being down/under maintenance, which can latter be cleaned up/migrated/replicated.

Another option worth considering would be to store the parser cache in Cassandra, and leverage its multi-DC and sharding functionality.

Another option worth considering would be to store the parser cache in Cassandra, and leverage its multi-DC and sharding functionality.

I was tempted to suggest this already per NIH.

I would not only consider Cassandra (reliability issues), but yes, outside of MySQL.

chasemp triaged this task as Medium priority.May 5 2016, 8:45 PM

All of this is good, but realistically I would be happy with a couple of fixes for now:

  1. Server connections failures should not be fatal, and servers should be "depooled" the same way they are when they have lag
  2. Query failures on parsercache, both reads and writes, should not be fatal (read only, queries killed, out-of-band changes, any other issue)
  3. Move parsercache replication to statement based, set them all in read-write mode, accept eventually consistent (loose consistency) and change misses into INSERT IGNORE/REPLACE

All of this is true no matter if we replace the storage backend. I can start working on this because "it should be just a couple of line fixes", but we should agree that is a good model first (there is a danger on transparent failure). Some may be already true- but we need to tune then the logging. Also this is tightly associated with T132317.

Read/write exceptions are already supposed to be caught in handleReadError()/handleWriteError(). Are there backtraces of exceptions that incorrectly bubbled up anyway?

Writes generate errors at 1000-10000 (rate per second), I am concerned about the logging infrastructure, specifically this:

https://logstash.wikimedia.org/#dashboard/temp/AVSZZJTzctfB5Pje7jwt (the time since one datacenter was marked as read only and traffic was directed on the other datacenter)

I see that REPLACE is already run, which means I could set both sites replicating AND in read-write mode -either by setting STATEMENT based replication or setting the slaves in IDEMPOTENT exec mode (so they will not have 100% sure the same contents). Ok with that? But I am still worried about a SPOF for each individual key, specially given that those servers do not have redundant disks.

Change 288362 had a related patch set uploaded (by Jcrespo):
Change binlog format for parsercaches to STATEMENT

https://gerrit.wikimedia.org/r/288362

Writes generate errors at 1000-10000 (rate per second), I am concerned about the logging infrastructure, specifically this:

https://logstash.wikimedia.org/#dashboard/temp/AVSZZJTzctfB5Pje7jwt (the time since one datacenter was marked as read only and traffic was directed on the other datacenter)

I see that REPLACE is already run, which means I could set both sites replicating AND in read-write mode -either by setting STATEMENT based replication or setting the slaves in IDEMPOTENT exec mode (so they will not have 100% sure the same contents). Ok with that? But I am still worried about a SPOF for each individual key, specially given that those servers do not have redundant disks.

Sounds OK to me, since MediaWiki validates cache validity against the main DBs on cache get(), so conflicts getting ignored don't matter since no strict chronology model is needed. Either r/w method seems fine to me.

Change 288362 merged by Jcrespo:
Change binlog format for parsercaches to STATEMENT

https://gerrit.wikimedia.org/r/288362

I've just applied the STATEMENT/REPLACE option. This will basically avoid all issues of the passive datacenter being in read only. Now both disk-based caches are in read-write, which means the passive datacenter can overwrite the active one's cache (misses get inserted on the local datacenter). As aaron mentions, I also believe it will not create issues and both simplify the failover process and allow multi-dc.

I will document the current model with T133337.

The only thing I have pending here is the sharding key, but that may take some time and we may want to solve it "properly" (different data storage). I would like to do a "what would it happen if one server stopped responding" test? Even if it is solved "automatically", I am not sure it would be very transparent because timeout + large amount of misses (aka SPOF). Those being caches, I decided to not have disk redundancy, and that is an issue for availability.

@jcrespo What's the current status of this task? Is this intended to be a general request for comments for how to handle various operational, or should this also be handled by TechCom? The first two points in this task seem like changes/questions that can be handled within configuration changes that wouldn't need/require TechCom involvement. The third one is a bigger question regarding active-active that perhaps would make sense to have a TechCom-RFC about.

See my latest comments on: T167784#3961866

The third one is a bigger question regarding active-active that perhaps would make sense to have a TechCom-RFC about.

Funnily enough, the third question is already resolved- we already can and do have write-anyware parsercache. MySQL, contrary to the popular believe, can have master-master configuration, the problem is getting consistency with complex table dependencies. With a no-sql like configuration, like parsercaches, this is already a reality (writes on codfw appear on eqiad and the other way around). This is right now disabled because it allows for easier testing and mainteance, but we have been running like that for quite some time (over a year).

We can have a conversation, but I do not think it is a mediawiki conversation, but a WMF one, if you should give SRE more control over things like sharding and failover, for improved operation experience, but I believe most of that would not involve code/app changes.

@jcrespo Thanks, I'll untag our team for now then. Let me know if there's anything we can do.

A bit of a recap on the original questions:

  • Parsercache keys are renamed to pc1, pc2, pc3 at: T210725
  • Parsercaches are write-write, and (can) send the results to the other datacenter. At the moment the replication in the direction codfw -> eqiad is disabled until both dcs are active-active fore reads
  • There was a purchase of a redundant host for each datacenter for maintenance, but at the moment the switch is not automatic. It would be nice to have an automatic method to switch to it either on mediawiki load balancer or through a proxy so it doesn't need manual interaction

3 additional items/proposals regarding purging:

  • Smarter purging- something maybe priority queue based, while respecting TTL, not sure about details
  • Review purging so it works well for passive hosts, and our host failover scenario (which is memcache like, the others take over the keys of the one failed)
  • Potentially enabling purging on both datacenters, as this service is write-write, active-active
jcrespo renamed this task from [RFC] improve parsercache replication and sharding handling to [RFC] improve parsercache replication, sharding and HA.Oct 9 2019, 11:13 AM
jcrespo updated the task description. (Show Details)
Krinkle renamed this task from [RFC] improve parsercache replication, sharding and HA to Decide how to improve parsercache replication, sharding and HA.May 21 2020, 1:31 AM
Krinkle moved this task from Limbo to Perf recommendation on the Performance-Team (Radar) board.

Playing around with

mwscript shell.php aawiki

...I noticed that SHOW SLAVE STATUS is empty in eqiad for the 'pc3' slot server. Both have SHOW MASTER STATUS output and read_only = 0. Any reason the eqiad DBs are not listening to the codfw DB binlogs?

Ideally the SqlBagOStuff hashing would use HashRing, though any naive transition would involve a lot of misses/churn at first.

Purges can probably be optimized a bit. I don't see any reason that the purging script couldn't be enabled in both DCs. Stochastic purging on writes already happens as long as mysql read_only = 0.

Playing around with

mwscript shell.php aawiki

...I noticed that SHOW SLAVE STATUS is empty in eqiad for the 'pc3' slot server. Both have SHOW MASTER STATUS output and read_only = 0. Any reason the eqiad DBs are not listening to the codfw DB binlogs?

As nothing writes to codfw we keep codfw -> eqiad replication disconnected everywhere (pcX, sX, esX) so we can do maintenance on codfw a lot faster and without worrying.

Krinkle added a subscriber: Kormat.

The concerned raised by @Kormat is that the current set up causes lag spikes, […]

Could you elaborate on this? What is the cause-effect we see, and how is it caused or aggrevated by having bi-directional replication?

[…] with the question whether we actually need this or whether a locally kept cache would suffice.

My gut feeling is that we do not need this indeed. I think the main reason we have it is for switchovers and disaster recovery so that our secondary DC has an effectively "warm" standby parser cache.

However, when we operate read requests from both DCs (multi-DC) this probably isn't a concern, perhaps even comparable to how we keep other caches local to the DC as well (like Memcached). And unlike Memcached/WANObjectCache, the parser cache does not have a need for proactively sending purges (as far as I know).

The only thing that comes to mind as a minor caveat is that gargage collection is a maintenance cron, which if parser cache becomes dc-local means we presumably want to run that in all DCs, which would probably make that the first maintenance cron of its kind. The others are only enabled in the primary DC, I think.

Assigning to @Marostegui (but he might not get to it until he's back from vacation in a couple of weeks).

Removing assignment as I don't believe Manuel will be looking into this in the near future.

@jcrespo wrote in the task description:
  • Should we implement a more deterministic sharding function? As far as I know, the server depends now on the key and the number of servers, but that means that on maintenance (it is very typical to depool one server at a time), keys are distributed randomly among the servers.

One option would be to shift sharding to the same model than external storage. Shard by a static key value, and separate the contents on different tables, so that they can be moved anywhere, but maintaining the subset of keys per table fixed.

'pc1' => 'server1',
'pc2' => 'server1'
'pc3' => 'server3'

{pc1-001, pc1-002, pc2-001, pc2-002, ...}

That way there is some constant on server being down/under maintenance, which can latter be cleaned up/migrated/replicated.

I've reviewd the current SqlBagOStuff implementation in MediaWiki, which we use for ParserCache at WMF.

ExternalStore

ExternalStore is quite different as I understand it. The labels es1 to es5 are not server labels, in the sense that MediaWiki does not consider ExternalStore as one cluster of servers with these being static labels for the hosts within them. Rather, MediaWiki is configured such as that es1-5 are their own sections similar to s1-8 and x1-2. This is also matched with each of es1-5 having both primaries and replicas within it. And while it is possible to build sharding by cluster label, we do not have such a mechanism in MediaWiki today. Afaik we only randomly shard by dbhost name or dbhost static label (e.g. redis, memcached, and parser cache).

Writes to ExternalStore are generally append-only so the destination for new writes is typically only the "last" table, as statically configured in $wgDefaultExternalStore (source) and mapped in $wgLBFactoryConf['externalLoads'] (source: etcd.php, for some reason).

We do actually have two destinations at any given time, presumably for improved availability, but these are picked randomly without any seeding or input. The address of the destination is what the MW revision database will track so it need not be deterministic. Making it deterministic would not remove the need for tracking writes since the candidates for writing do change over time and there is meaningful and neccecary state in the list of ES servers since they can be full and then become read-only.

ParserCache

When SqlBagOStuff is doing a pseudo-random hash on the key and picking one of the db hosts, it appears to already perform a deterministic hash that is not (entirely) dependent on the number of servers.

For example, with an input of $key = 'foobar'; and $servers = [ 'db001', 'db003', 'db004' ]; it will hash each combination of the hostname and key and pick the highest asc-sorted hash. I believe this means that even without labels or tags, it avoids a redistribution of keys merely for adding or temporarily removing a server. However, replacing one with a replicated copy under a different name would indeed re-hash.

However, parsercache is not configured with a list of hostnames. It is already configured with tags (static labels) in ProductionServices.php#138

		'parsercache-dbs' => [
			'pc1' => '10.64.0.57',   # pc1011, ..
			'pc2' => '10.64.16.65',  # pc1012, ..
			'pc3' => '10.64.32.163', # pc1013, ..

And SqlBagOStuff honours these labels for hashing purposes.

Change 816014 had a related patch set uploaded (by Krinkle; author: Krinkle):

[mediawiki/core@master] ArrayUtils: Add coverage for consistentHashSort()

https://gerrit.wikimedia.org/r/816014

source: etcd.php, for some reason)

This is because this allows 100% hot es failovers, as we can dynamically stop writing to one of the 2 servers and perform a topology change while in read only/other maintenance.

And SqlBagOStuff honours these labels for hashing purposes.

Yes, the label was solved thanks to the work of Tim or Aaron , I think (first checkbox), but as far as I understand that doesn't work when the number of server change/it gets updated. You may ask, why would you change the number of server? XD You wouldn't, but one can go down at any time or you need maintenance, and that makes labels to be suffled around.

To over come this, at the moment, we have a passive host replicating from one of the active servers. Ideally, you would have the 4 servers active, or at least on active configuration and with some magic(TM), one takes over the broken one automatically in a way that makes sense and still makes caching as effective as possible (e.g. existing keys are reused within reason). Normally that would require some kind of smaller sharding that can be distributed with fine grain), or a proxy to failover the host automatically so standbyes, or something else that doesn't require a DBA to go an update the mw config/dbctl. :-)

E.g.

		'parsercache-dbs' => [
			'pc1' => '10.64.0.57',   # pc1011, ..
			'pc2' => '10.64.16.65',  # pc1012, ..
			'pc3' => '10.64.32.163', # pc1013, ..
                        'spare' => 'XXXXXX' # pc1014

Note this is not dependent on active-active, it is a weakness of the setup in general- and this is just an example, there are many ways to solve this.

I explained on my previous comment why the labeling is not deterministic enough. I can edit the checkbox to make clearer what is the pending problem to solve, if that works? (and merge it with the HA outstanding issue)

I have merged the "needs" for HA+sharding in a single bullet point so the language is clearer. Unless it is not clear, When a single pc host goes down, bad things happen (it is not transparent for the end user), as proved on past outages.

I explained on my previous comment why the labeling is not deterministic enough.

I do not consider myself an expert in hash rings. I'm aware of libketama's existence and am aware that we support it in MediaWiki but are not using it in SqlBagOStuff. Based on the complexity and fame around libketama, I'm sure it does something clever that the older basic md5 sort doesn't solve. However, I do note that much of the writing about libketama seems to compare it to taking a hash and mapping it to an unlabelled list of servers, which is not what we do.

The following is my current understanding of ParserCache/SqlBagOStuff/ArrayUtils flow. Please help me learn if I am wrong or incomplete:

wmf-config ProductionServices
'parsercache-dbs' => [
	'pc1' => '10.64.0.57',   # pc1011, ..
	'pc2' => '10.64.16.65',  # pc1012, ..
	'pc3' => '10.64.32.163', # pc1013, ..

This feeds to SqlBagOStuff:

wmf-config CommonSettings.php
$pcServers = [];
foreach ( $wmgLocalServices['parsercache-dbs'] as $tag => $host ) {
	$pcServers[$tag] = [ .., 'host' => $host, .. ];
}

$wgObjectCaches['mysql-multiwrite'] = [
	.. => [
		'class' => 'SqlBagOStuff',
		'servers' => $pcServers,

Which interprets the labels as server tags and picks the label after hash shorting:

mediawiki-core SqlBagOStuff.php
$this->serverInfos = array_values( $params['servers'] ); // simplified
$this->serverTags = array_keys( $params['servers'] ); // simplified

// .. 

$sortedTags = $this->serverTags;
ArrayUtils::consistentHashSort( $sortedTags, $key );
reset( $sortedServers );
$i = key( $sortedServers );

$server = $this->serverInfos[$i]; // simplified

I've added a unit test to confirm this behaviour, and from what I can tell, it does not appear to impact additions or removals other than the naturally affected set of keys of the removed server, or a small portion of existing keys when adding a new server.

My test only covers a few basic inputs, so feel free to point out any flaw. The implementation in ArrayUtils implementation is md5( $tag . $key );, and we take the first sorted entry. I feel inexperienced in this area, but it seems this does not require the number of servers to stay equal, nor does it require a stand-by replica that we have today - for reasons of sorting. (The reason for the standby replica, afaik, isn't sorting but rather improved availability of data so that a host can be taken out of rotation for maintenance and we not fall cold for any portion of keys).

Based on my unit tests and a quick read-up on libketama, it appears to be very similar, but without a fixed node count, and with implicit clumping together of nodes to the same server.

I discussed with @Krinkle on IRC, as I believe we had a missunderstanding- consistent hashing was implemented by @aaron at https://gerrit.wikimedia.org/r/c/mediawiki/core/+/284023 but I mentioned that what worried me was the behaviour on failover/maintenance- which may or may not require further sharding changes (depending on the chosen implementation to solve that) - but at least requires sharding awareness.

Change 816014 merged by jenkins-bot:

[mediawiki/core@master] ArrayUtils: Add coverage for consistentHashSort()

https://gerrit.wikimedia.org/r/816014

akosiaris subscribed.

Removing SRE, has already been triaged to a more specific SRE subteam

Ladsgroup moved this task from Meta/Epic to Done on the DBA board.
Ladsgroup subscribed.

T350367: Add pc4 to mediawiki's parsercache clusters put a lot of these assumptions to test. We added pc4 and expected some misplaced rows but not much thanks to the smart algorithm used for consistent hashing. The minimal disruption was the case but not really.

The issue came from the fact that to look up a PC entry, first you hit the options row and you get the address and then you hit the actual HTML. As result, misplacing 1/4th of rows led to halving the parsing keys being hits (and doubling the parsing for a while). Still not as bad as all rows being misplaced (and we survived without any threat to reliability). I also think this indirection will go away with parsoid for read being rolled out.

For now, we decided to close this ticket as major points have been addressed and we are even confident about how it behaves in production. We would create a ticket for any narrow and specific need/idea/bug.