Batch updates create slave lag on s3 over WAN
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	jcrespo
	Dec 25 2015, 11:04 AM

Description

When updating slow special pages from terbium, such as Listredirects, certain rows of the querycache table are deleted and then inserted.

Under normal circunstances, those updates do not create a problem. However, I believe a combination of factors can make them lag slaves:

Special pages of s3 are updated, which means hundreds of updates, independently of the wiki size, multiplied by the hundreds of wikis on that shard. Not other shard has >800 wikis.
WAN latency is higher than same-datacenter replication
Other writes are happening at the same time, such as updating pagelinks or wbc_entity_usage
ROW-based replication is used
A non-very-flat topology is in use (there are now 4 tiers, which is not desirable)

Given that special page update is not time-sensitive, I would like to:
a) Introduce pauses between wiki updates or, better, check that those have been applied to >50% of the slaves before continuing (including remote slaves)
b) make the updates non-transactional, splitting the filling of those tables in several, smaller transactions

Related Objects
Search...

View Standalone Graph

This task is connected to more than 200 other tasks. Only direct parents and subtasks are shown here. Use View Standalone Graph to show more of the graph.

Status	Assigned	Task
		· · ·
Resolved	aaron	T95501 Fix causes of replica lag and get it to under 5 seconds at peak
Resolved	None	T109179 Migrate MySQLs to use ROW-based replication
Resolved	jcrespo	T122429 Batch updates create slave lag on s3 over WAN
Resolved	hoo	T125838 Implement usage tracking without eu_touched
		· · ·

Event Timeline

jcrespo created this task.Dec 25 2015, 11:04 AM

jcrespo raised the priority of this task from to Needs Triage.

jcrespo updated the task description. (Show Details)

jcrespo added projects: MediaWiki-Special-pages, Performance Issue, DBA.

jcrespo subscribed.

Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald TranscriptDec 25 2015, 11:04 AM

jcrespo added subtasks: T95501: Fix causes of replica lag and get it to under 5 seconds at peak, T109179: Migrate MySQLs to use ROW-based replication.Dec 25 2015, 11:05 AM

I cannot say for sure if it is the Special pages or wbc_entity_usage updates, one of the two:

I see lots of:

UPDATE /* Wikibase\Client\Usage\Sql\EntityUsageTable::touchUsageBatch 127.0.0.1 */  `wbc_entity_usage` SET eu_touched = '20151225132251' WHERE eu_row_id IN ('613395','476260','476261','613397','523272','476258','476259','525131','476254','394381','476252','476253','543080')

Setting db2018 as MIXED temporarily to see if that helps.

jcrespo renamed this task from Batch update of special pages creates slave lag on s3 over WAN to Batch updated create slave lag on s3 over WAN.Dec 25 2015, 1:27 PM

jcrespo set Security to None.

jcrespo mentioned this in T111769: [Bug] EntityUsageTable::touchUsageBatch slow query.

jcrespo renamed this task from Batch updated create slave lag on s3 over WAN to Batch updates create slave lag on s3 over WAN.Dec 25 2015, 1:32 PM

Glaisher subscribed.Dec 25 2015, 3:45 PM

jcrespo added a subscriber: hoo.Jan 6 2016, 8:41 PM

hoo mentioned this in T124737: [RfC] Implement usage tracking without eu_touched.Jan 26 2016, 1:33 AM

JanZerebecki added a project: Wikidata.Feb 1 2016, 12:17 PM

JanZerebecki subscribed.

jcrespo mentioned this in T95501: Fix causes of replica lag and get it to under 5 seconds at peak.Feb 4 2016, 11:18 AM

jcrespo removed subtasks: T109179: Migrate MySQLs to use ROW-based replication, T95501: Fix causes of replica lag and get it to under 5 seconds at peak.Feb 4 2016, 11:21 AM

jcrespo added a parent task: T109179: Migrate MySQLs to use ROW-based replication.

jcrespo added a parent task: T95501: Fix causes of replica lag and get it to under 5 seconds at peak.

jcrespo mentioned this in T123867: Repeated reports of wikidatawiki (s5) API going read only.Feb 4 2016, 11:31 AM

hoo added a subtask: T125838: Implement usage tracking without eu_touched.Feb 4 2016, 5:10 PM

Mixed fixed the ongoing issue as a workaround, the root causes are still there and have to be fixed: pagelinks and/or wbc_entity_usage write activity.

hoo closed subtask T125838: Implement usage tracking without eu_touched as Resolved.Apr 13 2016, 8:46 PM

	F3146353: lag.png
	Dec 25 2015, 1:22 PM

Batch updates create slave lag on s3 over WANClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Batch updates create slave lag on s3 over WAN
Closed, ResolvedPublic
Actions

Related Objects
Search...