Determine storage requirements for stashing parsoid output for VE edits
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	daniel
	May 23 2022, 10:39 AM

Description

The storage backend for stashing parsoid output for VE edits in the page/html endpoint needs to be configurable. The requirements in for persistance and latency are still unclear though.

Outcome

On the Cassandra keyspace used by RESTbase for stashing edits, we are seeing about 100 writes per second across all wikis (but only about 10 reads/s, indicating that 90% of edits are abandoned)
At a TTL of 24h, to amounts to about 7 million entries at any given time
Assuming an average of 20KB for each HTML blob, this works out to be 140GB.
Since this is essentially a key/value store, not much extra space is needed for indexes.
The sorage requirement will be multiplied by the replication factor

Backend tech choice:

Replication requirement: we need the stahed data to be available across DCs. Candidate tech: MemCached via mcrouter, Cassandra, MySQL (Redis as well, but it is being phased out).
Retention requirement: if stashed data vanishes, this directly impacts users by causing edits to fail. We don't want that. Candidate tech: Cassandra, MySQL
Performance requirement: high write rate. Candidate tech: MemCached via mcrouter, Cassandra
Space requirement: we need hundreds of GB with no unexpected eviction. Candidate tech: Cassandra, MySQL
Ease of deployment/maintenance: use what we have. Candidate tech: MemCached via mcrouter, MySQL.

Given the requirements above, the choice is between Cassandra and MySQL. Cassandra would require a significant effort (bundling and deploying a driver, implementing an adapter, setting up and running the Cassandra cluster). Using the ParserCache MySQL cluster only requires a small config change. So we whould try MySQL first, and picot to Cassandra if needed.

See T308511: [SPIKE] Determine necessity of edit session continuity during data center switchovers

Details

	Subject	Repo	Branch	Lines +/-
	ParsoidOutputStash: make storage backend configurable.	mediawiki/core	master	+84 -6

Customize query in gerrit

Related Objects
Search...

Status	Subtype	Assigned	Task
Stalled		None	T324931 Clean up open RESTBase related tickets
In Progress		None	T262315 <CORE TECHNOLOGY> API Migration & RESTBase Sunset
Resolved		MSantos	T264669 Move VE API from RESTBase into core.
Resolved		None	T267990 Support stashing in page/html and revision/html endpoints in MW core
Open		None	T344944 Move Parsoid endpoints out of RESTBase
Resolved		Jgiannelos	T344945 Disable storage of Parsoid content in RESTbase
Resolved		daniel	T329366 Enable WarmParsoidParserCache on all wikis
Resolved		Clement_Goubert	T333528 Increase memory_limit for jobrunners to $wmgMemoryLimitParsoid
Declined		None	T344946 Create a mechanism for purging output from (parsoid) HTML endpoints from edge caches (without RESTbase)
Resolved		hnowlan	T357504 Prepare amount of workers to handle enwiki traffic for parsoid endpoints
In Progress		None	T328559 Replace usage of RESTbase parsoid endpoints
Resolved		daniel	T320529 Configure VE backend to use Parsoid directly, instead of calling RESTbase
Resolved		daniel	T320536 Ensure that stashing backend for the VE API has sufficient capacity
Resolved		daniel	T309016 Determine storage requirements for stashing parsoid output for VE edits
Open		MSantos	T334238 Create deprecation plan for public parsoid endpoints
Resolved		MSantos	T335511 Talk to Google about transitioning REST-API-Crawler-Google from RESTBase endpoints
Resolved		MSantos	T335512 Transition Enterprise MediaWiki from RESTBase to core page HTML endpoints
Resolved		ROdonnell-WMF	T353286 Test structured-contents parsing with the new WMF update in T335512
Resolved	BUG REPORT	MSantos	T353689 REST HTML endpoint returning 500 for JSON page
Resolved	BUG REPORT	daniel	T353688 REST: Trying to load the content "magic" image description pages leads to redirect loop
Resolved	BUG REPORT	daniel	T353687 REST: page content endpoints fail on page "0"
Open	BUG REPORT	None	T362187 REST HTML endpoint returning 500 for the first 3 - 5 calls for several pages
Invalid		None	T335513 Selective outage of `/wikitext/to/lint` and `/html/to/wikitext` RESTBase endpoints
Resolved		daniel	T350661 Expose parsoid transformation API from MediaWiki core.
Resolved		BPirkle	T370430 REST: define the content.v1 module that will replace Parsoid endpoints.
In Progress		daniel	T366835 REST: API modularization and versioning (tracking)
Duplicate		None	T366838 Allow extensions to define REST modules
Stalled		daniel	T364400 map the /api/ prefix to /w/rest.php
Resolved		daniel	T365754 REST: Allow extensions to define REST modules
Resolved		daniel	T366837 REST: Introduce module definition files
Resolved		daniel	T362480 Introduces support for modules into the REST API framework
In Progress		daniel	T365753 REST: expose a machine readable directory of available API modules
Resolved		daniel	T365755 REST: make module definition files compatible with OpenAPI specs
Open		daniel	T365752 REST: Introduce support for private modules
Open		None	T366567 REST: introduce audience designations (proposal)
Resolved		daniel	T359964 Move page lint job outside of restbase request/response cycle to allow completely disabling cassandra writes
Resolved		daniel	T361013 Update lint tables independently of changeprop/restbase
Resolved		daniel	T361413 Parsoid should perhaps use LinkUpdate job for lints instead of special Linter API/Hook
Declined		None	T361411 Parsoid lints should be added to ParserOutput
Open		None	T367416 Route requests to parsoid endpoints to MediaWiki directly instead of RESTbase
Open		Jgiannelos	T367418 ChangeProp: stop making no-cache requests to parsoid endpoints in RESTBase
Resolved		BPirkle	T365630 Remove long term caching and active purging for Parsoid endpoints in RESTBase
Resolved		Jgiannelos	T367417 Enable LinterParseOnDerivedDataUpdate in production
Open		None	T373716 Reroute RESTbase Parsoid endpoints to core's REST endpoints
Open		akosiaris	T374683 Switchover plan from RESTbase to REST Gateway for rest_v1/page/html and rest_v1/page/title endpoints