The storage backend for stashing parsoid output for VE edits in the page/html endpoint needs to be configurable. The requirements in for persistance and latency are still unclear though.
Outcome
- On the Cassandra keyspace used by RESTbase for stashing edits, we are seeing about 100 writes per second across all wikis (but only about 10 reads/s, indicating that 90% of edits are abandoned)
- At a TTL of 24h, to amounts to about 7 million entries at any given time
- Assuming an average of 20KB for each HTML blob, this works out to be 140GB.
- Since this is essentially a key/value store, not much extra space is needed for indexes.
- The sorage requirement will be multiplied by the replication factor
Backend tech choice:
- Replication requirement: we need the stahed data to be available across DCs. Candidate tech: MemCached via mcrouter, Cassandra, MySQL (Redis as well, but it is being phased out).
- Retention requirement: if stashed data vanishes, this directly impacts users by causing edits to fail. We don't want that. Candidate tech: Cassandra, MySQL
- Performance requirement: high write rate. Candidate tech: MemCached via mcrouter, Cassandra
- Space requirement: we need hundreds of GB with no unexpected eviction. Candidate tech: Cassandra, MySQL
- Ease of deployment/maintenance: use what we have. Candidate tech: MemCached via mcrouter, MySQL.
Given the requirements above, the choice is between Cassandra and MySQL. Cassandra would require a significant effort (bundling and deploying a driver, implementing an adapter, setting up and running the Cassandra cluster). Using the ParserCache MySQL cluster only requires a small config change. So we whould try MySQL first, and picot to Cassandra if needed.
See T308511: [SPIKE] Determine necessity of edit session continuity during data center switchovers