DBA review: data structure for storing permalinks
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	ppelberg
	Mar 8 2022, 5:00 PM

Description

This task is about the work involved with DBA reviewing/approving the new table(s) we'll have implemented for storing permalinks to talk page comments and topics/discussions/sections in T296801.

Table structure

@matmarex to document here or link to where this is documented elsewhere (e.g. gerrit: 763690).

Done

Tag DBA once the ===Table structure above has been defined, per the guidance @Marostegui shared in T260372#6482475
The usual DBA questions have been answered
DBA communicates what – if any – changes need to be made to the new table(s) prior to them being deployed to production.
~~Table is created~~ This will happen in TICKET NEEDED.

Related Objects
Search...

Status	Assigned	Task
Open	None	T314973 Improve Topic Subscription support for discussions that are merged, moved, or deleted
Open	None	T315050 Enable people to visit the page the discussion they were subscribed to has been moved to
Open	None	T318449 [Tracking] Improve the sharing of MediaWiki content
Open	None	T318446 [Tracking] Improve sharing link functionality of pages
Open	None	T315507 Introduce support for topic and comment permanent links within the Android and iOS apps
Open	None	T349353 Expose comment/topic permanent links on Special:Contributions
Open	None	T265269 Assist with linking to other comments within the Reply Tool
Resolved	ppelberg	T302011 [Release Ticket] Introduce permalinks on wikitext talk pages
Open	None	T273341 Topic permalinks: design and implement user experience
Resolved	DLynch	T365974 Deploy talk page permalinks to en.wiki
Resolved	matmarex	T315353 Create database tables for permalinks in production wikis, and enable the feature
Open	None	T290186 [EPIC] Empower developers to create new and improve existing experiences using DiscussionTools
Open	None	T295392 [Proof of Concept] Build an embeddable topic list
Open	None	T306199 Replace redundant fields in discussiontools_subscription with references to other tables
Resolved	matmarex	T296801 Introduce a data structure for tracking talk page comments
Resolved	matmarex	T303295 DBA review: data structure for storing permalinks

Event Timeline

ppelberg created this task.Mar 8 2022, 5:00 PM

ppelberg updated the task description. (Show Details)

ppelberg moved this task from Backlog to Triaged on the DiscussionTools board.

ppelberg moved this task from Incoming to Ready to Be Worked On on the Editing-team (Kanban Board) board.Mar 8 2022, 11:18 PM

ppelberg mentioned this in T304576: [Investigation] What is required to ensure metadata from rev_deleted revisions is not kept in permalink table(s).Mar 24 2022, 12:32 AM

Info for DBA review:

Work-in-progress patch: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/DiscussionTools/+/763690

Table schema: discussiontools_persistent.json, or discussiontools_persistent.sql
Code that does the queries: ThreadItemStore.php

Responses to questions from https://wikitech.wikimedia.org/wiki/Creating_new_tables#Preparation:

Should this table be replicated to wiki replicas (does it not contain private data)?
- Yes, but since the data in the tables is generated from the contents of page revisions, it would need to be filtered in case the revisions or pages are deleted. I think this is possible to handle in our replication setup using views, and I could probably create the necessary patches.

Will you be doing cross-joins with the wiki metadata?
- Do you mean cross-joins as in CROSS JOIN, or just joins in general? No cross joins, but we are joining against revision, page and actor.

Size of the table (number of rows expected).
Expected growth per year (number of rows).
- discussiontools_items: 1 row per comment posted on wikitext talk pages.
- discussiontools_item_ids: Almost always 1 row per comment posted on wikitext talk pages, but rarely more than 1 row if the comment was edited in a particular way.
- discussiontools_item_pages: 1 row per comment posted on wikitext talk pages, for each page that the comment appears on (usually 1, or 2 if the comment has been archived, sometimes 3+ if it is also transcluded on another page).
- discussiontools_item_revisions: 2 rows per comment posted on wikitext talk pages, for each page that the comment appears on: 1 for the oldest revision, 1 for the latest. (The table schema supports storing a row for each revision, but that would probably be excessive. Even if we only store two, it seems nicer to do it this way than having two sets of identical columns.)
- Quick estimate shows about ~11,500 comments per day on enwiki (based on 'discussiontools-added-comment' tags in March), corresponding to about ~1-2 MB of data per day across all tables (based on disk usage on my local wiki). We will probably want to backfill some historical data (details TBD).

Expected amount of queries, both writes and reads (per minute, per hour...per day, any of those are ok).
- Writes: Some inserts/deletes/updates to each table for each edit to a discussion page, and for each purged page in case the edited page is transcluded on discussion pages (using the existing linksupdate jobs).
  - Quick estimate shows about ~14,500 discussion page edits per day on enwiki (it's difficult to define what a discussion page is, so I counted all edits on March 1st to pages that had any edits with 'discussiontools-added-comment' tags), but not all of these would cause writes if comments did not change, and there might be some more from template edits.
- Reads: When generating notifications for talk page edits, when checking for talk page edit conflicts, and when following links from notifications.
  - Hard to estimate, maybe 2x-10x writes?

Examples of queries that will be using the table.
- They're big and ugly because of the joins, it's probably easier to read the query builder code (but I can catalog them all if you really want).
- Example query used on Special:FindComment:

SELECT itr_id,
       it_itemname,
       it_timestamp,
       it_actor,
       itid_itemid,
       itr_parent_id,
       itr_transcludedfrom,
       itr_level,
       itr_headinglevel,
       itr_revision_id,
       page_id,
       page_namespace,
       page_title,
       page_is_redirect,
       page_is_new,
       page_touched,
       page_links_updated,
       page_latest,
       page_len,
       page_content_model,
       page_lang,
       page_restrictions,
       actor_id,
       actor_name,
       actor_user,
       itr_parent__itid_itemid
FROM `discussiontools_item_revisions`
JOIN `discussiontools_items` ON ((it_id = itr_items_id))
JOIN `discussiontools_item_ids` ON ((itid_id = itr_itemid_id))
JOIN `discussiontools_item_pages` ON ((itp_items_id = itr_items_id)
                                      AND (itp_newest_revision_id = itr_revision_id))
LEFT JOIN `page` ON ((page_id = itr_transcludedfrom))
LEFT JOIN `actor` ON ((actor_id = it_actor))
LEFT JOIN
  (SELECT itr_id AS `itr_parent__itr_id`,
          itid_itemid AS `itr_parent__itid_itemid`
   FROM `discussiontools_item_revisions`
   JOIN `discussiontools_items` ON ((it_id = itr_items_id))
   JOIN `discussiontools_item_ids` ON ((itid_id = itr_itemid_id))) `sqb1` ON ((itr_parent_id = itr_parent__itr_id))
WHERE (it_itemname IN
         (SELECT it_itemname
          FROM `discussiontools_item_revisions`
          JOIN `discussiontools_items` ON ((it_id = itr_items_id))
          JOIN `discussiontools_item_ids` ON ((itid_id = itr_itemid_id))
          WHERE itid_itemid = 'c-Matma_Rex-2020-02-20T20:25:00.000Z-Matma_Rex-2020-02-20T20:17:00.000Z' ))
ORDER BY itr_id ASC

The release plan for the feature (are there specific wikis you'd like to test first etc).
- Not determined yet

I'm trying to review the schema and from first glance it looks good. I need a bit of time to set it up locally and check the data in details (or possibly check it in beta cluster instead) and possibly come back with suggestion on improvements on schema or queries made.

The only thing that needs fixing is that the index prefix must be the same as the field prefix (so discussiontools_item_pages_items_id_page_id -> itp_item_pages_items_id_page_id or even better itp_items_id_page_id). We do have a test in core to ensure this stays like this in core but extensions are not wired to it (I have plans to make it possible. See T270033 for core work). The good thing is that since this is not deployed to production yet and not in a release, you can simply change it in the code without doing the alter table and the jazz. Beta cluster needs fixing but I can take care of that.

The only thing that needs fixing is that the index prefix must be the same as the field prefix

Done

Also, about doing something to avoid storing the long strings (which we talked about briefly, if you remember) – we definitely can't omit the username (it's generated from the page text rather than actor table), we definitely can't omit suffixes, and it seems impractical to omit the timestamps (it would make querying more difficult). But we are planning to make them somewhat shorter, if that's any consolation: T304595.

So I have tried installing it locally and play with the new schema but DT requires VisualEditor and that needs parsoid to work properly. Because of that, I have to speculate a bit here. Sorry if I'm missing something super obvious.

In T303295#7870002, @matmarex wrote:

Also, about doing something to avoid storing the long strings (which we talked about briefly, if you remember) – we definitely can't omit the username (it's generated from the page text rather than actor table), we definitely can't omit suffixes, and it seems impractical to omit the timestamps (it would make querying more difficult). But we are planning to make them somewhat shorter, if that's any consolation: T304595.

I can think of two notes here:

You do store the timestamp and username already. My issue is that the UUID is repeating that, you can still store the timestamp and username in your tables (not core ones) and build the UUID on the fly (unless we are in one of those edge cases you mentioned which storing them would make sense)
Or you can store a shorter hash of the uuid you built instead (and basically use that as uuid), it'll have lots of benefits including the much faster lookup given that the data will be more evenly distributed (so the binary tree lookups in mysql will be much faster) and it will be shorter as well. Imagine something like Flow Topic names (which I know are not very great but at least this is mostly hidden from users)

• Marostegui edited projects, added Data-Persistence (work done); removed DBA.May 23 2022, 1:30 PM

This is done, we had a meeting to discuss it. The outcomes from the review are:

matmarex updated the task description. (Show Details)Jun 6 2022, 5:25 PM

matmarex moved this task from Ready to Be Worked On to Ready for Sign Off on the Editing-team (Kanban Board) board.

ppelberg closed this task as Resolved.Jul 8 2022, 12:57 AM

DBA review: data structure for storing permalinksClosed, ResolvedPublicActions

Description

Table structure

Done

Related ObjectsSearch...

Event Timeline

DBA review: data structure for storing permalinks
Closed, ResolvedPublic
Actions

Related Objects
Search...