[go: up one dir, main page]

Page MenuHomePhabricator

DBA review: data structure for storing permalinks
Closed, ResolvedPublic

Description

This task is about the work involved with DBA reviewing/approving the new table(s) we'll have implemented for storing permalinks to talk page comments and topics/discussions/sections in T296801.

Table structure

@matmarex to document here or link to where this is documented elsewhere (e.g. gerrit: 763690).

Done

  • Tag DBA once the ===Table structure above has been defined, per the guidance @Marostegui shared in T260372#6482475
  • The usual DBA questions have been answered
  • DBA communicates what – if any – changes need to be made to the new table(s) prior to them being deployed to production.
  • Table is created This will happen in TICKET NEEDED.

Related Objects

Event Timeline

ppelberg updated the task description. (Show Details)
ppelberg moved this task from Backlog to Triaged on the DiscussionTools board.
matmarex added a subscriber: Ladsgroup.

Info for DBA review:

Work-in-progress patch: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/DiscussionTools/+/763690


Responses to questions from https://wikitech.wikimedia.org/wiki/Creating_new_tables#Preparation:

  • Should this table be replicated to wiki replicas (does it not contain private data)?
    • Yes, but since the data in the tables is generated from the contents of page revisions, it would need to be filtered in case the revisions or pages are deleted. I think this is possible to handle in our replication setup using views, and I could probably create the necessary patches.
  • Will you be doing cross-joins with the wiki metadata?
    • Do you mean cross-joins as in CROSS JOIN, or just joins in general? No cross joins, but we are joining against revision, page and actor.
  • Size of the table (number of rows expected).
  • Expected growth per year (number of rows).
    • discussiontools_items: 1 row per comment posted on wikitext talk pages.
    • discussiontools_item_ids: Almost always 1 row per comment posted on wikitext talk pages, but rarely more than 1 row if the comment was edited in a particular way.
    • discussiontools_item_pages: 1 row per comment posted on wikitext talk pages, for each page that the comment appears on (usually 1, or 2 if the comment has been archived, sometimes 3+ if it is also transcluded on another page).
    • discussiontools_item_revisions: 2 rows per comment posted on wikitext talk pages, for each page that the comment appears on: 1 for the oldest revision, 1 for the latest. (The table schema supports storing a row for each revision, but that would probably be excessive. Even if we only store two, it seems nicer to do it this way than having two sets of identical columns.)
    • Quick estimate shows about ~11,500 comments per day on enwiki (based on 'discussiontools-added-comment' tags in March), corresponding to about ~1-2 MB of data per day across all tables (based on disk usage on my local wiki). We will probably want to backfill some historical data (details TBD).
  • Expected amount of queries, both writes and reads (per minute, per hour...per day, any of those are ok).
    • Writes: Some inserts/deletes/updates to each table for each edit to a discussion page, and for each purged page in case the edited page is transcluded on discussion pages (using the existing linksupdate jobs).
    • Reads: When generating notifications for talk page edits, when checking for talk page edit conflicts, and when following links from notifications.
      • Hard to estimate, maybe 2x-10x writes?
  • Examples of queries that will be using the table.
    • They're big and ugly because of the joins, it's probably easier to read the query builder code (but I can catalog them all if you really want).
    • Example query used on Special:FindComment:
SELECT itr_id,
       it_itemname,
       it_timestamp,
       it_actor,
       itid_itemid,
       itr_parent_id,
       itr_transcludedfrom,
       itr_level,
       itr_headinglevel,
       itr_revision_id,
       page_id,
       page_namespace,
       page_title,
       page_is_redirect,
       page_is_new,
       page_touched,
       page_links_updated,
       page_latest,
       page_len,
       page_content_model,
       page_lang,
       page_restrictions,
       actor_id,
       actor_name,
       actor_user,
       itr_parent__itid_itemid
FROM `discussiontools_item_revisions`
JOIN `discussiontools_items` ON ((it_id = itr_items_id))
JOIN `discussiontools_item_ids` ON ((itid_id = itr_itemid_id))
JOIN `discussiontools_item_pages` ON ((itp_items_id = itr_items_id)
                                      AND (itp_newest_revision_id = itr_revision_id))
LEFT JOIN `page` ON ((page_id = itr_transcludedfrom))
LEFT JOIN `actor` ON ((actor_id = it_actor))
LEFT JOIN
  (SELECT itr_id AS `itr_parent__itr_id`,
          itid_itemid AS `itr_parent__itid_itemid`
   FROM `discussiontools_item_revisions`
   JOIN `discussiontools_items` ON ((it_id = itr_items_id))
   JOIN `discussiontools_item_ids` ON ((itid_id = itr_itemid_id))) `sqb1` ON ((itr_parent_id = itr_parent__itr_id))
WHERE (it_itemname IN
         (SELECT it_itemname
          FROM `discussiontools_item_revisions`
          JOIN `discussiontools_items` ON ((it_id = itr_items_id))
          JOIN `discussiontools_item_ids` ON ((itid_id = itr_itemid_id))
          WHERE itid_itemid = 'c-Matma_Rex-2020-02-20T20:25:00.000Z-Matma_Rex-2020-02-20T20:17:00.000Z' ))
ORDER BY itr_id ASC
  • The release plan for the feature (are there specific wikis you'd like to test first etc).
    • Not determined yet

I'm trying to review the schema and from first glance it looks good. I need a bit of time to set it up locally and check the data in details (or possibly check it in beta cluster instead) and possibly come back with suggestion on improvements on schema or queries made.

The only thing that needs fixing is that the index prefix must be the same as the field prefix (so discussiontools_item_pages_items_id_page_id -> itp_item_pages_items_id_page_id or even better itp_items_id_page_id). We do have a test in core to ensure this stays like this in core but extensions are not wired to it (I have plans to make it possible. See T270033 for core work). The good thing is that since this is not deployed to production yet and not in a release, you can simply change it in the code without doing the alter table and the jazz. Beta cluster needs fixing but I can take care of that.

The only thing that needs fixing is that the index prefix must be the same as the field prefix

Done


Also, about doing something to avoid storing the long strings (which we talked about briefly, if you remember) – we definitely can't omit the username (it's generated from the page text rather than actor table), we definitely can't omit suffixes, and it seems impractical to omit the timestamps (it would make querying more difficult). But we are planning to make them somewhat shorter, if that's any consolation: T304595.

So I have tried installing it locally and play with the new schema but DT requires VisualEditor and that needs parsoid to work properly. Because of that, I have to speculate a bit here. Sorry if I'm missing something super obvious.

Also, about doing something to avoid storing the long strings (which we talked about briefly, if you remember) – we definitely can't omit the username (it's generated from the page text rather than actor table), we definitely can't omit suffixes, and it seems impractical to omit the timestamps (it would make querying more difficult). But we are planning to make them somewhat shorter, if that's any consolation: T304595.

I can think of two notes here:

  • You do store the timestamp and username already. My issue is that the UUID is repeating that, you can still store the timestamp and username in your tables (not core ones) and build the UUID on the fly (unless we are in one of those edge cases you mentioned which storing them would make sense)
  • Or you can store a shorter hash of the uuid you built instead (and basically use that as uuid), it'll have lots of benefits including the much faster lookup given that the data will be more evenly distributed (so the binary tree lookups in mysql will be much faster) and it will be shorter as well. Imagine something like Flow Topic names (which I know are not very great but at least this is mostly hidden from users)