Analyze permalinks table to see how many duplicates exist
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	ppelberg
	Oct 19 2022, 6:42 PM

Description

T315510 will start a maintenance script to populate the talk page comment database.

This task involves the work of analyzing said database to learn what percentage of comments stored in the database are duplicates?

Knowing the answer to the above will enable us to estimate the likelihood that someone tapping/clicking on a permalink will directed to Special:GoToComment as opposed to being directly taken to the comment they are expecting to see.

With the probability described above "in-hand," we'll be able to decide whether any adjustments need to be made to:

A) How we're generating permalinks to lower the rate of duplicates
B) How the user experience looks/functions to help people develop more accurate expectations for what is likely to occur when they tap a permalink

Requirements

Once the talk page comment database contains a sufficiently large and representative amount of comments, calculate the percentage of said comments that are duplicates of one another

Open questions

1. When and how will we know the comment database is filled with a large and representative enough sample of comments for us to analyze its contents and make conclusions based on said analysis?

Done

Answers to all Open questions are documented
Requirements are met

Related Objects
Search...

Status	Assigned	Task
Open	None	T314973 Improve Topic Subscription support for discussions that are merged, moved, or deleted
Open	None	T315050 Enable people to visit the page the discussion they were subscribed to has been moved to
Open	None	T318449 [Tracking] Improve the sharing of MediaWiki content
Open	None	T318446 [Tracking] Improve sharing link functionality of pages
Open	None	T315507 Introduce support for topic and comment permanent links within the Android and iOS apps
Open	None	T349353 Expose comment/topic permanent links on Special:Contributions
Open	None	T265269 Assist with linking to other comments within the Reply Tool
Resolved	ppelberg	T302011 [Release Ticket] Introduce permalinks on wikitext talk pages
Resolved	matmarex	T321228 Analyze permalinks table to see how many duplicates exist

Event Timeline

ppelberg created this task.Oct 19 2022, 6:42 PM

ppelberg moved this task from Backlog to Triaged on the DiscussionTools board.

ppelberg moved this task from Untriaged to Upcoming on the Editing-team board.

ppelberg edited projects, added Editing-team (Kanban Board); removed Editing-team.Jun 23 2023, 5:56 PM

ppelberg moved this task from Incoming to Ready to Be Worked On on the Editing-team (Kanban Board) board.

ppelberg triaged this task as High priority.Jun 23 2023, 8:37 PM

matmarex claimed this task.Jul 15 2023, 4:14 AM

matmarex mentioned this in T341933: Red link to a comment on Special:FindComment.Jul 15 2023, 6:39 PM

what percentage of comments stored in the database are duplicates?

I started looking into this. It seems like it's normal for medium to large wikis to have 5%-20% of duplicate comments. Most of the duplicates are "boring" bot messages or mass notifications, but some are "real" comments that we can't uniquely identify.

(Note that I'm counting each occurrence of a duplicated comment separately. If you count all of them as just one comment, the number comes out to 2-8%.)

On wikis with less human activity it can be even higher. On the Cebuano Wikipedia (infamous for its bot-created articles), it's 97.36%, because many of the bot-created articles have bot-created talk pages (example, example).

I'll run the queries on all wikis and make some kind of report later, because Special:FindComment is currently still disabled on many wikis, which makes it a real chore to spot-check the results (and to share interesting examples).

There are several kinds of comments that can't be uniquely identified by author and time in our database (not necessarily "duplicates", but let's use that as a shorthand):

Identical messages posted to many users at once
Multiple comments posted in one edit (or within a minute) on a single page
Similar comments posted on separate pages within a minute (e.g. closing multiple deletion discussions, or welcoming multiple users)
Serial comments posted by a bot (e.g. notices about broken external links)
Mishaps with mass replacements of signatures (example, example)
A particular deletion discussion on enwiktionary concerning 81 related pages that has apparently been archived to each of these pages' talk page (example)

Queries I used:

summary.sql1 KBDownload

details.sql1 KBDownload

matmarex attached a referenced file: F37141689: details.sql. (Show Details)Jul 17 2023, 7:59 PM

matmarex attached a referenced file: F37141688: summary.sql. (Show Details)

ppelberg moved this task from Ready to Be Worked On to Doing on the Editing-team (Kanban Board) board.Jul 24 2023, 6:39 PM

ppelberg moved this task from Doing to Ready for Sign Off on the Editing-team (Kanban Board) board.Aug 1 2023, 4:27 PM

ppelberg mentioned this in T302012: Define URL schema for talk page comment and discussion URLs.Sep 30 2023, 12:45 AM

VPuffetMichel closed this task as Resolved.Dec 4 2023, 7:04 PM