Maniphest T252057

Run A/B test to evaluate impact of Reply tool
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	ppelberg
	May 6 2020, 6:53 PM

Description

This test is intended to help us understand what impact the Reply Tool is having on Junior Contributors' likelihood to start (activation) and continue (retention) participating on Wikipedia talk pages.

Decision to be made

The decision this analysis is intended to help us make:
Should the Reply tool be offered to all people, at all wikis, as an opt-out user preference?

Hypotheses

To help evaluate the impact of the Reply tool, we would like to analyze whether adding a more intuitive workflow for replying to specific comments to Wikipedia talk pages:

ID	Hypothesis	Metric(s) for evaluation
KPI	...causes a greater percentage of Junior Contributors to publish the comments they start without a significant increase in disruption. (see "Guardrail" below)	Comment completion rate as defined by the number of people who click the `[ reply ]` link (action = `init`), what % of people successfully publish the comment they were drafting (action = `saveSuccess`).
Guardrail	...does not cause a significant increase in the number of disruptive edits being made to talk pages	The number of edits made to talk pages that are reverted within 48 hours. The number of editors who are blocked after making an edit to a talk page.
Curiosity #1	...causes a greater number of Junior Contributors to start participating productively on talk pages.	The number of distinct Junior Contributors who make at least one edit to a page in a talk namespace that is not reverted within 48 hours.
Curiosity #2	...causes a greater percentage of Junior Contributors continue participating productively on talk pages.	The percentage of Junior Contributors who who make at least one edit to a page in a talk namespace that is not reverted within 48 hours in each of the following time intervals: 2 to 7 days after making their edit (read: within the first week), 8 to 14 days after making their first edit (read: within the second week), and 15 to 30 days after making their first edit (read: within the third or fourth weeks).

Decision matrix

ID	Scenario	Plan of action
1.	People are "meaningfully" more likely to publish edits using the Reply Tool than they are using full-page editing	Continue with plans to make the Reply Tool available at all Wikipedias, by default. See T269062 for more detail.
2.	People are "meaningfully" less likely to publish edits using the Reply Tool than they are using full-page editing	Investigate where within the Reply Tool comment funnel people are dropping off and what could be contributing to this drop-off. In parallel, we will pause plans to make the Reply Tool available at all Wikipedias by default.
3.	People are as likely to publish edits using the Reply Tool as they are using full-page editing	Continue with plans to offer features as opt-out preference at all Wikipedias considering we have meaningful qualitative feedback and quantitative data that suggests the tool is leading people to find participating on talk pages easier / more efficient.[ii]

Open questions

1. Should edits to non-talk namespace pages be included in this analysis?
2. What wikis should be included in the test? See: T267379.

Done

A report is published that evaluates the ===Hypotheses listed above
- Report: https://wikimedia-research.github.io/Reply-tools-analysis-2021/
- Findings published on mediawiki.org: https://www.mediawiki.org/w/index.php?title=Talk_pages_project/Replying&type=revision&diff=4542988&oldid=4541142&diffmode=source

i. Editor experience buckets

Logged out
0 cumulative edits
1-4 cumulative edits
5-99 cumulative edits
100-999 cumulative edits
1000+ cumulative edits

ii. An example of said "quantitative data": T247139

Related Objects
Search...

Status	Assigned	Task
Open	None	T276497 Scale DiscussionTools to all projects
Open	None	T251207 [Epic] Scale DiscussionTools to all Wikipedias
Resolved	None	T262331 [Release ticket] Make Reply Tool available as opt-out preference at all Wikipedias
Resolved	None	T269062 Make the the Reply Tool available as an opt-out preference at all Phase 0, 1, 2 and 3 Wikipedias
Open	None	T233443 [Epic] Reply Tool
Resolved	MNeisler	T252057 Run A/B test to evaluate impact of Reply tool
Resolved	ppelberg	T267379 Identify wiki candidates for A/B test
Resolved	ppelberg	T267381 Publish A/B test rationale on mediawiki.org
Resolved	• Whatamidoing-WMF	T267382 Inform candidate wikis of A/B test
Resolved	ppelberg	T268191 Implement A/B test bucketing
Resolved	DLynch	T268193 Verify A/B test buckets are balanced and events are being logged as expected
Resolved	ppelberg	T273096 Add DiscussionTools a/b test bucket information to events from VisualEditor and WikiEditor.
Resolved	Ryasmeen	T272204 Pre-deployment Reply Tool QA
Resolved	ppelberg	T271391 Publish workflow engagement analysis findings on mediawiki.org
Resolved	• Whatamidoing-WMF	T273406 Announce start of A/B test
Resolved	matmarex	T273554 Make config change to enable Reply Tool A/B test

Event Timeline

ppelberg created this task.May 6 2020, 6:53 PM

ppelberg updated the task description. (Show Details)

Esanders added a project: DiscussionTools.Jun 25 2020, 12:52 PM

ppelberg updated the task description. (Show Details)Nov 6 2020, 11:45 PM

Task description update
I've updated ===Hypotheses section to the task description which contains the metrics we will use to compare the two test groups and by extension, determine the impact the Reply Tool is having on Junior Contributor activation and retention.

Note: the above is the outcome of the conversation @MNeisler and I had on 4-Nov wherein we revisited the Reply Tool measurement plan and identified the metrics we will prioritize as part of this A/B test.

ppelberg updated the task description. (Show Details)Nov 16 2020, 11:47 PM

ppelberg mentioned this in T268193: Verify A/B test buckets are balanced and events are being logged as expected.Nov 18 2020, 11:27 PM

ppelberg mentioned this in T267381: Publish A/B test rationale on mediawiki.org.Nov 25 2020, 9:26 PM

ppelberg added a parent task: T269062: Make the the Reply Tool available as an opt-out preference at all Phase 0, 1, 2 and 3 Wikipedias.Dec 1 2020, 3:00 AM

JTannerWMF closed subtask T267379: Identify wiki candidates for A/B test as Resolved.Dec 2 2020, 6:27 PM

ppelberg edited projects, added Editing-team (Kanban Board); removed Editing-team.Dec 16 2020, 5:59 PM

JTannerWMF moved this task from Incoming to Main quests on the Editing-team (Kanban Board) board.Dec 16 2020, 6:55 PM

ppelberg updated the task description. (Show Details)Dec 16 2020, 7:57 PM

Task description update
I've updated the task description to reflect the updates to the test KPI @MNeisler and I decided upon during the meeting we had on 2-December.

ppelberg updated the task description. (Show Details)Dec 16 2020, 11:41 PM

ppelberg updated the task description. (Show Details)

ppelberg mentioned this in T269062: Make the the Reply Tool available as an opt-out preference at all Phase 0, 1, 2 and 3 Wikipedias.Dec 16 2020, 11:44 PM

• Whatamidoing-WMF closed subtask T267382: Inform candidate wikis of A/B test as Resolved.Dec 18 2020, 6:00 PM

ppelberg closed subtask T267381: Publish A/B test rationale on mediawiki.org as Resolved.Dec 18 2020, 11:56 PM

ppelberg mentioned this in T268191: Implement A/B test bucketing.Jan 13 2021, 10:16 PM

ppelberg added a subtask: T271391: Publish workflow engagement analysis findings on mediawiki.org.Jan 22 2021, 11:40 PM

Enterprisey unsubscribed.Jan 23 2021, 6:51 AM

ppelberg closed subtask T268193: Verify A/B test buckets are balanced and events are being logged as expected as Resolved.Jan 27 2021, 8:10 PM

ppelberg closed subtask T272204: Pre-deployment Reply Tool QA as Resolved.Jan 30 2021, 3:39 AM

ppelberg closed subtask T268191: Implement A/B test bucketing as Resolved.Feb 1 2021, 5:24 AM

Amorymeltzer unsubscribed.Feb 2 2021, 6:25 PM

ppelberg moved this task from Main quests to Ready for Sign Off on the Editing-team (Kanban Board) board.Feb 9 2021, 8:30 PM

ppelberg moved this task from Ready for Sign Off to Main quests on the Editing-team (Kanban Board) board.Feb 9 2021, 9:15 PM

Deployment update
The A/B test officially started today, 11-February-2021. [i]

This means the analysis can "start" as early as 25-February per the conversation @MNeisler and I had yesterday (10-February).

i. T273554#6825381

I read the A/B test has officially started. The task T273406 hasn't been updated yet. Is it okay if I inform Dutch Wikipedia?

• Whatamidoing-WMF closed subtask T273406: Announce start of A/B test as Resolved.Feb 12 2021, 9:37 PM

In T252057#6826097, @AdHuikeshoven wrote:

I read the A/B test has officially started...Is it okay if I inform Dutch Wikipedia?

I'm sorry for the delayed response, @AdHuikeshoven. Yes, it is okay to inform Dutch Wikipedia.

Note: it looks like @Whatamidoing-WMF has already made an announcement at nl.wiki per T273406#6827497.

Question: should the KPI be “percentage of people” or “percentage of edits/posts”? If a person makes a mixture of successful and unsuccessful posts, do they get counted as a success or a failure overall?

You could do something funky like number of people weighted by their personal success rate. Person A posts 30 comments, all successful, is scored as a 1; person B makes two successful posts also scores 1; person C has one success and one fail scores 0.5. (It's people-focussed, and avoids the problem with edit counts where prolific commenters would skew the results.)

Or maybe you’re only interested in their first n attempts before they learn the ropes? But if someone keeps trying and gets better with time (instead of giving up) then that retention factor is important to count in.

MNeisler mentioned this in T273096: Add DiscussionTools a/b test bucket information to events from VisualEditor and WikiEditor..Feb 18 2021, 10:47 PM

Meta
Per the conversation, @MNeisler and I had today, we are going to break this analysis into two parts:

Report on the KPIs
- Components: KPI and Guardrail metrics defined in the Hypotheses section of the task description.
Full analysis
- Components: Curiosity metrics defined in the Hypotheses section of the task description.

ppelberg mentioned this in T275827: Deploy config change to scale New Discussion Tool's availability as a beta feature.Feb 26 2021, 2:46 AM

• Whatamidoing-WMF mentioned this in T251207: [Epic] Scale DiscussionTools to all Wikipedias.Mar 1 2021, 6:51 PM

MNeisler claimed this task.Mar 2 2021, 9:31 PM

MNeisler added a project: Product-Analytics (Kanban).

MNeisler triaged this task as Medium priority.Mar 2 2021, 9:34 PM

MNeisler moved this task from Next 2 weeks to Doing on the Product-Analytics (Kanban) board.

ppelberg closed subtask T273554: Make config change to enable Reply Tool A/B test as Resolved.Mar 3 2021, 8:52 PM

In T252057#6827836, @Pelagic wrote:

Question: should the KPI be “percentage of people” or “percentage of edits/posts”? If a person makes a mixture of successful and unsuccessful posts, do they get counted as a success or a failure overall?

@Pelagic: you identified the part of the test design @MNeisler and I discussed most.

Ultimately, we decided to take, as you described it, a "people-focused" approach to evaluate the impact of the Reply Tool. The reason we decided to take this approach instead of an "edit-focused" focused approach was to lower the likelihood that the behavior of a small and non-representative group of people could unduly skew the test results.

I should mention that in taking the approach that we have, we will lose insight into, as you described, the number of edit attempts that comprise a given individual's success rate. We are comfortable accepting this tradeoff because, for right now, we are more interested in learning whether people reach the point of successfully posting a comment than we are in the effort (read: number of attempts) required to reach that success.

Please tell me if anything above prompts new thoughts/questions or leaves anything about what you asked in T252057#6827836 unanswered.

cc @MNeisler who can: A) correct any details I might've misconstrued and/or B) offer additional context.

Here is a quick look at the overall edit completion rate for replies on talk pages by editor type. Data is based on edit attempts by users in the AB test recorded in EditAttemptStep from 12 February 2021 through 10 March 2021. Note: This data currently reflects logged-in users across all experience levels. I will review edit completion rate by experience level when I complete the preliminary analysis on KPIs.

There's variance in the percent difference but editors using the Reply Tool have a higher completion rate across all AB participating wikis compared to the use of existing reply workflows ("Page Editing").

I am currently doing some research into different models to correctly infer the impact of the reply tool on these completion rates. This model can help take into account random effects due to the user and wiki as well as variables such as the user experience.

I'll follow-up with details regarding the estimated impact and further insights when available.

Overall Edit Completion Rate by Reply Editor Type

reply_type	n_users	n_users_completed	completion_rate
Page Editing	2646	1361	51%
Reply Tool	2037	1400	69%

Edit Completion Rate by Participating Wiki and Reply Editor Type

wiki	reply_type	n_users	n_users_completed	completion_rate
afwiki	Page Editing	4	3	75%
afwiki	Reply Tool	1	1	100%
arzwiki	Page Editing	8	2	25%
arzwiki	Reply Tool	3	2	67%
bnwiki	Page Editing	28	16	57%
bnwiki	Reply Tool	18	14	78%
eswiki	Page Editing	423	198	47%
eswiki	Reply Tool	322	230	71%
fawiki	Page Editing	134	71	53%
fawiki	Reply Tool	89	51	57%
frwiki	Page Editing	394	228	58%
frwiki	Reply Tool	422	325	77%
hewiki	Page Editing	199	119	60%
hewiki	Reply Tool	104	62	60%
hiwiki	Page Editing	27	7	26%
hiwiki	Reply Tool	13	6	46%
idwiki	Page Editing	64	16	25%
idwiki	Reply Tool	25	15	60%
itwiki	Page Editing	398	220	55%
itwiki	Reply Tool	355	239	67%
jawiki	Page Editing	218	97	44%
jawiki	Reply Tool	121	67	55%
kowiki	Page Editing	41	17	41%
kowiki	Reply Tool	19	9	47%
nlwiki	Page Editing	109	56	51%
nlwiki	Reply Tool	77	65	84%
plwiki	Page Editing	148	79	53%
plwiki	Reply Tool	126	72	57%
ptwiki	Page Editing	144	65	45%
ptwiki	Reply Tool	179	142	79%
swwiki	Page Editing	3	2	67%
swwiki	Reply Tool	1	1	100%
thwiki	Page Editing	19	10	53%
thwiki	Reply Tool	9	7	78%
ukwiki	Page Editing	114	75	66%
ukwiki	Reply Tool	50	34	68%
viwiki	Page Editing	53	22	42%
viwiki	Reply Tool	23	12	52%
zhwiki	Page Editing	122	60	49%
zhwiki	Reply Tool	83	46	55%

Notes:
(1) Reply Tool events sampled at 100%. Page Editing events (event.integration = 'page' ) sampled at a rate of 1/16, or 6.125%
(2) Excludes edits to create new sections or pages.

AdHuikeshoven awarded a token.Mar 12 2021, 8:20 PM

Here is the draft report on the KPIs and guardrails for the Reply Tool AB Test for review.

A few highlights and key findings from the preliminary analysis:

Overall, across all participating Wikipedias, Junior Contributors had a significantly higher comment completion rate using the reply tool compared to using non-reply tool workflows. 72.9% of all Junior Contributors that made a comment attempt were able to successfully publish at least 1 comment with the reply tool, while only 27.6% of Junior Contributors successfully saved a non-reply tool comment. This represents a 164% observed increase in comment completion rate.

On a per Wikipedia basis, the percent increases vary; however, Junior Contributors had a higher comment completion rate using the reply tool compared to non-reply tool editor interfaces on every participating Wikipedia. Indonesian, Japanese, Dutch, and Spanish Wikipedias saw the highest percent increases in comment completion rates with the reply tool. We observed the two lowest percent increases in the comment completion rate for Persian (42% increase) and Hebrew Wikipedias (55% increase). Note: These are both right-to-left languages, which might be worth exploring as a potential reason for the lower increases observed on this wikis.

To infer the impact of the reply tool on these comment completion rates, we used a Hierarchical Regression Model which accounts for any random effects due to the user and wiki. Based on estimates from the model, we found that there is an average 45.5% increase (maximum 49.4% increase) in the probability of a Junior Contributor publishing a comment when they use the reply tool instead of a non-reply tool editing interface.

I'm currently looking into the impact of experience level on the use of the reply tool. When looking at comment completion rates across all contributors' experience levels, the comment completion rates using the reply tool were not too different from the rates found for Junior Contributors. Overall, 68.9% of contributors across all experience levels were able to publish at least one comment using the reply tool. However, experience level appears to have a much more significant impact on the ability of a Contributor to publish a comment using non-reply tool methods. 57.8% by Contributors across all experience levels were able to complete at least one comment using non-reply tool editing interfaces, compared to only 27.6% of Junior Contributors. Further analysis will help quantify and confirm the impact.

Guardrail Analysis: Initial data does not indicate any significant disruption caused by the reply tool. For Junior Contributors using the reply tool, 1.65% of their comments were reverted within 48 hours and only 1.81% were blocked. (NOTE: The guardrail analysis metrics rely on data available in mediawiki_history which is updated monthly. At the time of this analysis, March 2021 editing attempts had not yet been logged so the data in the section below reflects edits recorded from the start of the AB Test on 2021 February 12 through the end of February. I will update the data when available but do not anticipate any significant changes to the reported metrics.

Remaining TODOs for Final Report:

Update Guardrail Analysis on March 2021 snapshot of mediawiki_history is available.
Complete analysis of completion rates across all contributors experience levels
Complete analysis of the curiosity metrics defined in the Hypotheses section of the task description.
Final cleanup of the report for publishing: Formatting of charts and tables, finalize data synposis, add a high-level summary of findings, etc.

@ppelberg - Please let me know if you have any questions or changes.

Source Code File

Thryduulf subscribed.Apr 5 2021, 12:18 AM

ppelberg edited projects, added Editing-team (Tracking); removed Editing-team (Kanban Board).Apr 13 2021, 12:16 AM

ppelberg moved this task from Backlog to Analytics on the Editing-team (Tracking) board.

To show I read your report, in fifth bullet it reads "reflect sedits" in stead of "reflects edits". Nice results!

ppelberg mentioned this in T280383: Broadcast Reply Tool A/B test results and opt-out thinking.Apr 16 2021, 5:20 PM

ppelberg removed a subtask: T270352: Publish A/B test results on mediawiki.org.

Here is the updated report.

Some key insights and conclusions:

Junior contributors had a much higher comment completion rate using the reply tool compared to page editing.
Using a regression model, we confirmed there is an average 45% increase in the probability of a Junior Contributor publishing a comment when they switch from using page editing to the reply tool. This model accounted for any random effects by the user and the wiki.
We found experience level has a significant effect on the comment completion rate of a contributor. The comment completion rate for Junior Contributors (defined as having under 100 edits) using page editing is significantly lower than the comment completion rate observed for non-junior contributors (defined as having over 100 edits) using page editing. However, using the reply tool, Junior contributors' comment completion rate was roughly the same as the Non-Junior contributors' comment completion rate using page editing
Overall, across all participating Wikipedia, we observed a 79.5% decrease in the revert rate for comments Junior Contributors made with the reply tool compared to page editing. The reply tool seems to enable Junior Contributors to not only successfully complete a comment but reduce the number of errors in the published comment that might lead to the comment being reverted.
In addition to a decrease in revert rate, under 2% of Junior Contributors using the reply tool were blocked after making a comment on a talk page indicating that the tool did not result in any significant increase in disruptive edits to talk pages.

@ppelberg - Please feel let me know if you have any questions.

Codebase

ppelberg mentioned this in T270352: Publish A/B test results on mediawiki.org.Apr 23 2021, 10:36 PM

In T252057#7030325, @MNeisler wrote:

Here is the updated report.

Excellent

Overall, across all participating Wikipedia, we observed a 79.5% decrease in the revert rate for comments made with the reply tool compared to page editing. The reply tool seems to enable Junior Contributors to not only successfully complete a comment but reduce the number of errors in the published comment that might lead to the comment being reverted.

@MNeisler: to be doubly sure, would it be more accurate to say, "...we observed a 79.5% decrease in the revert rate for comments Junior Contributors made..." vs. "...we observed a 79.5% decrease in the revert rate for comments made..."?

@ppelberg - Please feel let me know if you have any questions.

Before resolving this task, can you please review the update I've posted on the Reply Tool project page [i] to ensure it is accurate?

i. https://www.mediawiki.org/w/index.php?title=Talk_pages_project/Replying&type=revision&diff=4542988&oldid=4541142&diffmode=source

@MNeisler: to be doubly sure, would it be more accurate to say, "...we observed a 79.5% decrease in the revert rate for comments Junior Contributors made..." vs. "...we observed a 79.5% decrease in the revert rate for comments made..."?

Yes, that's correct. I've added text to that statement to clarify.

Before resolving this task, can you please review the update I've posted on the Reply Tool project page [i] to ensure it is accurate?

Yes, I'll plan to review the update on Monday and post an update to this ticket once complete.

@ppelberg - I've reviewed the Reply Tool project page and made some edits [i] to clarify the results.

i. https://www.mediawiki.org/w/index.php?title=Talk_pages_project/Replying&diff=4546318&oldid=4542988

In T252057#7035889, @MNeisler wrote:

@ppelberg - I've reviewed the Reply Tool project page and made some edits [i] to clarify the results.

i. https://www.mediawiki.org/w/index.php?title=Talk_pages_project/Replying&diff=4546318&oldid=4542988

Great – thank you, @MNeisler. And with this edit, I think this task can be resolved.

ppelberg updated the task description. (Show Details)Apr 26 2021, 10:26 PM

ppelberg mentioned this in T281533: Announce Planned Reply Tool Opt-Out Deployment (Phase 2 wikis).Apr 29 2021, 9:24 PM

ppelberg mentioned this in T283866: Inform Phase 3 Wikis of A/B test results.May 27 2021, 7:49 PM

ppelberg mentioned this in T297448: [SPIKE] Review approach to improving mobile talk pages.Dec 23 2021, 1:20 AM

matmarex mentioned this in T322492: Remove wgDiscussionToolsABTest config setting.Nov 6 2022, 9:02 PM

HLHJ mentioned this in T323461: Make new "[reply]" links work without javascript.Nov 20 2022, 9:24 PM