Establish processes for running the dataset pipeline
Open, LowPublic
Actions

Assigned To

None

Authored By

	kostajh
	Mar 4 2021, 11:06 AM

Description

Some things to work out:

Where on stat1008 should the canonical repo live, or does that matter? (currently various engineers have run the pipeline from their home directories, doesn't seem like a problem so long as the repo is up-to-date)
How often should we re-run the pipeline for an existing dataset? This is needed as page titles get renamed, for example (T274195), as we currently use page title instead of page ID for querying datasets. Should we automate the process via a cronjob?
What type of QA if any should be done after generating a new dataset? run-pipeline.sh generates the datasets but doesn't publish them to analytics.wikimedia.org, that intentionally requires a separate step.
[your questions / concerns here]

Related Objects
Search...

Status	Assigned	Task
Resolved	MMiller_WMF	T252822 [EPIC] Growth: "add a link" structured task 1.0
Resolved	kostajh	T266437 Add a link engineering: backend product specifications
Resolved	kostajh	T261396 Add a link: engineering tasks for initial release
Open	None	T276438 Establish processes for running the dataset pipeline
Resolved	MGerlach	T278679 Unable to run pipeline due to permissions errors
Resolved	MGerlach	T290067 Improve the documentation in the README of the mwaddlink-repo

Event Timeline

kostajh created this task.Mar 4 2021, 11:06 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 4 2021, 11:06 AM

kostajh added a parent task: T261396: Add a link: engineering tasks for initial release.Mar 4 2021, 11:17 AM

kostajh moved this task from Backlog to Post-release backlog on the Add-Link board.Mar 22 2021, 10:38 AM

We should answer these questions and document them on https://wikitech.wikimedia.org/wiki/Add_Link

This also relates to plans for scaling the service to new wikis.

kostajh updated the task description. (Show Details)Mar 24 2021, 11:26 AM

kostajh added a subtask: T278679: Unable to run pipeline due to permissions errors.Mar 29 2021, 9:34 AM

kostajh moved this task from Post-release backlog to April 19 - April 23 on the Add-Link board.Mar 29 2021, 11:32 AM

kostajh closed subtask T278679: Unable to run pipeline due to permissions errors as Resolved.Mar 29 2021, 1:38 PM

Some thoughts from my side.

Where on stat1008 should the canonical repo live, or does that matter?

not sure there needs to be a dedicated location since the data will be moved to published datasets, but I dont have a good sense of what is the recommended setup for such a pipeline

How often should we re-run the pipeline for an existing dataset? Should we automate the process via a cronjob?

Less often than once a month since we are using the dumps for training the pipeline. These are generated on a monthly basis.
More often than once a year, say every 3 months, though this is based on speculation/intuition. While we will always generate the link recommendations for the current version of a page, the underlying data used for training will change as new links (and new pages) are added or removed. I dont think we have a good understanding what the rate of change is and, if and how much it would impact the model performance. This is an interesting research question for the future
One option would be to monitor user-feedback on the link-recommendations and check whether the rate of rejection of links increases with time.

What type of QA if any should be done after generating a new dataset? run-pipeline.sh generates the datasets but doesn't publish them to analytics.wikimedia.org, that intentionally requires a separate step.

we could make a query on a sample page (or sample text) where we would expect some links to be recommended. requiring a specific output might be tricky since the page itself and the model can change, and thus also its output. however, we could require to get at least one link recommended in order to makre sure that the datasets/model work.
the evaluation of the backtesting-data provides numbers for precision and recall on a test-dataset with existing links. this should match (at least roughly, say within 10 or 20%) what we expect from previous analysis from before deployment (for example here)

MGerlach mentioned this in T272731: In-depth analysis of link-recommendation model .Apr 1 2021, 4:53 PM

kostajh triaged this task as Low priority.Apr 14 2021, 12:53 PM

MBinder_WMF added a project: Growth-Team-Filtering.Apr 15 2021, 6:48 PM

kostajh moved this task from April 19 - April 23 to May 10 - May 14 on the Add-Link board.Apr 19 2021, 9:57 AM

kostajh moved this task from Incoming to Ready for Development on the Growth-Team (Sprint 0 (Growth Team)) board.Apr 20 2021, 9:24 AM

• Rileych removed a project: Growth-Team-Filtering.Apr 20 2021, 4:31 PM

kostajh moved this task from May 10 - May 14 to Post-release backlog on the Add-Link board.May 10 2021, 1:29 PM

kostajh moved this task from Post-release backlog to Backlog on the Add-Link board.Jun 7 2021, 7:26 AM

In T276438#6953991, @MGerlach wrote:

Some thoughts from my side.

Where on stat1008 should the canonical repo live, or does that matter?

not sure there needs to be a dedicated location since the data will be moved to published datasets, but I dont have a good sense of what is the recommended setup for such a pipeline

I think we should consolidate on always running the ./run-pipeline.sh and other scripts from @MGerlach's repo which is at /home/mgerlach/REPOS/mwaddlink-gerrit (@MGerlach there are three other repos with mwaddlink in the title in that directory, maybe we should rename those or put them into a subdirectory so it's more obvious that this is the canonical location?)

I've updated the wikitech docs with the repo location

How often should we re-run the pipeline for an existing dataset? Should we automate the process via a cronjob?

Less often than once a month since we are using the dumps for training the pipeline. These are generated on a monthly basis.

More often than once a year, say every 3 months, though this is based on speculation/intuition. While we will always generate the link recommendations for the current version of a page, the underlying data used for training will change as new links (and new pages) are added or removed. I dont think we have a good understanding what the rate of change is and, if and how much it would impact the model performance. This is an interesting research question for the future

One option would be to monitor user-feedback on the link-recommendations and check whether the rate of rejection of links increases with time.

Why don't we go with every 4 months. @MMiller_WMF does this sound OK to you?

What type of QA if any should be done after generating a new dataset? run-pipeline.sh generates the datasets but doesn't publish them to analytics.wikimedia.org, that intentionally requires a separate step.

we could make a query on a sample page (or sample text) where we would expect some links to be recommended. requiring a specific output might be tricky since the page itself and the model can change, and thus also its output. however, we could require to get at least one link recommended in order to makre sure that the datasets/model work.

the evaluation of the backtesting-data provides numbers for precision and recall on a test-dataset with existing links. this should match (at least roughly, say within 10 or 20%) what we expect from previous analysis from before deployment (for example here)

That sounds good. Is this something that Research or Growth team should own? Maybe when we are refreshing the models / datasets, we can post some of this data in the task to get sign off from at least one person in Research and one from Growth? cc @MMiller_WMF & @MGerlach

kevinbazira subscribed.Sep 2 2021, 5:13 PM

@MGerlach is transitioning maintenance of the link recommendation algorithm from Research to ML-team.

@calbon and I will be the ML-team contacts on this. Thanks @MGerlach!

In T276438#7329170, @kevinbazira wrote:

@MGerlach is transitioning maintenance of the link recommendation algorithm from Research to ML-team.

@calbon and I will be the ML-team contacts on this. Thanks @MGerlach!

Sounds good. @kevinbazira do you need any information from the Growth team on this? https://wikitech.wikimedia.org/wiki/Add_Link hopefully describes what we are doing now and the README has further documentation, though it could use some improvement (T290067: Improve the documentation in the README of the mwaddlink-repo).

Will your team also maintain the existing service deployment to Kuberntes? If so, T278083: Define SLIs/SLOs for link recommendation service is probably one for us to discuss as well.

MGerlach closed subtask T290067: Improve the documentation in the README of the mwaddlink-repo as Resolved.Sep 29 2021, 8:22 AM

In T276438#7334245, @kostajh wrote:

In T276438#7329170, @kevinbazira wrote:

@MGerlach is transitioning maintenance of the link recommendation algorithm from Research to ML-team.

@calbon and I will be the ML-team contacts on this. Thanks @MGerlach!

Sounds good. @kevinbazira do you need any information from the Growth team on this? https://wikitech.wikimedia.org/wiki/Add_Link hopefully describes what we are doing now and the README has further documentation, though it could use some improvement (T290067: Improve the documentation in the README of the mwaddlink-repo).

Will your team also maintain the existing service deployment to Kuberntes? If so, T278083: Define SLIs/SLOs for link recommendation service is probably one for us to discuss as well.

I'm going to remove this from the Growth team's work under the assumption that Machine-Learning-Team will handle it from here on out. Please let us know if you have questions!

kostajh added a project: Growth-Team.Oct 1 2021, 10:46 AM

kostajh moved this task from Inbox to Triaged on the Growth-Team board.

Miriam subscribed.Oct 20 2021, 3:18 PM

calbon moved this task from Unsorted to Backlog/Train Wing on the Machine-Learning-Team board.Nov 15 2022, 3:29 PM

How often should we re-run the pipeline for an existing dataset?

I noticed today that cswiki's dataset was last generated in October 2021.

Should we re-run the pipelines every e.g. 6 months? @MGerlach what do you think makes sense?

@kostajh I agree that we should re-run the pipelines after some time. If possible, updating after 6 months seems reasonable (though I dont have any quantitative insights how quick the model gets outdated). But the almost 1.5 years since October 2021 seem too long.

In T276438#8531223, @MGerlach wrote:

@kostajh I agree that we should re-run the pipelines after some time. If possible, updating after 6 months seems reasonable (though I dont have any quantitative insights how quick the model gets outdated). But the almost 1.5 years since October 2021 seem too long.

That sounds good; I made T327212: Establish process for periodically refreshing link recommendation models to discuss further.

kevinbazira mentioned this in T336927: Completion report on training 18 rounds of add-a-link models.May 19 2023, 12:36 PM

elukey moved this task from Backlog/Train Wing to New Projects to review on the Machine-Learning-Team board.Sep 5 2023, 1:25 PM

calbon moved this task from New Projects to review to Unsorted on the Machine-Learning-Team board.Nov 2 2023, 5:55 PM

isarantopoulos moved this task from Unsorted to 2023-2024 Q3 Done on the Machine-Learning-Team board.Nov 20 2023, 11:43 AM

isarantopoulos moved this task from 2023-2024 Q3 Done to Unsorted on the Machine-Learning-Team board.Nov 20 2023, 12:21 PM

klausman moved this task from Unsorted to Watching on the Machine-Learning-Team board.Nov 21 2023, 3:44 PM

@kevinbazira: Removing task assignee as this open task has been assigned for more than two years - see the email sent to all task assignees on 2024-04-15.
Please assign this task to yourself again if you still realistically [plan to] work on this task - it would be welcome! :)
If this task has been resolved in the meantime, or should not be worked on by anybody ("declined"), please update its task status via "Add Action… 🡒 Change Status".
Also see https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup for tips how to best manage your individual work in Phabricator. Thanks!

Growth is working on surfacing link-recommendations in new ways (T362584), and so I'm trying to get a grasp on how this service is evolving. Where can I get insights into the latest datasets?

For now, I've only found the 2021-06 dataset for eswiki from the link above: https://analytics.wikimedia.org/published/datasets/one-off/research-mwaddlink/eswiki/
But that's in a folder called "one-off".
Is that the latest dataset used?

In T276438#10247742, @Michael wrote:

Growth is working on surfacing link-recommendations in new ways (T362584), and so I'm trying to get a grasp on how this service is evolving. Where can I get insights into the latest datasets?

For now, I've only found the 2021-06 dataset for eswiki from the link above: https://analytics.wikimedia.org/published/datasets/one-off/research-mwaddlink/eswiki/
But that's in a folder called "one-off".
Is that the latest dataset used?

@kevinbazira might be able to provide pointers to the latest datasets. He generated the models/datasets for most (all) of the wikis in which it is deployed (e.g. T308144#8853071).

I have been working on setting up a better/scalable pipeline to generate the models/datasets (T361926) as part of a project to improve the multilingual support of the add-a-link models (incl. reducing the overall number of models to maintain). Though this is not finished as we havent validated the output of the pipeline sufficiently.

@Michael I am curious to understand your use-case better: Are you looking for a specific dataset, or would you want to update the datasets for all models (or something different)? I am happy to support.

In T276438#10258492, @MGerlach wrote:

[...]
@Michael I am curious to understand your use-case better: Are you looking for a specific dataset, or would you want to update the datasets for all models (or something different)? I am happy to support.

Right now, I just want to understand how the process here works and how I can get insight into its current state. My core motivations:

improving my mental model of the link-recommendation pipeline (making "the process", "the dataset" and "the ML model" a bit less of a black box).
thinking ahead about upcoming communications with respect to link-recommendations as we roll out this feature both in new ways (recommendations right inside the article while reading) and to new audiences (most notably: enwiki). It would probably be nice if we could say "Those recommendations came from a model trained on wiki-articles up to 2024-XX-YY".)

Down the line, I wonder if it would make sense to look at some reverts, or maybe particular bad recommendations highlighted by the community, and see if we could add them to the training-data and/or test-data as counter-examples?

Hi @Michael, the latest used datasets can be found here. Additionally, here is a report of all add-a-link models trained and published so far in 18 batches.

These models were all trained using the old pipeline, which you can take a look at here.

As @MGerlach mentioned, there is ongoing work to:

Improve support for languages whose add-a-link models were not published
Train one language-agnostic model, which will simplify maintenance by replacing ~300+ language-specific models

Establish processes for running the dataset pipelineOpen, LowPublicActions

Description

Related ObjectsSearch...

Event Timeline

Establish processes for running the dataset pipeline
Open, LowPublic
Actions

Related Objects
Search...