[go: up one dir, main page]

Page MenuHomePhabricator

Establish processes for running the dataset pipeline
Open, LowPublic

Description

Some things to work out:

  • Where on stat1008 should the canonical repo live, or does that matter? (currently various engineers have run the pipeline from their home directories, doesn't seem like a problem so long as the repo is up-to-date)
  • How often should we re-run the pipeline for an existing dataset? This is needed as page titles get renamed, for example (T274195), as we currently use page title instead of page ID for querying datasets. Should we automate the process via a cronjob?
  • What type of QA if any should be done after generating a new dataset? run-pipeline.sh generates the datasets but doesn't publish them to analytics.wikimedia.org, that intentionally requires a separate step.
  • [your questions / concerns here]

Event Timeline

We should answer these questions and document them on https://wikitech.wikimedia.org/wiki/Add_Link

This also relates to plans for scaling the service to new wikis.

Some thoughts from my side.

Where on stat1008 should the canonical repo live, or does that matter?

  • not sure there needs to be a dedicated location since the data will be moved to published datasets, but I dont have a good sense of what is the recommended setup for such a pipeline

How often should we re-run the pipeline for an existing dataset? Should we automate the process via a cronjob?

  • Less often than once a month since we are using the dumps for training the pipeline. These are generated on a monthly basis.
  • More often than once a year, say every 3 months, though this is based on speculation/intuition. While we will always generate the link recommendations for the current version of a page, the underlying data used for training will change as new links (and new pages) are added or removed. I dont think we have a good understanding what the rate of change is and, if and how much it would impact the model performance. This is an interesting research question for the future
  • One option would be to monitor user-feedback on the link-recommendations and check whether the rate of rejection of links increases with time.

What type of QA if any should be done after generating a new dataset? run-pipeline.sh generates the datasets but doesn't publish them to analytics.wikimedia.org, that intentionally requires a separate step.

  • we could make a query on a sample page (or sample text) where we would expect some links to be recommended. requiring a specific output might be tricky since the page itself and the model can change, and thus also its output. however, we could require to get at least one link recommended in order to makre sure that the datasets/model work.
  • the evaluation of the backtesting-data provides numbers for precision and recall on a test-dataset with existing links. this should match (at least roughly, say within 10 or 20%) what we expect from previous analysis from before deployment (for example here)

Some thoughts from my side.

Where on stat1008 should the canonical repo live, or does that matter?

  • not sure there needs to be a dedicated location since the data will be moved to published datasets, but I dont have a good sense of what is the recommended setup for such a pipeline

I think we should consolidate on always running the ./run-pipeline.sh and other scripts from @MGerlach's repo which is at /home/mgerlach/REPOS/mwaddlink-gerrit (@MGerlach there are three other repos with mwaddlink in the title in that directory, maybe we should rename those or put them into a subdirectory so it's more obvious that this is the canonical location?)

I've updated the wikitech docs with the repo location

How often should we re-run the pipeline for an existing dataset? Should we automate the process via a cronjob?

  • Less often than once a month since we are using the dumps for training the pipeline. These are generated on a monthly basis.
  • More often than once a year, say every 3 months, though this is based on speculation/intuition. While we will always generate the link recommendations for the current version of a page, the underlying data used for training will change as new links (and new pages) are added or removed. I dont think we have a good understanding what the rate of change is and, if and how much it would impact the model performance. This is an interesting research question for the future
  • One option would be to monitor user-feedback on the link-recommendations and check whether the rate of rejection of links increases with time.

Why don't we go with every 4 months. @MMiller_WMF does this sound OK to you?

What type of QA if any should be done after generating a new dataset? run-pipeline.sh generates the datasets but doesn't publish them to analytics.wikimedia.org, that intentionally requires a separate step.

  • we could make a query on a sample page (or sample text) where we would expect some links to be recommended. requiring a specific output might be tricky since the page itself and the model can change, and thus also its output. however, we could require to get at least one link recommended in order to makre sure that the datasets/model work.
  • the evaluation of the backtesting-data provides numbers for precision and recall on a test-dataset with existing links. this should match (at least roughly, say within 10 or 20%) what we expect from previous analysis from before deployment (for example here)

That sounds good. Is this something that Research or Growth team should own? Maybe when we are refreshing the models / datasets, we can post some of this data in the task to get sign off from at least one person in Research and one from Growth? cc @MMiller_WMF & @MGerlach

@MGerlach is transitioning maintenance of the link recommendation algorithm from Research to ML-team.

@calbon and I will be the ML-team contacts on this. Thanks @MGerlach!

@MGerlach is transitioning maintenance of the link recommendation algorithm from Research to ML-team.

@calbon and I will be the ML-team contacts on this. Thanks @MGerlach!

Sounds good. @kevinbazira do you need any information from the Growth team on this? https://wikitech.wikimedia.org/wiki/Add_Link hopefully describes what we are doing now and the README has further documentation, though it could use some improvement (T290067: Improve the documentation in the README of the mwaddlink-repo).

Will your team also maintain the existing service deployment to Kuberntes? If so, T278083: Define SLIs/SLOs for link recommendation service is probably one for us to discuss as well.

@MGerlach is transitioning maintenance of the link recommendation algorithm from Research to ML-team.

@calbon and I will be the ML-team contacts on this. Thanks @MGerlach!

Sounds good. @kevinbazira do you need any information from the Growth team on this? https://wikitech.wikimedia.org/wiki/Add_Link hopefully describes what we are doing now and the README has further documentation, though it could use some improvement (T290067: Improve the documentation in the README of the mwaddlink-repo).

Will your team also maintain the existing service deployment to Kuberntes? If so, T278083: Define SLIs/SLOs for link recommendation service is probably one for us to discuss as well.

I'm going to remove this from the Growth team's work under the assumption that Machine-Learning-Team will handle it from here on out. Please let us know if you have questions!

How often should we re-run the pipeline for an existing dataset?

I noticed today that cswiki's dataset was last generated in October 2021.

Should we re-run the pipelines every e.g. 6 months? @MGerlach what do you think makes sense?

@kostajh I agree that we should re-run the pipelines after some time. If possible, updating after 6 months seems reasonable (though I dont have any quantitative insights how quick the model gets outdated). But the almost 1.5 years since October 2021 seem too long.

@kostajh I agree that we should re-run the pipelines after some time. If possible, updating after 6 months seems reasonable (though I dont have any quantitative insights how quick the model gets outdated). But the almost 1.5 years since October 2021 seem too long.

That sounds good; I made T327212: Establish process for periodically refreshing link recommendation models to discuss further.

@kevinbazira: Removing task assignee as this open task has been assigned for more than two years - see the email sent to all task assignees on 2024-04-15.
Please assign this task to yourself again if you still realistically [plan to] work on this task - it would be welcome! :)
If this task has been resolved in the meantime, or should not be worked on by anybody ("declined"), please update its task status via "Add Action… 🡒 Change Status".
Also see https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup for tips how to best manage your individual work in Phabricator. Thanks!

Growth is working on surfacing link-recommendations in new ways (T362584), and so I'm trying to get a grasp on how this service is evolving. Where can I get insights into the latest datasets?

For now, I've only found the 2021-06 dataset for eswiki from the link above: https://analytics.wikimedia.org/published/datasets/one-off/research-mwaddlink/eswiki/
But that's in a folder called "one-off".
Is that the latest dataset used?

Growth is working on surfacing link-recommendations in new ways (T362584), and so I'm trying to get a grasp on how this service is evolving. Where can I get insights into the latest datasets?

For now, I've only found the 2021-06 dataset for eswiki from the link above: https://analytics.wikimedia.org/published/datasets/one-off/research-mwaddlink/eswiki/
But that's in a folder called "one-off".
Is that the latest dataset used?

@kevinbazira might be able to provide pointers to the latest datasets. He generated the models/datasets for most (all) of the wikis in which it is deployed (e.g. T308144#8853071).

I have been working on setting up a better/scalable pipeline to generate the models/datasets (T361926) as part of a project to improve the multilingual support of the add-a-link models (incl. reducing the overall number of models to maintain). Though this is not finished as we havent validated the output of the pipeline sufficiently.

@Michael I am curious to understand your use-case better: Are you looking for a specific dataset, or would you want to update the datasets for all models (or something different)? I am happy to support.

[...]
@Michael I am curious to understand your use-case better: Are you looking for a specific dataset, or would you want to update the datasets for all models (or something different)? I am happy to support.

Right now, I just want to understand how the process here works and how I can get insight into its current state. My core motivations:

  1. improving my mental model of the link-recommendation pipeline (making "the process", "the dataset" and "the ML model" a bit less of a black box).
  2. thinking ahead about upcoming communications with respect to link-recommendations as we roll out this feature both in new ways (recommendations right inside the article while reading) and to new audiences (most notably: enwiki). It would probably be nice if we could say "Those recommendations came from a model trained on wiki-articles up to 2024-XX-YY".)

Down the line, I wonder if it would make sense to look at some reverts, or maybe particular bad recommendations highlighted by the community, and see if we could add them to the training-data and/or test-data as counter-examples?

Hi @Michael, the latest used datasets can be found here. Additionally, here is a report of all add-a-link models trained and published so far in 18 batches.

These models were all trained using the old pipeline, which you can take a look at here.

As @MGerlach mentioned, there is ongoing work to:

  • Improve support for languages whose add-a-link models were not published
  • Train one language-agnostic model, which will simplify maintenance by replacing ~300+ language-specific models