[go: up one dir, main page]

Page MenuHomePhabricator

fkaelin
User

Projects

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Wednesday

  • Clear sailing ahead.

User Details

User Since
Nov 12 2020, 6:16 PM (211 w, 3 d)
Availability
Available
LDAP User
Fabian Kaelin
MediaWiki User
FKaelin (WMF) [ Global Accounts ]

Recent Activity

Tue, Nov 26

fkaelin added a project to T380874: Incremental HTML dataset to support "Who are moderators" SDS 1.2.3: Data-Engineering.
Tue, Nov 26, 4:46 PM · Data-Engineering, Research
fkaelin updated the task description for T378761: HTML diff dataset for SDS 1.2.3.
Tue, Nov 26, 2:25 PM · Research-engineering, Research
fkaelin added a parent task for T380871: Create one-off HTML dataset for "Who are moderators" SDS 1.2.3: T378761: HTML diff dataset for SDS 1.2.3.
Tue, Nov 26, 2:25 PM · Research, Research-engineering
fkaelin added a parent task for T380874: Incremental HTML dataset to support "Who are moderators" SDS 1.2.3: T378761: HTML diff dataset for SDS 1.2.3.
Tue, Nov 26, 2:25 PM · Data-Engineering, Research
fkaelin added subtasks for T378761: HTML diff dataset for SDS 1.2.3: T380871: Create one-off HTML dataset for "Who are moderators" SDS 1.2.3, T380874: Incremental HTML dataset to support "Who are moderators" SDS 1.2.3.
Tue, Nov 26, 2:25 PM · Research-engineering, Research
fkaelin created T380874: Incremental HTML dataset to support "Who are moderators" SDS 1.2.3.
Tue, Nov 26, 2:23 PM · Data-Engineering, Research
fkaelin moved T380871: Create one-off HTML dataset for "Who are moderators" SDS 1.2.3 from Backlog to In Progress on the Research board.
Tue, Nov 26, 2:06 PM · Research, Research-engineering
fkaelin added a project to T380871: Create one-off HTML dataset for "Who are moderators" SDS 1.2.3: Research.
Tue, Nov 26, 2:06 PM · Research, Research-engineering
fkaelin changed the status of T380871: Create one-off HTML dataset for "Who are moderators" SDS 1.2.3 from Open to In Progress.
Tue, Nov 26, 2:06 PM · Research, Research-engineering
fkaelin created T380871: Create one-off HTML dataset for "Who are moderators" SDS 1.2.3.
Tue, Nov 26, 2:05 PM · Research, Research-engineering

Mon, Nov 25

fkaelin created T380773: Time-based partitioning for wikitext history for Dumps2 .
Mon, Nov 25, 4:52 PM · Data-Engineering

Tue, Nov 5

AndrewTavis_WMDE awarded T348999: Add linter and formatter to wmfdata-python (and link check) a Party Time token.
Tue, Nov 5, 10:29 AM · Patch-For-Review, Wikidata Analytics, Movement-Insights, Wikidata, Data-Engineering, Wmfdata-Python

Mon, Nov 4

isarantopoulos awarded T377496: Phase 1: LLM inference - base metrics a Yellow Medal token.
Mon, Nov 4, 9:06 AM · Research-engineering, Research

Oct 31 2024

fkaelin added a comment to T378761: HTML diff dataset for SDS 1.2.3.

Adding a quick snippet to look at the percentage of revisions that have the previous revision as their parent. Generally above 99%, for enwiki it is 99.98%.

Oct 31 2024, 8:40 PM · Research-engineering, Research
fkaelin created T378761: HTML diff dataset for SDS 1.2.3.
Oct 31 2024, 7:48 PM · Research-engineering, Research
fkaelin added a comment to T366369: MaxMind seems to be mapping the same IP to different countries.

The snippet pasted above now returns the same maxmind metadata for all hosts the job ran on: 2024-10-29 19:58:23.

Oct 31 2024, 1:43 AM · Data-Platform-SRE (2024.10.19 - 2024.11.08), Data-Engineering

Oct 29 2024

fkaelin added a comment to T377602: Validate that we can submit Spark jobs with Skein in Kubernetes .

Agreed that the SimpleSkeinOperator on kubernetes airflow is not needed anymore (and that the code should use kubernetes native operator instead).

Oct 29 2024, 5:18 PM · Data-Platform-SRE (2024.11.09 - 2024.11.29)
fkaelin added a comment to T289532: Add more languages to Wikipedia Clickstream.

This dataset has been reviewed and approved by privacy and legal (details on asana). Note that since the clickstream dataset is older than the L3SC process, both the dataset itself and the expansion of the set of languages were reviewed and approved.

Oct 29 2024, 12:36 PM · Data-Engineering (Q2 2024 October 1st - December 31th), Data Products, Privacy Engineering, Data Pipelines, Epic

Oct 28 2024

fkaelin moved T377594: Fix Dumps - errors exporting good revisions from Backlog to Watching on the Research board.
Oct 28 2024, 3:56 PM · MW-1.44-notes (1.44.0-wmf.3; 2024-11-12), wmde-wikidata-tech, Wikidata, Wikipedia-Android-App-Backlog, Research, Dumps-Generation, Data-Platform
fkaelin added a project to T377594: Fix Dumps - errors exporting good revisions: Research.
Oct 28 2024, 3:55 PM · MW-1.44-notes (1.44.0-wmf.3; 2024-11-12), wmde-wikidata-tech, Wikidata, Wikipedia-Android-App-Backlog, Research, Dumps-Generation, Data-Platform
fkaelin moved T305688: Make HTML Dumps available in hadoop from Support Needed to Watching on the Research board.
Oct 28 2024, 3:55 PM · Data-Engineering, Research, Structured-Data-Backlog
fkaelin moved T356701: Temporary Accounts Initiative (IP Masking) - Add user_is_temporary and user_is_permanent to data tables from Backlog to Watching on the Research board.
Oct 28 2024, 3:54 PM · Research, Product-Analytics, Movement-Insights, Temporary accounts, Data Products, Data-Engineering, Data-Platform
fkaelin added a project to T356701: Temporary Accounts Initiative (IP Masking) - Add user_is_temporary and user_is_permanent to data tables: Research.
Oct 28 2024, 3:54 PM · Research, Product-Analytics, Movement-Insights, Temporary accounts, Data Products, Data-Engineering, Data-Platform

Oct 21 2024

fkaelin added a comment to T356701: Temporary Accounts Initiative (IP Masking) - Add user_is_temporary and user_is_permanent to data tables.

To follow up my previous comment:

Another point for discussion: the mediawiki history dumps are published as tsv files (without a header) for the community. Changing the definition of user_is_anonymous could have an impact consumers in the community? Likely it would involve a prior notice to the community, cc @KinneretG who is working on an announcement about temp accounts to the research list.

Oct 21 2024, 12:37 PM · Research, Product-Analytics, Movement-Insights, Temporary accounts, Data Products, Data-Engineering, Data-Platform

Oct 17 2024

fkaelin updated the task description for T377159: [SDS 1.2.1 B] Test existing AI models for internal use-cases.
Oct 17 2024, 6:51 PM · Research (FY2024-25-Research-October-December)
fkaelin added a subtask for T377159: [SDS 1.2.1 B] Test existing AI models for internal use-cases: T377498: Phase 2: Article categorization metrics, fine-tuning metrics, optimization tooling.
Oct 17 2024, 6:47 PM · Research (FY2024-25-Research-October-December)
fkaelin added a parent task for T377498: Phase 2: Article categorization metrics, fine-tuning metrics, optimization tooling: T377159: [SDS 1.2.1 B] Test existing AI models for internal use-cases.
Oct 17 2024, 6:47 PM · Research-engineering, Research
fkaelin created T377498: Phase 2: Article categorization metrics, fine-tuning metrics, optimization tooling.
Oct 17 2024, 6:47 PM · Research-engineering, Research
fkaelin added a subtask for T377159: [SDS 1.2.1 B] Test existing AI models for internal use-cases: T377496: Phase 1: LLM inference - base metrics.
Oct 17 2024, 6:42 PM · Research (FY2024-25-Research-October-December)
fkaelin added a parent task for T377496: Phase 1: LLM inference - base metrics: T377159: [SDS 1.2.1 B] Test existing AI models for internal use-cases.
Oct 17 2024, 6:42 PM · Research-engineering, Research
fkaelin created T377496: Phase 1: LLM inference - base metrics.
Oct 17 2024, 6:41 PM · Research-engineering, Research

Oct 16 2024

fkaelin updated subscribers of T305688: Make HTML Dumps available in hadoop.

Summarizing my take-away from this slack thread about how to use html datasets in airflow dags.

Oct 16 2024, 3:13 PM · Data-Engineering, Research, Structured-Data-Backlog

Oct 15 2024

fkaelin created T377267: Consolidate article based data pipelines.
Oct 15 2024, 8:51 PM · Research-Freezer, Research-engineering
fkaelin created T377266: DSE kubernetes namespace for Research.
Oct 15 2024, 8:34 PM · Data-Platform-SRE (2024.11.30 - 2024.12.20), Research-engineering, Data-Platform, Research
fkaelin created T377265: Simplify dependencies between research code repositories for ML.
Oct 15 2024, 8:33 PM · Research-Freezer, Research-engineering

Oct 4 2024

fkaelin closed T368613: Essential work - Research tooling as Resolved.

Thank you @MunizaA, your mr is merged.

Oct 4 2024, 7:36 PM · Essential-Work, Research-engineering, Research (FY2024-25-Research-July-September)

Oct 3 2024

fkaelin closed T357038: reference model engineering work as Resolved.

This is completed.

Oct 3 2024, 3:15 PM · Research, Research-engineering
fkaelin closed T357038: reference model engineering work, a subtask of T357037: References Model: Language-agnostic Reference Risk, as Resolved.
Oct 3 2024, 3:14 PM · Research

Oct 2 2024

fkaelin closed T376207: Updates to CentralNotice for TempAccount as Resolved.

Great - thank you @TAndic. I am closing this as resolved as this means it will work same as before which should fine for Research.

Oct 2 2024, 2:27 AM · Temporary accounts, MediaWiki-extensions-CentralNotice, Research

Oct 1 2024

fkaelin added a comment to T376207: Updates to CentralNotice for TempAccount.

@TAndic, do you have suggestions for who to tag here?

Oct 1 2024, 8:11 PM · Temporary accounts, MediaWiki-extensions-CentralNotice, Research
fkaelin created T376207: Updates to CentralNotice for TempAccount.
Oct 1 2024, 8:11 PM · Temporary accounts, MediaWiki-extensions-CentralNotice, Research
fkaelin added a comment to T376206: Quicksurvey audience selection with TempAccounts.

@TAndic, do you have suggestions for who to tag here?

Oct 1 2024, 8:08 PM · Temporary accounts, QuickSurveys, Research
fkaelin created T376206: Quicksurvey audience selection with TempAccounts.
Oct 1 2024, 8:07 PM · Temporary accounts, QuickSurveys, Research
fkaelin created T376204: TempAccount updates to research pipelines.
Oct 1 2024, 7:59 PM · Research (FY2024-25-Research-January-March), Research-engineering
fkaelin updated subscribers of T356701: Temporary Accounts Initiative (IP Masking) - Add user_is_temporary and user_is_permanent to data tables.

Thanks for the clarifications @Mayakp.wiki, though in my opinion we should still consider adding a user_is_named flag as a replacement for the previous definition of user_is_anonymous, to minimize the downstream implications of this change.

Oct 1 2024, 1:34 PM · Research, Product-Analytics, Movement-Insights, Temporary accounts, Data Products, Data-Engineering, Data-Platform

Sep 27 2024

fkaelin added a comment to T368613: Essential work - Research tooling.

Weekly updates

  • Held a hands-on Tuesday meeting for research scientist to use and contribute to shared research code base
  • Notebooks to get started and for contributing to research codebase
  • Notebook for strategies to process webrequest logs at scale
Sep 27 2024, 9:00 PM · Essential-Work, Research-engineering, Research (FY2024-25-Research-July-September)

Sep 25 2024

fkaelin added a comment to T356701: Temporary Accounts Initiative (IP Masking) - Add user_is_temporary and user_is_permanent to data tables.

Instead, we want each consumer to stop and think how they want to handle temporary users.

Sep 25 2024, 5:50 PM · Research, Product-Analytics, Movement-Insights, Temporary accounts, Data Products, Data-Engineering, Data-Platform
fkaelin added a comment to T356701: Temporary Accounts Initiative (IP Masking) - Add user_is_temporary and user_is_permanent to data tables.

@Ottomata thanks for sharing - the intricacies of naming/classifying the user types are real!

Sep 25 2024, 5:30 PM · Research, Product-Analytics, Movement-Insights, Temporary accounts, Data Products, Data-Engineering, Data-Platform
fkaelin added a comment to T356701: Temporary Accounts Initiative (IP Masking) - Add user_is_temporary and user_is_permanent to data tables.

From a data/metrics usage perspective, the user_is_anonymous field seems to be mostly used for the binary anon/editor classification (e.g. in wikistats editor types are "Anonymous - a user that is not logged in" and "User - a registered, logged in", in research we create datasets/models that use this as a feature). In my understanding this binary nature will not change (we might want to update some naming), i.e. the temp accounts will eventually replace all anonymous edits (once the feature is rolled out to all wikis).

Sep 25 2024, 1:27 PM · Research, Product-Analytics, Movement-Insights, Temporary accounts, Data Products, Data-Engineering, Data-Platform

Sep 23 2024

fkaelin closed T356729: Research API repository as Resolved.

This is done, research-api-template repository

Sep 23 2024, 3:20 PM · Research-engineering, Research

Sep 11 2024

fkaelin added a comment to T372747: Repeat Automoderator testing process with Multilingual Revert Risk data.

@KCVelaga_WMF I misread this code previously - for now the model loading/inference for the various variants of the revert risk model is not unified (we plan to do that though). There are separate loading and classify methods for the multi-lingual model. The features extraction pipeline for all models is generalized, but the inference step needs some more work. I started updating this notebook to work with the multi-lingual model but run into some torch/transfomer version issues.

Sep 11 2024, 1:45 PM · Moderator-Tools-Team, Product-Analytics (Kanban), Automoderator

Sep 9 2024

fkaelin added a comment to T368613: Essential work - Research tooling.

Weekly updates

  • preparation for Tuesday hand-on session scheduled for Sep 17th
  • developing teaching notebooks
Sep 9 2024, 2:18 PM · Essential-Work, Research-engineering, Research (FY2024-25-Research-July-September)

Sep 4 2024

fkaelin closed T345446: AQS content gap metrics ingestion job as Declined.
Sep 4 2024, 2:57 PM · AQS2.0, Data Products, Patch-For-Review, Research
fkaelin closed T345446: AQS content gap metrics ingestion job, a subtask of T331158: Knowledge Gaps Datasets and APIs, as Declined.
Sep 4 2024, 2:57 PM · Research

Aug 26 2024

fkaelin removed Due Date on T354958: Additional data release - aggregated survey form.
Aug 26 2024, 1:40 PM · Research
fkaelin added a comment to T354958: Additional data release - aggregated survey form.

A process for publishing survey data has been established, so I am inclined to close this task as resolved and track the release of new datasets in other phabs. Do you have a preference @Miriam?

Aug 26 2024, 1:40 PM · Research

Aug 24 2024

fkaelin closed T361929: [Research Engineering Request] Building end-to-end training pipeline for the add-a-link model as Resolved.

This work has been merged at last, with this MR.

Aug 24 2024, 2:27 AM · Research (FY2024-25-Research-July-September), Research-engineering
fkaelin added a comment to T368613: Essential work - Research tooling.

No updates

Aug 24 2024, 2:26 AM · Essential-Work, Research-engineering, Research (FY2024-25-Research-July-September)
fkaelin closed T361929: [Research Engineering Request] Building end-to-end training pipeline for the add-a-link model, a subtask of T361926: Improve training and inference pipeline for multilingual link recommendation model, as Resolved.
Aug 24 2024, 2:26 AM · Research, Essential-Work

Aug 19 2024

fkaelin added a comment to T372747: Repeat Automoderator testing process with Multilingual Revert Risk data.

There is no pre computed dataset available. The implementation is general, e.g. by passing a multi-lingual model url it could create predictions for that model. See the risk observatory pipeline code as an example that uses this transformation end-to-end; this can be run via a notebook (pip install repo and import/execute the run method). To create a pipeline that generates predictions for the multi-lingual model via airflow dag, we would need to create research engineering request.

Aug 19 2024, 6:09 PM · Moderator-Tools-Team, Product-Analytics (Kanban), Automoderator
fkaelin added a comment to T368613: Essential work - Research tooling.

Weekly updates

  • Defined scoping for Q1, updated description
  • Started defining scope/style of Tuesday meeting
Aug 19 2024, 5:12 PM · Essential-Work, Research-engineering, Research (FY2024-25-Research-July-September)
fkaelin updated the task description for T368613: Essential work - Research tooling.
Aug 19 2024, 5:09 PM · Essential-Work, Research-engineering, Research (FY2024-25-Research-July-September)

Jul 23 2024

fkaelin closed T349755: Training pipeline for Revert Risk Language Agnostic (RRLA) model as Resolved.

The is resolved, including the training of the model. Code: pipeline / dag

Jul 23 2024, 10:07 AM · Knowledge-Integrity, Research
fkaelin closed T349755: Training pipeline for Revert Risk Language Agnostic (RRLA) model, a subtask of T314384: Develop a ML-based service to predict reverts on Wikipedia(s), as Resolved.
Jul 23 2024, 10:07 AM · Machine-Learning-Team, Research, Epic

Jul 8 2024

fkaelin added a comment to T354958: Additional data release - aggregated survey form.

Finally got around to this. Thank you @YLiou_WMF for the data file, this looks good to me in general.

Jul 8 2024, 1:11 PM · Research

Jun 19 2024

fkaelin updated subscribers of T367757: Request to add mnz to analytics-research-admins.

I confirm that this request is legit, also adding @XiaoXiao-WMF as manager.

Jun 19 2024, 3:07 PM · Patch-For-Review, SRE, SRE-Access-Requests

Jun 17 2024

fkaelin added a comment to T340494: Create keyspace and table for Knowledge Gaps.

Thanks. Is this now using AQS 2? It has been a moment, can you point to a current/good example job that writes to a AQS cassandra dataset from airflow?

Jun 17 2024, 4:10 PM · Cassandra, Data-Engineering
fkaelin added a comment to T351009: Develop an ML training workflow for ongoing work.

Summary of developments:

  • implementation of an end-to-end ml training workflows for
  • airflow dags that to execute pipelines (scheduled for retraining pipelines, manual trigger for development)
  • discussions for how new model versions can be deployed
    • for now, continue with manual process established by ML platform
    • T366528 to track automation, as a manual process will not scale as research puts more training pipelines into production
  • guide for contributing to repository containing training workflows
  • future work in collab with ML platform
    • GPU support
      • enable using new ML boxes once they become available
      • use gpu available on existing infra in production airflow job (maybe with a sprint with ML platform that we didn't get to in Q4 FY24)
    • standards for ML training
      • there is a style guide and existing ml training pipelines to base new work on, but we refrained from introducing "abstractions/framework" like code or language - instead we used the existing infrastructure.
      • lead by the ML Platform team, we should revisit this once the new ML boxes become available, as there will be a need for new tooling at that point
        • related: the current tooling for end-to-end ML training workflows is not convenient for iterative research/development (setup/deployment is error prone and too involved for one-off use xcases), research engineering has a goal in FY25 to improve researcher tooling
Jun 17 2024, 2:15 PM · Research-engineering, Research (FY2023-24-Research-April-June)
fkaelin added a comment to T361929: [Research Engineering Request] Building end-to-end training pipeline for the add-a-link model.

Weekly updates

  • initial review on the MR
  • meeting with Aisha/Martin to discuss MR and how to approach remaining work
Jun 17 2024, 1:37 PM · Research (FY2024-25-Research-July-September), Research-engineering
fkaelin added a comment to T357316: Develop pipelines for research datasets - Q3/Q4.

Weekly update

  • pipeline are merged
  • airflow dags are deployed, final testing in progress
Jun 17 2024, 1:37 PM · Research (FY2024-25-Research-July-September), Research-engineering

Jun 13 2024

fkaelin added a comment to T358373: [Dumps 2] Reconciliation mechanism to detect and fetch missing/mismatched revisions.

In T358366#9831389 I asked if other fields could be added to the schema; in particular the diff between two revisions, which is frequently used by research (wikidiff). I agree with @xcollazo's concerns, but this lead me to think about the implications of computing the diff separately in regards to reconciliation.

  • the diff is expensive to compute, as a the parent revision might be at any moment in the past and is not necessarily the most recent previous revision. The wikidiff pipeline batches jobs by page (i.e. a batch contains the full history of the pages in the batch).
  • the full diff dataset computed for each snapshot to follow the "snapshot pattern". However it is not significantly cheaper to make this pipeline incremental (e.g. only append diffs for the new month of revisions) as any revision in the past can be a parent revision so the join is still expensive
  • so how would one go about "enriching" wmf_dumps.wikitext_raw_rc2 with a diff column? the job could filter the full history for only the pages changed in that hour (broadcast join) and then do the self join, but that would still require a full pass over the data which seems expensive. This certainly is solvable, e.g. one could decrease the update interval, but it is tempting to instead implement the diff as a streaming "enrichment" pipeline.
  • this would look similar to the existing "page change" job, e.g. query mediawiki for the current and parent revision text and compute the diff (maybe with a cache for the previous wikitext for each page which is the most common parent revision)
  • however, this leads to the question of correctness/reconciliation, since this diff dataset would not be derived from wmf_dumps.wikitext_raw_rc2 and would thus require its own reconciliation mechanism? Which would be an argument in favour of the "s wmf_dumps.wikitext_raw the right 'place' to check whether we are missing events or not? Shouldn't we do these checks upstream?" point raised above.
Jun 13 2024, 6:58 PM · Patch-For-Review, Dumps 2.0 (Kanban Board)
fkaelin created T367446: Consolidate duplicated configuration/constants.
Jun 13 2024, 4:48 PM · Research-Freezer, Research-engineering

Jun 3 2024

fkaelin created T366528: Deployment of model updates .
Jun 3 2024, 7:47 PM · Research-engineering, Machine-Learning-Team, Research

May 31 2024

fkaelin added a comment to T366369: MaxMind seems to be mapping the same IP to different countries.

Indeed, different versions of the database seems to be present on cluster hosts.

May 31 2024, 4:44 PM · Data-Platform-SRE (2024.10.19 - 2024.11.08), Data-Engineering

May 23 2024

fkaelin assigned T351118: [Research Engineering Request] Produce regular snapshots of all Wikipedia article topics to MunizaA.
  • This pipeline is implemented (MR)
  • Remaining work: schedule an airflow dag to regularly compute new topics dataset
May 23 2024, 6:57 PM · Research-engineering, Research
fkaelin claimed T354958: Additional data release - aggregated survey form.
May 23 2024, 6:54 PM · Research
fkaelin closed T294380: Storage request for datasets published by research team as Resolved.

Closing this task as resolved as the storage request was handled.

May 23 2024, 6:53 PM · SRE-swift-storage
fkaelin closed T304425: Test LiftWing API/Predictions from Hadoop as Resolved.

Closing this as resolved. After more discussion and some experimentation, it was decided that doing batch inference within the distributed jobs (e.g. by broadcasting the model to the workers) is preferable. Pasting the comments from the relevant slack thread here.

May 23 2024, 6:52 PM · Lift-Wing
fkaelin closed T304425: Test LiftWing API/Predictions from Hadoop , a subtask of T290173: Orchestration of end-to-end machine learning workloads, as Resolved.
May 23 2024, 6:52 PM · Research-Freezer
fkaelin changed the status of T351009: Develop an ML training workflow for ongoing work from Open to In Progress.
May 23 2024, 6:52 PM · Research-engineering, Research (FY2023-24-Research-April-June)

May 22 2024

fkaelin added a comment to T358366: Consult with Product and Research team on schema and data retention expectations for wmf_dumps.wikitext_raw.

More on "Availability" / time travel. This question is not easy to answer, as it also relates to the current snapshot approach, which forces a pipeline to reason about the past in a rather limiting way. Aka "do you want the data as it looked today, or 1 month ago, or 2 month ago?", and finding out if/how the past data is different is not trivial and rarely practical. Generally pipelines either

  1. offload dealing with the snapshot semantics to the consumers by producing snapshotted datasets themselves
  2. implement a pseudo-incremental dataset by disregarding the "new past" and any changes it might contain.

For this reason I find it hard to define requirements for time travel, it is basically a new capability (for example, the replacement for "mediawiki_wikitext_current" could be a transformation of a time travel query). Starting with 90 days should be sufficient as it is strictly an improvement to what one can do now.

May 22 2024, 8:14 PM · Dumps 2.0 (Kanban Board)
fkaelin added a comment to T358366: Consult with Product and Research team on schema and data retention expectations for wmf_dumps.wikitext_raw.
  • Availability: Research is mostly treating the current history dumps as a pseudo incremental dataset- i.e. pipelines that depend on the history wait for a new snapshot to be released and then only use the "new" data from that snapshot (aka the revisions created in the month since the last snapshot was generated). This means that the wmf_dumps.wikitext_raw allows to significantly reduce the latency - roughly from 1 month (wait for snapshot interval to trigger) +12days (dump processing) to a few hours.
  • Schema: As the schemas are almost identical, my main question is about extending the existing dataset in ways that depend on the snapshot mechanism. For example research has a number of use cases that involve comparing the revision text with the parent revision text. This involves a computationally expensive self join and some pitfalls, so there is a wikidiff job that creates (yet another) version of the wikitext history that includes a column with the unified diff between the current and parent revision.
    • Could we add the diff to the proposed wmf_dumps.wikitext_raw? As the parent revision could be at any point in the past, this would likely involve the equivalent to the wmf.mediawiki_wikitext_current available when new revisions are ingested into the dataset.
    • More generally, what is the replacement for the wmf.mediawiki_wikitext_current?
  • Data quality: the discussion around correctness of the events data T120242 also applies in this context. For research in particular, many use cases don't have high requirements (e.g. for training datasets for ML, or for metrics datasets that involve models that can also be "incorrect"), and we could/would migrate existing jobs to the new dumps table once it is available/supported in prod.
May 22 2024, 8:11 PM · Dumps 2.0 (Kanban Board)

Apr 30 2024

fkaelin closed T355440: PoC - general model training support (Cloud GPU) as Resolved.

I am closing this as done - a summary:

Apr 30 2024, 4:59 PM · Research (FY2023-24-Research-April-June)
fkaelin edited projects for T344830: Incremental knowledge gap dataset, added: Research-Freezer; removed Research.

This task requires design/implementation. Given that the current implementation is stable, I am moving this task to the freezer until there is a more urgent need for an incremental dataset.

Apr 30 2024, 1:19 PM · Research-Freezer
fkaelin added a comment to T341515: Team Interface: Working on.

Update on the use of gitlab issues:

  • Research doesn't use them for team internal planning, work is tracked in Phabricator.
  • Some researchers use gitlab issues for managing tasks with external collaborators (e.g. outreachy internships) as it more convenient than depending another tool.
Apr 30 2024, 1:04 PM · Research-management, Research

Apr 16 2024

fkaelin closed T348826: Integrate with WMF deployment pipeline as Declined.

Closing this. Deploying on CloudVPS is supported, blubber integration to be done when a kubernetes deploy is needed.

Apr 16 2024, 3:10 PM · Research
fkaelin closed T348826: Integrate with WMF deployment pipeline, a subtask of T348820: Tooling to work with embeddings, as Declined.
Apr 16 2024, 3:10 PM · Epic, Research
fkaelin closed T348367: Create a python package to compute wikitext embeddings in the WMF data infra as Resolved.

Done - code

Apr 16 2024, 3:04 PM · Research
fkaelin closed T348367: Create a python package to compute wikitext embeddings in the WMF data infra, a subtask of T348819: Develop pipelines for research datasets - Q2, as Resolved.
Apr 16 2024, 3:04 PM · Research (FY2023-24-Research-October-December)
fkaelin closed T348823: Tooling to create an index from a dataset of vectors as Resolved.
Apr 16 2024, 3:02 PM · Research
fkaelin closed T348823: Tooling to create an index from a dataset of vectors, a subtask of T348820: Tooling to work with embeddings, as Resolved.
Apr 16 2024, 3:02 PM · Epic, Research
fkaelin removed Due Date on T343061: Denylist for language agnostic revert risk model.
Apr 16 2024, 2:57 PM · Research-Freezer, Research-engineering
fkaelin moved T343061: Denylist for language agnostic revert risk model from Staged to Backlog on the Research board.

Removing due date and moving to backlog to prioritize.

Apr 16 2024, 2:56 PM · Research-Freezer, Research-engineering
fkaelin closed T342915: Generate training/evaluation datasets using airflow , a subtask of T341817: Standardize research pipelines - Dataset generation, as Resolved.
Apr 16 2024, 2:54 PM · Research-engineering, Epic, Research
fkaelin closed T342915: Generate training/evaluation datasets using airflow as Resolved.
Apr 16 2024, 2:54 PM · Research
fkaelin added a comment to T343065: Scheduled risk observatory pipeline.

@Pablo can this ticket be closed as well, as the work was tracked with T341777?

Apr 16 2024, 1:35 AM · Research (FY2023-24-Research-April-June)

Apr 3 2024

fkaelin added a comment to T341777: Automate the data collection process.

@Pablo thanks for flagging - there was indeed an issue with the wikidiff table: it is an external hive table, the required data was on hdfs and triggered the risk observatory dag, but the hive table itself was not being correctly updated, so no data was ingested. This is fixed now, and the dashboard shows data until Feb 24 now.

Apr 3 2024, 11:10 AM · Research

Mar 21 2024

fkaelin added a comment to T305688: Make HTML Dumps available in hadoop.

Pasting this reply from a slack thread for context

Mar 21 2024, 3:15 PM · Data-Engineering, Research, Structured-Data-Backlog

Mar 5 2024

fkaelin updated the task description for T356729: Research API repository.
Mar 5 2024, 4:27 PM · Research-engineering, Research

Mar 4 2024

fkaelin added a comment to T355440: PoC - general model training support (Cloud GPU).

Weekly updates

  • Interesting development with the ml team, there is a conversation with an European HPC infra provider about getting compute resources, and research projects are good candidates. Naturally this is relevant to this cloud GPU initiative, and research is very interested.
Mar 4 2024, 3:17 PM · Research (FY2023-24-Research-April-June)