[go: up one dir, main page]

Page MenuHomePhabricator

Adapt the wdqs data-transfer cookbook to operate with federated subgraphs
Closed, ResolvedPublic

Description

The current data-transfer cookbook does assume that a single graph is served from all wdqs nodes, this will no longer be the case when the graph will be split.
Most of the script should operate similarly but there are few important configuration bits that might need to vary:

Mainly the kafka topic the updater will consume from, it will vary depending on what subgraph the machine is serving. This information might be available from puppet and could possibly be exposed via some config file readable by the cookbook.
The cookbook should also make sure to not transfer the data of subgraph A into a machine configured to serve the subgraph B.
We should also explore and document a procedure to switch a machine that serves subgraph A to serve subgraph B.

AC:
The wdqs data-transfer cookbook can operate in a federated subgraphs setup

  • Correctly sets kafka consumer offsets based off of instance (graph) type
  • In addition to transferring the actual data, also transfers metadata about instance (graph) type. Similarly the data-reload cookbook should set the metadata. We'll use the already extant data-reloaded flag-file for this; previously the mere presence of the file served to tell the wdqs updater that it can perform work, but now we'll also include the instance type in that same file.
  • cookbook checks that the source and target are hosting the same type of journal (main, scholarly, wcqs, ...)

Event Timeline

Change #1048060 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/cookbooks@master] sre.wdqs.data-transfer: new graph split instances

https://gerrit.wikimedia.org/r/1048060

Change #1048060 merged by Ryan Kemper:

[operations/cookbooks@master] sre.wdqs.data-transfer: new graph split instances

https://gerrit.wikimedia.org/r/1048060

@RKemper : can we close this task? Or does it make sense to keep it open for some reason? I think we did some testing by reloading a full graph on our test servers, could you link that in or add a comment about it?

@RKemper : can we close this task? Or does it make sense to keep it open for some reason? I think we did some testing by reloading a full graph on our test servers, could you link that in or add a comment about it?

The part of the task about setting the appropriate kafka topic / consumer offset is done.

I did just notice the task mentions:

The cookbook should also make sure to not transfer the data of subgraph A into a machine configured to serve the subgraph B.

& that feature isn't yet implemented. I can work on a patch for that and do another data transfer to validate.

Change #1053205 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/cookbooks@master] wdqs: store metadata about graph split type

https://gerrit.wikimedia.org/r/1053205

Change #1053778 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/cookbooks@master] wdqs restart envoy: support graph split aliases

https://gerrit.wikimedia.org/r/1053778

Change #1053778 merged by Ryan Kemper:

[operations/cookbooks@master] wdqs restart envoy: support graph split aliases

https://gerrit.wikimedia.org/r/1053778

Change #1059931 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] wdqs: make wdqs-all include graph splits

https://gerrit.wikimedia.org/r/1059931

Change #1059931 merged by Ryan Kemper:

[operations/puppet@production] wdqs: make wdqs-all include graph splits

https://gerrit.wikimedia.org/r/1059931

Mentioned in SAL (#wikimedia-operations) [2024-08-09T04:24:59Z] <ryankemper@cumin2002> START - Cookbook sre.wdqs.data-transfer (T364077, transfer to unpooled host (1022) to test cookbook changes) xfer wikidata from wdqs1012.eqiad.wmnet -> wdqs1022.eqiad.wmnet, repooling source-only afterwards

Mentioned in SAL (#wikimedia-operations) [2024-08-09T04:32:24Z] <ryankemper@cumin2002> END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T364077, transfer to unpooled host (1022) to test cookbook changes) xfer wikidata from wdqs1012.eqiad.wmnet -> wdqs1022.eqiad.wmnet, repooling source-only afterwards

Change #1063067 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] wdqs graph-split: data xfer needs python3-snappy

https://gerrit.wikimedia.org/r/1063067

Change #1063067 merged by Ryan Kemper:

[operations/puppet@production] wdqs graph-split: data xfer needs python3-snappy

https://gerrit.wikimedia.org/r/1063067

New data-transfer cookbook confirmed working properly. Currently it doesn't have the implementation of the "don't let transfers between graph split hosts of different types" logic yet, so that still remains to be added.

Change #947930 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/cookbooks@master] wdqs: store graph type in data_loaded file

https://gerrit.wikimedia.org/r/947930

Change #1053205 merged by jenkins-bot:

[operations/cookbooks@master] wdqs: store metadata about graph split type

https://gerrit.wikimedia.org/r/1053205

Change #947930 abandoned by Ryan Kemper:

[operations/cookbooks@master] wdqs: store graph type in data_loaded file

Reason:

obsoleted by 1053205

https://gerrit.wikimedia.org/r/947930

Change #1077051 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/cookbooks@master] wdqs.data-transfer: log now mentions bg instance

https://gerrit.wikimedia.org/r/1077051

Change #1077051 merged by Ryan Kemper:

[operations/cookbooks@master] wdqs.data-transfer: log now mentions bg instance

https://gerrit.wikimedia.org/r/1077051

Change #1077059 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/cookbooks@master] wdqs.data-transfer: refuse xfer on differing jnl

https://gerrit.wikimedia.org/r/1077059

Mentioned in SAL (#wikimedia-operations) [2024-10-01T15:53:59Z] <ryankemper@cumin2002> START - Cookbook sre.wdqs.data-transfer (T364077, this test transfer should fail) xfer wikidata_main from wdqs1021.eqiad.wmnet -> wdqs1023.eqiad.wmnet, repooling both afterwards

Mentioned in SAL (#wikimedia-operations) [2024-10-01T15:54:02Z] <ryankemper@cumin2002> END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) (T364077, this test transfer should fail) xfer wikidata_main from wdqs1021.eqiad.wmnet -> wdqs1023.eqiad.wmnet, repooling both afterwards

Mentioned in SAL (#wikimedia-operations) [2024-10-01T15:56:27Z] <ryankemper@cumin2002> START - Cookbook sre.wdqs.data-transfer (T364077, this test transfer should succeed) xfer wikidata_main from wdqs1021.eqiad.wmnet -> wdqs2021.codfw.wmnet, repooling both afterwards

Mentioned in SAL (#wikimedia-operations) [2024-10-01T15:59:02Z] <ryankemper@cumin2002> END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) (T364077, this test transfer should succeed) xfer wikidata_main from wdqs1021.eqiad.wmnet -> wdqs2021.codfw.wmnet, repooling both afterwards

Change #1077066 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/cookbooks@master] wdqs.data-transfer: add missing self.

https://gerrit.wikimedia.org/r/1077066

Change #1077066 merged by Ryan Kemper:

[operations/cookbooks@master] wdqs.data-transfer: add missing self.

https://gerrit.wikimedia.org/r/1077066

Mentioned in SAL (#wikimedia-operations) [2024-10-01T16:11:25Z] <ryankemper@cumin2002> START - Cookbook sre.wdqs.data-transfer (T364077, test out new flag) xfer wikidata_main from wdqs1021.eqiad.wmnet -> wdqs2021.codfw.wmnet, repooling both afterwards

Mentioned in SAL (#wikimedia-operations) [2024-10-01T16:16:05Z] <ryankemper@cumin2002> END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T364077, test out new flag) xfer wikidata_main from wdqs1021.eqiad.wmnet -> wdqs2021.codfw.wmnet, repooling both afterwards

Change #1077112 had a related patch set uploaded (by Volans; author: Volans):

[operations/cookbooks@master] sre.wdqs.data-transfer: fix CI

https://gerrit.wikimedia.org/r/1077112

Change #1077112 merged by Volans:

[operations/cookbooks@master] sre.wdqs.data-transfer: fix CI

https://gerrit.wikimedia.org/r/1077112

Mentioned in SAL (#wikimedia-operations) [2024-10-03T15:45:01Z] <ryankemper@cumin2002> START - Cookbook sre.wdqs.data-transfer (T364077, testing new behavior; this should fail) xfer wikidata_main from wdqs1021.eqiad.wmnet -> wdqs2023.codfw.wmnet, repooling both afterwards

Mentioned in SAL (#wikimedia-operations) [2024-10-03T15:45:07Z] <ryankemper@cumin2002> END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) (T364077, testing new behavior; this should fail) xfer wikidata_main from wdqs1021.eqiad.wmnet -> wdqs2023.codfw.wmnet, repooling both afterwards

Mentioned in SAL (#wikimedia-operations) [2024-10-03T15:45:57Z] <ryankemper@cumin2002> START - Cookbook sre.wdqs.data-transfer (T364077, testing new behavior; this should fail) xfer wikidata_main from wdqs1021.eqiad.wmnet -> wdqs2023.codfw.wmnet, repooling both afterwards

Mentioned in SAL (#wikimedia-operations) [2024-10-03T15:45:59Z] <ryankemper@cumin2002> END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) (T364077, testing new behavior; this should fail) xfer wikidata_main from wdqs1021.eqiad.wmnet -> wdqs2023.codfw.wmnet, repooling both afterwards

Mentioned in SAL (#wikimedia-operations) [2024-10-03T15:46:53Z] <ryankemper@cumin2002> START - Cookbook sre.wdqs.data-transfer (T364077, testing new behavior; this should fail) xfer wikidata_main from wdqs1021.eqiad.wmnet -> wdqs2023.codfw.wmnet, repooling both afterwards

Mentioned in SAL (#wikimedia-operations) [2024-10-03T15:46:55Z] <ryankemper@cumin2002> END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) (T364077, testing new behavior; this should fail) xfer wikidata_main from wdqs1021.eqiad.wmnet -> wdqs2023.codfw.wmnet, repooling both afterwards

Mentioned in SAL (#wikimedia-operations) [2024-10-03T15:47:32Z] <ryankemper@cumin2002> START - Cookbook sre.wdqs.data-transfer (T364077, testing new behavior; this should fail) xfer wikidata_main from wdqs1021.eqiad.wmnet -> wdqs2023.codfw.wmnet, repooling both afterwards

Mentioned in SAL (#wikimedia-operations) [2024-10-03T15:47:34Z] <ryankemper@cumin2002> END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) (T364077, testing new behavior; this should fail) xfer wikidata_main from wdqs1021.eqiad.wmnet -> wdqs2023.codfw.wmnet, repooling both afterwards

Mentioned in SAL (#wikimedia-operations) [2024-10-03T15:51:29Z] <ryankemper@cumin2002> START - Cookbook sre.wdqs.data-transfer (T364077, testing new behavior; this should fail) xfer wikidata_main from wdqs1021.eqiad.wmnet -> wdqs2021.codfw.wmnet, repooling both afterwards

Mentioned in SAL (#wikimedia-operations) [2024-10-03T15:51:32Z] <ryankemper@cumin2002> END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) (T364077, testing new behavior; this should fail) xfer wikidata_main from wdqs1021.eqiad.wmnet -> wdqs2021.codfw.wmnet, repooling both afterwards

Mentioned in SAL (#wikimedia-operations) [2024-10-03T15:52:12Z] <ryankemper@cumin2002> START - Cookbook sre.wdqs.data-transfer (T364077, testing new flag; this should succeed) xfer wikidata_main from wdqs1021.eqiad.wmnet -> wdqs2021.codfw.wmnet, repooling both afterwards

Mentioned in SAL (#wikimedia-operations) [2024-10-03T15:52:14Z] <ryankemper@cumin2002> END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) (T364077, testing new flag; this should succeed) xfer wikidata_main from wdqs1021.eqiad.wmnet -> wdqs2021.codfw.wmnet, repooling both afterwards

Mentioned in SAL (#wikimedia-operations) [2024-10-03T15:53:08Z] <ryankemper@cumin2002> START - Cookbook sre.wdqs.data-transfer (T364077, testing new flag; this should succeed) xfer wikidata_main from wdqs1021.eqiad.wmnet -> wdqs2021.codfw.wmnet, repooling both afterwards

Mentioned in SAL (#wikimedia-operations) [2024-10-03T15:58:03Z] <ryankemper@cumin2002> END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T364077, testing new flag; this should succeed) xfer wikidata_main from wdqs1021.eqiad.wmnet -> wdqs2021.codfw.wmnet, repooling both afterwards

Change #1077059 merged by Ryan Kemper:

[operations/cookbooks@master] wdqs.data-transfer: refuse xfer on differing jnl

https://gerrit.wikimedia.org/r/1077059