[go: up one dir, main page]

Page MenuHomePhabricator

Migrate node-based services in production to node18
Open, Needs TriagePublic

Description

T308371: Migrate node-based services in production to node16 | T364779: Migrate node-based services in production to node20

Note that you may wish to complete the migration to node16 first, rather than make the migration in one go.

Imperfect search:

  • Abstract Wikipedia team
    • services/function-evaluator
    • services/function-orchestrator
  • Content Transformation
    • mediawiki/services/chromium-render
    • mediawiki/services/geoshapes [never deployed - stalled work]
    • mediawiki/services/kartotherian [not in k8s yet]
    • mediawiki/services/mobileapps
    • mediawiki/services/push-notifications
    • mediawiki/services/wikifeeds T358017: Migrate wikifeeds from Node16 to Node18
  • Language Engineering
    • mediawiki/services/cxserver
  • MediaWiki Engineering
    • mediawiki/services/example-node-api [being decommissioned?]
    • mediawiki/services/restbase [being decommissioned]
  • Web
    • wikimedia/portals
  • Wikidata
    • wikibase/termbox
  • ???
    • mediawiki/services/mathoid
    • mediawiki/services/change-propagation
    • mediawiki/services/recommendation-api
    • ~~~mediawiki/services/service-runner~~~ (unmaintained)

Details

SubjectRepoBranchLines +/-
mediawiki/services/mobileappsmaster+5 K -5 K
operations/deployment-chartsmaster+1 -1
mediawiki/services/push-notificationsmaster+9 K -8 K
operations/deployment-chartsmaster+1 -1
mediawiki/services/zoteromaster+5 -5
operations/deployment-chartsmaster+1 -1
wikimedia/portalsmaster+1 K -331
operations/deployment-chartsmaster+2 -2
operations/deployment-chartsmaster+1 -1
mediawiki/services/wikifeedsmaster+347 -5 K
operations/deployment-chartsmaster+1 -1
operations/deployment-chartsmaster+1 -1
operations/deployment-chartsmaster+1 -1
wikibase/termboxmaster+7 -5
operations/deployment-chartsmaster+2 -7
operations/deployment-chartsmaster+6 -0
operations/deployment-chartsmaster+1 -1
operations/deployment-chartsmaster+1 -1
operations/deployment-chartsmaster+1 -1
mediawiki/services/recommendation-apimaster+13 -10
operations/deployment-chartsmaster+1 -1
operations/deployment-chartsmaster+9 -9
operations/deployment-chartsmaster+1 -1
mediawiki/services/cxservermaster+1 K -1 K
Show related patches Customize query in gerrit

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 992387 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Lucas Werkmeister (WMDE)):

[operations/deployment-charts@master] termbox(test): update to 2024-01-22-163619-production

https://gerrit.wikimedia.org/r/992387

Change 992387 merged by jenkins-bot:

[operations/deployment-charts@master] termbox(test): update to 2024-01-22-163619-production

https://gerrit.wikimedia.org/r/992387

Change 992446 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Lucas Werkmeister (WMDE)):

[operations/deployment-charts@master] termbox: update to 2024-01-22-163619-production

https://gerrit.wikimedia.org/r/992446

Change 992446 merged by jenkins-bot:

[operations/deployment-charts@master] termbox: update to 2024-01-22-163619-production

https://gerrit.wikimedia.org/r/992446

Change 992452 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Lucas Werkmeister (WMDE)):

[operations/deployment-charts@master] Revert "termbox: update to 2024-01-22-163619-production"

https://gerrit.wikimedia.org/r/992452

Change 992452 merged by jenkins-bot:

[operations/deployment-charts@master] Revert "termbox: update to 2024-01-22-163619-production"

https://gerrit.wikimedia.org/r/992452

Change 999024 had a related patch set uploaded (by Sbailey; author: Sbailey):

[mediawiki/services/wikifeeds@master] wikifeeds: upgrade to node18 from node16

https://gerrit.wikimedia.org/r/999024

Change 1000296 had a related patch set uploaded (by Func; author: Func):

[wikimedia/portals@master] build: Update eslint-config-wikimedia to 0.26.0 for node18

https://gerrit.wikimedia.org/r/1000296

Change 1002592 had a related patch set uploaded (by Sbailey; author: Sbailey):

[mediawiki/services/wikifeeds@master] wikifeeds: upgrade to node18 from node16

https://gerrit.wikimedia.org/r/1002592

Change 999024 abandoned by Sbailey:

[mediawiki/services/wikifeeds@master] wikifeeds: upgrade to node18 from node16

Reason:

replaced due to git problem

https://gerrit.wikimedia.org/r/999024

Change 1003400 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Lucas Werkmeister (WMDE)):

[operations/deployment-charts@master] Reapply "termbox: update to 2024-01-22-163619-production"

https://gerrit.wikimedia.org/r/1003400

Change 1003400 merged by jenkins-bot:

[operations/deployment-charts@master] Reapply "termbox: update to 2024-01-22-163619-production"

https://gerrit.wikimedia.org/r/1003400

Change 1003880 had a related patch set uploaded (by Sbailey; author: Sbailey):

[mediawiki/services/mobileapps@master] WIP mobileapps: upgrade to node18 from node12

https://gerrit.wikimedia.org/r/1003880

Change 1007353 had a related patch set uploaded (by Jforrester; author: Jforrester):

[operations/deployment-charts@master] wikifunctions: Upgrade evaluators from 2024-02-12-160222 to 2024-02-26-150300

https://gerrit.wikimedia.org/r/1007353

Change 1007353 merged by jenkins-bot:

[operations/deployment-charts@master] wikifunctions: Upgrade evaluators from 2024-02-12-160222 to 2024-02-26-150300

https://gerrit.wikimedia.org/r/1007353

Change 1000296 merged by jenkins-bot:

[wikimedia/portals@master] build: Update eslint-config-wikimedia to 0.26.0 for node18

https://gerrit.wikimedia.org/r/1000296

Change 1011269 had a related patch set uploaded (by Mvolz; author: Mvolz):

[mediawiki/services/zotero@master] Update Zotero to node 18

https://gerrit.wikimedia.org/r/1011269

Change #1011269 merged by jenkins-bot:

[mediawiki/services/zotero@master] Update Zotero to node 18

https://gerrit.wikimedia.org/r/1011269

Change #1016728 had a related patch set uploaded (by Mvolz; author: Mvolz):

[operations/deployment-charts@master] Update zotero to node18

https://gerrit.wikimedia.org/r/1016728

Change #1016728 merged by jenkins-bot:

[operations/deployment-charts@master] Update zotero to node18

https://gerrit.wikimedia.org/r/1016728

Change #1037480 had a related patch set uploaded (by Jgiannelos; author: Jgiannelos):

[mediawiki/services/push-notifications@master] Project maintenance

https://gerrit.wikimedia.org/r/1037480

Ottomata updated the task description. (Show Details)

Change #1047201 had a related patch set uploaded (by Jforrester; author: Jforrester):

[operations/deployment-charts@master] mathoid: Upgrade image from 2023-11-03-103402 to 2024-06-18-233457

https://gerrit.wikimedia.org/r/1047201

Change #1047201 merged by jenkins-bot:

[operations/deployment-charts@master] mathoid: Upgrade image from 2023-11-03-103402 to 2024-06-18-233457

https://gerrit.wikimedia.org/r/1047201

Mentioned in SAL (#wikimedia-operations) [2024-06-20T11:21:00Z] <akosiaris> upgrade mathoid to 2024-06-18-233457-production T349118

Change #1037480 merged by jenkins-bot:

[mediawiki/services/push-notifications@master] Project maintenance

https://gerrit.wikimedia.org/r/1037480

Change #1049891 had a related patch set uploaded (by Mvolz; author: Mvolz):

[operations/deployment-charts@master] Update Zotero to node 18

https://gerrit.wikimedia.org/r/1049891

Change #1049891 merged by jenkins-bot:

[operations/deployment-charts@master] Update Zotero to node 18

https://gerrit.wikimedia.org/r/1049891

FYI Citoid upgrade is blocked by dependency on service-runner: https://github.com/wikimedia/service-runner/pull/251

On update to 18, certain types of requests to the TestRunner causes the worker to die unexpectedly resulting in a 504 instead of the expected error.

Zotero upgrade to 18 is blocked by: T361728 (mystery issue)

FYI, service-utils (a replacement for service-runner) is nearing its first production deployment. In case you want to wait for it instead of solving service-runner problems :) cc @tchin

FYI, service-utils (a replacement for service-runner) is nearing its first production deployment. In case you want to wait for it instead of solving service-runner problems :) cc @tchin

What service are you using this for?

Unfortunately we need both workers and also Open API spec generation. Would you be open to including those?

What service are you using this for?

https://wikitech.wikimedia.org/wiki/Event_Platform/EventStreams_HTTP_Service
T361769: Migrate and re-deploy eventstreams using service-utils

Unfortunately we need [...] Open API spec generation.

OpenAPI spec generation sounds great. File a task...and help implement it!?

I don't believe service-runner or service-template-node ever supported this though, did it? FWIW, EventStreams also generates some spec, but it is manual, and not handled by a lib.

Unfortunately we need [...] workers

Curious, why do you need workers? Because WMF services are deployed on k8s now, we decided to rely on k8s pod replicas and k8s routing, instead of node.js worker / clustering.

What service are you using this for?

https://wikitech.wikimedia.org/wiki/Event_Platform/EventStreams_HTTP_Service
T361769: Migrate and re-deploy eventstreams using service-utils

Unfortunately we need [...] Open API spec generation.

OpenAPI spec generation sounds great. File a task...and help implement it!?

I don't believe service-runner or service-template-node ever supported this though, did it? FWIW, EventStreams also generates some spec, but it is manual, and not handled by a lib.

It was part of the list of things "not included" by service-utils so I assumed it was service-runner, but yeah now that I look that's in service-template-node, not service-runner: https://github.com/wikimedia/service-template-node/blob/main/lib/swagger-ui.js

Unfortunately we need [...] workers

Curious, why do you need workers? Because WMF services are deployed on k8s now, we decided to rely on k8s pod replicas and k8s routing, instead of node.js worker / clustering.

Well, I'm not sure we need them but we have the default production setting which is one worker per CPU, and also I believe service-runner kills them if memory is exceeded (though apparently that hasn't happened as far back as the logs go). Having a look at things recently, we have about 100 workers die a week: https://logstash.wikimedia.org/goto/8063d32bd662f2c916bf3dba2fa57368

@akosiaris maybe that is something we can get rid of? I'm afraid I'm not sure if we lose anything by getting rid of it, I just assumed because we intermittently have them restarting that they're useful. Maybe the k8 pods are sufficient?

Well, I'm not sure we need them but we have the default production setting which is one worker per CPU,

It's 1 worker per pod. 1 worker per CPU (the ncpu setting) is a very bad pattern for deploying to a k8s cluster. The main reason for that is that it makes the workload's size (cpu usage) dependent on the hardware it becomes deployed on and thus unable to reliably schedule on a set of different hardware nodes.

The default setting of 1 worker in each pod was a product of trying to figure out a sane default for that value, without entering into discussions as to whether the functionality makes sense in the k8s environment. I can add that I ran some basic benchmarking tests back then, testing num_workers=0 vs num_workers=1 and num_workers=2, focused in availability. There are some minor gains in having num_workers=1 as far as that aspect is concerned, mostly because nodejs cluster mechanism reacts faster than kubernetes default configured liveness probes in killing and restarting workers. I don't think those gains are justified in the majority of cases.

and also I believe service-runner kills them if memory is exceeded (though apparently that hasn't happened as far back as the logs go).

Yes, it does for heap size violations. It does kill workers as well if they fail the internal heartbeat comms.

Having a look at things recently, we have about 100 workers die a week: https://logstash.wikimedia.org/goto/8063d32bd662f2c916bf3dba2fa57368

@akosiaris maybe that is something we can get rid of? I'm afraid I'm not sure if we lose anything by getting rid of it, I just assumed because we intermittently have them restarting that they're useful. Maybe the k8 pods are sufficient?

Yes, I think it can be removed. It is not contributing much in Citoid's use cases. The workers monitoring it performs is more benefecial, compared to standard k8s liveness probes under heavy concurrent traffic situations. This isn't the case of Citoid. Kubernetes pods should be sufficient.

First thing would be to set num_workers to 0 in the Chart. This shortcircuits service-runner worker logic and has the main process do everything.

First thing would be to set num_workers to 0 in the Chart. This shortcircuits service-runner worker logic and has the main process do everything.

+1. FWIW, Long ago when service-runner based services I was involved with were migrated to k8s, we set num_workers: 0, and things have been fine :)