⚓ T349118 Migrate node-based services in production to node18

Subject	Repo	Branch	Lines +/-
WIP mobileapps: upgrade to node18 from node12	mediawiki/services/mobileapps	master	+5 K -5 K
Update Zotero to node 18	operations/deployment-charts	master	+1 -1
Project maintenance	mediawiki/services/push-notifications	master	+9 K -8 K
mathoid: Upgrade image from 2023-11-03-103402 to 2024-06-18-233457	operations/deployment-charts	master	+1 -1
Update Zotero to node 18	mediawiki/services/zotero	master	+5 -5
Update zotero to node18	operations/deployment-charts	master	+1 -1
build: Update eslint-config-wikimedia to 0.26.0 for node18	wikimedia/portals	master	+1 K -331
wikifunctions: Upgrade evaluators from 2024-02-12-160222 to 2024-02-26-150300	operations/deployment-charts	master	+2 -2
Reapply "termbox: update to 2024-01-22-163619-production"	operations/deployment-charts	master	+1 -1
wikifeeds: upgrade to node18 from node16	mediawiki/services/wikifeeds	master	+347 -5 K
Revert "termbox: update to 2024-01-22-163619-production"	operations/deployment-charts	master	+1 -1
termbox: update to 2024-01-22-163619-production	operations/deployment-charts	master	+1 -1
termbox(test): update to 2024-01-22-163619-production	operations/deployment-charts	master	+1 -1
Migrate from node16 to node18	wikibase/termbox	master	+7 -5
services: update Docker image and settings for Recommendation API	operations/deployment-charts	master	+2 -7
services: deploy the new rec-api-ng Docker image in staging	operations/deployment-charts	master	+6 -0
service: update recommendation-api's docker image	operations/deployment-charts	master	+1 -1
services: upgrade recommendation-api's Docker image	operations/deployment-charts	master	+1 -1
services: use 127.0.0.1 instead of localhost for rec-api's mw host	operations/deployment-charts	master	+1 -1
Upgrade to nodejs 18 and Bookworm	mediawiki/services/recommendation-api	master	+13 -10
Update cxserver to 2023-11-20-052250-production	operations/deployment-charts	master	+1 -1
cxserver: Force 127.0.0.1 instead of localhost	operations/deployment-charts	master	+9 -9
Update cxserver to 2023-11-07-081511-production	operations/deployment-charts	master	+1 -1
Migrate to nodejs18	mediawiki/services/cxserver	master	+1 K -1 K

Status	Assigned	Task
Open	None	T349118 Migrate node-based services in production to node18
Resolved	Jdforrester-WMF	T349381 Migrate Wikifunctions's node-based services in production to node18
Resolved	elukey	T348950 Upgrade change propagation to nodejs18
Resolved	Ottomata	T347477 eventgate: eventstreams: update nodejs and OS
Resolved	• Lucas_Werkmeister_WMDE	T355685 Migrate Termbox SSR from Node 16 to 18
Resolved	• Sbailey	T358017 Migrate wikifeeds from Node16 to Node18
Resolved	Jgiannelos	T363168 Upgrade mobileapps to node 18
Resolved	Jgiannelos	T365250 Enable ipv6 in mesh used in PCS k8s deployment
Resolved	Jgiannelos	T367272 Upgrade push notifications to node18
Resolved	Mvolz	T361728 SwaggerProbeHasFailures for citoid (due to Zotero failures) after upgrading to node 18

Jdforrester-WMF mentioned this in rWBTBacb802c6fe9d: Migrate from node16 to node18.Jan 22 2024, 9:02 PM

Change 992387 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Lucas Werkmeister (WMDE)):

[operations/deployment-charts@master] termbox(test): update to 2024-01-22-163619-production

https://gerrit.wikimedia.org/r/992387

WMDE-leszek subscribed.Jan 23 2024, 10:44 AM

Change 992387 merged by jenkins-bot:

[operations/deployment-charts@master] termbox(test): update to 2024-01-22-163619-production

https://gerrit.wikimedia.org/r/992387

Change 992446 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Lucas Werkmeister (WMDE)):

[operations/deployment-charts@master] termbox: update to 2024-01-22-163619-production

https://gerrit.wikimedia.org/r/992446

Change 992446 merged by jenkins-bot:

[operations/deployment-charts@master] termbox: update to 2024-01-22-163619-production

https://gerrit.wikimedia.org/r/992446

Change 992452 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Lucas Werkmeister (WMDE)):

[operations/deployment-charts@master] Revert "termbox: update to 2024-01-22-163619-production"

https://gerrit.wikimedia.org/r/992452

Change 992452 merged by jenkins-bot:

[operations/deployment-charts@master] Revert "termbox: update to 2024-01-22-163619-production"

https://gerrit.wikimedia.org/r/992452

Jdforrester-WMF updated the task description. (Show Details)Jan 26 2024, 5:54 PM

• Sbailey subscribed.Feb 7 2024, 8:15 PM

• Sbailey updated the task description. (Show Details)Feb 7 2024, 8:17 PM

Chromium render (Proton) upgraded from Node 12 to Node 18

[content transform] update to 2024-02-05-181957-production

https://gerrit.wikimedia.org/r/c/mediawiki/services/chromium-render/+/994356

https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/services/chromium-render/+/962523f63a59f6cbcaff467e377d742758ce830a

jforrester closed https://gitlab.wikimedia.org/repos/abstract-wiki/wikifunctions/function-orchestrator/-/merge_requests/75

Draft: build: Migrate image from node16 to node18

• lbowmaker moved this task from Incoming (new tickets) to Radar (External Teams) on the Data-Engineering board.Feb 8 2024, 7:06 PM

kostajh updated the task description. (Show Details)Feb 9 2024, 11:28 AM

kostajh added a project: Trust and Safety Product Team (Engineering).

• Sbailey updated the task description. (Show Details)Feb 9 2024, 11:55 PM

Change 999024 had a related patch set uploaded (by Sbailey; author: Sbailey):

[mediawiki/services/wikifeeds@master] wikifeeds: upgrade to node18 from node16

https://gerrit.wikimedia.org/r/999024

• Sbailey mentioned this in rWSWF2c712cb38265: wikifeeds: upgrade to node18 from node16.Feb 10 2024, 1:14 AM

Change 1000296 had a related patch set uploaded (by Func; author: Func):

[wikimedia/portals@master] build: Update eslint-config-wikimedia to 0.26.0 for node18

https://gerrit.wikimedia.org/r/1000296

MSantos updated the task description. (Show Details)Feb 12 2024, 12:56 PM

• Sbailey claimed this task.Feb 12 2024, 4:11 PM

• Sbailey moved this task from Backlog to In Progress on the Content-Transform-Team-WIP board.

Change 1002592 had a related patch set uploaded (by Sbailey; author: Sbailey):

[mediawiki/services/wikifeeds@master] wikifeeds: upgrade to node18 from node16

https://gerrit.wikimedia.org/r/1002592

Change 999024 abandoned by Sbailey:

[mediawiki/services/wikifeeds@master] wikifeeds: upgrade to node18 from node16

Reason:

replaced due to git problem

https://gerrit.wikimedia.org/r/999024

Change 1003400 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Lucas Werkmeister (WMDE)):

[operations/deployment-charts@master] Reapply "termbox: update to 2024-01-22-163619-production"

https://gerrit.wikimedia.org/r/1003400

MSantos moved this task from Needs Input (waiting) to Content Transform Team on the MediaWiki-Engineering board.Feb 14 2024, 11:53 AM

Change 1003400 merged by jenkins-bot:

[operations/deployment-charts@master] Reapply "termbox: update to 2024-01-22-163619-production"

https://gerrit.wikimedia.org/r/1003400

• Lucas_Werkmeister_WMDE closed subtask T355685: Migrate Termbox SSR from Node 16 to 18 as Resolved.Feb 14 2024, 12:04 PM

Change 1003880 had a related patch set uploaded (by Sbailey; author: Sbailey):

[mediawiki/services/mobileapps@master] WIP mobileapps: upgrade to node18 from node12

https://gerrit.wikimedia.org/r/1003880

Jgiannelos moved this task from In Progress to Current Epics on the Content-Transform-Team-WIP board.Feb 20 2024, 3:32 PM

• Sbailey removed • Sbailey as the assignee of this task.Feb 20 2024, 4:41 PM

Change 1007353 had a related patch set uploaded (by Jforrester; author: Jforrester):

[operations/deployment-charts@master] wikifunctions: Upgrade evaluators from 2024-02-12-160222 to 2024-02-26-150300

https://gerrit.wikimedia.org/r/1007353

Change 1007353 merged by jenkins-bot:

[operations/deployment-charts@master] wikifunctions: Upgrade evaluators from 2024-02-12-160222 to 2024-02-26-150300

https://gerrit.wikimedia.org/r/1007353

Jdforrester-WMF updated the task description. (Show Details)Feb 28 2024, 3:26 PM

Jdforrester-WMF closed subtask T349381: Migrate Wikifunctions's node-based services in production to node18 as Resolved.Feb 28 2024, 3:31 PM

Jdforrester-WMF updated the task description. (Show Details)Feb 28 2024, 3:55 PM

Jdforrester-WMF added a subtask: T358017: Migrate wikifeeds from Node16 to Node18.

Change 1000296 merged by jenkins-bot:

[wikimedia/portals@master] build: Update eslint-config-wikimedia to 0.26.0 for node18

https://gerrit.wikimedia.org/r/1000296

Change 1011269 had a related patch set uploaded (by Mvolz; author: Mvolz):

[mediawiki/services/zotero@master] Update Zotero to node 18

https://gerrit.wikimedia.org/r/1011269

Change #1011269 merged by jenkins-bot:

[mediawiki/services/zotero@master] Update Zotero to node 18

https://gerrit.wikimedia.org/r/1011269

Mvolz moved this task from Backlog to Service on the Citoid board.Apr 2 2024, 5:53 PM

Mvolz added a project: Platform Engineering.

Change #1016728 had a related patch set uploaded (by Mvolz; author: Mvolz):

[operations/deployment-charts@master] Update zotero to node18

https://gerrit.wikimedia.org/r/1016728

Change #1016728 merged by jenkins-bot:

[operations/deployment-charts@master] Update zotero to node18

https://gerrit.wikimedia.org/r/1016728

Nikerabbit removed a project: Language-Team (Language-2023-October-December).Apr 4 2024, 10:41 AM

Jgiannelos closed subtask T358017: Migrate wikifeeds from Node16 to Node18 as Resolved.Apr 29 2024, 3:31 PM

Jdforrester-WMF mentioned this in T364779: Migrate node-based services in production to node20.May 13 2024, 7:52 PM

Jdforrester-WMF updated the task description. (Show Details)

tchin mentioned this in T344730: Migrate Data Engineering Pipelinelib repos to GitLab.May 15 2024, 1:53 PM

Jgiannelos updated the task description. (Show Details)May 22 2024, 10:02 AM

Jgiannelos closed subtask T363168: Upgrade mobileapps to node 18 as Resolved.

Change #1037480 had a related patch set uploaded (by Jgiannelos; author: Jgiannelos):

[mediawiki/services/push-notifications@master] Project maintenance

https://gerrit.wikimedia.org/r/1037480

VPuffetMichel subscribed.May 30 2024, 7:14 PM

Ottomata updated the task description. (Show Details)Jun 3 2024, 8:20 PM

Ottomata updated the task description. (Show Details)

Jdforrester-WMF updated the task description. (Show Details)Jun 4 2024, 2:01 PM

akosiaris subscribed.Jun 13 2024, 4:16 PM

jforrester updated https://gitlab.wikimedia.org/repos/mediawiki/services/mathoid/-/merge_requests/5

Upgrade blubber to nodejs18

jforrester merged https://gitlab.wikimedia.org/repos/mediawiki/services/mathoid/-/merge_requests/5

Upgrade blubber config to build based on nodejs18

Change #1047201 had a related patch set uploaded (by Jforrester; author: Jforrester):

[operations/deployment-charts@master] mathoid: Upgrade image from 2023-11-03-103402 to 2024-06-18-233457

https://gerrit.wikimedia.org/r/1047201

Change #1047201 merged by jenkins-bot:

[operations/deployment-charts@master] mathoid: Upgrade image from 2023-11-03-103402 to 2024-06-18-233457

https://gerrit.wikimedia.org/r/1047201

Mentioned in SAL (#wikimedia-operations) [2024-06-20T11:21:00Z] <akosiaris> upgrade mathoid to 2024-06-18-233457-production T349118

Change #1037480 merged by jenkins-bot:

[mediawiki/services/push-notifications@master] Project maintenance

https://gerrit.wikimedia.org/r/1037480

Jgiannelos mentioned this in rMSPNc693116cb45a: Project maintenance.Jun 20 2024, 1:11 PM

Jdforrester-WMF updated the task description. (Show Details)Jun 20 2024, 1:14 PM

Jgiannelos updated the task description. (Show Details)Jun 24 2024, 11:53 AM

Jgiannelos closed subtask T367272: Upgrade push notifications to node18 as Resolved.

Jdforrester-WMF updated the task description. (Show Details)Jun 24 2024, 1:19 PM

Jdforrester-WMF updated the task description. (Show Details)Jun 24 2024, 1:33 PM

Change #1049891 had a related patch set uploaded (by Mvolz; author: Mvolz):

[operations/deployment-charts@master] Update Zotero to node 18

https://gerrit.wikimedia.org/r/1049891

Change #1049891 merged by jenkins-bot:

[operations/deployment-charts@master] Update Zotero to node 18

https://gerrit.wikimedia.org/r/1049891

MSantos edited projects, added Content-Transform-Team; removed Content-Transform-Team-WIP.Jun 26 2024, 11:42 AM

MSantos moved this task from Backlog to Maps on the Content-Transform-Team board.Jun 27 2024, 2:09 PM

Pginer-WMF removed a project: CX-cxserver.Jul 18 2024, 8:43 AM

Tchanders added a subtask: T370502: Upgrade ipoid's NodeJS version to 20.Jul 19 2024, 9:14 AM

Tchanders updated the task description. (Show Details)

JayCano removed a subtask: T370502: Upgrade ipoid's NodeJS version to 20.Jul 19 2024, 9:16 AM

FYI Citoid upgrade is blocked by dependency on service-runner: https://github.com/wikimedia/service-runner/pull/251

On update to 18, certain types of requests to the TestRunner causes the worker to die unexpectedly resulting in a 504 instead of the expected error.

Zotero upgrade to 18 is blocked by: T361728 (mystery issue)

Mvolz added a subtask: T361728: SwaggerProbeHasFailures for citoid (due to Zotero failures) after upgrading to node 18.Aug 21 2024, 9:34 AM

Mvolz updated the task description. (Show Details)Aug 21 2024, 9:36 AM

FYI, service-utils (a replacement for service-runner) is nearing its first production deployment. In case you want to wait for it instead of solving service-runner problems :) cc @tchin

akosiaris updated the task description. (Show Details)Sep 25 2024, 11:19 AM

Mvolz updated the task description. (Show Details)Sep 26 2024, 12:14 PM

akosiaris closed subtask T361728: SwaggerProbeHasFailures for citoid (due to Zotero failures) after upgrading to node 18 as Resolved.Sep 26 2024, 1:17 PM

In T349118#10081252, @Ottomata wrote:

FYI, service-utils (a replacement for service-runner) is nearing its first production deployment. In case you want to wait for it instead of solving service-runner problems :) cc @tchin

What service are you using this for?

Unfortunately we need both workers and also Open API spec generation. Would you be open to including those?

What service are you using this for?

https://wikitech.wikimedia.org/wiki/Event_Platform/EventStreams_HTTP_Service
T361769: Migrate and re-deploy eventstreams using service-utils

Unfortunately we need [...] Open API spec generation.

OpenAPI spec generation sounds great. File a task...and help implement it!?

I don't believe service-runner or service-template-node ever supported this though, did it? FWIW, EventStreams also generates some spec, but it is manual, and not handled by a lib.

Unfortunately we need [...] workers

Curious, why do you need workers? Because WMF services are deployed on k8s now, we decided to rely on k8s pod replicas and k8s routing, instead of node.js worker / clustering.

In T349118#10214863, @Ottomata wrote:

What service are you using this for?

https://wikitech.wikimedia.org/wiki/Event_Platform/EventStreams_HTTP_Service
T361769: Migrate and re-deploy eventstreams using service-utils

Unfortunately we need [...] Open API spec generation.

OpenAPI spec generation sounds great. File a task...and help implement it!?

I don't believe service-runner or service-template-node ever supported this though, did it? FWIW, EventStreams also generates some spec, but it is manual, and not handled by a lib.

It was part of the list of things "not included" by service-utils so I assumed it was service-runner, but yeah now that I look that's in service-template-node, not service-runner: https://github.com/wikimedia/service-template-node/blob/main/lib/swagger-ui.js

Unfortunately we need [...] workers

Curious, why do you need workers? Because WMF services are deployed on k8s now, we decided to rely on k8s pod replicas and k8s routing, instead of node.js worker / clustering.

Well, I'm not sure we need them but we have the default production setting which is one worker per CPU, and also I believe service-runner kills them if memory is exceeded (though apparently that hasn't happened as far back as the logs go). Having a look at things recently, we have about 100 workers die a week: https://logstash.wikimedia.org/goto/8063d32bd662f2c916bf3dba2fa57368

@akosiaris maybe that is something we can get rid of? I'm afraid I'm not sure if we lose anything by getting rid of it, I just assumed because we intermittently have them restarting that they're useful. Maybe the k8 pods are sufficient?

In T349118#10237211, @Mvolz wrote:

Well, I'm not sure we need them but we have the default production setting which is one worker per CPU,

It's 1 worker per pod. 1 worker per CPU (the ncpu setting) is a very bad pattern for deploying to a k8s cluster. The main reason for that is that it makes the workload's size (cpu usage) dependent on the hardware it becomes deployed on and thus unable to reliably schedule on a set of different hardware nodes.

The default setting of 1 worker in each pod was a product of trying to figure out a sane default for that value, without entering into discussions as to whether the functionality makes sense in the k8s environment. I can add that I ran some basic benchmarking tests back then, testing num_workers=0 vs num_workers=1 and num_workers=2, focused in availability. There are some minor gains in having num_workers=1 as far as that aspect is concerned, mostly because nodejs cluster mechanism reacts faster than kubernetes default configured liveness probes in killing and restarting workers. I don't think those gains are justified in the majority of cases.

and also I believe service-runner kills them if memory is exceeded (though apparently that hasn't happened as far back as the logs go).

Yes, it does for heap size violations. It does kill workers as well if they fail the internal heartbeat comms.

Having a look at things recently, we have about 100 workers die a week: https://logstash.wikimedia.org/goto/8063d32bd662f2c916bf3dba2fa57368

@akosiaris maybe that is something we can get rid of? I'm afraid I'm not sure if we lose anything by getting rid of it, I just assumed because we intermittently have them restarting that they're useful. Maybe the k8 pods are sufficient?

Yes, I think it can be removed. It is not contributing much in Citoid's use cases. The workers monitoring it performs is more benefecial, compared to standard k8s liveness probes under heavy concurrent traffic situations. This isn't the case of Citoid. Kubernetes pods should be sufficient.

First thing would be to set num_workers to 0 in the Chart. This shortcircuits service-runner worker logic and has the main process do everything.

First thing would be to set num_workers to 0 in the Chart. This shortcircuits service-runner worker logic and has the main process do everything.

+1. FWIW, Long ago when service-runner based services I was involved with were migrated to k8s, we set num_workers: 0, and things have been fine :)

Migrate node-based services in production to node18
Open, Needs TriagePublic
Actions

Description

Details

Related Objects
Search...

Event Timeline

	Jdforrester-WMF
	Oct 17 2023, 4:36 PM

Migrate node-based services in production to node18Open, Needs TriagePublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Migrate node-based services in production to node18
Open, Needs TriagePublic
Actions

Related Objects
Search...