[go: up one dir, main page]

Page MenuHomePhabricator

eventgate: eventstreams: update nodejs and OS
Closed, ResolvedPublic

Description

We services on deprecated nodejs and debian. We should update the docker images and codebases to nodejs 18 (LTS) or 21 (current), and to debian bookworm.

@elukey started working on the EventStreams image. Let's sync up.

Why the OS update?

Quoting @elukey

The OS is deprecated and removed from our infrastructure, so there is no way to build a stretch-based image anymore.

Details

SubjectRepoBranchLines +/-
operations/deployment-chartsmaster+4 -4
operations/deployment-chartsmaster+1 -1
operations/deployment-chartsmaster+3 -3
operations/deployment-chartsmaster+5 -2
operations/deployment-chartsmaster+2 -2
operations/deployment-chartsmaster+3 -3
operations/deployment-chartsmaster+1 -1
operations/deployment-chartsmaster+1 -19
operations/deployment-chartsmaster+7 -1
operations/deployment-chartsmaster+12 -0
operations/deployment-chartsmaster+9 -1
operations/deployment-chartsmaster+1 -2
operations/deployment-chartsmaster+2 -2
operations/deployment-chartsmaster+20 -6
operations/deployment-chartsmaster+2 -2
operations/deployment-chartsmaster+17 -12
operations/deployment-chartsmaster+8 -5
operations/deployment-chartsmaster+4 -4
operations/deployment-chartsmaster+3 -3
operations/deployment-chartsmaster+1 -1
eventgate-wikimediamaster+43 -33
node-rdkafka-factorymaster+1 -1
node-rdkafka-factorymaster+2 K -1 K
operations/deployment-chartsmaster+10 -2
operations/deployment-chartsmaster+4 -2
mediawiki/services/eventstreamsmaster+36 -32
mediawiki/services/eventstreamsmaster+5 -3
mediawiki/services/eventstreamsmaster+47 -54
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

stream-beta works for me! Go for it!

Change 963411 had a related patch set uploaded (by Ottomata; author: Ottomata):

[node-rdkafka-factory@master] v1.0.1 - Update to node-rdkafka 2.17

https://gerrit.wikimedia.org/r/963411

Change 963411 abandoned by Ottomata:

[node-rdkafka-factory@master] v1.0.1 - Update to node-rdkafka 2.17

Reason:

No one uses this code outside of EventGate. Copying this code there.

https://gerrit.wikimedia.org/r/963411

EventGate status update:

  • EventGate is updated and should work with node 18.
  • I tried to remove the @wikimedia/node-rdkafka-factory extra dependency. The code is simple enough, but I think I might have messed up the optionality of node-rdkafka for eventgate-wikimedia when doing so. I'll have to revisit this.

Will continue next week.

Will continue next week.

Err, umm, I'm off next week. After that! :)

Change 964848 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] services: upgrade Docker image for eventstreams services

https://gerrit.wikimedia.org/r/964848

Change 964848 merged by Elukey:

[operations/deployment-charts@master] services: upgrade Docker image for eventstreams services

https://gerrit.wikimedia.org/r/964848

Tried to deploy es-internal in staging, and got:

{"name":"eventstreams","hostname":"eventstreams-main-58d6df6f8c-jnm7v","pid":141,"level":"FATAL","err":{"message":"connect ECONNREFUSED ::1:6500","name":"HTTPError","stack":"HTTPError: connect ECONNREFUSED ::1:6500\n    at /srv/service/node_modules/preq/index.js:246:19\n    at tryCatcher (/srv/service/node_modules/bluebird/js/release/util.js:16:23)\n    at Promise._settlePromiseFromHandler (/srv/service/node_modules/bluebird/js/release/promise.js:547:31)\n    at Promise._settlePromise (/srv/service/node_modules/bluebird/js/release/promise.js:604:18)\n    at Promise._settlePromise0 (/srv/service/node_modules/bluebird/js/release/promise.js:649:10)\n    at Promise._settlePromises (/srv/service/node_modules/bluebird/js/release/promise.js:725:18)\n    at _drainQueueStep (/srv/service/node_modules/bluebird/js/release/async.js:93:12)\n    at _drainQueue (/srv/service/node_modules/bluebird/js/release/async.js:86:9)\n    at Async._drainQueues (/srv/service/node_modules/bluebird/js/release/async.js:102:5)\n    at Async.drainQueues [as _onImmediate] (/srv/service/node_modules/bluebird/js/release/async.js:15:14)\n    at process.processImmediate (node:internal/timers:471:21)","status":504,"headers":{"content-type":"application/problem+json"},"body":{"type":"internal_http_error","detail":"connect ECONNREFUSED ::1:6500","internalStack":"Error: connect ECONNREFUSED ::1:6500\n    at TCPConnectWrap.afterConnect [as oncomplete] (node:net:1481:16)","internalURI":"http://localhost:6500/w/api.php?format=json&action=streamconfigs","internalErr":"connect ECONNREFUSED ::1:6500","internalMethod":"get"},"levelPath":"fatal/service-runner/worker"},"msg":"connect ECONNREFUSED ::1:6500","time":"2023-10-12T10:05:25.944Z","v":0}

For some reason localhost gets resolved to [::] (ipv6) and we get the failure. probably forcing 127.0.0.1 should be enough.

Change 965481 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] services: force ipv4 for eventstreams when using the local tls proxy

https://gerrit.wikimedia.org/r/965481

Change 965481 merged by Elukey:

[operations/deployment-charts@master] services: force ipv4 for eventstreams when using the local tls proxy

https://gerrit.wikimedia.org/r/965481

Deployed ES to production, so far all good! I observed some higher latency only for codfw (taking way less traffic than eqiad), I'll circle back in some days to see if it is still the case.

Change 963411 restored by Ottomata:

[node-rdkafka-factory@master] v1.0.1 - Update to node-rdkafka 2.17

https://gerrit.wikimedia.org/r/963411

Change 963411 merged by Ottomata:

[node-rdkafka-factory@master] v1.0.1 - Update to node-rdkafka 2.17

https://gerrit.wikimedia.org/r/963411

Change 966591 had a related patch set uploaded (by Ottomata; author: Ottomata):

[node-rdkafka-factory@master] Bump version to 1.1.0

https://gerrit.wikimedia.org/r/966591

Change 966591 merged by Ottomata:

[node-rdkafka-factory@master] Bump version to 1.1.0

https://gerrit.wikimedia.org/r/966591

Change 966640 had a related patch set uploaded (by Ottomata; author: Ottomata):

[eventgate-wikimedia@master] Updates for running with upgraded EventGate library and node 18

https://gerrit.wikimedia.org/r/966640

Change 966640 merged by jenkins-bot:

[eventgate-wikimedia@master] Updates for running with upgraded EventGate library and node 18

https://gerrit.wikimedia.org/r/966640

Change 968304 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/deployment-charts@master] eventgate-logging-external - upgrade to debian bookworm and nodejs 18

https://gerrit.wikimedia.org/r/968304

I've upgraded eventgate instances in deployment-prep / beta. Things looking good from there.

I'll proceed to upgrading production instances. Plan:

  • eventgate-logging-external
    • canary first, then production
  • eventgate-analytics
    • canary first, then production
  • eventgate-analytics-external
  • eventgate-main
    • canary first, then production

Will likely leave a dayish in between upgrading eventgate-analytics and the rest.

Change 968304 merged by jenkins-bot:

[operations/deployment-charts@master] eventgate-logging-external - upgrade to debian bookworm and nodejs 18

https://gerrit.wikimedia.org/r/968304

Change 968327 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/deployment-charts@master] eventgate-logging-external - use 127.0.0.1 instead of localhost

https://gerrit.wikimedia.org/r/968327

Change 968327 merged by jenkins-bot:

[operations/deployment-charts@master] eventgate-logging-external - use 127.0.0.1 instead of localhost

https://gerrit.wikimedia.org/r/968327

Change 968330 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/deployment-charts@master] eventgate-analytics - upgrade to debian bookworm and nodejs 18

https://gerrit.wikimedia.org/r/968330

Change 968330 merged by jenkins-bot:

[operations/deployment-charts@master] eventgate-analytics - upgrade to debian bookworm and nodejs 18

https://gerrit.wikimedia.org/r/968330

Change 968334 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/deployment-charts@master] eventgate-analytics-external - upgrade to debian bookworm and nodejs 18

https://gerrit.wikimedia.org/r/968334

@Ottomata I checked some Event Gate's analytics metrics since I was curious about CPU usage (I am trying to upgrade Change Prop to node 18 and I noticed some changes, so I was wondering if it was node-18 related) and something looks odd:

  • There is a constant amount of HTTP 404s in the error rate, seem to have started after your deploy.
  • Latency and CPU usage increased

I see something similar for logging-external, but I don't see the same CPU increase.

I did some profiling of ChangeProp in T348950#9276824, the only thing that I noticed is that maybe with a new version of librdkafka/node-rdkafka the poll callbacks are called more frequently and cause more load, but it is only a theory.

Let's pause the rollout, if you agree, to figure out what's wrong (maybe it is nothing but better safe than sorry).

Sounds good, lets pause rollout. eventgate-analytics is under more heavy throughput than eventgate-logging-external, so maybe that's why we don't see quite the same stuff there.

I also saw the 404s, but couldn't find evidence of what it was. They seemed minimal (even if 0 before), so I wasn't too worried about it. Maybe along the way with the upgrades of service-template-node dependencies caused 404s to be instrumented correctly? I wasn't sure.

I think there might be some debug mode that will log 404s, I'll look into that.

Mentioned in SAL (#wikimedia-operations) [2023-10-25T17:02:19Z] <ottomata> temporarily increasing log level to trace for eventgate-logging-external in eqiad canary release only - T347477

Okay, the 404s are all to '/' AKA path: "root"
https://grafana.wikimedia.org/goto/jDZIEEnSz?orgId=1

I don't know why the upgraded image is now showing these, but they have always been 404s from eventgate's perspective.

To compare, I curled 'https://eventgate-main.discovery.wmnet:4492' (which responds with a 404) in a loop, but didn't see any 404s in prometheus.

I think something must have just gotten 'fixed' in whatever updates we did for service-template-node?

Latency and CPU usage increased

At least memory usage went down?

I don't mind the slight CPU increase...I don't love the latency increase though.

What's strange is that most of the latency increase is in irrelevant GET requests, e.g. to /_info, /robots.txt, etc. POSTing to the /v1/events endpoint does increase a bit, but not by as much.

Hm, actually, on second glance, they all increase by around the same factor of 2x? /v1/events: 88us -> 158us, /_info: 827us -> 2.1ms

Change 969961 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/deployment-charts@master] eventgate chart - disable SYS_PTRACE on wmfdebug container

https://gerrit.wikimedia.org/r/969961

Change 969961 merged by jenkins-bot:

[operations/deployment-charts@master] eventgate chart - disable SYS_PTRACE on wmfdebug container

https://gerrit.wikimedia.org/r/969961

Change 969963 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/deployment-charts@master] eventgate chart - separate config for wmfdebug container from nodejs profiler

https://gerrit.wikimedia.org/r/969963

Change 969963 merged by jenkins-bot:

[operations/deployment-charts@master] eventgate chart - separate config for wmfdebug container from nodejs profiler

https://gerrit.wikimedia.org/r/969963

Change 969964 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/deployment-charts@master] eventgate chart - fix debug mode CLI args

https://gerrit.wikimedia.org/r/969964

Change 969964 merged by jenkins-bot:

[operations/deployment-charts@master] eventgate chart - fix debug mode CLI args

https://gerrit.wikimedia.org/r/969964

Change 970371 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/deployment-charts@master] eventgate chart - debug mode: add some perf settings

https://gerrit.wikimedia.org/r/970371

Change 970371 merged by jenkins-bot:

[operations/deployment-charts@master] eventgate chart - debug mode: add some perf settings

https://gerrit.wikimedia.org/r/970371

Change 970372 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/deployment-charts@master] eventgate chart - fix missing comma

https://gerrit.wikimedia.org/r/970372

Change 970372 merged by jenkins-bot:

[operations/deployment-charts@master] eventgate chart - fix missing comma

https://gerrit.wikimedia.org/r/970372

Change 970374 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/deployment-charts@master] eventgate chart - remove --prof-process flag from debug mode

https://gerrit.wikimedia.org/r/970374

Change 970374 merged by jenkins-bot:

[operations/deployment-charts@master] eventgate chart - remove --prof-process flag from debug mode

https://gerrit.wikimedia.org/r/970374

Did a little investigating today. Got flame graphs for node10 and node18 versions following Luca's method here. Documented this method at https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate/Administration#Profiling_nodejs.

Here are the flame graphs:

  • node10:
  • node18:

I don't see any specifically obvious here. There are differences, but they don't (at least at a glance) seem to be node-rdkafka related, at least for eventgate, which is not a Kafka Consumer.

However, in v8 profiler summary we do see a difference in time spent in GC:

node10:

[Summary]:
  ticks  total  nonlib   name
 75622   14.2%   85.9%  JavaScript
     0    0.0%    0.0%  C++
  2624    0.5%    3.0%  GC
 443020   83.4%          Shared libraries
 12396    2.3%          Unaccounted

node18:

[Summary]:
  ticks  total  nonlib   name
 60089    9.2%   85.5%  JavaScript
     0    0.0%    0.0%  C++
 12149    1.9%   17.3%  GC
 579597   89.2%          Shared libraries
 10180    1.6%          Unaccounted

Node18 is working harder doing GC than node10. This tracks with the fact that we are seeing lower memory usage in node18.

Still investigating...

Change 970784 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/deployment-charts@master] eventgate chart - add nodejs_extra_opts value

https://gerrit.wikimedia.org/r/970784

Change 970784 merged by jenkins-bot:

[operations/deployment-charts@master] eventgate chart - add nodejs_extra_opts value

https://gerrit.wikimedia.org/r/970784

I added --trace-gc to nodejs CLI for eventgate-analytics canary (node18) and eventgate-analytics-external canary (still on node10) and compared output. I do see some differences in the Scavenge GC phase.

NodeJS 10:

  • scavenges only because of 'allocation failure' (normal)
  • Regular scavenges happening around every 2000ms, taking about 2ms to complete
  • Occassional quick subsequent (every few ms) 'allocation failure' scavenge taking about 0.3ms to complete.

NodeJS 18:

  • Scavenges happening often because of 'task', not just 'allocation failure'
  • 'task' scavenges happening about every 500ms, taking about 2ms to complete
  • when 'allocation failure' scavenges happen, they happen in quick succession
  • Occassional quick subsequent (every few ms) 'allocation failure' scavenge taking about 0.3ms to complete.

I don't know what a 'task' Scavenge is. I'm googling but not finding much.

I'm curious to try and increase --max_semi_space_size, but I'm kind of just guessing.

Hm, what happened to eventgate cpu and mem resource requests and limits in the helmfiles? Long ago we did some benchmarking and tuning and set these. I see these set for eventgate-logging-external, but that's all!

I'm going to try setting these first.

Change 970796 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/deployment-charts@master] eventgate-analytics - set cpu and mem requests and limits

https://gerrit.wikimedia.org/r/970796

Change 970796 merged by jenkins-bot:

[operations/deployment-charts@master] eventgate-analytics - set cpu and mem requests and limits

https://gerrit.wikimedia.org/r/970796

Change 970798 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/deployment-charts@master] eventgate-analytics - also set cpu and mem requests and limits for canary

https://gerrit.wikimedia.org/r/970798

Change 970798 merged by jenkins-bot:

[operations/deployment-charts@master] eventgate-analytics - also set cpu and mem requests and limits for canary

https://gerrit.wikimedia.org/r/970798

Hm, what happened to eventgate cpu and mem resource requests and limits in the helmfiles?

Ah, they are in the chart values defaults, main_app.requests, etc.

Okay...

Change 970804 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/deployment-charts@master] eventgate-analytics - revert cpu and mem requests and limits

https://gerrit.wikimedia.org/r/970804

Change 970804 merged by jenkins-bot:

[operations/deployment-charts@master] eventgate-analytics - revert cpu and mem requests and limits

https://gerrit.wikimedia.org/r/970804

I deployed eventgate-analytics in eqiad with --max_semi_space_size=32. Looking at gc-trace logs, it looks like the frequency of 'task' scavenges is reduced, maybe down to about ever 2000ms, just like it was in node10.

Will let this go for a while and then compare cpu and latency.

Looks CPU has gone down VERY slightly, and memory usage up very slightly, which I suppose makes sense. No effect on request latency though, or at least, a negligible one.

Going to revert and stick with default --max_semi_space_size.

@elukey, I'd like to proceed with the deployment as is. Going to move forward with eventgate-analytics-external today, but hold on eventgate-main for now.

Change 970811 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/deployment-charts@master] eventgate-analytics-external - upgrade to nodejs 18

https://gerrit.wikimedia.org/r/970811

Change 970811 merged by jenkins-bot:

[operations/deployment-charts@master] eventgate-analytics-external - upgrade to nodejs 18

https://gerrit.wikimedia.org/r/970811

Change 970813 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/deployment-charts@master] eventgate-analytics-external - use 127.0.0.1 in urls instead of localhost

https://gerrit.wikimedia.org/r/970813

Change 970813 merged by jenkins-bot:

[operations/deployment-charts@master] eventgate-analytics-external - use 127.0.0.1 in urls instead of localhost

https://gerrit.wikimedia.org/r/970813

Change 970814 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/deployment-charts@master] eventgate-analytics-external - fix typo in schema uris

https://gerrit.wikimedia.org/r/970814

Change 970814 merged by jenkins-bot:

[operations/deployment-charts@master] eventgate-analytics-external - fix typo in schema uris

https://gerrit.wikimedia.org/r/970814

Same CPU and latency and memory pattern after deploying to eventgate-analytics-external.

Hm, there is CPU throttling on eventgate-analytics-external (there is also a small amount on eventgate-analyitcs).
It was there before today too, but now that CPU usage is a little higher, there is also a little more throttling.

I'm confused as to why there would be throttling though. The max pod cpu usage is around 250ms, and CPU limits are set at 2000m. Reading https://stackoverflow.com/questions/54099425/pod-cpu-throttling indicates that perhaps setting CPU limits is not really a good practice?

Mentioned in SAL (#wikimedia-operations) [2023-11-02T15:51:39Z] <ottomata> eventgate-analytics in eqiad: setting service-runner num_workers: 0 to run with one process and reduce # of threads used by container processes. Should reduce throttling and perhaps help with latency. If works, will make this the default in the chart. - T347477

Mentioned in SAL (#wikimedia-operations) [2023-11-02T16:30:29Z] <ottomata> eventgate-analytics in codfw: setting service-runner num_workers: 0 to run with one process and reduce # of threads used by container processes. Should reduce throttling and perhaps help with latency. If works, will make this the default in the chart. - T347477

Mentioned in SAL (#wikimedia-operations) [2023-11-02T16:30:38Z] <ottomata> eventgate-analytics-external: setting service-runner num_workers: 0 to run with one process and reduce # of threads used by container processes. Should reduce throttling and perhaps help with latency. If works, will make this the default in the chart. - T347477

Luca and I met today, and after reading https://wikitech.wikimedia.org/wiki/Kubernetes/Resource_requests_and_limits (from @JMeybohm :) ), we understood basically that the more threads there are, the more likely throttling will be, as the allocated CPU time will be divided between all the threads more or less evenly.

eventgate-wikimedia's docker image is built with UV_THREADPOOL_SIZE=128 set. This causes all NodeJS processes to have at least this many threads (we saw around 145 on k8s nodes). The eventgate helm chart sets the default service-runner num_workers to 1. This means there will be 2 processes, one 'master' and one 'worker'. I believe this might be a vestigial decisions made when we migrated eventgate from bare metal to k8s. Previously, worker process pools were useful for parallelizing request processing on bare metal; now we just use k8s replica pods. With 2 processes (master and worker), there were almost 300 threads allocated per container.

We set num_workers: 0 (meaning master and worker in the same process) for eventgate-analytics and eventgate-analytics-external and reduced the number of threads to around 150. This eliminated CPU throttling, but did not significantly affect latency or CPU usage. Container memory usage also went down, likely due to 1 less process to manage heap for.

I'll make num_workers: 0 the default in the eventgate chart.

I feel comfortable moving forward with this deployment in eventgate-main, but we'll wait until Monday to do so.

Change 971243 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/deployment-charts@master] eventgate chart - set default service-runner num_workers to 0

https://gerrit.wikimedia.org/r/971243

Change 971243 merged by jenkins-bot:

[operations/deployment-charts@master] eventgate chart - set default service-runner num_workers to 0

https://gerrit.wikimedia.org/r/971243

Change 972050 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/deployment-charts@master] eventgate-main - use 127.0.0.1 instead of localhost

https://gerrit.wikimedia.org/r/972050

Change 972050 merged by jenkins-bot:

[operations/deployment-charts@master] eventgate-main - use 127.0.0.1 instead of localhost

https://gerrit.wikimedia.org/r/972050

Change 972418 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/deployment-charts@master] eventgate chart - Increase default cpu limits to 1500m

https://gerrit.wikimedia.org/r/972418

Change 972418 merged by jenkins-bot:

[operations/deployment-charts@master] eventgate chart - Increase default cpu limits to 1500m

https://gerrit.wikimedia.org/r/972418

Change 968334 abandoned by Ottomata:

[operations/deployment-charts@master] eventgate-analytics-external - upgrade to debian bookworm and nodejs 18

Reason:

Done elsewhere.

https://gerrit.wikimedia.org/r/968334