[Observation]: Workers on the same process report Ready every time a different worker completes. #2911

yacman · 2024-11-18T16:22:38Z

Version

v5.26.2

Platform

NodeJS

What happened?

We've added in metric collection to services and have been observing the worker.on('ready') event.

https://github.com/taskforcesh/bullmq/blob/8204ea3635b3042e8537c1d3f92584c572735a31/src/classes/worker.ts#L137C1-L143C1

While listening to this event, we can observe that if we have a worker formation like:

Pod A | Ready Count
WorkerA | 1
WorkerB | 1
WorkerC | 1

Pod B | Ready Count
WorkerA | 1
WorkerB | 1
WorkerC | 1

After single job is processed on any worker, we'll say WorkerC on Pod A

Pod A | Ready Count
WorkerA | 2
WorkerB | 2
WorkerC | 1

Pod B | Ready Count
WorkerA | 1
WorkerB | 1
WorkerC | 1

The other workers ready event is triggered at the completion of Worker C.

Are these counts expected? If the events are ok, the question is why would other workers report becoming unblocked?

How to reproduce.

No response

Relevant log output

No response

Code of Conduct

I agree to follow this project's Code of Conduct

manast · 2024-11-18T17:00:37Z

This "ready" event is emitted when the "blocking" (dedicated) Redis connection is ready. This event is actually generated by IORedis, signalling that the connection has been correctly stablished and is ready to accept Redis commands:

bullmq/src/classes/worker.ts

Line 323 in 8204ea3

this.blockingConnection.on('ready', () =>

So in your case I am not sure why this happens but there are situations when we need to force-disconnect the blocking connection and reconnect again, that would generate a new "ready" event:

bullmq/src/classes/worker.ts

Line 698 in 8204ea3

bclient.disconnect(!this.closing);

Having said that, I suspect you are using this event for something else, as this event as is is not particularly useful. If you describe your use case we may be able to find a better way to do it than using this event.

yacman · 2024-11-18T17:32:24Z

Much appreciated for the details on the origination. We are looking to identify the number of worker threads that are globally connected and working by Worker and Queue. e.g. if there are 10 instances of Worker A on Pod A and 5 instances of Worker A on pod B we would observe two values of 10 and 5. If Pod B had 2 workers that had issues with their redis connections (memory overflow something impacting limiting cpu timeouts heartbeats), The values would start reporting as 10 and 3.

We are looking to alert when worker instance counts go outside of a boundary.

manast · 2024-11-18T17:59:35Z

What about the "getWorkers" and "getWorkersCount" APIs, are they not useful for your case?
https://api.docs.bullmq.io/classes/v5.Queue.html#getWorkers

In BullMQ you can also set a specific name for your workers if you want and you will get this name back when calling "getWorkers": https://api.docs.bullmq.io/interfaces/v5.WorkerOptions.html#name

yacman · 2024-11-18T21:05:02Z

Thank you these are definitely the aggregations we are looking for.

Will calling these operations on a timer, like every 5 seconds be a problem?

manast · 2024-11-18T21:08:57Z

Probably not a problem, depends if you have many workers or not, you can make some benchmarks to see how long that call takes. I would try to increase the interval as much as possible, every 30 seconds even better.

yacman · 2024-11-18T21:20:49Z

Thank you for the time and feedback!

yacman added the bug Something isn't working label Nov 18, 2024

yacman closed this as completed Nov 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Observation]: Workers on the same process report Ready every time a different worker completes. #2911

[Observation]: Workers on the same process report Ready every time a different worker completes. #2911

[Observation]: Workers on the same process report Ready every time a different worker completes. #2911

[Observation]: Workers on the same process report Ready every time a different worker completes. #2911

Comments

Version

Platform

What happened?

How to reproduce.

Relevant log output

Code of Conduct