shellbox-video pods being restarted prematurely
Open, HighPublic
Actions

Assigned To

Authored By

	hnowlan
	Aug 28 2024, 12:54 PM

Description

Since migrating to use shellbox for video, we're seeing a few behaviours where pods are being killed. There are (at least) two different behaviours being seen in relation to this.

One is the pathological case of the healthchecking workaround of using the min_avail_workers parameter where Kubernetes assigns two encoding jobs to a single pod, thus ensuring that the healthcheck eventually fails (sometimes after exhausting the Apache workers) due to no active php workers being able to answer the calls to the liveness probe.

The other is less understood but appears to be a timeout or maximum execution time of some sort in the shelling out side of shellbox or our existing php-fpm configuration where the p99 for shellbox-video tasks appears to be pegged at a maxium of 1 hour. Jobs will of course take a lot longer than this usually, and the p99 for metal-based videoscalers frequently ran in excess of 1 day (admittedly we expect to see different behaviour here but at the very least we'd expect to see multiples of an hour). Manually testing encoding of some longer videos has yielded failure.

Pods failing in the latter manner seem to indicate that Shellbox itself has exited.

Last State:     Terminated
  Reason:       Completed
  Exit Code:    0
  Started:      Wed, 28 Aug 2024 11:44:46 +0000
  Finished:     Wed, 28 Aug 2024 12:09:24 +0000

Note the time differences here - healthz is called regularly up until 14:21:04 when it is not called again until we have restarted. The final connection refused errors repeat hundreds of times. Multiple video transcodes are requested, and succeed. It seems likely that multiple healthchecks are piling up, causing us to eventually run out of apache workers. I'm not sure what causes the subsequent failure to connect to shellbox but this coincides with the pod itself being restarted.

{"timestamp": "2024-08-28T14:21:04", "RequestTime": "977", "Connect-IP": "127.0.0.1", "Handle/Status": "proxy:fcgi://127.0.0.1:9000/200", "ResponseSize": "47", "Method": "GET", "Url": "http://shellbox-video.discovery.wmnet/healthz", "MimeType": "application/json", "Referer": "-", "X-Forwarded-For": "-", "User-Agent": "Blackbox Exporter/0.23.0", "Accept-Language": "-", "X-Analytics": "-", "User": "-", "UserHeader": "-", "X-Request-Id": "514fa4b6-1c6f-4c69-98f6-faf2ed380af2", "X-Client-IP": "-"}
{"timestamp": "2024-08-28T14:22:36", "RequestTime": "1530909", "Connect-IP": "127.0.0.1", "Handle/Status": "proxy:fcgi://127.0.0.1:9000/200", "ResponseSize": "47074", "Method": "POST", "Url": "http://localhost:6036/shell/tmh-ffmpegencode", "MimeType": "multipart/mixed", "Referer": "-", "X-Forwarded-For": "-", "User-Agent": "MediaWiki/1.43.0-wmf.20", "Accept-Language": "-", "X-Analytics": "-", "User": "-", "UserHeader": "-", "X-Request-Id": "de36ae53-cece-45ff-a8ea-3d326e3646dd", "X-Client-IP": "-"}
{"timestamp": "2024-08-28T14:22:45", "RequestTime": "1836476", "Connect-IP": "127.0.0.1", "Handle/Status": "proxy:fcgi://127.0.0.1:9000/200", "ResponseSize": "51956", "Method": "POST", "Url": "http://localhost:6036/shell/tmh-ffmpegencode", "MimeType": "multipart/mixed", "Referer": "-", "X-Forwarded-For": "-", "User-Agent": "MediaWiki/1.43.0-wmf.20", "Accept-Language": "-", "X-Analytics": "-", "User": "-", "UserHeader": "-", "X-Request-Id": "02bc930e-c51e-40b4-9168-390cb2593a18", "X-Client-IP": "-"}
{"timestamp": "2024-08-28T14:24:25", "RequestTime": "4912284", "Connect-IP": "127.0.0.1", "Handle/Status": "proxy:fcgi://127.0.0.1:9000/200", "ResponseSize": "177539", "Method": "POST", "Url": "http://localhost:6036/shell/tmh-ffmpegencode", "MimeType": "multipart/mixed", "Referer": "-", "X-Forwarded-For": "-", "User-Agent": "MediaWiki/1.43.0-wmf.20", "Accept-Language": "-", "X-Analytics": "-", "User": "-", "UserHeader": "-", "X-Request-Id": "6d911dbe-841c-484b-8e25-3ab2b82b5a64", "X-Client-IP": "-"}
{"timestamp": "2024-08-28T14:24:26", "RequestTime": "8506017", "Connect-IP": "127.0.0.1", "Handle/Status": "proxy:fcgi://127.0.0.1:9000/200", "ResponseSize": "206093", "Method": "POST", "Url": "http://localhost:6036/shell/tmh-ffmpegencode", "MimeType": "multipart/mixed", "Referer": "-", "X-Forwarded-For": "-", "User-Agent": "MediaWiki/1.43.0-wmf.20", "Accept-Language": "-", "X-Analytics": "-", "User": "-", "UserHeader": "-", "X-Request-Id": "6d911dbe-841c-484b-8e25-3ab2b82b5a64", "X-Client-IP": "-"}
{"timestamp": "2024-08-28T14:26:01", "RequestTime": "2124755", "Connect-IP": "127.0.0.1", "Handle/Status": "proxy:fcgi://127.0.0.1:9000/200", "ResponseSize": "108835", "Method": "POST", "Url": "http://localhost:6036/shell/tmh-ffmpegencode", "MimeType": "multipart/mixed", "Referer": "-", "X-Forwarded-For": "-", "User-Agent": "MediaWiki/1.43.0-wmf.20", "Accept-Language": "-", "X-Analytics": "-", "User": "-", "UserHeader": "-", "X-Request-Id": "533917f9-7638-4a4d-9593-7ef7ab58132c", "X-Client-IP": "-"}
{"timestamp": "2024-08-28T14:26:30", "RequestTime": "353200", "Connect-IP": "127.0.0.1", "Handle/Status": "proxy:fcgi://127.0.0.1:9000/200", "ResponseSize": "196331", "Method": "POST", "Url": "http://localhost:6036/shell/tmh-ffmpegencode", "MimeType": "multipart/mixed", "Referer": "-", "X-Forwarded-For": "-", "User-Agent": "MediaWiki/1.43.0-wmf.20", "Accept-Language": "-", "X-Analytics": "-", "User": "-", "UserHeader": "-", "X-Request-Id": "e4325c43-eaf1-4de9-b9b5-386f90268a66", "X-Client-IP": "-"}
{"timestamp": "2024-08-28T14:27:00", "RequestTime": "127289", "Connect-IP": "127.0.0.1", "Handle/Status": "proxy:fcgi://127.0.0.1:9000/200", "ResponseSize": "24507", "Method": "POST", "Url": "http://localhost:6036/shell/tmh-ffmpegencode", "MimeType": "multipart/mixed", "Referer": "-", "X-Forwarded-For": "-", "User-Agent": "MediaWiki/1.43.0-wmf.20", "Accept-Language": "-", "X-Analytics": "-", "User": "-", "UserHeader": "-", "X-Request-Id": "fefa4cfe-8d6c-486a-bd45-49d382f0e4f6", "X-Client-IP": "-"}
{"timestamp": "2024-08-28T14:27:25", "RequestTime": "171475", "Connect-IP": "127.0.0.1", "Handle/Status": "proxy:fcgi://127.0.0.1:9000/200", "ResponseSize": "42171", "Method": "POST", "Url": "http://localhost:6036/shell/tmh-ffmpegencode", "MimeType": "multipart/mixed", "Referer": "-", "X-Forwarded-For": "-", "User-Agent": "MediaWiki/1.43.0-wmf.20", "Accept-Language": "-", "X-Analytics": "-", "User": "-", "UserHeader": "-", "X-Request-Id": "c077503c-32ad-4d4c-ad51-89e3a739cdab", "X-Client-IP": "-"}
{"timestamp": "2024-08-28T14:28:55", "RequestTime": "1264936", "Connect-IP": "127.0.0.1", "Handle/Status": "proxy:fcgi://127.0.0.1:9000/200", "ResponseSize": "42849", "Method": "POST", "Url": "http://localhost:6036/shell/tmh-ffmpegencode", "MimeType": "multipart/mixed", "Referer": "-", "X-Forwarded-For": "-", "User-Agent": "MediaWiki/1.43.0-wmf.20", "Accept-Language": "-", "X-Analytics": "-", "User": "-", "UserHeader": "-", "X-Request-Id": "8c744e10-6211-4c26-aa46-f12ca42d77b5", "X-Client-IP": "-"}
{"timestamp": "2024-08-28T14:29:30", "RequestTime": "3175974", "Connect-IP": "127.0.0.1", "Handle/Status": "proxy:fcgi://127.0.0.1:9000/200", "ResponseSize": "135303", "Method": "POST", "Url": "http://localhost:6036/shell/tmh-ffmpegencode", "MimeType": "multipart/mixed", "Referer": "-", "X-Forwarded-For": "-", "User-Agent": "MediaWiki/1.43.0-wmf.20", "Accept-Language": "-", "X-Analytics": "-", "User": "-", "UserHeader": "-", "X-Request-Id": "aa6f9e1e-e905-43aa-a4dc-876e9f449653", "X-Client-IP": "-"}
{"timestamp": "2024-08-28T14:29:50", "RequestTime": "3216191", "Connect-IP": "127.0.0.1", "Handle/Status": "proxy:fcgi://127.0.0.1:9000/200", "ResponseSize": "144011", "Method": "POST", "Url": "http://localhost:6036/shell/tmh-ffmpegencode", "MimeType": "multipart/mixed", "Referer": "-", "X-Forwarded-For": "-", "User-Agent": "MediaWiki/1.43.0-wmf.20", "Accept-Language": "-", "X-Analytics": "-", "User": "-", "UserHeader": "-", "X-Request-Id": "457ea603-dd24-4591-94bb-17e4ab1cb4c7", "X-Client-IP": "-"}
{"timestamp": "2024-08-28T14:29:54", "RequestTime": "2568358", "Connect-IP": "127.0.0.1", "Handle/Status": "proxy:fcgi://127.0.0.1:9000/200", "ResponseSize": "133561", "Method": "POST", "Url": "http://localhost:6036/shell/tmh-ffmpegencode", "MimeType": "multipart/mixed", "Referer": "-", "X-Forwarded-For": "-", "User-Agent": "MediaWiki/1.43.0-wmf.20", "Accept-Language": "-", "X-Analytics": "-", "User": "-", "UserHeader": "-", "X-Request-Id": "30b19410-9034-48f1-a9fc-18dd75c0f2dd", "X-Client-IP": "-"}
[Wed Aug 28 14:32:43.285213 2024] [mpm_worker:error] [pid 1:tid 139886347531392] AH00287: server is within MinSpareThreads of MaxRequestWorkers, consider raising the MaxRequestWorkers setting
[Wed Aug 28 14:33:04.303606 2024] [mpm_worker:error] [pid 1:tid 139886347531392] AH00286: server reached MaxRequestWorkers setting, consider raising the MaxRequestWorkers setting
{"timestamp": "2024-08-28T14:30:45", "RequestTime": "559977136", "Connect-IP": "127.0.0.1", "Handle/Status": "proxy:fcgi://127.0.0.1:9000/503", "ResponseSize": "299", "Method": "POST", "Url": "http://localhost:6036/shell/tmh-ffmpegencode", "MimeType": "text/html", "Referer": "-", "X-Forwarded-For": "-", "User-Agent": "MediaWiki/1.43.0-wmf.20", "Accept-Language": "-", "X-Analytics": "-", "User": "-", "UserHeader": "-", "X-Request-Id": "ffec46cf-9dce-4e23-9eed-ce48e5b378ff", "X-Client-IP": "-"}
[Wed Aug 28 14:40:05.669283 2024] [proxy:error] [pid 20:tid 139886204020480] (111)Connection refused: AH00957: FCGI: attempt to connect to 127.0.0.1:9000 (localhost:9000) failed
[Wed Aug 28 14:40:05.669731 2024] [proxy:error] [pid 20:tid 139885641971456] (111)Connection refused: AH00957: FCGI: attempt to connect to 127.0.0.1:9000 (localhost:9000) failed

Details

Subject	Repo	Branch	Lines +/-
services_proxy: add tcp_keepalive parameter, enable for shellbox-video	operations/puppet	production	+17 -0
Set a content-length for multipart responses	mediawiki/libs/Shellbox	master	+5 -1
Apply videoscaler request limits and wall clock time limits to shellbox-video	operations/mediawiki-config	master	+2 -2
envoy: set downstream and upstream idle timeout hiera keys for cloud	operations/puppet	production	+2 -0
envoyproxy: add route-level idle timeout for tlsproxy	operations/puppet	production	+18 -0
shellbox-video: bypass mesh temporarily	operations/mediawiki-config	master	+1 -1
jobrunner: set stream_idle_timeout appropriately	operations/puppet	production	+1 -0
videoscaler: bump idle_timeouts for envoy tls terminator	operations/puppet	production	+2 -0
videoscalers: enable error logging on tls terminator envoy	operations/puppet	production	+1 -0
shellbox-video: use correct command in process check	operations/deployment-charts	master	+1 -1
shellbox-video: add process-based readiness check	operations/deployment-charts	master	+13 -1
Bump to rebuild image	mediawiki/libs/Shellbox	master	+1 -1
php:common: sleep briefly when checking for busy workers	operations/docker-images/production-images	master	+17 -0
timedmediahandler: revert using shellbox for commonswiki	operations/mediawiki-config	master	+1 -0
shellbox-video: healthcheck every 1s	operations/deployment-charts	master	+1 -1

Related Objects
Search...

Status	Assigned	Task
In Progress	None	T290536 Serve production traffic via Kubernetes
In Progress	Clement_Goubert	T321786 Deploy mediawiki kubernetes services
Open	None	T355292 Port videoscaling to kubernetes
In Progress	hnowlan	T356241 Move video transcoding to use Shellbox
Open	hnowlan	T373517 shellbox-video pods being restarted prematurely

Event Timeline

hnowlan created this task.Aug 28 2024, 12:54 PM

Restricted Application removed a project: Patch-For-Review. · View Herald TranscriptAug 28 2024, 12:54 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

As a stopgap measure we've already introduced retries to the videoscaling job

Change #1067992 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/deployment-charts@master] shellbox-video: healthcheck every 1s

https://gerrit.wikimedia.org/r/1067992

gerritbot added a project: Patch-For-Review.Aug 28 2024, 1:08 PM

hnowlan updated the task description. (Show Details)Aug 28 2024, 1:11 PM

Change #1067992 merged by jenkins-bot:

[operations/deployment-charts@master] shellbox-video: healthcheck every 1s

https://gerrit.wikimedia.org/r/1067992

hnowlan triaged this task as High priority.Aug 28 2024, 3:04 PM

hnowlan updated the task description. (Show Details)

Change #1068025 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/mediawiki-config@master] timedmediahandler: revert using shellbox for commonswiki

https://gerrit.wikimedia.org/r/1068025

bvibber subscribed.Aug 28 2024, 4:08 PM

Change #1068025 merged by jenkins-bot:

[operations/mediawiki-config@master] timedmediahandler: revert using shellbox for commonswiki

https://gerrit.wikimedia.org/r/1068025

Mentioned in SAL (#wikimedia-operations) [2024-08-28T16:17:36Z] <hnowlan@deploy1003> Started scap sync-world: Backport for [[gerrit:1068025|timedmediahandler: revert using shellbox for commonswiki (T373517)]]

Mentioned in SAL (#wikimedia-operations) [2024-08-28T16:20:01Z] <hnowlan@deploy1003> hnowlan: Backport for [[gerrit:1068025|timedmediahandler: revert using shellbox for commonswiki (T373517)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2024-08-28T16:24:50Z] <hnowlan@deploy1003> Finished scap sync-world: Backport for [[gerrit:1068025|timedmediahandler: revert using shellbox for commonswiki (T373517)]] (duration: 07m 13s)

bvibber mentioned this in T373546: Migrate off HLS mov/mp4 experiment to a flat mov back-compat with WebM and MPEG-DASH.Aug 28 2024, 5:29 PM

Krinkle subscribed.Aug 29 2024, 4:27 AM

The healthcheck endpoint is not consistently returning a 503 when workers are busy - this could be some kind of a race condition. When all of the following were executed the pod was actively running an ffmpeg process:

Service Unavailableroot@mw1482:/home/hnowlan# nsenter -t 313825 -n curl 10.67.139.145:9181/healthz?min_avail_workers=1                          
Service Unavailableroot@mw1482:/home/hnowlan# nsenter -t 313825 -n curl 10.67.139.145:9181/healthz?min_avail_workers=1                                                                             OKroot@mw1482:/home/hnowlan# nsenter -t 313825 -n curl 10.67.139.145:9181/healthz?min_avail_workers=1        
Service Unavailableroot@mw1482:/home/hnowlan# nsenter -t 313825 -n curl 10.67.139.145:9181/healthz?min_avail_workers=1                        
Service Unavailableroot@mw1482:/home/hnowlan# nsenter -t 313825 -n curl 10.67.139.145:9181/healthz?min_avail_workers=1
Service Unavailableroot@mw1482:/home/hnowlan# nsenter -t 313825 -n curl 10.67.139.145:9181/healthz?min_avail_workers=1
Service Unavailableroot@mw1482:/home/hnowlan# nsenter -t 313825 -n curl 10.67.139.145:9181/healthz?min_avail_workers=1
OKroot@mw1482:/home/hnowlan# nsenter -t 313825 -n curl 10.67.139.145:9181/healthz?min_avail_workers=1
Service Unavailableroot@mw1482:/home/hnowlan# nsenter -t 313825 -n curl 10.67.139.145:9181/healthz?min_avail_workers=1
Service Unavailableroot@mw1482:/home/hnowlan# nsenter -t 313825 -n curl 10.67.139.145:9181/healthz?min_avail_workers=1

Change #1072174 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/docker-images/production-images@master] php:common: sleep briefly when checking for busy workers

https://gerrit.wikimedia.org/r/1072174

From php-fpm's fpm-status we can even see this behaviour so our check isn't at fault:

root@mw1451:/home/hnowlan# for i in `seq 200`; do curl -s 10.67.165.241:9181/fpm-status| grep ^active; sleep 0.2; done | sort | uniq -c
     18 active processes:     1
    182 active processes:     2

Change #1072174 merged by Hnowlan:

[operations/docker-images/production-images@master] php:common: sleep briefly when checking for busy workers

https://gerrit.wikimedia.org/r/1072174

Change #1072238 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[mediawiki/libs/Shellbox@master] Bump to rebuild image

https://gerrit.wikimedia.org/r/1072238

Change #1072238 merged by jenkins-bot:

[mediawiki/libs/Shellbox@master] Bump to rebuild image

https://gerrit.wikimedia.org/r/1072238

Change #1072549 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/deployment-charts@master] shellbox-video: add process-based readiness check

https://gerrit.wikimedia.org/r/1072549

Change #1072549 merged by jenkins-bot:

[operations/deployment-charts@master] shellbox-video: add process-based readiness check

https://gerrit.wikimedia.org/r/1072549

Change #1072571 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/deployment-charts@master] shellbox-video: use correct command in process check

https://gerrit.wikimedia.org/r/1072571

Change #1072571 merged by jenkins-bot:

[operations/deployment-charts@master] shellbox-video: use correct command in process check

https://gerrit.wikimedia.org/r/1072571

We have at least partially addressed the healthchecking issues by introducing a second readiness probe on the shellbox app container that checks for an ffmpeg process running, which appears to be working quite well.

More confusing details have emerged though - we see pods running the ffmpeg processes seemingly successfully, but with a very early error from the envoy on the videoscaler side (note the UC status code)

[2024-09-13T14:49:53.037Z] "POST /shell/tmh-ffmpegencode HTTP/1.1" 503 UC 211365618 95 1200021 - "-" "MediaWiki/1.43.0-wmf.22" "28a856bc-112d-4a4c-b569-383f1dfe4d39" "localhost:6036" "10.2.2.68:4080"

Additionally we see this in the logs of the pod envoy (note the 200 and the DC status code)

[2024-09-13T14:49:53.045Z] "POST /shell/tmh-ffmpegencode HTTP/1.1" 200 DC 211365618 1612706 2139390 1198106 "-" "MediaWiki/1.43.0-wmf.22" "28a856bc-112d-4a4c-b569-383f1dfe4d39" "localhost:6036" "127.0.0.1:8080"

Also in the apache log we see a 200 and a long requesttime, indicating that the task was seen as successful on the shellbox side:

{"timestamp": "2024-09-13T14:49:53", "RequestTime": "2139390677", "Connect-IP": "127.0.0.1", "Handle/Status": "proxy:fcgi://127.0.0.1:9000/200", "ResponseSize": "24829952", "Method": "POST", "Url": "http://localhost:6036/shell/tmh-ffmpegencode", "MimeType": "multipart/mixed", "Referer": "-", "X-Forwarded-For": "-", "User-Agent": "MediaWiki/1.43.0-wmf.22", "Accept-Language": "-", "X-Analytics": "-", "User": "-", "UserHeader": "-", "X-Request-Id": "28a856bc-112d-4a4c-b569-383f1dfe4d39", "X-Client-IP": "-"}

The second Envoy log message occurs long after the 503 log entry is seen on the videoscaler side. It logically follows that we'd see a DC on the pod envoy if the videoscaler envoy assumes that the connection has been a disconnect (which is signified by the UC).

Change #1073192 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/puppet@production] videoscalers: enable error logging on tls terminator envoy

https://gerrit.wikimedia.org/r/1073192

Change #1073192 merged by Hnowlan:

[operations/puppet@production] videoscalers: enable error logging on tls terminator envoy

https://gerrit.wikimedia.org/r/1073192

Change #1073209 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/puppet@production] videoscaler: bump idle_timeouts for envoy tls terminator

https://gerrit.wikimedia.org/r/1073209

Change #1073209 merged by Hnowlan:

[operations/puppet@production] videoscaler: bump idle_timeouts for envoy tls terminator

https://gerrit.wikimedia.org/r/1073209

Change #1073219 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/puppet@production] envoyproxy: add route-level idle timeout for tlsproxy

https://gerrit.wikimedia.org/r/1073219

Change #1073219 merged by Hnowlan:

[operations/puppet@production] envoyproxy: add route-level idle timeout for tlsproxy

https://gerrit.wikimedia.org/r/1073219

Change #1073262 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/puppet@production] jobrunner: set stream_idle_timeout appropriately

https://gerrit.wikimedia.org/r/1073262

Change #1073262 merged by Hnowlan:

[operations/puppet@production] jobrunner: set stream_idle_timeout appropriately

https://gerrit.wikimedia.org/r/1073262

Change #1073770 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/mediawiki-config@master] shellbox-video: bypass mesh temporarily

https://gerrit.wikimedia.org/r/1073770

Change #1073770 merged by jenkins-bot:

[operations/mediawiki-config@master] shellbox-video: bypass mesh temporarily

https://gerrit.wikimedia.org/r/1073770

Change #1073840 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/mediawiki-config@master] Apply videoscaler request limits and wall clock time limits to shellbox-video

https://gerrit.wikimedia.org/r/1073840

Change #1074198 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] envoy: set downstream and upstream idle timeout hiera keys for cloud

https://gerrit.wikimedia.org/r/1074198

Change #1074198 merged by Dzahn:

[operations/puppet@production] envoy: set downstream and upstream idle timeout hiera keys for cloud

https://gerrit.wikimedia.org/r/1074198

Just to note, I've been testing by forcing a reencode of this video in VP9 format. This can also be tested by grabbing a job from kafka using kafkacat (kafkacat -b kafka-main1004.eqiad.wmnet:9092 -t eqiad.mediawiki.job.webVideoTranscode -o -200) and then POSTing the inner parts of the event via curl to a specific videoscaler to test logging changes etc:

time curl -H "Host: videoscaler.discovery.wmnet" -k -v -v -X POST -d '{"database":"testwiki","type":"webVideoTranscode","params": {"transcodeMode":"derivative" ,"transcodeKey":"240p.vp9.webm","prioritized":false,"manualOverride":true,"remux":false,"requestId":"A_REQ_ID","namespace":6,"title":"CC_1916_10_02_ThePawnshop.mpg"},"mediawiki_signature":"A_SIG"}' https://mw1437.eqiad.wmnet/rpc/RunSingleJob.php

Currently the error we're seeing from the RunSingleJob/TimedMediaHandler invocation is Exception: cURL error 56: OpenSSL SSL_read: SSL_ERROR_SYSCALL, errno 104 (see https://curl.haxx.se/libcurl/c/libcurl-errors.html) for https://shellbox-video.discovery.wmnet:4080/shell/tmh-ffmpegencode. We also see timeouts in the debug log messages from the shellbox pod Envoy when a response is sent to clients. Close to the exact same time as the response is sent, the 500 status is returned to the client. Given the relative lack of logging from RunSingleJob I suspect that we are hitting this error rather than the general exception handling case. More logging from the JobExecutor would be handy here if we can get it.

Given that the request appears to be being terminated mid-flight there is still a chance that some timeout or limit is being hit. This patch is a bit of a desperate test of the request time limits, but joe has made the point that these limits are not applied to calls made by C libraries such as curl.

We need to try to get more insight into the responses that we're seeing in the Shellbox client and in what TMH/RunSingleJob are seeing as far as responses and similar.

Change #1073840 merged by jenkins-bot:

[operations/mediawiki-config@master] Apply videoscaler request limits and wall clock time limits to shellbox-video

https://gerrit.wikimedia.org/r/1073840

Mentioned in SAL (#wikimedia-operations) [2024-09-24T13:02:54Z] <lucaswerkmeister-wmde@deploy1003> Started scap sync-world: Backport for [[gerrit:1073840|Apply videoscaler request limits and wall clock time limits to shellbox-video (T373517)]]

Mentioned in SAL (#wikimedia-operations) [2024-09-24T13:08:19Z] <lucaswerkmeister-wmde@deploy1003> lucaswerkmeister-wmde, hnowlan: Backport for [[gerrit:1073840|Apply videoscaler request limits and wall clock time limits to shellbox-video (T373517)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2024-09-24T13:20:06Z] <lucaswerkmeister-wmde@deploy1003> Finished scap sync-world: Backport for [[gerrit:1073840|Apply videoscaler request limits and wall clock time limits to shellbox-video (T373517)]] (duration: 17m 12s)

akosiaris subscribed.Sep 24 2024, 1:26 PM

Change #1076212 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[mediawiki/libs/Shellbox@master] Set a content-length for multipart responses

https://gerrit.wikimedia.org/r/1076212

A_smart_kitten subscribed.Oct 13 2024, 12:49 PM

TheDJ moved this task from To sort to In progress on the TimedMediaHandler board.Oct 16 2024, 1:55 PM

Minor datapoint that hasn't been noted - when testing with a larger file that takes longer to convert, we're seeing the same behaviour. This adds credence to the idea that this issue is not caused by a timeout, and is most likely caused by some kind of issue with the handling of and reading of responses, most likely beyond shellbox.

In T373517#10161968, @hnowlan wrote:

Given the relative lack of logging from RunSingleJob I suspect that we are hitting this error rather than the general exception handling case. More logging from the JobExecutor would be handy here if we can get it.

I suspect this error is never actually hit, as an event in EventBus\RunSingleJobHandler returning a response['status'] of false will always cause an exception. This means the nearby readonly backoff behaviour will never be triggered, but that's for another issue.

Are the http requests using chunked transfer encoding. or not ? (I'm assuming its all http 1.1 and not 2.0)

hnowlan mentioned this in T377830: RunSingleJob.php's readonly backoff behaviour will never be triggered.Tue, Oct 22, 11:36 AM

In T373517#10249262, @TheDJ wrote:

Are the http requests using chunked transfer encoding. or not ? (I'm assuming its all http 1.1 and not 2.0)

Yeah, all HTTP 1.1 at the moment. Shellbox does not use chunked transfer encoding but does use multipart responses .

I've mocked up a horrible Frankenstein script that mimics the TimedMediaHandler behaviour - when directly calling shellbox-video.discovery.wmnet via it, we see the exact same behaviour. This means that at the very least we can rule out failures at the Jobqueue or RunSingleJob layer:

hnowlan@mwmaint1002:~/shellbox$ php src/client.php                                                                                                                                                 Fatal error: Uncaught GuzzleHttp\Exception\RequestException: cURL error 56: OpenSSL SSL_read: SSL_ERROR_SYSCALL, errno 104 (see https://curl.haxx.se/libcurl/c/libcurl-errors.html) for https://she
llbox-video.discovery.wmnet:4080/shell/tmh-ffmpegencode in /home/hnowlan/shellbox/vendor/guzzlehttp/guzzle/src/Handler/CurlFactory.php:276
Stack trace:                                                                                                                                                                                       
#0 /home/hnowlan/shellbox/vendor/guzzlehttp/guzzle/src/Handler/CurlFactory.php(205): GuzzleHttp\Handler\CurlFactory::createRejection(Object(GuzzleHttp\Handler\EasyHandle), Array)                
#1 /home/hnowlan/shellbox/vendor/guzzlehttp/guzzle/src/Handler/CurlFactory.php(157): GuzzleHttp\Handler\CurlFactory::finishError(Object(GuzzleHttp\Handler\CurlHandler), Object(GuzzleHttp\Handler\EasyHandle), Object(GuzzleHttp\Handler\CurlFactory))                                                                                                                                               
#2 /home/hnowlan/shellbox/vendor/guzzlehttp/guzzle/src/Handler/CurlHandler.php(47): GuzzleHttp\Handler\CurlFactory::finish(Object(GuzzleHttp\Handler\CurlHandler), Object(GuzzleHttp\Handler\EasyHandle), Object(GuzzleHttp\Handler\CurlFactory))  
#3 /home/hnowlan/shel in /home/hnowlan/shellbox/vendor/guzzlehttp/guzzle/src/Handler/CurlFactory.php on line 276

As before, this error occurs immediately after the encoding is completed on the shellbox end.

When connecting the same client to a k8s pod IP, the encoding and download of the file complete successfully, so some point of the communication between is definitely at fault here. We can now say with reasonable confidence that Envoy and Apache are not at fault here. Isolating which part will be a bit of a challenge but it's a clearer task.

Running the client directly against a k8s worker IP also succeeds, which means that kube-proxy most likely isn't to blame here.

jijiki moved this task from Incoming 🐫 to Doing 😎 on the serviceops board.Wed, Oct 23, 10:56 AM

Change #1083207 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/puppet@production] services_proxy: add tcp_keepalive parameter, enable for shellbox-video

https://gerrit.wikimedia.org/r/1083207

Change #1076212 abandoned by Hnowlan:

[mediawiki/libs/Shellbox@master] Set a content-length for multipart responses

Reason:

Not the fix we needed - tcp keepalives were to blame here.