Since migrating to use shellbox for video, we're seeing a few behaviours where pods are being killed. There are (at least) two different behaviours being seen in relation to this.
One is the pathological case of the healthchecking workaround of using the min_avail_workers parameter where Kubernetes assigns two encoding jobs to a single pod, thus ensuring that the healthcheck eventually fails (sometimes after exhausting the Apache workers) due to no active php workers being able to answer the calls to the liveness probe.
The other is less understood but appears to be a timeout or maximum execution time of some sort in the shelling out side of shellbox or our existing php-fpm configuration where the p99 for shellbox-video tasks appears to be pegged at a maxium of 1 hour. Jobs will of course take a lot longer than this usually, and the p99 for metal-based videoscalers frequently ran in excess of 1 day (admittedly we expect to see different behaviour here but at the very least we'd expect to see multiples of an hour). Manually testing encoding of some longer videos has yielded failure.
Pods failing in the latter manner seem to indicate that Shellbox itself has exited.
Last State: Terminated Reason: Completed Exit Code: 0 Started: Wed, 28 Aug 2024 11:44:46 +0000 Finished: Wed, 28 Aug 2024 12:09:24 +0000
Note the time differences here - healthz is called regularly up until 14:21:04 when it is not called again until we have restarted. The final connection refused errors repeat hundreds of times. Multiple video transcodes are requested, and succeed. It seems likely that multiple healthchecks are piling up, causing us to eventually run out of apache workers. I'm not sure what causes the subsequent failure to connect to shellbox but this coincides with the pod itself being restarted.
{"timestamp": "2024-08-28T14:21:04", "RequestTime": "977", "Connect-IP": "127.0.0.1", "Handle/Status": "proxy:fcgi://127.0.0.1:9000/200", "ResponseSize": "47", "Method": "GET", "Url": "http://shellbox-video.discovery.wmnet/healthz", "MimeType": "application/json", "Referer": "-", "X-Forwarded-For": "-", "User-Agent": "Blackbox Exporter/0.23.0", "Accept-Language": "-", "X-Analytics": "-", "User": "-", "UserHeader": "-", "X-Request-Id": "514fa4b6-1c6f-4c69-98f6-faf2ed380af2", "X-Client-IP": "-"} {"timestamp": "2024-08-28T14:22:36", "RequestTime": "1530909", "Connect-IP": "127.0.0.1", "Handle/Status": "proxy:fcgi://127.0.0.1:9000/200", "ResponseSize": "47074", "Method": "POST", "Url": "http://localhost:6036/shell/tmh-ffmpegencode", "MimeType": "multipart/mixed", "Referer": "-", "X-Forwarded-For": "-", "User-Agent": "MediaWiki/1.43.0-wmf.20", "Accept-Language": "-", "X-Analytics": "-", "User": "-", "UserHeader": "-", "X-Request-Id": "de36ae53-cece-45ff-a8ea-3d326e3646dd", "X-Client-IP": "-"} {"timestamp": "2024-08-28T14:22:45", "RequestTime": "1836476", "Connect-IP": "127.0.0.1", "Handle/Status": "proxy:fcgi://127.0.0.1:9000/200", "ResponseSize": "51956", "Method": "POST", "Url": "http://localhost:6036/shell/tmh-ffmpegencode", "MimeType": "multipart/mixed", "Referer": "-", "X-Forwarded-For": "-", "User-Agent": "MediaWiki/1.43.0-wmf.20", "Accept-Language": "-", "X-Analytics": "-", "User": "-", "UserHeader": "-", "X-Request-Id": "02bc930e-c51e-40b4-9168-390cb2593a18", "X-Client-IP": "-"} {"timestamp": "2024-08-28T14:24:25", "RequestTime": "4912284", "Connect-IP": "127.0.0.1", "Handle/Status": "proxy:fcgi://127.0.0.1:9000/200", "ResponseSize": "177539", "Method": "POST", "Url": "http://localhost:6036/shell/tmh-ffmpegencode", "MimeType": "multipart/mixed", "Referer": "-", "X-Forwarded-For": "-", "User-Agent": "MediaWiki/1.43.0-wmf.20", "Accept-Language": "-", "X-Analytics": "-", "User": "-", "UserHeader": "-", "X-Request-Id": "6d911dbe-841c-484b-8e25-3ab2b82b5a64", "X-Client-IP": "-"} {"timestamp": "2024-08-28T14:24:26", "RequestTime": "8506017", "Connect-IP": "127.0.0.1", "Handle/Status": "proxy:fcgi://127.0.0.1:9000/200", "ResponseSize": "206093", "Method": "POST", "Url": "http://localhost:6036/shell/tmh-ffmpegencode", "MimeType": "multipart/mixed", "Referer": "-", "X-Forwarded-For": "-", "User-Agent": "MediaWiki/1.43.0-wmf.20", "Accept-Language": "-", "X-Analytics": "-", "User": "-", "UserHeader": "-", "X-Request-Id": "6d911dbe-841c-484b-8e25-3ab2b82b5a64", "X-Client-IP": "-"} {"timestamp": "2024-08-28T14:26:01", "RequestTime": "2124755", "Connect-IP": "127.0.0.1", "Handle/Status": "proxy:fcgi://127.0.0.1:9000/200", "ResponseSize": "108835", "Method": "POST", "Url": "http://localhost:6036/shell/tmh-ffmpegencode", "MimeType": "multipart/mixed", "Referer": "-", "X-Forwarded-For": "-", "User-Agent": "MediaWiki/1.43.0-wmf.20", "Accept-Language": "-", "X-Analytics": "-", "User": "-", "UserHeader": "-", "X-Request-Id": "533917f9-7638-4a4d-9593-7ef7ab58132c", "X-Client-IP": "-"} {"timestamp": "2024-08-28T14:26:30", "RequestTime": "353200", "Connect-IP": "127.0.0.1", "Handle/Status": "proxy:fcgi://127.0.0.1:9000/200", "ResponseSize": "196331", "Method": "POST", "Url": "http://localhost:6036/shell/tmh-ffmpegencode", "MimeType": "multipart/mixed", "Referer": "-", "X-Forwarded-For": "-", "User-Agent": "MediaWiki/1.43.0-wmf.20", "Accept-Language": "-", "X-Analytics": "-", "User": "-", "UserHeader": "-", "X-Request-Id": "e4325c43-eaf1-4de9-b9b5-386f90268a66", "X-Client-IP": "-"} {"timestamp": "2024-08-28T14:27:00", "RequestTime": "127289", "Connect-IP": "127.0.0.1", "Handle/Status": "proxy:fcgi://127.0.0.1:9000/200", "ResponseSize": "24507", "Method": "POST", "Url": "http://localhost:6036/shell/tmh-ffmpegencode", "MimeType": "multipart/mixed", "Referer": "-", "X-Forwarded-For": "-", "User-Agent": "MediaWiki/1.43.0-wmf.20", "Accept-Language": "-", "X-Analytics": "-", "User": "-", "UserHeader": "-", "X-Request-Id": "fefa4cfe-8d6c-486a-bd45-49d382f0e4f6", "X-Client-IP": "-"} {"timestamp": "2024-08-28T14:27:25", "RequestTime": "171475", "Connect-IP": "127.0.0.1", "Handle/Status": "proxy:fcgi://127.0.0.1:9000/200", "ResponseSize": "42171", "Method": "POST", "Url": "http://localhost:6036/shell/tmh-ffmpegencode", "MimeType": "multipart/mixed", "Referer": "-", "X-Forwarded-For": "-", "User-Agent": "MediaWiki/1.43.0-wmf.20", "Accept-Language": "-", "X-Analytics": "-", "User": "-", "UserHeader": "-", "X-Request-Id": "c077503c-32ad-4d4c-ad51-89e3a739cdab", "X-Client-IP": "-"} {"timestamp": "2024-08-28T14:28:55", "RequestTime": "1264936", "Connect-IP": "127.0.0.1", "Handle/Status": "proxy:fcgi://127.0.0.1:9000/200", "ResponseSize": "42849", "Method": "POST", "Url": "http://localhost:6036/shell/tmh-ffmpegencode", "MimeType": "multipart/mixed", "Referer": "-", "X-Forwarded-For": "-", "User-Agent": "MediaWiki/1.43.0-wmf.20", "Accept-Language": "-", "X-Analytics": "-", "User": "-", "UserHeader": "-", "X-Request-Id": "8c744e10-6211-4c26-aa46-f12ca42d77b5", "X-Client-IP": "-"} {"timestamp": "2024-08-28T14:29:30", "RequestTime": "3175974", "Connect-IP": "127.0.0.1", "Handle/Status": "proxy:fcgi://127.0.0.1:9000/200", "ResponseSize": "135303", "Method": "POST", "Url": "http://localhost:6036/shell/tmh-ffmpegencode", "MimeType": "multipart/mixed", "Referer": "-", "X-Forwarded-For": "-", "User-Agent": "MediaWiki/1.43.0-wmf.20", "Accept-Language": "-", "X-Analytics": "-", "User": "-", "UserHeader": "-", "X-Request-Id": "aa6f9e1e-e905-43aa-a4dc-876e9f449653", "X-Client-IP": "-"} {"timestamp": "2024-08-28T14:29:50", "RequestTime": "3216191", "Connect-IP": "127.0.0.1", "Handle/Status": "proxy:fcgi://127.0.0.1:9000/200", "ResponseSize": "144011", "Method": "POST", "Url": "http://localhost:6036/shell/tmh-ffmpegencode", "MimeType": "multipart/mixed", "Referer": "-", "X-Forwarded-For": "-", "User-Agent": "MediaWiki/1.43.0-wmf.20", "Accept-Language": "-", "X-Analytics": "-", "User": "-", "UserHeader": "-", "X-Request-Id": "457ea603-dd24-4591-94bb-17e4ab1cb4c7", "X-Client-IP": "-"} {"timestamp": "2024-08-28T14:29:54", "RequestTime": "2568358", "Connect-IP": "127.0.0.1", "Handle/Status": "proxy:fcgi://127.0.0.1:9000/200", "ResponseSize": "133561", "Method": "POST", "Url": "http://localhost:6036/shell/tmh-ffmpegencode", "MimeType": "multipart/mixed", "Referer": "-", "X-Forwarded-For": "-", "User-Agent": "MediaWiki/1.43.0-wmf.20", "Accept-Language": "-", "X-Analytics": "-", "User": "-", "UserHeader": "-", "X-Request-Id": "30b19410-9034-48f1-a9fc-18dd75c0f2dd", "X-Client-IP": "-"} [Wed Aug 28 14:32:43.285213 2024] [mpm_worker:error] [pid 1:tid 139886347531392] AH00287: server is within MinSpareThreads of MaxRequestWorkers, consider raising the MaxRequestWorkers setting [Wed Aug 28 14:33:04.303606 2024] [mpm_worker:error] [pid 1:tid 139886347531392] AH00286: server reached MaxRequestWorkers setting, consider raising the MaxRequestWorkers setting {"timestamp": "2024-08-28T14:30:45", "RequestTime": "559977136", "Connect-IP": "127.0.0.1", "Handle/Status": "proxy:fcgi://127.0.0.1:9000/503", "ResponseSize": "299", "Method": "POST", "Url": "http://localhost:6036/shell/tmh-ffmpegencode", "MimeType": "text/html", "Referer": "-", "X-Forwarded-For": "-", "User-Agent": "MediaWiki/1.43.0-wmf.20", "Accept-Language": "-", "X-Analytics": "-", "User": "-", "UserHeader": "-", "X-Request-Id": "ffec46cf-9dce-4e23-9eed-ce48e5b378ff", "X-Client-IP": "-"} [Wed Aug 28 14:40:05.669283 2024] [proxy:error] [pid 20:tid 139886204020480] (111)Connection refused: AH00957: FCGI: attempt to connect to 127.0.0.1:9000 (localhost:9000) failed [Wed Aug 28 14:40:05.669731 2024] [proxy:error] [pid 20:tid 139885641971456] (111)Connection refused: AH00957: FCGI: attempt to connect to 127.0.0.1:9000 (localhost:9000) failed