Reported by various people (via Telegram, email and talkpage): wikiloves is not accessible for a few days
Description
Related Objects
Event Timeline
The UWSGI logs are a vast loop of
*** Starting uWSGI 2.0.19.1-debian (64bit) on [Mon Nov 4 14:27:45 2024] *** *** Starting uWSGI 2.0.19.1-debian (64bit) on [Mon Nov 4 14:27:58 2024] *** *** Starting uWSGI 2.0.19.1-debian (64bit) on [Mon Nov 4 14:28:27 2024] *** *** Starting uWSGI 2.0.19.1-debian (64bit) on [Mon Nov 4 14:29:12 2024] *** *** Starting uWSGI 2.0.19.1-debian (64bit) on [Mon Nov 4 14:30:20 2024] *** *** Starting uWSGI 2.0.19.1-debian (64bit) on [Mon Nov 4 14:32:05 2024] *** *** Starting uWSGI 2.0.19.1-debian (64bit) on [Mon Nov 4 14:35:01 2024] *** *** Starting uWSGI 2.0.19.1-debian (64bit) on [Mon Nov 4 14:40:26 2024] *** *** Starting uWSGI 2.0.19.1-debian (64bit) on [Mon Nov 4 14:45:46 2024] *** *** Starting uWSGI 2.0.19.1-debian (64bit) on [Mon Nov 4 14:51:11 2024] *** *** Starting uWSGI 2.0.19.1-debian (64bit) on [Mon Nov 4 14:56:35 2024] *** ...
Yesterday, just in case, I recreated the virtual-environment using
toolforge webservice python3.9 shell webservice-python-bootstrap --fresh
which did not help.
@dhinus on IRC took some steps some 8 hours ago:
<dhinus> the pod is indeed in CrashLoopBackOff, and has already restarted 44 times
<dhinus> kubectl describe pod shows "Back-off restarting failed container webservice in pod wikiloves-6849f4ccb4-9w6b6_tool-wikiloves"
<dhinus> I will try the stop+start myself while I'm here
<dhinus> the pod is now rescheduled on tools-k8s-worker-nfs-74 and it seems more healthy
Mentioned in SAL (#wikimedia-cloud) [2024-11-09T08:18:52Z] <wmbot~jeanfred@tools-sgebastion-10> Run webservice stop ; webservice --backend=kubernetes python3.9 start for T379452
The service was going in 403 forbidden (at least something different!) When I tried to stop/start, it started as a PHP service ; I then ran the command logged above in SAL ; and edited webservice.template to specify web: python3.9.
The tool is still in 404, and the logs are looping again :/
*** Starting uWSGI 2.0.19.1-debian (64bit) on [Sat Nov 9 08:18:22 2024] *** *** Starting uWSGI 2.0.19.1-debian (64bit) on [Sat Nov 9 08:18:38 2024] *** *** Starting uWSGI 2.0.19.1-debian (64bit) on [Sat Nov 9 08:19:16 2024] *** *** Starting uWSGI 2.0.19.1-debian (64bit) on [Sat Nov 9 08:19:58 2024] *** *** Starting uWSGI 2.0.19.1-debian (64bit) on [Sat Nov 9 08:20:58 2024] *** *** Starting uWSGI 2.0.19.1-debian (64bit) on [Sat Nov 9 08:22:34 2024] ***
Weird, I would try stopping and starting a couple more times, if that doesn't work, is there a way to add some extra debug logs in the app itself, to understand better where exactly it gets stuck? Could it be a memory issue maybe?
Mentioned in SAL (#wikimedia-cloud) [2024-11-09T12:33:10Z] <wmbot~lucaswerkmeister@tools-bastion-13> added mem: 1Gi to service.template (T379452); webservice stop && webservice start
kubectl describe pods showed
Last State: Terminated Reason: OOMKilled
so I tried bumping the memory limit. (Bit annoying that the OOMKilled wasn’t shown in kubectl logs nor kubectl get events.) So far, the pod looks more stable:
tools.wikiloves@tools-bastion-13:~$ kubectl get pods NAME READY STATUS RESTARTS AGE wikiloves-575bb94984-cp7sp 1/1 Running 1 (2m44s ago) 2m46s
https://wikiloves.toolforge.org/ just shows me a 403, I don’t know if that’s correct or not.
Mentioned in SAL (#wikimedia-cloud) [2024-11-09T12:38:43Z] <wmbot~lucaswerkmeister@tools-bastion-13> sed -i s/^web:/type:/ service.template # T379452
Mentioned in SAL (#wikimedia-cloud) [2024-11-09T12:39:00Z] <wmbot~lucaswerkmeister@tools-bastion-13> webservice stop && webservice start # T379452
Turns out it was defaulting to the php7.4 webservice type because of a mistake in service.template. Now it seems to be working \o/
Warn about unknown keys in service.template might help with noticing future mistakes like this sooner (hopefully).