[go: up one dir, main page]

Page MenuHomePhabricator

wikiloves inaccessible since November 4th
Open, Needs TriagePublic

Description

Reported by various people (via Telegram, email and talkpage): wikiloves is not accessible for a few days

Event Timeline

The UWSGI logs are a vast loop of

*** Starting uWSGI 2.0.19.1-debian (64bit) on [Mon Nov  4 14:27:45 2024] ***
*** Starting uWSGI 2.0.19.1-debian (64bit) on [Mon Nov  4 14:27:58 2024] ***
*** Starting uWSGI 2.0.19.1-debian (64bit) on [Mon Nov  4 14:28:27 2024] ***
*** Starting uWSGI 2.0.19.1-debian (64bit) on [Mon Nov  4 14:29:12 2024] ***
*** Starting uWSGI 2.0.19.1-debian (64bit) on [Mon Nov  4 14:30:20 2024] ***
*** Starting uWSGI 2.0.19.1-debian (64bit) on [Mon Nov  4 14:32:05 2024] ***
*** Starting uWSGI 2.0.19.1-debian (64bit) on [Mon Nov  4 14:35:01 2024] ***
*** Starting uWSGI 2.0.19.1-debian (64bit) on [Mon Nov  4 14:40:26 2024] ***
*** Starting uWSGI 2.0.19.1-debian (64bit) on [Mon Nov  4 14:45:46 2024] ***
*** Starting uWSGI 2.0.19.1-debian (64bit) on [Mon Nov  4 14:51:11 2024] ***
*** Starting uWSGI 2.0.19.1-debian (64bit) on [Mon Nov  4 14:56:35 2024] ***
...

Two days ago, I ran webservice restart a few times, which did not help

Yesterday, just in case, I recreated the virtual-environment using

toolforge webservice python3.9 shell
webservice-python-bootstrap --fresh

which did not help.

Asking for help on Telegram, I was told this might be T362867#10292196

@dhinus on IRC took some steps some 8 hours ago:
<dhinus> the pod is indeed in CrashLoopBackOff, and has already restarted 44 times
<dhinus> kubectl describe pod shows "Back-off restarting failed container webservice in pod wikiloves-6849f4ccb4-9w6b6_tool-wikiloves"
<dhinus> I will try the stop+start myself while I'm here
<dhinus> the pod is now rescheduled on tools-k8s-worker-nfs-74 and it seems more healthy

Mentioned in SAL (#wikimedia-cloud) [2024-11-09T08:18:52Z] <wmbot~jeanfred@tools-sgebastion-10> Run webservice stop ; webservice --backend=kubernetes python3.9 start for T379452

The service was going in 403 forbidden (at least something different!) When I tried to stop/start, it started as a PHP service ; I then ran the command logged above in SAL ; and edited webservice.template to specify web: python3.9.

The tool is still in 404, and the logs are looping again :/

*** Starting uWSGI 2.0.19.1-debian (64bit) on [Sat Nov  9 08:18:22 2024] ***
*** Starting uWSGI 2.0.19.1-debian (64bit) on [Sat Nov  9 08:18:38 2024] ***
*** Starting uWSGI 2.0.19.1-debian (64bit) on [Sat Nov  9 08:19:16 2024] ***
*** Starting uWSGI 2.0.19.1-debian (64bit) on [Sat Nov  9 08:19:58 2024] ***
*** Starting uWSGI 2.0.19.1-debian (64bit) on [Sat Nov  9 08:20:58 2024] ***
*** Starting uWSGI 2.0.19.1-debian (64bit) on [Sat Nov  9 08:22:34 2024] ***

Weird, I would try stopping and starting a couple more times, if that doesn't work, is there a way to add some extra debug logs in the app itself, to understand better where exactly it gets stuck? Could it be a memory issue maybe?

Mentioned in SAL (#wikimedia-cloud) [2024-11-09T12:33:10Z] <wmbot~lucaswerkmeister@tools-bastion-13> added mem: 1Gi to service.template (T379452); webservice stop && webservice start

kubectl describe pods showed

Last State:     Terminated
Reason:       OOMKilled

so I tried bumping the memory limit. (Bit annoying that the OOMKilled wasn’t shown in kubectl logs nor kubectl get events.) So far, the pod looks more stable:

tools.wikiloves@tools-bastion-13:~$ kubectl get pods
NAME                         READY   STATUS    RESTARTS        AGE
wikiloves-575bb94984-cp7sp   1/1     Running   1 (2m44s ago)   2m46s

https://wikiloves.toolforge.org/ just shows me a 403, I don’t know if that’s correct or not.

Mentioned in SAL (#wikimedia-cloud) [2024-11-09T12:38:43Z] <wmbot~lucaswerkmeister@tools-bastion-13> sed -i s/^web:/type:/ service.template # T379452

Mentioned in SAL (#wikimedia-cloud) [2024-11-09T12:39:00Z] <wmbot~lucaswerkmeister@tools-bastion-13> webservice stop && webservice start # T379452

https://wikiloves.toolforge.org/ just shows me a 403, I don’t know if that’s correct or not.

Turns out it was defaulting to the php7.4 webservice type because of a mistake in service.template. Now it seems to be working \o/

Mentioned in SAL (#wikimedia-cloud) [2024-11-09T12:38:43Z] <wmbot~lucaswerkmeister@tools-bastion-13> sed -i s/^web:/type:/ service.template # T379452

Warn about unknown keys in service.template might help with noticing future mistakes like this sooner (hopefully).