[go: up one dir, main page]

Page MenuHomePhabricator

Migrate zoomviewer from Toolforge GridEngine to Toolforge Kubernetes
Closed, ResolvedPublic

Description

Kindly migrate your tool(https://grid-deprecation.toolforge.org/t/zoomviewer) from Toolforge GridEngine to Toolforge Kubernetes.

Toolforge GridEngine is getting deprecated.
See: https://techblog.wikimedia.org/2022/03/14/toolforge-and-grid-engine/

Please note that a volunteer may perform this migration if this has not been done after some time.
If you have already migrated this tool, kindly mark this as resolved.

If you would rather shut down this tool, kindly do so and mark this as resolved.

Useful Resources:
Migrating Jobs from GridEngine to Kubernetes
https://wikitech.wikimedia.org/wiki/Help:Toolforge/Jobs_framework#Grid_Engine_migration
Migrating Web Services from GridEngine to Kubernetes
https://wikitech.wikimedia.org/wiki/News/Toolforge_Stretch_deprecation#Move_a_grid_engine_webservice
Python
https://wikitech.wikimedia.org/wiki/News/Toolforge_Stretch_deprecation#Rebuild_virtualenv_for_python_users

Event Timeline

My apologies if this ticket comes as a surprise to you. In order to ensure WMCS can provide a stable, secure and supported platform, it’s important we migrate away from GridEngine. I want to assure you that while it is WMCS’s intention to shutdown GridEngine as outlined in the blog post https://techblog.wikimedia.org/2022/03/14/toolforge-and-grid-engine/, a shutdown date for GridEngine has not yet been set. The goal of the migration is to migrate as many tools as possible onto kubernetes and ensure as smooth a transition as possible for everyone. Once the majority of tools have migrated, discussion on a shutdown date is more appropriate. See T314664: [infra] Decommission the Grid Engine infrastructure.

As noted in https://techblog.wikimedia.org/2022/03/16/toolforge-gridengine-debian-10-buster-migration/ some use cases are already supported by kubernetes and should be migrated. If your tool can migrate, please do plan a migration. Reach out if you need help or find you are blocked by missing features. Most of all, WMCS is here to support you.

However, it’s possible your tool needs a mixed runtime environment or some other features that aren't yet present in https://techblog.wikimedia.org/2022/03/18/toolforge-jobs-framework/. We’d love to hear of this or any other blocking issues so we can work with you once a migration path is ready. Thanks for your hard work as volunteers and help in this migration!

It might be possible for maintainers to work on this now that https://wikitech.wikimedia.org/wiki/Help:Toolforge/Build_Service has support for installing apt packages. The existing PHP code submits image download jobs to the grid engine backend which will need to be rewritten somehow to work on Kubernetes.

This migration would also be a great time to add a clear OSI-approved software license and some documentation on how to restart the service and check for other common problems. Tasks like T343796: ZoomViewer produces a 503 error would be easier for Toolforge admins to help with if we knew how things are supposed to work on the happy path.

I can work on this if someone makes me a maintainer. It'll be the same as what I did for panoviewer (except hopefully less complicated).

I can work on this if someone makes me a maintainer. It'll be the same as what I did for panoviewer (except hopefully less complicated).

{{Done}} on the assumption that @dschwen and @Tacsipacsi would like the help. This is a tiny bit of abuse of my Toolforge admin powers without invoking the Toolforge adoption policy, so if any of y'all are mad that I did it feel free to yell at me. I will likely respond by making Tim a Toolforge root. ;)

It'll be the same as what I did for panoviewer (except hopefully less complicated).

I don't think it's going to be less complicated anymore, but at least I have a better idea of what I'm doing now.

ZoomViewer uses iipsrv, which is a FastCGI server written in C. Luckily there is a package for it in Ubuntu, so I didn't have to build it from source. The GridEngine version of the tool builds it from source.

With some effort, I was able to make an image which adds mod_fcgid to the Apache server started by the Heroku PHP buildpack. I tested the image and confirmed that Apache can run both PHP and iipsrv requests. Next I'll update the PHP and shell scripts, and then migrate the webservice and the jobs.

If I was doing it in production, I'd have separate containers for PHP and iipsrv, but I gather it is not possible for a tool to have multiple webservices.

This is basically done, but performance seems very bad. Please test to confirm that it's not just me.

The server is not showing significant load while I am waiting tens of seconds for tiles to load.

The full page load on https://zoomviewer.toolforge.org/index.php?f=Seattle+7.jpg&flash=no took 48 seconds for me as well, but it didn’t feel very long – it’s a huge picture after all. Thanks for porting the tool!

It's really not a lot of data -- a previous maintainer turned the JPEG quality down to 50. In Chromium with a viewport width of 1373, reloading with cache disabled, I got 57 requests totalling 419KB in 13.44 seconds. I can download the 23MB original in 15s so it feels pretty slow to me.

With strace I can see iipsrv waiting for incoming connections while my browser makes requests. After a few initial requests it settles down to exactly 2 req/s, with the fractional part of the timestamp staying about the same, with an accept beginning at about 0.96 past each second. Then after a few more requests (1707730396 in the paste) it drops to exactly 1 req/s. So something is throttling it.

1root@tools-k8s-worker-73:~# strace -p 300 -T -ttt -e trace=accept,shutdown
2strace: Process 300 attached
31707730363.783749 accept(0, {sa_family=AF_UNIX}, [112->2]) = 5 <18.702767>
41707730382.499504 shutdown(5, SHUT_WR) = 0 <0.000040>
51707730382.499854 accept(0, {sa_family=AF_UNIX}, [112->2]) = 5 <0.246508>
61707730382.760030 shutdown(5, SHUT_WR) = 0 <0.000048>
71707730382.760543 accept(0, {sa_family=AF_UNIX}, [112->2]) = 5 <0.184346>
81707730382.958574 shutdown(5, SHUT_WR) = 0 <0.000264>
91707730382.959335 accept(0, {sa_family=AF_UNIX}, [112->2]) = 5 <0.001164>
101707730382.968778 shutdown(5, SHUT_WR) = 0 <0.000026>
111707730382.969155 accept(0, {sa_family=AF_UNIX}, [112->2]) = 5 <0.785306>
121707730383.764575 shutdown(5, SHUT_WR) = 0 <0.000281>
131707730383.765368 accept(0, {sa_family=AF_UNIX}, [112->2]) = 5 <0.181063>
141707730383.957324 shutdown(5, SHUT_WR) = 0 <0.000217>
151707730383.957945 accept(0, {sa_family=AF_UNIX}, [112->2]) = 5 <0.003578>
161707730383.969714 shutdown(5, SHUT_WR) = 0 <0.000032>
171707730383.970149 accept(0, {sa_family=AF_UNIX}, [112->2]) = 5 <0.786016>
181707730384.767770 shutdown(5, SHUT_WR) = 0 <0.000240>
191707730384.768547 accept(0, {sa_family=AF_UNIX}, [112->2]) = 5 <0.179901>
201707730384.959841 shutdown(5, SHUT_WR) = 0 <0.000400>
211707730384.960719 accept(0, {sa_family=AF_UNIX}, [112->2]) = 5 <0.002657>
221707730384.972134 shutdown(5, SHUT_WR) = 0 <0.000249>
231707730384.972795 accept(0, {sa_family=AF_UNIX}, [112->2]) = 5 <0.978208>
241707730385.959629 shutdown(5, SHUT_WR) = 0 <0.000133>
251707730385.959966 accept(0, {sa_family=AF_UNIX}, [112->2]) = 5 <0.005561>
261707730385.970641 shutdown(5, SHUT_WR) = 0 <0.000097>
271707730385.971054 accept(0, {sa_family=AF_UNIX}, [112->2]) = 5 <0.000108>
281707730385.976570 shutdown(5, SHUT_WR) = 0 <0.000108>
291707730385.976936 accept(0, {sa_family=AF_UNIX}, [112->2]) = 5 <0.974743>
301707730386.960760 shutdown(5, SHUT_WR) = 0 <0.000233>
311707730386.961419 accept(0, {sa_family=AF_UNIX}, [112->2]) = 5 <0.005400>
321707730386.975159 shutdown(5, SHUT_WR) = 0 <0.000165>
331707730386.975668 accept(0, {sa_family=AF_UNIX}, [112->2]) = 5 <0.977758>
341707730387.965000 shutdown(5, SHUT_WR) = 0 <0.000270>
351707730387.965806 accept(0, {sa_family=AF_UNIX}, [112->2]) = 5 <0.002661>
361707730387.976134 shutdown(5, SHUT_WR) = 0 <0.000191>
371707730387.976746 accept(0, {sa_family=AF_UNIX}, [112->2]) = 5 <0.978093>
381707730388.969666 shutdown(5, SHUT_WR) = 0 <0.000035>
391707730388.970097 accept(0, {sa_family=AF_UNIX}, [112->2]) = 5 <0.000152>
401707730388.977433 shutdown(5, SHUT_WR) = 0 <0.000030>
411707730388.977874 accept(0, debug2: channel 0: window 999267 sent adjust 49309
42{sa_family=AF_UNIX}, [112->2]) = 5 <0.979438>
431707730389.967674 shutdown(5, SHUT_WR) = 0 <0.000056>
441707730389.968021 accept(0, {sa_family=AF_UNIX}, [112->2]) = 5 <0.004158>
451707730389.981124 shutdown(5, SHUT_WR) = 0 <0.000044>
461707730389.981480 accept(0, {sa_family=AF_UNIX}, [112->2]) = 5 <0.976210>
471707730390.969582 shutdown(5, SHUT_WR) = 0 <0.000045>
481707730390.969919 accept(0, {sa_family=AF_UNIX}, [112->2]) = 5 <0.005159>
491707730390.984007 shutdown(5, SHUT_WR) = 0 <0.000055>
501707730390.984314 accept(0, {sa_family=AF_UNIX}, [112->2]) = 5 <0.976064>
511707730391.975522 shutdown(5, SHUT_WR) = 0 <0.000036>
521707730391.975872 accept(0, {sa_family=AF_UNIX}, [112->2]) = 5 <0.000730>
531707730391.984432 shutdown(5, SHUT_WR) = 0 <0.000052>
541707730391.984764 accept(0, {sa_family=AF_UNIX}, [112->2]) = 5 <0.977255>
551707730392.973918 shutdown(5, SHUT_WR) = 0 <0.000134>
561707730392.974446 accept(0, {sa_family=AF_UNIX}, [112->2]) = 5 <0.001828>
571707730392.985904 shutdown(5, SHUT_WR) = 0 <0.000063>
581707730392.986222 accept(0, {sa_family=AF_UNIX}, [112->2]) = 5 <0.976683>
591707730393.970193 shutdown(5, SHUT_WR) = 0 <0.000043>
601707730393.970452 accept(0, {sa_family=AF_UNIX}, [112->2]) = 5 <0.008402>
611707730393.987182 shutdown(5, SHUT_WR) = 0 <0.000048>
621707730393.987459 accept(0, {sa_family=AF_UNIX}, [112->2]) = 5 <0.977525>
631707730394.978793 shutdown(5, SHUT_WR) = 0 <0.000047>
641707730394.979151 accept(0, {sa_family=AF_UNIX}, [112->2]) = 5 <0.001238>
651707730394.988088 shutdown(5, SHUT_WR) = 0 <0.000033>
661707730394.988414 accept(0, {sa_family=AF_UNIX}, [112->2]) = 5 <0.977409>
671707730395.976684 shutdown(5, SHUT_WR) = 0 <0.000045>
681707730395.977041 accept(0, {sa_family=AF_UNIX}, [112->2]) = 5 <0.006274>
691707730395.993846 shutdown(5, SHUT_WR) = 0 <0.000047>
701707730395.994162 accept(0, {sa_family=AF_UNIX}, [112->2]) = 5 <0.973232>
711707730396.978081 shutdown(5, SHUT_WR) = 0 <0.000036>
721707730396.978415 accept(0, {sa_family=AF_UNIX}, [112->2]) = 5 <0.990780>
731707730397.984147 shutdown(5, SHUT_WR) = 0 <0.000029>
741707730397.984419 accept(0, {sa_family=AF_UNIX}, [112->2]) = 5 <0.986857>
751707730398.985494 shutdown(5, SHUT_WR) = 0 <0.000037>
761707730398.985893 accept(0, {sa_family=AF_UNIX}, [112->2]) = 5 <0.986072>
771707730399.982356 shutdown(5, SHUT_WR) = 0 <0.000041>
781707730399.982781 accept(0, {sa_family=AF_UNIX}, [112->2]) = 5 <0.990874>
791707730400.986972 shutdown(5, SHUT_WR) = 0 <0.000043>
801707730400.987339 accept(0, {sa_family=AF_UNIX}, [112->2]) = 5 <0.987759>
811707730401.986275 shutdown(5, SHUT_WR) = 0 <0.000057>
821707730401.986637 accept(0, {sa_family=AF_UNIX}, [112->2]) = 5 <0.990359>
831707730402.989961 shutdown(5, SHUT_WR) = 0 <0.000070>
841707730402.990569 accept(0, {sa_family=AF_UNIX}, [112->2]) = 5 <0.988429>
851707730403.991380 shutdown(5, SHUT_WR) = 0 <0.000037>
861707730403.992053 accept(0, {sa_family=AF_UNIX}, [112->2]) = 5 <0.988544>
871707730404.991476 shutdown(5, SHUT_WR) = 0 <0.000048>

It's fast when I take my network out of the loop by doing requests with ab on the server.

Also, it's 2x faster (13s -> 7s) when I disable HTTP/2 in my browser. So I guess the round numbers of requests per second were a coincidence.

Performance seems unrelated so can be discussed elsewhere.

I migrated prune.sh to a scheduled job. I removed the rest of the crontab for now -- in future check.sh and restart.sh might be replaced by a health check as described in T341919. The old crontab is archived at old/crontab in the NFS home. The old restart.sh and webstart.sh are in the git history.

I deleted some logs.

I moved various old files to old/.

I think this is done.