Outreachdashboard.wmflabs.org is down frequently with database problems
Closed, ResolvedPublicBUG REPORT
Actions

Assigned To

Authored By

	Ijon
	Mar 17 2021, 11:17 AM

Description

Programs & Events Dashboard has been going down repeatedly in the last several weeks. Even after moving the database to a separate server, the system has encounted repeated cases of database corruption, which may be related to the data update cycle for programs with very large numbers of edits.

Current status

The system appears to be stable when updates for the largest, slowest programs are disabled. Sage Ross is investigating options for bringing back updates for these programs.

Initial report

VPS: programs-and-events-dashboard.globaleducation.eqiad.wmflabs

Error

Accessing the service yields:
"This website is under heavy load (queue full)
We're sorry, too many people are accessing this website at the same time. We're working on this problem. Please try again later."

Impact

Wikimedia program organizers can use the tool to track impact.

Notes

I understand this is a cloud service and not MW core. It is still a "production error" in my understanding, as it's not a functionality bug but an operational failure (itself possibly caused by a bug, but that's TBD) of a mature service many volunteers rely on for their daily work.

Details

Request URL: https://outreachdashboard.wmflabs.org/

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved	BUG REPORT	Ragesoss	T277651 Outreachdashboard.wmflabs.org is down frequently with database problems
		Resolved	PRODUCTION ERROR	Ragesoss	T278042 500 Internal Service Error for Outreach Dashboard

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Richard_Nevell awarded a token.Mar 17 2021, 2:34 PM

Oh, yeah, that does look like disk corruption! Ugh.

One problem, possibly at the root of it, appears to be a corrupted index of the revisions table in the Dashboard's database.

We have a backup copy of the database from March 14 (which is unfortunately from just before I finished cleaning up a number of duplicate articles records that I needed to do before I could deploy a migration, which I had done yesterday morning).

I'm attempting to repair the revisions table index, but it's not been successful so far.

From /var/log/mysql/error.log:

2021-03-17 12:51:12 140131319138048 [ERROR] InnoDB: Database page corruption on disk or a failed file read of tablespace dashboard/revisions page [page id: space=687, page number=756582]. You may have to recover from a backup.
[giant hex dump omitted]
InnoDB: page type 17855 meaning INDEX
InnoDB: Page may be an update undo log page
InnoDB: Page may be an index page where index id is 1308
InnoDB: (index "PRIMARY" of table "dashboard"."revisions")
2021-03-17 12:51:12 140131319138048 [Note] InnoDB: It is also possible that your operating system has corrupted its own file cache and rebooting your computer removes the error. If the corrupt page is an index page. You can also try to fix the corruption by dumping, dropping, and reimporting the corrupt table. You can use CHECK TABLE to scan your table for corruption. Please refer to http://dev.mysql.com/doc/refman/5.6/en/forcing-innodb-recovery.html for information about forcing recovery.
2021-03-17 12:51:12 7f72dd7fc700 InnoDB: Assertion failure in thread 140131319138048 in file buf0lru.cc line 2121
InnoDB: Failing assertion: bpage->buf_fix_count == 0
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to https://jira.mariadb.org/
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: http://dev.mysql.com/doc/refman/5.6/en/forcing-innodb-recovery.html
InnoDB: about forcing recovery.

Trying the innodb recovery seems reasonable, but you might want to have a backup of /srv/mysql first.

/srv is over half full, so I'm not sure I can safely make a backup on the same block disk.

I've set innodb_force_recovery = 1 in /etc/mysql/mardiadb.conf.d/50-server.conf`, done systemctl restart mysqld, and I'm running the backup script again.

The backup script ran without issue is recovery mode, so I have an up-to-date, possibly valid backup of the whole DB now. I'm currently running a dump of just the revisions table (mysqldump dashboard revisions > revisions.sql) in preparation for dropping and importing it.

(OPTIMIZE TABLE revisions; won't run in recovery mode because it's read-only.)

After dumping the revisions table, I've done drop table revisions;. Then I took it mysqld out of recovery mode and restarted it. Now I'm running mysql dashboard < revisions.sql to restore the revisions table from the dump.

Theklan subscribed.Mar 17 2021, 7:31 PM

We give students direct links to their program, and now they find a "Not Found" error. It would be interesting to point everyone to the main page so they can see that an error exists.

• Bstorm edited projects, added VPS-Projects; removed cloud-services-team (Kanban).Mar 17 2021, 7:55 PM

@Theklan done. (I added .htaccess, enabled mod_rewrite, and set a Directory stanza with AllowOverride All in the site config.)

💙ragesoss 💙fast response

people whom I never knew cared contacted me today frantic that they could not access the dashboard

also a couple of regular classes contacted me, which is great, but sometimes I forget that people get excited to watch the stats

wiki is survivable without the dashboard but when it is gone it is sorely missed

Reloading the revisions table has completed. I'm going to restart the frontend.

The Dashboard is back up and running. I haven't enabled or restarted any of the background processes yet except for the 'default' one that handles transactional background jobs (like wiki edits and error reporting), so it hasn't started pulling in new data yet.

The cron jobs related to background tasks are also disabled for now.

I'm going to leave it up for a little while to see whether any database errors or other problems recur, before turning on more background processes gradually.

Things are going fine so far. I've just re-enabled the constant_update and schedule_course_updates cron jobs, and the constant_update Sidekiq process that handles them.

Still going fine, so I've enabled the short and medium sidekiq processes. The slowest programs, and the daily_update job, are still disabled.

Thank you for working on this, especially since it looks to have taken a substantial amount of time! As Bluerasberry said, we can cope when it's down but it is very much appreciated when it's working.

The system is still stable, so I've just enabled the final Sidekiq processes for long stats updates and the daily job. I've also re-enabled the weekly database backup script (which runs on Sunday), and deployed a minor change to help with logging slower courses.

To wrap up, my best guess at what happened is:

Sometime on March 16, the index of the revisions table became corrupted.
Whenever the database looked at the affected index entry, the database server would crash, interrupting any web requests and background stats jobs that were running at the time.
By the time the database could restart, the web server had a full queue of unprocessed requests waiting.
For reasons I don't understand yet, this also tied up the system such that it stopped processing web requests altogether and also became unresponsive via SSH.
Upon restarting the server, it would run fine for a few minutes until once again encountering the corrupted database index entry.

Repairing the revisions index appears to have fixed the problem.

The revisions table is probably unsustainably large (and growing). In the medium- to long-term, it might be better to calculate all the necessary stats without saving them to the Dashboard, perhaps by moving some of the data processing over to Toolforge where it can be done more directly via the Replica databases (and by serving recent revisions via client-side queries directly from the wikis instead of via the Dashboard).

Ragesoss closed this task as Resolved.Mar 18 2021, 4:01 PM

Ragesoss claimed this task.

In T277651#6925492, @Ragesoss wrote:

Repairing the revisions index appears to have fixed the problem.

\o/

The revisions table is probably unsustainably large (and growing). In the medium- to long-term, it might be better to calculate all the necessary stats without saving them to the Dashboard, perhaps by moving some of the data processing over to Toolforge where it can be done more directly via the Replica databases (and by serving recent revisions via client-side queries directly from the wikis instead of via the Dashboard).

Note that there's nothing stopping you from querying the wikireplicas directly from a VPS project, you just need to create a Toolforge tool to get credentials (replica.my.cnf), which you can then stick in your VM.

In T277651#6925855, @Legoktm wrote:

Note that there's nothing stopping you from querying the wikireplicas directly from a VPS project, you just need to create a Toolforge tool to get credentials (replica.my.cnf), which you can then stick in your VM.

Oh right! I've done this before from a VPS, although I don't remember when/where. :-)

However, we still need to do the same data flow for dashboard.wikiedu.org as well, so having some kind of separate service for doing that processing outside of the Dashboard server that could be shared by both production instances will probably be necessary.

the dashboard is back down again :/

the dashboard is back down again :/

The disk was full due to a large leftover file from the previous repair effort. Looks like things went back to normal as soon as I freed up space.

Ragesoss closed this task as Resolved.Mar 22 2021, 5:23 PM

Although it's mostly only affecting the update processes for slow-updating programs, it appears that problems with the Revisions index are back. I'm not sure what to do about this in the short term. I'm investigating the idea of moving more processing outside of the Dashboard, but that's not something that could be done quickly.

I'm currently unsure what the cause of the corruption is, but I've read this "Most InnoDB corruptions are hardware-related. Corrupted page writes can be caused by power failures or bad memory. The issue also can be caused by using network-attached storage (NAS) and allocating InnoDB databases on it." (source)

If it's not hardware-related, then the most likely cause that the database is somehow failing because of being asked to do more than it can handle for some of the biggest, slowest-updating programs.

I'm not sure what the best path forward is here. I can migrate the system to a new VPS (which would be an opportunity to upgrade to a newer version of MariaDB at the same time), although I may need to do that again soon as part of a switch to a multi-node architecture that I'm currently working with a consultant on.

@bd808 and I discussed this briefly on IRC.

In T277651#6939264, @Ragesoss wrote:

I'm currently unsure what the cause of the corruption is, but I've read this "Most InnoDB corruptions are hardware-related. Corrupted page writes can be caused by power failures or bad memory. The issue also can be caused by using network-attached storage (NAS) and allocating InnoDB databases on it." (source)

If it actually is a filesystem corruption issue, the moving to cinder won't really make a difference.

If it's not hardware-related, then the most likely cause that the database is somehow failing because of being asked to do more than it can handle for some of the biggest, slowest-updating programs.

FWIW, mysqld segfaulted twice today (/var/log/kern.log):

Mar 23 14:15:51 programs-and-events-dashboard kernel: [252301.994981] mysqld[2753]: segfault at 18 ip 000055a4f885ac00 sp 00007f48137a93f8 error 4 in mysqld[55a4f7e7c000+fb3000]
Mar 23 14:15:51 programs-and-events-dashboard kernel: [252301.994994] Code: e5 0f c8 89 c0 5d c3 0f 1f 00 55 48 89 f0 40 88 77 19 48 c1 e8 08 48 89 e5 88 47 18 5d c3 66 90 66 2e 0f 1f 84 00 00 00 00 00 <0f> b7 47 18 55 48 89 e5 5d 66 c1 c0 08 0f b7 c0 c3 0f 1f 44 00 00
Mar 23 20:15:27 programs-and-events-dashboard kernel: [273878.162922] mysqld[10795]: segfault at 18 ip 0000558373ffdc00 sp 00007f4a3dfc93f8 error 4 in mysqld[55837361f000+fb3000]
Mar 23 20:15:27 programs-and-events-dashboard kernel: [273878.162933] Code: e5 0f c8 89 c0 5d c3 0f 1f 00 55 48 89 f0 40 88 77 19 48 c1 e8 08 48 89 e5 88 47 18 5d c3 66 90 66 2e 0f 1f 84 00 00 00 00 00 <0f> b7 47 18 55 48 89 e5 5d 66 c1 c0 08 0f b7 c0 c3 0f 1f 44 00 00

Not sure if what triggered the segfaults... I also see no other segfaults (except apache2 once) in the archived logs, so this could be unrelated to the original problem.

I'm not sure what the best path forward is here. I can migrate the system to a new VPS (which would be an opportunity to upgrade to a newer version of MariaDB at the same time), although I may need to do that again soon as part of a switch to a multi-node architecture that I'm currently working with a consultant on.

@bd808 suggested moving *just* the database to another VM, backed by cinder. This would let you take snapshots, etc. without affecting the rest of the instance. And you could also upgrade to newer MariaDB without needing to mess with everything else.

Thanks @Legoktm . I just took the system down to run a dump. Not sure sure exactly how to hook up the database to a separate VM, but I think I can figure it out. I'll attempt to go with the move @bd808 suggests.

On that note, have an open Cinder quota request: https://phabricator.wikimedia.org/T278132

As before, running a dump errors out in the revisions table, so I've edited /etc/mysql/mariadb.conf.d/50-server.cnf again to set innodb recover mode 1. This time I'll do a complete backup and then use that .sql file to populate a new database on a new VM (with MariaDB 10.3, instead of 10.1 which it's currently using).

With T278132: Request increased quota for globaleducation Cloud VPS project granted, I think I would personally try creating a 150G Cinder volume, attaching it to programs-and-events-dashboard.globaleducation.eqiad1.wikimedia.cloud, and copying the clean database dump there. Depending on the urgency you feel you are under you could then either a) spin up a new Buster instance in your project to run the db from, or b) try to get the current system back up and running first.

I guess the big question is if your db corruption is triggered by the memory pressure on the overloaded programs-and-events-dashboard instance or by a MariaDB bug. Either way I think your only way out is redistributing the load across more instances. I would hope that moving the db to its own instance is relatively easy. I don't know enough about the rest of your stack to make informed guesses about separating other bits. I do see a redis running there that looks like another strong candidate to a separate instance.

The backup is complete. I've spun up a new VM to put the database on. Setting up the Cinder volume was a breeze; I used 180 of the 200GiB quota for this volume (an uncompressed backups is huge, so this doesn't actually provide a huge amount of headroom, but it should be fine for now).

Next steps are to transfer a database backup there (in progress), set up mariadb to use /srv to hold all the data, and then try to hook up the current production server to that database. Since the database doesn't need to restart during deployments, I think it shouldn't take any extra work for now to run it like this.

I don't think Redis itself contributes very much system load. The total amount of data stored in Redis is tiny, and it looks like its memory footprint is tiny too; it basically just holds the queue of background jobs that are waiting to run. The actual background job processes will make a much bigger difference when we move them to a separate VM, but that will require figuring out the deployment and coordination story more fully.

If things go smoothly, I expect to bring up the system tomorrow morning with the database on a separate VM.

20000423 subscribed.Mar 24 2021, 3:05 AM

• AxelPettersson_WMSE subscribed.Mar 24 2021, 10:47 AM

Simon_Villeneuve subscribed.Mar 24 2021, 1:22 PM

I've got the backup file transferred to the new VM, I've configured mariadb to use /srv/mysql, and I've started loading the data (pv 2021-03-23_22h10m_12.sql.gz | gunzip | mysql dashboard). It looks like it's going to take all day to complete the load, so it will probably be tomorrow when I can try to bring the Dashboard back online with the new database server.

Hi can we have more details on when the dashbpard problem will be resolved? Just to know if it will be oprational this week end

In T277651#6942056, @Nattes wrote:

Hi can we have more details on when the dashbpard problem will be resolved? Just to know if it will be oprational this week end

I'm currently in the process of using a database backup to populate a new database on a separate server. This looks like it will take about 10 more hours, which means I won't have a chance to do more with it until tomorrow.

Once that is complete, I can attempt to bring the system back online with the database running on this new server, and rest of the system running on the same server as before. I've never run the system in this configuration before, and I don't know if there will be additional complications. If everything goes smoothly, I hope to have it running about about 25 hours from now. If things don't go smoothly, I don't have an estimate for how much longer it might take.

Meanwhile, I'm setting up a pair of test VMs to try out running the system in this two-server configuration to identify any extra hurdles to expect once the database finishes loading.

Database population seems to have sped up some, and is now estimated to complete in 2.5 more hours (which means I might be able to get the Dashboard up and running this evening).

I've successfully booted the Dashboard app on a test VM with the database on a different test VM, so I think this is going to work. It just requires configuring mysql for remote access, and updating the app's database.yml file to point to the database server.

@bd808: One thing I'm not sure about is how to handle the bind-address setting for mysqld on the database VM. It works with bind-address = 0.0.0.0, but it doesn't work with bind-address = dashboard-testing (where dashboard-testing is the name of app server that needs to connect), nor with bind-address = 172.16.6.83 (where that's the local network IP for dashboard-testing). Is using 0.0.0.0 fine here, since the database VM isn't web accessible?

In T277651#6942620, @Ragesoss wrote:

@bd808: One thing I'm not sure about is how to handle the bind-address setting for mysqld on the database VM. It works with bind-address = 0.0.0.0, but it doesn't work with bind-address = dashboard-testing (where dashboard-testing is the name of app server that needs to connect), nor with bind-address = 172.16.6.83 (where that's the local network IP for dashboard-testing). Is using 0.0.0.0 fine here, since the database VM isn't web accessible?

Yes, the default 0.0.0.0 (all interfaces) bind is perfectly ok. The Horizon managed "Security Groups" will keep any instance outside of your project from communicating with your database server.

The database population step is complete (using up 61G out of 177G on the Cinder volume). I've turned the Dashboard web server back on, and I'll gradually enable related services.

My current plan is to hold off on enabling the long queue which is responsible for the ~114 programs that take longer than 10 minutes per update (and some if which take 10+ hours), identify the longest-updating programs that represent the most danger of bringing the system down, and figure out a strategy for them before starting to churn through them again. (If a program has been 9+ days since its last update, it's in this 'long' queue. Others should start receiving updates later today. If you are eagerly awaiting data on one of these slow-to-update programs, I'd like to hear from you so I can understand your use case and try to figure out a good path forward.)

ToniSant awarded a token.Mar 25 2021, 12:31 PM

ToniSant subscribed.

Amazing work...thank you so much for this!

Thank you so much! We were very concerned this problem could jeopardize our programs, and it is amazing you were able to solve it so fast. Cheers!

The web app is largely working now, but seems to be a problem with the database configuration when the system tries to run some of the larger queries, which trigger a connection error. Investigating it today.

I've set log_warning=4 to surface logs from these connection errors on the mysqld side. They look like this: 2021-03-25 15:47:13 10 [Warning] Aborted connection 10 to db: 'dashboard' user: 'outreachdashboard' host: 'programs-and-events-dashboard.globaleducation.eqiad1.wikimed' (Got an error reading communication packets)

I tried increasing max_allowed_packet, but that didn't help.

I tried these settings as well, to no avail:

innodb_log_buffer_size=128M
innodb_log_file_size=2047M

Also tried:
innodb_buffer_pool_size=12GB

@bd808 any insight into what might be going wrong communicating between the VMs for these larger queries?

In T277651#6946005, @Ragesoss wrote:

@bd808 any insight into what might be going wrong communicating between the VMs for these larger queries?

Sadly my MariaDB admin knowledge is pretty close to "I guess I'll try a web search". I found this blog post from Percona, but I'm not sure if it will be helpful to you at all: https://www.percona.com/blog/2016/05/16/mysql-got-an-error-reading-communication-packet-errors/

Yeah, I've been looking at that. It points to the problem most likely being network related, rather than server related. I ran tcpdump through a disconnection and then used strings to see what the traffic looked like, but it offers no useful hints that I can glean.

In T277651#6946226, @Ragesoss wrote:

Yeah, I've been looking at that. It points to the problem most likely being network related, rather than server related. I ran tcpdump through a disconnection and then used strings to see what the traffic looked like, but it offers no useful hints that I can glean.

https://www.percona.com/blog/2007/07/08/mysql-net_write_timeout-vs-wait_timeout-and-protocol-notes/ seems potentially relevant as it sounds like the problems you are seeing are related to "large" writes.

I already tried increasing net_write_timeout as well, but that didn't help.

All the problems seem to be with queries where the query itself is very long (but not necessarily expensive), for example because of a long list of IN values. Some cases are not writing to the database, just reading. (And some are doing a large batch update.)

I'm still stumped on why larger query strings are causing the app to lose its database connection. I've started putting in place some workarounds to avoid needing such long query strings in the places that have failed so far.

The adjustments I made to avoid long query strings are working so far, and all update processes are currently back online. I'll continue monitoring closely for additional problems.

In T277651#6946189, @bd808 wrote:

In T277651#6946005, @Ragesoss wrote:

@bd808 any insight into what might be going wrong communicating between the VMs for these larger queries?

Sadly my MariaDB admin knowledge is pretty close to "I guess I'll try a web search". I found this blog post from Percona, but I'm not sure if it will be helpful to you at all: https://www.percona.com/blog/2016/05/16/mysql-got-an-error-reading-communication-packet-errors/

Maybe @Marostegui could chime in with some insight? He's one of the WMF's DBAs, taking care of the production databases, and while this is about a WMCS service, the database side wouldn't differ much :-).

The system had looked quite stable, but about 4 hours ago, all the update processes once again started erroring due to a corrupted revisions index. So far, I haven't found any hints at what the acute cause is. Given that this happened again after moving the database to a new server and a Cinder volume, it seems like a hardware problem is a very unlikely reason.

Right now, it looks like much of the frontend is working fine; just queries that touch the revisions table — such as the Activty tab for a program — will error. I think this means that events happening today will still be able to add users without issue, so I'll leave the system up for now (which will allow us to get stats for those events later on).

My best guess is that this connected to one of the biggest programs (revisions-wise) during the update process. My current plan will be to take the system down again tomorrow, dump and rebuild the revisions table again, and this time I'll leave the long update queue turned off. On Friday I added a separate very_long_update queue to remove from the update system individual programs that I know are too large; I manually marked 3 such programs previously. This time I'll go in the opposite direction, enabling the events that have reasonable update times and recent successful updates one by one.

This error at 11:31 UTC March 28 happened right before the index corruption:

Mysql2::Error: Got error 192 'Table encrypted but decryption failed. This could be because correct encryption management plugin is not loaded, used encryption key is not available or encryption method does not match.' from InnoDB

I've taken the Dashboard offline, put the database in innodb_force_recoery=1 mode, and started mysqldump dashboard revisions > revisions.sql. It'll take perhaps a couple of hours to drop and reload the revisions table, after which I'll switch out of recovery bring the frontend back online while I work on isolating the events that take too long.

I've completed the revisions.sql dump (33G), dropped the revisions table, taken it out of recovery mode, and started restoring the table (mysql dashboard < revisions.sql).

I've finished reloading the revisions table, brought the web server back up, and turned on the background processes for short and medium events and 'constant' updates. I'll turn on the 'daily' updates process in a little while.

VIGNERON subscribed.Mar 30 2021, 8:39 AM

I've now disabled updates for all the 'long' programs, which are ones that have taken longer than 10 minutes for a data update. Many of these which take between 10 and 30 minutes are probably fine, and I'll start re-enabling updates for such pages. I'll be adding a message shortly to indicate when a particular program has updates disabled. For the longest ones, I plan to do more work to limit the time-consuming parts of the process (likely by disabling some features) so that updates can eventually be re-enabled.

I still don't have a good idea of exactly what has caused revision index corruption repeatedly. My best guess at this point is that a database query for one or more of the very long courses can use up all the memory on the server and crash the database while it is in the middle of writing new revisions. I plan to test the systems limits by replicating very long courses on a test server to see if I can recreate and pin down the problem, but it may be a while before the largest ones can be re-enabled. (I suspect that many of these aren't actively needed anyway; some of these very long courses were created one or more years ago and set up to track a set of very active users for 10+ years.)

Ragesoss renamed this task from Outreachdashboard.wmflabs.org is down with "queue full" to Outreachdashboard.wmflabs.org is down frequently with database problems.Mar 30 2021, 5:28 PM

Ragesoss updated the task description. (Show Details)

In T277651#6957446, @Ragesoss wrote:

I've now disabled updates for all the 'long' programs, which are ones that have taken longer than 10 minutes for a data update. Many of these which take between 10 and 30 minutes are probably fine, and I'll start re-enabling updates for such pages. I'll be adding a message shortly to indicate when a particular program has updates disabled. For the longest ones, I plan to do more work to limit the time-consuming parts of the process (likely by disabling some features) so that updates can eventually be re-enabled.

I still don't have a good idea of exactly what has caused revision index corruption repeatedly. My best guess at this point is that a database query for one or more of the very long courses can use up all the memory on the server and crash the database while it is in the middle of writing new revisions. I plan to test the systems limits by replicating very long courses on a test server to see if I can recreate and pin down the problem, but it may be a while before the largest ones can be re-enabled. (I suspect that many of these aren't actively needed anyway; some of these very long courses were created one or more years ago and set up to track a set of very active users for 10+ years.)

Sage, have you checked if a memory module is failing in the respective host server?

Could you update the MariaDB server to the latest minor version ?

In T277651#6960333, @Joalpe wrote:

Sage, have you checked if a memory module is failing in the respective host server?

Could you update the MariaDB server to the latest minor version ?

Given that we've seen the same database corruption problem occur once since moving the database to a separate server, I think failing hardware is very unlikely to be the problem.

When I set up a new server just for the database, I installed the latest stable MariaDB version available on Debian Buster, which is 10.3. This is a bump from 10.1 that was being used previously. There are newer versions of MariaDB, but none that are already part of Debian Buster. At this point, I don't have any strong reason to think newer versions will address the specific problem we're having; if the problem is that some queries are running the database server out of memory and crashing it, the solution needs to be to fix the app code to avoid such over-long queries. One thing I'll probably try soon is to lower the query time limit setting; this will likely throw errors during the update process for large programs which will let me pinpoint the most dangerous queries.

For now, I'm making progress with re-enabling updates for many of the slow-updating programs, along with some performance adjustments that seem to be helping a lot with reducing the amount of unnecessary work involved with some of them.

I've now re-enabled updates for all programs except for the following ones which are disproportionately responsible for high system load, which I need to study more to find possible bottlenecks before I'll feel safe re-enabling. Many of them are ones that track very large numbers of articles over long time periods; I may need to add some limitations such as not allowing programs that track activity for more than 1 or 2 years.

Some of these probably don't need to actually continue tracking, in which case we can just update the end date. I'll attempt to contact the facilitators to understand better which programs would still benefit from updates.

There's also one more remaining case of too-long-SQL-queries that I can't work around as easily as the others; for an ArticleScopedProgram that is tracking based on a large category or PetScan query, we use the array of article titles for that category/query to find which edited articles are in scope, and if there are thousands of titles it makes the SQL too long and the database connection cuts off. Here's an example of a program being affected by this problem: https://outreachdashboard.wmflabs.org/courses/wikidata/Museum_Day_2021_-_Museums_(2021)/articles/edited

I posted the problem to StackOverflow in case someone there can pinpoint the problem: https://stackoverflow.com/questions/66921479/database-connection-lost-whenever-query-string-is-too-long

My MySQL docs point to this possible problem:

Use of Ethernet protocol with Linux, both half and full duplex. Some Linux Ethernet drivers have this bug. You should test for this bug by transferring a huge file using FTP between the client and server machines. If a transfer goes in burst-pause-burst-pause mode, you are experiencing a Linux duplex syndrome. Switch the duplex mode for both your network card and hub/switch to either full duplex or to half duplex and test the results to determine the best setting.

I tried looking into this with ethtool but it doesn't return any useful info.

I fixed the too-long-SQL-queries problem! After the switch to a separate database server, the Rails app was still connecting to the database using the mysql2 gem built against MariaDB 10.1, but the database server is now running MariaDB 10.3. I upgraded libmariadb-dev to 10.3 on the Rails server, rebuilt the gem, and the problem went away. :-D

Hi @Ragesoss, me and my colleague have been getting a "500 Internal Server Error" for some time now (today again) when trying to log into P&E Dashboard. Is it related or should I start a new issue?

In T277651#7041601, @Vojtech.dostal wrote:

Hi @Ragesoss, me and my colleague have been getting a "500 Internal Server Error" for some time now (today again) when trying to log into P&E Dashboard. Is it related or should I start a new issue?

I think this is very similar to T281334: Intermittent 500s when loading course.json, an issue I started few hours ago.

I'm not sure what is wrong, but it might be related to the recent NFS outage on cloud services... I can't currently log in to the server, as I'm getting Permission denied (publickey)

I'm back into the system now after some help with restarting the server and fixing LDAP integration. My work preparing for a distributed architecture is still in progress, but it's looking good; I hope to have it much more stable and resilient a few months from now.

But at the rate it's been hitting problems lately, it might be good for me to set up a fresh application VM sooner than that. Now that the database is on a separate, up-to-date VM, standing up a new application server to replace the current one will be relatively quick and easy.

I set up and switched over to a fresh VM today; with the separate database server, this was completely seamless as all I needed to do once it was set up was to change the Web Proxy on Horizon to point to the new VM.

A next intermediate step will be to use a separate VM for most of the background jobs, because I think running out of memory is the cause of most of the downtime we've had recently.

Today I set up an additional node to process the long program update queue. My best guess is that running out of memory is the source of the problems we've had lately (ie, the server going down until an Apache restart), so moving that most resource-intensive process to a separate server should help.

The system now has three VMs:

p-and-e-dashboard-database - running the MariaDB database, with the data living on the attached Cinder volume
p-and-e-dashboard-web - running Apache2/Passenger and the Rails web app, as well Redis and memcache, as well as 5 sidekiq services for processing background jobs
p-and-e-dashboard-sidekiq - running 2 sidekiq services, including the only one set to process data updates for programs in the long_update queue (as well as an extra one set for the short_update queue to improve latency for short-duration events)

This is just a short-term change, until we're ready to switch to a Nomad-based distributed architecture within the next few months, as in the meantime it makes deploying updates to the Dashboard more complicated.

Yesterday the system was broken for several hours because of a full disk on p-and-e-dashboard-database. This is because I hadn't configured the database to use Cinder volume for the tmpdir, so temporary tables filled up the small root partition.

It's been in good shape lately, so I'm closing this; we can re-open this if we starting facing problems again.

Restricted Application added a project: Wikimedia-production-error. · View Herald TranscriptSep 18 2023, 5:14 PM

JJMC89 changed the subtype of this task from "Production Error" to "Bug Report".Sep 18 2023, 5:16 PM

JJMC89 removed a project: Wikimedia-production-error.

	F34176473: image.png
	Mar 21 2021, 1:49 PM

	F34176394: image.png
	Mar 21 2021, 12:30 PM

Outreachdashboard.wmflabs.org is down frequently with database problemsClosed, ResolvedPublicBUG REPORTActions