[go: up one dir, main page]

Page MenuHomePhabricator

Puppet failure emails sent to non-admin members of tools project causing user confusion
Closed, ResolvedPublic

Description

Got the following automated email from the Cloud VPS a few moments ago, should we be concerned about that? I don't have much in terms of a way to look into this right now.

Thanks!

Puppet is failing to run on the "tools-exec-1430.tools.eqiad.wmflabs" instance in Wikimedia Cloud VPS.

Working Puppet runs are needed to maintain instance security and logins.
As long as Puppet continues to fail, this system is in danger of becoming
unreachable.

You are receiving this email because you are listed as member for the
project that contains this instance. Please take steps to repair
this instance or contact a Cloud VPS admin for assistance.

For further support, visit #wikimedia-cloud on freenode or
https://wikitech.wikimedia.org

Event Timeline

Got another email stating that puppet is now failing on tools-sgeexec-0905.tools.eqiad.wmflabs as well, so updating phab title.

Az1568 renamed this task from Puppet failure on tools-exec-1430.tools.eqiad.wmflabs to Puppet failure on tools-exec-1430.tools.eqiad.wmflabs and tools-sgeexec-0905.tools.eqiad.wmflabs.Mar 11 2019, 8:57 AM

I also received two automated emails, the numbers are different but I am sure that they are related to this bug.

tools-sgeexec-0905.tools.eqiad.wmflabs

tools-exec-1430.tools.eqiad.wmflabs

Change 495670 had a related patch set uploaded (by GTirloni; owner: GTirloni):
[operations/puppet@production] profile::base::labs - Ability to disable Puppet failure emails

https://gerrit.wikimedia.org/r/495670

notify_maintainers.py emails all members of a project.

Apparently, Toolforge users are added to the tools project so they can SSH into the bastions. So it seems things are working as expected, if only in an annoying way for Toolforge (which is a special Cloud VPS project).

I'm puzzled about why this only became a problem today. There have been many Puppet failures in various Toolforge servers and I don't remember users being notified of those.

I was a bit confused as well not being aware of T210432 and @GTirloni fix there.

I continue being confused as I did not receive these puppet failures emails today. (Checked my email filters, spam folder and if LDAP has the correct email)

On the whole, I believe it is a good idea to have members receiving failure emails (instead of only admins) on most projects. The standard infrastructure projects maintained by WMCS are exceptions though (bastion and tools projects atm).

On the other projects we assume members are there for a reason (usually to maintain or use some part of the VPS project). It makes perfect sense to have them be notified about failures in that context (specially if they can opt-out).

The bastion and tools project, however, probably have most users as members (all? Perhaps an mailserver overload is why I didn't get an email?) and only WMCS team members and other cloud roots can do anything to fix things.

Emailing to all members of the project is not desirable for at least the bastion and tools projects. In practice it is probably not useful for most projects. A slightly more ideal mailing list would be just the projectadmin members of each project. This would require us to check with the OpenStack Keystone service to get that list as LDAP only tracks "membership" broadly for each project without distinction by permission level.

Change 495729 had a related patch set uploaded (by GTirloni; owner: GTirloni):
[operations/puppet@production] cloudvps: Do not send Puppet failure emails for tools/bastion projects

https://gerrit.wikimedia.org/r/495729

Change 495729 merged by GTirloni:
[operations/puppet@production] cloudvps: Do not send Puppet failure emails for tools/bastion projects

https://gerrit.wikimedia.org/r/495729

bd808 renamed this task from Puppet failure on tools-exec-1430.tools.eqiad.wmflabs and tools-sgeexec-0905.tools.eqiad.wmflabs to Puppet failure emails sent to non-admin members of tools project causing user confusion.Mar 11 2019, 6:36 PM
Krenair subscribed.

I'll have a go at this

Change 495757 had a related patch set uploaded (by Alex Monk; owner: Alex Monk):
[operations/puppet@production] puppet_alert: Email projectadmins instead of members

https://gerrit.wikimedia.org/r/495757

Change 495757 merged by GTirloni:
[operations/puppet@production] puppet_alert: Email projectadmins instead of members

https://gerrit.wikimedia.org/r/495757

GTirloni triaged this task as Medium priority.
bd808 reopened this task as Open.EditedMar 15 2019, 8:22 PM

Reports of unintended fallout via irc:

[20:08]  <  Izhidez>	why are we forcing old versions of mysql client onto instances, causing dependency conflicts and uninstalling our entire database?
[20:17]  <    bd808>	Izhidez: can you provide a bit more context?
[20:18]  < Oresrian>	This puppet change (  https://gerrit.wikimedia.org/r/c/operations/puppet/+/495757/4/modules/role/manifests/labs/instance.pp#8  ) forced the installation of the mysql-client-5.5 package, which forced the removal of the mariadb-server package
[20:19]  <    bd808>	That is better context! Thank you.
[20:19]  <    bd808>	I would also say that was unintended and a bug we need to figure out how to correct
[20:20]  < Oresrian>	I've temporarilly disabled puppet on the affected instance, reinstalled mariadb-server, and I'm now rebuilding that instance with the puppet role::mariadb; hopefully that will let puppet see the conflict.
[20:25]  <    bd808>	Oresrian: what operating system (Jessie? Stretch?) is the instance that you are fixing running?
[20:25]  < Oresrian>	yeah, accounts-db3 is the instance
[20:25]  < Oresrian>	I *htink* it's jessie
[20:25]  < Oresrian>	yup: Description:    Debian GNU/Linux 8.6 (jessie)
/var/log/syslog
Mar 15 13:43:50 accounts-db3 systemd[1]: Stopping LSB: Start and stop the mysql database server daemon...
Mar 15 13:43:50 accounts-db3 systemd[1]: Failed to reset devices.list on /system.slice: Invalid argument
Mar 15 13:43:50 accounts-db3 mysqld: 190315 13:43:50 [Note] /usr/sbin/mysqld: Normal shutdown
Mar 15 13:43:50 accounts-db3 mysqld: 190315 13:43:50 [Note] Event Scheduler: Purging the queue. 0 events
Mar 15 13:43:50 accounts-db3 mysqld: 190315 13:43:50 [Note] InnoDB: FTS optimize thread exiting.
Mar 15 13:43:50 accounts-db3 mysqld: 190315 13:43:50 [Note] InnoDB: Starting shutdown...
Mar 15 13:43:51 accounts-db3 mysqld: 190315 13:43:51 [Note] InnoDB: Waiting for page_cleaner to finish flushing of buffer pool
Mar 15 13:43:52 accounts-db3 mysqld: 190315 13:43:52 [Note] InnoDB: Shutdown completed; log sequence number 26725902627
Mar 15 13:43:52 accounts-db3 mysqld: 190315 13:43:52 [Note] /usr/sbin/mysqld: Shutdown complete
Mar 15 13:43:52 accounts-db3 mysqld: 
Mar 15 13:43:52 accounts-db3 mysqld_safe: mysqld from pid file /var/run/mysqld/mysqld.pid ended
Mar 15 13:43:52 accounts-db3 mysql[12038]: Stopping MariaDB database server: mysqld.
Mar 15 13:43:52 accounts-db3 systemd[1]: Stopped LSB: Start and stop the mysql database server daemon.
Mar 15 13:43:54 accounts-db3 puppet-agent[9283]: (/Stage[main]/Openstack::Clientpackages::Mitaka::Jessie/Package[mysql-client-5.5]/ensure) created
/var/log/apt/history.log
Start-Date: 2019-03-15  13:43:50
Commandline: /usr/bin/apt-get -q -y -o DPkg::Options::=--force-confold install mysql-client-5.5
Install: mysql-client-5.5:amd64 (5.5.62-0+deb8u1)
Remove: mariadb-client:amd64 (10.0.38-0+deb8u1), mariadb-client-core-10.0:amd64 (10.0.38-0+deb8u1), mariadb-server-core-10.0:amd64 (10.0.38-0+deb8u1), mariadb-client-10.0:amd64 (10.0.38-0+deb8u1), mariadb-server-10.0:amd64 (10.0.38-0+deb8u1), mariadb-server:amd64 (10.0.38-0+deb8u1)
End-Date: 2019-03-15  13:43:54

@aborrero I tracked down mysql packages in openstack clientpackages to seemingly have been introduced in https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/475326/ but it's not clear why it got included. Is there any chance you still remember what that was for?

Change 497210 had a related patch set uploaded (by BryanDavis; owner: Bryan Davis):
[operations/puppet@production] cloud-vps: Remove mysql packages from openstack::clientpackages::mitaka::*

https://gerrit.wikimedia.org/r/497210

@aborrero I tracked down mysql packages in openstack clientpackages to seemingly have been introduced in https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/475326/ but it's not clear why it got included. Is there any chance you still remember what that was for?

When I did that refactor I tried to minimize the puppet catalog delta on the affected servers. I don't remember all the details, since it was a large refactor, but I would say there was a Package[mysql-client] somewhere in the servers puppet catalog before my refactor (perhaps by means of require_package()).

I think this theory is right because I just found this comment in the patch your mentioned:

+    # we are moving away from require_package() in this factorization, so put
+    # this here to have a minimal catalog diff. This could be dropped, but
+    # probably better to just wait until we deprecate this deployment.
+    # Why? because we switched to 'virtual-mysql-client', which is a more
+    # robust way of expressing this dependency.
+    $mainpackages = [
+        'mysql-client-5.5',
+        'mysql-common',
+    ]
+

As I mentioned to you on IRC, to try to keep the refactor as simple as possible (and as rollback-friendly as possible) I tried to minimize the catalog diff, meaning that I didn't evaluate if we required all the configs, just factorized it with the idea of doing that evaluation later on (as you are doing right now).
So this is kinda expected, sadly.

ack, sounds like we should try getting rid of those mysql packages from the manifest then. thanks.

Change 497210 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] cloud-vps: Remove mysql packages from openstack::clientpackages::mitaka::*

https://gerrit.wikimedia.org/r/497210

@stwalkerster this should be better now

Change 497445 had a related patch set uploaded (by Alex Monk; owner: Alex Monk):
[operations/puppet@production] Follow-up I71678b27: Remove stray MariaDB reference in openstack::clientpackages::newton::stretch

https://gerrit.wikimedia.org/r/497445

Change 497445 had a related patch set uploaded (by Alex Monk; owner: Alex Monk):
[operations/puppet@production] openstack: Follow-up I71678b27: Remove stray MariaDB reference

https://gerrit.wikimedia.org/r/497445

Change 495670 had a related patch set uploaded (by GTirloni; owner: GTirloni):
[operations/puppet@production] profile::base::labs - Ability to disable Puppet failure emails

https://gerrit.wikimedia.org/r/495670

Change 495670 merged by GTirloni:
[operations/puppet@production] profile::base::labs - Ability to disable Puppet failure emails

https://gerrit.wikimedia.org/r/495670

Change 497445 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] openstack: Follow-up I71678b27: Remove stray MariaDB reference

https://gerrit.wikimedia.org/r/497445