User Details
- User Since
- Feb 10 2016, 11:25 AM (457 w, 3 d)
- Availability
- Available
- IRC Nick
- volans
- LDAP User
- Volans
- MediaWiki User
- RCoccioli (WMF) [ Global Accounts ]
Thu, Nov 14
Did you go through https://wikitech.wikimedia.org/wiki/Management_Interfaces#Troubleshooting_Commands ?
Could a Phabricator search with the hostname for open+closed tasks cover the needs?
Mon, Nov 11
With the new requestctl web UI I think it would be very useful if the current requestctl generator ( https://superset.wikimedia.org/requestctl-generator?q= ) would be adapted to work with the new web UI or (even better) ditched and get the same functionality embedded into requestctl web UI directly just providing the URL of a superset dashboard with filters.
For 3. an EASY way to link the superset dashboard from a given requestctl rule is to generate links of this type:
Thu, Nov 7
@Marostegui I've added some notes in those two pages and removed one paragraph that I think was obsolete due to active/active mediawiki. I think though that some of the steps listed there might be outdatad.
I think we could put the code directly into the move-vlan cookbook, if the host is present in dbctl, update it. I don't see too much use for an update-ip specific cookbook just for databases, but let me know if you see other use cases for it.
Wed, Nov 6
In our netbox config we have for the logging formatters:
'django.server': { '()': 'django.utils.log.ServerFormatter', 'format': '[%(server_time)s] %(message)s', }
Mon, Nov 4
@bking we had a brief chat in the I/F meeting today about this. We think that this would mostly be a step backward instead of forward.
Wed, Oct 30
I've tried on es7 eqiad master, that is pooled with a weight of 10 and get's only 4.7% of requests.
I've some comments/questions on the above:
Mon, Oct 28
Should it be part of the sre.ganeti.addnode cookbook or done at reimage time?
Wed, Oct 23
The sre.mysql.pool and sre.mysql.depool cookbooks have been merged and are ready to be used. Future expansions will be handled separately from this task.
Mon, Oct 21
Tentatively resolving, I can't repro it anymore. If you encounter the same issue please re-open it.
De-assigning from me as I've not worked on it or plan to do so soon.
Now that we have dbctl support in Spicerack it should be doable to add the step to modify the IP in dbctl when needed.
Fri, Oct 18
All pending patches have been tested and merged. Resolving.
This seems a perfect opportunity to re-evaluate why we're hardcoding those values in Puppet's hiera and evaluate if that can be avoided.
Some possible options could be to gather them via a PuppetDB query, from DNS or exporting them from Netbox.
Oct 17 2024
Oct 16 2024
The cause of that dry-run failure was the added check of replication working of MASTER_FROM from MASTER_TO added here
Oct 15 2024
A forced reboot via mgmt seems to have put back in a working state for now.
Oct 11 2024
@jcrespo I had it almost finished yesterday but then I had to step out, I've sent the patches. If you test with the test-cookbook using as CR the last one (1079537) you'll be also testing all the other pending improvements that were done but not yet merged.
The last one allows to test it also on a custom section, so you can pass --section test-s4 and it should do the right thing.
Oct 10 2024
Thanks @jcrespo for the detailed request. I'll get to it. Only one question, are you sure we want to use REPLACE and not INSERT? I thought that replace contributed to the issue.
Oct 7 2024
This has been released and tested. Resolving.
@ssingh what do you think of the above draft patch proposal? If that works for you I'll complete it and get it included into spicerack.
Spicerack has support for prometheus, why not getting the metrics directly from there in the cookbooks?
Oct 2 2024
Is there plan to try to get away from the very long hardcoded lists in hiera?
How often do you expect the data to change? This might affect how we want to trigger the changes.
Sep 30 2024
The above was me aborting the leftover execution of the cookbook that have been left in waiting for user input.
Sep 25 2024
With the current API that's not possible because allowed_aliases is an instance property (not a class property) of the runner class, not the cookbook class.
This means that we need an instance of the runner to get the list of allowed aliases and to get that instance we need an instance of spicerack that is passed only to the get_runner method and not the argument_parser one.
Sep 24 2024
If this is not super urgent, do you think it could wait an "upstream" solution in spicerack as discussed in T375014?
Sep 23 2024
AFAIK pc1015 should be the candidate host if we want to fail it over, from dbctl:
"note": "Hot spare for pc4 and cold spare for pc3",
Thanks for the summary @ssingh. I have a local proposal that will send out when ready. There is one main point to decide and is the "caching" time of this information:
Sep 18 2024
Thanks for the task. I think the main decision to make is how fresh the data needs to be. If we opt for refreshed every time it's called then we need to think well how to inject it into spicerack as it would not be possible to just have it as a helper in the Spicerack class.
Sep 17 2024
The idea was that this is all part of the A->B datacenter switchover and the finalize steps are still part of that process hence to be called with the same flow A->B. If you prefer otherwise we can invert it and/or improve the help messages.
This is the updated list as of today: clouddb2002-dev,cloudlb2004-dev,clouddb[1013-1020].
I guess that the clouddb are expected and they all don't have the AAAA records. That leaves only cloudlb2004-dev as the outlier.
Sep 16 2024
As there were no agreement here on task and multiple years have passed we decided to close it. Feel free to reopen in case there is more consensus.
The alertmanager support has been in place for a long time. Resolving. Any additional feature will be developed separately.
Sep 10 2024
In the optic of a cookbook to replace puppet merge I'd try to use https://doc.wikimedia.org/spicerack/master/api/spicerack.reposync.html#spicerack.reposync.RepoSync
I don't think it does anymore unfortunately...
While the above is totally true the probability that a rename+reimage happens exactly at the time of the timer that runs once a day is fairly low.
Sep 5 2024
I need to recollect my old memories and check local branches, the hardest part IIRC are not the code changes but the grammar changes to support it. Do you have a timeline in mind? So to be able to plan accordingly.
Thanks for the task, we'll evaluate the various options and come up with a final proposal.
Sep 4 2024
Netbox 4.1 is out, published yesterday.
FYI Cumin's puppet backend too will need to be refactored to support structured facts.
The patch was needed, the last error was at Sep 03 17:08:32. After that it run smoothly except for one run at Sep 03 20:10:57 that failed with ERROR:/usr/local/bin/vrts_aliases:Connection unexpectedly closed.
Sep 3 2024
We've 3.2 in prod right now ( https://docs.djangoproject.com/en/3.2/ref/models/querysets/#bulk-update ) but yes this is an option, although save() will not be called and so we need to verify if that will skip some steps.
@LSobanski the current failures are because there are 2 email addresses in the config that are managed by gsuite (I've redacted some part):
Sep 2 2024
Updated list of ganeti hosts without AAAA records (all the others have them): ganeti[1009-1022,2009-2024]
Aug 1 2024
Final update before vacations :)
Status update:
Some addition to spicerack were made in this patch. I would actually be tempted to rename Instance.run_vertical_query() to Instance.run_query() and replace it as it's the only way to sensibly parse some output (with the caveats explained in the docstring).
With the puppet patch we save the replication credentials to the cumin hosts in a way accessibile by the cookbooks it will be possible to start testing (with test-cookbook passing the DRY-RUN mode option) the two new cookbooks introduced as good draft but still WIP here for the pre and post steps required for the switch datacenter.
Jul 31 2024
@Papaul We've debugged this live during the tooling and automation office hours and we think it's a race condition due do the fact that late_command.sh stil defaults to puppet 5 if unable to read the version. Luca is changing that.
@Papaul what's the timeline for deciding to not reimage anymore buster host so we can just have puppet 7 and solve all problems?
Jul 30 2024
I've just cleaned the host from both puppetmaster and puppetserver to be sure it was not there and run:
sudo cookbook sre.hosts.reimage -t T369654 --os bookworm -p 7 --force-dhcp-tftp --new db2227
The reimage cookbook did set it to puppet 7 each time:
$ grep "node=db2227.codfw.wmnet, command='printf" reimage-extended.log 2024-07-29 12:27:45,345 jhancock 2656784 [DEBUG clustershell.py:590 in ev_pickup] node=db2227.codfw.wmnet, command='printf 7 > /tmp/puppet_version' 2024-07-29 17:59:29,599 pt1979 3022287 [DEBUG clustershell.py:590 in ev_pickup] node=db2227.codfw.wmnet, command='printf 7 > /tmp/puppet_version' 2024-07-29 19:35:24,239 pt1979 3092430 [DEBUG clustershell.py:590 in ev_pickup] node=db2227.codfw.wmnet, command='printf 7 > /tmp/puppet_version'
Has it been cleared from puppet5 between each reimage? After the first reimage if the host is in puppetdb the reimage cookbook will use the current puppet version of the host. That could explain it. Let me know if you want me to do something with it.
That's the same it had during the last reboot+reimage cycle, see T370304#10016765 . That's supposedly the NIC, that surely needs investigating, but at first sight doesn't look like the culprit to me (but it could be ofc).
Jul 29 2024
I've noticed that from the logs we're not setting any additional values since May 31st, from SAL I saw that we didn't had reimages between the 31st and when we merged the above patch.
@Papaul can you give us some provision cookbook run that didn't set it so we could check the logs please? Hostname and date/time if they were run multiple times.
FYI 86% of maxmind's french prefixes have the subdivisions information. Then how much they are accurate both in terms of geolocation and as a consequence routing for this specific case is all to be seen.
Yep, I thought it was already :)
My 2 cents would be to keep both for now as in icinga you can also see what are the icinga checks configured for that device.
If deemed useful those are the subdivisions in maxmind db for France (nda-only) and the related docs for gdnsd are here (assuming up to date ;)):
https://github.com/gdnsd/gdnsd/wiki/GdnsdPluginGeoip#geoip2-location-data-hierarchy
Jul 26 2024
During the last occurrence I gathered a full processlist in cumin1002:/home/volans/2024-07-26_01.18_s4_master.processlist for those interested.
Some general timeline from the host PoV.
Jul 25 2024
As a side note there were 2 other alerts that were failing and for those the solution was just to add the missing:
GRANT SELECT ON `heartbeat`.`heartbeat` TO `prometheus`@`localhost`
EDIT: sorry I noticed the task was closed only afterwards. Created T371049 as a follow up.
Jul 24 2024
Let's just make sure that requestctl works fine on the puppetmasters, it's installed but the pyparsing version on bookwork is technically outside of the range allowed in setup.py.
@Ladsgroup that's looks very useful, I didn't know about it, is it mentioned anywhere? I can't find in wikitech.
Having glimpsed at it I also have some questions and potential concerns about the data it uses. Where would be a good place to raise them?
Without too much previous experience from past migrations, I think we could tackle it per DB section (aka shard), moving all easily depoolable hosts in a section first and then leaving the harder to depool ones (like masters) for last.
Let me know if I should extract a distribution of hosts per section and row/rack
@akosiaris with the wikimedia account we have we do have access to the Add Verified Bot form and potentially we could compile that one. The verification methods in that case are:
- Reverse DNS
- IP List
- ASN
@joanna_borun I think this should probably be discussed in the next SRE I/F meeting for approval.