⚓ T250206 Deploy a proof of concept prometheus server in cloudvps to replace shinken

Subject	Repo	Branch	Lines +/-
cloudvps: metricsinfra alert on puppet agent disabled state	operations/puppet	production	+8 -1
cloudvps: enable monitoring for projects using shinken	operations/puppet	production	+9 -1
cloudvps: enable project monitoring for the metricsinfra project	operations/puppet	production	+1 -0
cloudvps: metricsinfra add project label and default alert rules	operations/puppet	production	+48 -17
cloudvps: metricsinfra add prometheus alert manager and email notifications	operations/puppet	production	+187 -17
cloudvps: Add new role for metricsinfra	operations/puppet	production	+8 -0
cloudvps: update prometheus rule annotations	operations/puppet	production	+3 -3
cloudvps: add prometheus alert rules for project instances	operations/puppet	production	+50 -7
cloudvps: update project service discovery prometheus config	operations/puppet	production	+15 -38
cloudvps: Add metricsinfra prometheus server	operations/puppet	production	+91 -0

Status	Assigned	Task
Resolved	fgiunchedi	T205862 Expand modern metrics infrastructure coverage (2018-19 Q2 goal)
Resolved	colewhite	T183454 Deprovision Diamond collectors no longer in use
Resolved	MoritzMuehlenhoff	T210993 Deprecate Diamond collectors in Cloud VPS
Resolved	taavi	T336774 Current status of cloudmetrics and its components
Resolved	taavi	T326266 Remove the WMCS statsd/Graphite service
Open	dcaro	T313444 Streamline WMCS Alerting and Paging
Resolved	taavi	T317032 Remove Diamond?
Resolved	taavi	T264920 Grafana "cloud-vps-project-board" needs to be migrated from Graphite to Prometheus
Open	None	T194333 [Epic] Provide logging/metrics/monitoring SaaS for Cloud VPS tenants
Resolved	Andrew	T236547 "shinken" Cloud VPS project jessie deprecation
Resolved	taavi	T266050 Build Prometheus service for use by all Cloud VPS projects and their instances
Resolved	• JHedden	T250206 Deploy a proof of concept prometheus server in cloudvps to replace shinken
Resolved	• JHedden	T250210 Request creation of metricsinfra VPS project
Resolved	• Bstorm	T252260 Try out removing the nfs client probes from the node exporter on VMs
Resolved	bd808	T256134 wikistats.analytics.eqiad.wmflabs blocking Prometheus scraping from metricsinfra
Resolved	Andrew	T128615 Get rid of Toolforge home page check from shinken
Declined	None	T128716 Make icinga-wm report Tools homepage check at #wikimedia-cloud, too

In T250206#6056467, @JHedden wrote:
The new VM is created with a dedicated network port. Having a dedicated port reserves an IP address making future architecture changes or rebuilds more flexible.
$ OS_PROJECT_ID=metricsinfra openstack port create --network 7425e328-560c-4f00-8e99-706f3fb90bb4 --description "reserved address for monitoring" prometheus01.metricsinfra.eqiad.wikimedia.cloud

$ OS_PROJECT_ID=metricsinfra openstack server create --image debian-10.0-buster --flavor bigdisk2 --nic port-id=2e67a486-e840-4800-b974-d9220f5e107a prometheus01

Doesn't it also make it impossible to fully replace the instance without ops intervention? I'd prefer we had a special DNS name that stayed the same.

In T250206#6056475, @Krenair wrote:

Doesn't it also make it impossible to fully replace the instance without ops intervention? I'd prefer we had a special DNS name that stayed the same.

It does, we'd need to detach and reattach the port to the new instance. But I think either way we'll still require ops intervention. If we fully replaced an instance we'd have to update each project's security groups with the new IP address.

Maybe a new port with a reserved address and DNS name in addition to the local VM's network interface would be better?

In T250206#6056556, @JHedden wrote:

In T250206#6056475, @Krenair wrote:

Doesn't it also make it impossible to fully replace the instance without ops intervention? I'd prefer we had a special DNS name that stayed the same.

It does, we'd need to detach and reattach the port to the new instance. But I think either way we'll still require ops intervention. If we fully replaced an instance we'd have to update each project's security groups with the new IP address.

Ugh, yeah, you're right re: IP and security groups I think. This is probably the right solution, and it doesn't necessarily make sense to try to fix the problem of special IPs like this for ordinary tenants as this is really specific to cloudinfra-type systems.

In T250206#6056556, @JHedden wrote:

Maybe a new port with a reserved address and DNS name in addition to the local VM's network interface would be better?

I'm not sure I understand the distinction beyond the addition of a particular DNS name?

In T250206#6056586, @Krenair wrote:

In T250206#6056556, @JHedden wrote:

Maybe a new port with a reserved address and DNS name in addition to the local VM's network interface would be better?

I'm not sure I understand the distinction beyond the addition of a particular DNS name?

I was thinking of it more like a dedicated service address and name pair. Something that we could detach from the underlying host without losing full network connectivity. We're probably getting in the weeds here though :)

Yeah, I think what you've got so far sounds like the right thing for now at
least.

Change 588803 had a related patch set uploaded (by Jhedden; owner: Jhedden):
[operations/puppet@production] cloudvps: Add metricsinfra prometheus server

https://gerrit.wikimedia.org/r/588803

gerritbot added a project: Patch-For-Review.Apr 14 2020, 10:42 PM

Krenair updated the task description. (Show Details)Apr 15 2020, 2:05 AM

For the record, we have a grafana dashboard per project https://grafana-labs.wikimedia.org/d/000000059/cloud-vps-project-board?orgId=1&var-project=project-proxy&var-server=All&from=now-2d&to=now which is think is using data from graphite instead of prometheus.

Ah, is the plan to use the existing grafana-labs in prod?

In T250206#6059740, @Krenair wrote:

Ah, is the plan to use the existing grafana-labs in prod?

Yeah, we can add this Prometheus server as a datasource to Grafana running on cloudmetrics. (Similar setup to the current tools Prometheus server)

Change 588803 merged by Jhedden:
[operations/puppet@production] cloudvps: Add metricsinfra prometheus server

https://gerrit.wikimedia.org/r/588803

Mentioned in SAL (#wikimedia-cloud) [2020-04-15T20:09:08Z] <jeh> update default security group to allow prometheus01.metricsinfra.eqiad.wmflabs TCP 9100 T250206

Mentioned in SAL (#wikimedia-cloud) [2020-04-15T20:10:10Z] <jeh> update default security group to allow prometheus01.metricsinfra.eqiad.wmflabs TCP 9100 T250206

Maintenance_bot removed a project: Patch-For-Review.Apr 15 2020, 8:10 PM

Command used to update security groups for tools and cloudinfra

OS_PROJECT_ID=tools openstack security group rule create default --protocol tcp --dst-port 9100:9100 --remote-ip 172.16.0.229/32
OS_PROJECT_ID=cloudinfra openstack security group rule create default --protocol tcp --dst-port 9100:9100 --remote-ip 172.16.0.229/32

• JHedden updated the task description. (Show Details)Apr 15 2020, 8:14 PM

The new prometheus server is up and scraping node-exporter metrics from all the VMs in tools and cloudinfra

jeh@prometheus01:~$ curl -s http://$(hostname -f)/cloud/api/v1/targets | jq '.data.activeTargets | .[] | .labels.instance+" "+.health' | grep -c up
171
jeh@prometheus01:~$ curl -s http://$(hostname -f)/cloud/api/v1/targets | jq '.data.activeTargets | .[] | .labels.instance+" "+.health' | grep -v up
"tools-sgeexec-0912 down"

• JHedden updated the task description. (Show Details)Apr 15 2020, 9:22 PM

• JHedden updated the task description. (Show Details)

Krenair updated the task description. (Show Details)Apr 16 2020, 12:33 PM

Krenair updated the task description. (Show Details)

Change 589398 had a related patch set uploaded (by Jhedden; owner: Jhedden):
[operations/puppet@production] cloudvps: update project service discovery prometheus config

https://gerrit.wikimedia.org/r/589398

gerritbot added a project: Patch-For-Review.Apr 16 2020, 6:44 PM

Change 589398 merged by Jhedden:
[operations/puppet@production] cloudvps: update project service discovery prometheus config

https://gerrit.wikimedia.org/r/589398

Maintenance_bot removed a project: Patch-For-Review.Apr 16 2020, 10:10 PM

Change 589716 had a related patch set uploaded (by Jhedden; owner: Jhedden):
[operations/puppet@production] cloudvps: add prometheus alert rules for project instances

https://gerrit.wikimedia.org/r/589716

gerritbot added a project: Patch-For-Review.Apr 17 2020, 9:53 PM

• JHedden updated the task description. (Show Details)Apr 17 2020, 9:57 PM

• JHedden updated the task description. (Show Details)Apr 17 2020, 10:11 PM

Change 589716 merged by Jhedden:
[operations/puppet@production] cloudvps: add prometheus alert rules for project instances

https://gerrit.wikimedia.org/r/589716

Maintenance_bot removed a project: Patch-For-Review.Apr 18 2020, 3:10 PM

Krenair updated the task description. (Show Details)Apr 18 2020, 3:16 PM

Change 589864 had a related patch set uploaded (by Jhedden; owner: Jhedden):
[operations/puppet@production] cloudvps: update prometheus rule annotations

https://gerrit.wikimedia.org/r/589864

gerritbot added a project: Patch-For-Review.Apr 18 2020, 3:44 PM

Change 589864 merged by Jhedden:
[operations/puppet@production] cloudvps: update prometheus rule annotations

https://gerrit.wikimedia.org/r/589864

Maintenance_bot removed a project: Patch-For-Review.Apr 18 2020, 4:10 PM

Krenair moved this task from Backlog to Subtasks on the Cloud-VPS (Debian Jessie Deprecation) board.Apr 18 2020, 8:40 PM

Change 591053 had a related patch set uploaded (by Jhedden; owner: Jhedden):
[operations/puppet@production] cloudvps: Add new role for metricsinfra

https://gerrit.wikimedia.org/r/591053

gerritbot added a project: Patch-For-Review.Apr 20 2020, 2:21 PM

Change 591053 merged by Jhedden:
[operations/puppet@production] cloudvps: Add new role for metricsinfra

https://gerrit.wikimedia.org/r/591053

Maintenance_bot removed a project: Patch-For-Review.Apr 20 2020, 3:10 PM

Change 591202 had a related patch set uploaded (by Jhedden; owner: Jhedden):
[operations/puppet@production] cloudvps: metricsinfra add prometheus alert manager and email notifications

https://gerrit.wikimedia.org/r/591202

gerritbot added a project: Patch-For-Review.Apr 20 2020, 10:24 PM

Change 591202 merged by Jhedden:
[operations/puppet@production] cloudvps: metricsinfra add prometheus alert manager and email notifications

https://gerrit.wikimedia.org/r/591202

Maintenance_bot removed a project: Patch-For-Review.Apr 27 2020, 6:10 PM

Email based alert notifications are now enabled for the tools and cloudinfra projects.

Note that the OpenStack service discovery has support for monitoring all projects, but unfortunately our current packaged version of Prometheus drops the instance's project name which breaks multi-tenancy. Once our version of Prometheus includes this upstream patch [0] we have the option to enable monitoring for every project.

[0] https://github.com/prometheus/prometheus/commit/9c5370fdfe7cf51fd5d58151bb745ac10f6c2dac

• JHedden updated the task description. (Show Details)Apr 27 2020, 7:11 PM

It looks like things will be noisy if we add the alert space rules right now.
https://prometheus.wmflabs.org/cloud/graph?g0.range_input=1h&g0.expr=100%20-%20(node_filesystem_avail_bytes%7Bfstype%3D%22ext4%22%7D%2Fnode_filesystem_size_bytes%20*%20100)%20%3E%3D%2080&g0.tab=1

Going to hold off adding that and see what I can clean up first.

I poked at tools-sgeexec-0901 just out of curiosity, and it was apt. After running sudo apt clean:

/dev/vda3                                                           19G   12G  6.2G  66% /

That's down from 80%. That's pretty common in Toolforge (seen it before), but I don't have a strong opinion on how to fix it. Just noting that the condition was out there in case that is useful to you.

Added a Grafana dashboard for detailed instance metrics using the metricsinfra prometheus server: https://grafana-labs.wikimedia.org/d/000000590/metricsinfra-cloudvps-instance-details

Change 593042 had a related patch set uploaded (by Jhedden; owner: Jhedden):
[operations/puppet@production] cloudvps: metricsinfra add project label and default alert rules

https://gerrit.wikimedia.org/r/593042

gerritbot added a project: Patch-For-Review.Apr 28 2020, 7:33 PM

Change 593042 merged by Jhedden:
[operations/puppet@production] cloudvps: metricsinfra add project label and default alert rules

https://gerrit.wikimedia.org/r/593042

Change 593048 had a related patch set uploaded (by Jhedden; owner: Jhedden):
[operations/puppet@production] cloudvps: enable project monitoring for the metricsinfra project

https://gerrit.wikimedia.org/r/593048

Change 593048 merged by Jhedden:
[operations/puppet@production] cloudvps: enable project monitoring for the metricsinfra project

https://gerrit.wikimedia.org/r/593048

Change 593054 had a related patch set uploaded (by Jhedden; owner: Jhedden):
[operations/puppet@production] cloudvps: enable monitoring for projects using shinken

https://gerrit.wikimedia.org/r/593054

Change 593054 merged by Jhedden:
[operations/puppet@production] cloudvps: enable monitoring for projects using shinken

https://gerrit.wikimedia.org/r/593054

Change 593342 had a related patch set uploaded (by Jhedden; owner: Jhedden):
[operations/puppet@production] cloudvps: metricsinfra alert on puppet agent disabled state

https://gerrit.wikimedia.org/r/593342

Change 593342 merged by Jhedden:
[operations/puppet@production] cloudvps: metricsinfra alert on puppet agent disabled state

https://gerrit.wikimedia.org/r/593342

Added some documentation at https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Monitoring#Monitoring_for_Cloud_VPS

• Bstorm closed subtask T252260: Try out removing the nfs client probes from the node exporter on VMs as Resolved.May 13 2020, 11:47 PM

bd808 removed • JHedden as the assignee of this task.May 31 2020, 8:20 PM

bd808 edited projects, added cloud-services-team (Kanban); removed Patch-For-Review.

bd808 merged a task: T107297: shinken is too "volatile" and imprecise to be of use.Jun 1 2020, 5:00 AM

bd808 added subscribers: scfc, tom29739, Krinkle and 2 others.

• Bstorm triaged this task as High priority.Jun 2 2020, 4:16 PM

• Bstorm added a subtask: T186552: Create alerts for bastion hosts - Usage and latency.Jun 22 2020, 6:06 PM

bd808 added a subtask: T256134: wikistats.analytics.eqiad.wmflabs blocking Prometheus scraping from metricsinfra.Jun 23 2020, 2:54 PM

bd808 closed subtask T256134: wikistats.analytics.eqiad.wmflabs blocking Prometheus scraping from metricsinfra as Resolved.Jun 26 2020, 9:15 PM

• Bstorm added a subtask: T128615: Get rid of Toolforge home page check from shinken.Jul 7 2020, 4:05 PM

• Bstorm added a subtask: T128715: Add all Cloud VPS project administrators to the Prometheus notification group for each project.

taavi subscribed.Jul 7 2020, 4:10 PM

Since implementing apt cache autocleaning for T127374: Avoid indefinite growing of apt caches and old kernel images, I think we probably should enable the disk size monitor.

Rerunning https://prometheus.wmflabs.org/cloud/graph?g0.range_input=1h&g0.expr=100%20-%20(node_filesystem_avail_bytes%7Bfstype%3D%22ext4%22%7D%2Fnode_filesystem_size_bytes%20*%20100)%20%3E%3D%2080&g0.tab=1 shows that the only thing in tools that would alert are the docker registries (which we should be concerned about and need to clean up). I'm not sure we want to alert on deployment-prep's disks. They are quite full, it seems.

for historical purposes, the old shinken config files can be found here:

https://download.wmflabs.org/etcshinken.tgz
https://download.wmflabs.org/varshinken.tgz

Andrew closed subtask T128615: Get rid of Toolforge home page check from shinken as Resolved.Aug 17 2020, 3:50 PM

MSantos subscribed.Aug 20 2020, 11:55 AM

• nskaggs subscribed.Sep 8 2020, 4:33 PM

Things that I think we could do next here:

setup a project puppetmaster so that we can have secret storage for things like volunteer's emails
refactor the puppet module so that the alertmanager config merges public and private hiera hashes into the config so that email addresses, etc can be kept non-public
setup irc relay on the the metricsinfra node
change reverse proxy restrictions so that the vpsalertmanager.toolforge.org deployment can silence alerts. We had an email discussion and decided that abuse of random silences

Krinkle unsubscribed.Sep 9 2020, 2:43 PM

Dzahn mentioned this in T220531: Get the clouddb-services systems into Shinken and possibly icinga.Oct 5 2020, 10:41 PM

• Bstorm merged a task: T220531: Get the clouddb-services systems into Shinken and possibly icinga.Oct 5 2020, 11:45 PM

• Bstorm added subscribers: Dzahn, Framawiki.

bd808 added a parent task: T264920: Grafana "cloud-vps-project-board" needs to be migrated from Graphite to Prometheus.Oct 7 2020, 6:22 PM

bd808 mentioned this in T266050: Build Prometheus service for use by all Cloud VPS projects and their instances.Oct 20 2020, 5:02 PM

bd808 removed a parent task: T264920: Grafana "cloud-vps-project-board" needs to be migrated from Graphite to Prometheus.

bd808 added a parent task: T266050: Build Prometheus service for use by all Cloud VPS projects and their instances.

bd808 removed a subtask: T186552: Create alerts for bastion hosts - Usage and latency.Oct 20 2020, 5:07 PM

bd808 removed a subtask: T128715: Add all Cloud VPS project administrators to the Prometheus notification group for each project.