Try out removing the nfs client probes from the node exporter on VMs
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• Bstorm
	May 8 2020, 11:29 PM

Description

The up-down monitor on our new metricsinfra prometheus flaps often because the node exporter gets wedged. Based on repeated investigations, this seems to be caused by NFS hangs on the client side.

Based on https://github.com/prometheus/node_exporter/issues/578 and later on https://github.com/prometheus/node_exporter/pull/1166, it seems like the best upstream aims to do to fix this is cause the exporter to return 503s instead of blowing up the host during such events. That is a reasonable thing to do, but it also makes the host report down and stop reporting all of our metrics.

It seems like a good idea to monitor NFS at the host level instead of on the client to avoid wedging our entire monitoring setup when the somewhat frequent issue of NFS client hangs comes up.

Details

	Subject	Repo	Branch	Lines +/-
	cloud-node-exporter: ignore NFS on the cloud client side	operations/puppet	production	+3 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	fgiunchedi	T205862 Expand modern metrics infrastructure coverage (2018-19 Q2 goal)
Resolved	colewhite	T183454 Deprovision Diamond collectors no longer in use
Resolved	MoritzMuehlenhoff	T210993 Deprecate Diamond collectors in Cloud VPS
Resolved	taavi	T336774 Current status of cloudmetrics and its components
Resolved	taavi	T326266 Remove the WMCS statsd/Graphite service
Open	dcaro	T313444 Streamline WMCS Alerting and Paging
Resolved	taavi	T317032 Remove Diamond?
Resolved	taavi	T264920 Grafana "cloud-vps-project-board" needs to be migrated from Graphite to Prometheus
Open	None	T194333 [Epic] Provide logging/metrics/monitoring SaaS for Cloud VPS tenants
Resolved	Andrew	T236547 "shinken" Cloud VPS project jessie deprecation
Resolved	taavi	T266050 Build Prometheus service for use by all Cloud VPS projects and their instances
Resolved	• JHedden	T250206 Deploy a proof of concept prometheus server in cloudvps to replace shinken
Resolved	• Bstorm	T252260 Try out removing the nfs client probes from the node exporter on VMs

Event Timeline

• Bstorm renamed this task from Try out removing the nfs client probes from VM the node exporter on VMs to Try out removing the nfs client probes from the node exporter on VMs.May 8 2020, 11:29 PM

• Bstorm created this task.

Mentioned in SAL (#wikimedia-cloud) [2020-05-09T00:28:50Z] <bstorm_> added nfs.* to ignored_fs_types for the prometheus::node_exporter params in project hiera T252260

• Bstorm claimed this task.May 9 2020, 12:29 AM

• Bstorm triaged this task as Medium priority.

• Bstorm moved this task from Inbox to Doing on the cloud-services-team (Kanban) board.

This seems to have stopped the bleeding for the flapping "instance down" notifications.

Change 596063 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloud-node-exporter: ignore NFS on the cloud client side

https://gerrit.wikimedia.org/r/596063

gerritbot added a project: Patch-For-Review.May 12 2020, 9:59 PM

Change 596063 merged by Bstorm:
[operations/puppet@production] cloud-node-exporter: ignore NFS on the cloud client side

https://gerrit.wikimedia.org/r/596063

I think it is safe to say this fixed the flapping instance-down notification.

Try out removing the nfs client probes from the node exporter on VMsClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Try out removing the nfs client probes from the node exporter on VMs
Closed, ResolvedPublic
Actions

Related Objects
Search...