Build Prometheus service for use by all Cloud VPS projects and their instances
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	bd808
	Oct 20 2020, 5:02 PM

Description

In T250206: Deploy a proof of concept prometheus server in cloudvps to replace shinken we worked on a proof of concept project to design a replacement for monitoring a subset of projects using Shinken with Prometheus. Later we realized that T210993: Deprecate Diamond collectors in Cloud VPS means that we also need to migrate to Prometheus for basic dashboards like T264920: Grafana "cloud-vps-project-board" needs to be migrated from Graphite to Prometheus.

We now need to redesign and build out the POC to scale to collecting at least basic instance health information for all instances in all projects for some reasonable amount of time (3+ months for sure, 1+ year ideally).

Features needed beyond POC:

Storage for metrics from 700+ instances
Redundancy
Configure the alert rules to monitor disk capacity
Project local puppetmaster for secrets storage (e.g. volunteer's email addresses)
~~Refactor puppet module to merge public and private hiera config~~
irc relay for irc alerting per project
karma alert management dashboard with ability to silence alerts (T250206#6443507, T285055: Allow project members/admins to ack/silence alerts of that project)

Related Objects
Search...

Status	Assigned	Task
Resolved	fgiunchedi	T205862 Expand modern metrics infrastructure coverage (2018-19 Q2 goal)
Resolved	colewhite	T183454 Deprovision Diamond collectors no longer in use
Resolved	MoritzMuehlenhoff	T210993 Deprecate Diamond collectors in Cloud VPS
Resolved	taavi	T336774 Current status of cloudmetrics and its components
Resolved	taavi	T326266 Remove the WMCS statsd/Graphite service
Open	dcaro	T313444 Streamline WMCS Alerting and Paging
Resolved	taavi	T317032 Remove Diamond?
Resolved	taavi	T264920 Grafana "cloud-vps-project-board" needs to be migrated from Graphite to Prometheus
Open	None	T194333 [Epic] Provide logging/metrics/monitoring SaaS for Cloud VPS tenants
Resolved	taavi	T266050 Build Prometheus service for use by all Cloud VPS projects and their instances
Resolved	• JHedden	T250206 Deploy a proof of concept prometheus server in cloudvps to replace shinken
Resolved	• JHedden	T250210 Request creation of metricsinfra VPS project
Resolved	• Bstorm	T252260 Try out removing the nfs client probes from the node exporter on VMs
Resolved	bd808	T256134 wikistats.analytics.eqiad.wmflabs blocking Prometheus scraping from metricsinfra
Resolved	Andrew	T128615 Get rid of Toolforge home page check from shinken
Declined	None	T128716 Make icinga-wm report Tools homepage check at #wikimedia-cloud, too
Resolved	• nskaggs	T284973 Request increased quota for metricsinfra Cloud VPS project
Resolved	taavi	T285055 Allow project members/admins to ack/silence alerts of that project
Resolved	jbond	T286716 Create solution for developer account authentication for services hosted in Cloud VPS
Resolved	taavi	T286301 Scale up metricsinfra prometheus beyond one Prometheus instance (Thanos/Cortex/similar)
Resolved	taavi	T286335 Split metricsinfra alertmanager to separate hosts from prometheus
Resolved	taavi	T287148 Set up IRC relay for metricsinfra alerts
Resolved	taavi	T310799 Upgrade metricsinfra prometheus to bullseye
Resolved	dcaro	T310802 Request increased quota for metricsinfra Cloud VPS project
Resolved	Andrew	T288108 Figure out how to deal with security groups when rolling out metricsinfra scraping

Event Timeline

bd808 created this task.Oct 20 2020, 5:02 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 20 2020, 5:02 PM

bd808 added a parent task: T264920: Grafana "cloud-vps-project-board" needs to be migrated from Graphite to Prometheus.Oct 20 2020, 5:03 PM

bd808 added a subtask: T250206: Deploy a proof of concept prometheus server in cloudvps to replace shinken.

bd808 added a parent task: T194333: [Epic] Provide logging/metrics/monitoring SaaS for Cloud VPS tenants.Oct 20 2020, 5:05 PM

bd808 added a subtask: T186552: Create alerts for bastion hosts - Usage and latency.

bd808 added a subtask: T128715: Add all Cloud VPS project administrators to the Prometheus notification group for each project.Oct 20 2020, 5:08 PM

bd808 closed subtask T250206: Deploy a proof of concept prometheus server in cloudvps to replace shinken as Resolved.Oct 20 2020, 5:15 PM

bd808 mentioned this in T250206: Deploy a proof of concept prometheus server in cloudvps to replace shinken.

The POC project was a bit ahead of work by the Observability folks on similar alerting for the production realm. This build out should include examining the profiles and modules that have been built for prod now to see how much of the POC can be replaced with shared setup.

bd808 triaged this task as High priority.Oct 20 2020, 5:23 PM

• Bstorm mentioned this in T186552: Create alerts for bastion hosts - Usage and latency.Oct 21 2020, 4:22 PM

• nskaggs subscribed.Oct 27 2020, 7:50 PM

bd808 mentioned this in T264920: Grafana "cloud-vps-project-board" needs to be migrated from Graphite to Prometheus.Dec 3 2020, 8:54 PM

taavi subscribed.Dec 13 2020, 6:14 PM

taavi mentioned this in T95922: Adopt service status dashboard.Feb 8 2021, 4:59 PM

• Bstorm mentioned this in T284860: Prometheus alerting support on Toolforge.Jun 14 2021, 4:22 PM

taavi added a subtask: T284973: Request increased quota for metricsinfra Cloud VPS project.Jun 15 2021, 5:32 AM

• nskaggs closed subtask T284973: Request increased quota for metricsinfra Cloud VPS project as Resolved.Jun 15 2021, 4:12 PM

taavi updated the task description. (Show Details)Jun 16 2021, 1:05 PM

taavi updated the task description. (Show Details)Jun 16 2021, 2:42 PM

taavi mentioned this in T286299: Create initial scaffolding for Prometheus configuration automation.Jul 7 2021, 5:53 PM

Mentioned in SAL (#wikimedia-cloud) [2021-07-14T10:35:46Z] <majavah> undeploy old ingress T266050

In T266050#7211735, @Stashbot wrote:

Mentioned in SAL (#wikimedia-cloud) [2021-07-14T10:35:46Z] <majavah> undeploy old ingress T266050

Wrong task, was meant to be T264221.

taavi closed subtask T286335: Split metricsinfra alertmanager to separate hosts from prometheus as Resolved.Jul 30 2021, 2:24 PM

taavi closed subtask T287148: Set up IRC relay for metricsinfra alerts as Resolved.Aug 4 2021, 3:00 PM

taavi removed a subtask: T128715: Add all Cloud VPS project administrators to the Prometheus notification group for each project.Aug 7 2021, 11:57 AM

• nnikkhoui subscribed.Sep 14 2021, 9:23 PM

taavi updated the task description. (Show Details)Oct 22 2021, 1:39 PM

dcaro subscribed.Mar 3 2022, 9:00 AM

Hydriz subscribed.Apr 28 2022, 7:19 AM

taavi mentioned this in T307655: Replacement needed for obsolete Diamond/Graphite monitoring of integration instances.May 5 2022, 7:51 AM

taavi mentioned this in T308555: Provide access to Thanos.May 18 2022, 7:36 AM

taavi updated the task description. (Show Details)Jun 16 2022, 4:06 PM

taavi closed subtask T285055: Allow project members/admins to ack/silence alerts of that project as Resolved.Jun 17 2022, 7:18 PM

taavi closed subtask T286301: Scale up metricsinfra prometheus beyond one Prometheus instance (Thanos/Cortex/similar) as Resolved.Jul 6 2022, 9:32 AM

taavi updated the task description. (Show Details)Jul 6 2022, 11:04 AM

taavi closed subtask T310799: Upgrade metricsinfra prometheus to bullseye as Resolved.Oct 28 2022, 3:07 PM

taavi closed subtask T288108: Figure out how to deal with security groups when rolling out metricsinfra scraping as Resolved.Nov 15 2022, 7:14 AM

taavi claimed this task.Nov 15 2022, 9:02 AM

taavi updated the task description. (Show Details)

fnegri edited projects, added cloud-services-team; removed cloud-services-team (Kanban).Jan 18 2023, 7:12 PM

fnegri moved this task from Kanban to Inbox on the cloud-services-team board.

While there are still open feature requests as subtasks, I'm closing this task as the metricsinfra Prometheus service is now monitoring all instances and can replace Diamond.

• nskaggs awarded a token.Feb 10 2023, 9:42 PM

I'm boldly reopen to keep this "epic" task as the entry point for all the other subtasks.

No, please create new tasks to track future work. This task was for the initial buildout that is now complete.

taavi removed subtasks: T288168: metricsinfra: Build out default alert rules, T288053: Add external meta-monitoring for metricsinfra, T287349: Deploy prometheus pushgateway to metricsinfra, T284993: Enable self-service Prometheus configuration management for project administrators, T186552: Create alerts for bastion hosts - Usage and latency.Sep 28 2024, 1:18 PM

Build Prometheus service for use by all Cloud VPS projects and their instancesClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Build Prometheus service for use by all Cloud VPS projects and their instances
Closed, ResolvedPublic
Actions

Related Objects
Search...