In T250206: Deploy a proof of concept prometheus server in cloudvps to replace shinken we worked on a proof of concept project to design a replacement for monitoring a subset of projects using Shinken with Prometheus. Later we realized that T210993: Deprecate Diamond collectors in Cloud VPS means that we also need to migrate to Prometheus for basic dashboards like T264920: Grafana "cloud-vps-project-board" needs to be migrated from Graphite to Prometheus.
We now need to redesign and build out the POC to scale to collecting at least basic instance health information for all instances in all projects for some reasonable amount of time (3+ months for sure, 1+ year ideally).
Features needed beyond POC:
- Storage for metrics from 700+ instances
- Redundancy
- Configure the alert rules to monitor disk capacity
- Project local puppetmaster for secrets storage (e.g. volunteer's email addresses)
-
Refactor puppet module to merge public and private hiera config - irc relay for irc alerting per project
- karma alert management dashboard with ability to silence alerts (T250206#6443507, T285055: Allow project members/admins to ack/silence alerts of that project)