[go: up one dir, main page]

Page MenuHomePhabricator

Discrepancy between Graphite & Prometheus editResponseTime counts
Open, HighPublic

Description

I saw that T354905: migrate MediaWiki.timing.editResponseTime to statslib had been resolved for some time now, so I looked at converting the "Successful wiki edits" panels on the Grafana front page & www.wikimediastatus.net to use the version from Prometheus.

The original Graphite metric used was MediaWiki.timing.editResponseTime.sample_rate.

As best I can tell this ought to correspond to a sum(rate(mediawiki_WikimediaEvents_editResponseTime_seconds_count[5m])) query against Thanos.

However, comparing the results, the Prometheus metric is approx half the expected value:
https://grafana.wikimedia.org/goto/sBKcZCBIg

Am I misunderstanding something or is there something wrong?

Event Timeline

Thanks for the report!

I'd hypothesize this is because Prometheus stats ingestion is not yet enabled on k8s hosts. The per-pod deployment strategy is convenient, but we've been concerned about turning it on in light of T359640: mediawiki_resourceloader_build_seconds_bucket big metric on Prometheus ops

We're coordinating with ServiceOps to redesign the exporter deployment on k8s. How we do this should also take into account: T359497: StatsD Exporter: gracefully handle metric signature changes.

Indeed I agree that would be the root cause @colewhite pointed out. In light of the fact that (as far as I'm aware) we don't have an ETA to tweak the statsd-exporter deployment on wikikube as described in T359640; I think we should go back to the graphite/statsd metric for edits, so numbers are accurate

Now that T365265 is nearing completion, this may be worth another look, @CDanis?