Use caution when using $__rate_interval along with increase()

A few months ago, I was pairing with my coworker Graham on a Grafana dashboard, and we ran into a pitfall I thought it'd be nice to share. We were trying to create a dashboard that would show how many changes happened to a database table between points on the graph, and we were getting some mysterious results!

We were looking at increase(detected_changes_count[$__interval]), which is Grafana-speak for "how much did detected_changes_count increase between points on this graph?". One thing we changed pretty quickly was to replace $__interval with $__rate_interval, which I had recently read about. The summary of that post is that PromQL range vector operators like rate() and increase() can only do their job if they have at least two points in the range - otherwise, you get "no data". Using $__interval can cause problems if you are zooming in too far on your graph, or if there's data missing (eg. due to a failed scrape) - if that's the case, then detected_changes_count[$__interval] won't have enough data, and you'll get that dreaded "no data" error message.

The fix for that is to use the aforementioned $__rate_interval - it's just like $__interval, but it will be at least four times the scrape interval, which will compensate for those issues.

However, there's a caveat here - we noticed that the table change counts looked a lot higher when we used $__rate_interval! And in hindsight, the reason for this seems clear: if you're looking at a graph using increase() where the points are placed fifteen seconds apart, you'll think that the table change count is over that fifteen second interval. But remember - $__rate_interval can be four times that graph interval duration, so you may be looking at the increase over a sixty second period, but come away thinking that increase happened over fifteen seconds!

rate() doesn't suffer from this problem, since it incorporates the duration of the interval into its final result, but be mindful of this when using increase() along with $__rate_interval (along with other range vector functions, I'm sure - I haven't looked into any others).

One way you can get around this behavior in the case of increase() is to use rate() instead and multiply by the interval duration - so rate(detected_changes_count[$__rate_interval]) * $__interval_ms / 1000 instead of increase(detected_changes_count[$__rate_interval]).

Published on 2024-02-28