Beware staleness when creating recursive recording rules in Prometheus

At work, we have a Prometheus metric from our build system named autobuild_build_timestamp that exports the timestamp for the latest build for a given application and Git branch. The process that exports this is stateless, so if it restarts or a new version of it is deployed, we're out of luck if we want to do something like build a dashboard to show the latest builds. One thing we can do is use a range query, but we can also make use of Prometheus' recording rules to provide a nice abstraction to users who care about this data, and to forgo a bunch of redundant recalculations of the same data.

One way we could express this would be a simple range query:

record: autobuild_build_timestamp:last
expr: max(max_over_time(autobuild_build_timestamp[30d])) by(app_name, branch)

...but while simple and straightforward, loading 30 days' worth of data can get expensive - we're no longer paying for it on every alert rule and dashboard using this data, but it's still expensive to evaluate!

One pattern I've seen and used to work around this is to use the metric resulting from the recording rule within the recording rule expression to send information to its future self - here's an example of the rules file for such a rule:

record: autobuild_build_timestamp:last
expr: max(autobuild_build_timestamp) by(app_name, branch) or autobuild_build_timestamp:last

...and here's an example timeline of how that would work:

EventValues for app_name label in autobuild_build_timestamp metricResult
First evaluation of recording rule{foo}{foo} or {}{foo}
bar app gets built{foo, bar}{foo, bar} or {foo}{foo, bar}
Build manager restarts, effectively clearing autobuild_build_timestamp{}{} or {foo, bar}{foo, bar}
baz app gets built{baz}{baz} or {foo, bar}{foo, bar, baz}

I don't know if this pattern has a name - I've been calling them "latch" recording rules (after latches in electronics), but maybe a more familiar term would be "recursive recording rules", even though they're not really recursive.

This is handy, but there's a pitfall I ran into last week! I wrote the above recording rule to replace a more expensive one, so I was periodically checking count(autobuild_build_timestamp:last) against the count() of the legacy recording rule. However, during one of my checks, I saw that count go down. The count should never go down - new series should only ever be added to the metric, right? So what's going on here?

Now, an important detail in discovering the issue is that at work, we have a lot of data in our Prometheus instances. Sometimes due to things like WAL replay, they can take a few minutes to start up. I realized that if that start-up time is greater than the staleness duration, the first time the recording rule runs on that Prometheus, the autobuild_build_timestamp:last on the right-hand side of the or will have no series!

So, how do we fix this? My solution was to use a one-hour range query along with last_over_time for the right-hand side:

record: autobuild_build_timestamp:last
expr: max(autobuild_build_timestamp) by(app_name, branch) or last_over_time(autobuild_build_timestamp:last[1h])

I figured that an hour wouldn't be super expensive, and if a Prometheus server is down for an hour, we probably should be getting paged!

Published on 2024-10-30